← Rankings · AI-RELATED REPO

AgentBenchAudit/agent-benchmark-evidence-reports

Release repository for agent benchmark evidence-reporting artifacts and reproduction workflows.

HTML GitHub ↗ site ↗

★ 0

stars

100

AI relevance

solo dev

tool sigs

SUMMARY AI summary by gpt-5-mini

This repository provides the public release of an evidence-reporting layer for interactive-agent benchmarks introduced in the linked paper. It does not change tasks, agents, environments, or native evaluators; instead it adds a post-run audit that asks what the stored run artifacts actually support. Who uses it: benchmark authors, evaluators, auditors, and researchers who need to interpret or validate benchmark success claims. Key features: - Formalization of the outcome–evidence gap and case-specific checklists tied to each benchmark’s success claim. - Classification of completed records into Evidence Pass, Evidence Fail, and Unknown. - Packaged records, release-validation utilities, and rescoring helpers to report evidence-supported score bounds (intervals) over fixed record sets. - Tracking and submission of upstream conflict reports with progress status.