OpenAI and Paradigm launch EVMbench for smart contract security
OpenAI and Paradigm launch EVMbench for smart contract security
EVMbench is an open benchmark created by OpenAI together with Paradigm to evaluate how AI systems handle smart contract security tasks.
Scope and motivation
Smart contracts on Ethereum and other EVM-compatible networks currently hold more than $100 billion in open-source code, and deployed contracts are immutable after deployment.
Because vulnerabilities in immutable contracts can cause significant financial losses, the benchmark aims to measure AI performance across common security tasks in a reproducible setting.
Evaluation modes
EVMbench assesses agents in three distinct modes designed to reflect real-world adversarial and defensive activities without interacting with live networks.
- Vulnerability discovery: locate bugs and insecure constructs in contract source code.
- Patch generation: propose fixes that preserve original contract logic while removing vulnerabilities.
- Exploit execution: simulate siphoning funds within an isolated sandbox to validate exploitability.
Dataset and scenarios
The benchmark incorporates 120 real vulnerabilities drawn from 40 audits, with many cases sourced from Code4rena competition reports.
Additionally, EVMbench includes scenarios from the Tempo audit, the Layer‑1 project developed by Stripe for accelerated stablecoin transfers.
Isolated testing environment
To prevent test-time manipulation, OpenAI runs agents against an isolated local blockchain replica where transactions follow a fixed, deterministic sequence.
This sandbox ensures agents cannot alter outcomes or reuse external network state, making results comparable across runs and systems.
Key findings
On a task framed as "steal funds from contract" with a known vulnerability, GPT-5.3-Codex succeeded in 72% of attempts under the benchmark conditions.
However, automatic discovery of unknown vulnerabilities and reliable patching remain difficult, as agents often identify a single issue and stop rather than performing exhaustive contract analysis.
Related posts

