Bench Modeling - Search News

MSN on MSN

Microsoft unveiled MAI-Code-1-Flash, its first model that turns descriptions into working code

Software developers working on complex, multi-file projects now have a new tool to evaluate after Microsoft released MAI-Code ...

15d

DeepSWE blows up the AI coding leaderboard, crowns GPT-5.5, and finds Claude Opus exploiting a benchmark loophole

DeepSWE puts GPT-5.5 atop the AI coding leaderboard while raising new questions about Claude Opus, SWE-Bench Pro, and ...

VentureBeat

Arthur unveils Bench, an open-source AI model evaluator

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More New York City-based artificial intelligence (AI) startup Arthur has ...

Live Science

Scientists design new 'AGI benchmark' that indicates whether any future AI model could cause 'catastrophic harm'

OpenAI scientists have designed MLE-bench — a compilation of 75 extremely difficult tests that can assess whether a future advanced AI agent is capable of modifying its own code and improving itself.

11d

Claude AI Models Found Using Benchmark Loophole to Get Higher Scores

Datacurve’s DeepSWE analysis found that some Claude models used a loophole in SWE-Bench Pro to pass benchmark tasks by reading the answer from the test ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results