METR, which runs the benchmark measuring how well models can complete long-duration tasks, found that Claude Mythos Preview ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible resultsSome results have been hidden because they may be inaccessible to you
Show inaccessible results