The first long-context LRM trained with reinforcement learning for long-context reasoning. Outperforms flagship models like o3-mini and achieves performance on par with Claude 3.7 Sonnet Thinking, demonstrating leading performance for long-context document QA tasks.
Context Window
128.0K
Max Output
41.0K
Input Price (Auto)
$0.14/1M
Output Price (Auto)
$0.60/1M
Performance metrics and benchmarks
Sourced from Artificial Analysis.
Intelligence Index
14.5
Auto routing is available for this model. Explicit provider selection is not available.
Loading provider options…
GPQA Diamond
Graduate-level scientific reasoning
53.5%
Better than 38% of models compared
HLE
Humanity's Last Exam
4.3%
Better than 23% of models compared
IFBench
Instruction-following benchmark
31.5%
Better than 21% of models compared
AA-LCR
Long context reasoning evaluation
0.0%
Better than 7% of models compared
SciCode
Python programming for scientific computing
28.0%
Better than 45% of models compared
LiveCodeBench
Contamination-free coding benchmark
28.8%
Better than 32% of models compared
AIME 2025
American Invitational Mathematics Examination 2025
19.7%
Better than 22% of models compared
AIME
American Invitational Mathematics Examination
30.3%
Better than 58% of models compared
MMLU-Pro
Professional and academic subject knowledge
72.7%
Better than 42% of models compared
Last updated May 15, 2026
Artificial AnalysisMath-500
Diverse mathematical problem solving benchmark
86.9%
Better than 56% of models compared