Research Arena — How Far Are We From True Auto Research?

Introduction
Framework Overview
Experiment Setup
Main Results: Stanford Agentic Review
Main Results: Peer Review
Research Style Analysis
Reviewer Analysis
Time Analysis
H100 GPU Scaling Experiments
Case Studies: Paper–Artifact Divergence
Human Inspection: Findings & Implications

Browse all 117 papers with full reviews →

1. Introduction

Recent years have seen a wave of auto-research systems, including Auto Research [1], AI Scientist [2], and Analemma's FARS [3]. These projects have shown impressive progress in automated end-to-end scientific research. Rather than proposing another complex framework, we study a more direct setting: whether widely used general-purpose CLI agents (e.g., Claude Code, Codex, Kimi Code) can carry out research with only minimal guidance. Moreover, there is no comprehensive evaluation of the quality of agent-conducted research across diverse domains or under varying computational constraints (e.g., CPU-only vs. GPU environments), making it difficult to systematically assess the true capabilities and limitations of such agents in realistic research scenarios.

We evaluate three off-the-shelf CLI agents — Claude Code with Opus 4.6 [4, 5], Codex with GPT-5.4 [6, 7], and Kimi Code with K2.5 [8, 9] — using a minimal scaffold on 13 diverse CS domains (5 CPU-only + 8 GPU environments). We employ a peer-review protocol where three CLI agents evaluate each paper alongside its code, and additionally use the Stanford Agentic Reviewer [10] for external validation.

Key Findings

Each CLI agent develops a distinct research persona: Claude Code is the full-stack researcher — balancing both strategies (46% empirical, 33% novel methods) with the longest, most comprehensive papers. Codex is the empirical scientist — 87% empirical studies, highest integrity, but lower novelty. Kimi Code is the system builder — 79% of its papers are framed as novel methods with acronym-heavy names, yet it still scores low on novelty because many of its ideas sound more like repackaged existing work than genuinely new contributions.

Stanford Agentic Review (SAR) (0–10, ICLR scale):

Among automatic systems: Claude Code (5.45) > FARS (5.06) > Codex (4.93) > Kimi Code (4.24). Our minimal scaffold with Claude Code (5.45) surpasses Analemma's FARS (5.06) using only a $200 Max subscription, compared to Analemma's $186,000 budget.
Compared to 200 human-authored ICLR 2025 papers, Claude Code (5.45) scores above the weighted average of human-authored submissions based on the acceptance rate (5.42), sitting between accepted papers (5.59) and rejected papers (5.34).
But human-annotated accept rates reveal a wide gap: ICLR Accepted 76.0%, Claude Code 41.0%, FARS 21.6%, Codex 12.8%, Kimi Code 5.1%.

Peer Review (PR) (0–10, ICLR scale):

Claude Code (4.59) > Codex (4.51) > Kimi Code (3.38). Our PR is stricter than SAR because it gives the reviewer access to experiment code and execution logs, enabling systematic checks for fabricated or unsupported results. This may sometimes be overlooked even by the human expert.
The major weakness for CLI agents are experimental rigor and significance. Agents perform relatively well on reference integrity (Codex: 8.41/10) and clarity (Codex: 7.47/10).
Performance varies substantially across domains — Claude Code ranges from 5.33 (Probabilistic Methods) to 3.78 (Privacy in ML). Using stronger GPUs (RTX A6000→H100) shows no overall improvement, and in some cases even leads to lower scores, which further validates the major weaknesses.
Manual integrity checks reveal sharp divergence between agents: Kimi Code has both results + setting mismatches in 77% of papers and fake references in 72%; Codex is the most trustworthy (5% / 8%); Claude Code sits in between (31% / 36%).

Human Inspection :

We further confirm that experimental rigor is the number-one weakness: agents fail to plan, write, and execute experiments, which limits the significance and scope of the papers. Fabricating results and favouring negative findings are the reasons the agentic reviewer falls short in its reviews, raising faithfulness concerns for frontier models. In conclusion, although some papers offer insights, all of them fall far behind top-tier venues — there is still a long way to go for true auto research.

2. Framework Overview

Each CLI agent follows a standardized pipeline:

Ideation — Generate a research idea and experiment plan; self-review for up to 3 iterations.
Experiments — Write and execute code, collect results; self-review for up to 3 iterations.
Paper Writing — Produce a paper; self-review for up to 3 iterations.
Review — Evaluate via Stanford Agentic Reviewer and triple peer review (all three agents review each paper alongside its code).

3. Experiment Setup

Agent	Model	CPU Seeds	GPU Seeds	Trials/Seed	Total Papers
Claude Code	Opus 4.6	5	8	3	39
Codex	GPT 5.4	5	8	3	39
Kimi Code	K 2.5	5	8	3	39

CPU seeds: Causal Learning, Compiler Optimization, Data Integration & Cleaning, Operating System Design, Probabilistic Methods

GPU seeds: AI for Biology, Computer Vision, Datasets & Benchmarks, Generative Models, Interpretability, NLP, Privacy in ML, Supervised Representation Learning

Hardware: CPU-only or 1× NVIDIA RTX A6000 (48GB). Each agent gets 4 CPUs, 60GB RAM. We also conduct experiments for GPU experiments using NVIDIA H100 (80GB) GPUs.

Peer review (PR): Every paper is reviewed by all three agents (Claude Code, Codex, Kimi Code), scoring on 9 dimensions (novelty, soundness, significance, clarity, reproducibility, experimental rigor, references, reference integrity, results integrity) on a 0–10 ICLR scale.

Stanford Agentic review (SAR): Every paper is reviewed by Stanford Agentic Reviewer, which provides ICLR-calibrated scores on a 0–10 scale.

4. Main Results on Stanford Agentic Review

Automated Research Systems Comparison

We evaluate the performance of three CLI agents on the Stanford Agentic Reviewer (SAR). For comparison, we also collect scores from Analemma's FARS, which reports SAR scores for a total of 102 papers.

Claude Code

5.45

n=39 • $200

FARS

5.06

n=102 • $186K

Codex

4.93

n=39 • $200

Kimi Code

4.24

n=39 • $100

Figure 1: Score distributions of automated research systems on the SAR. Claude Code has the highest median and tightest distribution above 5.0.

System	n	SAR μ	SAR σ	Min	Max	≥5.0
Claude Code	39	5.45	0.70	3.1	6.3	82%
FARS	102	5.06	0.62	3.0	6.3	76%
Codex	39	4.93	0.85	2.7	6.3	64%
Kimi Code	39	4.24	0.84	2.0	5.3	26%

Finding 1: Claude Code achieves the highest performance among automated research systems (5.45 > FARS 5.06 > Codex 4.93 > Kimi Code 4.24). Claude Code outperforms Analemma's FARS despite using only a $200 Max subscription vs. FARS's $186,000 budget, suggesting that complex specialized auto-research frameworks are unnecessary — a general-purpose CLI agent with minimal scaffolding already surpasses them.

Comparison with Human-Authored ICLR 2025 Papers

To calibrate these scores against human-authored research, we additionally sampled 200 real ICLR 2025 papers (100 accepted, 100 rejected) to the same Stanford Agentic Reviewer, establishing a human baseline.

ICLR Accepted

5.59

n=100 • Human=6.54

Claude Code

5.45

n=39

ICLR Weighted

5.42

32% acc / 68% rej

ICLR Rejected

5.34

n=100 • Human=5.02

System	n	SAR μ	SAR σ	Human μ	Human σ
ICLR 2025 Accepted	100	5.59	0.59	6.54	0.80
Claude Code	39	5.45	0.70	—	—
ICLR 2025 Weighted (32%/68%)	200	5.42	0.70	5.50	0.81
ICLR 2025 Rejected	100	5.34	0.75	5.02	0.81

SAR = Stanford Agentic Reviewer score. Human = average OpenReview score from ICLR 2025 human reviewers.

Finding 2: Claude Code achieves performance comparable to the average ICLR 2025 submission (SAR: 5.45 vs. 5.42), but remains below the level of accepted papers (5.59), indicating a non-trivial gap to top-tier research quality. More importantly, SAR exhibits substantially weaker discriminative ability than human reviewers. Human reviewers show a 1.52-point gap between accepted and rejected papers (6.54 vs. 5.02), whereas SAR compresses this gap to only 0.25 points (5.59 vs. 5.34).

Score Heatmap (Seed × Trial)

Figure 2: SAR score heatmap across all seeds and trials for each agent.

Per-Domain Breakdown

We break down SAR scores across 13 research domains (5 CPU-only, 8 GPU) and compare CPU vs GPU performance for each agent.

Figure 3: SAR scores by research domain, ranked by average score across agents.

CPU vs GPU Performance (SAR)

Agent	CPU	GPU	Gap
Claude Code	5.49	5.42	+0.07
Codex	4.55	5.16	−0.61
Kimi Code	4.12	4.32	−0.20

Finding 3: Performance varies substantially across domains — Datasets & Benchmarks scores highest (5.79) while Data Integration & Cleaning scores lowest (4.39), a 1.40-point spread. The CPU vs GPU gap is agent-dependent: Claude Code performs nearly identically across platforms (+0.07), while Codex scores significantly higher on GPU tasks (5.16 vs 4.55, gap = −0.61). However, under PR all agents perform better on CPU tasks than on GPU tasks (See Section 5).

Qualitative Review Analysis

We analyze the text of Stanford reviews to identify common weakness themes via keyword matching across 7 categories: Experimental Design (baselines, ablations, evaluation, statistical rigor), Overclaiming (unsupported or exaggerated claims), Scope & Scale (limited or toy experiments), Clarity & Presentation (unclear writing, confusing notation), Results Inconsistency (contradictions or mismatches within the paper), Reference Quality (fake citations, placeholder references, future dates), and Novelty (incremental or trivial contributions). We also measure the balance between strengths and weaknesses across agents.

Figure 4: Weakness themes in SAR (7 categories).

Agent	Avg Strengths (words)	Avg Weaknesses (words)	Ratio (W/S)
Claude Code	215	316	1.47
Codex	212	276	1.30
Kimi Code	184	339	1.85

Finding 4: Experimental design is the dominant weakness across all agents, with Kimi Code receiving the most criticism (256 mentions). Missing baselines and comparisons are a major driver of this weakness. Kimi Code also leads in overclaiming and results inconsistency (79 vs Claude Code's 44 and Codex's 27; 54 vs Codex's 17), while Codex is flagged most for novelty & scope (66) — consistent with its empiricist research style. Meanwhile, Kimi Code shows a much larger weakness-to-strength ratio (1.85× vs 1.30× for Codex), reflecting the overall quality gap.

Human-Annotated SAR Decisions

We manually inspected all papers together with their SAR reviews and found that the overall assessment score is not a reliable indicator of SAR's final recommendation. For example, some papers with scores above 6 were not recommended for acceptance, while others with scores around 4.5 were recommended for acceptance. Due to this inconsistency, we manually assigned accept or reject labels based on SAR's final recommendation rather than its numeric score. In some cases, the review does not explicitly state an accept or reject decision; in those instances, we inferred the recommendation from the overall tone of the review, judging whether it was more "accepted" or more "rejected". Moreover, we treat conditional accept, accept with revision, and borderline (accept) as accept decisions; all others are treated as rejection.

Category	Accept	Reject	Accept %	Accept/Reject
ICLR 2025 Accepted	76	24	76.0%	3.17
ICLR 2025 Weighted (32%/68%)	59.7	40.3	59.7%	1.48
ICLR 2025 Rejected	52	48	52.0%	1.08
Claude Code	16	23	41.0%	0.70
FARS (Analemma)	22	80	21.6%	0.28
Codex	5	34	12.8%	0.15
Kimi Code	2	37	5.1%	0.05

Finding 5: The human-annotated SAR final-decision acceptance rates show a clear gap between human-authored and agent-generated papers. All human-written paper groups achieve higher acceptance rates: 52% for rejected ICLR papers and 76% for accepted ICLR papers, compared with only about 41% for the strongest agent-generated group. Among the agentic systems, this further confirms that Claude Code with a minimal scaffold substantially outperforms FARS papers (21.6%). Codex ranks next with 12.8%, and Kimi Code performs the worst at 5.1%.

5. Main Results on Peer Review

We design a peer review (PR) protocol where each of the three CLI agents reviews every paper alongside its experiment code, execution logs, and results artifacts. Unlike SAR, which evaluates only the PDF, this enables reviewers to verify whether reported results match actual experimental outputs, detect incomplete experiments, and identify fabricated claims. Additionally, reviewers are required to check reference validity via tools, flagging hallucinated or nonexistent citations.

Claude Code

4.59

±0.12 SE • Best: 6.0 • 36% ≥ 5.0

Codex

4.51

±0.06 SE • Best: 5.3 • 5% ≥ 5.0

Kimi Code

3.38

±0.16 SE • Best: 5.3 • 3% ≥ 5.0

Figure 5: PR score distributions. Claude Code and Codex cluster around 4–5, while Kimi Code has a wider spread with more low scores.

Score Heatmap (Seed × Trial)

Figure 6: PR score heatmap across all seeds and trials. Codex is the most consistent (std=0.32), while Kimi Code has the highest variance (std=0.80).

Per-Domain Breakdown

Figure 7: PR scores by research domain, ranked by average score across agents.

Finding 6: PR scores drop substantially compared to SAR across all agents (Claude Code: 4.59 vs. 5.45, Codex: 4.51 vs. 4.93, Kimi Code: 3.38 vs. 4.24). Claude Code achieves the highest scores overall and produces the single best paper (score = 6.0). Performance varies across domains with a 0.92-point spread. Codex is the most consistent across domains (std=0.21, range 4.22–5.00), while Claude Code (std=0.44, range 3.78–5.33) and Kimi Code (std=0.59, range 2.67–4.33) show much higher variance. Kimi Code struggles most on GPU-intensive domains (Generative Models, Datasets & Benchmarks, Supervised Repr. Learning: all 2.67).

CPU vs GPU Performance

	Peer Review (PR)			Stanford (SAR)
Agent	CPU	GPU	Gap	CPU	GPU	Gap
Claude Code	4.73±0.20	4.50±0.10	+0.23	5.49±0.16	5.42±0.15	+0.07
Codex	4.53±0.07	4.50±0.05	+0.03	4.55±0.23	5.16±0.15	−0.61
Kimi Code	3.69±0.22	3.19±0.18	+0.50	4.12±0.23	4.32±0.16	−0.20

Finding 7: PR and SAR show opposite CPU/GPU trends. Under PR, all agents score higher on CPU tasks (especially Kimi Code: +0.50). Under SAR, Codex and Kimi Code score higher on GPU tasks (Codex: −0.61). GPU domains (vision, NLP, generative models) are well-established fields where agents can produce better-looking papers, but GPU experiments are harder to execute correctly — CUDA issues, memory limits, and training instabilities lead to more incomplete runs and mismatched results when reviewers verify the code. CPU tasks are simpler to run and verify, yielding more reliable experiments. This divergence highlights that SAR alone is insufficient — it rewards presentation quality over experimental substance.

Per-Dimension Review Breakdown

Figure 8: Per-dimension PR scores. Codex leads on reproducibility (7.65), reference integrity (8.41), and results integrity (7.94). Claude Code leads on novelty (5.42) and significance (5.23). Kimi Code trails on all dimensions.

Finding 8: Agents exhibit a clear divergence between creative and reliability-oriented dimensions. Claude Code performs best on creative aspects such as novelty (5.42) and significance (5.23), while Codex leads on the majority of reliability-related dimensions (7 out of 9), including reproducibility, reference integrity, and results integrity. This pattern is consistent with the different research strategies adopted by the agents (see Section 6). Across all agents, experimental rigor is surprisingly the weakest dimension (3.29–5.56) rather than novelty. Further analysis suggests that missing baselines and not fully executed experimental plans are the primary causes, despite having sufficient available compute budget (see Section 9).

Integrity & Reference Analysis

We further manually validate every paper with its artifacts and PRs across four integrity categories:

Results mismatch only: numbers reported in the paper don't match the actual results.json, logs, or experiment outputs.
Setting mismatch only: the paper claims to do things the code doesn't actually do: claimed components or ablations are never implemented, hyperparameters in the text differ from the config, or the method is not implemented as described.
Both: papers flagged for both results and setting mismatches.
Fake reference: citations that don't exist, have fabricated authors, or have incorrect bibliographic metadata.

Category	Claude Code	Codex	Kimi Code
Results mismatch only	6/39 (15%)	2/39 (5%)	4/39 (10%)
Setting mismatch only	10/39 (26%)	1/39 (3%)	5/39 (13%)
Both (results + setting)	12/39 (31%)	2/39 (5%)	30/39 (77%)
Fake reference	14/39 (36%)	3/39 (8%)	28/39 (72%)
Mean refs per paper	15.1	12.8	10.5
Reference integrity score	7.34	8.41	6.26

Finding 9: Code-aware peer review detects integrity issues that SAR cannot. Codex identifies the largest number of fabricated results, with Claude Code ranking second, while Kimi Code flags the fewest. We manually verified that most of the issues raised by Codex and Claude are valid. In contrast, Kimi Code performs the most superficial integrity checks and occasionally produces false positives. Based on our manual annotation of all 117 papers, 77% of Kimi Code papers have both results and setting mismatches, and 72% contain hallucinated references. Claude Code is intermediate: 31% have both results + setting mismatch, and 36% contain fake references. Codex is the most trustworthy — only 5% have both mismatches and 8% have fake references, consistent with its cautious empiricist research style.

Programming Language Usage

We analyze the programming languages and file counts across all agent-generated codebases to understand how each agent structures its experiments.

Claude Code

Codex

Kimi Code

Finding 10: All three agents overwhelmingly default to Python (Claude Code 96.6%, Codex 100%, Kimi Code 99.2%), even in domains where other languages are typically used, such as operating systems. Claude Code and Kimi Code occasionally make use of shell scripts.

6. Research Style Analysis

Beyond scores, we compare how each CLI agent approaches research — what types of papers they write, how they structure arguments, and what their titles reveal about underlying research strategy. This comparative analysis shows that the three agents have developed fundamentally different research personas, which in turn explain many of the quality differences observed in Sections 4–5.

Research Type Distribution

We manually classify each paper based on its title and abstract into three categories: novel method (proposes a new named algorithm or system, including extensions or improvements to existing methods), new benchmark (creates evaluation tools), and empirical study (studies without proposing new methods).

Type	Claude Code	Codex	Kimi Code
Novel method (incl. extensions)	18 (46%)	5 (13%)	31 (79%)
New benchmark	3 (8%)	0 (0%)	4 (10%)
Empirical study	18 (46%)	34 (87%)	4 (10%)

Title Structure Analysis

Metric	Claude Code	Codex	Kimi Code
Avg title length	102 chars	99 chars	92 chars
Question titles	10%	28%	0%
Colon structure (Name: Subtitle)	74%	46%	85%
Contains acronym (2+ caps)	15%	49%	51%
Named method (starts with name)	21%	38%	46%

Title Vocabulary

The most frequent words in each agent's titles reveal distinct research personas:

Agent	Top Keywords	Personality
Claude Code	`learning` `when` `causal` `adaptive` `pipelines` `contrastive`	Full-stack researcher — spreads across research types
Codex	`study` `benchmark` `negative` `matched` `controlled` `pilot`	Empirical scientist — controlled studies
Kimi Code	`adaptive` `aware` `guided` `dynamic` `gradient` `contrastive`	System builder — named frameworks

Title Patterns

Each agent has a distinct title style:

Agent	Style	Example Titles
Codex	Question-based, hypothesis-testing. 28% use question marks — highest of any agent.	"Do Shared Decoders Improve Prototype-Edit Reusability?" "When Does Clarification Supervision Transfer to Formal Reasoning?" "How Much Signal Is in Early Training Trajectories?"
Claude Code	Descriptive analysis. Unique "The X of Y" essayistic framing.	"The Algebra of Compiler Passes: An Empirical Study of Idempotency" "The Bandwidth Knapsack: Optimal Migration Scheduling" "The Functional Anatomy of Sparse Features in Language Models"
Kimi Code	Acronym-heavy named frameworks. Zero question titles — always declarative.	"CAGER: Causal Geometric Explanation Recovery" "DU-VPT: Decomposed Uncertainty-Guided Visual Prompt Tuning" "VAST: Velocity-Adaptive Spatially-varying Timesteps"

Finding 11: Research-type distribution and title style reveal three distinct research strategies. Codex is overwhelmingly empirical (87%), and its titles are frequently framed as questions, controlled studies, or benchmark-style evaluations. This helps explain why Codex performs better on integrity-related dimensions, yet still lags behind Claude Code on novelty and significance. Kimi Code, by contrast, is dominated by named methods (79%) and acronym-led titles, reinforcing its system-builder profile. But its reference base is relatively dated — 41% of citations are from before 2022 — which likely weakens genuine novelty, making many proposals read more like repackaging than truly new ideas. Claude Code is more balanced across empirical studies and method papers (46% empirical, 33% novel methods), with lower acronym use and broader analytical framing, making it closer to a full-stack researcher with the most diversified research portfolio.

Paper Structure

How do the papers differ in structure and technical depth?

Metric	Claude Code	Codex	Kimi Code
Paper length (words)	4,023	3,421	2,461
Method section (words)	572	531	394
Equations	3.8	2.3	4.0
Figures	4.8	4.1	0.8
Tables	6.0	4.2	4.0
Algorithm blocks	0.6	0.0	0.6
Theorems / proofs	0.3	0.0	0.4
Complexity analysis	77%	10%	64%

All 39 papers per agent analyzed (117 total).

Finding 12: The three agents write fundamentally different papers. Claude Code writes the longest papers with the most figures and tables, comprehensive and visually rich. Codex writes moderate-length papers with the fewest equations, no algorithm blocks, and no theorems, consistent with its empirical style. Kimi Code writes the shortest papers but relies most heavily on formal surface cues, using the most equations and the strongest theorem-style presentation despite offering very little visual grounding.

7. Reviewer Analysis

Our pipeline uses two layers of review. First, each agent self-reviews its own work at every stage (ideation, experiments, paper writing) for up to 3 revision rounds. Second, all three agents cross-review every paper in a peer-review setup. Below we examine whether self-review revision actually improves quality, and how much bias the peer reviewers introduce.

Self-Review Revision Effectiveness

When an agent scores below the self-review threshold, it revises and re-evaluates. The table below tracks whether scores improve, stay the same, or decline after revision.

Agent / Gate	Improved	Same	Declined	Avg Delta
Claude Code / Idea	100%	0%	0%	+2.1
Claude Code / Experiment	88%	8%	4%	+2.2
Claude Code / Paper	43%	34%	23%	+0.0
Codex / Idea	35%	51%	15%	+0.4
Codex / Experiment	78%	20%	2%	+2.0
Codex / Paper	64%	29%	7%	+1.5
Kimi Code / Idea	100%	0%	0%	+3.0
Kimi Code / Experiment	91%	9%	0%	+2.2
Kimi Code / Paper	61%	34%	5%	+1.6

Finding 13: Self-review is highly effective for Claude Code and Kimi Code in ideation and experiments (100% improvement; avg. +2.1–3.0 for ideation; 78–91% improvement; avg. +2.0–2.2 for experiments), but shows limited gains for paper writing. This is expected, as paper quality is largely constrained by the underlying ideas and experimental results. In contrast, Codex exhibits a different pattern: it shows relatively modest improvements in ideation, but stronger gains in both experiments and paper writing.

Peer Review Bias

Each paper is reviewed by all three agents. The score distributions reveal large systematic differences between reviewers.

Figure 9: How each reviewer scores papers from each agent (all 117 papers). Codex is the strictest reviewer (μ=1.9 for Kimi papers), Kimi Code is the most lenient (μ=6.2 for Claude papers).

Reviewer ↓ / Agent →	Claude Code	Codex	Kimi Code	Avg Given
Claude Code	4.6	4.0	3.2	3.9
Codex	2.9	3.9	1.9	2.9
Kimi Code	6.2	5.7	5.0	5.7

Finding 14: Reviewer severity differs dramatically across agents. Codex is the strictest reviewer (avg. 2.9), while Kimi Code is the most generous (avg. 5.7), scoring the same set of papers nearly 2× higher. Self-review is not uniformly inflated across agents: Claude Code and Codex both give their own papers their highest average scores (4.6 and 3.9), but Kimi Code does not — it scores Claude Code highest (6.2) and its own papers lower (5.0). Even so, Kimi Code's ratings remain substantially more lenient overall, including toward its own work, despite producing the weakest papers on average (PR 3.38). This points to a broader generation-verification gap and shows why multi-reviewer aggregation is necessary.

8. Time Analysis

Figure 11: Time breakdown per pipeline stage. Experiments dominate for all agents. Claude Code spends the most time overall (13.7h avg), Kimi Code is fastest (4.1h).

Figure 12: Wall time distribution. Claude Code has a long tail with some runs exceeding 30+ hours due to experiment self-review revision loops.

Finding 15: All three agents spend the majority of their time on experiments, where Claude Code (13.7h total) is roughly 3× slower than Kimi Code (4.1h) and 2× slower than Codex (6.9h). For ideation, Codex spends the most time, while for paper writing, Claude Code takes the longest. Notably, self-review stages across all three phases take only a very small share of the total time and are almost negligible compared with other stages.

9. H100 GPU Scaling Experiments

Do agents produce better research with more powerful hardware? We re-run all 8 GPU seeds on NVIDIA H100 (80GB) GPUs — nearly double the VRAM of the A6000 (48GB) used in our main experiments. All other settings remain identical. Codex runs 3 independent trials per seed (24 total).

Overall Comparison: A6000 vs H100

Codex (A6000)

4.51

GPU PR • n=24

Codex (H100)

4.26

GPU PR • n=24

Delta

−0.25

not significant

Per-Seed Breakdown (Peer Review, H100)

Seed	Codex (H100)	Codex (A6000)
AI for Biology	4.89	4.67
Interp. of Learned Repr.	4.78	4.67
Computer Vision	4.33	4.33
Natural Language Processing	4.22	4.33
Privacy in ML	4.22	4.00
Supervised Repr. Learning	4.22	4.67
Generative Models	3.78	4.33
Datasets & Benchmarks	3.67	4.67
Average	4.26	4.50

Codex H100: 3 trials per seed, scores averaged. A6000 scores from main experiments (Section 5).

Finding 16: Upgrading from A6000 (48GB) to H100 (80GB), Codex surprisingly scores lower on H100 (4.26 vs 4.50). Per-seed analysis reveals no consistent pattern — some seeds improve (e.g., AI Bio and Privacy), and some seeds degrade (e.g., Gen. Models and D&B). This suggests that computational resources are not the main bottleneck for "true auto research". Instead, as the earlier results indicate, the more important constraints lie in experiment design.

10. Case Studies: Paper–Artifact Divergence

In this section, we present six illustrative case studies that support the conclusions of our manual inspection. Each case highlights a distinct failure pattern we observed across the 117 agent-generated papers, spans all three agents, and embeds the actual paper PDF so readers can inspect the evidence directly.

Case 1 · Claude Code

When Do Causal Discovery Algorithms Disagree? Diagnosing Assumption Violations via Per-Edge Profiling

Claude Code can occasionally produce a paper with a real insight, but weak evidence still prevents it from being convincing.

The paper offers a genuinely useful observation: distributional diagnostics can be detected more reliably, while structural diagnostics remain close to chance level. This kind of asymmetry is a meaningful takeaway and shows that agent-generated papers can still surface nontrivial empirical insights. However, the paper ultimately remains unconvincing because the evidence is weak. The results are not strong, the experimental settings appear chaotic in the artifacts, and the paper sometimes highlights its own method even when it is not the best or second-best. In addition, some key notions are not clearly grounded, which further weakens the paper's faithfulness.

Open in new tab · Download PDF

Case 2 · Claude Code

The Algebra of Compiler Passes: An Empirical Study of Idempotency, Commutativity, and Convergence in LLVM Optimization Pipelines

Claude Code may overclaim or present unsupported results when experiments are weak.

The paper claims evaluation on 87 benchmarks, but the artifact only supports a 20-benchmark subset. It also contains reference errors and relies heavily on synthetic programs. As a result, the paper overstates both the scale and the practical value of its findings. This case supports our observation that when experiments fail to produce sufficiently strong evidence, agents may compensate by inflating claims or presenting unsupported results. It also shows why artifact-aware review is essential: the mismatch is not obvious from the paper alone, but becomes clear once the code and outputs are inspected.

Open in new tab · Download PDF

Case 3 · Codex

Do Corruption-Family Text Residuals Help Zero-Shot CLIP? A Controlled Baseline Study

Codex reduces fabrication partly by running much narrower experiments.

Rather than producing large or ambitious evaluations, Codex tends to run controlled but very limited experiments. Here the study uses only a single frozen CLIP backbone on CIFAR-10, making the empirical scope too narrow to support broad conclusions. The idea is also close to prior prompt-based and unlabeled adaptation methods, so the novelty is modest. This case supports our claim that Codex's lower fabrication rate comes in part from being more conservative experimentally. However, that conservatism comes at a cost: the evidence is too limited, which leads to weaker papers overall.

Open in new tab · Download PDF

Case 4 · Kimi Code

DU-VPT: Decomposed Uncertainty-Guided Visual Prompt Tuning for Test-Time Adaptation

Kimi Code fabricates experimental results directly rather than actually running the experiments.

The artifact contains hard-coded benchmark statistics, and the reported per-run metrics are generated by sampling around these constants rather than by real model outputs. The published results mirror those prewritten target values almost exactly. Moreover, several analyses claimed in the paper, including forgetting analysis and shift-type diagnosis accuracy, have no implementation or logs in the artifact. This case supports our conclusion that Kimi Code often appears to fabricate results directly rather than obtaining them through actual experiments.

Open in new tab · Download PDF

Case 5 · Kimi Code

UniSched: A Critical Analysis of Simulation-Based Evaluation for CXL-Aware CPU Scheduling

Even when Kimi Code produces code, the method and implementation often do not match.

The reported results and settings do not align with the artifact, and the code itself contains clear problems, including implementation bugs in PMU-based task classification. The result files also show behaviors inconsistent with the paper's claims, such as nonzero migration counts where the method's story would suggest otherwise. Unlike Case 4, where the main issue is direct fabrication, this case shows that even when code exists, the implemented system often fails to correspond to the method described in the paper. It therefore supports our claim that Kimi's failures are not limited to fake numbers, but also include deeper mismatches between method, code, and evaluation.

Open in new tab · Download PDF

Case 6 · Claude Code

Characterizing Operator Interaction Effects in Data Cleaning Pipelines

A common failure is missing relevant baselines even when the paper substantially overlaps with prior work.

This case illustrates a common failure across agent-generated papers: missing the most relevant prior baseline even when the proposed study substantially overlaps with it. Here, the paper is highly similar to ShapleyPipe, yet it does not cite or compare against that work. As a result, the evaluation is incomplete at its core: without the most relevant baseline, the paper cannot establish either novelty or empirical advantage convincingly. This case therefore supports our broader observation that many agent-generated papers compare mainly against older or easier baselines while overlooking the most important recent or closely related methods.

Open in new tab · Download PDF

11. Human Inspection: Findings & Implications

A synthesis of what we learned after manually inspecting all 117 agent-generated papers, their artifacts, and their 351 peer reviews.

What the papers look like

We manually inspected all papers, together with their artifacts and reviews, and identified several recurring patterns. We categorize the papers into three groups: methods, empirical studies, and benchmark papers. Across these categories, method and benchmark papers are primarily heuristic or incremental extensions of prior work, while the empirical studies are generally limited in depth.

Where each agent fails

We find that across all agents, the main weakness is flawed experimental design and rigor.

Claude Code conducts the largest number of experiments, but these evaluations are still narrow in scope, often missing comparisons with recent baselines (Case 6). Moreover, it sometimes appears to fabricate results when experiments fail or produce unsatisfactory outcomes (Case 2).
Codex runs experiments at a much smaller scale than the other agents — for example, on only one small dataset with a single model (Case 3). This reduces its hallucination and fabrication rate, but it also leads to lower scores because the empirical evaluation is too limited.
Kimi Code, by contrast, often appears to fabricate experimental results directly rather than actually running the experiments (Case 4). Even when it produces code, there is frequently a mismatch between the proposed method and the implementation, and the failure rate of its experiments is high (Case 5).

More broadly, all agents tend to compare against older methods rather than recent ones, even when newer baselines are mentioned in the related work section.

What the reviewers catch (and miss)

For the reviews, we find that both SAR and PR tend to view negative results favorably, often treating them as a sign of honesty and a strength rather than a weakness. In assessing novelty, SAR is clearly weaker than PR at searching for related work, which leads it to assign consistently higher novelty scores. Since SAR cannot inspect artifacts, it almost never questions the validity of experimental results; instead, it mainly checks for inconsistencies between claims made in different parts of the paper and those reported in the experimental section.

For PR, Codex identifies the largest number of fabricated results, with Claude Code ranking second, while Kimi Code flags the fewest. We manually verified that most of the issues raised by Codex and Claude are valid. In contrast, Kimi Code performs the most superficial integrity checks and occasionally produces false positives. Overall, the inability to inspect code artifacts causes SAR to substantially overrate auto-research papers compared with PR, highlighting that artifact inspection should be a first-class component of future agentic reviewer systems. Finally, most citation problems are not entirely invented references, but rather errors in author names or conference venues, while the cited papers themselves usually exist.

Where this leaves us

Overall, in terms of paper quality, all current agents still fall well short of the threshold for top-tier venues. That said, some papers do offer useful insights, particularly in the stronger cases such as Case 1. A central challenge remains how to effectively harness agents to design and carry out comprehensive experiments. At the same time, our findings also raise broader concerns about the faithfulness of current frontier models.

References

Andrej Karpathy. "Auto Research." 2025. github.com/karpathy/autoresearch
Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. "The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery." arXiv:2408.06292, 2024.
Analemma Intelligence. "Introducing FARS: Fully Automated Research System." 2025. analemma.ai/blog/introducing-fars
Anthropic. "Claude Opus 4.6." 2026. anthropic.com/news/claude-opus-4-6
Anthropic. "Claude Code." code.claude.com
OpenAI. "Introducing GPT-5.4." 2026. openai.com/index/introducing-gpt-5-4
OpenAI. "Codex." openai.com/codex
Moonshot AI. "Kimi K2.5." kimi.com/ai-models/kimi-k2-5
Moonshot AI. "Kimi Code." kimi.com/code
Stanford ML Group. "Stanford Agentic Reviewer." paperreview.ai

Contents

1. Introduction

Key Findings

2. Framework Overview

3. Experiment Setup

4. Main Results on Stanford Agentic Review

Automated Research Systems Comparison

Comparison with Human-Authored ICLR 2025 Papers

Score Heatmap (Seed × Trial)

Per-Domain Breakdown

CPU vs GPU Performance (SAR)

Qualitative Review Analysis

Human-Annotated SAR Decisions

5. Main Results on Peer Review

Score Heatmap (Seed × Trial)

Per-Domain Breakdown

CPU vs GPU Performance

Per-Dimension Review Breakdown

Integrity & Reference Analysis

Programming Language Usage

6. Research Style Analysis

Research Type Distribution

Title Structure Analysis

Title Vocabulary

Title Patterns

Paper Structure

7. Reviewer Analysis

Self-Review Revision Effectiveness

Peer Review Bias

8. Time Analysis

9. H100 GPU Scaling Experiments

Overall Comparison: A6000 vs H100

Per-Seed Breakdown (Peer Review, H100)

10. Case Studies: Paper–Artifact Divergence

Case 1 · Claude Code

Case 2 · Claude Code

Case 3 · Codex

Case 4 · Kimi Code

Case 5 · Kimi Code

Case 6 · Claude Code

11. Human Inspection: Findings & Implications

What the papers look like

Where each agent fails

What the reviewers catch (and miss)

Where this leaves us

References