How Far Are We From True Auto Research?

An in-depth analysis of frontier CLI-agents conducting end-to-end research across diverse fields and compute resources

Claude Code (Opus 4.6) Codex (GPT-5.4) Kimi Code (K2.5)

Zhengxin Zhang*, Ning Wang*, Sainyam Galhotra, Claire Cardie
Cornell University • *Order determined by coin flip

117 agent papers • 13 research domains

Star on GitHub

1. Introduction

Recent years have seen a wave of auto-research systems, including Auto Research [1], AI Scientist [2], and Analemma's FARS [3]. These projects have shown impressive progress in automated end-to-end scientific research. Rather than proposing another complex framework, we study a more direct setting: whether widely used general-purpose CLI agents (e.g., Claude Code, Codex, Kimi Code) can carry out research with only minimal guidance. Moreover, there is no comprehensive evaluation of the quality of agent-conducted research across diverse domains or under varying computational constraints (e.g., CPU-only vs. GPU environments), making it difficult to systematically assess the true capabilities and limitations of such agents in realistic research scenarios.

We evaluate three off-the-shelf CLI agents — Claude Code with Opus 4.6 [4, 5], Codex with GPT-5.4 [6, 7], and Kimi Code with K2.5 [8, 9] — using a minimal scaffold on 13 diverse CS domains (5 CPU-only + 8 GPU environments). We employ a peer-review protocol where three CLI agents evaluate each paper alongside its code, and additionally use the Stanford Agentic Reviewer [10] for external validation.

Key Findings

Each CLI agent develops a distinct research persona: Claude Code is the full-stack researcher — balancing both strategies (46% empirical, 33% novel methods) with the longest, most comprehensive papers. Codex is the empirical scientist — 87% empirical studies, highest integrity, but lower novelty. Kimi Code is the system builder — 79% of its papers are framed as novel methods with acronym-heavy names, yet it still scores low on novelty because many of its ideas sound more like repackaged existing work than genuinely new contributions.

Stanford Agentic Review (SAR) (0–10, ICLR scale):

Peer Review (PR) (0–10, ICLR scale):

Human Inspection :

  • We further confirm that experimental rigor is the number-one weakness: agents fail to plan, write, and execute experiments, which limits the significance and scope of the papers. Fabricating results and favouring negative findings are the reasons the agentic reviewer falls short in its reviews, raising faithfulness concerns for frontier models. In conclusion, although some papers offer insights, all of them fall far behind top-tier venues — there is still a long way to go for true auto research.
  • 2. Framework Overview

    Each CLI agent follows a standardized pipeline:

    1. Ideation — Generate a research idea and experiment plan; self-review for up to 3 iterations.
    2. Experiments — Write and execute code, collect results; self-review for up to 3 iterations.
    3. Paper Writing — Produce a paper; self-review for up to 3 iterations.
    4. Review — Evaluate via Stanford Agentic Reviewer and triple peer review (all three agents review each paper alongside its code).

    3. Experiment Setup

    Agent Model CPU Seeds GPU Seeds Trials/Seed Total Papers
    Claude Code Opus 4.6 5 8 3 39
    Codex GPT 5.4 5 8 3 39
    Kimi Code K 2.5 5 8 3 39

    CPU seeds: Causal Learning, Compiler Optimization, Data Integration & Cleaning, Operating System Design, Probabilistic Methods

    GPU seeds: AI for Biology, Computer Vision, Datasets & Benchmarks, Generative Models, Interpretability, NLP, Privacy in ML, Supervised Representation Learning

    Hardware: CPU-only or 1× NVIDIA RTX A6000 (48GB). Each agent gets 4 CPUs, 60GB RAM. We also conduct experiments for GPU experiments using NVIDIA H100 (80GB) GPUs.

    Peer review (PR): Every paper is reviewed by all three agents (Claude Code, Codex, Kimi Code), scoring on 9 dimensions (novelty, soundness, significance, clarity, reproducibility, experimental rigor, references, reference integrity, results integrity) on a 0–10 ICLR scale.

    Stanford Agentic review (SAR): Every paper is reviewed by Stanford Agentic Reviewer, which provides ICLR-calibrated scores on a 0–10 scale.

    4. Main Results on Stanford Agentic Review

    Automated Research Systems Comparison

    We evaluate the performance of three CLI agents on the Stanford Agentic Reviewer (SAR). For comparison, we also collect scores from Analemma's FARS, which reports SAR scores for a total of 102 papers.

    Claude Code
    5.45
    n=39 • $200
    FARS
    5.06
    n=102 • $186K
    Codex
    4.93
    n=39 • $200
    Kimi Code
    4.24
    n=39 • $100
    Score Distribution
    Figure 1: Score distributions of automated research systems on the SAR. Claude Code has the highest median and tightest distribution above 5.0.
    SystemnSAR μSAR σMinMax≥5.0
    Claude Code395.450.703.16.382%
    FARS1025.060.623.06.376%
    Codex394.930.852.76.364%
    Kimi Code394.240.842.05.326%
    Finding 1: Claude Code achieves the highest performance among automated research systems (5.45 > FARS 5.06 > Codex 4.93 > Kimi Code 4.24). Claude Code outperforms Analemma's FARS despite using only a $200 Max subscription vs. FARS's $186,000 budget, suggesting that complex specialized auto-research frameworks are unnecessary — a general-purpose CLI agent with minimal scaffolding already surpasses them.

    Comparison with Human-Authored ICLR 2025 Papers

    To calibrate these scores against human-authored research, we additionally sampled 200 real ICLR 2025 papers (100 accepted, 100 rejected) to the same Stanford Agentic Reviewer, establishing a human baseline.

    ICLR Accepted
    5.59
    n=100 • Human=6.54
    Claude Code
    5.45
    n=39
    ICLR Weighted
    5.42
    32% acc / 68% rej
    ICLR Rejected
    5.34
    n=100 • Human=5.02
    SystemnSAR μSAR σHuman μHuman σ
    ICLR 2025 Accepted1005.590.596.540.80
    Claude Code395.450.70
    ICLR 2025 Weighted (32%/68%)2005.420.705.500.81
    ICLR 2025 Rejected1005.340.755.020.81

    SAR = Stanford Agentic Reviewer score. Human = average OpenReview score from ICLR 2025 human reviewers.

    Finding 2: Claude Code achieves performance comparable to the average ICLR 2025 submission (SAR: 5.45 vs. 5.42), but remains below the level of accepted papers (5.59), indicating a non-trivial gap to top-tier research quality. More importantly, SAR exhibits substantially weaker discriminative ability than human reviewers. Human reviewers show a 1.52-point gap between accepted and rejected papers (6.54 vs. 5.02), whereas SAR compresses this gap to only 0.25 points (5.59 vs. 5.34).

    Score Heatmap (Seed × Trial)

    SAR score heatmap
    Figure 2: SAR score heatmap across all seeds and trials for each agent.

    Per-Domain Breakdown

    We break down SAR scores across 13 research domains (5 CPU-only, 8 GPU) and compare CPU vs GPU performance for each agent.

    Per-domain Stanford scores
    Figure 3: SAR scores by research domain, ranked by average score across agents.

    CPU vs GPU Performance (SAR)

    AgentCPUGPUGap
    Claude Code5.495.42+0.07
    Codex4.555.16−0.61
    Kimi Code4.124.32−0.20
    Finding 3: Performance varies substantially across domainsDatasets & Benchmarks scores highest (5.79) while Data Integration & Cleaning scores lowest (4.39), a 1.40-point spread. The CPU vs GPU gap is agent-dependent: Claude Code performs nearly identically across platforms (+0.07), while Codex scores significantly higher on GPU tasks (5.16 vs 4.55, gap = −0.61). However, under PR all agents perform better on CPU tasks than on GPU tasks (See Section 5).

    Qualitative Review Analysis

    We analyze the text of Stanford reviews to identify common weakness themes via keyword matching across 7 categories: Experimental Design (baselines, ablations, evaluation, statistical rigor), Overclaiming (unsupported or exaggerated claims), Scope & Scale (limited or toy experiments), Clarity & Presentation (unclear writing, confusing notation), Results Inconsistency (contradictions or mismatches within the paper), Reference Quality (fake citations, placeholder references, future dates), and Novelty (incremental or trivial contributions). We also measure the balance between strengths and weaknesses across agents.

    Weakness themes
    Figure 4: Weakness themes in SAR (7 categories).
    AgentAvg Strengths (words)Avg Weaknesses (words)Ratio (W/S)
    Claude Code2153161.47
    Codex2122761.30
    Kimi Code1843391.85
    Finding 4: Experimental design is the dominant weakness across all agents, with Kimi Code receiving the most criticism (256 mentions). Missing baselines and comparisons are a major driver of this weakness. Kimi Code also leads in overclaiming and results inconsistency (79 vs Claude Code's 44 and Codex's 27; 54 vs Codex's 17), while Codex is flagged most for novelty & scope (66) — consistent with its empiricist research style. Meanwhile, Kimi Code shows a much larger weakness-to-strength ratio (1.85× vs 1.30× for Codex), reflecting the overall quality gap.

    Human-Annotated SAR Decisions

    We manually inspected all papers together with their SAR reviews and found that the overall assessment score is not a reliable indicator of SAR's final recommendation. For example, some papers with scores above 6 were not recommended for acceptance, while others with scores around 4.5 were recommended for acceptance. Due to this inconsistency, we manually assigned accept or reject labels based on SAR's final recommendation rather than its numeric score. In some cases, the review does not explicitly state an accept or reject decision; in those instances, we inferred the recommendation from the overall tone of the review, judging whether it was more "accepted" or more "rejected". Moreover, we treat conditional accept, accept with revision, and borderline (accept) as accept decisions; all others are treated as rejection.

    Category Accept Reject Accept % Accept/Reject
    ICLR 2025 Accepted 7624 76.0%3.17
    ICLR 2025 Weighted (32%/68%) 59.740.3 59.7%1.48
    ICLR 2025 Rejected 5248 52.0%1.08
    Claude Code 1623 41.0%0.70
    FARS (Analemma) 2280 21.6%0.28
    Codex 534 12.8%0.15
    Kimi Code 237 5.1%0.05
    Finding 5: The human-annotated SAR final-decision acceptance rates show a clear gap between human-authored and agent-generated papers. All human-written paper groups achieve higher acceptance rates: 52% for rejected ICLR papers and 76% for accepted ICLR papers, compared with only about 41% for the strongest agent-generated group. Among the agentic systems, this further confirms that Claude Code with a minimal scaffold substantially outperforms FARS papers (21.6%). Codex ranks next with 12.8%, and Kimi Code performs the worst at 5.1%.

    5. Main Results on Peer Review

    We design a peer review (PR) protocol where each of the three CLI agents reviews every paper alongside its experiment code, execution logs, and results artifacts. Unlike SAR, which evaluates only the PDF, this enables reviewers to verify whether reported results match actual experimental outputs, detect incomplete experiments, and identify fabricated claims. Additionally, reviewers are required to check reference validity via tools, flagging hallucinated or nonexistent citations.

    Claude Code
    4.59
    ±0.12 SE • Best: 6.0 • 36% ≥ 5.0
    Codex
    4.51
    ±0.06 SE • Best: 5.3 • 5% ≥ 5.0
    Kimi Code
    3.38
    ±0.16 SE • Best: 5.3 • 3% ≥ 5.0
    Peer Review Score Distribution
    Figure 5: PR score distributions. Claude Code and Codex cluster around 4–5, while Kimi Code has a wider spread with more low scores.

    Score Heatmap (Seed × Trial)

    Score heatmap
    Figure 6: PR score heatmap across all seeds and trials. Codex is the most consistent (std=0.32), while Kimi Code has the highest variance (std=0.80).

    Per-Domain Breakdown

    Per-domain peer review scores
    Figure 7: PR scores by research domain, ranked by average score across agents.
    Finding 6: PR scores drop substantially compared to SAR across all agents (Claude Code: 4.59 vs. 5.45, Codex: 4.51 vs. 4.93, Kimi Code: 3.38 vs. 4.24). Claude Code achieves the highest scores overall and produces the single best paper (score = 6.0). Performance varies across domains with a 0.92-point spread. Codex is the most consistent across domains (std=0.21, range 4.22–5.00), while Claude Code (std=0.44, range 3.78–5.33) and Kimi Code (std=0.59, range 2.67–4.33) show much higher variance. Kimi Code struggles most on GPU-intensive domains (Generative Models, Datasets & Benchmarks, Supervised Repr. Learning: all 2.67).

    CPU vs GPU Performance

    Peer Review (PR)Stanford (SAR)
    AgentCPUGPUGapCPUGPUGap
    Claude Code4.73±0.204.50±0.10+0.235.49±0.165.42±0.15+0.07
    Codex4.53±0.074.50±0.05+0.034.55±0.235.16±0.15−0.61
    Kimi Code3.69±0.223.19±0.18+0.504.12±0.234.32±0.16−0.20
    Finding 7: PR and SAR show opposite CPU/GPU trends. Under PR, all agents score higher on CPU tasks (especially Kimi Code: +0.50). Under SAR, Codex and Kimi Code score higher on GPU tasks (Codex: −0.61). GPU domains (vision, NLP, generative models) are well-established fields where agents can produce better-looking papers, but GPU experiments are harder to execute correctly — CUDA issues, memory limits, and training instabilities lead to more incomplete runs and mismatched results when reviewers verify the code. CPU tasks are simpler to run and verify, yielding more reliable experiments. This divergence highlights that SAR alone is insufficient — it rewards presentation quality over experimental substance.

    Per-Dimension Review Breakdown

    Dimension scores
    Figure 8: Per-dimension PR scores. Codex leads on reproducibility (7.65), reference integrity (8.41), and results integrity (7.94). Claude Code leads on novelty (5.42) and significance (5.23). Kimi Code trails on all dimensions.
    Finding 8: Agents exhibit a clear divergence between creative and reliability-oriented dimensions. Claude Code performs best on creative aspects such as novelty (5.42) and significance (5.23), while Codex leads on the majority of reliability-related dimensions (7 out of 9), including reproducibility, reference integrity, and results integrity. This pattern is consistent with the different research strategies adopted by the agents (see Section 6). Across all agents, experimental rigor is surprisingly the weakest dimension (3.29–5.56) rather than novelty. Further analysis suggests that missing baselines and not fully executed experimental plans are the primary causes, despite having sufficient available compute budget (see Section 9).

    Integrity & Reference Analysis

    We further manually validate every paper with its artifacts and PRs across four integrity categories:

    CategoryClaude CodeCodexKimi Code
    Results mismatch only6/39 (15%)2/39 (5%)4/39 (10%)
    Setting mismatch only10/39 (26%)1/39 (3%)5/39 (13%)
    Both (results + setting)12/39 (31%)2/39 (5%)30/39 (77%)
    Fake reference14/39 (36%)3/39 (8%)28/39 (72%)
    Mean refs per paper15.112.810.5
    Reference integrity score7.348.416.26
    Finding 9: Code-aware peer review detects integrity issues that SAR cannot. Codex identifies the largest number of fabricated results, with Claude Code ranking second, while Kimi Code flags the fewest. We manually verified that most of the issues raised by Codex and Claude are valid. In contrast, Kimi Code performs the most superficial integrity checks and occasionally produces false positives. Based on our manual annotation of all 117 papers, 77% of Kimi Code papers have both results and setting mismatches, and 72% contain hallucinated references. Claude Code is intermediate: 31% have both results + setting mismatch, and 36% contain fake references. Codex is the most trustworthy — only 5% have both mismatches and 8% have fake references, consistent with its cautious empiricist research style.

    Programming Language Usage

    We analyze the programming languages and file counts across all agent-generated codebases to understand how each agent structures its experiments.

    Claude languages
    Claude Code
    Codex languages
    Codex
    Kimi languages
    Kimi Code
    Finding 10: All three agents overwhelmingly default to Python (Claude Code 96.6%, Codex 100%, Kimi Code 99.2%), even in domains where other languages are typically used, such as operating systems. Claude Code and Kimi Code occasionally make use of shell scripts.

    6. Research Style Analysis

    Beyond scores, we compare how each CLI agent approaches research — what types of papers they write, how they structure arguments, and what their titles reveal about underlying research strategy. This comparative analysis shows that the three agents have developed fundamentally different research personas, which in turn explain many of the quality differences observed in Sections 4–5.

    Research Type Distribution

    We manually classify each paper based on its title and abstract into four categories: novel method (proposes a new named algorithm or system), benchmarking (creates evaluation tools), extension (improves existing methods), and empirical study (studies without proposing new methods).

    TypeClaude CodeCodexKimi Code
    Novel method13 (33%)3 (8%)31 (79%)
    New benchmark3 (8%)0 (0%)4 (10%)
    Extension/improvement5 (13%)2 (5%)0 (0%)
    Empirical study18 (46%)34 (87%)4 (10%)

    Title Structure Analysis

    MetricClaude CodeCodexKimi Code
    Avg title length102 chars99 chars92 chars
    Question titles10%28%0%
    Colon structure (Name: Subtitle)74%46%85%
    Contains acronym (2+ caps)15%49%51%
    Named method (starts with name)21%38%46%

    Title Vocabulary

    The most frequent words in each agent's titles reveal distinct research personas:

    AgentTop KeywordsPersonality
    Claude Code learning when causal adaptive pipelines contrastive Full-stack researcher — spreads across research types
    Codex study benchmark negative matched controlled pilot Empirical scientist — controlled studies
    Kimi Code adaptive aware guided dynamic gradient contrastive System builder — named frameworks

    Title Patterns

    Each agent has a distinct title style:

    AgentStyleExample Titles
    Codex Question-based, hypothesis-testing.
    28% use question marks — highest of any agent.
    "Do Shared Decoders Improve Prototype-Edit Reusability?"
    "When Does Clarification Supervision Transfer to Formal Reasoning?"
    "How Much Signal Is in Early Training Trajectories?"
    Claude Code Descriptive analysis.
    Unique "The X of Y" essayistic framing.
    "The Algebra of Compiler Passes: An Empirical Study of Idempotency"
    "The Bandwidth Knapsack: Optimal Migration Scheduling"
    "The Functional Anatomy of Sparse Features in Language Models"
    Kimi Code Acronym-heavy named frameworks.
    Zero question titles — always declarative.
    "CAGER: Causal Geometric Explanation Recovery"
    "DU-VPT: Decomposed Uncertainty-Guided Visual Prompt Tuning"
    "VAST: Velocity-Adaptive Spatially-varying Timesteps"
    Finding 11: Research-type distribution and title style reveal three distinct research strategies. Codex is overwhelmingly empirical (87%), and its titles are frequently framed as questions, controlled studies, or benchmark-style evaluations. This helps explain why Codex performs better on integrity-related dimensions, yet still lags behind Claude Code on novelty and significance. Kimi Code, by contrast, is dominated by named methods (79%) and acronym-led titles, reinforcing its system-builder profile. But its reference base is relatively dated — 41% of citations are from before 2022 — which likely weakens genuine novelty, making many proposals read more like repackaging than truly new ideas. Claude Code is more balanced across empirical studies and method papers (46% empirical, 33% novel methods), with lower acronym use and broader analytical framing, making it closer to a full-stack researcher with the most diversified research portfolio.

    Paper Structure

    How do the papers differ in structure and technical depth?

    MetricClaude CodeCodexKimi Code
    Paper length (words)4,0233,4212,461
    Method section (words)572531394
    Equations3.82.34.0
    Figures4.84.10.8
    Tables6.04.24.0
    Algorithm blocks0.60.00.6
    Theorems / proofs0.30.00.4
    Complexity analysis77%10%64%

    All 39 papers per agent analyzed (117 total).

    Finding 12: The three agents write fundamentally different papers. Claude Code writes the longest papers with the most figures and tables, comprehensive and visually rich. Codex writes moderate-length papers with the fewest equations, no algorithm blocks, and no theorems, consistent with its empirical style. Kimi Code writes the shortest papers but relies most heavily on formal surface cues, using the most equations and the strongest theorem-style presentation despite offering very little visual grounding.

    7. Reviewer Analysis

    Our pipeline uses two layers of review. First, each agent self-reviews its own work at every stage (ideation, experiments, paper writing) for up to 3 revision rounds. Second, all three agents cross-review every paper in a peer-review setup. Below we examine whether self-review revision actually improves quality, and how much bias the peer reviewers introduce.

    Self-Review Revision Effectiveness

    When an agent scores below the self-review threshold, it revises and re-evaluates. The table below tracks whether scores improve, stay the same, or decline after revision.

    Agent / GateImprovedSameDeclinedAvg Delta
    Claude Code / Idea100%0%0%+2.1
    Claude Code / Experiment88%8%4%+2.2
    Claude Code / Paper43%34%23%+0.0
    Codex / Idea35%51%15%+0.4
    Codex / Experiment78%20%2%+2.0
    Codex / Paper64%29%7%+1.5
    Kimi Code / Idea100%0%0%+3.0
    Kimi Code / Experiment91%9%0%+2.2
    Kimi Code / Paper61%34%5%+1.6
    Finding 13: Self-review is highly effective for Claude Code and Kimi Code in ideation and experiments (100% improvement; avg. +2.1–3.0 for ideation; 78–91% improvement; avg. +2.0–2.2 for experiments), but shows limited gains for paper writing. This is expected, as paper quality is largely constrained by the underlying ideas and experimental results. In contrast, Codex exhibits a different pattern: it shows relatively modest improvements in ideation, but stronger gains in both experiments and paper writing.

    Peer Review Bias

    Each paper is reviewed by all three agents. The score distributions reveal large systematic differences between reviewers.

    Reviewer scores
    Figure 9: How each reviewer scores papers from each agent (all 117 papers). Codex is the strictest reviewer (μ=1.9 for Kimi papers), Kimi Code is the most lenient (μ=6.2 for Claude papers).
    Reviewer ↓ / Agent →Claude CodeCodexKimi CodeAvg Given
    Claude Code4.64.03.23.9
    Codex2.93.91.92.9
    Kimi Code6.25.75.05.7
    Finding 14: Reviewer severity differs dramatically across agents. Codex is the strictest reviewer (avg. 2.9), while Kimi Code is the most generous (avg. 5.7), scoring the same set of papers nearly 2× higher. Self-review is not uniformly inflated across agents: Claude Code and Codex both give their own papers their highest average scores (4.6 and 3.9), but Kimi Code does not — it scores Claude Code highest (6.2) and its own papers lower (5.0). Even so, Kimi Code's ratings remain substantially more lenient overall, including toward its own work, despite producing the weakest papers on average (PR 3.38). This points to a broader generation-verification gap and shows why multi-reviewer aggregation is necessary.

    8. Time Analysis

    Time per stage
    Figure 11: Time breakdown per pipeline stage. Experiments dominate for all agents. Claude Code spends the most time overall (13.7h avg), Kimi Code is fastest (4.1h).
    Wall time
    Figure 12: Wall time distribution. Claude Code has a long tail with some runs exceeding 30+ hours due to experiment self-review revision loops.
    Finding 15: All three agents spend the majority of their time on experiments, where Claude Code (13.7h total) is roughly 3× slower than Kimi Code (4.1h) and 2× slower than Codex (6.9h). For ideation, Codex spends the most time, while for paper writing, Claude Code takes the longest. Notably, self-review stages across all three phases take only a very small share of the total time and are almost negligible compared with other stages.

    9. H100 GPU Scaling Experiments

    Do agents produce better research with more powerful hardware? We re-run all 8 GPU seeds on NVIDIA H100 (80GB) GPUs — nearly double the VRAM of the A6000 (48GB) used in our main experiments. All other settings remain identical. Codex runs 3 independent trials per seed (24 total).

    Overall Comparison: A6000 vs H100

    Codex (A6000)
    4.51
    GPU PR • n=24
    Codex (H100)
    4.26
    GPU PR • n=24
    Delta
    −0.25
    not significant

    Per-Seed Breakdown (Peer Review, H100)

    SeedCodex (H100)Codex (A6000)
    AI for Biology4.894.67
    Interp. of Learned Repr.4.784.67
    Computer Vision4.334.33
    Natural Language Processing4.224.33
    Privacy in ML4.224.00
    Supervised Repr. Learning4.224.67
    Generative Models3.784.33
    Datasets & Benchmarks3.674.67
    Average4.264.50

    Codex H100: 3 trials per seed, scores averaged. A6000 scores from main experiments (Section 5).

    Finding 16: Upgrading from A6000 (48GB) to H100 (80GB), Codex surprisingly scores lower on H100 (4.26 vs 4.50). Per-seed analysis reveals no consistent pattern — some seeds improve (e.g., AI Bio and Privacy), and some seeds degrade (e.g., Gen. Models and D&B). This suggests that computational resources are not the main bottleneck for "true auto research". Instead, as the earlier results indicate, the more important constraints lie in experiment design.

    10. Case Studies: Paper–Artifact Divergence

    In this section, we present six illustrative case studies that support the conclusions of our manual inspection. Each case highlights a distinct failure pattern we observed across the 117 agent-generated papers, spans all three agents, and embeds the actual paper PDF so readers can inspect the evidence directly.

    Case 1 · Claude Code

    When Do Causal Discovery Algorithms Disagree? Diagnosing Assumption Violations via Per-Edge Profiling

    Claude Code can occasionally produce a paper with a real insight, but weak evidence still prevents it from being convincing.

    The paper offers a genuinely useful observation: distributional diagnostics can be detected more reliably, while structural diagnostics remain close to chance level. This kind of asymmetry is a meaningful takeaway and shows that agent-generated papers can still surface nontrivial empirical insights. However, the paper ultimately remains unconvincing because the evidence is weak. The results are not strong, the experimental settings appear chaotic in the artifacts, and the paper sometimes highlights its own method even when it is not the best or second-best. In addition, some key notions are not clearly grounded, which further weakens the paper's faithfulness.

    Open in new tab · Download PDF

    Case 2 · Claude Code

    The Algebra of Compiler Passes: An Empirical Study of Idempotency, Commutativity, and Convergence in LLVM Optimization Pipelines

    Claude Code may overclaim or present unsupported results when experiments are weak.

    The paper claims evaluation on 87 benchmarks, but the artifact only supports a 20-benchmark subset. It also contains reference errors and relies heavily on synthetic programs. As a result, the paper overstates both the scale and the practical value of its findings. This case supports our observation that when experiments fail to produce sufficiently strong evidence, agents may compensate by inflating claims or presenting unsupported results. It also shows why artifact-aware review is essential: the mismatch is not obvious from the paper alone, but becomes clear once the code and outputs are inspected.

    Open in new tab · Download PDF

    Case 3 · Codex

    Do Corruption-Family Text Residuals Help Zero-Shot CLIP? A Controlled Baseline Study

    Codex reduces fabrication partly by running much narrower experiments.

    Rather than producing large or ambitious evaluations, Codex tends to run controlled but very limited experiments. Here the study uses only a single frozen CLIP backbone on CIFAR-10, making the empirical scope too narrow to support broad conclusions. The idea is also close to prior prompt-based and unlabeled adaptation methods, so the novelty is modest. This case supports our claim that Codex's lower fabrication rate comes in part from being more conservative experimentally. However, that conservatism comes at a cost: the evidence is too limited, which leads to weaker papers overall.

    Open in new tab · Download PDF

    Case 4 · Kimi Code

    DU-VPT: Decomposed Uncertainty-Guided Visual Prompt Tuning for Test-Time Adaptation

    Kimi Code fabricates experimental results directly rather than actually running the experiments.

    The artifact contains hard-coded benchmark statistics, and the reported per-run metrics are generated by sampling around these constants rather than by real model outputs. The published results mirror those prewritten target values almost exactly. Moreover, several analyses claimed in the paper, including forgetting analysis and shift-type diagnosis accuracy, have no implementation or logs in the artifact. This case supports our conclusion that Kimi Code often appears to fabricate results directly rather than obtaining them through actual experiments.

    Open in new tab · Download PDF

    Case 5 · Kimi Code

    UniSched: A Critical Analysis of Simulation-Based Evaluation for CXL-Aware CPU Scheduling

    Even when Kimi Code produces code, the method and implementation often do not match.

    The reported results and settings do not align with the artifact, and the code itself contains clear problems, including implementation bugs in PMU-based task classification. The result files also show behaviors inconsistent with the paper's claims, such as nonzero migration counts where the method's story would suggest otherwise. Unlike Case 4, where the main issue is direct fabrication, this case shows that even when code exists, the implemented system often fails to correspond to the method described in the paper. It therefore supports our claim that Kimi's failures are not limited to fake numbers, but also include deeper mismatches between method, code, and evaluation.

    Open in new tab · Download PDF

    Case 6 · Claude Code

    Characterizing Operator Interaction Effects in Data Cleaning Pipelines

    A common failure is missing relevant baselines even when the paper substantially overlaps with prior work.

    This case illustrates a common failure across agent-generated papers: missing the most relevant prior baseline even when the proposed study substantially overlaps with it. Here, the paper is highly similar to ShapleyPipe, yet it does not cite or compare against that work. As a result, the evaluation is incomplete at its core: without the most relevant baseline, the paper cannot establish either novelty or empirical advantage convincingly. This case therefore supports our broader observation that many agent-generated papers compare mainly against older or easier baselines while overlooking the most important recent or closely related methods.

    Open in new tab · Download PDF

    11. Human Inspection: Findings & Implications

    A synthesis of what we learned after manually inspecting all 117 agent-generated papers, their artifacts, and their 351 peer reviews.

    What the papers look like

    We manually inspected all papers, together with their artifacts and reviews, and identified several recurring patterns. We categorize the papers into three groups: methods, empirical studies, and benchmark papers. Across these categories, method and benchmark papers are primarily heuristic or incremental extensions of prior work, while the empirical studies are generally limited in depth.

    Where each agent fails

    We find that across all agents, the main weakness is flawed experimental design and rigor.

    • Claude Code conducts the largest number of experiments, but these evaluations are still narrow in scope, often missing comparisons with recent baselines (Case 6). Moreover, it sometimes appears to fabricate results when experiments fail or produce unsatisfactory outcomes (Case 2).
    • Codex runs experiments at a much smaller scale than the other agents — for example, on only one small dataset with a single model (Case 3). This reduces its hallucination and fabrication rate, but it also leads to lower scores because the empirical evaluation is too limited.
    • Kimi Code, by contrast, often appears to fabricate experimental results directly rather than actually running the experiments (Case 4). Even when it produces code, there is frequently a mismatch between the proposed method and the implementation, and the failure rate of its experiments is high (Case 5).

    More broadly, all agents tend to compare against older methods rather than recent ones, even when newer baselines are mentioned in the related work section.

    What the reviewers catch (and miss)

    For the reviews, we find that both SAR and PR tend to view negative results favorably, often treating them as a sign of honesty and a strength rather than a weakness. In assessing novelty, SAR is clearly weaker than PR at searching for related work, which leads it to assign consistently higher novelty scores. Since SAR cannot inspect artifacts, it almost never questions the validity of experimental results; instead, it mainly checks for inconsistencies between claims made in different parts of the paper and those reported in the experimental section.

    For PR, Codex identifies the largest number of fabricated results, with Claude Code ranking second, while Kimi Code flags the fewest. We manually verified that most of the issues raised by Codex and Claude are valid. In contrast, Kimi Code performs the most superficial integrity checks and occasionally produces false positives. Overall, the inability to inspect code artifacts causes SAR to substantially overrate auto-research papers compared with PR, highlighting that artifact inspection should be a first-class component of future agentic reviewer systems. Finally, most citation problems are not entirely invented references, but rather errors in author names or conference venues, while the cited papers themselves usually exist.

    Where this leaves us

    Overall, in terms of paper quality, all current agents still fall well short of the threshold for top-tier venues. That said, some papers do offer useful insights, particularly in the stronger cases such as Case 1. A central challenge remains how to effectively harness agents to design and carry out comprehensive experiments. At the same time, our findings also raise broader concerns about the faithfulness of current frontier models.

    References

    1. Andrej Karpathy. "Auto Research." 2025. github.com/karpathy/autoresearch
    2. Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. "The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery." arXiv:2408.06292, 2024.
    3. Analemma Intelligence. "Introducing FARS: Fully Automated Research System." 2025. analemma.ai/blog/introducing-fars
    4. Anthropic. "Claude Opus 4.6." 2026. anthropic.com/news/claude-opus-4-6
    5. Anthropic. "Claude Code." code.claude.com
    6. OpenAI. "Introducing GPT-5.4." 2026. openai.com/index/introducing-gpt-5-4
    7. OpenAI. "Codex." openai.com/codex
    8. Moonshot AI. "Kimi K2.5." kimi.com/ai-models/kimi-k2-5
    9. Moonshot AI. "Kimi Code." kimi.com/code
    10. Stanford ML Group. "Stanford Agentic Reviewer." paperreview.ai