Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Paper Copilot
  • OpenReview.net
  • Deadlines
  • CSRanking
  • AI Reviewer: coming soon ...
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
CSPaper

CSPaper: review sidekick

CSPaper AI Reviewer: coming soon ...
  1. Home
  2. Peer Review in Computer Science: good, bad & broken
  3. Artificial intelligence & Machine Learning
  4. 🤖The AI-for-Science Hype, Peer Review Failures, and the Path Forward

🤖The AI-for-Science Hype, Peer Review Failures, and the Path Forward

Scheduled Pinned Locked Moved Artificial intelligence & Machine Learning
ai reviewai for sciencenegative resultsbias
2 Posts 2 Posters 144 Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • cqsyfC Offline
    cqsyfC Offline
    cqsyf
    Super Users
    wrote last edited by
    #1

    Last week I read an eye-opening essay titled "I got fooled by AI-for-science hype—here’s what it taught me" by physicist Nick McGreivy. It is an honest account of how high expectations for AI in physics ran headfirst into brittle models, weak baselines, and, most critically, a scientific publishing system that rarely reports when AI fails.

    This story isn’t just about physics; it’s a mirror reflecting systemic issues in peer-reviewed AI research, especially in computer science conferences like NeurIPS, ICML, and ICLR. Let’s talk about why this is happening, and where we might go from here.


    đźš© From Plasma Physics to Peer Review Pitfalls

    Nick's journey started with high hopes for using machine learning to solve partial differential equations (PDEs). He tried PINNs (Physics-Informed Neural Networks), inspired by their 14,000 citations and glowing papers. But he quickly found they didn’t work on even modestly different PDEs.

    “Eventually, I realized that the original PINN paper had selectively reported successes and omitted failures.”

    And it wasn’t just PINNs. In a review of 76 ML-for-PDE papers:

    • 79% used weak baselines to claim superiority.
    • Very few reported negative results.
    • The most hyped outcomes were the most misleading.

    031e6fe4-c936-4846-8f81-0e0029d1042d_1300x700.png
    60 out of 76 AI-for-PDE papers used weak baselines. Strong baseline comparisons drastically reduced the performance claims.


    đź§Ş This Sounds Familiar... Hello, Computer Science

    These patterns, selective reporting, weak baselines, cherry-picking, and data leakage, aren’t just physics problems. They’re endemic in the AI research culture, particularly in CS conferences:

    1. Survivorship Bias

    Conferences reward “novel and positive results.” Negative results? Not welcome. This distorts the field, making AI look more capable than it is.

    2. Reviewer Incentives

    Reviewers are often working on similar topics. Accepting “hyped” papers may boost their own citation networks, leading to conflicts of interest.

    3. Benchmark Gaming

    How many papers claim state-of-the-art by subtly tweaking test splits or cherry-picking metrics? Quite a few. The culture of leaderboard chasing dilutes scientific rigor.

    36a0461d-d4f1-497b-8a4e-b962fd14c880_1600x1200.png
    The rise in AI mentions across disciplines: from 2% in 2015 to nearly 8% in 2022—raises questions about depth vs hype.


    đź’¬ My Take: Science Is Not a Startup Pitch

    McGreivy rightly points out:

    “AI adoption is exploding among scientists less because it benefits science and more because it benefits the scientists themselves.”

    Just as a startup might inflate a TAM (Total Addressable Market) to impress VCs, many researchers unintentionally inflate their AI’s potential to impress reviewers and get published. The difference? In science, these distortions can misguide entire research agendas.


    đź”® What Needs to Change

    Let’s be constructive. Here’s what I believe we need in CS/AI peer review:

    âś… Stronger Baselines and Ablations

    Papers should compare against state-of-the-art, not strawmen. Ablation studies should be mandatory for claims about novel components.

    âś… Negative Results Track

    NeurIPS experimented with this before. We need a permanent home for negative or inconclusive findings.

    âś… Reviewer Education

    Provide checklists to reviewers: Have strong baselines been used? Are the gains statistically significant? Is there a reproducibility artifact?

    âś… Independent Benchmarks

    The NLP community has benefited from shared tasks and datasets (GLUE, SuperGLUE). We need more of these in other domains.


    🙏 Finally

    The article is a refreshing reminder: Just because something works with AI doesn’t mean it’s better. And just because it’s published doesn’t mean it’s trustworthy.

    Let’s aim for a culture where rigor beats razzle-dazzle, and where peer review serves truth, not trendiness.

    Would love to hear your thoughts. Have you seen similar issues in your domain?


    Images from the article "I got fooled by AI-for-science hype — here’s what it taught me" by Nick McGreivy.

    1 Reply Last reply
    1
    • JoanneJ Offline
      JoanneJ Offline
      Joanne
      wrote last edited by
      #2

      truly a nice read, really reflective piece

      1 Reply Last reply
      0
      Reply
      • Reply as topic
      Log in to reply
      • Oldest to Newest
      • Newest to Oldest
      • Most Votes


      • Login

      • Don't have an account? Register

      • Login or register to search.
      © 2025 CSPaper.org Sidekick of Peer Reviews
      Debating the highs and lows of peer review in computer science.
      • First post
        Last post
      0
      • Categories
      • Recent
      • Tags
      • Popular
      • World
      • Paper Copilot
      • OpenReview.net
      • Deadlines
      • CSRanking
      • AI Reviewer: coming soon ...