§17.3

Automated, Agent-Driven Predictive Workflows

Part IV built predictive models the slow way: a human framed the task, engineered the features, fit the model, read the metrics, and shipped it. Every step was a place where judgment — and labor — was required. The promise of agentic AI is to automate the labor while keeping the judgment: an agent that can do the exploratory analysis, propose features, train and compare models, and then watch the deployed model and retrain it when the world shifts. How much of that is real today, and how much is still a research demo, is the subject of this chapter.

The honest answer is that agents are genuinely useful for parts of the predictive lifecycle and genuinely unreliable for the whole of it. Both halves of that sentence are backed by hard benchmark numbers, and a manager needs to hold both at once.

Can an agent do data science?

The most rigorous measurement is OpenAI's MLE-bench, which drops agents into 75 real Kaggle competitions. The best setup — a strong reasoning model with purpose-built scaffolding — earned at least a bronze medal in just 16.9% of competitions on a single attempt, roughly doubling to 34.1% when allowed eight tries.1 Other benchmarks tell the same story from different angles: on DSBench's realistic analysis tasks the best agent solved about a third;2 on BixBench's open-ended bioinformatics work, frontier models scored in the teens.3 Yet on closed-form analysis questions — a clean CSV, a specific question — the same class of model answers correctly about three-quarters of the time.4

How far are data-science agents from experts? It depends on the task

16.9%
Win a Kaggle medal (best agent, 1 try)
34.1%
Win a Kaggle medal (8 tries)
34.1%
Solve a realistic data-analysis task
17%
Open-answer bioinformatics analysis
74.6%
Closed-form analysis question
47.6%
Match/beat an expert on real knowledge work

These benchmarks measure different things, so the bars are not directly comparable — but the pattern is clear. On narrow, closed-form questions, agents are strong; on open-ended, end-to-end modeling work, they still trail experts by a wide margin. Benchmark wins are not the same as production reliability.

Figure 1. The same agents, graded on different tasks. Strong on narrow, well-specified questions; weak on open-ended, end-to-end modeling. The bars measure different benchmarks and are not directly comparable — but the gradient is the lesson.

The pattern is consistent: the more an answer is pinned down, the better agents do; the more the task is open-ended judgment, the worse. And capability is climbing fast. On OpenAI's GDPval, which rates model output against human experts on real economically valuable work, the strongest model matched or beat experts 47.6% of the time — up roughly fourfold from about 12% a year earlier.5

The research-grade frontier

The ceiling is rising in ways that would have seemed implausible in 2023. Sakana AI's automated “AI Scientist” generated a machine-learning paper — hypothesis, experiments, and write-up — that passed peer review at an ICLR 2025 workshop, scoring above the acceptance threshold and higher than roughly 55% of human-authored submissions; the work was documented in Nature in 2026.6 It is a milestone worth taking seriously and not over-reading: one accepted workshop paper out of several attempts, in a venue chosen for the experiment. The direction is unmistakable; the reliability is not yet there.

What is actually shipping

For everyday analytics, the agents are already in the tools. Google's Gemini-powered Data Science Agent has been free inside Colab since March 2025, autonomously writing and running notebook code for exploratory analysis, cleaning, feature work, and prediction;7 in August 2025 Google extended data-science and data-engineering agents into BigQuery, where a plain-English request becomes a working pipeline.8 On the predictive-modeling side, vendors like DataRobot and H2O.ai have fused classic AutoML with agents, so a single system can build a model and then monitor, explain, and act on it.9

The loop that matters

For a manager, the exciting capability is not one-shot model-building — it is closing the operating loop that Part IV opened. A deployed model decays as the world drifts; the discipline is to monitor it, detect the drift with statistical tests, retrain, validate, and roll out carefully. An agent can run every station of that loop. What it should not do is decide unsupervised that a new model is good enough to replace the old one.

The predictive loop, now agent-driven

Agent + durable executionsurvives crashes · runs for days · resumesTrainfit / refitDeploycanary rolloutMonitorlive metricsDetect driftKS · PSI testsDecideretrain? alert?human approves promotion before a new model goes live

The loop itself is the same one Part IV deployed by hand. What is new is that an agent can run every station — watch the metrics, run the drift tests, retrain, and stage a canary — while a human stays “above the loop,” approving the promotions that matter.

Figure 2. The predict → monitor → drift → retrain loop, now agent-driven. The agent turns the cranks; the human keeps an approval gate on the one step that changes what customers experience — promoting a new model to production.

The reliability layer

A loop that runs for days and survives crashes needs more than a cron job. This is where durable execution engines come in — Temporal, Apache Airflow, Vercel's Workflow Development Kit — which persist a workflow's state at every step and resume exactly where they left off after a failure, letting a single run span minutes to months.1012 They are also where human checkpoints get built in: Airflow's 3.1 release in late 2025 added first-class human-in-the-loop operators so a pipeline can pause for an approval before continuing.11 An agentic pipeline without this substrate is a science project; with it, it is infrastructure.

Sources

Verified June 2026

  1. 1MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering · OpenAI / arXiv 2410.07095 (ICLR 2025), 2024. arxiv.org/abs/2410.07095
  2. 2DSBench: How Far Are Data Science Agents from Becoming Data Science Experts? · arXiv 2409.07703, 2024. arxiv.org/abs/2409.07703
  3. 3BixBench: A Comprehensive Benchmark for LLM-based Agents in Computational Biology · FutureHouse, 2025. www.futurehouse.org/research-announcements/bixbench
  4. 4InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks · arXiv 2401.05507 (ICML 2024), 2024. arxiv.org/abs/2401.05507
  5. 5GDPval: Measuring the Performance of Our Models on Real-World Economically Valuable Tasks · OpenAI / arXiv 2510.04374, 2025. openai.com/index/gdpval
  6. 6The AI Scientist: Towards Fully Automated AI Research, Now Published in Nature · Sakana AI, 2026. sakana.ai/ai-scientist-nature
  7. 7Data Science Agent in Colab: The Future of Data Analysis with Gemini · Google Developers Blog, 2025. developers.googleblog.com/en/data-science-agent-in-colab-with-gemini
  8. 8Google Unveils Enterprise Data Science and Engineering AI Agents · SiliconANGLE, 2025. siliconangle.com/2025/08/05/google-unveils-enterprise-data-science-engineering-ai-agents-provide-real-time-analysis
  9. 9DataRobot Announces Agent Workforce Platform, Built with NVIDIA · DataRobot, 2025. www.datarobot.com/newsroom/press/datarobot-announces-agent-workforce-platform-built-with-nvidia
  10. 10Temporal for AI — Durable Execution for Long-Running Workflows · Temporal Technologies, 2025. temporal.io/solutions/ai
  11. 11Apache Airflow 3.1 — Human-in-the-Loop operators · Apache Airflow, 2025. airflow.apache.org/blog/airflow-3.1.0
  12. 12Built-in Durability: Introducing the Workflow Development Kit · Vercel, 2025. vercel.com/blog/introducing-workflow
  13. 13Gartner Predicts Over 40% of Agentic AI Projects Will Be Canceled by End of 2027 · Gartner, 2025. www.gartner.com/en/newsroom/press-releases/2025-06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects-will-be-canceled-by-end-of-2027