Artificial Intelligence and Minecraft — From Project Malmo to Voyager

In 2016 Microsoft Research announced Project Malmo, an open-source platform for training AI agents inside Minecraft. A decade later, Minecraft is the most-used benchmark in AI agent research — from DeepMind's hierarchical learners to NVIDIA's GPT-4-powered Voyager. A reading of how a children's sandbox became the field's preferred test.

Editorial photograph of an abstract voxel landscape rendered as an architectural maquette, navy and phosphor-green cubes assembled into rolling terrain with a single amber cube glowing in the middle distance as a search beacon.
Editorial visual · an open world being read by something that wasn't there to play it.

In March 2016, AlphaGo defeated Lee Sedol 4-1 in a five-game match. The match was a generational signal: a deep reinforcement-learning system had played the world champion of a game whose state space was thought to be too large for the methods that had recently solved chess. A few weeks after the match, Microsoft Research announced Project Malmo — a platform for training AI agents inside Minecraft, built on top of the Java Edition's mod API. The naming was deliberate: Malmö is the Swedish city near where Mojang's offices are; the project framed itself as Microsoft Research's contribution to the open question of how to take the AlphaGo-class approach beyond board games.

The thesis at the time was that Minecraft would be a useful intermediate step between the toy environments where reinforcement learning had been working (Atari games, board games) and the kind of physical-world embodiment that the field eventually wanted to reach (robotics, autonomous vehicles). The argument was that a Minecraft world contains many of the features of a real environment — partial observability, long planning horizons, tool use, procedural variety, mineable resources — at a tiny fraction of the cost of real robotics.

That thesis aged unusually well. A decade later, Minecraft is the single most-published benchmark in AI agent research. The 2023 announcement of NVIDIA's Voyager — a GPT-4-powered embodied agent that learns by writing and refining its own Minecraft code — was the headline moment, but it was the culmination of a research arc that had been building since Malmo's launch. Project Malmo, DeepMind's hierarchical agents, OpenAI's Video Pre-Training (VPT), MineRL, MineDojo, Mineflayer, JARVIS-1, GROOT, STEVE — the field's full menagerie of embodied-agent research either uses Minecraft directly or borrows its task taxonomy.

Why Minecraft and Not Something Else

Editorial photograph of a large vertical pegboard grid of matte navy, phosphor-green and amber cubes — a single amber cube sits two-thirds up the grid with a soft circular gradient of phosphor-green around it, as if a search heuristic had radiated outward.
Editorial visual · a procedural search pattern made visible — something with patience reading a world block by block.

The reasons Minecraft became the standard benchmark are partly technical and partly historical. Five features matter more than the others:

  1. Open-ended tasks. Most prior reinforcement-learning environments (Atari, MuJoCo, board games) have a clear objective: maximize a single scalar reward. Minecraft has many viable objectives — gather wood, build a shelter, find diamond, defeat the Ender Dragon — and an agent has to choose which to pursue. This open-endedness pushed the field to develop hierarchical and goal-conditional methods that the simpler benchmarks did not require.
  2. Long horizons. Mining a diamond from a fresh spawn takes a competent human player ~30 minutes of real time and involves a partially-ordered sequence of subtasks (find wood → make a workbench → make a wooden pickaxe → mine stone → make a stone pickaxe → find iron → smelt iron → make iron pickaxe → find diamond). The credit-assignment problem at this horizon length is brutal, and the field needed an environment that forced researchers to solve it.
  3. Tool use and crafting. Minecraft's crafting recipes are a discrete combinatorial action space layered on top of the continuous movement space. Agents that succeed in Minecraft have to learn to plan over tools, not just over movements. This is structurally closer to what real-world robots have to do than the pure-movement benchmarks that preceded it.
  4. Visual variety and procedural generation. Every Minecraft world is procedurally generated. An agent that memorizes one world fails on the next. The procedural-generation feature forced the field to build generalisable agents rather than environment-specific policies — a substantial step toward the kind of generalization the field wanted.
  5. Cost. A Minecraft instance runs on a laptop. A robot environment costs a robot. The cost-per-experiment differential is several orders of magnitude, which means PhD students at small labs can publish meaningful Minecraft results that they could not publish on physical robotics. The field's adoption rate followed the cost curve.

None of these features is unique to Minecraft individually. The combination of all five in a single, already-popular environment with active modding support and decades of community-built tooling is what made it the de-facto standard.

A Decade of Minecraft Agent Research

2016: Project Malmo (Microsoft Research)

The opening move. Project Malmo released as a fully open-source Minecraft mod with a Python API for controlling an in-game agent. Katja Hofmann's team at Microsoft Research Cambridge UK led the project. The initial release supported single-agent and multi-agent tasks, configurable observation spaces (RGB, depth, symbolic), and a curriculum of starter environments. Malmo was the platform on which most of the field's early work was done; even projects that later switched to other frameworks usually started from a Malmo-based prototype.

2017-2019: Establishing the Baseline

The 2017-2019 window was the period of establishing what reinforcement-learning agents could and could not do in Minecraft. The MineRL competition series, launched in 2019, set the agenda: a standard set of tasks (Treechop, Navigate, ObtainIronPickaxe, ObtainDiamond) with human-demonstration data and a fixed compute budget. The competitions revealed that pure reinforcement-learning approaches were not solving the long-horizon tasks; the top entries combined human demonstrations with imitation learning and only used RL for fine-tuning.

This was the field's first big lesson from Minecraft: the long-horizon problem was not solvable by tabula-rasa reinforcement learning at the compute budgets available. Either researchers needed orders of magnitude more compute (which is what DeepMind would do), or they needed a different approach (which is what OpenAI would do with VPT).

2022: OpenAI Video Pre-Training (VPT)

OpenAI's VPT paper (June 2022) trained an agent on 70,000 hours of YouTube video of humans playing Minecraft. The model — a transformer architecture predicting keyboard-and-mouse actions from screen pixels — was an imitation-learning approach at a scale that had not been tried in the field. The resulting agent could craft diamond tools in the standard Minecraft progression, a task that pure RL agents had been unable to solve at any compute budget.

The VPT result was significant beyond Minecraft: it was an existence proof that internet-scale unlabeled video could be used as a training signal for embodied tasks. The methodology has since been generalized to robotics and to broader embodied-agent training.

2023: Voyager (NVIDIA + Caltech + University of Texas Austin)

The headline moment. Voyager, published in May 2023, used GPT-4 as the policy network for a Minecraft agent. The architecture: a high-level planner (GPT-4) decided what to do next; a code-writing module (also GPT-4) wrote JavaScript code to execute that decision via the Mineflayer bot framework; a skill-library system stored and retrieved the agent's previous successful skills.

Voyager was the first agent that could, in standard Minecraft survival mode, autonomously progress from a fresh spawn to obtaining a diamond, an iron pickaxe, a netherite ingot, and beyond, without human intervention and without task-specific reinforcement learning. It also did this with substantially less compute than the prior best results — because the heavy lifting was offloaded to a foundation model that had been pre-trained on internet text.

The Voyager result reframed the question. The agent wasn't learning Minecraft from scratch; it was using a foundation model that had absorbed Minecraft wiki pages, YouTube tutorial transcripts, and Minecraft modding documentation in its pre-training, and applying that knowledge through a code-writing loop. The "agent" in Voyager is, in some sense, the entire stack of internet knowledge about Minecraft, deployed against a single instance of the game.

MINECRAFT AS AN AI BENCHMARK — KEY MILESTONES 2016 2018 2020 2022 2023 2024 2025 Malmo Microsoft Research MineRL comp. series begins (2019) DreamerV3 DeepMind, diamond (2023) VPT OpenAI, 70k hrs video VOYAGER NVIDIA, GPT-4 agent (2023) JARVIS-1 CASIA, multimodal (2023) GROOT-2 multimodal-instruction (2024) Major shift through the period: pure-RL approaches (left) yielded to imitation learning on internet-scale video (center) and then to foundation-model agents (right). The benchmark itself remained Minecraft.
Key milestones in the Minecraft AI-benchmark research arc. The benchmark held steady while the methodology shifted twice: from pure RL to imitation-on-video, then to foundation-model agents.

2024-2025: Embodied Foundation Models

The current research frontier on Minecraft has moved past the "can an agent do this" framing. Most published methods can now navigate the basic crafting tree. The open questions are about generalization: can an agent trained on one Minecraft world or mod transfer to a different one? Can the same agent architecture work on Robotic Process Automation tasks, web browsing, or actual robotics? Can the system learn new skills from natural-language instructions alone?

The 2024-2025 generation of papers — JARVIS-1, GROOT-2, STEVE-Eye, the various embodied-agent extensions from CMU and Stanford — are evaluating Minecraft agents on these transfer questions rather than on standalone task completion. Minecraft has, in this sense, served its purpose as a benchmark: the field has moved past the question of whether agents can play it.

Why Microsoft Bought Mojang

DIAMOND-PICKAXE COMPLETION RATE — METHODOLOGY-ERA TREND 100% 75% 25% 0% 2018 2020 2022 2024 2026 ~0% pure RL ~5% MineRL competitions ~50% VPT (2022) ~90% Voyager (2023) ↑ >95% — foundation-model baseline (2025) Indicative curve based on published reproduction rates. Methodology shifts (RL → imitation → foundation models) drive the visible step changes.
Indicative diamond-pickaxe-task completion rate across the methodology eras. The two largest step changes — 2022 (imitation learning on internet-scale video) and 2023 (foundation-model agents) — reframe what the benchmark is actually measuring.

The 2014 acquisition of Mojang by Microsoft for $2.5 billion was widely read at the time as a play for the gaming franchise. Minecraft's status as the second-best-selling game of all time made that read defensible. The benchmark story, in retrospect, is the other half of the answer.

Microsoft Research's investment in Project Malmo in 2015-2016, and the more recent integration of Minecraft into Microsoft's broader AI agent product lines (Copilot Studio, the various agentic-AI offerings), suggests the company has been operating on a longer-term thesis. Owning the platform that has become the field's standard benchmark gives Microsoft a privileged position in defining how embodied AI gets evaluated and published. That position is worth substantially more than the game franchise alone.

This is the same kind of strategic pattern that the gaming industry has seen elsewhere — Konami's pachinko pivot was the inverse case (extract value from an IP catalog without continuing to develop it), and Valve's Steam-platform strategy is the corresponding pattern in PC distribution (the platform is the actual business). Microsoft's Minecraft thesis is its own variant: the catalog title is also the field's preferred research instrument, and owning both is structurally valuable.

What the Minecraft Benchmark Says About AI

Macro editorial photograph of a small architectural maquette of glossy navy and phosphor-green cubic blocks stacked into a tower, a single amber cube embedded near the top, scattered cubes in bokeh background.
Editorial visual · a small structure built block by block by something with patience but no hands.

The research arc that ran from Project Malmo to Voyager and beyond is, taken as a whole, a particular kind of statement about how the field of AI research actually evolves. Three patterns are visible:

1. Benchmarks shape methodology more than methodology shapes benchmarks. The field's methods shifted from pure RL to imitation-on-video to foundation-model agents over a single decade, but the benchmark — Minecraft — held steady throughout. Researchers built new methods specifically to score on the benchmark, which means the benchmark itself was driving methodology choices. This is a normal pattern in scientific fields but one that the AI field's own self-narrative often underplays.

2. The field's preferred environments need to be culturally legible. Minecraft is not just technically convenient; it is culturally familiar. Researchers, reviewers and journalists understand what "obtaining a diamond" means without explanation. The benchmark's communicability is a substantial part of its value. Pure technical convenience (e.g., the MuJoCo robotic environments, which are technically much cleaner) does not produce the same level of community engagement because the tasks are not culturally legible.

3. The benchmark's commercial success was load-bearing. The decade of Minecraft-as-benchmark research relied on Microsoft's continued commitment to the game as a commercial product. If Mojang had been a niche studio whose game went out of print, the Malmo platform would have died on the vine. The benchmark's stability across a decade is directly enabled by the commercial-game economics that made Minecraft the second-best-selling game of all time. The AI research field has, in this sense, taken a free ride on the gaming industry's catalog economics — without the popularity of the game, the field's benchmark of choice would not have been viable.

That third point matters because it makes Minecraft a useful example of how the AI field has been structurally dependent on the cultural products it studies. The benchmark is durable because the game is durable. The methods improve year-on-year because Mojang keeps shipping content. The cultural legibility holds because the game is part of the broader children-and-teenagers cultural baseline. Take any of those props away, and the research field's preferred testbed would have to migrate. That migration would be expensive and slow.

What the State of Play Looks Like in 2026

Subsystem diagram of the four places AI / ML attaches to Minecraft — world generation, mob behaviour, villager trades and command-block scripting — with the vanilla scripted layer (phosphor) and the modded research layer (amber) shown separately.
Four AI surfaces · the scripted vanilla layer (phosphor) is where the game already runs; the research layer (amber) is where every shipped AI mod attaches.

By 2026, "AI agent in Minecraft" research has largely shifted from being a flagship publishable result to being a baseline competence that other claims are evaluated against. Voyager, JARVIS-1, and the various GROOT iterations established that foundation-model agents can complete the standard Minecraft progression; the publication question now is whether they can do so more efficiently, in more varied worlds, or with less per-task data than the prior best result.

The longer-term direction the field has taken is to use Minecraft as a calibration environment for agents that are ultimately targeted at non-Minecraft tasks — browser automation, code generation, robotic control. The thinking is that an agent that can succeed in Minecraft has demonstrated the capacity for long-horizon goal-conditional behavior, tool use and self-correcting planning, which are the capabilities a useful real-world agent needs. The benchmark's purpose has shifted from "can an agent play Minecraft" to "if an agent can play Minecraft, what else can we infer about it."

What remains true is that the field's expectation is shaped by the benchmark. An agent that cannot complete the diamond-pickaxe progression in 2026 is, by community consensus, not yet a competent agent. An agent that can do so but cannot also handle the unexplored variant tasks (modded worlds, novel goals, multi-agent coordination) is still incomplete. The bar moves forward, the benchmark moves with it, and Minecraft continues to occupy the position that nothing else in the field has displaced.

Frequently Asked Questions

What is Project Malmo?

Project Malmo is an open-source Microsoft Research platform, launched July 2016, for training AI agents inside Minecraft. It provides a Python API for controlling an in-game agent, configurable observation spaces, and a curriculum of starter environments. Project Malmo is the foundational platform for most subsequent Minecraft-based AI research.

Who runs Project Malmo?

Katja Hofmann at Microsoft Research Cambridge UK led the original Project Malmo team. The project has since been folded into the broader Microsoft Research Game Intelligence group, with continued open-source releases through Microsoft Research's GitHub.

What is Voyager?

Voyager is an embodied agent published by NVIDIA, Caltech and the University of Texas at Austin in May 2023. It uses GPT-4 as a policy network to write JavaScript code that controls a Minecraft bot via the Mineflayer framework. Voyager was the first agent to autonomously progress from a fresh spawn to advanced Minecraft items (diamond pickaxe, netherite ingot) without task-specific reinforcement learning.

What is the MineRL competition?

MineRL is a series of AI competitions launched in 2019, organized by Carnegie Mellon University and Microsoft Research. The competition provides a fixed set of Minecraft tasks (Treechop, Navigate, ObtainIronPickaxe, ObtainDiamond) with human-demonstration data and a per-entry compute budget. MineRL is the closest the field has to a standardized leaderboard for Minecraft agents.

Why is Minecraft a good AI benchmark?

Five features matter: open-ended objective set, long planning horizons, tool use and crafting, procedural visual variety, and per-experiment cost low enough for academic research. The combination of these features in a single popular environment is not replicable in other game environments at comparable cost.

Further Reading on Gamers Haven