Open vs Closed AI: How Far Apart Are They?
Open-source models tend to be about one year behind state of the art closed-source models
Open-source large language models (LLMs) and their closed-source counterparts have each played significant roles in advancing artificial intelligence. Epoch AI recently released a report1 which explores the dynamics between these two types of models, offering a thorough examination of trends in openness, performance benchmarks, and training compute. It concludes that the most capable open models generally trail the most advanced closed models by roughly one year—though this gap might shorten soon.
Context and Definitions
For years, openness has been a guiding principle in AI research, encouraging transparency and collaborative efforts. Openness, however, comes with risks, including the potential misuse of powerful AI systems and commercial disincentives for fully sharing cutting-edge model weights and data. Consequently, today’s AI development landscape displays varying degrees of openness. Some models remain unreleased, others are open in restricted forms, and still others—like Meta’s Llama variants—release their weights for wider usage and experimentation.
The authors of the Epoch AI report—Ben Cottier, Josh You, Natalia Martemianova, and David Owen—define an open model as one that makes its trained weights downloadable, regardless of whether its code or data is fully accessible. Meanwhile, closed models only offer access through an API or hosted service, or they are unreleased altogether. The authors also introduce a finer-grained spectrum, noting that “open” itself may include tight license restrictions, and “closed” can include partial access through advanced structured APIs. This nuanced view informs their study of how each model category progresses in capability over time.
Historical Trends in Open vs Closed Models
From a broad perspective, the AI community has oscillated between openness and caution:
2018-2023 shift: Although many influential organizations grew more guarded over time (for example, OpenAI no longer releasing weights after GPT-2), a majority of the notable models in the authors’ database from 2019 through 2023 were nonetheless open. This likely stems from the overall proliferation of AI research, the popularity of hosting platforms like HuggingFace, and growth in academic labs that prefer open releases.
Commercial incentives: Businesses such as OpenAI, Google DeepMind, and Anthropic benefit from keeping their most valuable models proprietary. For them, user adoption, subscription revenue, and brand control may outweight the benefits that community contributions and openness bring. However, Meta has bucked this trend, regularly publishing its Llama models with downloadable weights, albeit under licenses that typically include commercial-use restrictions.
Performance Gap and Benchmarks
A central objective of the report is to estimate how far behind open LLMs are compared to state-of-the-art closed LLMs. To measure this, the authors analzye several benchmarks that test language comprehension, reasoning, and math skills:
Benchmark Lag: Across multiple standard benchmarks (MMLU, GPQA, GSM1K, BBH), open models have trailed closed ones by a range of 5 to 22 months. For instance, GPT-3 (text-davinci-0012) outperformed open competitors on MMLU in 2020, and it took over two years before BLOOM-176B matched or exceeded that performance.
Shortening Gaps: Recent evaluations (e.g., on GPQA) suggest a faster catch-up. For one, Llama 3.1 405B nearly closed the gap with Claude 3 Opus in only five months. This short interval signals that open-source initatives backed by major players like Meta might continue to reduce the overall lag.
Quality vs Quantity: Benchmark results showed that some open models can reach similar accuracy scores while using fewer compute resources. However, the report cautions that many factors—like data contamination, specialized tuning for leaderboard tasks, or differences in test conditions—may influence such results. It remains uncertain how efficiently any given model’s architecture will scale to future frontier levels of capability.
Training Compute as a Proxy for Frontier Capabilities
The authors highlight training compute (measured in floating-point operations per second, or FLOPs) as a key indicator of a model’s potential capabilities. By charting the training compute of noteworthy open and closed models over several years, the authors make the following observations:
15-Month Lag: For top-1 models in each category (meaning those that used the highest compute at the time of their release), open models consistently trail closed ones by about 15 months. This echoes the 5- to 22-month benchmark figure, suggesting a stable—and not drastically widening—gap.
Growth Rates: Both open and closed top-1 models have scaled their compute at a similar rate (about 4.6 times per year) since around 2018. Although open models are behind in absolute terms, they match the growth trajectory, implying that the time lag remains steady. In essence, once closed models scale up, open models do so, just over a year later.
Potential Catch-up: Notably, Meta’s Llama 3.1 405B matched the approximate compute budget of GPT-4 about 16 months after GPT-4’s release. Because GPT-4 remains a key reference point for advanced capabilities, many observers see Llama’s achievement as a milestone in narrowing the open-vs-closed gap. Future models, such as Llama 4, may close it even further.
The Role of Meta’s Llama in Future Projections
One reason for optimism among open-source proponents is Meta’s declared commitment to large-scale open models:
Llama 4 Plans: Mark Zuckerberg has stated that Meta’s forthcoming Llama 4 will use “almost 10x more” compute than the 405B parameter version of Llama 3. If Meta releases Llama 4’s weights in mid-2025, it could conceivably land squarely on trend lines that predict the capabilities of next-generation closed models.
Business and Ecocystem Factors: Whether Meta will stay open depends on its broader strategic objectives. If the added synergy from community-driven development, user adoption, and reputational advantages outweigh any competitive or regulatory risks, Meta might see enough value to continue its openness. In contrast, rising training costs and potential liability could eventually prompt caution, as is typical for other large AI developers. Regulatory developments could also induce Meta to move to releasing closed-source versions of its next-gen large language models.
Economics and Regulation
The authors emphasize that the economic value of open frontier models remains an open question. Training a leading-edge model now routinely costs tens or even hundreds of millions of dollars, with the cost projected to reach billions for the most ambitious projects. If the financial returns from selling direct access far exceed those from community contributions, AI labs might prefer secrecy.
Additionally, looming regulations could curtail how and to whom open models are released. If developers can be held legally responsible for harmful use of their models, they might hesitate to share weights widely. Conversely, the competitiveness of the global AI market—exemplified by ChatGPT’s 350 million users vs Meta’s nearly 500 million monthly users for its AI assistant—might fuel a continued arms race that incentivizes at least one major player, like Meta, to remain open.
Implications for AI Safety, Policy, and Research
The fact that open models are about a year behind in raw capabilities, yet remain close enough to produce near-frontier performance, has two major ramifications:
Positive Side: Open models foster broader research on safety and alignment. By enabling experts around the world to probe, stress-test, and fine-tune near-frontier models, new insights might emerge on how to mitigate potential harms. Furthermore, open ecosystems can innovate at a rapid pace, potentially contributing novel architectures, training strategies, and domain-specific adaptations.
Negative Side: The easier a model is to download, fine-tune, and redistribute, the harder it becomes to enforce guardrails. If a near-frontier model is repurposed, it might power disinformation campaigns, malicious code generation, or other problematic uses. The one-year lag effectively offers policymakers and large labs a short window to evaluate novel frontiers before such models become widely accessible in open form.
Conclusions
Overall, the report finds that open large language models typically lag behind the leading closed models by about one year, judging by both benchmark performance and training compute. However, that gap may narrow in 2025 and beyond, particularly if Meta follows through on its plan for Llama 4 to match or exceed closed-model compute levels. In contrast, many smaller open-minded developers, lacking Meta’s resources, will remain further behind.
Note this model has been deprecated.