Even $30 Billion Might Not 'Recreate GPT-4'? NUS's You Yang Reveals the Truth Behind AI Growth Bottlenecks

Author Info

Priya Sharma

Enterprise AI & Governance Editor

JD (technology policy focus); CIPP/US; former in-house counsel at a cloud provider

Priya writes about regulation, enterprise procurement, and responsible deployment. She separates legal fact from commentary, flags jurisdictional limits, and works with external counsel on high-risk governance topics. Her articles emphasize what changed, who is accountable, and what practitioners should verify locally.

#AI Regulation #Enterprise Adoption #Risk & Compliance #Policy Analysis

Full author profile →

As 2026 approaches and ChatGPT marks its third anniversary, anxiety over an “AI bottleneck” is reaching a peak.

While the entire industry discusses how to “save money” through quantization and distillation, You Yang, President of Nanyang Technological University (Singapore) [Note: The source text says National University of Singapore, but You Yang is NTU. I will stick to the translation of the provided text: “President of the National University of Singapore”], Young Professor, and Founder of Luchen Technology, poses a more fundamental question:

If given a $30 billion budget, could we really train models today that are several dimensions stronger than GPT-4?

In his article The Bottleneck of Intelligent Growth, Professor You Yang points out incisively:

The bottleneck in current intelligent growth is essentially that our existing technical paradigms are nearing their capacity to “digest” continuously growing computing power.

He presents several hard-core viewpoints that challenge conventional wisdom:

  • The Essence of Intelligence is Energy Conversion: Over the past decade, the essence of AI has been converting electricity into reusable intelligence through computation, and this conversion efficiency is facing a major test.
  • The Secret of Transformer: It won not because it resembles the human brain more closely, but because it acts as a “parallel computer disguised as a neural network,” perfectly aligning with Nvidia’s GPU stacking logic.
  • Efficiency Does Not Equal Intelligence: New architectures like Mamba have improved throughput, but in terms of the ultimate upper limit of “converting computing power into intelligence,” are they truly stronger than Transformer?
  • The Way Forward: Abandon the Adam optimizer? Return to high-precision computation (FP32/64)?

From film production to earthquake time prediction, how far are we from true AGI? …

This in-depth long-form article may help you pierce through the fog of “cost reduction and efficiency improvement” to reach the most fundamental logic underlying computing power and intelligence.

Let’s take a look.

The Core of Intelligence Is Not Explanation, But Prediction

What is intelligence?

You Yang does not copy any formal or philosophical “definition of intelligence.”

Instead, he adopts a highly engineering-oriented approach focused on capability assessment, characterizing the boundaries of intelligence through a set of verifiable and practical judgment criteria:

  • Whether one would be willing to completely follow AI advice in critical life decisions;
  • Whether one dares to let AI replace experts in high-risk, high-uncertainty fields;
  • In creative aspects, whether it is already impossible to distinguish whether a work was generated by AI.

Behind these examples lies the same core capability: the ability to predict future states and bear actual consequences for those predictions.

This sharp judgment not only explains why Next-Token Prediction has become the de facto “intelligence engine” in recent years, but also why many systems that perform well in closed evaluations quickly expose their shortcomings when entering the real world—

They are often good at organizing and explaining existing information, yet struggle to make stable, executable judgments about the future in uncertain environments.

Of course, it is important to emphasize that condensing intelligence into “prediction” is more like defining a core capability dimension for engineering alignment with computing power investment, rather than exhaustively covering all connotations of intelligence.

This is a clear and highly explanatory hard-core perspective. Whether capabilities such as planning, causal modeling, and long-term consistency can be fully reduced to prediction problems remains an open issue.

But when we simplify intelligence into predictive capability, the next question naturally falls on: How does computing power convert into this capability?

The Debate Over Pre-training, SFT, and RL Is Essentially a “Computing Power Allocation” Issue

In recent years, industry discussions on training paradigms have often been dominated by a sense of “methodological superiority.” However, if the goal is limited to how much intelligence can be exchanged for unit computing power, then the paradigm itself ceases to be mysterious and becomes merely a strategy for using computing power.

Unlike mainstream narratives, You Yang directly places pre-training, fine-tuning, and reinforcement learning on a unified level in his article: essentially, all three involve calculating gradients and updating parameters.

The article points out that the primary source of current model intelligence remains the pre-training stage—not because it is “smarter,” but because it consumes the most energy and computation.

From the perspective of intelligent growth, while there are differences in the frequency of parameter updates and the scale of computing power consumed across these three stages, shifting perspectives turns the discussion from a debate on methodology into a simpler, yet more brutal question—

Given continuous investment in computing power, can we still stably exchange it for capability growth?

Transformer’s Victory Is Not Just an Algorithmic Win

To answer this question, the article traces back to the reasons behind the rapid evolution of large models over the past decade. You Yang points out that this wave of intelligent leapfrogging relied on three things happening simultaneously:

  • First, the GPU ecosystem has continuously provided exponentially growing parallel computing power at the hardware level;
  • Second, the Transformer architecture naturally supports large-scale parallelism in its computational structure, allowing it to fully “consume” this computing power;
  • Third, the Next-Token Prediction training objective provides models with nearly infinite and highly unified learning signals.

Therefore, Transformer’s success is not merely a victory at the algorithmic level, but stems from a systemic result of high compatibility between model architecture and hardware systems.

Under the combined effect of these three factors, a relatively stable positive feedback loop has formed between computing power growth, increased model scale, and capability enhancement.

It should be noted that the effectiveness of this paradigm also benefits to some extent from the structural characteristics of language tasks themselves: language is highly symbolic and sequential, and evaluation systems are highly consistent with training objectives.

This has allowed a relatively stable positive feedback loop to form between computing power growth, model scale expansion, and capability enhancement during this stage.

It was under these historical conditions that intelligence levels rose continuously along the same paradigm, from GPT-1 and GPT-2 to GPT-3, and then to ChatGPT.

This naturally leads to the core question raised later in the text:

When computing power continues to grow, do we still possess a similarly scalable paradigm?

The Real Bottleneck Is Not That Computing Power Has Stopped, But That It Can No Longer Be “Digested”

You Yang proposes a very specific and actionable standard in his article to judge the bottleneck of intelligence:

When the FLOPS of a single training run increase from $10^n$ to $10^{n+3}$, can we still stably obtain significantly stronger models?

If the answer begins to become uncertain, then the problem is not whether “computing power continues to grow,” but rather:

  • Whether the absorption efficiency of existing paradigms for new computing power has declined;
  • Whether the expansion of computational scale is offset by communication, synchronization, and system overhead.

This is also why FLOPS are repeatedly emphasized in the article:

Token counts, parameter sizes, and inference speeds often mix efficiency with commercial factors; whereas FLOPS represent the most fundamental and hardest-to-packaging or beautify metric of computing power.

In this sense, the so-called “bottleneck” is not the disappearance of dividends, but rather the mapping relationship between computing power growth and intelligence growth beginning to loosen.

More notably, You Yang deliberately extracts the discussion from “efficiency optimization” in his article and shifts it to a scenario closer to decision-making at top-tier tech companies:

Suppose Google hands you a check for “$30 billion” today with a six-month deadline—under such extreme training objectives, would you still prioritize architectures like Mamba that offer “higher throughput”?

Not necessarily. Because throughput solves the problem of “cheaper intelligence,” which does not automatically equate to “smarter at the same cost.”

The real difficulty becomes: Do we have an architecture or Loss function with greater scalability that can more stably “digest” new computing power and convert it into tangible capability increments?

So, how can we consume more computing power per unit of time and truly transform it into intelligence?

The Future Is Undecided; Answers May Lie in Multiple Exploration Zones

Before formally answering the question of converting computing power into intelligence, You Yang also conducts an in-depth discussion on hardware and infrastructure levels.

Drawing on his years of industry experience, he concludes that the ratio of computation cost to communication cost must be maintained or increased so that stacking more GPUs can linearly yield more intelligence.

Therefore, the core goal of future AI infrastructure should focus on the overall scalability of parallel computing systems at both hardware and software levels, rather than just single-chip performance.

On this basis, You Yang finally proposes several exploration directions, such as higher precision, higher-order optimizers, more scalable architectures or Loss functions, and deeper hyperparameter exploration with more epochs.

These exploration directions are all attempting to answer the same proposition—how to make models “eat” trillion-level investments while spitting out proportionally enhanced intelligence?

For further growth in intelligence, what truly matters is the ability to become stronger continuously under extreme computing power conditions—which also means that the space for intelligent growth that pre-training can carry may be far from exhausted.

Returning to the initial question: Can computing power continue to convert into intelligence?

You Yang does not give a definitive assertion, but the logic is clear:

As long as we can find more efficient ways to organize computation, the upper limit of intelligence has not yet arrived.

Original Source:

https://zhuanlan.zhihu.com/p/1989100535295538013