ERNIE 5.1 Tested in Detail at Mona Vale Beach

Baidu ERNIE 5.1 beats DeepSeek on the hardest agent math tasks with tools, competes with Claude and Gemini, and was trained at only 6% of the usual budget. It sits in the global top tier on agents, math, knowledge, and search while cutting total parameters to about a third and active parameters to half.

They pulled this off by retraining from scratch, deriving from ERNIE 5 but compressing aggressively, and running a single training pass that flexes across depth, width, and routing sparsity.

The recipe finishes with four stages that keep capabilities broad without the usual trade-offs, then tops it off with targeted reinforcement learning for creative and open-ended chat.

If you are exploring Claude-based coding stacks alongside these comparisons, see this practical guide to install Claude Code in Antigravity.

Benchmarks at a glance

On Text Arena, the top four are Claude Opus variants, then Gemini, Meta’s Muse Spark, GPT, and Grok. ERNIE 5.1 preview sits at 13 globally and is the only Chinese model on the list.

Two leaderboards, two top 15 finishes, one Chinese model.

ERNIE 5.1 Tested in Detail at Mona Vale Beach screenshot 3

On the toughest agent math benchmark with tools, ERNIE beats DeepSeek V4 Pro, with Gemini 3.1 Pro edging it overall.

On knowledge and reasoning it is competitive with Claude Opus and DeepSeek, with Gemini slightly ahead.

All of that from a model trained at a fraction of the cost.

ERNIE 5.1 Tested in Detail at Mona Vale Beach screenshot 4

For a contrasting drafting-oriented system in this cohort, check out the Gemma 4 drafter.

What changed in training

The model family is retrained from scratch, derived from ERNIE 5 but compressed down massively. Total parameters are cut to a third, and active parameters are cut to half. The low training cost is the headline outcome.

Dynamic depth, width, and routing

Normally you train separate models at different sizes, which means multiple full training runs.

Baidu trained one model that flexes dynamically across depth, width, and routing sparsity all at once.

One training run, an entire family of model sizes.

ERNIE 5.1 Tested in Detail at Mona Vale Beach screenshot 5

Four stages that keep breadth without trade-offs

Stage 1 is standard instruction fine-tuning across all domains.

Stage 2 trains separate expert models for code, reasoning, and agents in complete isolation.

Stage 3 fuses all those experts into one at the same time.

Stage 4 adds reinforcement learning for creative writing and open-ended chat, producing strong control over tone and style. Clean separation, clean fusion, no compromises.

Trials on the hosted platform

One-shot tool build in a self-contained HTML file

I gave it a single, very complex one-shot prompt: build a black box locator app in a self-contained HTML file for a plane crash near Fiji, and I turned on thinking mode. This tests multi-file reasoning inside one HTML, JavaScript-heavy logic, and a long dependency chain.

It produced a full dashboard with signal analysis calculations, a movable map, interactive clicks that update signal paths, live latitudes and longitudes, and a depth estimate with location confirmed near Tonga.

The map is stretchable and draggable, and the left panel shows everything I asked for.

There was a slight map repetition mistake, which I can live with given the completeness of the build.

ERNIE 5.1 Tested in Detail at Mona Vale Beach screenshot 6

Multilingual translation stress test

I asked it to translate “My neighbor has nine kids yet claims she never found true love.” into 80+ languages across common, regional, and low resource families.

I spot-checked several with external references and they read well.

From Indian regional languages to Southeast Asia, East Europe, Central Asia, Dutch, Kurdish, and Serbian, the phrasing holds up.

Even gibberish output looked consistent in style and orthography.

If you need speech output to pair with multilingual workflows, see Dramabox TTS.

Agentic advisor with web search

With thinking mode on and web search enabled, I set up a near-impossible brief: a living dodo, extinct since 1681, is found in a remote highland forest in Mauritius, and access shuts in 72 hours. Three groups clash.

A pharma company wants DNA extraction rights, a conservation body demands full isolation, and a local religious group claims the bird is a sacred woman that must not be touched. The instructions required Stephy to look up real dodo biology, current deextinction science, and Mauritian conservation law, then finish with a clear, decisive, prioritized plan without hedging.

The chain of thought focused on what matters, sliced the problem cleanly, and moved briskly to action. References like Colombian veterinary protocols, the Copenhagen genome reference, and Colossal Biosciences appeared via web search. The negotiation framework gave each group a win while nobody got full control.

The final order from Stephy was precise, presented hour by hour, with lines like “call the priest, get the money, deploy the nebulizer.” A few beats felt a bit convenient, but that is nitpicking given the rigor and clarity.

Quick steps to reproduce my tests

Step 1: Open the hosted ERNIE 5.1 interface and enable thinking mode.

ERNIE 5.1 Tested in Detail at Mona Vale Beach screenshot 1

Step 2: Issue a single one-shot prompt to build a self-contained HTML black box locator with a movable map, signal analysis, coordinates, and a depth estimate.

Inspect the left dashboard and interactive map behavior.

ERNIE 5.1 Tested in Detail at Mona Vale Beach screenshot 2

Step 3: Switch to the instant model and translate the sentence about the neighbor with nine kids into a long list of languages, including regional and low resource families.

ERNIE 5.1 Tested in Detail at Mona Vale Beach screenshot 7

Step 4: Enable thinking and web search, then run the dodo brief with the rule that the last paragraph must deliver a definitive, prioritized plan with no hedging.

ERNIE 5.1 Tested in Detail at Mona Vale Beach screenshot 8

For related model tooling and IDE workflows, you might also find value in this guide to installing Claude Code in Antigravity.

Final thoughts

ERNIE 5.1 shows that 6% training cost does not mean weak performance. It competes with the best on agents, math, and reasoning, shows strong tool use for complex one-shot builds, and handles multilingual prompts with surprising breadth.

If Baidu can do this with 6%, imagine what comes next.

Benchmarks at a glance

What changed in training

Dynamic depth, width, and routing

Four stages that keep breadth without trade-offs

Trials on the hosted platform

One-shot tool build in a self-contained HTML file

Multilingual translation stress test

Agentic advisor with web search

Quick steps to reproduce my tests

Final thoughts

Leave a Comment Cancel reply