MiniMax : What is China's new open-source AI?

Minimax New Open-Source AI Model From China SHOCKS The Industry (CRUSHES DeepSeek)

Table of Contents

Minimax M1: The Open-Source Language Model Redefining Context and Performance

Minimax just dropped a 1 million token language model that anyone can use. For free. No paywall, no API limits, no vendor lock-in.

And it’s not just big. It’s smarter, faster, and cheaper to train than anything we’ve seen from the open-source world. We’re talking full book series memory, 80,000 token responses and performance that challenges models costing over $100 million.

Built for just half a million. This changes the game, so let’s talk about it. Alright, the biggest number in the launch post is the context window.

1 million input tokens with room for an 80,000 token reply. A token is nothing mysterious. It is simply a tiny chunk of text, often a word piece that the model understands.

If you mashed all the Harry Potter books into a single prompt, you would still be safely below a million tokens. So M1 can keep an entire book series in its short-term memory while it writes. For comparison, OpenAI’s GPT-4.0 can juggle about one-eighth of that, Claude 4 Opus can hold a fifth, and Google Gemini 2.5 Pro matches the million on input but has a shorter reply limit.

The well-known open-source model DeepSeek R1 tops out at 128,000 both ways. In simple terms, M1 has breathing room that other public models do not. Holding that much text usually slams into the transformer’s problem of attention costs ballooning as sequences get longer.

Inside Minimax M1: Smarter Scaling with Mixture of Experts and Lightning Attention

Minimax sidestepped this by combining two ideas. First, the model uses a mixture of experts’ design. Think of it as 32 specialist submodels that share one brain.

Each token wakes up only a handful of those specialists, so although the entire model contains 456 billion parameters, only 46 billion are active at any moment. Second, they swapped the classic every-token-looks-at-every-other-token attention for what they call lightning attention. It is a linear technique that keeps the math cost almost flat as the prompt grows.

Seven regular transformer layers sit on top of each lightning block, so the model still keeps the strengths of the standard architecture, but most of the heavy lifting happens in the lighter mode. The payoff is clear in their graphs. When M1 generates a 100,000-token answer, it burns through only a quarter of the floating-point operations that DeepSeek R1 would need at the same length.

And at 64k, it is under half. That efficiency also shows up in the price tag for training. Minimax completed the entire reinforcement learning stage in three weeks on 512 NVIDIA H800 graphics cards, racking up a rental cost of about $534,700.

By contrast, DeepSeek R1’s team reported spending between $5 and $6 million, and early estimates for OpenAI’s GPT-4 training broke the $100 million mark. The difference comes partly from lightning attention’s low floating-point operations and partly from a fresh reinforcement learning algorithm that Minimax calls CISPO, short for Clipped Importance Sampling Policy Optimization. Traditional PPO, which many labs use, avoids instability by clipping big gradient steps for tokens that were unlikely in the previous model pass.

CISPO and Curriculum Learning: How Minimax Trained M1 to Reason Like a Human

Unfortunately, those rare tokens often hold the wait-let-me-rethink moments that real reasoning depends on, so PPO can strangle creativity. CISPO flips the rule. Instead of cutting off the gradient, it caps the importance sampling wait and lets every token keep contributing.

In Minimax’s own tests on a separate 32-billion-parameter Quinn model, CISPO hit the same math contest score in half the number of training steps as its closest competitor. Before any of that clever reinforcement learning work began, Minimax carried out another 7.5 trillion tokens of standard, pre-training on a datamix that was 70% STEM, code, books, and explicit reasoning. They then ran a supervised pass that injected long, chain-of-thought answers so the model would naturally lay out each step of a solution.

Only after this foundation was solid did the reinforcement learning phase kick in, and the curriculum there is worth a closer look because it mirrors how people learn. First came tasks with answers that can be checked by handwritten rules. The team cleaned and deduplicated tens of thousands of real, competition-level problems from the American Invitational Mathematics exam, included only those where the answer format made automated checking straightforward, and filtered out items that were either too easy or essentially impossible for current models.

They generated 53,000 logical puzzles, everything from cipher grids to Sudoku variants, using an internal synthesis tool and gave each puzzle a strict verifier. They also collected 30,000 competitive programming tasks, and where the original websites lacked tests, the older Minimax Text-01 model produced comprehensive test suites so that every code solution could be compiled and run. For a taste of real software engineering, they mirrored GitHub issues into a sandbox that can build and test actual repositories, meaning the model must read a bug report, locate the broken file, write a patch, and prove it works by passing unit tests before earning a reward.

Training Stability and Reward Optimization: How Minimax Fine-Tuned M1 for Accuracy and Coherence

The second batch involved tasks with single, correct answers that are hard to catch with pure rules, detailed science questions, tricky factoids, or problems that allow several right phrasings. Minimax trained a generative reward model, called GenRM, by comparing human-labeled answer pairs and tuning until the system reliably picked the better response. Finally came open-ended conversation, writing, and instruction following.

Here, GenRM used a simple three-way judgment, better, same, worse, against a reference answer that itself was selected through a Swiss round tournament among outputs from several strong models. Throughout, the engineers monitored GenRM for length bias because reward models sometimes hand out extra points just for longer replies. If the policy started stretching answers without real gains in accuracy, they paused, recalibrated the reward model, and resumed training.

Minimax also had to keep the training run from crashing. Early on, they spotted that probabilities inside the training loop did not match the ones seen during inference. The culprit was low numerical precision in the final language model head, so they switched that layer to full 32-bit floats, and the mismatch vanished.

Another headache was output loops. Occasionally, the model would fall into a repetitive pattern, churning out boilerplate for thousands of tokens. To prevent the model from getting stuck in a loop where it keeps repeating itself with full confidence, basically saying the same thing over and over for thousands of words, they added a simple rule.

Pushing Limits: How Minimax M1 Achieves Long-Form Precision and Competitive Benchmark Scores

If it writes 3,000 words in a row with super high certainty, it just stops. On top of that, they had to adjust how the model learns during training. The numbers it was working with were so small, almost unimaginably tiny, that the usual settings didn’t work well.

So they fine-tuned those learning controls to better handle these delicate adjustments and keep everything stable. With everything stable, the team pushed the model’s reply length step by step. They started at 40,000 tokens, then expanded the limit to 48, 56, 64, 72, and finally 80,000.

They only opened the next gate once perplexity leveled off and the longest real generations brushed the ceiling. Each jump forced them to rebalance the dataset because, in later stages, bad answers tended to run longer than good ones. They addressed that by mixing sample level and token level losses, tightening gradient clipping, and lowering SysPo’s clipping bound.

So how does all that translate into real scores? On AAME 2024, the 80k version answers 86% correctly, just a notch behind the freshly patched DeepSeq R10528, and comfortably ahead of earlier open-weight models. It turns in the same mid-70s range on AAME 2025 and almost 97% on the smaller Math500 set. On LiveCodeBench, which mixes everyday coding tasks, M1 lands at 65%, matching the BigQuen model and trailing DeepSeq by only a few points.

FullStackBench, which focuses on full project edits, records 68%. In knowledge and logic tests, GPQA Diamond shows 70%, ZebraLogic mid-80s, and MMLU Pro just above 80%. The TUF HLE benchmark, when run without external tools, places M1’s text-only score around 8%, the same ballpark as most open models and well under Gemini or O3, meaning there is still space to grow on real-world reasoning under heavy constraint.

Real-World Intelligence: How Minimax M1 Excels in Reasoning, Tool Use, and Long-Context Tasks

Software engineering is where M1 leverages its fine-tuned sandbox skills. On SWEBench, verified it repairs 56% of issues, a fraction shy of the latest DeepSeq, but well ahead of everything else you can download today. Along context, understanding is perhaps its strongest suit.

On the 128,000-token version of OpenAI’s MRCR benchmark, M1 posts 73.4, beating both CLAWD for Opus and OpenAI O3 and sitting second only to Gemini 2.5 Pro. On the million-token, MRCR it still clears 56, and LongBench V2, which goes up to 2 million words of context, shows just over 61. For tool use, TAUBench measures how well an agent follows airline and retail domain policies.

While calling APIs, M1’s scores hover around 62 and 63, outpacing every other open model and edging ahead of Gemini in some runs. Pure factual recall is tougher. Simple QA hands M1 about 18 out of 100, stronger than most open peers, but behind DeepSeq’s 27.

A broad, multi-turn assistant check called Multi-Challenge, judged by GPT-4.0 for coherence and helpfulness, slots M1 into the mid-40s, which is about level with CLAWD 4 and the revamped DeepSeq, but below Gemini and OpenAI’s latest. One pattern shows up in the training charts Minimax released. As reinforcement learning steps tick upward, the average answer length on math and coding problems climbs past 20,000 tokens, and the accuracy curves rise almost in lockstep.

In other words, letting the model think out loud for longer really does improve the final answer, provided the underlying attention can keep costs down. That’s the philosophical bet Minimax is making. Extend test time compute efficiently, and you get smarter models without simply scaling parameter count.

If you want to try M1 yourself, the company suggests VLLM as the serving backend because it handles large expert models with smart memory pooling. But you can start with the standard Transformers library if that’s easier. The repo includes structured function calling, and the demo chatbot plugs in online search, image and video generation, text-to-speech, and voice cloning, so you can build a fairly complete assistant or agent without adding external tooling.

Because the license is so permissive, you can even keep everything on-premises, which is appealing for companies with strict data policies. That’s the full story for now. All the code, weights, and the technical report are linked below if you want a deeper look.

Thanks for staying with me, and I’ll catch you in the next article.

en.wikipedia.org

Also Read:- Google Just Introduced NEW FORM of Intelligence (Evolving Nonstop)