Press Esc to close

NVIDIA just squeezed a massive AI model into half the space

NVIDIA quietly dropped something interesting on Hugging Face last week, and the backstory is almost as good as the release itself.

When MiniMax unleashed their M3 model earlier this month, developers had the usual two-part reaction: impressed by the benchmarks, then immediately stressed about the hardware bill. At 428 billion parameters, M3 is not what you'd call a casual deployment. So the question on everyone's mind wasn't just performance — it was "how do I actually run this thing?"

Turns out, all it took was one cheeky reply on X.

A tweet, a joke, and a Hugging Face drop

When M3 launched, a systematic trading account called Quant Capital tagged NVIDIA AI and jokingly asked for an NVFP4-quantized version, sweetening the deal by offering to "buy another DGX Spark" if they came through. A couple of weeks later, NVIDIA quietly dropped a Hugging Face link in the replies — and just like that, the MiniMax-M3-NVFP4 was live.

Whether the request actually nudged the team or it was already in the pipeline, the casual way this release landed made it feel like someone inside NVIDIA thought it was funny enough to play along.

What NVFP4 actually does here

A quick breakdown for anyone not living inside precision format documentation: NVFP4 is a 4-bit data type that NVIDIA built specifically for their new Blackwell GPU architecture. By running M3 through their Model Optimizer and quantizing it down to NVFP4, they cut the model's disk footprint and GPU memory requirements roughly in half compared to the standard FP8 version.

That's a big deal when you're talking about a model this size. M3 uses a Mixture-of-Experts (MoE) architecture, so only about 23 billion of its 428 billion parameters are actually active during any given generation — which helps with speed — but the full weight file still sits on your hardware regardless. Shrinking that by half without gutting the model's capability is exactly the kind of optimization the open-weight community has been waiting for.

On that last point, the benchmarks are reassuring. The NVFP4 version scores 91.92 on the GPQA Diamond benchmark (a graduate-level science reasoning test), compared to 92.53 for the FP8 baseline. That's a rounding error in practical terms. The model also keeps its full 1-million token context window and supports text, images, and up to 30 minutes of video. So nothing got stripped out.

Still some rough edges

This is very much a cutting-edge release, and it shows. You can't just spin it up with a stable vLLM build right now, and instead, you'll need to pull a nightly Docker image that includes specific pull requests to get the checkpoint running at all. Early users on X have also flagged slower-than-expected token generation on some setups, with one report citing around 12 tokens per second on 4x RTX 6000 Ada GPUs, well below the 90 t/s they'd typically see with other 4-bit models.

These kinds of rough edges are pretty normal for a release that's this fresh and this tightly coupled to new hardware. The expectation is that vLLM support will stabilize as the PRs get merged into mainline builds.

The bigger picture

Frontier models keep getting larger, and there's been a real concern that the open-weight ecosystem would start falling behind simply because of hardware constraints. A quantized M3 that runs on fewer GPUs without meaningful quality loss pushes back against that trend.

As for Quant Capital — looks like they owe someone a DGX receipt.

Comments