Large language models can solve challenging math problems. However, making them work efficiently at scale requires more than a strong checkpoint. You need the right serving stack, quantization strategy, and decoding methods—often spread across different tools that don’t work together cleanly. Teams end up juggling containers, conversion scripts, and ad‑hoc glue code to compare BF16 vs FP8 or to test a speculative decoding setup.
This post shows how to build a fast, reproducible inference pipeline with the NVIDIA NeMo-Skills library to manage NVIDIA TensorRT-LLM. This streamlined version of the setup we used to win the AI Mathematical Olympiad Prize 2024, which achieved 4x faster batched inference on two NVIDIA H100 GPUs with FP8 quantization and ReDrafter speculative decoding. The same workflow can run on a single workstation or scale out on a cluster, with minimal changes.
By the end of this blog post, you’ll learn how to:
Prepare and quantize an OpenMath model to an FP8 TensorRT-LLM engine. Train and integrate a ReDrafter draft model for speculative decoding. Launch an optimized inference server with optional tool-calling through a secure code sandbox. Benchmark latency and throughput across BF16, FP8, and FP8+ReDrafter configurations.If you’re following along, we recommend a machine with two H100 (or comparable FP8-capable) GPUs or a Slurm cluster with similar nodes.
Setting up your environment
Our first step is to establish a consistent and isolated environment. We’ll use an NVIDIA PyTorch NGC container and install the essential libraries: TensorRT-LLM for model optimization and NeMo-Skills for the overall pipeline management. FP8 inference requires an NVIDIA GPU that supports FP8 inference, including the NVIDIA Ada Lovelace, NVIDIA Hopper, NVIDIA Blackwell, or NVIDIA Rubin architecture. For this example, we assume two GPUs are available.
Container setup and library installation
Once inside the nvcr.io/nvidia/pytorch:25.05-py3 container, run the following commands to install TensorRT-LLM and NeMo-Skills:
Preparing model weights
The next step is preparing our large language model (LLM). We’ll download the nvidia/OpenMath-Nemotron-14B-Kaggle model and transform it into an optimized TensorRT-LLM engine using FP8 quantization.
Note on FP8 Quantization: FP8 (8-bit floating point) quantization is highly efficient but requires GPUs that support E4M3 FP8 (like NVIDIA Hopper GPUs). For other GPUs, int8_wo (8-bit integer with weight-only quantization) is recommended and doesn’t require calibration.
Downloading model weights and datasets
Generate a Hugging Face token and export it as an environment variable. Then use the Hugging Face CLI to download the necessary models and datasets.
Preparing the calibration dataset for FP8 quantization
For FP8 quantization, a small calibration dataset representative of inference data is essential. We’ll use a subset of the OpenMathReasoning dataset to create it. An example is provided to generate the math calibration dataset in HuggingFace format.
Converting and quantizing to TensorRT-LLM engine
Now, convert the Hugging Face model to a TensorRT-LLM engine, applying FP8 quantization and using the prepared calibration dataset. This step generates the FP8 quantized LLM inference engine.
After this command, your FP8 LLM engine is ready for deployment.
Accelerating inference with ReDrafter
To push our inference efficiency further, we integrate ReDrafter. This speculative decoding technique uses a smaller “draft” model to predict tokens, enabling the main LLM to generate responses faster. ReDrafter is an RNN-based inference method developed by Apple. The ReDrafter implementation is compatible with most models supported within the TensorRT-LLM library.
Installing and training ReDrafter
First, install the ReDrafter library. The tokenizer and training data for the draft model should be the same as those used for the base model. If the original training data is not available, base model generations can also be used for training the draft model.
During training, observe the redrafter2_top1 score. Aiming for above 0.6 indicates close to 2x runtime performance (60% of steps accept the next three drafted tokens).
Building the TensorRT-LLM engine for the ReDrafter model
Now, we’ll convert our trained ReDrafter model into a TensorRT-LLM checkpoint and then combine it with our main LLM to create the final, accelerated TensorRT-LLM engine.
First, clone the TensorRT-LLM repository to access its conversion scripts:
Next, convert the trained ReDrafter PyTorch checkpoint to a TensorRT-LLM checkpoint.
Finally, build the combined TensorRT-LLM engine base model with a draft head for speculative decoding.
Your TensorRT-LLM engine, now supercharged with ReDrafter, is ready to be served!
Benchmarking and results
We’ve prepared a companion notebook where you can try out the full pipeline yourself. The notebook was run with the same container setup and installations as the container setup section above, along with two H100 GPUs for inference. In the notebook, you can:
Run inference on different TensorRT-LLM engines (BF16, FP8, FP8+ReDrafter). Compare performance benchmarks such as time to first token and throughput per device. Explore advanced controls, such as early stopping after a fixed time or terminating after the first N generations complete. Run inference with tool-calling.Here’s a sample of the kind of benchmark results you’ll see:
| Metrics | BF16 | FP8 | FP8+ReDrafter |
| Total generation time(s) | 144.2 | 64.7 | 30.5 |
| Average sample throughput (Tok/s) | 34.6 | 75.2 | 138.5 |
Full benchmarks and code available in the notebook. Check out the AIMO-2 Winning Solution paper for more results.
The OpenMath LLM is a powerful tool-instruction reasoning model. This means it doesn’t just generate text. It can also write and execute Python code in a secure sandbox to solve problems. In the companion notebook, we provide an example of how to launch both the LLM server and its accompanying code execution sandbox.
The interaction works like this:
The LLM generates Python code wrapped in <tool_call> and </tool_call> tokens. The inference engine extracts and sends this code to the sandbox. The sandbox executes the code and returns the results. The output is fed back to the LLM for continued generation or to finalize its answer.Here’s an example of such an interaction:
To turn off tool-calling in the companion notebook, use get_model instead of get_code_execution_model as shown in the NeMo-Skills docs.
Try it yourself. Run the companion notebook to benchmark these performance improvements on your hardware and experiment with tool-calling capabilities.
.png)
7 months ago
English (United States) ·
French (France) ·