NVIDIA Blackwell Redefines Generative AI Performance in MLPerf Inference Benchmark


August 29, 2024 by our News Team

NVIDIA is leading the way in the generative AI race, with their powerful data center infrastructure and innovative technologies like the Blackwell platform and Triton Inference Server, as well as their impressive results in the latest MLPerf benchmarks.

  • NVIDIA's platforms showcased impressive performance across all data center tests in the latest MLPerf benchmarks.
  • The upcoming NVIDIA Blackwell platform boasts up to four times the performance compared to its predecessor, thanks to its innovative second-generation Transformer Engine and FP4 Tensor Cores.
  • NVIDIA's continuous software innovation, including the Triton Inference Server, allows for reduced costs and faster deployment times for businesses looking to leverage AI.


The Generative AI Race: How nVidia is Shaping the Future of Data Centers and Beyond

As businesses scramble to harness the power of generative AI, the demand for robust data center infrastructure is skyrocketing. It’s one thing to train large language models (LLMs)—that’s like prepping for a marathon—but delivering real-time services powered by those models? That’s akin to sprinting a 100-meter dash while juggling. It’s a challenge that requires not just speed but also precision, and NVIDIA is stepping up to the plate with some impressive tech.

In the latest round of MLPerf benchmarks, dubbed Inference v4.1, NVIDIA’s platforms showcased their expertise across all data center tests. The spotlight was particularly bright on the upcoming NVIDIA Blackwell platform, which boasted up to

four times
the performance compared to its predecessor, the H100 Tensor Core GPU. This leap was largely due to its innovative second-generation Transformer Engine and FP4 Tensor Cores. If you’re wondering how that translates in real-world terms, think of it as upgrading from a bicycle to a high-speed train when it comes to processing LLM workloads like Llama 2 70B.

But it’s not just about raw power. The NVIDIA H200 Tensor Core GPU also made waves, delivering stellar results across a range of benchmarks, including the newly introduced Mixtral 8x7B model. This model, with its impressive 46.7 billion parameters, is a game changer. It activates only a fraction of those parameters at any given time—12.9 billion per token—allowing for quicker responses and a broader range of tasks. It’s like having a Swiss Army knife that only unfolds the tools you need for the job at hand, making it both versatile and efficient.

The Need for Speed

As LLMs grow in complexity, so does the need for computing power to handle inference requests. Imagine trying to serve millions of users simultaneously while keeping response times low—it’s no small feat. Here’s where multi-GPU compute comes into play. NVIDIA’s NVLink and NVSwitch technologies are designed for high-speed communication between GPUs, which is crucial for real-time large model inference. With the Blackwell platform extending these capabilities to support up to

72 GPUs
, the potential for scaling is enormous.

What’s even more interesting is the collaborative spirit in this space. NVIDIA isn’t going it alone; they’ve partnered with heavyweights like Cisco, Dell, and Lenovo, all of whom submitted impressive MLPerf results. This collective effort underscores the widespread adoption and availability of NVIDIA’s platforms, which is great news for businesses looking to dive into generative AI.

Continuous Innovation

One thing that stands out about NVIDIA is its relentless push for software innovation. Their platforms are not static; they evolve. The latest inference round saw the H200 GPU delivering a remarkable

27% increase
in generative AI performance compared to the previous round. It’s like getting a free upgrade on your smartphone every month—who wouldn’t want that?

A particularly noteworthy tool in NVIDIA’s arsenal is the Triton Inference Server. This open-source server allows organizations to consolidate multiple framework-specific inference servers into a single platform. The result? Reduced costs and significantly faster deployment times—from months to mere minutes. It’s a game changer for companies that want to leverage AI without getting bogged down in technical complexities.

Bringing AI to the Edge

But what about the edge, you ask? This is where things get really exciting. Deployed at the edge, generative AI models can transform sensor data—think images and videos—into actionable insights in real time. The NVIDIA Jetson platform is designed for just this purpose, enabling developers to run various models, including LLMs and vision transformers, locally.

In the latest MLPerf benchmarks, the NVIDIA Jetson AGX Orin system-on-modules achieved over

6.2 times
the throughput and
2.4 times
the Latency improvement on the GPT-J LLM workload. This means developers can now leverage a general-purpose 6-billion-parameter model to interact seamlessly with human language right at the edge. It’s a bit like having a personal assistant that’s not just smart but also incredibly efficient.

Conclusion: A Future Powered by NVIDIA

The latest MLPerf Inference benchmarks have clearly demonstrated NVIDIA’s versatility and leadership in performance, extending from data centers to the edge. As we continue to witness the rapid evolution of AI-powered applications and services, it’s clear that NVIDIA is not just keeping pace; they’re setting the tempo.

If you’re curious about diving deeper into these results, NVIDIA’s technical blog has all the nitty-gritty details. And for those eager to jump into the action, H200 GPU-powered systems are now available from several providers, including CoreWeave and Dell Technologies. The race is on, and it’s shaping up to be a thrilling ride.

NVIDIA Blackwell Redefines Generative AI Performance in MLPerf Inference Benchmark

About Our Team

Our team comprises industry insiders with extensive experience in computers, semiconductors, games, and consumer electronics. With decades of collective experience, we’re committed to delivering timely, accurate, and engaging news content to our readers.

Background Information


About Dell:

Dell is a globally technology leader providing comprehensive solutions in the field of hardware, software, and services. for its customizable computers and enterprise solutions, Dell offers a diverse range of laptops, desktops, servers, and networking equipment. With a commitment to innovation and customer satisfaction, Dell caters to a wide range of consumer and business needs, making it a important player in the tech industry.

Dell website  Dell LinkedIn
Latest Articles about Dell

About Lenovo:

Lenovo, formerly known as "Legend Holdings," is a important global technology company that offers an extensive portfolio of computers, smartphones, servers, and electronic devices. Notably, Lenovo acquired IBM's personal computer division, including the ThinkPad line of laptops, in 2005. With a strong presence in laptops and PCs, Lenovo's products cater to a wide range of consumer and business needs. Committed to innovation and quality, Lenovo delivers reliable and high-performance solutions, making it a significant player in the tech industry.

Lenovo website  Lenovo LinkedIn
Latest Articles about Lenovo

About nVidia:

NVIDIA has firmly established itself as a leader in the realm of client computing, continuously pushing the boundaries of innovation in graphics and AI technologies. With a deep commitment to enhancing user experiences, NVIDIA's client computing business focuses on delivering solutions that power everything from gaming and creative workloads to enterprise applications. for its GeForce graphics cards, the company has redefined high-performance gaming, setting industry standards for realistic visuals, fluid frame rates, and immersive experiences. Complementing its gaming expertise, NVIDIA's Quadro and NVIDIA RTX graphics cards cater to professionals in design, content creation, and scientific fields, enabling real-time ray tracing and AI-driven workflows that elevate productivity and creativity to unprecedented heights. By seamlessly integrating graphics, AI, and software, NVIDIA continues to shape the landscape of client computing, fostering innovation and immersive interactions in a rapidly evolving digital world.

nVidia website  nVidia LinkedIn
Latest Articles about nVidia

Technology Explained


Blackwell: Blackwell is an AI computing architecture designed to supercharge tasks like training large language models. These powerful GPUs boast features like a next-gen Transformer Engine and support for lower-precision calculations, enabling them to handle complex AI workloads significantly faster and more efficiently than before. While aimed at data centers, the innovations within Blackwell are expected to influence consumer graphics cards as well

Latest Articles about Blackwell

GPU: GPU stands for Graphics Processing Unit and is a specialized type of processor designed to handle graphics-intensive tasks. It is used in the computer industry to render images, videos, and 3D graphics. GPUs are used in gaming consoles, PCs, and mobile devices to provide a smooth and immersive gaming experience. They are also used in the medical field to create 3D models of organs and tissues, and in the automotive industry to create virtual prototypes of cars. GPUs are also used in the field of artificial intelligence to process large amounts of data and create complex models. GPUs are becoming increasingly important in the computer industry as they are able to process large amounts of data quickly and efficiently.

Latest Articles about GPU

Latency: Technology latency is the time it takes for a computer system to respond to a request. It is an important factor in the performance of computer systems, as it affects the speed and efficiency of data processing. In the computer industry, latency is a major factor in the performance of computer networks, storage systems, and other computer systems. Low latency is essential for applications that require fast response times, such as online gaming, streaming media, and real-time data processing. High latency can cause delays in data processing, resulting in slow response times and poor performance. To reduce latency, computer systems use various techniques such as caching, load balancing, and parallel processing. By reducing latency, computer systems can provide faster response times and improved performance.

Latest Articles about Latency

LLM: A Large Language Model (LLM) is a highly advanced artificial intelligence system, often based on complex architectures like GPT-3.5, designed to comprehend and produce human-like text on a massive scale. LLMs possess exceptional capabilities in various natural language understanding and generation tasks, including answering questions, generating creative content, and delivering context-aware responses to textual inputs. These models undergo extensive training on vast datasets to grasp the nuances of language, making them invaluable tools for applications like chatbots, content generation, and language translation.

Latest Articles about LLM

Tensor Cores: Tensor Cores are a type of specialized hardware designed to accelerate deep learning and AI applications. They are used in the computer industry to speed up the training of deep learning models and to enable faster inference. Tensor Cores are capable of performing matrix operations at a much faster rate than traditional CPUs, allowing for faster training and inference of deep learning models. This technology is used in a variety of applications, including image recognition, natural language processing, and autonomous driving. Tensor Cores are also used in the gaming industry to improve the performance of games and to enable more realistic graphics.

Latest Articles about Tensor Cores




Leave a Reply