NVIDIA Hopper: A Remarkable Leap in Generative AI Performance at MLPerf

NVIDIA's TensorRT-LLM software has significantly improved the performance of NVIDIA Hopper architecture GPUs on the GPT-J LLM, boosting it nearly three times compared to six months ago, showcasing the capabilities of NVIDIA's comprehensive platform in handling the demanding requirements of generative AI.

NVIDIA's TensorRT-LLM software has boosted the performance of NVIDIA Hopper architecture GPUs on the GPT-J LLM by nearly 3 times compared to their results just six months ago.
Leading companies are already leveraging TensorRT-LLM to optimize their models, showcasing its effectiveness in the real world.
The H200 GPUs, making their debut in MLPerf, achieved record-breaking results by producing up to 31,000 tokens per second on MLPerf's Llama 2 benchmark, demonstrating their impressive speed and capabilities.

nVidia has officially achieved a feat in the world of generative AI. In the latest MLPerf benchmarks, NVIDIA’s TensorRT-LLM software, designed to accelerate inference on large language models, has boosted the performance of NVIDIA Hopper architecture GPUs on the GPT-J LLM by nearly 3 times compared to their results just six months ago. This impressive speedup highlights the capabilities of NVIDIA’s comprehensive platform of chips, systems, and software in handling the demanding requirements of generative AI.

Leading companies are already leveraging TensorRT-LLM to optimize their models, and NVIDIA NIM, a set of inference microservices that includes inferencing engines like TensorRT-LLM, simplifies the deployment of NVIDIA’s inference platform for businesses.

The MLPerf benchmarks showcased NVIDIA’s TensorRT-LLM running on the latest H200 Tensor Core GPUs, which are memory-enhanced versions of the Hopper GPUs. This combination delivered the fastest performance in MLPerf’s largest-ever test of generative AI. The benchmark utilized Llama 2, a large language model with a staggering 70 billion parameters. Compared to the GPT-J LLM used in previous benchmarks, Llama 2 is more than 10 times larger. The H200 GPUs, making their debut in MLPerf, achieved record-breaking results by producing up to 31,000 tokens per second on MLPerf’s Llama 2 benchmark. These results include up to 14% gains from a custom thermal solution, showcasing the innovative cooling techniques employed by system builders to maximize the performance of Hopper GPUs.

NVIDIA is now shipping H200 GPUs, which will soon be available from nearly 20 leading system builders and cloud service providers. With 141 GB of HBM3E memory running at 4.8 TB/s, the H200 GPUs offer 76% more memory and 43% faster memory speed compared to their H100 predecessors. These accelerators are compatible with the same boards, systems, and software as the H100 GPUs. The increased memory capacity of the H200 GPUs enables a single GPU to run an entire Llama 2 70B model with the highest throughput, simplifying and accelerating inference processes.

NVIDIA’s GH200 Superchips take memory capacity to even greater heights, packing up to 624 GB of fast memory, including 144 GB of HBM3e. These Superchips combine a Hopper architecture GPU with a power-efficient NVIDIA Grace CPU on a single module. NVIDIA is the first to utilize HBM3e memory technology, which offers nearly 5 TB/second memory bandwidth. The GH200 Superchips have demonstrated outstanding performance in memory-intensive MLPerf tests, such as recommender systems.

In the latest round of MLPerf industry benchmarks, Hopper GPUs dominated every test of AI inference on a per-accelerator basis. These benchmarks cover a wide range of popular AI workloads and scenarios, including generative AI, recommendation systems, natural language processing, speech, and computer vision. NVIDIA was the only company to submit results on every workload in this round and every round since MLPerf’s data center inference benchmarks began in October 2020. These continued performance gains translate into lower costs for inference, which is a significant part of the daily workload for millions of NVIDIA GPUs worldwide.

NVIDIA’s commitment to pushing the boundaries of what’s possible in AI was evident in the special section of MLPerf benchmarks known as the open division. In this section, NVIDIA engineers showcased three innovative techniques. The first technique, known as structured sparsity, leverages NVIDIA A100 Tensor Core GPUs to reduce calculations and achieve up to 33% speedups on inference with Llama 2. The second technique, pruning, simplifies AI models by removing unnecessary components, resulting in up to 40% speedups on inference. Finally, the optimization technique called DeepCache reduced the computational requirements for inference with the Stable Diffusion XL model, leading to a remarkable 74% improvement in performance. These results were achieved using NVIDIA H100 Tensor Core GPUs.

MLPerf’s transparent and objective tests make it a trusted source for users evaluating AI systems and services. NVIDIA’s partners, including ASUS, Cisco, Dell Technologies, Fujitsu, Gigabyte, Google, Hewlett Packard Enterprise, Lenovo, Microsoft Azure, Oracle, QCT, Supermicro, VMware (recently acquired by Broadcom), and Wiwynn, participate in MLPerf to provide valuable insights for customers making informed buying decisions. All the software used by NVIDIA in these tests is available in the MLPerf repository. The optimizations developed by NVIDIA are continuously integrated into containers available on NGC, NVIDIA’s software hub for GPU applications, as well as NVIDIA AI Enterprise, a secure and supported platform that includes NIM inference microservices.

As the use cases, model sizes, and datasets for generative AI continue to expand, MLPerf evolves to incorporate real-world tests with popular models like Llama 2 70B and Stable Diffusion XL. NVIDIA is committed to keeping pace with the growing demands of large language models and recently announced that the upcoming NVIDIA Blackwell architecture GPUs will deliver unprecedented levels of performance required for multitrillion-parameter AI models.

Inference for large language models is a challenging task that requires expertise and a full-stack architecture like the one showcased by NVIDIA in MLPerf with Hopper architecture GPUs and TensorRT-LLM. With NVIDIA’s continuous advancements in AI technology, there is much more to come in this exciting field.

To learn more about the MLPerf benchmarks and the technical details of this round of inference testing, visit the MLPerf website.

About Our Team

Our team comprises industry insiders with extensive experience in computers, semiconductors, games, and consumer electronics. With decades of collective experience, we’re committed to delivering timely, accurate, and engaging news content to our readers.

Background Information

About ASUS: ASUS, founded in 1989 by Ted Hsu, M.T. Liao, Wayne Hsieh, and T.H. Tung, has become a multinational tech giant known for its diverse hardware products. Spanning laptops, motherboards, graphics cards, and more, ASUS has gained recognition for its innovation and commitment to high-performance computing solutions. The company has a significant presence in gaming technology, producing popular products that cater to enthusiasts and professionals alike. With a focus on delivering and reliable technology, ASUS maintains its position as a important player in the industry.

About Broadcom: Founded in 1961, Broadcom is a leading global technology company headquartered in the United States. They specialize in semiconductor and infrastructure software solutions. Broadcom's innovations in connectivity, networking, and storage technologies have made them a key player in the industry, with a focus on enabling seamless communication and connectivity in the digital world.

About Dell: Dell is a globally technology leader providing comprehensive solutions in the field of hardware, software, and services. for its customizable computers and enterprise solutions, Dell offers a diverse range of laptops, desktops, servers, and networking equipment. With a commitment to innovation and customer satisfaction, Dell caters to a wide range of consumer and business needs, making it a important player in the tech industry.

About Fujitsu: Fujitsu is a important Japanese technology company for its wide array of computing solutions. With a history dating back to 1935, Fujitsu excels in producing personal computers, laptops, and tablets that combine innovation and reliability. In addition to consumer-focused products, Fujitsu is a key player in enterprise solutions, offering servers, storage systems, and data center services. The company's emphasis on quality, advanced features, and IT services has solidified its position as a significant player in the global computing industry.

About Gigabyte: Gigabyte Technology, a important player in the computer hardware industry, has established itself as a leading provider of innovative solutions and products catering to the ever-evolving needs of modern computing. With a strong emphasis on quality, performance, and technology, Gigabyte has gained recognition for its wide array of computer products. These encompass motherboards, graphics cards, laptops, desktop PCs, monitors, and other components that are integral to building high-performance systems. for their reliability and advanced features, Gigabyte's motherboards and graphics cards have become staples in the gaming and enthusiast communities, delivering the power and capabilities required for immersive gaming experiences and resource-intensive applications

About Google: Google, founded by Larry Page and Sergey Brin in 1998, is a multinational technology company known for its internet-related services and products. Initially for its search engine, Google has since expanded into various domains including online advertising, cloud computing, software development, and hardware devices. With its innovative approach, Google has introduced influential products such as Google Search, Android OS, Google Maps, and Google Drive. The company's commitment to research and development has led to advancements in artificial intelligence and machine learning.

About Lenovo: Lenovo, formerly known as "Legend Holdings," is a important global technology company that offers an extensive portfolio of computers, smartphones, servers, and electronic devices. Notably, Lenovo acquired IBM's personal computer division, including the ThinkPad line of laptops, in 2005. With a strong presence in laptops and PCs, Lenovo's products cater to a wide range of consumer and business needs. Committed to innovation and quality, Lenovo delivers reliable and high-performance solutions, making it a significant player in the tech industry.

About Microsoft: Microsoft, founded by Bill Gates and Paul Allen in 1975 in Redmond, Washington, USA, is a technology giant known for its wide range of software products, including the Windows operating system, Office productivity suite, and cloud services like Azure. Microsoft also manufactures hardware, such as the Surface line of laptops and tablets, Xbox gaming consoles, and accessories.

About nVidia: NVIDIA has firmly established itself as a leader in the realm of client computing, continuously pushing the boundaries of innovation in graphics and AI technologies. With a deep commitment to enhancing user experiences, NVIDIA's client computing business focuses on delivering solutions that power everything from gaming and creative workloads to enterprise applications. for its GeForce graphics cards, the company has redefined high-performance gaming, setting industry standards for realistic visuals, fluid frame rates, and immersive experiences. Complementing its gaming expertise, NVIDIA's Quadro and NVIDIA RTX graphics cards cater to professionals in design, content creation, and scientific fields, enabling real-time ray tracing and AI-driven workflows that elevate productivity and creativity to unprecedented heights. By seamlessly integrating graphics, AI, and software, NVIDIA continues to shape the landscape of client computing, fostering innovation and immersive interactions in a rapidly evolving digital world.

About Oracle: Oracle Corporation is a important American multinational technology company founded in 1977 and headquartered in Redwood City, California. It's one of the world's largest software and cloud computing companies, known for its enterprise software products and services. Oracle specializes in developing and providing database management systems, cloud solutions, software applications, and hardware infrastructure. Their flagship product, the Oracle Database, is widely used in businesses and organizations worldwide. Oracle also offers a range of cloud services, including Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS).

About Supermicro: Supermicro is a reputable American technology company founded in 1993 and headquartered in San Jose, California. Specializing in high-performance server and storage solutions, Supermicro has become a trusted name in the data center industry. The company offers a wide range of innovative and customizable server hardware, including motherboards, servers, storage systems, and networking equipment, catering to the needs of enterprise clients, cloud service providers, and businesses seeking reliable infrastructure solutions.

Technology Explained

Blackwell: Blackwell is an AI computing architecture designed to supercharge tasks like training large language models. These powerful GPUs boast features like a next-gen Transformer Engine and support for lower-precision calculations, enabling them to handle complex AI workloads significantly faster and more efficiently than before. While aimed at data centers, the innovations within Blackwell are expected to influence consumer graphics cards as well

CPU: The Central Processing Unit (CPU) is the brain of a computer, responsible for executing instructions and performing calculations. It is the most important component of a computer system, as it is responsible for controlling all other components. CPUs are used in a wide range of applications, from desktop computers to mobile devices, gaming consoles, and even supercomputers. CPUs are used to process data, execute instructions, and control the flow of information within a computer system. They are also used to control the input and output of data, as well as to store and retrieve data from memory. CPUs are essential for the functioning of any computer system, and their applications in the computer industry are vast.

GPU: GPU stands for Graphics Processing Unit and is a specialized type of processor designed to handle graphics-intensive tasks. It is used in the computer industry to render images, videos, and 3D graphics. GPUs are used in gaming consoles, PCs, and mobile devices to provide a smooth and immersive gaming experience. They are also used in the medical field to create 3D models of organs and tissues, and in the automotive industry to create virtual prototypes of cars. GPUs are also used in the field of artificial intelligence to process large amounts of data and create complex models. GPUs are becoming increasingly important in the computer industry as they are able to process large amounts of data quickly and efficiently.

HBM3E: HBM3E is the latest generation of high-bandwidth memory (HBM), a type of DRAM that is designed for artificial intelligence (AI) applications. HBM3E offers faster data transfer rates, higher density, and lower power consumption than previous HBM versions. HBM3E is developed by SK Hynix, a South Korean chipmaker, and is expected to enter mass production in 2024. HBM3E can achieve a speed of 1.15 TB/s and a capacity of 64 GB per stack. HBM3E is suitable for AI systems that require large amounts of data processing, such as deep learning, machine learning, and computer vision.

LLM: A Large Language Model (LLM) is a highly advanced artificial intelligence system, often based on complex architectures like GPT-3.5, designed to comprehend and produce human-like text on a massive scale. LLMs possess exceptional capabilities in various natural language understanding and generation tasks, including answering questions, generating creative content, and delivering context-aware responses to textual inputs. These models undergo extensive training on vast datasets to grasp the nuances of language, making them invaluable tools for applications like chatbots, content generation, and language translation.

Stable Diffusion: Stable Diffusion is a technology that is used to improve the performance of computer systems. It is a process of spreading out the load of a system across multiple processors or cores. This helps to reduce the amount of time it takes for a system to complete a task, as well as reduce the amount of energy used. Stable Diffusion is used in many areas of the computer industry, such as in cloud computing, distributed computing, and high-performance computing. It is also used in gaming, where it can help to reduce the amount of time it takes for a game to load. Stable Diffusion is also used in artificial intelligence, where it can help to improve the accuracy of machine learning algorithms.

VMware: VMware is an industry leader in virtualization technology, allowing for the effective virtualization of computer hardware, networks, and operating systems. Any organization needing to quickly and efficiently deploy complex network environments can utilize VMware, allowing for greater flexibility than physical solutions. VMware products are especially popular in the enterprise segment, as they offer cost savings through increased worker productivity and improved resource utilization. Virtualized services hosted on VMware offer better scaleability and reliability, as well as improved fault-tolerance and cost savings associated with server consolidation. VMware also provides support for the latest hardware and software components, ensuring compatibility with a wide range of applications and services. Furthermore, VMware products provide a secure computing environment, as virtual machines are isolated from each other, preventing the spread of viruses and other threats from one virtual machine to another.