Cerebras introduces New AI Inference System, Promising Enhanced Processing Speed


August 28, 2024 by our News Team

Cerebras Systems launches Cerebras Inference, offering lightning-fast AI processing speeds and a revolutionary pricing model, while maintaining state-of-the-art accuracy and fostering partnerships with industry players.

  • Unmatched speed: Cerebras Inference can process AI tasks at lightning speed, up to 20 times faster than traditional GPU setups.
  • Cost-effective pricing: With a pay-as-you-go model starting at just 10 cents per million tokens, Cerebras offers a 100 times better price-performance ratio compared to traditional GPU solutions.
  • State-of-the-art accuracy: Cerebras maintains high accuracy while operating in the 16-bit domain, eliminating the trade-off between speed and precision.


Today, Cerebras Systems is stirring up the AI landscape with the launch of its latest offering: Cerebras Inference. If you’re not already familiar with Cerebras, they’re the folks behind some seriously impressive AI computing hardware, and their latest announcement is no exception. Imagine being able to process AI tasks at lightning speed—1,800 tokens per second for the Llama 3.1 8B model and 450 tokens per second for the Llama 3.1 70B. For context, that’s about 20 times faster than what you’d get from the typical nVidia GPU setups in hyperscale clouds.

Now, let’s break that down a bit. When we talk about “tokens,” we’re referring to chunks of text that AI models process. Think of them as the building blocks of language. So, when Cerebras claims to deliver 1,800 tokens per second, it’s not just a flashy number—it means real-time applications can become a reality. Whether you’re developing chatbots, virtual assistants, or any application that relies on understanding and generating human language, speed is crucial.

Cerebras is also shaking up the pricing game. Their pay-as-you-go model starts at just 10 cents per million tokens for the smaller Llama 3.1 8B model, and 60 cents for the larger 70B version. In a world where costs can spiral out of control, this pricing structure offers a refreshing alternative, boasting a staggering 100 times better price-performance ratio compared to traditional GPU solutions. It’s a bit like finding a gourmet restaurant that also happens to have a happy hour.

But what about accuracy? In the tech world, there’s often a trade-off between speed and precision. Cerebras claims to sidestep this dilemma by maintaining state-of-the-art accuracy throughout the inference process, all while operating in the 16-bit domain. This is significant because it means you don’t have to sacrifice quality for speed. Micah Hill-Smith, the CEO of Artificial Analysis, shared that Cerebras has set new benchmarks in AI inference, achieving speeds previously thought impossible while still delivering results that match the official Meta versions of Llama 3.1.

Now, you might wonder why this matters. Well, inference is the fastest-growing segment of AI computing, making up about 40% of the total AI hardware market. The emergence of high-speed inference is akin to the arrival of broadband internet—suddenly, everything changes. Developers can build sophisticated applications that require real-time responses, like AI agents that can handle complex tasks without missing a beat.

Dr. Andrew Ng from DeepLearning.AI emphasized this potential, stating that Cerebras’ rapid inference capabilities are a game-changer for workflows that involve repeatedly prompting AI models. It’s like having a supercharged assistant who can keep up with your every request without breaking a sweat.

And the excitement doesn’t stop there. Industry leaders are buzzing about the implications of this technology. Kim Branson from GlaxoSmithKline noted that “speed and scale change everything,” while Russell D’sa from LiveKit highlighted how Cerebras’ capabilities could empower developers to create more human-like AI experiences, especially in voice and video applications.

Cerebras is also making its inference service accessible through three pricing tiers: Free, Developer, and Enterprise. The Free Tier offers generous usage limits, making it easy for anyone to dip their toes into AI development. The Developer Tier is designed for flexible deployments, and the Enterprise Tier caters to businesses needing dedicated support and custom solutions.

What’s particularly interesting is how Cerebras is fostering partnerships with other industry players, from Docker for consistent application deployment to Weights & Biases for operational efficiency. This collaborative spirit is essential in an ecosystem that thrives on innovation.

At the heart of Cerebras Inference is the CS-3 system, powered by the Wafer Scale Engine 3 (WSE-3). This isn’t just another chip; it’s a massive piece of technology that offers unparalleled memory bandwidth—7,000 times more than the Nvidia H100. This solves a critical challenge in generative AI: how to handle large amounts of data quickly and efficiently.

For developers looking to integrate this new service, the transition is straightforward. The Cerebras Inference API is fully compatible with the OpenAI Chat Completions API, meaning you can migrate with just a few lines of code. It’s almost like switching from one streaming service to another; the content remains familiar, but the experience is faster and smoother.

In a world where everyone is clamoring for faster, better, and cheaper solutions, Cerebras Inference seems poised to redefine what’s possible in AI development. As we stand on the brink of this new era, one thing is clear: speed matters, and Cerebras is leading the charge.

Cerebras introduces New AI Inference System, Promising Enhanced Processing Speed

Cerebras introduces New AI Inference System, Promising Enhanced Processing Speed

About Our Team

Our team comprises industry insiders with extensive experience in computers, semiconductors, games, and consumer electronics. With decades of collective experience, we’re committed to delivering timely, accurate, and engaging news content to our readers.

Background Information


About nVidia:

NVIDIA has firmly established itself as a leader in the realm of client computing, continuously pushing the boundaries of innovation in graphics and AI technologies. With a deep commitment to enhancing user experiences, NVIDIA's client computing business focuses on delivering solutions that power everything from gaming and creative workloads to enterprise applications. for its GeForce graphics cards, the company has redefined high-performance gaming, setting industry standards for realistic visuals, fluid frame rates, and immersive experiences. Complementing its gaming expertise, NVIDIA's Quadro and NVIDIA RTX graphics cards cater to professionals in design, content creation, and scientific fields, enabling real-time ray tracing and AI-driven workflows that elevate productivity and creativity to unprecedented heights. By seamlessly integrating graphics, AI, and software, NVIDIA continues to shape the landscape of client computing, fostering innovation and immersive interactions in a rapidly evolving digital world.

nVidia website  nVidia LinkedIn
Latest Articles about nVidia

Technology Explained


GPU: GPU stands for Graphics Processing Unit and is a specialized type of processor designed to handle graphics-intensive tasks. It is used in the computer industry to render images, videos, and 3D graphics. GPUs are used in gaming consoles, PCs, and mobile devices to provide a smooth and immersive gaming experience. They are also used in the medical field to create 3D models of organs and tissues, and in the automotive industry to create virtual prototypes of cars. GPUs are also used in the field of artificial intelligence to process large amounts of data and create complex models. GPUs are becoming increasingly important in the computer industry as they are able to process large amounts of data quickly and efficiently.

Latest Articles about GPU




Leave a Reply