In the early morning of December 7, Google (Google) released the multi-modal large model Gemini at the same time, but also launched a new cloud-oriented AI acceleration TPU V5P, which is also Google's most powerful and cost-effective TPU (Cloud Tensor Processing Unit) to date.
According to reports, each TPU V5P pod consists of up to 8,960 chips, interconnected using the highest bandwidth chip-to-chip connection (4,800 Gbps per chip) to ensure fast transfer speeds and optimal performance.
In terms of AI performance, the TPU V5P is capable of delivering 459 teraflops (459 trillion floating-point operations per second) of Bfloat16 (16-bit floating-point format) performance or 918 Teraops (918 trillion integer operations per second) of INT8 (8-bit integer execution) performance, supports 95GB of high-bandwidth memory, and is capable of performing at 276 TBS for data transfer.
Compared to TPU V4, the newly released TPU V5P has twice the FLOPS (floating point operations per second) and three times the high memory bandwidth boost, which is amazing in the field of artificial intelligence.
In addition, in terms of model training, TPU v5P shows 2. on LLM (large language model) training speedAn 8x generational improvement, even compared to TPU V5E, there is a 50% improvement. Google is also squeezing out more computing power because TPU V5P is "4x more scalable than TPU V4 in terms of total available FLOP per pod".
To sum up, TPU v5P vs. TPU v4:
2x increase in floating-point operations (459 tflops bf16 918 tops int8).
Up to 3x more memory capacity than TPU v4 (95 GB HBM).
LLM training speed increased by 28 times.
Embedding dense model training is 19 times.
Bandwidth increase 225x (2765 GB seconds vs 1228 GB seconds).
Twice the bandwidth of the chip-to-chip interconnect (4800 Gbps vs. 2400 Gbps).
Google has recognized clear success in having the best hardware and software resources, which is why the company has an AI supercomputer, a set of elements designed to work collaboratively to achieve modern AI workloads. Google has integrated features such as performance-optimized computing, optimal storage, and liquid cooling to make the most of the huge features, and the output performance is indeed industry-leading.
On the software side, Google has stepped up its use of open software to tune its AI workloads to ensure optimal performance of its hardware.
Here's a rundown of the new software resources for AI Hypercomputer:
Extensive support for popular ML frameworks such as Jax, TensorFlow, and PyTorch works out of the box. Both JAX and PyTorch are powered by the OpenXLA compiler for building complex LLMs. XLA acts as the foundational backbone and enables the creation of complex, multi-layered models (LLAMA 2 training and inference on cloud TPUs using PyTorch XLA). It optimizes distributed architectures across a variety of hardware platforms, ensuring easy-to-use and efficient model development for different AI use cases (AssemblyAI leverages JAX XLA and Cloud TPU for AI speech at scale).
The open and unique multi-slice training and multi-host inference software make scaling, training, and service workloads smooth and simple, respectively. Developers can scale to tens of thousands of chips to support demanding AI workloads.
Deep integration with Google Kubernetes Engine (GKE) and Google Compute Engine provides efficient resource management, a consistent operating environment, auto-scaling, auto-provisioning of node pools, auto-checkpoints, auto-recovery, and timely fail-safe.
Google's revolutionary approach to AI is evident through its new hardware and software elements that will break down the barriers that limit the industry. It will be interesting to see how the new Cloud TPU V5P processing units will help with the ongoing AI development with AI supercomputers, but one thing is for sure, they will certainly intensify the competition.
Editor: Xinzhixun-Rogue Sword.