Continuing to be far ahead, Amazon Web Services has built a solid foundation for IaaS

Amazon Web Services has always been the vane of the cloud computing industry, and it is also a leading leader, and the annual re:invent conference has attracted the attention of the whole industry. Not long ago, Re:Invent 2023 was successfully held, at which it not only adhered to the concept of "customer first" of Amazon Web Services, but also released a number of IaaS solution ideas, further iterating in terms of performance, cost, and security. Let's take a look at the many highlights of this conference.

At every RE:INVENT conference, the most important information is self-developed chips. Since its birth in 2018, the GR**ITON series chips have launched 150 kinds of instances, 2 million pieces on the cloud, delivered to 50,000 users, and have been recognized by the top 100 customersFor example, SAP: is a major customer of GR**ITON;

The combination of 96 Neoverse v2 cores, 2 MB of L2 cache per core, and 12 DDR5-5600 channels resulted in 40 percent faster database processing for GR**ITON4, 30 percent faster web applications, and 45 percent faster processing for large J**A applications compared to GR**ITON3.

This conference deliberately emphasizes the DB and J**A scenarios, which we have evaluated before, and it is indeed the key scenario of ARM, compared with GR**ITON in other strong scenarios, the performance of these two scenarios is not outstanding enough, which is also the reason for the additional emphasis of this conference.

The main parameters of several generations of ARM products are as follows.

EC2 R8G, a computing product based on ARM chips, supports 96 cores for a single CPU and 192 cores for a whole machine

Gr**Iton supports a large number of cloud products, including DB, big data, containers, and FaaS

The new gr**iton4 has 96 cores inside and is ARM-based"demeter"The Neoverse v2 core, based on the ArmV9 architecture, provides a 50% increase in the number of cores compared to the 64-core gr**iton3. This time, a 7-die design is adopted, with 12 DDR5 controllers distributed over 4 dies2 PCIe5 DIEs and CCIX Numa interconnect DIs

The NUMA interconnection architecture that exceeded expectations, the CCIX implementation under the ARM architecture was not particularly perfect, and the latency was also very large, but this generation of GR**ITON actually crossed this step.

Why did the ARM server architecture shift from the standalone 3-socket architecture of the past to the NUMA design?

Speculation and location are inseparable: large databases, SAP Hanna, and limitless Arura databasesOf course, this has high latency requirements for CCIX interconnection, and excellent performance is expected

Regarding the performance design, the first figure is the traditional benchmark, and it is speculated that it should be specint2017, which is a program that is standardized and easy to install, and the general program is relatively small, and it is difficult to reflect the real business performance. GR**iton CPU design products use the "Real Workload" benchmark to optimize CPU design;

MySQL benchmarks show a 40% advantage over R7G and more than 45% when testing Groovy Grails applications on 8vCPUs.

According to statistics, there are currently more than 150 types of Amazon EC2 instances supported by Amazon Gr**Iton, more than 2 million Amazon Gr**Iton processors have been built, and more than 50,000 customers, including Datadog, DirectV, Discovery, Formula 1 (F1), Nextroll, Nielsen, Pinterest, SAP, Snowflake, Sprinklr, Stripe, Zendesk, etc. SAP, for example, has reduced costs by 35% and analyzed faster times while reducing carbon emissions by 45% after using Amazon GR**ITON services.

In this press conference, we are comparing the previous generation of ARM, and there has been no horizontal comparison of X86 data, you can speculate based on the performance of GR**ITON3.

When GR**ITON3 was released in 2021, the performance improvement compared to GR**ITON2 can also be used as a reference for G2-3.

GR**ITON3 adopts the ARM V1 architecture, while GR**ITON4 is based on the next-generation V2 architecture, with a focus on upgrades.

l2 cache：1->2m

The V2 architecture of ARM products has been overturned, and the main performance parameters are as follows.

The V2 architecture has been optimized and improved in both the front-end and back-end of the microprocessor.

ARM's physical core has natural advantages, and what needs to be accumulated is the algorithm under high-performance load on the server side, such as instruction**, out-of-order execution, and cache prefetch algorithm in random access data scenarios. Compared with V1, V2 brings 13% of the benefits of SIR improvement and 10% reduction of SLC MISS, which seems to consume a lot of performance by memory access. The most significant performance gains were seen in MOP Fetch and HW Prefetch;

branch predict/fetch/icache

x86 released two products:

The first is the M7i-Intel SPR

CPU 96VCPU, two-way 192VCPU, integrated AI accelerator.

Up to 32 GHz 4th Gen Intel Xeon Scalable processors (Sapphire Rapids 8488C).

The new Advanced Matrix Extensions (AMX) accelerates matrix multiplication.

The latest DDR5 memory, which has more bandwidth as compared to DDR4.

M7i-Flex architecture: This generation has 1 more cores than the previous generation5 times, but the total IO performance is the same, so Flex makes a profit.

The maximum is 32vCPU, and the IO is 125G network 10GEBS

The price/performance ratio is improved by 19%, *M6i is reduced by 5%, and the CPU performance is improved by 15%.

M7i Product Specifications:

The M7i-Flex is up to 32core, IO can be shared, only a maximum of 12 is promised5g、10g

Cost-effective advantage;

For scenarios of large in-memory databases such as SAP HANA, Oracle, or SQL Server, U7i products are launched.

The U7i supports up to 896 VCPUs, the largest number of VCPUs in AWS Cloud. They offer up to 100Gbps of Elastic Block Storage (EBS), which is 25x more, enabling customers to load data into memory faster and improve backup speed. U7i instances support EBS IO2 Block Express volumes to provide the best EBS performance on Amazon EC2. U7i instances provide up to 100Gbps of network bandwidth and support for ENA Express. U7i instances are ideal for customers with mission-critical, in-memory databases, such as SAP HANA, Oracle, or SQL Server.

The second is the M7A-AMD Genoa product.

Amazon EC2 M7a instances powered by AMD EPYC processors can deliver up to 50% faster performance compared to M6A instances.

Key features: Turbo 37 GHz 4th Gen AMD EPYC Processor (AMD EPYC 9R14) GENOA

50 Gbps network bandwidth and 40 Gbps Amazon Elastic Block Store (Amazon EBS) bandwidth.

Up to 192 vCPUs and 768 Gib of memory instances.

SAP authentication instance.

Supports round-the-clock on-the-clock memory encryption with AMD Secure Memory Encryption (SME).

Support for new processor features such as X3-512, VNNI, and BFLOAT16.

New interpretation: The maximum bare metal specification of the previous generation of milan is 192vcpu, why hasn't it improved this time?

AMD Genoa was originally 96core, 192HT capability, why didn't it launch a 384vCPU product?

The answer is that they have done SMT OFF processing, which directly exposes the physical core to the user, which greatly alleviates the problems of insufficient memory and poor HT linearity in the previous Milan eraAt the same time, M7A pursues a single VCPU bandwidth to ensure the best application performance and latency.

EBS and S3 are designed to evolve every year to increase bandwidth and reduce latency.

First of all, the bandwidth will increase by about 30% every year, and this year the 100G network will be used, 50G will be used for VPC, 40G will be given to EBS, and the rest will be controlled. As core density rises (192->256 384), the next generation is expected to move towards 200G networks;

Second, storage latency is very important, and new acceleration products are released to users every year.

On EBS, EBS IO2 Express is used to reduce latency by 10 times

On S3 object storage products, S3 Express One Zone is used to reduce latency by 10 times

In the past, we looked at the product from the user's point of view, but this year we can finally look at the implementation architecture from the perspective of EBS storage server

EC2--Nitro--SRD--EBS server, the server scale uses gr**iton CPU, and from experience, the ARM server does the IO, compression, and verification of storage, which can greatly exert the advantages of physical cores.

EBS IO2 Express, which was released for the first time at last year's press conference, has more product specifications this year.

Compared with the previous IO2, it provides 4 times the bandwidth and capacity

Compared with IO1, it has 10 times lower latency and 100 times more reliability, reducing costs by 50% in high IO throughput scenarios.

IO2 Express is based on the SRD protocol, which greatly increases bandwidth and reduces latency. In recent years, DCTCP and RDMA protocols have been used in data centers, greatly improving the throughput of IaaS interconnection.

A new product was released this year, the name is a bit long, Amazon S3 Express One Zone, in order to solve the aforementioned article, it is an intra-az cache accelerator for object storage, in fact, it is easier to remember to call S3 Express.

The typical latency of S3 is 10-200 ms, and for services such as ML, BigData, and data analysis, computing needs to wait for data to be ready, and computing clusters consume and waste waiting timeThat's where S3 Express comes in.

The following figure shows the schema of the solution

Here are a few key points:

Compute Server EC2 is the same as S3 Express

The latency has 10 times the speed, and as seen above, the 100 millisecond delay wastes a lot of time, and can be optimized by 10 times

Presumably, with an SSD server cluster, ** is expected to be 10 times more expensive than the S3 composed of HDDs.

Amazon S3 Express One Zone is a high-performance, single-zone Amazon S3 storage class designed to provide consistent, single-digit millisecond data access for the most latency-sensitive applications. S3 Express One Zone is the lowest-latency cloud object storage class available today, with data access up to 10x faster and 50% lower request costs than S3 Standard. Applications can immediately benefit from request completion orders of magnitude faster. S3 Express One Zone provides similar performance elasticity to other S3 storage classes. As with Amazon S3, there is no need to plan or configure capacity or throughput requirements in advance. Storage capacity can be scaled up or down as needed, and data can be accessed through the Amazon S3 API. S3 Express One Zone is the first S3 storage class that provides the highest possible access speed by selecting a single Availability Zone and co-co-locating object storage with compute resources. In addition, to further improve access speeds and support hundreds of thousands of requests per second, data is stored in a new bucket type: Amazon S3 directory buckets. Each bucket can support hundreds of thousands of transactions per second (TPS), regardless of key name or access mode.

Confidential computing is very important for business, finance, and multi-party transactions, but the popularity rate has not been good in recent yearsIntel withdrew from the SGX chip security solution, ARM had the TrustZone solution, and AMD had its own different solutionFor users, it's better to have a unified scheme.

Nitro Encl**es does this by storing information such as security keys in a separate DPU space outside of a unified user domain, avoiding the need to modify programs for different CPUs. It can be compatible with both vendors and generationsThe case of the blockchain of the Bank of Brazil is also cited

This year, the general-purpose computing network is the same as the previous generation of 100G platformsNetwork enhancement up to 200G;The single card of the AI network reaches 400 GbpsFor the Nitro platform with ARM CPUs, it is relatively easy to use Jumbo to double the bandwidth in AI scenariosOf course, in the face of AI training scenarios, NVLink is still required for 480 GB s cabinet bus interconnection.

As more and more open source software enters the production business of enterprises, the irreplaceability of cloud computing in the future requires the combination of chips and software, and the combination of multiple product matrices to create differentiated value for customers and provide safe, high-performance, and low-cost products and services.

From the ten-year pace of product evolution, it can be seen that Amazon Cloud adheres to the product strategy:

Cost control: gr**iton's self-developed chip, reducing power consumption by 60% and giving profits to customers (the price was 20% lower in the past).

Minimize the price for customers (M7i-Flex) and reduce their own costs (6-7 generations share 100G network).

Performance first: AMD processor products, M7A (GENOA) strategy: unleash physical computing power (50% increase) while ensuring memory bandwidth (DDR5 4800, 50% increase over the previous generation).

gr**iton policy: physical cores, high-capacity cache, maximum memory bandwidth;At the same time, the V1 and V2 series ARM architectures are selected, which reduces the density by double the price (compared with the N series) to achieve the best performance

EBS Express and S3 Express are the storage products with higher bandwidth and lower latency

Security first: From Nitro encryption to the cost of network encryption and decryption hardware overhead, memory encryption to pay 10% latency overhead, still provide customers with the most secure solution.

Launched Nitro Encl**es for confidential computing;

Facts have proved that Amazon Web Services can provide users with a wealth of cloud application choices, and these choices are also the most advanced and high-end in the industry, which not only provides differentiated competitiveness, but also allows users to adjust their business in the first time to adapt to the future development of digitalization.

Continuing to be far ahead, Amazon Web Services has built a solid foundation for IaaS

Related Pages

The Mate60 is a technology masterpiece at the forefront

From "far ahead" to "far behind", Huawei Mate60, really lost!

The Huawei Mate 60 is far ahead

Ahead! The OPPO Watch 3 series products are really high-end and really high-end

Google's new AI, "far ahead"?