PyTorch 2 2 Big Update! Integrated FlashAttention 2 improves performance by 2x

Edited by alan

In the new year, PyTorch also ushered in a major update!

Following the release of 2After version 1, 521 developers around the world contributed 3,628 commits to the latest version of PyTorch 2Version 2.

The new version integrates FlashAttention-2, which provides a performance improvement of about 2x compared to the previous version.

pytorch 2.2 also introduces a new TorchInductor advance extension, called Aotinductor, designed to compile and deploy PyTorch programs for the non-Python server-side.

torch. in pytorchDistributed supports a new abstraction called Device Mesh for initializing and representing ProcessGroups.

Also, pytorch 22Provides a standardized, configurable logging mechanism,——torch logs.

pytorch 2.2 also on the torchCompile has made a number of improvements, including improved support for compilation optimizers, as well as torchinductor fusion and layout optimizations.

Finally, it's worth noting that PyTorch will drop support for macOS x86, PyTorch 22.X is the last version to support macOS X64.

pytorch 2.2 new features

First of all, note that if you build pytorch 2 from the source2. GCC 94 or later, the PyTorch library has been migrated from C++14 to C++17.

FlashAttention-2 solves the problem of low usage or unnecessary shared memory reads and writes by optimizing the working partition between different thread blocks and warps on the GPU.

FlashAttention-2 tweaks the algorithm to reduce the amount of non-matmul computation, while improving the parallelism of Attention computation (even a single header can be across different thread blocks to increase usage), and in each thread block, the work distribution between warps is optimized to reduce communication through shared memory.

pytorch 2.2 Updated the FlashAttention kernel to v2, but it should be noted that the previous Flash Attention kernel has a Windows implementation, and Windows users can force the SDP kernel to be used, and only enable the context manager of Flash Attention.

And in 22, if you must use the SDP kernel context manager, use the memory efficient or math kernel (on Windows).

With the blessing of flashattention-2, torchnn.functional.Scaled Dot Product Attention is about 2x faster, reaching 50%-73% of the theoretical peak on A100 GPUs.

Aotinductor is an extension of Torchinductor for processing exported PyTorch models, optimizing them, and generating shared libraries as well as other related artifacts.

These compiled artifacts can be deployed in a non-Python environment and are often used for server-side inference.

The following example shows how to call AOT Compile to convert a model to a shared library.

Aotinductor supports the same backend as Inductor, including CUDA, ROCM, and CPU.

pytorch 2.2 provides a standardized, configurable logging mechanism that can be used to analyze the state of various subsystems, such as compilation and distributed operations.

Logging can be enabled via the torch logs environment variable. For example, by modifying environment variables in the command line:

Set torchdynamo's log level to loggingerror, set the log level of the torchinductor to loggingdebug。

Of course, it can also be used in the form of an API in **:

pytorch 2.2 introduces a new abstraction for representing processgroups involved in distributed parallelism, called torchdistributed.device_mesh。

Setting up a Distributed Communicator (NCCL) for distributed training is a cumbersome affair. Users need to write workloads with different degrees of parallelism and manually set up and manage the NCCL communicator (ProcessGroup) for each degree of parallelism.

This process can be complex and error-prone. DeviceMesh can simplify this process and make it easier to manage.

DeviceMesh is a higher-level abstraction for managing processgroups. It allows users to effortlessly create inter-node and intra-node process groups without having to worry about how to properly set the hierarchy for different sub-process groups.

For example, one dimension of an array can represent data parallelism in FSDP, while the other dimension can represent tensor parallelism in FSDP.

Users can also easily manage the underlying process groups through DeviceMesh to achieve multi-dimensional parallelism.

DeviceMesh is useful when dealing with multidimensional parallelism, such as 3D parallelism. As shown in the diagram above, when your parallel solution needs to communicate across and within each host, you can create a 2D mesh that connects the devices in each host and connects each device to its counterparts on the other hosts in a homogeneous setup.

With the help of Init Device Mesh() we can do the above 2D setup in just two lines:

And if we don't use devicemesh, we probably need to write the following bunch of **:

Of course, we can still access the underlying processgroup:if needed

There are probably the following:

The compilation optimizer improved performance across all benchmarks: HuggingFace +18%, Torchbench +19%, TIMM +8% E2E;

Compiled optimizer adds support for cudagraphs;

Averaging all models in the test suite, the average benchmark compilation time for each test suite increased by about 40 seconds; Ongoing optimizations may reduce it to less than 30 seconds.

The main missing feature in the inductor for multi-tensor optimizer compilation is the efficient encoding generation of the foreach operator.

Inside the scheduler, condense the list of buffers registered during the decentralization process into ForeachKernelSchedulerNodes (a subclass of FusedSchedulerNode).

In order to check if the fusion is legitimate, the writes performed by each internal schedulernode must match the read operations at the same list index as the consuming schedulernode.

In addition, normal vertical convergence rules must allow convergence at each index of the list of consumer and producer schedulernodes.

If these conditions are met, the ForeachKernelSchedulerNode will be vertically fused into a ForeachKernelSchedulerNode, where the corresponding point operations on each list will be fused.

By achieving this fusion, a series of foreach operations can be fused into a single core, resulting in full fusion of multi-tensor optimizers.

A number of performance optimizations have been added to TorchInductor, including support for TorchConcat's horizontal fusion support, improved convolutional layout optimization, and improved Scaled Dot Product Attention pattern matching.

pytorch 2.2 also includes a number of performance enhancements to ARClCloud64, including support for mkldnn weight pre-packaging, improved iDeep primitive caching, and improved inference speed through fixed-format kernel improvements to Onednn.

References:

PyTorch 2 2 Big Update! Integrated FlashAttention 2 improves performance by 2x

Related Pages

The king was updated on the 22nd, many heroes ushered in a weakening, and Li Bai's gameplay changed

Thick eyebrows 41 11 6 big outbreak! James 22 5 12, the Lakers 7 narrowly beat the Raptors in double

iOS 17 2 major update for iPhone and iPad

Chen Tingjing, the prime minister of the Qing Dynasty (22).

The ranking of 985 universities has been updated and is divided into "six echelons", with Renmin Uni