Request 1188410 (accepted)

This request supersedes: request 1188322 (Show diff)

Overview

Request 1188410 accepted

- update to 2.3.1 with following summarized highlights:
* from 2.0.x:
- torch.compile is the main API for PyTorch 2.0, which wraps your model and
returns a compiled model. It is a fully additive (and optional) feature
and hence 2.0 is 100% backward compatible by definition
- Accelerated Transformers introduce high-performance support for training
and inference using a custom kernel architecture for scaled dot product
attention (SPDA). The API is integrated with torch.compile() and model
developers may also use the scaled dot product attention kernels directly
by calling the new scaled_dot_product_attention() operato
* from 2.1.x:
- automatic dynamic shape support in torch.compile,
torch.distributed.checkpoint for saving/loading distributed training jobs
on multiple ranks in parallel, and torch.compile support for the NumPy
API.
- In addition, this release offers numerous performance improvements (e.g.
CPU inductor improvements, AVX512 support, scaled-dot-product-attention
support) as well as a prototype release of torch.export, a sound
full-graph capture mechanism, and torch.export-based quantization.
* from 2.2.x:
- 2x performance improvements to scaled_dot_product_attention via
FlashAttention-v2 integration, as well as AOTInductor, a new
ahead-of-time compilation and deployment tool built for non-python
server-side deployments.
* from 2.3.x:
- support for user-defined Triton kernels in torch.compile, allowing for
users to migrate their own Triton kernels from eager without
experiencing performance complications or graph breaks. As well, Tensor
Parallelism improves the experience for training Large Language Models
using native PyTorch functions, which has been validated on training