Train Your Large Model on Multiple GPUs with Fully Sharded Data Parallelism

Short excerpt below. Click through to read at the original source.

This article is divided into five parts; they are: • Introduction to Fully Sharded Data Parallel • Preparing Model for FSDP Training • Training Loop with FSDP • Fine-Tuning FSDP Behavior • Checkpointing FSDP Models Sharding is a term originally used in database management systems, where it refers to dividing a database into smaller units, […]

Read at Source