A growing problem with training ever-larger foundation models lies in the intricate synchronization of processes spanning thousands of GPUs and even more network connections. A single fault can spoil ...