{ "cells": [ { "cell_type": "markdown", "id": "dca3b3e8", "metadata": { "origin_pos": 0 }, "source": [ "# Concise Implementation for Multiple GPUs\n", ":label:`sec_multi_gpu_concise`\n", "\n", "Implementing parallelism from scratch for every new model is no fun. Moreover, there is significant benefit in optimizing synchronization tools for high performance. In the following we will show how to do this using high-level APIs of deep learning frameworks.\n", "The mathematics and the algorithms are the same as in :numref:`sec_multi_gpu`.\n", "Quite unsurprisingly you will need at least two GPUs to run code of this section.\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "091940c4", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T19:25:31.520170Z", "iopub.status.busy": "2023-08-18T19:25:31.519608Z", "iopub.status.idle": "2023-08-18T19:25:35.444259Z", "shell.execute_reply": "2023-08-18T19:25:35.443315Z" }, "origin_pos": 2, "tab": [ "pytorch" ] }, "outputs": [], "source": [ "import torch\n", "from torch import nn\n", "from d2l import torch as d2l" ] }, { "cell_type": "markdown", "id": "1171a169", "metadata": { "origin_pos": 3 }, "source": [ "## [**A Toy Network**]\n", "\n", "Let's use a slightly more meaningful network than LeNet from :numref:`sec_multi_gpu` that is still sufficiently easy and quick to train.\n", "We pick a ResNet-18 variant :cite:`He.Zhang.Ren.ea.2016`. Since the input images are tiny we modify it slightly. In particular, the difference from :numref:`sec_resnet` is that we use a smaller convolution kernel, stride, and padding at the beginning.\n", "Moreover, we remove the max-pooling layer.\n" ] }, { "cell_type": "code", "execution_count": 2, "id": "508e4ec5", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T19:25:35.448570Z", "iopub.status.busy": "2023-08-18T19:25:35.447857Z", "iopub.status.idle": "2023-08-18T19:25:35.457955Z", "shell.execute_reply": "2023-08-18T19:25:35.456979Z" }, "origin_pos": 5, "tab": [ "pytorch" ] }, "outputs": [], "source": [ "#@save\n", "def resnet18(num_classes, in_channels=1):\n", " \"\"\"A slightly modified ResNet-18 model.\"\"\"\n", " def resnet_block(in_channels, out_channels, num_residuals,\n", " first_block=False):\n", " blk = []\n", " for i in range(num_residuals):\n", " if i == 0 and not first_block:\n", " blk.append(d2l.Residual(out_channels, use_1x1conv=True,\n", " strides=2))\n", " else:\n", " blk.append(d2l.Residual(out_channels))\n", " return nn.Sequential(*blk)\n", "\n", " # This model uses a smaller convolution kernel, stride, and padding and\n", " # removes the max-pooling layer\n", " net = nn.Sequential(\n", " nn.Conv2d(in_channels, 64, kernel_size=3, stride=1, padding=1),\n", " nn.BatchNorm2d(64),\n", " nn.ReLU())\n", " net.add_module(\"resnet_block1\", resnet_block(64, 64, 2, first_block=True))\n", " net.add_module(\"resnet_block2\", resnet_block(64, 128, 2))\n", " net.add_module(\"resnet_block3\", resnet_block(128, 256, 2))\n", " net.add_module(\"resnet_block4\", resnet_block(256, 512, 2))\n", " net.add_module(\"global_avg_pool\", nn.AdaptiveAvgPool2d((1,1)))\n", " net.add_module(\"fc\", nn.Sequential(nn.Flatten(),\n", " nn.Linear(512, num_classes)))\n", " return net" ] }, { "cell_type": "markdown", "id": "fa6f89bb", "metadata": { "origin_pos": 6 }, "source": [ "## Network Initialization\n" ] }, { "cell_type": "markdown", "id": "ba314b72", "metadata": { "origin_pos": 8, "tab": [ "pytorch" ] }, "source": [ "We will initialize the network inside the training loop.\n", "For a refresher on initialization methods see :numref:`sec_numerical_stability`.\n" ] }, { "cell_type": "code", "execution_count": 3, "id": "465a5950", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T19:25:35.461588Z", "iopub.status.busy": "2023-08-18T19:25:35.461046Z", "iopub.status.idle": "2023-08-18T19:25:35.473955Z", "shell.execute_reply": "2023-08-18T19:25:35.473185Z" }, "origin_pos": 10, "tab": [ "pytorch" ] }, "outputs": [], "source": [ "net = resnet18(10)\n", "# Get a list of GPUs\n", "devices = d2l.try_all_gpus()\n", "# We will initialize the network inside the training loop" ] }, { "cell_type": "markdown", "id": "2c115c28", "metadata": { "origin_pos": 17 }, "source": [ "## [**Training**]\n", "\n", "As before, the training code needs to perform several basic functions for efficient parallelism:\n", "\n", "* Network parameters need to be initialized across all devices.\n", "* While iterating over the dataset minibatches are to be divided across all devices.\n", "* We compute the loss and its gradient in parallel across devices.\n", "* Gradients are aggregated and parameters are updated accordingly.\n", "\n", "In the end we compute the accuracy (again in parallel) to report the final performance of the network. The training routine is quite similar to implementations in previous chapters, except that we need to split and aggregate data.\n" ] }, { "cell_type": "code", "execution_count": 4, "id": "699b241a", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T19:25:35.477572Z", "iopub.status.busy": "2023-08-18T19:25:35.477034Z", "iopub.status.idle": "2023-08-18T19:25:35.485278Z", "shell.execute_reply": "2023-08-18T19:25:35.484313Z" }, "origin_pos": 19, "tab": [ "pytorch" ] }, "outputs": [], "source": [ "def train(net, num_gpus, batch_size, lr):\n", " train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)\n", " devices = [d2l.try_gpu(i) for i in range(num_gpus)]\n", " def init_weights(module):\n", " if type(module) in [nn.Linear, nn.Conv2d]:\n", " nn.init.normal_(module.weight, std=0.01)\n", " net.apply(init_weights)\n", " # Set the model on multiple GPUs\n", " net = nn.DataParallel(net, device_ids=devices)\n", " trainer = torch.optim.SGD(net.parameters(), lr)\n", " loss = nn.CrossEntropyLoss()\n", " timer, num_epochs = d2l.Timer(), 10\n", " animator = d2l.Animator('epoch', 'test acc', xlim=[1, num_epochs])\n", " for epoch in range(num_epochs):\n", " net.train()\n", " timer.start()\n", " for X, y in train_iter:\n", " trainer.zero_grad()\n", " X, y = X.to(devices[0]), y.to(devices[0])\n", " l = loss(net(X), y)\n", " l.backward()\n", " trainer.step()\n", " timer.stop()\n", " animator.add(epoch + 1, (d2l.evaluate_accuracy_gpu(net, test_iter),))\n", " print(f'test acc: {animator.Y[0][-1]:.2f}, {timer.avg():.1f} sec/epoch '\n", " f'on {str(devices)}')" ] }, { "cell_type": "markdown", "id": "f1b844d7", "metadata": { "origin_pos": 20 }, "source": [ "Let's see how this works in practice. As a warm-up we [**train the network on a single GPU.**]\n" ] }, { "cell_type": "code", "execution_count": 5, "id": "79601c9b", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T19:25:35.488871Z", "iopub.status.busy": "2023-08-18T19:25:35.488396Z", "iopub.status.idle": "2023-08-18T19:27:56.584731Z", "shell.execute_reply": "2023-08-18T19:27:56.583506Z" }, "origin_pos": 22, "tab": [ "pytorch" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "test acc: 0.91, 12.2 sec/epoch on [device(type='cuda', index=0)]\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", " \n", " \n", " \n", " \n", " 2023-08-18T19:27:56.544088\n", " image/svg+xml\n", " \n", " \n", " Matplotlib v3.7.2, https://matplotlib.org/\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n" ], "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "train(net, num_gpus=1, batch_size=256, lr=0.1)" ] }, { "cell_type": "markdown", "id": "8104cce9", "metadata": { "origin_pos": 23 }, "source": [ "Next we [**use 2 GPUs for training**]. Compared with LeNet\n", "evaluated in :numref:`sec_multi_gpu`,\n", "the model for ResNet-18 is considerably more complex. This is where parallelization shows its advantage. The time for computation is meaningfully larger than the time for synchronizing parameters. This improves scalability since the overhead for parallelization is less relevant.\n" ] }, { "cell_type": "code", "execution_count": 6, "id": "96ccaa47", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T19:27:56.590412Z", "iopub.status.busy": "2023-08-18T19:27:56.589484Z", "iopub.status.idle": "2023-08-18T19:29:21.137466Z", "shell.execute_reply": "2023-08-18T19:29:21.136402Z" }, "origin_pos": 25, "tab": [ "pytorch" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "test acc: 0.73, 7.5 sec/epoch on [device(type='cuda', index=0), device(type='cuda', index=1)]\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", " \n", " \n", " \n", " \n", " 2023-08-18T19:29:21.101180\n", " image/svg+xml\n", " \n", " \n", " Matplotlib v3.7.2, https://matplotlib.org/\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n" ], "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "train(net, num_gpus=2, batch_size=512, lr=0.2)" ] }, { "cell_type": "markdown", "id": "eb397925", "metadata": { "origin_pos": 26 }, "source": [ "## Summary\n" ] }, { "cell_type": "markdown", "id": "59836808", "metadata": { "origin_pos": 28 }, "source": [ "* Data is automatically evaluated on the devices where the data can be found.\n", "* Take care to initialize the networks on each device before trying to access the parameters on that device. Otherwise you will encounter an error.\n", "* The optimization algorithms automatically aggregate over multiple GPUs.\n", "\n", "\n", "\n", "## Exercises\n" ] }, { "cell_type": "markdown", "id": "6eeda264", "metadata": { "origin_pos": 30, "tab": [ "pytorch" ] }, "source": [ "1. This section uses ResNet-18. Try different epochs, batch sizes, and learning rates. Use more GPUs for computation. What happens if you try this with 16 GPUs (e.g., on an AWS p2.16xlarge instance)?\n", "1. Sometimes, different devices provide different computing power. We could use the GPUs and the CPU at the same time. How should we divide the work? Is it worth the effort? Why? Why not?\n" ] }, { "cell_type": "markdown", "id": "e4642e6d", "metadata": { "origin_pos": 32, "tab": [ "pytorch" ] }, "source": [ "[Discussions](https://discuss.d2l.ai/t/1403)\n" ] } ], "metadata": { "language_info": { "name": "python" }, "required_libs": [] }, "nbformat": 4, "nbformat_minor": 5 }