{ "cells": [ { "cell_type": "markdown", "id": "4bffbe72", "metadata": { "origin_pos": 0 }, "source": [ "# Asynchronous Computation\n", ":label:`sec_async`\n", "\n", "Today's computers are highly parallel systems, consisting of multiple CPU cores (often multiple threads per core), multiple processing elements per GPU, and often multiple GPUs per device. In short, we can process many different things at the same time, often on different devices. Unfortunately Python is not a great way of writing parallel and asynchronous code, at least not without some extra help. After all, Python is single-threaded and this is unlikely to change in the future. Deep learning frameworks such as MXNet and TensorFlow adopt an *asynchronous programming* model to improve performance,\n", "while PyTorch uses Python's own scheduler leading to a different performance trade-off.\n", "For PyTorch, by default, GPU operations are asynchronous. When you call a function that uses the GPU, the operations are enqueued to the particular device, but not necessarily executed until later. This allows us to execute more computations in parallel, including operations on the CPU or other GPUs.\n", "\n", "Hence, understanding how asynchronous programming works helps us to develop more efficient programs, by proactively reducing computational requirements and mutual dependencies. This allows us to reduce memory overhead and increase processor utilization.\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "e2543b3d", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T19:47:57.710213Z", "iopub.status.busy": "2023-08-18T19:47:57.709371Z", "iopub.status.idle": "2023-08-18T19:48:00.925162Z", "shell.execute_reply": "2023-08-18T19:48:00.923855Z" }, "origin_pos": 2, "tab": [ "pytorch" ] }, "outputs": [], "source": [ "import os\n", "import subprocess\n", "import numpy\n", "import torch\n", "from torch import nn\n", "from d2l import torch as d2l" ] }, { "cell_type": "markdown", "id": "acc948cf", "metadata": { "origin_pos": 3 }, "source": [ "## Asynchrony via Backend\n" ] }, { "cell_type": "markdown", "id": "c7cd88ca", "metadata": { "origin_pos": 5, "tab": [ "pytorch" ] }, "source": [ "For a warmup consider the following toy problem: we want to generate a random matrix and multiply it. Let's do that both in NumPy and in PyTorch tensor to see the difference.\n", "Note that PyTorch `tensor` is defined on a GPU.\n" ] }, { "cell_type": "code", "execution_count": 2, "id": "d76704a2", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T19:48:00.929925Z", "iopub.status.busy": "2023-08-18T19:48:00.928862Z", "iopub.status.idle": "2023-08-18T19:48:02.651564Z", "shell.execute_reply": "2023-08-18T19:48:02.650169Z" }, "origin_pos": 7, "tab": [ "pytorch" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "numpy: 1.4693 sec\n", "torch: 0.0022 sec\n" ] } ], "source": [ "# Warmup for GPU computation\n", "device = d2l.try_gpu()\n", "a = torch.randn(size=(1000, 1000), device=device)\n", "b = torch.mm(a, a)\n", "\n", "with d2l.Benchmark('numpy'):\n", " for _ in range(10):\n", " a = numpy.random.normal(size=(1000, 1000))\n", " b = numpy.dot(a, a)\n", "\n", "with d2l.Benchmark('torch'):\n", " for _ in range(10):\n", " a = torch.randn(size=(1000, 1000), device=device)\n", " b = torch.mm(a, a)" ] }, { "cell_type": "markdown", "id": "3a488546", "metadata": { "origin_pos": 9, "tab": [ "pytorch" ] }, "source": [ "The benchmark output via PyTorch is orders of magnitude faster.\n", "NumPy dot product is executed on the CPU processor while\n", "PyTorch matrix multiplication is executed on GPU and hence the latter\n", "is expected to be much faster. But the huge time difference suggests something\n", "else must be going on.\n", "By default, GPU operations are asynchronous in PyTorch.\n", "Forcing PyTorch to finish all computation prior to returning shows\n", "what happened previously: computation is being executed by the backend\n", "while the frontend returns control to Python.\n" ] }, { "cell_type": "code", "execution_count": 3, "id": "8ee490b4", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T19:48:02.657514Z", "iopub.status.busy": "2023-08-18T19:48:02.656839Z", "iopub.status.idle": "2023-08-18T19:48:02.671673Z", "shell.execute_reply": "2023-08-18T19:48:02.670413Z" }, "origin_pos": 11, "tab": [ "pytorch" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done: 0.0058 sec\n" ] } ], "source": [ "with d2l.Benchmark():\n", " for _ in range(10):\n", " a = torch.randn(size=(1000, 1000), device=device)\n", " b = torch.mm(a, a)\n", " torch.cuda.synchronize(device)" ] }, { "cell_type": "markdown", "id": "63514ebe", "metadata": { "origin_pos": 13, "tab": [ "pytorch" ] }, "source": [ "Broadly speaking, PyTorch has a frontend for direct interaction with the users, e.g., via Python, as well as a backend used by the system to perform the computation. \n", "As shown in :numref:`fig_frontends`, users can write PyTorch programs in various frontend languages, such as Python and C++. Regardless of the frontend programming language used, the execution of PyTorch programs occurs primarily in the backend of C++ implementations. Operations issued by the frontend language are passed on to the backend for execution.\n", "The backend manages its own threads that continuously collect and execute queued tasks.\n", "Note that for this to work the backend must be able to keep track of the\n", "dependencies between various steps in the computational graph.\n", "Hence, it is not possible to parallelize operations that depend on each other.\n" ] }, { "cell_type": "markdown", "id": "589e5a04", "metadata": { "origin_pos": 14 }, "source": [ "![Programming language frontends and deep learning framework backends.](../img/frontends.png)\n", ":width:`300px`\n", ":label:`fig_frontends`\n", "\n", "Let's look at another toy example to understand the dependency graph a bit better.\n" ] }, { "cell_type": "code", "execution_count": 4, "id": "e7220a99", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T19:48:02.677729Z", "iopub.status.busy": "2023-08-18T19:48:02.677284Z", "iopub.status.idle": "2023-08-18T19:48:02.875593Z", "shell.execute_reply": "2023-08-18T19:48:02.874563Z" }, "origin_pos": 16, "tab": [ "pytorch" ] }, "outputs": [ { "data": { "text/plain": [ "tensor([[3., 3.]], device='cuda:0')" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x = torch.ones((1, 2), device=device)\n", "y = torch.ones((1, 2), device=device)\n", "z = x * y + 2\n", "z" ] }, { "cell_type": "markdown", "id": "adf45eb3", "metadata": { "origin_pos": 17 }, "source": [ "![The backend tracks dependencies between various steps in the computational graph.](../img/asyncgraph.svg)\n", ":label:`fig_asyncgraph`\n", "\n", "\n", "\n", "The code snippet above is also illustrated in :numref:`fig_asyncgraph`.\n", "Whenever the Python frontend thread executes one of the first three statements, it simply returns the task to the backend queue. When the last statement's results need to be *printed*, the Python frontend thread will wait for the C++ backend thread to finish computing the result of the variable `z`. One benefit of this design is that the Python frontend thread does not need to perform actual computations. Thus, there is little impact on the program's overall performance, regardless of Python's performance. :numref:`fig_threading` illustrates how frontend and backend interact.\n", "\n", "![Interactions of the frontend and backend.](../img/threading.svg)\n", ":label:`fig_threading`\n", "\n", "\n", "\n", "\n", "## Barriers and Blockers\n" ] }, { "cell_type": "markdown", "id": "62ec654b", "metadata": { "origin_pos": 22 }, "source": [ "## Improving Computation\n" ] }, { "cell_type": "markdown", "id": "922d9547", "metadata": { "origin_pos": 26 }, "source": [ "## Summary\n", "\n", "\n", "* Deep learning frameworks may decouple the Python frontend from an execution backend. This allows for fast asynchronous insertion of commands into the backend and associated parallelism.\n", "* Asynchrony leads to a rather responsive frontend. However, use caution not to overfill the task queue since it may lead to excessive memory consumption. It is recommended to synchronize for each minibatch to keep frontend and backend approximately synchronized.\n", "* Chip vendors offer sophisticated performance analysis tools to obtain a much more fine-grained insight into the efficiency of deep learning.\n" ] }, { "cell_type": "markdown", "id": "58dca706", "metadata": { "origin_pos": 28 }, "source": [ "## Exercises\n" ] }, { "cell_type": "markdown", "id": "3ca7cabd", "metadata": { "origin_pos": 30, "tab": [ "pytorch" ] }, "source": [ "1. On the CPU, benchmark the same matrix multiplication operations in this section. Can you still observe asynchrony via the backend?\n" ] }, { "cell_type": "markdown", "id": "80c5ecfb", "metadata": { "origin_pos": 32, "tab": [ "pytorch" ] }, "source": [ "[Discussions](https://discuss.d2l.ai/t/2564)\n" ] } ], "metadata": { "language_info": { "name": "python" }, "required_libs": [] }, "nbformat": 4, "nbformat_minor": 5 }