{
"cells": [
{
"cell_type": "markdown",
"id": "7a381239",
"metadata": {
"origin_pos": 0
},
"source": [
"# Deep Recurrent Neural Networks\n",
"\n",
":label:`sec_deep_rnn`\n",
"\n",
"Up until now, we have focused on defining networks \n",
"consisting of a sequence input, \n",
"a single hidden RNN layer,\n",
"and an output layer. \n",
"Despite having just one hidden layer \n",
"between the input at any time step\n",
"and the corresponding output,\n",
"there is a sense in which these networks are deep.\n",
"Inputs from the first time step can influence\n",
"the outputs at the final time step $T$ \n",
"(often 100s or 1000s of steps later).\n",
"These inputs pass through $T$ applications\n",
"of the recurrent layer before reaching \n",
"the final output. \n",
"However, we often also wish to retain the ability\n",
"to express complex relationships \n",
"between the inputs at a given time step\n",
"and the outputs at that same time step.\n",
"Thus we often construct RNNs that are deep\n",
"not only in the time direction \n",
"but also in the input-to-output direction.\n",
"This is precisely the notion of depth\n",
"that we have already encountered \n",
"in our development of MLPs\n",
"and deep CNNs.\n",
"\n",
"\n",
"The standard method for building this sort of deep RNN \n",
"is strikingly simple: we stack the RNNs on top of each other. \n",
"Given a sequence of length $T$, the first RNN produces \n",
"a sequence of outputs, also of length $T$.\n",
"These, in turn, constitute the inputs to the next RNN layer. \n",
"In this short section, we illustrate this design pattern\n",
"and present a simple example for how to code up such stacked RNNs.\n",
"Below, in :numref:`fig_deep_rnn`, we illustrate\n",
"a deep RNN with $L$ hidden layers.\n",
"Each hidden state operates on a sequential input\n",
"and produces a sequential output.\n",
"Moreover, any RNN cell (white box in :numref:`fig_deep_rnn`) at each time step\n",
"depends on both the same layer's \n",
"value at the previous time step\n",
"and the previous layer's value \n",
"at the same time step. \n",
"\n",
"\n",
":label:`fig_deep_rnn`\n",
"\n",
"Formally, suppose that we have a minibatch input\n",
"$\\mathbf{X}_t \\in \\mathbb{R}^{n \\times d}$ \n",
"(number of examples $=n$; number of inputs in each example $=d$) at time step $t$.\n",
"At the same time step, \n",
"let the hidden state of the $l^\\textrm{th}$ hidden layer ($l=1,\\ldots,L$) be $\\mathbf{H}_t^{(l)} \\in \\mathbb{R}^{n \\times h}$ \n",
"(number of hidden units $=h$)\n",
"and the output layer variable be \n",
"$\\mathbf{O}_t \\in \\mathbb{R}^{n \\times q}$ \n",
"(number of outputs: $q$).\n",
"Setting $\\mathbf{H}_t^{(0)} = \\mathbf{X}_t$,\n",
"the hidden state of\n",
"the $l^\\textrm{th}$ hidden layer\n",
"that uses the activation function $\\phi_l$\n",
"is calculated as follows:\n",
"\n",
"$$\\mathbf{H}_t^{(l)} = \\phi_l(\\mathbf{H}_t^{(l-1)} \\mathbf{W}_{\\textrm{xh}}^{(l)} + \\mathbf{H}_{t-1}^{(l)} \\mathbf{W}_{\\textrm{hh}}^{(l)} + \\mathbf{b}_\\textrm{h}^{(l)}),$$\n",
":eqlabel:`eq_deep_rnn_H`\n",
"\n",
"where the weights $\\mathbf{W}_{\\textrm{xh}}^{(l)} \\in \\mathbb{R}^{h \\times h}$ and $\\mathbf{W}_{\\textrm{hh}}^{(l)} \\in \\mathbb{R}^{h \\times h}$, together with\n",
"the bias $\\mathbf{b}_\\textrm{h}^{(l)} \\in \\mathbb{R}^{1 \\times h}$, \n",
"are the model parameters of the $l^\\textrm{th}$ hidden layer.\n",
"\n",
"At the end, the calculation of the output layer \n",
"is only based on the hidden state \n",
"of the final $L^\\textrm{th}$ hidden layer:\n",
"\n",
"$$\\mathbf{O}_t = \\mathbf{H}_t^{(L)} \\mathbf{W}_{\\textrm{hq}} + \\mathbf{b}_\\textrm{q},$$\n",
"\n",
"where the weight $\\mathbf{W}_{\\textrm{hq}} \\in \\mathbb{R}^{h \\times q}$ \n",
"and the bias $\\mathbf{b}_\\textrm{q} \\in \\mathbb{R}^{1 \\times q}$ \n",
"are the model parameters of the output layer.\n",
"\n",
"Just as with MLPs, the number of hidden layers $L$ \n",
"and the number of hidden units $h$ are hyperparameters\n",
"that we can tune.\n",
"Common RNN layer widths ($h$) are in the range $(64, 2056)$,\n",
"and common depths ($L$) are in the range $(1, 8)$. \n",
"In addition, we can easily get a deep-gated RNN\n",
"by replacing the hidden state computation in :eqref:`eq_deep_rnn_H`\n",
"with that from an LSTM or a GRU.\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "fcd91977",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T19:48:08.849784Z",
"iopub.status.busy": "2023-08-18T19:48:08.849030Z",
"iopub.status.idle": "2023-08-18T19:48:11.970528Z",
"shell.execute_reply": "2023-08-18T19:48:11.968861Z"
},
"origin_pos": 3,
"tab": [
"pytorch"
]
},
"outputs": [],
"source": [
"import torch\n",
"from torch import nn\n",
"from d2l import torch as d2l"
]
},
{
"cell_type": "markdown",
"id": "febf1c38",
"metadata": {
"origin_pos": 6
},
"source": [
"## Implementation from Scratch\n",
"\n",
"To implement a multilayer RNN from scratch,\n",
"we can treat each layer as an `RNNScratch` instance\n",
"with its own learnable parameters.\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "dcbd1828",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T19:48:11.977198Z",
"iopub.status.busy": "2023-08-18T19:48:11.976182Z",
"iopub.status.idle": "2023-08-18T19:48:11.985097Z",
"shell.execute_reply": "2023-08-18T19:48:11.983616Z"
},
"origin_pos": 8,
"tab": [
"pytorch"
]
},
"outputs": [],
"source": [
"class StackedRNNScratch(d2l.Module):\n",
" def __init__(self, num_inputs, num_hiddens, num_layers, sigma=0.01):\n",
" super().__init__()\n",
" self.save_hyperparameters()\n",
" self.rnns = nn.Sequential(*[d2l.RNNScratch(\n",
" num_inputs if i==0 else num_hiddens, num_hiddens, sigma)\n",
" for i in range(num_layers)])"
]
},
{
"cell_type": "markdown",
"id": "d540a3a5",
"metadata": {
"origin_pos": 10
},
"source": [
"The multilayer forward computation\n",
"simply performs forward computation\n",
"layer by layer.\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "16d0b69d",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T19:48:11.991389Z",
"iopub.status.busy": "2023-08-18T19:48:11.990127Z",
"iopub.status.idle": "2023-08-18T19:48:11.998937Z",
"shell.execute_reply": "2023-08-18T19:48:11.997525Z"
},
"origin_pos": 11,
"tab": [
"pytorch"
]
},
"outputs": [],
"source": [
"@d2l.add_to_class(StackedRNNScratch)\n",
"def forward(self, inputs, Hs=None):\n",
" outputs = inputs\n",
" if Hs is None: Hs = [None] * self.num_layers\n",
" for i in range(self.num_layers):\n",
" outputs, Hs[i] = self.rnns[i](outputs, Hs[i])\n",
" outputs = torch.stack(outputs, 0)\n",
" return outputs, Hs"
]
},
{
"cell_type": "markdown",
"id": "8e65edf9",
"metadata": {
"origin_pos": 12
},
"source": [
"As an example, we train a deep GRU model on\n",
"*The Time Machine* dataset (same as in :numref:`sec_rnn-scratch`).\n",
"To keep things simple we set the number of layers to 2.\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "dad8438b",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T19:48:12.003936Z",
"iopub.status.busy": "2023-08-18T19:48:12.003560Z",
"iopub.status.idle": "2023-08-18T19:52:08.500314Z",
"shell.execute_reply": "2023-08-18T19:52:08.499056Z"
},
"origin_pos": 13,
"tab": [
"pytorch"
]
},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"data = d2l.TimeMachine(batch_size=1024, num_steps=32)\n",
"rnn_block = StackedRNNScratch(num_inputs=len(data.vocab),\n",
" num_hiddens=32, num_layers=2)\n",
"model = d2l.RNNLMScratch(rnn_block, vocab_size=len(data.vocab), lr=2)\n",
"trainer = d2l.Trainer(max_epochs=100, gradient_clip_val=1, num_gpus=1)\n",
"trainer.fit(model, data)"
]
},
{
"cell_type": "markdown",
"id": "7b8d2e65",
"metadata": {
"origin_pos": 14
},
"source": [
"## Concise Implementation\n"
]
},
{
"cell_type": "markdown",
"id": "236b929c",
"metadata": {
"origin_pos": 15,
"tab": [
"pytorch"
]
},
"source": [
"Fortunately many of the logistical details required\n",
"to implement multiple layers of an RNN \n",
"are readily available in high-level APIs.\n",
"Our concise implementation will use such built-in functionalities.\n",
"The code generalizes the one we used previously in :numref:`sec_gru`,\n",
"letting us specify the number of layers explicitly \n",
"rather than picking the default of only one layer.\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "8d2a6b50",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T19:52:08.519473Z",
"iopub.status.busy": "2023-08-18T19:52:08.519078Z",
"iopub.status.idle": "2023-08-18T19:52:08.525324Z",
"shell.execute_reply": "2023-08-18T19:52:08.524335Z"
},
"origin_pos": 18,
"tab": [
"pytorch"
]
},
"outputs": [],
"source": [
"class GRU(d2l.RNN): #@save\n",
" \"\"\"The multilayer GRU model.\"\"\"\n",
" def __init__(self, num_inputs, num_hiddens, num_layers, dropout=0):\n",
" d2l.Module.__init__(self)\n",
" self.save_hyperparameters()\n",
" self.rnn = nn.GRU(num_inputs, num_hiddens, num_layers,\n",
" dropout=dropout)"
]
},
{
"cell_type": "markdown",
"id": "4942ac43",
"metadata": {
"origin_pos": 21
},
"source": [
"The architectural decisions such as choosing hyperparameters \n",
"are very similar to those of :numref:`sec_gru`.\n",
"We pick the same number of inputs and outputs \n",
"as we have distinct tokens, i.e., `vocab_size`.\n",
"The number of hidden units is still 32.\n",
"The only difference is that we now \n",
"(**select a nontrivial number of hidden layers \n",
"by specifying the value of `num_layers`.**)\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "e4201eec",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T19:52:08.528774Z",
"iopub.status.busy": "2023-08-18T19:52:08.528499Z",
"iopub.status.idle": "2023-08-18T19:55:24.406556Z",
"shell.execute_reply": "2023-08-18T19:55:24.405655Z"
},
"origin_pos": 23,
"tab": [
"pytorch"
]
},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"gru = GRU(num_inputs=len(data.vocab), num_hiddens=32, num_layers=2)\n",
"model = d2l.RNNLM(gru, vocab_size=len(data.vocab), lr=2)\n",
"trainer.fit(model, data)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "d1f034f9",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T19:55:24.426604Z",
"iopub.status.busy": "2023-08-18T19:55:24.425906Z",
"iopub.status.idle": "2023-08-18T19:55:24.462233Z",
"shell.execute_reply": "2023-08-18T19:55:24.461393Z"
},
"origin_pos": 24,
"tab": [
"pytorch"
]
},
"outputs": [
{
"data": {
"text/plain": [
"'it has for and the time th'"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model.predict('it has', 20, data.vocab, d2l.try_gpu())"
]
},
{
"cell_type": "markdown",
"id": "645263ba",
"metadata": {
"origin_pos": 27
},
"source": [
"## Summary\n",
"\n",
"In deep RNNs, the hidden state information is passed \n",
"to the next time step of the current layer \n",
"and the current time step of the next layer.\n",
"There exist many different flavors of deep RNNs, such as LSTMs, GRUs, or vanilla RNNs. \n",
"Conveniently, these models are all available \n",
"as parts of the high-level APIs of deep learning frameworks.\n",
"Initialization of models requires care. \n",
"Overall, deep RNNs require considerable amount of work \n",
"(such as learning rate and clipping) \n",
"to ensure proper convergence.\n",
"\n",
"## Exercises\n",
"\n",
"1. Replace the GRU by an LSTM and compare the accuracy and training speed.\n",
"1. Increase the training data to include multiple books. How low can you go on the perplexity scale?\n",
"1. Would you want to combine sources of different authors when modeling text? Why is this a good idea? What could go wrong?\n"
]
},
{
"cell_type": "markdown",
"id": "ea822948",
"metadata": {
"origin_pos": 29,
"tab": [
"pytorch"
]
},
"source": [
"[Discussions](https://discuss.d2l.ai/t/1058)\n"
]
}
],
"metadata": {
"language_info": {
"name": "python"
},
"required_libs": []
},
"nbformat": 4,
"nbformat_minor": 5
}