{ "cells": [ { "cell_type": "markdown", "id": "7d2e5134", "metadata": { "origin_pos": 0 }, "source": [ "# Sentiment Analysis: Using Recurrent Neural Networks\n", ":label:`sec_sentiment_rnn` \n", "\n", "\n", "Like word similarity and analogy tasks,\n", "we can also apply pretrained word vectors\n", "to sentiment analysis.\n", "Since the IMDb review dataset\n", "in :numref:`sec_sentiment`\n", "is not very big,\n", "using text representations\n", "that were pretrained\n", "on large-scale corpora\n", "may reduce overfitting of the model.\n", "As a specific example\n", "illustrated in :numref:`fig_nlp-map-sa-rnn`,\n", "we will represent each token\n", "using the pretrained GloVe model,\n", "and feed these token representations\n", "into a multilayer bidirectional RNN\n", "to obtain the text sequence representation,\n", "which will\n", "be transformed into \n", "sentiment analysis outputs :cite:`Maas.Daly.Pham.ea.2011`.\n", "For the same downstream application,\n", "we will consider a different architectural\n", "choice later.\n", "\n", "![This section feeds pretrained GloVe to an RNN-based architecture for sentiment analysis.](../img/nlp-map-sa-rnn.svg)\n", ":label:`fig_nlp-map-sa-rnn`\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "351073ee", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T19:31:46.176830Z", "iopub.status.busy": "2023-08-18T19:31:46.176318Z", "iopub.status.idle": "2023-08-18T19:32:28.594759Z", "shell.execute_reply": "2023-08-18T19:32:28.593738Z" }, "origin_pos": 2, "tab": [ "pytorch" ] }, "outputs": [], "source": [ "import torch\n", "from torch import nn\n", "from d2l import torch as d2l\n", "\n", "batch_size = 64\n", "train_iter, test_iter, vocab = d2l.load_data_imdb(batch_size)" ] }, { "cell_type": "markdown", "id": "76be3883", "metadata": { "origin_pos": 3 }, "source": [ "## Representing Single Text with RNNs\n", "\n", "In text classifications tasks,\n", "such as sentiment analysis,\n", "a varying-length text sequence \n", "will be transformed into fixed-length categories.\n", "In the following `BiRNN` class,\n", "while each token of a text sequence\n", "gets its individual\n", "pretrained GloVe\n", "representation via the embedding layer\n", "(`self.embedding`),\n", "the entire sequence\n", "is encoded by a bidirectional RNN (`self.encoder`).\n", "More concretely,\n", "the hidden states (at the last layer)\n", "of the bidirectional LSTM\n", "at both the initial and final time steps\n", "are concatenated \n", "as the representation of the text sequence.\n", "This single text representation\n", "is then transformed into output categories\n", "by a fully connected layer (`self.decoder`)\n", "with two outputs (\"positive\" and \"negative\").\n" ] }, { "cell_type": "code", "execution_count": 2, "id": "f5afe3a9", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T19:32:28.599503Z", "iopub.status.busy": "2023-08-18T19:32:28.598815Z", "iopub.status.idle": "2023-08-18T19:32:28.605966Z", "shell.execute_reply": "2023-08-18T19:32:28.605157Z" }, "origin_pos": 5, "tab": [ "pytorch" ] }, "outputs": [], "source": [ "class BiRNN(nn.Module):\n", " def __init__(self, vocab_size, embed_size, num_hiddens,\n", " num_layers, **kwargs):\n", " super(BiRNN, self).__init__(**kwargs)\n", " self.embedding = nn.Embedding(vocab_size, embed_size)\n", " # Set `bidirectional` to True to get a bidirectional RNN\n", " self.encoder = nn.LSTM(embed_size, num_hiddens, num_layers=num_layers,\n", " bidirectional=True)\n", " self.decoder = nn.Linear(4 * num_hiddens, 2)\n", "\n", " def forward(self, inputs):\n", " # The shape of `inputs` is (batch size, no. of time steps). Because\n", " # LSTM requires its input's first dimension to be the temporal\n", " # dimension, the input is transposed before obtaining token\n", " # representations. The output shape is (no. of time steps, batch size,\n", " # word vector dimension)\n", " embeddings = self.embedding(inputs.T)\n", " self.encoder.flatten_parameters()\n", " # Returns hidden states of the last hidden layer at different time\n", " # steps. The shape of `outputs` is (no. of time steps, batch size,\n", " # 2 * no. of hidden units)\n", " outputs, _ = self.encoder(embeddings)\n", " # Concatenate the hidden states at the initial and final time steps as\n", " # the input of the fully connected layer. Its shape is (batch size,\n", " # 4 * no. of hidden units)\n", " encoding = torch.cat((outputs[0], outputs[-1]), dim=1)\n", " outs = self.decoder(encoding)\n", " return outs" ] }, { "cell_type": "markdown", "id": "733003ba", "metadata": { "origin_pos": 6 }, "source": [ "Let's construct a bidirectional RNN with two hidden layers to represent single text for sentiment analysis.\n" ] }, { "cell_type": "code", "execution_count": 3, "id": "2a895a32", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T19:32:28.609785Z", "iopub.status.busy": "2023-08-18T19:32:28.609294Z", "iopub.status.idle": "2023-08-18T19:32:28.692548Z", "shell.execute_reply": "2023-08-18T19:32:28.691572Z" }, "origin_pos": 7, "tab": [ "pytorch" ] }, "outputs": [], "source": [ "embed_size, num_hiddens, num_layers, devices = 100, 100, 2, d2l.try_all_gpus()\n", "net = BiRNN(len(vocab), embed_size, num_hiddens, num_layers)" ] }, { "cell_type": "code", "execution_count": 4, "id": "a17b69d9", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T19:32:28.698062Z", "iopub.status.busy": "2023-08-18T19:32:28.697265Z", "iopub.status.idle": "2023-08-18T19:32:28.708391Z", "shell.execute_reply": "2023-08-18T19:32:28.707453Z" }, "origin_pos": 9, "tab": [ "pytorch" ] }, "outputs": [], "source": [ "def init_weights(module):\n", " if type(module) == nn.Linear:\n", " nn.init.xavier_uniform_(module.weight)\n", " if type(module) == nn.LSTM:\n", " for param in module._flat_weights_names:\n", " if \"weight\" in param:\n", " nn.init.xavier_uniform_(module._parameters[param])\n", "net.apply(init_weights);" ] }, { "cell_type": "markdown", "id": "6f879042", "metadata": { "origin_pos": 10 }, "source": [ "## Loading Pretrained Word Vectors\n", "\n", "Below we load the pretrained 100-dimensional (needs to be consistent with `embed_size`) GloVe embeddings for tokens in the vocabulary.\n" ] }, { "cell_type": "code", "execution_count": 5, "id": "b6ff3161", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T19:32:28.713419Z", "iopub.status.busy": "2023-08-18T19:32:28.712837Z", "iopub.status.idle": "2023-08-18T19:32:51.636936Z", "shell.execute_reply": "2023-08-18T19:32:51.636041Z" }, "origin_pos": 11, "tab": [ "pytorch" ] }, "outputs": [], "source": [ "glove_embedding = d2l.TokenEmbedding('glove.6b.100d')" ] }, { "cell_type": "markdown", "id": "996032f3", "metadata": { "origin_pos": 12 }, "source": [ "Print the shape of the vectors\n", "for all the tokens in the vocabulary.\n" ] }, { "cell_type": "code", "execution_count": 6, "id": "85c438f4", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T19:32:51.640667Z", "iopub.status.busy": "2023-08-18T19:32:51.640383Z", "iopub.status.idle": "2023-08-18T19:32:51.686003Z", "shell.execute_reply": "2023-08-18T19:32:51.685005Z" }, "origin_pos": 13, "tab": [ "pytorch" ] }, "outputs": [ { "data": { "text/plain": [ "torch.Size([49346, 100])" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "embeds = glove_embedding[vocab.idx_to_token]\n", "embeds.shape" ] }, { "cell_type": "markdown", "id": "5d9788ba", "metadata": { "origin_pos": 14 }, "source": [ "We use these pretrained\n", "word vectors\n", "to represent tokens in the reviews\n", "and will not update\n", "these vectors during training.\n" ] }, { "cell_type": "code", "execution_count": 7, "id": "39110559", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T19:32:51.689593Z", "iopub.status.busy": "2023-08-18T19:32:51.689043Z", "iopub.status.idle": "2023-08-18T19:32:51.694629Z", "shell.execute_reply": "2023-08-18T19:32:51.693728Z" }, "origin_pos": 16, "tab": [ "pytorch" ] }, "outputs": [], "source": [ "net.embedding.weight.data.copy_(embeds)\n", "net.embedding.weight.requires_grad = False" ] }, { "cell_type": "markdown", "id": "fcf2cd33", "metadata": { "origin_pos": 17 }, "source": [ "## Training and Evaluating the Model\n", "\n", "Now we can train the bidirectional RNN for sentiment analysis.\n" ] }, { "cell_type": "code", "execution_count": 8, "id": "4a70c144", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T19:32:51.698736Z", "iopub.status.busy": "2023-08-18T19:32:51.698181Z", "iopub.status.idle": "2023-08-18T19:34:15.725402Z", "shell.execute_reply": "2023-08-18T19:34:15.724177Z" }, "origin_pos": 19, "tab": [ "pytorch" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "loss 0.277, train acc 0.884, test acc 0.861\n", "2608.4 examples/sec on [device(type='cuda', index=0), device(type='cuda', index=1)]\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", " \n", " \n", " \n", " \n", " 2023-08-18T19:34:15.663316\n", " image/svg+xml\n", " \n", " \n", " Matplotlib v3.7.2, https://matplotlib.org/\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n" ], "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "lr, num_epochs = 0.01, 5\n", "trainer = torch.optim.Adam(net.parameters(), lr=lr)\n", "loss = nn.CrossEntropyLoss(reduction=\"none\")\n", "d2l.train_ch13(net, train_iter, test_iter, loss, trainer, num_epochs, devices)" ] }, { "cell_type": "markdown", "id": "74249641", "metadata": { "origin_pos": 20 }, "source": [ "We define the following function to predict the sentiment of a text sequence using the trained model `net`.\n" ] }, { "cell_type": "code", "execution_count": 9, "id": "e27658b2", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T19:34:15.729313Z", "iopub.status.busy": "2023-08-18T19:34:15.728502Z", "iopub.status.idle": "2023-08-18T19:34:15.736056Z", "shell.execute_reply": "2023-08-18T19:34:15.734676Z" }, "origin_pos": 22, "tab": [ "pytorch" ] }, "outputs": [], "source": [ "#@save\n", "def predict_sentiment(net, vocab, sequence):\n", " \"\"\"Predict the sentiment of a text sequence.\"\"\"\n", " sequence = torch.tensor(vocab[sequence.split()], device=d2l.try_gpu())\n", " label = torch.argmax(net(sequence.reshape(1, -1)), dim=1)\n", " return 'positive' if label == 1 else 'negative'" ] }, { "cell_type": "markdown", "id": "e8ccb33a", "metadata": { "origin_pos": 23 }, "source": [ "Finally, let's use the trained model to predict the sentiment for two simple sentences.\n" ] }, { "cell_type": "code", "execution_count": 10, "id": "91f39ffa", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T19:34:15.740295Z", "iopub.status.busy": "2023-08-18T19:34:15.739642Z", "iopub.status.idle": "2023-08-18T19:34:15.754013Z", "shell.execute_reply": "2023-08-18T19:34:15.752828Z" }, "origin_pos": 24, "tab": [ "pytorch" ] }, "outputs": [ { "data": { "text/plain": [ "'positive'" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "predict_sentiment(net, vocab, 'this movie is so great')" ] }, { "cell_type": "code", "execution_count": 11, "id": "94b2f3e8", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T19:34:15.757463Z", "iopub.status.busy": "2023-08-18T19:34:15.756900Z", "iopub.status.idle": "2023-08-18T19:34:15.763631Z", "shell.execute_reply": "2023-08-18T19:34:15.762783Z" }, "origin_pos": 25, "tab": [ "pytorch" ] }, "outputs": [ { "data": { "text/plain": [ "'negative'" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "predict_sentiment(net, vocab, 'this movie is so bad')" ] }, { "cell_type": "markdown", "id": "89cb0585", "metadata": { "origin_pos": 26 }, "source": [ "## Summary\n", "\n", "* Pretrained word vectors can represent individual tokens in a text sequence.\n", "* Bidirectional RNNs can represent a text sequence, such as via the concatenation of its hidden states at the initial and final time steps. This single text representation can be transformed into categories using a fully connected layer.\n", "\n", "\n", "\n", "## Exercises\n", "\n", "1. Increase the number of epochs. Can you improve the training and testing accuracies? How about tuning other hyperparameters?\n", "1. Use larger pretrained word vectors, such as 300-dimensional GloVe embeddings. Does it improve classification accuracy?\n", "1. Can we improve the classification accuracy by using the spaCy tokenization? You need to install spaCy (`pip install spacy`) and install the English package (`python -m spacy download en`). In the code, first, import spaCy (`import spacy`). Then, load the spaCy English package (`spacy_en = spacy.load('en')`). Finally, define the function `def tokenizer(text): return [tok.text for tok in spacy_en.tokenizer(text)]` and replace the original `tokenizer` function. Note the different forms of phrase tokens in GloVe and spaCy. For example, the phrase token \"new york\" takes the form of \"new-york\" in GloVe and the form of \"new york\" after the spaCy tokenization.\n" ] }, { "cell_type": "markdown", "id": "09007e06", "metadata": { "origin_pos": 28, "tab": [ "pytorch" ] }, "source": [ "[Discussions](https://discuss.d2l.ai/t/1424)\n" ] } ], "metadata": { "language_info": { "name": "python" }, "required_libs": [] }, "nbformat": 4, "nbformat_minor": 5 }