{ "cells": [ { "cell_type": "markdown", "id": "a8e2c9d2", "metadata": { "origin_pos": 0 }, "source": [ "# Sentiment Analysis and the Dataset\n", ":label:`sec_sentiment`\n", "\n", "\n", "With the proliferation of online social media\n", "and review platforms,\n", "a plethora of\n", "opinionated data\n", "has been logged,\n", "bearing great potential for\n", "supporting decision making processes.\n", "*Sentiment analysis*\n", "studies people's sentiments\n", "in their produced text,\n", "such as product reviews,\n", "blog comments,\n", "and\n", "forum discussions.\n", "It enjoys wide applications\n", "to fields as diverse as \n", "politics (e.g., analysis of public sentiments towards policies),\n", "finance (e.g., analysis of sentiments of the market),\n", "and \n", "marketing (e.g., product research and brand management).\n", "\n", "Since sentiments\n", "can be categorized\n", "as discrete polarities or scales (e.g., positive and negative),\n", "we can consider \n", "sentiment analysis \n", "as a text classification task,\n", "which transforms a varying-length text sequence\n", "into a fixed-length text category.\n", "In this chapter,\n", "we will use Stanford's [large movie review dataset](https://ai.stanford.edu/%7Eamaas/data/sentiment/)\n", "for sentiment analysis. \n", "It consists of a training set and a testing set, \n", "either containing 25000 movie reviews downloaded from IMDb.\n", "In both datasets, \n", "there are equal number of \n", "\"positive\" and \"negative\" labels,\n", "indicating different sentiment polarities.\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "f9217be5", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T19:27:34.097526Z", "iopub.status.busy": "2023-08-18T19:27:34.097247Z", "iopub.status.idle": "2023-08-18T19:27:37.093684Z", "shell.execute_reply": "2023-08-18T19:27:37.092038Z" }, "origin_pos": 2, "tab": [ "pytorch" ] }, "outputs": [], "source": [ "import os\n", "import torch\n", "from torch import nn\n", "from d2l import torch as d2l" ] }, { "cell_type": "markdown", "id": "4bf02ae4", "metadata": { "origin_pos": 3 }, "source": [ "## Reading the Dataset\n", "\n", "First, download and extract this IMDb review dataset\n", "in the path `../data/aclImdb`.\n" ] }, { "cell_type": "code", "execution_count": 2, "id": "6df26269", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T19:27:37.099078Z", "iopub.status.busy": "2023-08-18T19:27:37.097890Z", "iopub.status.idle": "2023-08-18T19:27:56.963098Z", "shell.execute_reply": "2023-08-18T19:27:56.962130Z" }, "origin_pos": 4, "tab": [ "pytorch" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloading ../data/aclImdb_v1.tar.gz from http://d2l-data.s3-accelerate.amazonaws.com/aclImdb_v1.tar.gz...\n" ] } ], "source": [ "#@save\n", "d2l.DATA_HUB['aclImdb'] = (d2l.DATA_URL + 'aclImdb_v1.tar.gz',\n", " '01ada507287d82875905620988597833ad4e0903')\n", "\n", "data_dir = d2l.download_extract('aclImdb', 'aclImdb')" ] }, { "cell_type": "markdown", "id": "eac581ac", "metadata": { "origin_pos": 5 }, "source": [ "Next, read the training and test datasets. Each example is a review and its label: 1 for \"positive\" and 0 for \"negative\".\n" ] }, { "cell_type": "code", "execution_count": 3, "id": "7473bc5a", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T19:27:56.967361Z", "iopub.status.busy": "2023-08-18T19:27:56.966560Z", "iopub.status.idle": "2023-08-18T19:27:57.573255Z", "shell.execute_reply": "2023-08-18T19:27:57.572306Z" }, "origin_pos": 6, "tab": [ "pytorch" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "# trainings: 25000\n", "label: 1 review: Zentropa has much in common with The Third Man, another noir\n", "label: 1 review: Zentropa is the most original movie I've seen in years. If y\n", "label: 1 review: Lars Von Trier is never backward in trying out new technique\n" ] } ], "source": [ "#@save\n", "def read_imdb(data_dir, is_train):\n", " \"\"\"Read the IMDb review dataset text sequences and labels.\"\"\"\n", " data, labels = [], []\n", " for label in ('pos', 'neg'):\n", " folder_name = os.path.join(data_dir, 'train' if is_train else 'test',\n", " label)\n", " for file in os.listdir(folder_name):\n", " with open(os.path.join(folder_name, file), 'rb') as f:\n", " review = f.read().decode('utf-8').replace('\\n', '')\n", " data.append(review)\n", " labels.append(1 if label == 'pos' else 0)\n", " return data, labels\n", "\n", "train_data = read_imdb(data_dir, is_train=True)\n", "print('# trainings:', len(train_data[0]))\n", "for x, y in zip(train_data[0][:3], train_data[1][:3]):\n", " print('label:', y, 'review:', x[:60])" ] }, { "cell_type": "markdown", "id": "e10fa446", "metadata": { "origin_pos": 7 }, "source": [ "## Preprocessing the Dataset\n", "\n", "Treating each word as a token\n", "and filtering out words that appear less than 5 times,\n", "we create a vocabulary out of the training dataset.\n" ] }, { "cell_type": "code", "execution_count": 4, "id": "e5fc58b9", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T19:27:57.576587Z", "iopub.status.busy": "2023-08-18T19:27:57.576303Z", "iopub.status.idle": "2023-08-18T19:27:59.652710Z", "shell.execute_reply": "2023-08-18T19:27:59.651764Z" }, "origin_pos": 8, "tab": [ "pytorch" ] }, "outputs": [], "source": [ "train_tokens = d2l.tokenize(train_data[0], token='word')\n", "vocab = d2l.Vocab(train_tokens, min_freq=5, reserved_tokens=[''])" ] }, { "cell_type": "markdown", "id": "6861bab0", "metadata": { "origin_pos": 9 }, "source": [ "After tokenization,\n", "let's plot the histogram of\n", "review lengths in tokens.\n" ] }, { "cell_type": "code", "execution_count": 5, "id": "626174c3", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T19:27:59.656341Z", "iopub.status.busy": "2023-08-18T19:27:59.656051Z", "iopub.status.idle": "2023-08-18T19:27:59.955894Z", "shell.execute_reply": "2023-08-18T19:27:59.954608Z" }, "origin_pos": 10, "tab": [ "pytorch" ] }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", " \n", " \n", " \n", " \n", " 2023-08-18T19:27:59.914213\n", " image/svg+xml\n", " \n", " \n", " Matplotlib v3.7.2, https://matplotlib.org/\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n" ], "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "d2l.set_figsize()\n", "d2l.plt.xlabel('# tokens per review')\n", "d2l.plt.ylabel('count')\n", "d2l.plt.hist([len(line) for line in train_tokens], bins=range(0, 1000, 50));" ] }, { "cell_type": "markdown", "id": "88f60611", "metadata": { "origin_pos": 11 }, "source": [ "As we expected,\n", "the reviews have varying lengths.\n", "To process\n", "a minibatch of such reviews at each time,\n", "we set the length of each review to 500 with truncation and padding,\n", "which is similar to \n", "the preprocessing step \n", "for the machine translation dataset\n", "in :numref:`sec_machine_translation`.\n" ] }, { "cell_type": "code", "execution_count": 6, "id": "fca901d7", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T19:27:59.959214Z", "iopub.status.busy": "2023-08-18T19:27:59.958931Z", "iopub.status.idle": "2023-08-18T19:28:05.474448Z", "shell.execute_reply": "2023-08-18T19:28:05.473520Z" }, "origin_pos": 12, "tab": [ "pytorch" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "torch.Size([25000, 500])\n" ] } ], "source": [ "num_steps = 500 # sequence length\n", "train_features = torch.tensor([d2l.truncate_pad(\n", " vocab[line], num_steps, vocab['']) for line in train_tokens])\n", "print(train_features.shape)" ] }, { "cell_type": "markdown", "id": "6c1a0d71", "metadata": { "origin_pos": 13 }, "source": [ "## Creating Data Iterators\n", "\n", "Now we can create data iterators.\n", "At each iteration, a minibatch of examples are returned.\n" ] }, { "cell_type": "code", "execution_count": 7, "id": "70c6864a", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T19:28:05.477697Z", "iopub.status.busy": "2023-08-18T19:28:05.477416Z", "iopub.status.idle": "2023-08-18T19:28:05.509970Z", "shell.execute_reply": "2023-08-18T19:28:05.509156Z" }, "origin_pos": 15, "tab": [ "pytorch" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "X: torch.Size([64, 500]) , y: torch.Size([64])\n", "# batches: 391\n" ] } ], "source": [ "train_iter = d2l.load_array((train_features, torch.tensor(train_data[1])), 64)\n", "\n", "for X, y in train_iter:\n", " print('X:', X.shape, ', y:', y.shape)\n", " break\n", "print('# batches:', len(train_iter))" ] }, { "cell_type": "markdown", "id": "d4e0e168", "metadata": { "origin_pos": 16 }, "source": [ "## Putting It All Together\n", "\n", "Last, we wrap up the above steps into the `load_data_imdb` function.\n", "It returns training and test data iterators and the vocabulary of the IMDb review dataset.\n" ] }, { "cell_type": "code", "execution_count": 8, "id": "0856deec", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T19:28:05.513373Z", "iopub.status.busy": "2023-08-18T19:28:05.512812Z", "iopub.status.idle": "2023-08-18T19:28:05.519500Z", "shell.execute_reply": "2023-08-18T19:28:05.518633Z" }, "origin_pos": 18, "tab": [ "pytorch" ] }, "outputs": [], "source": [ "#@save\n", "def load_data_imdb(batch_size, num_steps=500):\n", " \"\"\"Return data iterators and the vocabulary of the IMDb review dataset.\"\"\"\n", " data_dir = d2l.download_extract('aclImdb', 'aclImdb')\n", " train_data = read_imdb(data_dir, True)\n", " test_data = read_imdb(data_dir, False)\n", " train_tokens = d2l.tokenize(train_data[0], token='word')\n", " test_tokens = d2l.tokenize(test_data[0], token='word')\n", " vocab = d2l.Vocab(train_tokens, min_freq=5)\n", " train_features = torch.tensor([d2l.truncate_pad(\n", " vocab[line], num_steps, vocab['']) for line in train_tokens])\n", " test_features = torch.tensor([d2l.truncate_pad(\n", " vocab[line], num_steps, vocab['']) for line in test_tokens])\n", " train_iter = d2l.load_array((train_features, torch.tensor(train_data[1])),\n", " batch_size)\n", " test_iter = d2l.load_array((test_features, torch.tensor(test_data[1])),\n", " batch_size,\n", " is_train=False)\n", " return train_iter, test_iter, vocab" ] }, { "cell_type": "markdown", "id": "7cf41fda", "metadata": { "origin_pos": 19 }, "source": [ "## Summary\n", "\n", "* Sentiment analysis studies people's sentiments in their produced text, which is considered as a text classification problem that transforms a varying-length text sequence\n", "into a fixed-length text category.\n", "* After preprocessing, we can load Stanford's large movie review dataset (IMDb review dataset) into data iterators with a vocabulary.\n", "\n", "\n", "## Exercises\n", "\n", "\n", "1. What hyperparameters in this section can we modify to accelerate training sentiment analysis models?\n", "1. Can you implement a function to load the dataset of [Amazon reviews](https://snap.stanford.edu/data/web-Amazon.html) into data iterators and labels for sentiment analysis?\n" ] }, { "cell_type": "markdown", "id": "a8e3879c", "metadata": { "origin_pos": 21, "tab": [ "pytorch" ] }, "source": [ "[Discussions](https://discuss.d2l.ai/t/1387)\n" ] } ], "metadata": { "language_info": { "name": "python" }, "required_libs": [] }, "nbformat": 4, "nbformat_minor": 5 }