{ "cells": [ { "cell_type": "markdown", "id": "0d9388b7", "metadata": { "origin_pos": 1 }, "source": [ "# Predicting House Prices on Kaggle\n", ":label:`sec_kaggle_house`\n", "\n", "Now that we have introduced some basic tools\n", "for building and training deep networks\n", "and regularizing them with techniques including\n", "weight decay and dropout,\n", "we are ready to put all this knowledge into practice\n", "by participating in a Kaggle competition.\n", "The house price prediction competition\n", "is a great place to start.\n", "The data is fairly generic and do not exhibit exotic structure\n", "that might require specialized models (as audio or video might).\n", "This dataset, collected by :citet:`De-Cock.2011`,\n", "covers house prices in Ames, Iowa from the period 2006--2010.\n", "It is considerably larger than the famous [Boston housing dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.names) of Harrison and Rubinfeld (1978),\n", "boasting both more examples and more features.\n", "\n", "\n", "In this section, we will walk you through details of\n", "data preprocessing, model design, and hyperparameter selection.\n", "We hope that through a hands-on approach,\n", "you will gain some intuitions that will guide you\n", "in your career as a data scientist.\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "1c33eb92", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T19:32:21.587414Z", "iopub.status.busy": "2023-08-18T19:32:21.586752Z", "iopub.status.idle": "2023-08-18T19:32:24.821984Z", "shell.execute_reply": "2023-08-18T19:32:24.820834Z" }, "origin_pos": 3, "tab": [ "pytorch" ] }, "outputs": [], "source": [ "%matplotlib inline\n", "import pandas as pd\n", "import torch\n", "from torch import nn\n", "from d2l import torch as d2l" ] }, { "cell_type": "markdown", "id": "22637186", "metadata": { "origin_pos": 6 }, "source": [ "## Downloading Data\n", "\n", "Throughout the book, we will train and test models\n", "on various downloaded datasets.\n", "Here, we (**implement two utility functions**)\n", "for downloading and extracting zip or tar files.\n", "Again, we skip implementation details of\n", "such utility functions.\n" ] }, { "cell_type": "code", "execution_count": 2, "id": "c5b9dd70", "metadata": { "attributes": { "classes": [], "id": "", "n": "2" }, "execution": { "iopub.execute_input": "2023-08-18T19:32:24.826201Z", "iopub.status.busy": "2023-08-18T19:32:24.825720Z", "iopub.status.idle": "2023-08-18T19:32:24.831209Z", "shell.execute_reply": "2023-08-18T19:32:24.830384Z" }, "origin_pos": 7, "tab": [ "pytorch" ] }, "outputs": [], "source": [ "def download(url, folder, sha1_hash=None):\n", " \"\"\"Download a file to folder and return the local filepath.\"\"\"\n", "\n", "def extract(filename, folder):\n", " \"\"\"Extract a zip/tar file into folder.\"\"\"" ] }, { "cell_type": "markdown", "id": "ff0b2664", "metadata": { "origin_pos": 8 }, "source": [ "## Kaggle\n", "\n", "[Kaggle](https://www.kaggle.com) is a popular platform\n", "that hosts machine learning competitions.\n", "Each competition centers on a dataset and many\n", "are sponsored by stakeholders who offer prizes\n", "to the winning solutions.\n", "The platform helps users to interact\n", "via forums and shared code,\n", "fostering both collaboration and competition.\n", "While leaderboard chasing often spirals out of control,\n", "with researchers focusing myopically on preprocessing steps\n", "rather than asking fundamental questions,\n", "there is also tremendous value in the objectivity of a platform\n", "that facilitates direct quantitative comparisons\n", "among competing approaches as well as code sharing\n", "so that everyone can learn what did and did not work.\n", "If you want to participate in a Kaggle competition,\n", "you will first need to register for an account\n", "(see :numref:`fig_kaggle`).\n", "\n", "![The Kaggle website.](../img/kaggle.png)\n", ":width:`400px`\n", ":label:`fig_kaggle`\n", "\n", "On the house price prediction competition page, as illustrated\n", "in :numref:`fig_house_pricing`,\n", "you can find the dataset (under the \"Data\" tab),\n", "submit predictions, and see your ranking,\n", "The URL is right here:\n", "\n", "> https://www.kaggle.com/c/house-prices-advanced-regression-techniques\n", "\n", "![The house price prediction competition page.](../img/house-pricing.png)\n", ":width:`400px`\n", ":label:`fig_house_pricing`\n", "\n", "## Accessing and Reading the Dataset\n", "\n", "Note that the competition data is separated\n", "into training and test sets.\n", "Each record includes the property value of the house\n", "and attributes such as street type, year of construction,\n", "roof type, basement condition, etc.\n", "The features consist of various data types.\n", "For example, the year of construction\n", "is represented by an integer,\n", "the roof type by discrete categorical assignments,\n", "and other features by floating point numbers.\n", "And here is where reality complicates things:\n", "for some examples, some data is altogether missing\n", "with the missing value marked simply as \"na\".\n", "The price of each house is included\n", "for the training set only\n", "(it is a competition after all).\n", "We will want to partition the training set\n", "to create a validation set,\n", "but we only get to evaluate our models on the official test set\n", "after uploading predictions to Kaggle.\n", "The \"Data\" tab on the competition tab\n", "in :numref:`fig_house_pricing`\n", "has links for downloading the data.\n", "\n", "To get started, we will [**read in and process the data\n", "using `pandas`**], which we introduced in :numref:`sec_pandas`.\n", "For convenience, we can download and cache\n", "the Kaggle housing dataset.\n", "If a file corresponding to this dataset already exists in the cache directory and its SHA-1 matches `sha1_hash`, our code will use the cached file to avoid clogging up your Internet with redundant downloads.\n" ] }, { "cell_type": "code", "execution_count": 3, "id": "48c41477", "metadata": { "attributes": { "classes": [], "id": "", "n": "30" }, "execution": { "iopub.execute_input": "2023-08-18T19:32:24.835437Z", "iopub.status.busy": "2023-08-18T19:32:24.834529Z", "iopub.status.idle": "2023-08-18T19:32:24.840555Z", "shell.execute_reply": "2023-08-18T19:32:24.839683Z" }, "origin_pos": 9, "tab": [ "pytorch" ] }, "outputs": [], "source": [ "class KaggleHouse(d2l.DataModule):\n", " def __init__(self, batch_size, train=None, val=None):\n", " super().__init__()\n", " self.save_hyperparameters()\n", " if self.train is None:\n", " self.raw_train = pd.read_csv(d2l.download(\n", " d2l.DATA_URL + 'kaggle_house_pred_train.csv', self.root,\n", " sha1_hash='585e9cc93e70b39160e7921475f9bcd7d31219ce'))\n", " self.raw_val = pd.read_csv(d2l.download(\n", " d2l.DATA_URL + 'kaggle_house_pred_test.csv', self.root,\n", " sha1_hash='fa19780a7b011d9b009e8bff8e99922a8ee2eb90'))" ] }, { "cell_type": "markdown", "id": "45dfab1b", "metadata": { "origin_pos": 10 }, "source": [ "The training dataset includes 1460 examples,\n", "80 features, and one label, while the validation data\n", "contains 1459 examples and 80 features.\n" ] }, { "cell_type": "code", "execution_count": 4, "id": "7e9e8f7c", "metadata": { "attributes": { "classes": [], "id": "", "n": "31" }, "execution": { "iopub.execute_input": "2023-08-18T19:32:24.844705Z", "iopub.status.busy": "2023-08-18T19:32:24.843955Z", "iopub.status.idle": "2023-08-18T19:32:25.218067Z", "shell.execute_reply": "2023-08-18T19:32:25.217232Z" }, "origin_pos": 11, "tab": [ "pytorch" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloading ../data/kaggle_house_pred_train.csv from http://d2l-data.s3-accelerate.amazonaws.com/kaggle_house_pred_train.csv...\n", "Downloading ../data/kaggle_house_pred_test.csv from http://d2l-data.s3-accelerate.amazonaws.com/kaggle_house_pred_test.csv...\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "(1460, 81)\n", "(1459, 80)\n" ] } ], "source": [ "data = KaggleHouse(batch_size=64)\n", "print(data.raw_train.shape)\n", "print(data.raw_val.shape)" ] }, { "cell_type": "markdown", "id": "017e793e", "metadata": { "origin_pos": 12 }, "source": [ "## Data Preprocessing\n", "\n", "Let's [**take a look at the first four and final two features\n", "as well as the label (SalePrice)**] from the first four examples.\n" ] }, { "cell_type": "code", "execution_count": 5, "id": "92621a85", "metadata": { "attributes": { "classes": [], "id": "", "n": "10" }, "execution": { "iopub.execute_input": "2023-08-18T19:32:25.221755Z", "iopub.status.busy": "2023-08-18T19:32:25.221161Z", "iopub.status.idle": "2023-08-18T19:32:25.230323Z", "shell.execute_reply": "2023-08-18T19:32:25.229502Z" }, "origin_pos": 13, "tab": [ "pytorch" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Id MSSubClass MSZoning LotFrontage SaleType SaleCondition SalePrice\n", "0 1 60 RL 65.0 WD Normal 208500\n", "1 2 20 RL 80.0 WD Normal 181500\n", "2 3 60 RL 68.0 WD Normal 223500\n", "3 4 70 RL 60.0 WD Abnorml 140000\n" ] } ], "source": [ "print(data.raw_train.iloc[:4, [0, 1, 2, 3, -3, -2, -1]])" ] }, { "cell_type": "markdown", "id": "caf77495", "metadata": { "origin_pos": 14 }, "source": [ "We can see that in each example, the first feature is the identifier.\n", "This helps the model determine each training example.\n", "While this is convenient, it does not carry\n", "any information for prediction purposes.\n", "Hence, we will remove it from the dataset\n", "before feeding the data into the model.\n", "Furthermore, given a wide variety of data types,\n", "we will need to preprocess the data before we can start modeling.\n", "\n", "\n", "Let's start with the numerical features.\n", "First, we apply a heuristic,\n", "[**replacing all missing values\n", "by the corresponding feature's mean.**]\n", "Then, to put all features on a common scale,\n", "we (***standardize* the data by\n", "rescaling features to zero mean and unit variance**):\n", "\n", "$$x \\leftarrow \\frac{x - \\mu}{\\sigma},$$\n", "\n", "where $\\mu$ and $\\sigma$ denote mean and standard deviation, respectively.\n", "To verify that this indeed transforms\n", "our feature (variable) such that it has zero mean and unit variance,\n", "note that $E[\\frac{x-\\mu}{\\sigma}] = \\frac{\\mu - \\mu}{\\sigma} = 0$\n", "and that $E[(x-\\mu)^2] = (\\sigma^2 + \\mu^2) - 2\\mu^2+\\mu^2 = \\sigma^2$.\n", "Intuitively, we standardize the data\n", "for two reasons.\n", "First, it proves convenient for optimization.\n", "Second, because we do not know *a priori*\n", "which features will be relevant,\n", "we do not want to penalize coefficients\n", "assigned to one feature more than any other.\n", "\n", "[**Next we deal with discrete values.**]\n", "These include features such as \"MSZoning\".\n", "(**We replace them by a one-hot encoding**)\n", "in the same way that we earlier transformed\n", "multiclass labels into vectors (see :numref:`subsec_classification-problem`).\n", "For instance, \"MSZoning\" assumes the values \"RL\" and \"RM\".\n", "Dropping the \"MSZoning\" feature,\n", "two new indicator features\n", "\"MSZoning_RL\" and \"MSZoning_RM\" are created with values being either 0 or 1.\n", "According to one-hot encoding,\n", "if the original value of \"MSZoning\" is \"RL\",\n", "then \"MSZoning_RL\" is 1 and \"MSZoning_RM\" is 0.\n", "The `pandas` package does this automatically for us.\n" ] }, { "cell_type": "code", "execution_count": 6, "id": "11b2ad6e", "metadata": { "attributes": { "classes": [], "id": "", "n": "32" }, "execution": { "iopub.execute_input": "2023-08-18T19:32:25.233625Z", "iopub.status.busy": "2023-08-18T19:32:25.233151Z", "iopub.status.idle": "2023-08-18T19:32:25.239490Z", "shell.execute_reply": "2023-08-18T19:32:25.238692Z" }, "origin_pos": 15, "tab": [ "pytorch" ] }, "outputs": [], "source": [ "@d2l.add_to_class(KaggleHouse)\n", "def preprocess(self):\n", " # Remove the ID and label columns\n", " label = 'SalePrice'\n", " features = pd.concat(\n", " (self.raw_train.drop(columns=['Id', label]),\n", " self.raw_val.drop(columns=['Id'])))\n", " # Standardize numerical columns\n", " numeric_features = features.dtypes[features.dtypes!='object'].index\n", " features[numeric_features] = features[numeric_features].apply(\n", " lambda x: (x - x.mean()) / (x.std()))\n", " # Replace NAN numerical features by 0\n", " features[numeric_features] = features[numeric_features].fillna(0)\n", " # Replace discrete features by one-hot encoding\n", " features = pd.get_dummies(features, dummy_na=True)\n", " # Save preprocessed features\n", " self.train = features[:self.raw_train.shape[0]].copy()\n", " self.train[label] = self.raw_train[label]\n", " self.val = features[self.raw_train.shape[0]:].copy()" ] }, { "cell_type": "markdown", "id": "3d38d0ed", "metadata": { "origin_pos": 16 }, "source": [ "You can see that this conversion increases\n", "the number of features from 79 to 331 (excluding ID and label columns).\n" ] }, { "cell_type": "code", "execution_count": 7, "id": "a9e39c34", "metadata": { "attributes": { "classes": [], "id": "", "n": "33" }, "execution": { "iopub.execute_input": "2023-08-18T19:32:25.242819Z", "iopub.status.busy": "2023-08-18T19:32:25.242192Z", "iopub.status.idle": "2023-08-18T19:32:25.356247Z", "shell.execute_reply": "2023-08-18T19:32:25.355251Z" }, "origin_pos": 17, "tab": [ "pytorch" ] }, "outputs": [ { "data": { "text/plain": [ "(1460, 331)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.preprocess()\n", "data.train.shape" ] }, { "cell_type": "markdown", "id": "4a4f19be", "metadata": { "origin_pos": 18 }, "source": [ "## Error Measure\n", "\n", "To get started we will train a linear model with squared loss. Not surprisingly, our linear model will not lead to a competition-winning submission but it does provide a sanity check to see whether there is meaningful information in the data. If we cannot do better than random guessing here, then there might be a good chance that we have a data processing bug. And if things work, the linear model will serve as a baseline giving us some intuition about how close the simple model gets to the best reported models, giving us a sense of how much gain we should expect from fancier models.\n", "\n", "With house prices, as with stock prices,\n", "we care about relative quantities\n", "more than absolute quantities.\n", "Thus [**we tend to care more about\n", "the relative error $\\frac{y - \\hat{y}}{y}$**]\n", "than about the absolute error $y - \\hat{y}$.\n", "For instance, if our prediction is off by \\$100,000\n", "when estimating the price of a house in rural Ohio,\n", "where the value of a typical house is \\$125,000,\n", "then we are probably doing a horrible job.\n", "On the other hand, if we err by this amount\n", "in Los Altos Hills, California,\n", "this might represent a stunningly accurate prediction\n", "(there, the median house price exceeds \\$4 million).\n", "\n", "(**One way to address this problem is to\n", "measure the discrepancy in the logarithm of the price estimates.**)\n", "In fact, this is also the official error measure\n", "used by the competition to evaluate the quality of submissions.\n", "After all, a small value $\\delta$ for $|\\log y - \\log \\hat{y}| \\leq \\delta$\n", "translates into $e^{-\\delta} \\leq \\frac{\\hat{y}}{y} \\leq e^\\delta$.\n", "This leads to the following root-mean-squared-error between the logarithm of the predicted price and the logarithm of the label price:\n", "\n", "$$\\sqrt{\\frac{1}{n}\\sum_{i=1}^n\\left(\\log y_i -\\log \\hat{y}_i\\right)^2}.$$\n" ] }, { "cell_type": "code", "execution_count": 8, "id": "22cee03d", "metadata": { "attributes": { "classes": [], "id": "", "n": "60" }, "execution": { "iopub.execute_input": "2023-08-18T19:32:25.360088Z", "iopub.status.busy": "2023-08-18T19:32:25.359480Z", "iopub.status.idle": "2023-08-18T19:32:25.365132Z", "shell.execute_reply": "2023-08-18T19:32:25.364342Z" }, "origin_pos": 19, "tab": [ "pytorch" ] }, "outputs": [], "source": [ "@d2l.add_to_class(KaggleHouse)\n", "def get_dataloader(self, train):\n", " label = 'SalePrice'\n", " data = self.train if train else self.val\n", " if label not in data: return\n", " get_tensor = lambda x: torch.tensor(x.values.astype(float),\n", " dtype=torch.float32)\n", " # Logarithm of prices\n", " tensors = (get_tensor(data.drop(columns=[label])), # X\n", " torch.log(get_tensor(data[label])).reshape((-1, 1))) # Y\n", " return self.get_tensorloader(tensors, train)" ] }, { "cell_type": "markdown", "id": "737b03af", "metadata": { "origin_pos": 20 }, "source": [ "## $K$-Fold Cross-Validation\n", "\n", "You might recall that we introduced [**cross-validation**]\n", "in :numref:`subsec_generalization-model-selection`, where we discussed how to deal\n", "with model selection.\n", "We will put this to good use to select the model design\n", "and to adjust the hyperparameters.\n", "We first need a function that returns\n", "the $i^\\textrm{th}$ fold of the data\n", "in a $K$-fold cross-validation procedure.\n", "It proceeds by slicing out the $i^\\textrm{th}$ segment\n", "as validation data and returning the rest as training data.\n", "Note that this is not the most efficient way of handling data\n", "and we would definitely do something much smarter\n", "if our dataset was considerably larger.\n", "But this added complexity might obfuscate our code unnecessarily\n", "so we can safely omit it here owing to the simplicity of our problem.\n" ] }, { "cell_type": "code", "execution_count": 9, "id": "e6949856", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T19:32:25.368517Z", "iopub.status.busy": "2023-08-18T19:32:25.367949Z", "iopub.status.idle": "2023-08-18T19:32:25.372985Z", "shell.execute_reply": "2023-08-18T19:32:25.372067Z" }, "origin_pos": 21, "tab": [ "pytorch" ] }, "outputs": [], "source": [ "def k_fold_data(data, k):\n", " rets = []\n", " fold_size = data.train.shape[0] // k\n", " for j in range(k):\n", " idx = range(j * fold_size, (j+1) * fold_size)\n", " rets.append(KaggleHouse(data.batch_size, data.train.drop(index=idx),\n", " data.train.loc[idx]))\n", " return rets" ] }, { "cell_type": "markdown", "id": "f7071050", "metadata": { "origin_pos": 22 }, "source": [ "[**The average validation error is returned**]\n", "when we train $K$ times in the $K$-fold cross-validation.\n" ] }, { "cell_type": "code", "execution_count": 10, "id": "c626ec24", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T19:32:25.376435Z", "iopub.status.busy": "2023-08-18T19:32:25.375867Z", "iopub.status.idle": "2023-08-18T19:32:25.381314Z", "shell.execute_reply": "2023-08-18T19:32:25.380464Z" }, "origin_pos": 23, "tab": [ "pytorch" ] }, "outputs": [], "source": [ "def k_fold(trainer, data, k, lr):\n", " val_loss, models = [], []\n", " for i, data_fold in enumerate(k_fold_data(data, k)):\n", " model = d2l.LinearRegression(lr)\n", " model.board.yscale='log'\n", " if i != 0: model.board.display = False\n", " trainer.fit(model, data_fold)\n", " val_loss.append(float(model.board.data['val_loss'][-1].y))\n", " models.append(model)\n", " print(f'average validation log mse = {sum(val_loss)/len(val_loss)}')\n", " return models" ] }, { "cell_type": "markdown", "id": "bcb1790f", "metadata": { "origin_pos": 24 }, "source": [ "## [**Model Selection**]\n", "\n", "In this example, we pick an untuned set of hyperparameters\n", "and leave it up to the reader to improve the model.\n", "Finding a good choice can take time,\n", "depending on how many variables one optimizes over.\n", "With a large enough dataset,\n", "and the normal sorts of hyperparameters,\n", "$K$-fold cross-validation tends to be\n", "reasonably resilient against multiple testing.\n", "However, if we try an unreasonably large number of options\n", "we might find that our validation\n", "performance is no longer representative of the true error.\n" ] }, { "cell_type": "code", "execution_count": 11, "id": "c86184c4", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T19:32:25.384646Z", "iopub.status.busy": "2023-08-18T19:32:25.384079Z", "iopub.status.idle": "2023-08-18T19:32:37.095341Z", "shell.execute_reply": "2023-08-18T19:32:37.094054Z" }, "origin_pos": 25, "tab": [ "pytorch" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "average validation log mse = 0.17325432986021042\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", " \n", " \n", " \n", " \n", " 2023-08-18T19:32:36.970536\n", " image/svg+xml\n", " \n", " \n", " Matplotlib v3.7.2, https://matplotlib.org/\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n" ], "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "trainer = d2l.Trainer(max_epochs=10)\n", "models = k_fold(trainer, data, k=5, lr=0.01)" ] }, { "cell_type": "markdown", "id": "158afa13", "metadata": { "origin_pos": 26 }, "source": [ "Notice that sometimes the number of training errors\n", "for a set of hyperparameters can be very low,\n", "even as the number of errors on $K$-fold cross-validation\n", "grows considerably higher.\n", "This indicates that we are overfitting.\n", "Throughout training you will want to monitor both numbers.\n", "Less overfitting might indicate that our data can support a more powerful model.\n", "Massive overfitting might suggest that we can gain\n", "by incorporating regularization techniques.\n", "\n", "## [**Submitting Predictions on Kaggle**]\n", "\n", "Now that we know what a good choice of hyperparameters should be,\n", "we might \n", "calculate the average predictions \n", "on the test set\n", "by all the $K$ models.\n", "Saving the predictions in a csv file\n", "will simplify uploading the results to Kaggle.\n", "The following code will generate a file called `submission.csv`.\n" ] }, { "cell_type": "code", "execution_count": 12, "id": "f4a3bcde", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T19:32:37.100208Z", "iopub.status.busy": "2023-08-18T19:32:37.099453Z", "iopub.status.idle": "2023-08-18T19:32:37.266811Z", "shell.execute_reply": "2023-08-18T19:32:37.265844Z" }, "origin_pos": 27, "tab": [ "pytorch" ] }, "outputs": [], "source": [ "preds = [model(torch.tensor(data.val.values.astype(float), dtype=torch.float32))\n", " for model in models]\n", "# Taking exponentiation of predictions in the logarithm scale\n", "ensemble_preds = torch.exp(torch.cat(preds, 1)).mean(1)\n", "submission = pd.DataFrame({'Id':data.raw_val.Id,\n", " 'SalePrice':ensemble_preds.detach().numpy()})\n", "submission.to_csv('submission.csv', index=False)" ] }, { "cell_type": "markdown", "id": "206db870", "metadata": { "origin_pos": 28 }, "source": [ "Next, as demonstrated in :numref:`fig_kaggle_submit2`,\n", "we can submit our predictions on Kaggle\n", "and see how they compare with the actual house prices (labels)\n", "on the test set.\n", "The steps are quite simple:\n", "\n", "* Log in to the Kaggle website and visit the house price prediction competition page.\n", "* Click the “Submit Predictions” or “Late Submission” button.\n", "* Click the “Upload Submission File” button in the dashed box at the bottom of the page and select the prediction file you wish to upload.\n", "* Click the “Make Submission” button at the bottom of the page to view your results.\n", "\n", "![Submitting data to Kaggle](../img/kaggle-submit2.png)\n", ":width:`400px`\n", ":label:`fig_kaggle_submit2`\n", "\n", "## Summary and Discussion\n", "\n", "Real data often contains a mix of different data types and needs to be preprocessed.\n", "Rescaling real-valued data to zero mean and unit variance is a good default. So is replacing missing values with their mean.\n", "Furthermore, transforming categorical features into indicator features allows us to treat them like one-hot vectors.\n", "When we tend to care more about\n", "the relative error than about the absolute error,\n", "we can \n", "measure the discrepancy in the logarithm of the prediction.\n", "To select the model and adjust the hyperparameters,\n", "we can use $K$-fold cross-validation .\n", "\n", "\n", "\n", "## Exercises\n", "\n", "1. Submit your predictions for this section to Kaggle. How good are they?\n", "1. Is it always a good idea to replace missing values by a mean? Hint: can you construct a situation where the values are not missing at random?\n", "1. Improve the score by tuning the hyperparameters through $K$-fold cross-validation.\n", "1. Improve the score by improving the model (e.g., layers, weight decay, and dropout).\n", "1. What happens if we do not standardize the continuous numerical features as we have done in this section?\n" ] }, { "cell_type": "markdown", "id": "fe43aac4", "metadata": { "origin_pos": 30, "tab": [ "pytorch" ] }, "source": [ "[Discussions](https://discuss.d2l.ai/t/107)\n" ] } ], "metadata": { "language_info": { "name": "python" }, "required_libs": [] }, "nbformat": 4, "nbformat_minor": 5 }