{
"cells": [
{
"cell_type": "markdown",
"id": "0337f854",
"metadata": {
"origin_pos": 1
},
"source": [
"# Softmax Regression Implementation from Scratch\n",
":label:`sec_softmax_scratch`\n",
"\n",
"Because softmax regression is so fundamental,\n",
"we believe that you ought to know\n",
"how to implement it yourself.\n",
"Here, we limit ourselves to defining the\n",
"softmax-specific aspects of the model\n",
"and reuse the other components\n",
"from our linear regression section,\n",
"including the training loop.\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "27f90605",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T19:40:38.020091Z",
"iopub.status.busy": "2023-08-18T19:40:38.019758Z",
"iopub.status.idle": "2023-08-18T19:40:41.731894Z",
"shell.execute_reply": "2023-08-18T19:40:41.728859Z"
},
"origin_pos": 3,
"tab": [
"pytorch"
]
},
"outputs": [],
"source": [
"import torch\n",
"from d2l import torch as d2l"
]
},
{
"cell_type": "markdown",
"id": "7f7c0f8d",
"metadata": {
"origin_pos": 6
},
"source": [
"## The Softmax\n",
"\n",
"Let's begin with the most important part:\n",
"the mapping from scalars to probabilities.\n",
"For a refresher, recall the operation of the sum operator\n",
"along specific dimensions in a tensor,\n",
"as discussed in :numref:`subsec_lin-alg-reduction`\n",
"and :numref:`subsec_lin-alg-non-reduction`.\n",
"[**Given a matrix `X` we can sum over all elements (by default) or only\n",
"over elements in the same axis.**]\n",
"The `axis` variable lets us compute row and column sums:\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "4721b51f",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T19:40:41.736669Z",
"iopub.status.busy": "2023-08-18T19:40:41.735590Z",
"iopub.status.idle": "2023-08-18T19:40:41.768196Z",
"shell.execute_reply": "2023-08-18T19:40:41.766964Z"
},
"origin_pos": 7,
"tab": [
"pytorch"
]
},
"outputs": [
{
"data": {
"text/plain": [
"(tensor([[5., 7., 9.]]),\n",
" tensor([[ 6.],\n",
" [15.]]))"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X = torch.tensor([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])\n",
"X.sum(0, keepdims=True), X.sum(1, keepdims=True)"
]
},
{
"cell_type": "markdown",
"id": "e183bcdc",
"metadata": {
"origin_pos": 8
},
"source": [
"Computing the softmax requires three steps:\n",
"(i) exponentiation of each term;\n",
"(ii) a sum over each row to compute the normalization constant for each example;\n",
"(iii) division of each row by its normalization constant,\n",
"ensuring that the result sums to 1:\n",
"\n",
"(**\n",
"$$\\mathrm{softmax}(\\mathbf{X})_{ij} = \\frac{\\exp(\\mathbf{X}_{ij})}{\\sum_k \\exp(\\mathbf{X}_{ik})}.$$\n",
"**)\n",
"\n",
"The (logarithm of the) denominator\n",
"is called the (log) *partition function*.\n",
"It was introduced in [statistical physics](https://en.wikipedia.org/wiki/Partition_function_(statistical_mechanics))\n",
"to sum over all possible states in a thermodynamic ensemble.\n",
"The implementation is straightforward:\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "d2f22e34",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T19:40:41.772527Z",
"iopub.status.busy": "2023-08-18T19:40:41.771945Z",
"iopub.status.idle": "2023-08-18T19:40:41.777757Z",
"shell.execute_reply": "2023-08-18T19:40:41.776558Z"
},
"origin_pos": 9,
"tab": [
"pytorch"
]
},
"outputs": [],
"source": [
"def softmax(X):\n",
" X_exp = torch.exp(X)\n",
" partition = X_exp.sum(1, keepdims=True)\n",
" return X_exp / partition # The broadcasting mechanism is applied here"
]
},
{
"cell_type": "markdown",
"id": "12b4d33d",
"metadata": {
"origin_pos": 10
},
"source": [
"For any input `X`, [**we turn each element\n",
"into a nonnegative number.\n",
"Each row sums up to 1,**]\n",
"as is required for a probability. Caution: the code above is *not* robust against very large or very small arguments. While it is sufficient to illustrate what is happening, you should *not* use this code verbatim for any serious purpose. Deep learning frameworks have such protections built in and we will be using the built-in softmax going forward.\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "90ec733c",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T19:40:41.782237Z",
"iopub.status.busy": "2023-08-18T19:40:41.781359Z",
"iopub.status.idle": "2023-08-18T19:40:41.793051Z",
"shell.execute_reply": "2023-08-18T19:40:41.792163Z"
},
"origin_pos": 12,
"tab": [
"pytorch"
]
},
"outputs": [
{
"data": {
"text/plain": [
"(tensor([[0.2511, 0.1417, 0.1158, 0.2529, 0.2385],\n",
" [0.2004, 0.1419, 0.1957, 0.2504, 0.2117]]),\n",
" tensor([1., 1.]))"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X = torch.rand((2, 5))\n",
"X_prob = softmax(X)\n",
"X_prob, X_prob.sum(1)"
]
},
{
"cell_type": "markdown",
"id": "23d983d7",
"metadata": {
"origin_pos": 14
},
"source": [
"## The Model\n",
"\n",
"We now have everything that we need\n",
"to implement [**the softmax regression model.**]\n",
"As in our linear regression example,\n",
"each instance will be represented\n",
"by a fixed-length vector.\n",
"Since the raw data here consists\n",
"of $28 \\times 28$ pixel images,\n",
"[**we flatten each image,\n",
"treating them as vectors of length 784.**]\n",
"In later chapters, we will introduce\n",
"convolutional neural networks,\n",
"which exploit the spatial structure\n",
"in a more satisfying way.\n",
"\n",
"\n",
"In softmax regression,\n",
"the number of outputs from our network\n",
"should be equal to the number of classes.\n",
"(**Since our dataset has 10 classes,\n",
"our network has an output dimension of 10.**)\n",
"Consequently, our weights constitute a $784 \\times 10$ matrix\n",
"plus a $1 \\times 10$ row vector for the biases.\n",
"As with linear regression,\n",
"we initialize the weights `W`\n",
"with Gaussian noise.\n",
"The biases are initialized as zeros.\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "b88679dc",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T19:40:41.796788Z",
"iopub.status.busy": "2023-08-18T19:40:41.796307Z",
"iopub.status.idle": "2023-08-18T19:40:41.803043Z",
"shell.execute_reply": "2023-08-18T19:40:41.802032Z"
},
"origin_pos": 16,
"tab": [
"pytorch"
]
},
"outputs": [],
"source": [
"class SoftmaxRegressionScratch(d2l.Classifier):\n",
" def __init__(self, num_inputs, num_outputs, lr, sigma=0.01):\n",
" super().__init__()\n",
" self.save_hyperparameters()\n",
" self.W = torch.normal(0, sigma, size=(num_inputs, num_outputs),\n",
" requires_grad=True)\n",
" self.b = torch.zeros(num_outputs, requires_grad=True)\n",
"\n",
" def parameters(self):\n",
" return [self.W, self.b]"
]
},
{
"cell_type": "markdown",
"id": "cffef3c8",
"metadata": {
"origin_pos": 19
},
"source": [
"The code below defines how the network\n",
"maps each input to an output.\n",
"Note that we flatten each $28 \\times 28$ pixel image in the batch\n",
"into a vector using `reshape`\n",
"before passing the data through our model.\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "6525b147",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T19:40:41.810424Z",
"iopub.status.busy": "2023-08-18T19:40:41.807390Z",
"iopub.status.idle": "2023-08-18T19:40:41.815642Z",
"shell.execute_reply": "2023-08-18T19:40:41.814520Z"
},
"origin_pos": 20,
"tab": [
"pytorch"
]
},
"outputs": [],
"source": [
"@d2l.add_to_class(SoftmaxRegressionScratch)\n",
"def forward(self, X):\n",
" X = X.reshape((-1, self.W.shape[0]))\n",
" return softmax(torch.matmul(X, self.W) + self.b)"
]
},
{
"cell_type": "markdown",
"id": "358c8926",
"metadata": {
"origin_pos": 21
},
"source": [
"## The Cross-Entropy Loss\n",
"\n",
"Next we need to implement the cross-entropy loss function\n",
"(introduced in :numref:`subsec_softmax-regression-loss-func`).\n",
"This may be the most common loss function\n",
"in all of deep learning.\n",
"At the moment, applications of deep learning\n",
"easily cast as classification problems\n",
"far outnumber those better treated as regression problems.\n",
"\n",
"Recall that cross-entropy takes the negative log-likelihood\n",
"of the predicted probability assigned to the true label.\n",
"For efficiency we avoid Python for-loops and use indexing instead.\n",
"In particular, the one-hot encoding in $\\mathbf{y}$\n",
"allows us to select the matching terms in $\\hat{\\mathbf{y}}$.\n",
"\n",
"To see this in action we [**create sample data `y_hat`\n",
"with 2 examples of predicted probabilities over 3 classes and their corresponding labels `y`.**]\n",
"The correct labels are $0$ and $2$ respectively (i.e., the first and third class).\n",
"[**Using `y` as the indices of the probabilities in `y_hat`,**]\n",
"we can pick out terms efficiently.\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "be06d72f",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T19:40:41.820273Z",
"iopub.status.busy": "2023-08-18T19:40:41.819514Z",
"iopub.status.idle": "2023-08-18T19:40:41.829451Z",
"shell.execute_reply": "2023-08-18T19:40:41.828524Z"
},
"origin_pos": 22,
"tab": [
"pytorch"
]
},
"outputs": [
{
"data": {
"text/plain": [
"tensor([0.1000, 0.5000])"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y = torch.tensor([0, 2])\n",
"y_hat = torch.tensor([[0.1, 0.3, 0.6], [0.3, 0.2, 0.5]])\n",
"y_hat[[0, 1], y]"
]
},
{
"cell_type": "markdown",
"id": "100327e0",
"metadata": {
"origin_pos": 24,
"tab": [
"pytorch"
]
},
"source": [
"Now we can (**implement the cross-entropy loss function**) by averaging over the logarithms of the selected probabilities.\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "6f5696bf",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T19:40:41.833839Z",
"iopub.status.busy": "2023-08-18T19:40:41.832979Z",
"iopub.status.idle": "2023-08-18T19:40:41.846258Z",
"shell.execute_reply": "2023-08-18T19:40:41.845315Z"
},
"origin_pos": 26,
"tab": [
"pytorch"
]
},
"outputs": [
{
"data": {
"text/plain": [
"tensor(1.4979)"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def cross_entropy(y_hat, y):\n",
" return -torch.log(y_hat[list(range(len(y_hat))), y]).mean()\n",
"\n",
"cross_entropy(y_hat, y)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "0d97074e",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T19:40:41.850484Z",
"iopub.status.busy": "2023-08-18T19:40:41.849744Z",
"iopub.status.idle": "2023-08-18T19:40:41.854383Z",
"shell.execute_reply": "2023-08-18T19:40:41.853500Z"
},
"origin_pos": 28,
"tab": [
"pytorch"
]
},
"outputs": [],
"source": [
"@d2l.add_to_class(SoftmaxRegressionScratch)\n",
"def loss(self, y_hat, y):\n",
" return cross_entropy(y_hat, y)"
]
},
{
"cell_type": "markdown",
"id": "133145ec",
"metadata": {
"origin_pos": 30
},
"source": [
"## Training\n",
"\n",
"We reuse the `fit` method defined in :numref:`sec_linear_scratch` to [**train the model with 10 epochs.**]\n",
"Note that the number of epochs (`max_epochs`),\n",
"the minibatch size (`batch_size`),\n",
"and learning rate (`lr`)\n",
"are adjustable hyperparameters.\n",
"That means that while these values are not\n",
"learned during our primary training loop,\n",
"they still influence the performance\n",
"of our model, both vis-à-vis training\n",
"and generalization performance.\n",
"In practice you will want to choose these values\n",
"based on the *validation* split of the data\n",
"and then, ultimately, to evaluate your final model\n",
"on the *test* split.\n",
"As discussed in :numref:`subsec_generalization-model-selection`,\n",
"we will regard the test data of Fashion-MNIST\n",
"as the validation set, thus\n",
"reporting validation loss and validation accuracy\n",
"on this split.\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "0fed2d06",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T19:40:41.860751Z",
"iopub.status.busy": "2023-08-18T19:40:41.859826Z",
"iopub.status.idle": "2023-08-18T19:41:38.026253Z",
"shell.execute_reply": "2023-08-18T19:41:38.024988Z"
},
"origin_pos": 31,
"tab": [
"pytorch"
]
},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"data = d2l.FashionMNIST(batch_size=256)\n",
"model = SoftmaxRegressionScratch(num_inputs=784, num_outputs=10, lr=0.1)\n",
"trainer = d2l.Trainer(max_epochs=10)\n",
"trainer.fit(model, data)"
]
},
{
"cell_type": "markdown",
"id": "c01a1d3e",
"metadata": {
"origin_pos": 32
},
"source": [
"## Prediction\n",
"\n",
"Now that training is complete,\n",
"our model is ready to [**classify some images.**]\n"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "ea0b6dd0",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T19:41:38.030970Z",
"iopub.status.busy": "2023-08-18T19:41:38.030251Z",
"iopub.status.idle": "2023-08-18T19:41:38.207822Z",
"shell.execute_reply": "2023-08-18T19:41:38.206713Z"
},
"origin_pos": 33,
"tab": [
"pytorch"
]
},
"outputs": [
{
"data": {
"text/plain": [
"torch.Size([256])"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X, y = next(iter(data.val_dataloader()))\n",
"preds = model(X).argmax(axis=1)\n",
"preds.shape"
]
},
{
"cell_type": "markdown",
"id": "7a160103",
"metadata": {
"origin_pos": 34
},
"source": [
"We are more interested in the images we label *incorrectly*. We visualize them by\n",
"comparing their actual labels\n",
"(first line of text output)\n",
"with the predictions from the model\n",
"(second line of text output).\n"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "41ef8291",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T19:41:38.211804Z",
"iopub.status.busy": "2023-08-18T19:41:38.211212Z",
"iopub.status.idle": "2023-08-18T19:41:38.569431Z",
"shell.execute_reply": "2023-08-18T19:41:38.568219Z"
},
"origin_pos": 35,
"tab": [
"pytorch"
]
},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"wrong = preds.type(y.dtype) != y\n",
"X, y, preds = X[wrong], y[wrong], preds[wrong]\n",
"labels = [a+'\\n'+b for a, b in zip(\n",
" data.text_labels(y), data.text_labels(preds))]\n",
"data.visualize([X, y], labels=labels)"
]
},
{
"cell_type": "markdown",
"id": "806939a6",
"metadata": {
"origin_pos": 36
},
"source": [
"## Summary\n",
"\n",
"By now we are starting to get some experience\n",
"with solving linear regression\n",
"and classification problems.\n",
"With it, we have reached what would arguably be\n",
"the state of the art of 1960--1970s of statistical modeling.\n",
"In the next section, we will show you how to leverage\n",
"deep learning frameworks to implement this model\n",
"much more efficiently.\n",
"\n",
"## Exercises\n",
"\n",
"1. In this section, we directly implemented the softmax function based on the mathematical definition of the softmax operation. As discussed in :numref:`sec_softmax` this can cause numerical instabilities.\n",
" 1. Test whether `softmax` still works correctly if an input has a value of $100$.\n",
" 1. Test whether `softmax` still works correctly if the largest of all inputs is smaller than $-100$?\n",
" 1. Implement a fix by looking at the value relative to the largest entry in the argument.\n",
"1. Implement a `cross_entropy` function that follows the definition of the cross-entropy loss function $\\sum_i y_i \\log \\hat{y}_i$.\n",
" 1. Try it out in the code example of this section.\n",
" 1. Why do you think it runs more slowly?\n",
" 1. Should you use it? When would it make sense to?\n",
" 1. What do you need to be careful of? Hint: consider the domain of the logarithm.\n",
"1. Is it always a good idea to return the most likely label? For example, would you do this for medical diagnosis? How would you try to address this?\n",
"1. Assume that we want to use softmax regression to predict the next word based on some features. What are some problems that might arise from a large vocabulary?\n",
"1. Experiment with the hyperparameters of the code in this section. In particular:\n",
" 1. Plot how the validation loss changes as you change the learning rate.\n",
" 1. Do the validation and training loss change as you change the minibatch size? How large or small do you need to go before you see an effect?\n"
]
},
{
"cell_type": "markdown",
"id": "840785bb",
"metadata": {
"origin_pos": 38,
"tab": [
"pytorch"
]
},
"source": [
"[Discussions](https://discuss.d2l.ai/t/51)\n"
]
}
],
"metadata": {
"language_info": {
"name": "python"
},
"required_libs": []
},
"nbformat": 4,
"nbformat_minor": 5
}