{
"cells": [
{
"cell_type": "markdown",
"id": "1ac91c6c",
"metadata": {
"origin_pos": 0
},
"source": [
"# RMSProp\n",
":label:`sec_rmsprop`\n",
"\n",
"\n",
"One of the key issues in :numref:`sec_adagrad` is that the learning rate decreases at a predefined schedule of effectively $\\mathcal{O}(t^{-\\frac{1}{2}})$. While this is generally appropriate for convex problems, it might not be ideal for nonconvex ones, such as those encountered in deep learning. Yet, the coordinate-wise adaptivity of Adagrad is highly desirable as a preconditioner.\n",
"\n",
":citet:`Tieleman.Hinton.2012` proposed the RMSProp algorithm as a simple fix to decouple rate scheduling from coordinate-adaptive learning rates. The issue is that Adagrad accumulates the squares of the gradient $\\mathbf{g}_t$ into a state vector $\\mathbf{s}_t = \\mathbf{s}_{t-1} + \\mathbf{g}_t^2$. As a result $\\mathbf{s}_t$ keeps on growing without bound due to the lack of normalization, essentially linearly as the algorithm converges.\n",
"\n",
"One way of fixing this problem would be to use $\\mathbf{s}_t / t$. For reasonable distributions of $\\mathbf{g}_t$ this will converge. Unfortunately it might take a very long time until the limit behavior starts to matter since the procedure remembers the full trajectory of values. An alternative is to use a leaky average in the same way we used in the momentum method, i.e., $\\mathbf{s}_t \\leftarrow \\gamma \\mathbf{s}_{t-1} + (1-\\gamma) \\mathbf{g}_t^2$ for some parameter $\\gamma > 0$. Keeping all other parts unchanged yields RMSProp.\n",
"\n",
"## The Algorithm\n",
"\n",
"Let's write out the equations in detail.\n",
"\n",
"$$\\begin{aligned}\n",
" \\mathbf{s}_t & \\leftarrow \\gamma \\mathbf{s}_{t-1} + (1 - \\gamma) \\mathbf{g}_t^2, \\\\\n",
" \\mathbf{x}_t & \\leftarrow \\mathbf{x}_{t-1} - \\frac{\\eta}{\\sqrt{\\mathbf{s}_t + \\epsilon}} \\odot \\mathbf{g}_t.\n",
"\\end{aligned}$$\n",
"\n",
"The constant $\\epsilon > 0$ is typically set to $10^{-6}$ to ensure that we do not suffer from division by zero or overly large step sizes. Given this expansion we are now free to control the learning rate $\\eta$ independently of the scaling that is applied on a per-coordinate basis. In terms of leaky averages we can apply the same reasoning as previously applied in the case of the momentum method. Expanding the definition of $\\mathbf{s}_t$ yields\n",
"\n",
"$$\n",
"\\begin{aligned}\n",
"\\mathbf{s}_t & = (1 - \\gamma) \\mathbf{g}_t^2 + \\gamma \\mathbf{s}_{t-1} \\\\\n",
"& = (1 - \\gamma) \\left(\\mathbf{g}_t^2 + \\gamma \\mathbf{g}_{t-1}^2 + \\gamma^2 \\mathbf{g}_{t-2} + \\ldots, \\right).\n",
"\\end{aligned}\n",
"$$\n",
"\n",
"As before in :numref:`sec_momentum` we use $1 + \\gamma + \\gamma^2 + \\ldots, = \\frac{1}{1-\\gamma}$. Hence the sum of weights is normalized to $1$ with a half-life time of an observation of $\\gamma^{-1}$. Let's visualize the weights for the past 40 time steps for various choices of $\\gamma$.\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "e0909b3e",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T19:38:18.653539Z",
"iopub.status.busy": "2023-08-18T19:38:18.653217Z",
"iopub.status.idle": "2023-08-18T19:38:22.412302Z",
"shell.execute_reply": "2023-08-18T19:38:22.410946Z"
},
"origin_pos": 2,
"tab": [
"pytorch"
]
},
"outputs": [],
"source": [
"import math\n",
"import torch\n",
"from d2l import torch as d2l"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "d135b008",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T19:38:22.423079Z",
"iopub.status.busy": "2023-08-18T19:38:22.419951Z",
"iopub.status.idle": "2023-08-18T19:38:22.671490Z",
"shell.execute_reply": "2023-08-18T19:38:22.670027Z"
},
"origin_pos": 4,
"tab": [
"pytorch"
]
},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"d2l.set_figsize()\n",
"gammas = [0.95, 0.9, 0.8, 0.7]\n",
"for gamma in gammas:\n",
" x = torch.arange(40).detach().numpy()\n",
" d2l.plt.plot(x, (1-gamma) * gamma ** x, label=f'gamma = {gamma:.2f}')\n",
"d2l.plt.xlabel('time');"
]
},
{
"cell_type": "markdown",
"id": "7ad9385f",
"metadata": {
"origin_pos": 5
},
"source": [
"## Implementation from Scratch\n",
"\n",
"As before we use the quadratic function $f(\\mathbf{x})=0.1x_1^2+2x_2^2$ to observe the trajectory of RMSProp. Recall that in :numref:`sec_adagrad`, when we used Adagrad with a learning rate of 0.4, the variables moved only very slowly in the later stages of the algorithm since the learning rate decreased too quickly. Since $\\eta$ is controlled separately this does not happen with RMSProp.\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "186c7b34",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T19:38:22.683908Z",
"iopub.status.busy": "2023-08-18T19:38:22.678341Z",
"iopub.status.idle": "2023-08-18T19:38:22.902749Z",
"shell.execute_reply": "2023-08-18T19:38:22.901714Z"
},
"origin_pos": 6,
"tab": [
"pytorch"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch 20, x1: -0.010599, x2: 0.000000\n"
]
},
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"def rmsprop_2d(x1, x2, s1, s2):\n",
" g1, g2, eps = 0.2 * x1, 4 * x2, 1e-6\n",
" s1 = gamma * s1 + (1 - gamma) * g1 ** 2\n",
" s2 = gamma * s2 + (1 - gamma) * g2 ** 2\n",
" x1 -= eta / math.sqrt(s1 + eps) * g1\n",
" x2 -= eta / math.sqrt(s2 + eps) * g2\n",
" return x1, x2, s1, s2\n",
"\n",
"def f_2d(x1, x2):\n",
" return 0.1 * x1 ** 2 + 2 * x2 ** 2\n",
"\n",
"eta, gamma = 0.4, 0.9\n",
"d2l.show_trace_2d(f_2d, d2l.train_2d(rmsprop_2d))"
]
},
{
"cell_type": "markdown",
"id": "963e122d",
"metadata": {
"origin_pos": 7
},
"source": [
"Next, we implement RMSProp to be used in a deep network. This is equally straightforward.\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "e8e0e3f0",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T19:38:22.908239Z",
"iopub.status.busy": "2023-08-18T19:38:22.907762Z",
"iopub.status.idle": "2023-08-18T19:38:22.916224Z",
"shell.execute_reply": "2023-08-18T19:38:22.915291Z"
},
"origin_pos": 8,
"tab": [
"pytorch"
]
},
"outputs": [],
"source": [
"def init_rmsprop_states(feature_dim):\n",
" s_w = torch.zeros((feature_dim, 1))\n",
" s_b = torch.zeros(1)\n",
" return (s_w, s_b)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "d820bd1b",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T19:38:22.919668Z",
"iopub.status.busy": "2023-08-18T19:38:22.919312Z",
"iopub.status.idle": "2023-08-18T19:38:22.928989Z",
"shell.execute_reply": "2023-08-18T19:38:22.924422Z"
},
"origin_pos": 11,
"tab": [
"pytorch"
]
},
"outputs": [],
"source": [
"def rmsprop(params, states, hyperparams):\n",
" gamma, eps = hyperparams['gamma'], 1e-6\n",
" for p, s in zip(params, states):\n",
" with torch.no_grad():\n",
" s[:] = gamma * s + (1 - gamma) * torch.square(p.grad)\n",
" p[:] -= hyperparams['lr'] * p.grad / torch.sqrt(s + eps)\n",
" p.grad.data.zero_()"
]
},
{
"cell_type": "markdown",
"id": "c22a86d4",
"metadata": {
"origin_pos": 13
},
"source": [
"We set the initial learning rate to 0.01 and the weighting term $\\gamma$ to 0.9. That is, $\\mathbf{s}$ aggregates on average over the past $1/(1-\\gamma) = 10$ observations of the square gradient.\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "95618054",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T19:38:22.934493Z",
"iopub.status.busy": "2023-08-18T19:38:22.934151Z",
"iopub.status.idle": "2023-08-18T19:38:27.523745Z",
"shell.execute_reply": "2023-08-18T19:38:27.520739Z"
},
"origin_pos": 14,
"tab": [
"pytorch"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"loss: 0.245, 0.245 sec/epoch\n"
]
},
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"data_iter, feature_dim = d2l.get_data_ch11(batch_size=10)\n",
"d2l.train_ch11(rmsprop, init_rmsprop_states(feature_dim),\n",
" {'lr': 0.01, 'gamma': 0.9}, data_iter, feature_dim);"
]
},
{
"cell_type": "markdown",
"id": "b2a8d178",
"metadata": {
"origin_pos": 15
},
"source": [
"## Concise Implementation\n",
"\n",
"Since RMSProp is a rather popular algorithm it is also available in the `Trainer` instance. All we need to do is instantiate it using an algorithm named `rmsprop`, assigning $\\gamma$ to the parameter `gamma1`.\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "be90a4f1",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T19:38:27.530243Z",
"iopub.status.busy": "2023-08-18T19:38:27.529340Z",
"iopub.status.idle": "2023-08-18T19:38:35.580724Z",
"shell.execute_reply": "2023-08-18T19:38:35.577932Z"
},
"origin_pos": 17,
"tab": [
"pytorch"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"loss: 0.246, 0.129 sec/epoch\n"
]
},
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"trainer = torch.optim.RMSprop\n",
"d2l.train_concise_ch11(trainer, {'lr': 0.01, 'alpha': 0.9},\n",
" data_iter)"
]
},
{
"cell_type": "markdown",
"id": "de5d0655",
"metadata": {
"origin_pos": 19
},
"source": [
"## Summary\n",
"\n",
"* RMSProp is very similar to Adagrad insofar as both use the square of the gradient to scale coefficients.\n",
"* RMSProp shares with momentum the leaky averaging. However, RMSProp uses the technique to adjust the coefficient-wise preconditioner.\n",
"* The learning rate needs to be scheduled by the experimenter in practice.\n",
"* The coefficient $\\gamma$ determines how long the history is when adjusting the per-coordinate scale.\n",
"\n",
"## Exercises\n",
"\n",
"1. What happens experimentally if we set $\\gamma = 1$? Why?\n",
"1. Rotate the optimization problem to minimize $f(\\mathbf{x}) = 0.1 (x_1 + x_2)^2 + 2 (x_1 - x_2)^2$. What happens to the convergence?\n",
"1. Try out what happens to RMSProp on a real machine learning problem, such as training on Fashion-MNIST. Experiment with different choices for adjusting the learning rate.\n",
"1. Would you want to adjust $\\gamma$ as optimization progresses? How sensitive is RMSProp to this?\n"
]
},
{
"cell_type": "markdown",
"id": "7ce477d8",
"metadata": {
"origin_pos": 21,
"tab": [
"pytorch"
]
},
"source": [
"[Discussions](https://discuss.d2l.ai/t/1074)\n"
]
}
],
"metadata": {
"language_info": {
"name": "python"
},
"required_libs": []
},
"nbformat": 4,
"nbformat_minor": 5
}