{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "a0b452b4",
   "metadata": {
    "origin_pos": 0
   },
   "source": [
    "# Word Embedding with Global Vectors (GloVe)\n",
    ":label:`sec_glove`\n",
    "\n",
    "\n",
    "Word-word co-occurrences\n",
    "within context windows\n",
    "may carry rich semantic information.\n",
    "For example,\n",
    "in a large corpus\n",
    "word \"solid\" is\n",
    "more likely to co-occur\n",
    "with \"ice\" than \"steam\",\n",
    "but word \"gas\"\n",
    "probably co-occurs with \"steam\"\n",
    "more frequently than \"ice\".\n",
    "Besides,\n",
    "global corpus statistics\n",
    "of such co-occurrences\n",
    "can be precomputed:\n",
    "this can lead to more efficient training.\n",
    "To leverage statistical\n",
    "information in the entire corpus\n",
    "for word embedding,\n",
    "let's first revisit\n",
    "the skip-gram model in :numref:`subsec_skip-gram`,\n",
    "but interpreting it\n",
    "using global corpus statistics\n",
    "such as co-occurrence counts.\n",
    "\n",
    "## Skip-Gram with Global Corpus Statistics\n",
    ":label:`subsec_skipgram-global`\n",
    "\n",
    "Denoting by $q_{ij}$\n",
    "the conditional probability\n",
    "$P(w_j\\mid w_i)$\n",
    "of word $w_j$ given word $w_i$\n",
    "in the skip-gram model,\n",
    "we have\n",
    "\n",
    "$$q_{ij}=\\frac{\\exp(\\mathbf{u}_j^\\top \\mathbf{v}_i)}{ \\sum_{k \\in \\mathcal{V}} \\exp(\\mathbf{u}_k^\\top \\mathbf{v}_i)},$$\n",
    "\n",
    "where\n",
    "for any index $i$\n",
    "vectors $\\mathbf{v}_i$ and $\\mathbf{u}_i$\n",
    "represent word $w_i$\n",
    "as the center word and context word,\n",
    "respectively, and $\\mathcal{V} = \\{0, 1, \\ldots, |\\mathcal{V}|-1\\}$\n",
    "is the index set of the vocabulary.\n",
    "\n",
    "Consider word $w_i$\n",
    "that may occur multiple times\n",
    "in the corpus.\n",
    "In the entire corpus,\n",
    "all the context words\n",
    "wherever $w_i$ is taken as their center word\n",
    "form a *multiset* $\\mathcal{C}_i$\n",
    "of word indices\n",
    "that *allows for multiple instances of the same element*.\n",
    "For any element,\n",
    "its number of instances is called its *multiplicity*.\n",
    "To illustrate with an example,\n",
    "suppose that word $w_i$ occurs twice in the corpus\n",
    "and indices of the context words\n",
    "that take $w_i$ as their center word\n",
    "in the two context windows\n",
    "are\n",
    "$k, j, m, k$ and $k, l, k, j$.\n",
    "Thus, multiset $\\mathcal{C}_i = \\{j, j, k, k, k, k, l, m\\}$, where\n",
    "multiplicities of elements $j, k, l, m$\n",
    "are 2, 4, 1, 1, respectively.\n",
    "\n",
    "Now let's denote the multiplicity of element $j$ in\n",
    "multiset $\\mathcal{C}_i$ as $x_{ij}$.\n",
    "This is the global co-occurrence count\n",
    "of word $w_j$ (as the context word)\n",
    "and word $w_i$ (as the center word)\n",
    "in the same context window\n",
    "in the entire corpus.\n",
    "Using such global corpus statistics,\n",
    "the loss function of the skip-gram model\n",
    "is equivalent to\n",
    "\n",
    "$$-\\sum_{i\\in\\mathcal{V}}\\sum_{j\\in\\mathcal{V}} x_{ij} \\log\\,q_{ij}.$$\n",
    ":eqlabel:`eq_skipgram-x_ij`\n",
    "\n",
    "We further denote by\n",
    "$x_i$\n",
    "the number of all the context words\n",
    "in the context windows\n",
    "where $w_i$ occurs as their center word,\n",
    "which is equivalent to $|\\mathcal{C}_i|$.\n",
    "Letting $p_{ij}$\n",
    "be the conditional probability\n",
    "$x_{ij}/x_i$ for generating\n",
    "context word $w_j$ given center word $w_i$,\n",
    ":eqref:`eq_skipgram-x_ij`\n",
    "can be rewritten as\n",
    "\n",
    "$$-\\sum_{i\\in\\mathcal{V}} x_i \\sum_{j\\in\\mathcal{V}} p_{ij} \\log\\,q_{ij}.$$\n",
    ":eqlabel:`eq_skipgram-p_ij`\n",
    "\n",
    "In :eqref:`eq_skipgram-p_ij`, $-\\sum_{j\\in\\mathcal{V}} p_{ij} \\log\\,q_{ij}$ calculates\n",
    "the cross-entropy\n",
    "of\n",
    "the conditional distribution $p_{ij}$\n",
    "of global corpus statistics\n",
    "and\n",
    "the\n",
    "conditional distribution $q_{ij}$\n",
    "of model predictions.\n",
    "This loss\n",
    "is also weighted by $x_i$ as explained above.\n",
    "Minimizing the loss function in\n",
    ":eqref:`eq_skipgram-p_ij`\n",
    "will allow\n",
    "the predicted conditional distribution\n",
    "to get close to\n",
    "the conditional distribution\n",
    "from the global corpus statistics.\n",
    "\n",
    "\n",
    "Though being commonly used\n",
    "for measuring the distance\n",
    "between probability distributions,\n",
    "the cross-entropy loss function may not be a good choice here.\n",
    "On the one hand, as we mentioned in :numref:`sec_approx_train`,\n",
    "the cost of properly normalizing $q_{ij}$\n",
    "results in the sum over the entire vocabulary,\n",
    "which can be computationally expensive.\n",
    "On the other hand,\n",
    "a large number of rare\n",
    "events from a large corpus\n",
    "are often modeled by the cross-entropy loss\n",
    "to be assigned with\n",
    "too much weight.\n",
    "\n",
    "## The GloVe Model\n",
    "\n",
    "In view of this,\n",
    "the *GloVe* model makes three changes\n",
    "to the skip-gram model based on squared loss :cite:`Pennington.Socher.Manning.2014`:\n",
    "\n",
    "1. Use variables $p'_{ij}=x_{ij}$ and $q'_{ij}=\\exp(\\mathbf{u}_j^\\top \\mathbf{v}_i)$\n",
    "that are not probability distributions\n",
    "and take the logarithm of both, so the squared loss term is $\\left(\\log\\,p'_{ij} - \\log\\,q'_{ij}\\right)^2 = \\left(\\mathbf{u}_j^\\top \\mathbf{v}_i - \\log\\,x_{ij}\\right)^2$.\n",
    "2. Add two scalar model parameters for each word $w_i$: the center word bias $b_i$ and the context word bias $c_i$.\n",
    "3. Replace the weight of each loss term with the weight function $h(x_{ij})$, where $h(x)$ is increasing in the interval of $[0, 1]$.\n",
    "\n",
    "Putting all things together, training GloVe is to minimize the following loss function:\n",
    "\n",
    "$$\\sum_{i\\in\\mathcal{V}} \\sum_{j\\in\\mathcal{V}} h(x_{ij}) \\left(\\mathbf{u}_j^\\top \\mathbf{v}_i + b_i + c_j - \\log\\,x_{ij}\\right)^2.$$\n",
    ":eqlabel:`eq_glove-loss`\n",
    "\n",
    "For the weight function, a suggested choice is:\n",
    "$h(x) = (x/c) ^\\alpha$ (e.g $\\alpha = 0.75$) if $x < c$ (e.g., $c = 100$); otherwise $h(x) = 1$.\n",
    "In this case,\n",
    "because $h(0)=0$,\n",
    "the squared loss term for any $x_{ij}=0$ can be omitted\n",
    "for computational efficiency.\n",
    "For example,\n",
    "when using minibatch stochastic gradient descent for training,\n",
    "at each iteration\n",
    "we randomly sample a minibatch of *non-zero* $x_{ij}$\n",
    "to calculate gradients\n",
    "and update the model parameters.\n",
    "Note that these non-zero $x_{ij}$ are precomputed\n",
    "global corpus statistics;\n",
    "thus, the model is called GloVe\n",
    "for *Global Vectors*.\n",
    "\n",
    "It should be emphasized that\n",
    "if word $w_i$ appears in the context window of\n",
    "word $w_j$, then *vice versa*.\n",
    "Therefore, $x_{ij}=x_{ji}$.\n",
    "Unlike word2vec\n",
    "that fits the asymmetric conditional probability\n",
    "$p_{ij}$,\n",
    "GloVe fits the symmetric $\\log \\, x_{ij}$.\n",
    "Therefore, the center word vector and\n",
    "the context word vector of any word are mathematically equivalent in the GloVe model.\n",
    "However in practice, owing to different initialization values,\n",
    "the same word may still get different values\n",
    "in these two vectors after training:\n",
    "GloVe sums them up as the output vector.\n",
    "\n",
    "\n",
    "\n",
    "## Interpreting GloVe from the Ratio of Co-occurrence Probabilities\n",
    "\n",
    "\n",
    "We can also interpret the GloVe model from another perspective.\n",
    "Using the same notation in\n",
    ":numref:`subsec_skipgram-global`,\n",
    "let $p_{ij} \\stackrel{\\textrm{def}}{=} P(w_j \\mid w_i)$ be the conditional probability of generating the context word $w_j$ given $w_i$ as the center word in the corpus.\n",
    ":numref:`tab_glove`\n",
    "lists several co-occurrence probabilities\n",
    "given words \"ice\" and \"steam\"\n",
    "and their ratios based on  statistics from a large corpus.\n",
    "\n",
    "\n",
    ":Word-word co-occurrence probabilities and their ratios from a large corpus (adapted from Table 1 in :citet:`Pennington.Socher.Manning.2014`)\n",
    ":label:`tab_glove`\n",
    "\n",
    "|$w_k$=|solid|gas|water|fashion|\n",
    "|:--|:-|:-|:-|:-|\n",
    "|$p_1=P(w_k\\mid \\textrm{ice})$|0.00019|0.000066|0.003|0.000017|\n",
    "|$p_2=P(w_k\\mid\\textrm{steam})$|0.000022|0.00078|0.0022|0.000018|\n",
    "|$p_1/p_2$|8.9|0.085|1.36|0.96|\n",
    "\n",
    "\n",
    "\n",
    "We can observe the following from :numref:`tab_glove`:\n",
    "\n",
    "* For a word $w_k$ that is related to \"ice\" but unrelated to \"steam\", such as $w_k=\\textrm{solid}$, we expect a larger ratio of co-occurence probabilities, such as 8.9.\n",
    "* For a word $w_k$ that is related to \"steam\" but unrelated to \"ice\", such as $w_k=\\textrm{gas}$, we expect a smaller ratio of co-occurence probabilities, such as 0.085.\n",
    "* For a word $w_k$ that is related to both \"ice\" and \"steam\", such as $w_k=\\textrm{water}$, we expect a ratio of co-occurence probabilities that is close to 1, such as 1.36.\n",
    "* For a word $w_k$ that is unrelated to both \"ice\" and \"steam\", such as $w_k=\\textrm{fashion}$, we expect a ratio of co-occurence probabilities that is close to 1, such as 0.96.\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "It can be seen that the ratio\n",
    "of co-occurrence probabilities\n",
    "can intuitively express\n",
    "the relationship between words.\n",
    "Thus, we can design a function\n",
    "of three word vectors\n",
    "to fit this ratio.\n",
    "For the ratio of co-occurrence probabilities\n",
    "${p_{ij}}/{p_{ik}}$\n",
    "with $w_i$ being the center word\n",
    "and $w_j$ and $w_k$ being the context words,\n",
    "we want to fit this ratio\n",
    "using some function $f$:\n",
    "\n",
    "$$f(\\mathbf{u}_j, \\mathbf{u}_k, {\\mathbf{v}}_i) \\approx \\frac{p_{ij}}{p_{ik}}.$$\n",
    ":eqlabel:`eq_glove-f`\n",
    "\n",
    "Among many possible designs for $f$,\n",
    "we only pick a reasonable choice in the following.\n",
    "Since the ratio of co-occurrence probabilities\n",
    "is a scalar,\n",
    "we require that\n",
    "$f$ be a scalar function, such as\n",
    "$f(\\mathbf{u}_j, \\mathbf{u}_k, {\\mathbf{v}}_i) = f\\left((\\mathbf{u}_j - \\mathbf{u}_k)^\\top {\\mathbf{v}}_i\\right)$.\n",
    "Switching word indices\n",
    "$j$ and $k$ in :eqref:`eq_glove-f`,\n",
    "it must hold that\n",
    "$f(x)f(-x)=1$,\n",
    "so one possibility is $f(x)=\\exp(x)$,\n",
    "i.e.,\n",
    "\n",
    "$$f(\\mathbf{u}_j, \\mathbf{u}_k, {\\mathbf{v}}_i) = \\frac{\\exp\\left(\\mathbf{u}_j^\\top {\\mathbf{v}}_i\\right)}{\\exp\\left(\\mathbf{u}_k^\\top {\\mathbf{v}}_i\\right)} \\approx \\frac{p_{ij}}{p_{ik}}.$$\n",
    "\n",
    "Now let's pick\n",
    "$\\exp\\left(\\mathbf{u}_j^\\top {\\mathbf{v}}_i\\right) \\approx \\alpha p_{ij}$,\n",
    "where $\\alpha$ is a constant.\n",
    "Since $p_{ij}=x_{ij}/x_i$, after taking the logarithm on both sides we get $\\mathbf{u}_j^\\top {\\mathbf{v}}_i \\approx \\log\\,\\alpha + \\log\\,x_{ij} - \\log\\,x_i$.\n",
    "We may use additional bias terms to fit $- \\log\\, \\alpha + \\log\\, x_i$, such as the center word bias $b_i$ and the context word bias $c_j$:\n",
    "\n",
    "$$\\mathbf{u}_j^\\top \\mathbf{v}_i + b_i + c_j \\approx \\log\\, x_{ij}.$$\n",
    ":eqlabel:`eq_glove-square`\n",
    "\n",
    "Measuring the squared error of\n",
    ":eqref:`eq_glove-square` with weights,\n",
    "the GloVe loss function in\n",
    ":eqref:`eq_glove-loss` is obtained.\n",
    "\n",
    "\n",
    "\n",
    "## Summary\n",
    "\n",
    "* The skip-gram model can be interpreted using global corpus statistics such as word-word co-occurrence counts.\n",
    "* The cross-entropy loss may not be a good choice for measuring the difference of two probability distributions, especially for a large corpus. GloVe uses squared loss to fit precomputed global corpus statistics.\n",
    "* The center word vector and the context word vector are mathematically equivalent for any word in GloVe.\n",
    "* GloVe can be interpreted from the ratio of word-word co-occurrence probabilities.\n",
    "\n",
    "\n",
    "## Exercises\n",
    "\n",
    "1. If words $w_i$ and $w_j$ co-occur in the same context window, how can we use their   distance in the text sequence to redesign the method for  calculating the conditional probability $p_{ij}$? Hint: see Section 4.2 of the GloVe paper :cite:`Pennington.Socher.Manning.2014`.\n",
    "1. For any word, are its center word bias  and context word bias mathematically equivalent in GloVe? Why?\n",
    "\n",
    "\n",
    "[Discussions](https://discuss.d2l.ai/t/385)\n"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  },
  "required_libs": []
 },
 "nbformat": 4,
 "nbformat_minor": 5
}