{
"cells": [
{
"cell_type": "markdown",
"id": "6ffa7e5c",
"metadata": {
"origin_pos": 0
},
"source": [
"# Object Detection and Bounding Boxes\n",
":label:`sec_bbox`\n",
"\n",
"\n",
"In earlier sections (e.g., :numref:`sec_alexnet`--:numref:`sec_googlenet`),\n",
"we introduced various models for image classification.\n",
"In image classification tasks,\n",
"we assume that there is only *one*\n",
"major object\n",
"in the image and we only focus on how to \n",
"recognize its category.\n",
"However, there are often *multiple* objects\n",
"in the image of interest.\n",
"We not only want to know their categories, but also their specific positions in the image.\n",
"In computer vision, we refer to such tasks as *object detection* (or *object recognition*).\n",
"\n",
"Object detection has been\n",
"widely applied in many fields.\n",
"For example, self-driving needs to plan \n",
"traveling routes\n",
"by detecting the positions\n",
"of vehicles, pedestrians, roads, and obstacles in the captured video images.\n",
"Besides,\n",
"robots may use this technique\n",
"to detect and localize objects of interest\n",
"throughout its navigation of an environment.\n",
"Moreover,\n",
"security systems\n",
"may need to detect abnormal objects, such as intruders or bombs.\n",
"\n",
"In the next few sections, we will introduce \n",
"several deep learning methods for object detection.\n",
"We will begin with an introduction\n",
"to *positions* (or *locations*) of objects.\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "95e988d0",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T19:30:52.105742Z",
"iopub.status.busy": "2023-08-18T19:30:52.105044Z",
"iopub.status.idle": "2023-08-18T19:30:55.350387Z",
"shell.execute_reply": "2023-08-18T19:30:55.349034Z"
},
"origin_pos": 2,
"tab": [
"pytorch"
]
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import torch\n",
"from d2l import torch as d2l"
]
},
{
"cell_type": "markdown",
"id": "66869c14",
"metadata": {
"origin_pos": 4
},
"source": [
"We will load the sample image to be used in this section. We can see that there is a dog on the left side of the image and a cat on the right.\n",
"They are the two major objects in this image.\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "3f5ff930",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T19:30:55.357694Z",
"iopub.status.busy": "2023-08-18T19:30:55.356638Z",
"iopub.status.idle": "2023-08-18T19:30:55.670346Z",
"shell.execute_reply": "2023-08-18T19:30:55.668923Z"
},
"origin_pos": 6,
"tab": [
"pytorch"
]
},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"d2l.set_figsize()\n",
"img = d2l.plt.imread('../img/catdog.jpg')\n",
"d2l.plt.imshow(img);"
]
},
{
"cell_type": "markdown",
"id": "3656c899",
"metadata": {
"origin_pos": 7
},
"source": [
"## Bounding Boxes\n",
"\n",
"\n",
"In object detection,\n",
"we usually use a *bounding box* to describe the spatial location of an object.\n",
"The bounding box is rectangular, which is determined by the $x$ and $y$ coordinates of the upper-left corner of the rectangle and the such coordinates of the lower-right corner. \n",
"Another commonly used bounding box representation is the $(x, y)$-axis\n",
"coordinates of the bounding box center, and the width and height of the box.\n",
"\n",
"[**Here we define functions to convert between**] these (**two\n",
"representations**):\n",
"`box_corner_to_center` converts from the two-corner\n",
"representation to the center-width-height presentation,\n",
"and `box_center_to_corner` vice versa.\n",
"The input argument `boxes` should be a two-dimensional tensor of\n",
"shape ($n$, 4), where $n$ is the number of bounding boxes.\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "ae9d5814",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T19:30:55.674957Z",
"iopub.status.busy": "2023-08-18T19:30:55.674309Z",
"iopub.status.idle": "2023-08-18T19:30:55.683476Z",
"shell.execute_reply": "2023-08-18T19:30:55.682475Z"
},
"origin_pos": 8,
"tab": [
"pytorch"
]
},
"outputs": [],
"source": [
"#@save\n",
"def box_corner_to_center(boxes):\n",
" \"\"\"Convert from (upper-left, lower-right) to (center, width, height).\"\"\"\n",
" x1, y1, x2, y2 = boxes[:, 0], boxes[:, 1], boxes[:, 2], boxes[:, 3]\n",
" cx = (x1 + x2) / 2\n",
" cy = (y1 + y2) / 2\n",
" w = x2 - x1\n",
" h = y2 - y1\n",
" boxes = torch.stack((cx, cy, w, h), axis=-1)\n",
" return boxes\n",
"\n",
"#@save\n",
"def box_center_to_corner(boxes):\n",
" \"\"\"Convert from (center, width, height) to (upper-left, lower-right).\"\"\"\n",
" cx, cy, w, h = boxes[:, 0], boxes[:, 1], boxes[:, 2], boxes[:, 3]\n",
" x1 = cx - 0.5 * w\n",
" y1 = cy - 0.5 * h\n",
" x2 = cx + 0.5 * w\n",
" y2 = cy + 0.5 * h\n",
" boxes = torch.stack((x1, y1, x2, y2), axis=-1)\n",
" return boxes"
]
},
{
"cell_type": "markdown",
"id": "d416292d",
"metadata": {
"origin_pos": 9
},
"source": [
"We will [**define the bounding boxes of the dog and the cat in the image**] based\n",
"on the coordinate information.\n",
"The origin of the coordinates in the image\n",
"is the upper-left corner of the image, and to the right and down are the\n",
"positive directions of the $x$ and $y$ axes, respectively.\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "48833313",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T19:30:55.687312Z",
"iopub.status.busy": "2023-08-18T19:30:55.686751Z",
"iopub.status.idle": "2023-08-18T19:30:55.691983Z",
"shell.execute_reply": "2023-08-18T19:30:55.690524Z"
},
"origin_pos": 10,
"tab": [
"pytorch"
]
},
"outputs": [],
"source": [
"# Here `bbox` is the abbreviation for bounding box\n",
"dog_bbox, cat_bbox = [60.0, 45.0, 378.0, 516.0], [400.0, 112.0, 655.0, 493.0]"
]
},
{
"cell_type": "markdown",
"id": "b475a74c",
"metadata": {
"origin_pos": 11
},
"source": [
"We can verify the correctness of the two\n",
"bounding box conversion functions by converting twice.\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "4981a2a8",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T19:30:55.695405Z",
"iopub.status.busy": "2023-08-18T19:30:55.695042Z",
"iopub.status.idle": "2023-08-18T19:30:55.726532Z",
"shell.execute_reply": "2023-08-18T19:30:55.725574Z"
},
"origin_pos": 12,
"tab": [
"pytorch"
]
},
"outputs": [
{
"data": {
"text/plain": [
"tensor([[True, True, True, True],\n",
" [True, True, True, True]])"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"boxes = torch.tensor((dog_bbox, cat_bbox))\n",
"box_center_to_corner(box_corner_to_center(boxes)) == boxes"
]
},
{
"cell_type": "markdown",
"id": "fa756189",
"metadata": {
"origin_pos": 13
},
"source": [
"Let's [**draw the bounding boxes in the image**] to check if they are accurate.\n",
"Before drawing, we will define a helper function `bbox_to_rect`. It represents the bounding box in the bounding box format of the `matplotlib` package.\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "138d9471",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T19:30:55.730348Z",
"iopub.status.busy": "2023-08-18T19:30:55.730058Z",
"iopub.status.idle": "2023-08-18T19:30:55.736346Z",
"shell.execute_reply": "2023-08-18T19:30:55.735177Z"
},
"origin_pos": 14,
"tab": [
"pytorch"
]
},
"outputs": [],
"source": [
"#@save\n",
"def bbox_to_rect(bbox, color):\n",
" \"\"\"Convert bounding box to matplotlib format.\"\"\"\n",
" # Convert the bounding box (upper-left x, upper-left y, lower-right x,\n",
" # lower-right y) format to the matplotlib format: ((upper-left x,\n",
" # upper-left y), width, height)\n",
" return d2l.plt.Rectangle(\n",
" xy=(bbox[0], bbox[1]), width=bbox[2]-bbox[0], height=bbox[3]-bbox[1],\n",
" fill=False, edgecolor=color, linewidth=2)"
]
},
{
"cell_type": "markdown",
"id": "b591d667",
"metadata": {
"origin_pos": 15
},
"source": [
"After adding the bounding boxes on the image,\n",
"we can see that the main outline of the two objects are basically inside the two boxes.\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "dc793942",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T19:30:55.740501Z",
"iopub.status.busy": "2023-08-18T19:30:55.739885Z",
"iopub.status.idle": "2023-08-18T19:30:56.104353Z",
"shell.execute_reply": "2023-08-18T19:30:56.101132Z"
},
"origin_pos": 16,
"tab": [
"pytorch"
]
},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"fig = d2l.plt.imshow(img)\n",
"fig.axes.add_patch(bbox_to_rect(dog_bbox, 'blue'))\n",
"fig.axes.add_patch(bbox_to_rect(cat_bbox, 'red'));"
]
},
{
"cell_type": "markdown",
"id": "3a185826",
"metadata": {
"origin_pos": 17
},
"source": [
"## Summary\n",
"\n",
"* Object detection not only recognizes all the objects of interest in the image, but also their positions. The position is generally represented by a rectangular bounding box.\n",
"* We can convert between two commonly used bounding box representations.\n",
"\n",
"## Exercises\n",
"\n",
"1. Find another image and try to label a bounding box that contains the object. Compare labeling bounding boxes and categories: which usually takes longer?\n",
"1. Why is the innermost dimension of the input argument `boxes` of `box_corner_to_center` and `box_center_to_corner` always 4?\n"
]
},
{
"cell_type": "markdown",
"id": "4915b119",
"metadata": {
"origin_pos": 19,
"tab": [
"pytorch"
]
},
"source": [
"[Discussions](https://discuss.d2l.ai/t/1527)\n"
]
}
],
"metadata": {
"language_info": {
"name": "python"
},
"required_libs": []
},
"nbformat": 4,
"nbformat_minor": 5
}