Computer Vision

2025-09-01

I think taking a course in a subject that you are interested in is never particularly a bad thing.

This page will exclusively contain what I was taught in COMP9517 for a short while. That is just how it will be.

However, eventually we will morph out of this into the textbook, and subsequently out of that too.

For now, I just want a document to type in that I enjoy typing in!

Notes

in each of the below topics, there are a number of algorithms and facts that I must know for the final exam.

Image Formation

pinhole camera model
projective geometry
- vanishing lines and points
- mathematics: \[x' = -x \frac{f}{z} \] \[ y' = -y \frac{f}{z} \]
colour spaces:
- RGB: red green blue
  - strongly correlated channels
- HSV: hue saturation value
  - hue and saturation are confounded channels
- YCbCr
  - fast to compute, good for compression. the modern day standard
  - Y = luminance
  - Cb = blue colour difference
  - Cr = red colour difference
- L*a*b*: differences in luminance are more perceptually uniform
  - L* = lightness
  - a* = green to red spectrum (negative to positive)
  - b* = blue to yellow spectrum (negative to positive)
quantisation: digitises the image intensity / amplitude values

Image Processing

spatial domain

point operations

\[T: \mathbb{R} \rightarrow \mathbb{R}\quad g(x,y) =T(f(x,y))\]

contrast stretching: linearly maps input intensity range to full output range \[g(x,y) = \frac{f(x,y) - f_{min}}{f_{max} - f_{min}} \times 255\]
- enhances contrast when image uses only part of available intensity range
intensity thresholding: converts grayscale to binary \[g(x,y) = \begin{cases} 255 & \text{if } f(x,y) \geq T \\ 0 & \text{otherwise} \end{cases}\]
- automatic methods:
  - Otsu: maximises between-class variance to find optimal threshold
  - Isodata: iteratively refines threshold based on class means
  - multilevel: extends to multiple thresholds for multi-class segmentation
intensity inversion: creates photographic negative \[g(x,y) = L - 1 - f(x,y)\] where $L$ is number of intensity levels
log transformation: compresses high intensities, expands low ones \[g(x,y) = c \log(1 + f(x,y))\]
- useful for displaying images with wide dynamic range (e.g., Fourier spectra)
power transformation (gamma correction): \[g(x,y) = c \cdot f(x,y)^\gamma\]
- $\gamma < 1$: brightens image (expands dark regions)
- $\gamma > 1$: darkens image (compresses dark regions)
- $\gamma = 1$: linear (no change)
piecewise linear transformation: different linear functions for different intensity ranges
- allows precise control over specific intensity intervals
piecewise contrast stretching: applies contrast stretching to specific intensity ranges while leaving others unchanged
gray-level slicing: highlights specific intensity range
- with background: \[g(x,y) = \begin{cases} 255 & \text{if } A \leq f(x,y) \leq B \\ f(x,y) & \text{otherwise} \end{cases}\]
- without background: binary output for range $[A,B]$
bit-plane slicing: extracts specific bit planes to analyze contribution of each bit to image appearance
- higher-order bits contain most visual information
- lower-order bits often contain noise
histogram of pixel values: shows frequency distribution of intensity levels
- dark image: histogram concentrated on left
- bright image: histogram concentrated on right
- high contrast: histogram spread across full range
histogram-based thresholding "triangle": finds threshold by fitting triangle to histogram and locating point of maximum distance
histogram equalisation: redistributes intensities to achieve uniform histogram
- continuous: \[g = T(f) = \int_0^f p_r(w) dw\]
- discrete: \[g_k = \sum_{j=0}^k \frac{n_j}{n} \times (L-1)\]
- constrained: limits enhancement to prevent over-amplification of noise
histogram matching: transforms image histogram to match target histogram shape
- continuous: find $T_1^{-1}(T_2(z))$ where $T_1, T_2$ are CDFs
- discrete: use lookup table mapping
arithmetic operations: addition, subtraction, multiplication, division between images
- useful for image enhancement, noise reduction, change detection
logical operations: AND, OR, NOT operations on binary images
- fundamental for morphological operations and region manipulation
averaging: reduces noise by averaging multiple images of same scene \[\bar{g}(x,y) = \frac{1}{K} \sum_{i=1}^K f_i(x,y)\]

neighourhood operations

\[T: \mathbb{R^n} \rightarrow \mathbb{R}\quad g(x,y) = T(f(x,y),f(x+1,y),f(x-1,y),...\]

convolution
- linear, shift-invariant
- properties:
  - commutativity \[f_1 * f_2 = f_2 * f_1 \]
  - associativity \[f_1 * (f_2 * f_3) = (f_1 * f_2) * f_3 \]
  - distributivity \[f_1 * (f_2 * f_3) = f_1 * f_2 + f_1 * f_3\]
  - multiplicativity \[a(f_1*f_2) = (a f_1) * f_2 = f_1 * (a f_2) \]
  - derivation \[(f_1 * f_2)' = f_1'*f_2 = f_1*f_2' \]
  - theorem \[f_1 * f_2 \iff \hat{f_1} \hat{f_2}\] convolution in spatial domain amounts to multiplication in spectral domain
spatial filtering
linear shift-invariant operations
border problem
- padding: add more pixels with value 0
- clamping: repeat all border pixel values indefinitely
- wrapping: copy pixel values from opposite sides
- mirroring: reflect pixel values across borders

filtering methods

uniform filter
- smoothing
gaussian filter
- separable and circularly symmetric; the only such filter
- optimal joint localisation in spatial and frequency domain
- fourier transform (ft henceforth) is also a Gaussian
- n-fold convolution of any low-pass filter converges to a Gaussian
- infinitely smooth, so infinite derivatives exist
- good at keeping small objects (better than median). it is a smoothing filter.
median filter
- order-statistic filter
- sorts, then takes median
- can eliminate salt and pepper noise (which are just isolated intensity spikes)
- nonlinear filter
- better than gaussian at removing small objects
smoothing
- image blurring, noise reduction
differentiation
- forward, backward, central difference (finite differences because images are discrete)
separability
- improves computation efficiency
- examples: uniform, prewitt, sobel, gauss
pooling
- max / min / average
- makes image smaller
- combines filtering and downsampling in one operation

image enhancement

sharpening
- subtract Gaussian filtered from image, then add the produced "high-frequencies" back into the image.
- can also use the laplacean: $\nabla^2 f = f_{xx} + f_{yy} $ by subtracting it from the original image: \[f(x,y) - \nabla^2 f(x,y) \]
unsharp masking
- a classical sharpening technique that enhances edges by subtracting a blurred version from the original
- algorithm:
  1. create blurred version: $f_{blur} = f * G_\sigma$ (Gaussian blur)
  2. create mask: $f_{mask} = f - f_{blur}$ (high-frequency components)
  3. add scaled mask back: $f_{sharp} = f + k \cdot f_{mask}$ where $k$ is sharpening strength
- combined formula: $f_{sharp} = f + k(f - f * G_\sigma)$
- controls:
  - radius: size of Gaussian blur (larger = affects broader features)
  - amount: strength multiplier $k$ (higher = more sharpening)
  - threshold: minimum contrast required before sharpening is applied
gradient vector & magnitude
- \[\nabla f(x,y) = [f_x(x,y), f_y(x,y)]^T \]
- \[||\nabla f(x,y) || = \sqrt{f_x^2(x,y),f_y^2(x,y)} \]
edge detection
- use laplacean or intensity gradient

transform domain

high frequency -> rapidly changing intensities across pixels
low frequency -> large scale image structures
we process images in the frequency domain by first applying the Fourier transform

Fourier Transform

interpretations:
- frequencies correspond to patterns
- $F(0,0)$ is the total intensity over all pixels of the image
- noise (typically) corresponds to fluctuations in the highest frequencies
notation:
- $f(x)$ is the spatial input function
- $F(u)$ is the Fourier transform
- $e^{i\omega x} = \cos(\omega x) + i\sin(\omega x) $
- $\omega = 2\pi u$ is radial frequency
- $u$ is spatial frequency
forward fourier transform \[F(u) = \int^\infty_{-\infty} f(x)\; e^{\displaystyle -i 2\pi u x}\,\mathrm{d}x\]
inverse fourier transform \[f(x) = \int^\infty_{-\infty} F(u)\; e^{\displaystyle i 2\pi u x}\,\mathrm{d}u\]
properties:

Property	Spatial	Frequency
Superposition	$f_1(x) + f_2(x)$	$F_1(u) + F_2(u)$
Translation	$f(x-\Delta x)$	$F(u)e^{-i 2\pi u\Delta x}$
Convolution	$f(x)*h(x)$	$F(u)H(u)$
Correlation	$f(x) \otimes h(x)$	$F(u)H^*(u)$
Multiplication	$f(x)h(x)$	$F(u)*H(u)$
Scaling	$f(ax)$	$F(u/a)/a$
Differentiation	$f^{(n)}(x)$	$(i2\pi u)^n F(u)$

2D:
- forward fourier transform \[F(u,v) = \int^\infty_{-\infty}\int^\infty_{-\infty} f(x,y)\; e^{\displaystyle -i 2\pi (ux+vy)}\;\mathrm{d}x\,\mathrm{d}y\]
- inverse fourier transform \[f(x,y) = \int^\infty_{-\infty}\int^\infty_{-\infty} F(u,v)\; e^{\displaystyle -i 2\pi (ux+vy)}\;\mathrm{d}u\,\mathrm{d}v\]
- $f \leftrightarrow F$: fourier transform pair
- $F = R + i I$: real plus imaginary part
- $|F| = \sqrt{R^2 + I^2}$: Magnitude
- $\phi = \arctan(\frac{I}{R})$: Phase
Discrete:
- forward \[F(u,v) = \sum_{x=0}^{M-1} \sum_{y=0}^{N-1} f(x,y)\;e^{\displaystyle -i 2 \pi (\frac{ux}{M} + \frac{vy}{N})} \] for $u=0... M-1$ and $v = 0... N -1$
- inverse \[f(x,y) = \frac{1}{MN} \sum_{u=0}^{M-1} \sum_{v=0}^{N-1} F(u,v)\;e^{\displaystyle i 2\pi (\frac{ux}{M} + \frac{vy}{N})} \] for $x=0... M-1$ and $y = 0... N -1$

filtering

procedure:
1. multiply input image $f(x,y)$ by $(-1)^{x+y}$ to ensure centering $F(u,v)$
2. compute the transform $F(u,v)$ from image $f(x,y)$ using 2D DFT
3. multiply $F(u,v)$ by a centred filter $H(u,v)$ to obtain result $G(u,v)$
4. compute the inverse 2D DFT of $G(u,v)$ to obtain the spatial result $g(x,y)$
5. take the real component of $g(x,y)$ (imaginary component is zero)
6. multiply the result by $(-1)^{x+y}$ to remove the pattern introduced in step 1^^

convolution theorem (how does this relate to convolution?)

filtering in the frequency domain can be computationally more efficient
more intuitive in freq dom. i.e:
- low-pass = keep low frequencies, but attenuate ⊕ high frequencies
- high-pass = keep high freq, reduce low freq
- band-pass = keep frequencies in a given band. attenuate the rest
- take inverse to get the corresponding spatial filter

notch filtering = opposite of band-pass; attenuates a given range.
difference of Gaussians is a high-pass
gaussian filter = low-pass
image pyramids is for multi-resolution
approximation = the process of creating coarser resolution levels by successive filtering and downsampling
- Gaussian pyramid: $G_l = \text{DOWNSAMPLE}(G_{l-1} * g)$ where $g$ is Gaussian kernel
- Laplacian pyramid: $L_l = G_l - \text{EXPAND}(G_{l+1})$ (difference between levels)
- captures image information at different scales for efficient processing
reconstruction = the process of rebuilding higher resolution from coarser levels
- for Gaussian: simple upsampling and interpolation
- for Laplacian: $G_l = L_l + \text{EXPAND}(G_{l+1})$ (recursive reconstruction)
- perfect reconstruction possible with proper filter design
- applications: compression, blending, texture analysis, feature detection

Feature Representation

image features are vectors that are a compact representation of images. i.e. blobs, edges, corners, etc.
more efficient and robust way to represent images. also useful for further processing: object detection, image segmentation, classification, retrieval, stitching and object tracking.
note that pixel values are highly redundant and victim to light intensity, colour, angle changes, camera orientation
we wish for features to be: reproducible, salient and compact (aka robust, descriptive, efficient)

colour features

colour is easy to compute
invariant to image scaling, translation, rotation

colour histogram

represent the global distribution of pixel colours in an image
algorithm:
- step 1: quantize each color channel into bins (e.g., 256 levels → 32 bins)
- step 2: construct a histogram for each colour channel (R, G, B)
- step 3: concatenate the histograms (vectors) of all channels as the final feature vector
example: 32 bins per channel → 32×3 = 96-dimensional feature vector
properties:
- translation invariant: moving objects doesn't change histogram
- rotation invariant: rotating image doesn't change color distribution
- scale invariant: resizing preserves relative color frequencies
- not robust to illumination changes
variations:
- HSV histograms: often more perceptually meaningful
- 3D color histograms: joint RGB distribution (higher dimensional)
- normalized histograms: divide by total pixels for scale invariance
applications: image retrieval, object recognition, scene classification
limitations: loses spatial information, sensitive to lighting conditions

colour moments

moments based representation of colour distributions
gives a feature vector of only 9 elements (for RGB)
lower representation capability than above histogram

texture features

visual characteristics and appearance of objects
a powerful discriminating feature for identifying visual patterns
encodes properties of structural homogeneity beyond colour or intensity

haralick features

array of statistical descriptors of image patterns
captures spatial relationship between neighbouring pixels
step 1: construct the gray-level co-occurence matrix (GLCM) - representing the frequency of peixel intensity pairs occurring at a specific offset and direction
step 2: compute the Haralick feature descriptors from the GLCM - that summarises texture information (how pixel intensities are spatially related)
often used in practice due to their simplicity and interpretability.

glcm method

Gray-Level Co-occurrence Matrix construction:
- step 1: convert image to grayscale and quantize intensity levels (typically 8-256 levels)
- step 2: define spatial relationship parameters:
  - distance $d$: pixel separation (commonly 1, 2, 4, 8)
  - angle $\theta$: direction (0°, 45°, 90°, 135°)
- step 3: count co-occurrences of intensity pairs $(i,j)$ at distance $d$ in direction $\theta$
- step 4: normalize by total number of pixel pairs to get probabilities $P(i,j|d,\theta)$
GLCM properties:
- symmetric if considering both directions: $P(i,j) = P(j,i)$
- size: $N \times N$ where $N$ is number of gray levels
- multiple GLCMs computed for different $d$ and $\theta$ values
interpretation:
- diagonal elements: pixels with same intensity at specified distance
- off-diagonal: pixels with different intensities
- concentrated near diagonal: homogeneous texture
- spread throughout matrix: heterogeneous texture

haralick descriptors

14 statistical measures computed from GLCM $P(i,j)$:

Angular Second Moment (Energy): $\sum_i \sum_j P(i,j)^2$
- measures texture uniformity; high for homogeneous textures
Contrast: $\sum_i \sum_j (i-j)^2 P(i,j)$
- measures local intensity variation; high for high-contrast textures
Correlation: $\frac{\sum_i \sum_j (i-\mu_i)(j-\mu_j)P(i,j)}{\sigma_i \sigma_j}$
- measures linear dependency of gray levels; high for linear structures
Variance: $\sum_i (i-\mu)^2 \sum_j P(i,j)$
- measures spread of intensity distribution
Inverse Difference Moment (Homogeneity): $\sum_i \sum_j \frac{P(i,j)}{1+(i-j)^2}$
- measures texture homogeneity; high for similar gray levels
Sum Average: $\sum_{k=2}^{2N} k \cdot P_{x+y}(k)$
- where $P_{x+y}(k) = \sum_{i+j=k} P(i,j)$
Sum Entropy: $-\sum_{k=2}^{2N} P_{x+y}(k) \log P_{x+y}(k)$
Sum Variance: $\sum_{k=2}^{2N} (k - \text{Sum Average})^2 P_{x+y}(k)$
Entropy: $-\sum_i \sum_j P(i,j) \log P(i,j)$
- measures randomness; high for random textures
Difference Variance: variance of $P_{x-y}$
Difference Entropy: $-\sum_{k=0}^{N-1} P_{x-y}(k) \log P_{x-y}(k)$
Information Measures of Correlation (1 & 2): based on mutual information
Maximum Correlation Coefficient: second largest eigenvalue of $Q$

typical usage: compute subset (contrast, correlation, energy, homogeneity) for efficiency
combine descriptors from multiple directions and distances for rotation/scale invariance

local binary patterns

describe the spatial structure of local image texture

algorithm:

divide the image into cells of $N\times N$ pixels ($N=16$ or $N=32$)
compare each pixel in a given cell to each of its 8 neighbouring pixels
if the neighbour's value is greater than or equal to the centre, write 1; otherwise write 0.
this gives an 8-digit binary pattern per pixel, representing a value in the range $0...255$
count the number of times each 8-digit binary number occurs in the cell
this gives a 256-bin histogram (also known as the LBP feature vector)
combine the histograms of all cells of the given image
this gives the image-level LBP feature descriptor

Example: Consider a 3×3 neighborhood with center pixel value 45:

41	52	39
48	45	51	→ Binary pattern: 10011001₂ = 153₁₀
46	43	50

Compare each neighbor to center (45): 41<45→0, 52≥45→1, 39<45→0, etc.
Read clockwise from top-left: 01001100₂
Rotate to get minimum: 00110010₂ = 50₁₀ (for rotation invariance)
This pixel contributes +1 to bin 50 in the histogram
LBP can be multiresolution and rotation-invariant
- in the case of multiresolution, you vary the distance between the centre pixel and neighbouring pixels and vary the number of neighbouring pixels
- for rotation-invariance: vary the way of constructing the 8-digit binary number by performing bitwise shift to derive the smallest number
  - note: not all patterns have 8-shifted variants (i.e. 11001100 has only 4)
  - reduces LPB feature dimension from 256 to 36

scale-invariant feature transform

describes texture in a localised region around a keypoint
invariant to scaling, rotation, shift.
robust to affine distortion and illumination changes

algorithm

Scale-Space Extrema Detection: find maxima/minima in DoG images across scales
Keypoint Localisation: discard low-contrast keypoints and eliminate edge responses
Orientation Assignment: achieve rotation invariance by orientation assignment (make histogram of local gradient vectors) ⊕
Keypoint Descriptor: compute gradient orientation histograms

descriptor matching

nearest neighbour distance ratio (NNDR) \[NNDR = \frac{d_1}{d_2} = \frac{||D_a -D_B||}{||D_A-D_C||}\]

distance $d_1$ is to the first nearest neighbour
distance $d_2$ is to the second nearest neighbour
nearest neighbours in 128D feature space
reject matches with NNDR > 0.8

translation and rotation are rigid transformations
scaling, affine, perspective are nonrigid transformations

Transformation matrices:

Translation: \[\begin{bmatrix} x' \\ y' \\ 1 \end{bmatrix} = \begin{bmatrix} 1 & 0 & t_x \\ 0 & 1 & t_y \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} x \\ y \\ 1 \end{bmatrix}\]
Rotation: \[\begin{bmatrix} x' \\ y' \\ 1 \end{bmatrix} = \begin{bmatrix} \cos\theta & -\sin\theta & 0 \\ \sin\theta & \cos\theta & 0 \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} x \\ y \\ 1 \end{bmatrix}\]
Scaling: \[\begin{bmatrix} x' \\ y' \\ 1 \end{bmatrix} = \begin{bmatrix} s_x & 0 & 0 \\ 0 & s_y & 0 \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} x \\ y \\ 1 \end{bmatrix}\]
Affine (6 DOF): \[\begin{bmatrix} x' \\ y' \\ 1 \end{bmatrix} = \begin{bmatrix} a & b & t_x \\ c & d & t_y \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} x \\ y \\ 1 \end{bmatrix}\]
Perspective/Homography (8 DOF): \[\begin{bmatrix} x' \\ y' \\ w' \end{bmatrix} = \begin{bmatrix} h_{11} & h_{12} & h_{13} \\ h_{21} & h_{22} & h_{23} \\ h_{31} & h_{32} & h_{33} \end{bmatrix} \begin{bmatrix} x \\ y \\ 1 \end{bmatrix}\] where $(x', y') = (x'/w', y'/w')$

fitting and alignment

Least Squares (LS) fitting of corresponding keypoints $(x_i,x_i')$
- find parameters $p$ that minimise the squared error E \[E = \sum_i ||T(\mathbf{x}_\mathbf{i}); \mathbf{p) - \mathbf{x}_\mathbf{i}'}||^2\]
RANdom SAmple Consensus (RANSAC) fitting
- least-squares is hampered by outliers
- better use a subset of the data and check inlier agreement
- RANSAC does this in an iterative way to find the optimum
- algorithm:
  1. sample (randomly) the number of points required to fit the model
  2. solve for the model parameters using the samples
  3. score by the fraction of inliers within a preset threshold of the model
  - repeat 1-3 until the best model is found with high confidence
TODO practise problem on solving the "transformation" given matched points A and B.

feature encoding

Bag-of-Words (BoW) takes variable number of local image features and encodes them into a fixed-dimensional histogram. it works in general, but for SIFT too.
Algorithm:
- extract local SIFT keypoint descriptors from training images
- create the "vocabulary" from the set of SIFT keypoint descriptors
  - use K-means clustering
    - partitions the training data into $k$ categories
    - algorithm:
- initialise: $k$ cluster centres (randomly)
- iterate:
  1. assign data (feature vectors) to the closest cluster (Euclidean distance)
  2. update cluster centres as the mean of the data samples in each cluster
- terminate:
  - when converged or the number of iterations reaches the maximum
  - this vocabulary represents the categories of local descriptors
  - cluster centres are the "visual words" in this "vocabulary" used to represent an image
  - each local feature descriptor is assigned to one visual word with the smallest distance
  - compute the number of local image feature descriptors assigned to each visual word
  - concatenate the numbers into a vector which is the "BoW" representation of the image
local features (that BoW takes in) can be LBP, SURF, BRIEF, ORB
- BoW in turn can be replaced with VLAD, Fisher Vector

shape features

essential characteristic of material objects
typically extracted after image segmentation
can be used to identify and classify objects
challenges
- invariant to rigid transformations
- tolerant to non-rigid deformations

basic shape features

net area; principal axes; convex area
convexity, concavity; convex hull ⊕ ; convex deficiency (set difference between the convex hull and the object)
compactness; circularity
- ^inversely related
- compactness: ratio of the area of a circle with the same perimeter as the object to the area of the object
- circularity: ratio of $4\pi$ times the area of an object to the second power of its perimeter ($4\pi A/p^=1$ for a circle)
elongation; eccentricity
- elongation: ratio between the length and width of the object's bounding box
- eccentricity: ratio of the length of the minor axis to the length of the major axis

boundary descriptors

Describe object shape using boundary/contour information:

chain code descriptor
- represents boundary as sequence of directional codes
- 4-connected: codes {0,1,2,3} for right, down, left, up
- 8-connected: codes {0,1,2,3,4,5,6,7} for 8 directions
- properties: translation invariant, rotation/scale dependent
- normalization: start from same relative position, compute difference chain code
local curvature descriptor
- measures rate of change of tangent direction along boundary
- discrete curvature: $k(i) = \angle(\mathbf{v}_{i-1}, \mathbf{v}_{i+1})$ where $\mathbf{v}_i$ is tangent vector
- captures local shape properties: corners (high curvature), straight segments (low curvature)
- can be smoothed to reduce noise effects
global curvature descriptors:
- total bending energy: $B=\oint_C \kappa^2(s)\; \mathrm{d}s$ measures overall "bendiness"
- total absolute curvature: $K=\oint_C |\kappa(s)|\;\mathrm{d}s$ measures total shape complexity
- both are invariant to rigid transformations
- sensitive to boundary noise, require smoothing
radial distance descriptor
- distance from centroid to boundary points: $r(\theta) = ||\mathbf{p}(\theta) - \mathbf{c}||$
- creates 1D signature function from 2D shape
- invariant to translation, rotation can be handled by circular shift
- Fourier descriptors: apply FFT to $r(\theta)$ for compact representation
- limitations: fails for non-star-shaped objects (multiple intersections per ray)

shape context

is a point-wise local feature descriptor
- pick $n$ points $p_i$ on the contour of a shape
- for each point, create a radial coordinate system centred at this point and compute a histogram $h_i$ based on the relative coordinates of the other $n-1$ points
- this is the shape context of $p_i$

Shape matching algorithm:

Extract $n$ points from each shape's contour (typically $n=100-200$)
Compute shape context histogram for each point:
- use log-polar coordinate system: $k$ radial bins, $l$ angular bins
- typical values: $k=5$, $l=12$ → 60-dimensional histogram per point
Find correspondence between points using Hungarian algorithm:
- cost matrix: $C_{ij} = \frac{1}{2}\sum_{k=1}^K \frac{[h_i(k) - h_j(k)]^2}{h_i(k) + h_j(k)}$ (chi-squared distance)
Estimate transformation (affine/TPS) from corresponding points
Compute shape distance as sum of matching costs + transformation cost

histogram of oriented gradients (HoG)

describes the distributions of gradient orientations in localised areas
does not require initial segmentation
algorithm:
1. calculate the gradient vector at each pixel
  - gradient magnitude
  - gradient orientation
2. construct the gradient histogram of all pixels in a cell
  - divide orientations into $N$ bins (typically $N=9$ bins evenly splitting 180 degrees)
  - assign the gradient magnitude of each pixel to the bin corresponding to its orientation
3. generate detection-window level HOG descriptor
  - concatenate cell histograms
  - block-normalise cell histograms

Sliding window detection:

HOG features are computed for a fixed-size detection window (e.g., 64×128 pixels for human detection)
slide this window across the entire image at multiple scales and positions
at each position, extract HOG descriptor and classify using trained SVM
algorithm:
1. create image pyramid (multiple scales) to detect objects of different sizes
2. for each scale level:
  - slide detection window across image (typically with 8-pixel stride)
  - compute HOG descriptor for each window position
  - classify using trained classifier (e.g., SVM)
3. apply non-maximum suppression to remove duplicate detections
computational optimization:
- compute gradients once for entire image
- reuse cell histograms between overlapping windows
- integral histograms for fast block normalization
use-case: detecting humans in images.

Pattern Recognition

automatically recognise patterns and regularities in data.
- object recognition; text classif; speech recognition; event detection; recommender systems
the different learning paradigms:
- supervised learning
- unsupervised learning
- semi-supervised learning
  - uses labelled and unlabelled data
- weakly supervised learning
  - noisy / limited / imprecise supervision signals in learning

concepts

objects: identifiable physical entities of which images are taken
regions: correspond to objects after image segmentation
classes: disjoint subsets of objects sharing common features
labels: associated with objects and indicate to which class they belong
classification:process of assigning labels to ojbects based on features
classifiers: algorithms / methods performing the classication task
patterns: are regularities in object features and are used by classifiers
pre-processing: aims to enhance images for further processing
feature extraction: reduces the data by measuring certain properties
feature descriptors: aims to keep only the most descriptive features
models: are (mathematical or statistical) descriptions of classes
training samples: objects with known labels used to build models
cost: consequence of making an incorrect decision / assignment
decision boundary: demarcation between regions in feature space

supervised learning

classification

nearest class mean

like k-means but without the iteration
would depend strongly on initialised parameters
algorithm:
- training: given training sample pairs ${(x_1,y_1),(x_2,y_2),...,(x_n,y_n)}$, the centrodi for each class $k$ is obtained as \[\mu_k = \frac{1}{|c_k|}\sum_{x_i\in c_k}x_i\]
- testing: each unknown object with feature vector $x$ is classified at class $k$ if $x$ is closer to the centroid of class $k$ than to any other class centroid
works well when classes are compact and far from each other
cannot handle outliers and noisy data well
no good for complex classes (multimodal, non-spherical)

k-nearest neighbours

euclidean distance used for continuous variables
hamming distance for discrete variables
decision surfaces are non-linear
cons:
- slow for big datasets (non-parametric)
- needs homogeneous feature types and scales
- number of variables $>> \implies$ curse of dimensionality
- hyper-parameter tuning of $k$ can be annoying

bayesian decision theory

introduces prior knowledge
assigns class to which it most likely belongs based on observed features
assume the following to be known: ⊕
- prior probability $p(c_i)$ for each class $c_i$
- class conditional distribution $p((x|c_i)$
compute the posterior probability $p(c_i|x)$ as follows:
- if all the classes are disjoint, by Bayes Rule, the posterior probabilities are given by \[p(c_i|x) = \frac{p(x|c_i)p(c_i)}{p(x)}\] \[\boxed{\begin{align}p(x,c_i) &= p(x|c_i)p(c_i) = p(c_i|x)p(x) \\ p(x) &= \sum_j p(x,c_j) = \sum_j p(x|c_j) p(c_j)\end{align}}\]
the bayesian decision rule is: $c=\mathrm{arg}\max_i (p(x|c_i))$
- which is equivalent to \[c=\mathrm{arg}\max_i (p(x|c_i)p(c_i))\] \[c(c_i|x) = \frac{p(x|c_i)p(c_i)}{p(x)}\propto p(x|c_i)p(c_i)\]
unusually, in the running example: uniform priced fish $\implies$ we should maximise the posterior probability and prices unequal $\implies$ we should minimise the loss.
pros:
- simple and efficient
- considers uncertainties
- permits new information updates
cons:
- struggles with complex data relationships
- choice of priors can be subjective

decision trees

nominal data = categorical
entropy: of a set of events $y = \set{y_1,y_2,...,y_n}$ is: \[H(y) = \sum_{i=1}^n -p(y_i) \log_2(p(y_i))\] where $p(y_i)$ is the probability of event $y_i$
- may be viewed as the average uncertainty of the information source
  - $H=0$ means source has no uncertainty
  - $H>0$ if the source information is uncertain
algorithm:
1. select a feature to place at the node
2. make one branch for each possible value (nominal) or range(numerical)
3. for each branch node, repeat steps 1 and 2 using only those samples that actually reach the branch
4. when all samples at a node have the same classifiaciton (or label), stop growing that part of the tree
information gain: \[IG(S,F) = H(S) - H(S|f)\]
- \[IG(S,f) = \mathrm{Entropy}(S) - \sum_{f_a\in \mathrm{values}(f)}\frac{|S_{fa}|}{S}\mathrm{Entropy}(S_{fa})\]
- use the feature with highest information gain to split on
pros:
- interpretable
- numerical, categorical data
- robust to outliers and missing values
- feature selection (helps determine the more important features)
cons:
- overfitting
- greedy algorithm

ensemble learning, random forests

training:
1. let $N$ be number of training instances and $M$ the number of features
2. sample $N$ instances at random with replacement from the original data
3. at each node select $m << M$ features at random and split on the best feature
4. grow each tree to the largest extent possible (no pruning)
5. repeat $B$ times. keep the value of $m$ constant during the forest growing
testing:
1. push a new sample down a tree and assign the label of the terminal node it ends up in
2. iterate over all trees in the ensemble to get $B$ predictions for the sample
3. report the majority vote of all trees as the random forest prediction
error rate depends on 2 factors:
1. correlation between any two trees in the forest.
  - increased correlation increases the forest error rate; uncorrelated trees lead to better generalisation
2. strength of each individual tree in the forest
  - strong tree has low error rate
  - increasing the strength of individual trees decreases the forest error rate
selecting parameter $m$:
- reducing $m$ reduces both the correlation and the strength
- increasing $m$ increases both the correlation and the strength
- somewhere in between is an "optimal" range
pros:
- high accuracy
- efficient and effective on large datasets
- can handle thousands of input features without feature selection
- handles missing values
cons:
- less interpretable than an individual decision tree

linear classification

\[f(x) = W^T x+ b\]

support vector machines

maximise margin - the distance to the closest sample
primal (hard-margin) \[\begin{equation} \min_{\mathbf{w},b} \frac{1}{2} ||\mathbf{w}||^2_2\\ \text{s.t.}\quad y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1, \quad \forall i \end{equation}\]
distance between a point to a hyperplane \[d = \frac{|\mathbf{w}^T\mathbf{x}'+b|}{||\mathbf{w}||_2}\]
soft-margin primal \[\begin{equation} \min_{\mathbf{w},b} \frac{1}{2} ||\mathbf{w}||^2_2 + C \sum_i \xi_i\\ \text{s.t.}\quad y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1-\xi_i, \quad \forall i \\ \xi_i \geq 0 \end{equation}\]
- small $C$: more tolerance on miss-classified samples for larger margin
- large $C$: focus on avoiding mistakes at the expense of smaller margin
- $C$ to infinity means going back to the hard margin SVM
- still a quadratic programming optimisation problem
pros:
- effective in high-dimensional feature spaces
- effective when number of features is larger than the training data size
- good whenn classes are separable
- works well when data is sparse
- can be extended to non-linear with the kernel trick
cons:
- does not perform well for overlapping classes
- hyperparameter tuning

multiclass classification

one versus rest: builds one classifier for one class versus the rest and assigns a test sample to the class that has the highest confidence score
one versus one: builds one classifier for every pair of classes and assigns a test sample to the class that has the highest number of predictions

classification performance metrics

ROC curve

receiver operating curve
relates the false positive to the true positive rate
generally, false alarms go up with attempts to correctly detect higher percentages of known objects
AUC (area under the ROC) summarises the overall performance

Confusion Matrix

matrix whose entry $(i,j)$ records the number of times an object of class $i$ was classified as class $j$
diagonal entries indicate success
reproduce the binary classification matrix TODO

Precision vs. Recall

\[\text{accuracy} = \frac{\text{TP}+\text{TN}}{\text{TP}+\text{TN}+\text{FP}+\text{FN}} \quad \left ( \frac{\text{correct}}{\text{total}} \right)\]
precision (= positive predictive value)
- fraction of the segmented object that is correctly segmented \[\text{P} = \frac{\text{TP}}{\text{TP}+\text{FP}}\]
F-measure (= harmonic mean of precision and recall) \[\text{F1} = \frac{2\text{RP}}{\text{R}+\text{P}}\]
recall (= sensitivity)
- fraction of the true object that is correctly segmented \[\text{R} = \frac{\text{TP}}{\text{TP}+\text{FN}}\]

regression

linear regression

\[f(x) = xW\]

least-squares regression

\[\text{RSS}(W) = \sum_{i=1}[y_i-f(x_i)]^2 = (Y-XW)^T(Y-XW)\]
$\hat{W} = \text{arg}\min_w \text{RSS}(W) = (X^TX)^{-1}X^TY$

regression performance evaluation

\[\text{RMSE} = \sqrt{\frac{1}{N}\sum_{i=1}^N (y_i - \hat{y_i})^2}\]

represents the standard deviation of the ppredicted values from the observed values

\[\text{MAE} = \frac{1}{N}\sum_{i=1}^N |y_i - \hat{y_i}|\]

represents the average of the absolute differences between the predicted and observed values
RMSE penalises big differences between predicted values and observed values more heavily.

\[R^2 = 1 - \frac{\sum_{i=1}^N(y_i-\hat{y_i})^2}{\sum_{i=1}^N (y_i - \bar{y})^2}\]

indicates how well the selected features explain the output variable
tends to always increase by adding extra features

normalisation: \[\frac{x-x_\text{min}}{x_\text{max} - x_\text{min}}\]

cross validation: data is split into $k$ subsets (folds) and at each iteration we keep one fold out for testing and use the rest for training.

this is repeated $K$ times until all folds have been used once as the test set.

Image Segmentation

basic segmentation

thresholding

works fine if regions have sufficiently different intensity distributions
problematic if regions have overlapping intensity distributions

k-means

may work if the number of clusters is known a priori
problematic if not

feature-based pixel classification

extract a patch around each pixel and compute its features
classify each pixel based on its features using a trained classifier

advanced segmentation

region splitting * and merging

recursively split the whole image into pieces based on local statistics
recursively merge pieces together (in a hierarchical fashion)
combine splitting and merging sequentially

connectivity

4-connectivity in 2d
8-connectivity in 2d
3d: 6-connected, 18, 26
connected components labelling algorithm:
- first pass:
  1. check each pixel (top left to bottom right)
  2. if an object pixel, check its neighbours ($N_4$ or $N_8$)
  3. if no neighbours have labels, assign a new label
  4. if neighbours do have labels, assign the smallest
  5. record label equivalences while assigning
- second pass
  1. check each pixel
  2. replace each label with its smallest equivalent
  3. all background pixels default to the zero label
merging by region growing:
- define a similarity measure
- start from one seed pixel for the region
- add neighbouring pixels to the region if they are similar
- repeat previous step until no more pixels are similar

watershed segmentation

based on the analogy of immersion of a topographic surface
meyer's flooding algorithm:
1. choose a set of markers to start the flooding
  - i.e. local minima. give each a different label
2. put neighbouring pixels of each marker into a priority queue
  - a pixel with more similar gray value has higher priority
3. pop the pixel with the highest priority level from the queue. if the neighbours of the popped pixel that have already been labelled all have the same label, then give the pixel that same label. put all non-labelled neighbours that have never been in the queue into the queue
4. repeat step 3 until the queue is empty
preprocessing:
- invert the image or compute edges if needed to get local minima markers
images often have many local minima, leading to heavy oversegmentation
preprocessing (image smoothing) may be needed to reduce false minima
postprocessing (basin merging) may be needed to reduce fragmentation
many different implementations and pre/postprocessing criteria exist
it is possible to start from user-defined markers instead of local minima

maximally stable extremal regions

try multiple thresholds and analyse the shape of the connected components
select regions with virtually constant shape over many thresholds

mean shifting

seeks stationary points (peaks / modes) in a density function
attempts to find all possible cluster centers in feature space
does not require knowing the number of clusters a priori
is a variant of iterative steepest-ascent method
algorithm:
1. initialise a random seed point $x$ and window $N$
2. calculate the mean (centre of gravity) $m(x)$ within $N$ \[m(x)=\frac{\sum_{x_i\in N(x)} K(x_i-x)x_i}{\sum_{x_i\in N(x)} K(x_i-x)}\qquad K(x) = \exp(-|x|^2)\] \[\text{Mean (centre of gravity)}\qquad\text{Kernel (weight function)}\]
3. shift the search window to the mean
4. repeat step 2 until convergence
advantages:
- model-free (does not assume any prior shape on data clusters)
- has just a single parameter (window size)
- finds variable number of modes (clusters)
- robust to outliers
limitations:
- computationally expensive (need to shift many windows)
- output depends on window size parameter value
- window size (bandwidth) selection is not trivial
- does not scale well with dimensionality of feature space

superpixel segmentation

improves efficiency:
- group similar pixels into a superpixel
- superpixels together are an oversegmentation of the image
- ultimate segmentation (classification, mergingg) performed on superpixels
simple linear iterative clustering (SLIC)
- popular superpixel generation algorithm
- preserves object boundaries, is fast and memory efficient
- algorithm:
  1. initialise cluster centres $C_j$ on pixel grid with step size $S$
  2. move $C_j$ to position in $3\times 3$ window with smallest gradient
  3. compute distance $D_{ij}$ for each pixel $i$ in $2S\times 2S$ window around $C_j$
  4. assign each pixel $i$ to the cluster $C_j$ with smallest distance $D_{ij}$
  5. recompute cluster centres as mean colour and position of pixels in $C_j$
  6. iterate (go to step 3) until the residual error is small \[\begin{align*} D &= \sqrt{\frac{d_{\text{lab}^2}}{m^2} + \frac{d_{xy}^2}{S^2}}\quad\text{(weight $m$ controls influence of colour over spatial distance)}\\ d_\text{lab} &= \sqrt{(l_j-l_i)^2 + (a_j-a_i)^2 + (b_j-b_i)^2}\quad\text{(distance in CIELAB space)}\\ d_{xy}&=\sqrt{(x_j-x_i)^2+(y_j-y_i)^2}\quad\text{(distance in pixel space)} \end{align*}\]

conditional random field

superpixels provide a basis for further segmentation
- determine spatial relationship between the superpixels
- compute similarities between superpixels
- group superpixels to form larger segments
conditional random field (CRF) approach
- a probabilistic graphical model that encodes the relationships between observations (i.e. superpixels) and constructs a consistent interpretation (i.e. a segmentation) for a set of data (i.e. an image)
nodes: superpixels (value based on features of superpixels)
edges: adjacency (value based on similarity between superpixels)
formulated as an energy minimisaion problem: $E(s,c) = \sum_i \psi (s_i,c_i) \sum_{ij}\Psi (s_i, s_j)$
- unary potentials $\psi$:
  - data term based on graph node values
  - computes the cost of superpixel $s_i$ belonging to class $c_i$
  - a lower cost means a higher likelihood of $s_i$ belonging to $c_i$
  - can be obtained via superpixel classification
- pairwise potentials $\Psi$:
  - smoothness term based on graph edge values
  - computes a cost of neighbourhood consistency
  - a cost is assigned if adjacent superpixels are assigned to different classes
  - higher similarity results in a lower cost (0 if assigned to the same class)
graph cutting by min-cut / max-flow algorithm

active contour segmentation (snakes method)

a contour-based approach to object segmentation
aims to locate object boundaries in images by curve fitting
represents the curve by set of control points and interpolation
iteratively moves the control points to fit the curve to the object
uses image, smoothness and user-guidance forces along curve
a.k.a snakes:
- smoothly follows high intensity gradients at the object boundary
- bridges areas of noise or missing gradients using smooth interpolation
missing gradient problem?

level-set segmentation

active contours / snakes are parametric models
- explicit representations of the object boundaries
- typically requires manual interaction to initialise the curve
- it is challenging to change the topology of the curve as it evolves
- curve reparametrisation may be required for big shape changes
level-set methods have become more popular alternatives
- implicit representations of the object boundaries
- boundaries defined by the zero-set of a higher dimensional function
- level-set function evolves to make the zero-set fit and track objects
- easily accommodates topological changes in object shape
- computationally more demanding than active contours
representation:
- define an initial 3D level-set function
- zero-level plane represents the 2D shape
- iteratively deform the function to fit the shape

evaluation

pixel classification

segmented object pixels: $S$
true object pixels: $T$
true positives: $TP = S \cup T$
- pixels correctly segmented as object
true negatives: $TN = S^c\cup T^c$
- pixels correctl segmented as background
false positives: $FP=S\cap T^c$
- pixels incorrectly segmented as object
false negatives: $FN=S^c \cap T$
- pixels incorrectly segmented as background

quantitative evaluation metrics

sensitivity (= true-positive rate)
- fraction of the true object that is correctly segmented \[\text{TPR} = \frac{\text{TP}}{\text{TP}+\text{FN}}\]
specificity (= true-negative rate)
- fraction of the true background that is correctly segmented \[\text{TNR} = \frac{\text{TN}}{\text{TN}+\text{FP}}\]
precision (= positive predictive value)
- fraction of the segmented object that is correctly segmented \[\text{P} = \frac{\text{TP}}{\text{TP}+\text{FP}}\]
F-measure (= harmonic mean of precision and recall) \[\text{F1} = \frac{2\text{RP}}{\text{R}+\text{P}}\]
recall (= sensitivity)
- fraction of the true object that is correctly segmented \[\text{R} = \frac{\text{TP}}{\text{TP}+\text{FN}}\]
Jaccard similarity coefficient (JSC)
- intersection over union (IoU) = correctly segmented fraction of the union of the segmented object and the true object \[\text{JSC} = \frac{S\cap T}{S\cup T} = \frac{\text{TP}}{\text{FP}+\text{TP}+\text{FN}}\]
Dice similarity coefficient (DSC)
- correctly segmented fraction of the segmented object set joined with the true object set \[\text{DSC} = \frac{2|S\cap T|}{|S|\cup |T|} = \frac{2\text{TP}}{\text{FP}+2\text{TP}+\text{FN}}\]

receiver operating characteristics

plots the true-positive rate (sensitivity) versus the false positive rate (one minus the specificity) of a method as a function of its free parameters
auc
- higher is better.
- ranges from 0.0 to 1.0. can technically go negative, but maybe just invert your work at that point.

morphology

nonlinear, set‑theoretic post/pre‑processing to clean noise, split/merge objects, fill holes, extract shapes, measure distances.

basic set ops

translation $A_c = \{\,x + c \mid x \in A\,\}$
reflection $A^{r} = \{\, -x \mid x \in A\,\}$
complement $A^{c} = \{\,x \mid x \not\in A\,\}$
union $A \cup B = \{\,x \mid x \in A \text{ or } x \in B\,\}$
intersection $A \cap B = \{\,x \mid x \in A \text{ and } x \in B\,\}$
difference $A - B = \{\,x \mid x \in A \text{ and } x \not\in B\,\}$
size $|A|$ = number of elements

binary morphology (thresholded imgs)

core tools

dilation $I \oplus S = \{\,x \mid (S^{r})_x \cap I \neq \varnothing\,\}$
erosion $I \ominus S = \{\,x \mid S_x \subseteq I\,\}$
default struct elem: symmetric $3 \times 3$; use decomposition / iteration for bigger ones

compound tweaks

opening $I \circ S = (I \ominus S)\, \oplus S$ -> drops tiny foreground blobs
closing $I \bullet S = (I \oplus S)\, \ominus S$ -> fills tiny background gaps

edges & outlines

morph gradient $(I \oplus S) - (I \ominus S)$ -> outer + inner edge
1‑px outline $(I \oplus S) - I$

reconstruction tricks

keep chosen objs: repeatedly dilate marker inside mask until no change
kill boundary objs: marker = border; reconstruct; subtract result
fill holes: work on complement, reconstruct from border, then take complement back

shape analysis

distance transform: count erosion steps per pixel
ultimate erosion: local maxima of dist map -> object centres
ultimate dilation (no‑merge) -> voronoi tessellation
skeleton: iterative thinning that keeps connectivity

n‑d notes

swap $3 \times 3$ for $3 \times 3 \times 3$ voxels, etc.; everything generalises

grey‑scale morphology (pre‑process)

umbra view

treat an $n$-d image as an $(n+1)$-d binary volume under its surface

core ops (flat $S$)

dilation $I \oplus S = \max_{p \in S}\, I(x - p)$
erosion $I \ominus S = \min_{p \in S}\, I(x + p)$

filters

opening / closing: size‑selective smoothing (remove bright / dark specks)
morph gradient $D - E$; outer / inner variants give one‑sided edges
laplacian $D + E - 2I$
white top‑hat $I - (I \circ S)$ (bright peaks)
black top‑hat $(I \bullet S) - I$ (dark wells)

cheat‑sheet

binary morphology (post‑seg): hole fill, contour, split touchers, shape & dist
grey morphology (pre‑seg): kill noise & shading, emphasise structures

morphological operations diagrams

Structuring elements examples:

Distance transform and skeleton:

Deep Learning

stepping stones

linear model: $ \hat y = W x + b $
multilayer perceptron adds hidden layers and nonlinear activation $ \phi $
convolution shares weights in space, giving sparse connections and translation invariance
parameter count per conv: $ (k\_h \, k\_w \, c\_{in} + 1) \, c\_{out} $

core cnn layers

conv: kernel $k\_h \times k\_w$, stride $s$, pad $p$
- output height $h\_{out} = 1 + \frac{h + 2p - k\_h}{s}$
pooling: max or average, reduces resolution
batch norm: $ \hat x = (x-\mu)/\sqrt{\sigma^{2}+\epsilon} $ then scale and shift
activation: sigmoid, tanh, ReLU (default), leaky ReLU, ELU, GELU
dropout: zero units with prob $1-p$ at train time
transposed conv and unpooling learnable upsampling
dilated conv increases receptive field without extra weights

data sets

mnist 28x28 digits, 60k train
cifar‑10 32x32 rgb, 50k train
imagenet 224x224 crops, 1.3m train, 1000 classes
coco detection / segmentation, 80 classes

landmark architectures

lenet‑5 1989

2 conv + subsample then 3 fully connected

alexnet 2012

5 conv + 3 fc, relu, lrn, dropout, heavy data aug

vgg‑16 2014

stacks of 3x3 conv, about 144m params

inception (googlenet) 2014

parallel 1x1, 3x3, 5x5, pool paths then concat

resnet 2015

residual block $y = F(x) + x$, allows 50+ layers

senet 2017

squeeze‑excitation channel reweighting

densenet 2017

dense block concatenates all previous features

efficientnet 2019

compound scaling of depth, width, resolution

vision transformer 2020

image patches to tokens, pure self‑attention

convnext 2022

convnet with transformer design cues (patchify stem, layer norm)

training workflow

split data 70 train / 10 val / 20 test, avoid leakage
loss
- classification: softmax cross‑entropy $L = -\sum y\log p$
- detection: add smooth‑L1 bbox term
- seg: dice or cross‑entropy
metrics
- classification
  - accuracy $ (tp+tn)/(tp+tn+fp+fn) $
  - precision $ tp/(tp+fp) $
  - recall $ tp/(tp+fn) $
  - f1 $ 2pr/(p+r) $
- segmentation (nested list)
  - dice $ 2|x\cap y|/(|x|+|y|) $
  - iou (jaccard) $ |x\cap y|/|x\cup y| $
  - mean iou = average over classes
- detection
  - mAP, AP, precision‑recall curve
optimisers
- sgd, momentum, nesterov, adagrad, adadelta, adam, rmsprop
schedules: step, cosine, one‑cycle
overfit cures
- more data, early stop, data aug, weight decay, dropout, batch norm

transfer learning

load imagenet pretrained weights
freeze feature extractor, replace task head
fine tune later layers with small learning rate

object detection families

two stage: r‑cnn, fast, faster (adds rpn)
one stage: ssd, yolo, retinanet (focal loss)

segmentation architectures

fcn converts classification net to pixel output, uses skip fusion
u‑net encoder decoder with skip concat
- variants: attention u‑net, resunet, transunet
mask r‑cnn adds mask branch on faster r‑cnn

fully convolutional means

all layers are convolutional (or up‑convolution) so network outputs a spatial map whose size scales with input instead of a fixed vector

practical checklist

verify layer dimensions with formula above
start from pretrained backbone, freeze then unfreeze
augment: crop, flip, rotate, color jitter, cutout
use mixed precision to save memory
accumulate gradients for larger effective batch
track train and val curves; stop when val loss stops improving ⊕

exam practice

key benefit of cnns over anns for images: automatically learn hierarchical features
transfer learning with cnns: use a pretrained model then fine tune it
purpose of convolutional layers: apply learned kernels (dot product) at every spatial location to extract local patterns

Motion and Tracking

add a time axis to everything above (3D = 2D + t)
two problem families:
- motion estimation (what moved, by how much?) → change detection; sparse; dense
- object tracking (which object is where over time?) → state-space inference

scenarios

still camera → constant background; single or multiple moving objects
moving camera → coherent scene motion + possible independently moving objects
common uses: surveillance, gait/action recognition, HCI (gaze/gesture), traffic, robotics

change detection (background / frame differencing)

goal: detect pixels that changed between frames
simplest model: subtract a static “empty scene” background
- step 1: acquire background image $B(x,y)$
- step 2: compute difference $D_t(x,y)=|I_t(x,y)-B(x,y)|$ (or $I_t-I_{t-\Delta t}$)
- step 3: threshold $D_t>\tau$ to get a binary mask; post-process:
  - connected-components; remove tiny blobs (noise)
  - morphological closing (small disk) to fuse nearby regions
- step 4: output mask + bounding boxes of changed regions
caveats: camera jitter, lighting changes, shadows → use smoothing, adaptive backgrounds, illumination-invariant cues

sparse motion estimation (template matching around interest points)

detect “interesting” points (corners/ridges/edges or learned keypoints)
- examples: Canny (edges), Hessian (ridges), Harris (corners), SIFT, CNN features
- simple “interest operator”: take a window $w\!\times\!w$ at $(r,c)$; compute variances along horizontal/vertical/diagonals; mark point if $\min\{v_h,v_v,v_{d1},v_{d2}\}\ge t$
track each point by local search at $t+\Delta t$
- take template $T\subset I_t$ around the point; search a region $R\subset I_{t+\Delta t}$ for best match
- similarity measures:
  - cross-correlation (maximise) $\mathrm{CC}(\Delta x,\Delta y)=\sum\limits_{(x,y)\in T} I_t(x,y)\;I_{t+\Delta t}(x+\Delta x,y+\Delta y)$
  - SAD (minimise) $\mathrm{SAD}=\sum\limits_{(x,y)\in T}\!\left|I_t-I_{t+\Delta t}\right|$
  - SSD (minimise) $\mathrm{SSD}=\sum\limits_{(x,y)\in T}\!\left(I_t-I_{t+\Delta t}\right)^2$
  - mutual information (maximise) $\mathrm{MI}(A,B)=\sum_a\sum_b P_{AB}(a,b)\log_2\!\frac{P_{AB}(a,b)}{P_A(a)P_B(b)}$
practicalities:
- choose patch size $w$ and search radius by expected motion
- use image pyramids for large displacements; normalise intensity; reject low-contrast patches

dense motion estimation (optical flow)

assumptions (local, over $\Delta t$)
- brightness constancy: $f(x+\Delta x,y+\Delta y,t+\Delta t)=f(x,y,t)$
- small motion: first-order Taylor expansion suffices
- local coherence: neighbouring pixels share similar velocity
derivation (spatio-temporal gradient): \[ f_x\Delta x+f_y\Delta y+f_t\Delta t\approx 0 \;\Rightarrow\; f_x v_x+f_y v_y = -f_t \;\;\text{with}\;\; v_x=\tfrac{\Delta x}{\Delta t},\; v_y=\tfrac{\Delta y}{\Delta t} \]
- optical-flow constraint: one equation, two unknowns → aperture problem ⊕
Lucas–Kanade (local least-squares)
- assume constant $(v_x,v_y)$ in a small window; stack $n$ pixels $p_i$: \[ \begin{bmatrix}f_x(p_1)&f_y(p_1)\\ \vdots&\vdots\\ f_x(p_n)&f_y(p_n)\end{bmatrix} \!\begin{bmatrix}v_x\\ v_y\end{bmatrix} = -\begin{bmatrix}f_t(p_1)\\ \vdots\\ f_t(p_n)\end{bmatrix}, \quad \mathbf{v}=(A^\top A)^{-1}A^\top \mathbf{b} \]
- works best at corners (good conditioning of $A^\top A$); poor on flat regions/edges
(global alternative) Horn–Schunck (add smoothness): minimise $\iint (f_x v_x+f_y v_y+f_t)^2+\alpha^2(\|\nabla v_x\|^2+\|\nabla v_y\|^2)\,dx\,dy$
engineering notes:
- compute derivatives after Gaussian smoothing; multiscale pyramids for large motion
- handle outliers/occlusion with robust losses and consistency checks (forward/backward flow)

object tracking (state-space / Bayesian view)

goal: infer object state $X_i$ from image-derived measurements $Y_i$ over time
- state may include position, velocity, acceleration, size/shape, intensity/colour, …
hidden Markov assumptions
- Markov dynamics: $P(X_i|X_{0:i-1})=P(X_i|X_{i-1})$
- memoryless observations: $P(Y_i|X_{0:i},Y_{0:i-1})=P(Y_i|X_i)$
recursive filtering
- prediction: $P(X_i|Y_{0:i-1})=\int P(X_i|X_{i-1})\,P(X_{i-1}|Y_{0:i-1})\,dX_{i-1}$
- correction: $P(X_i|Y_{0:i})\propto P(Y_i|X_i)\,P(X_i|Y_{0:i-1})$
- estimators: EAP $ \hat x_i=\int x\,P(X_i|Y_{0:i})\,dx$; MAP $ \hat x_i=\arg\max_x P(X_i|Y_{0:i})$
association: decide which detections at time $i$ belong to which tracks (gating, nearest-neighbour, Hungarian) ⊕

Kalman filter (linear-Gaussian)

models: $x_i=A x_{i-1}+q_{i-1},\; q\sim\mathcal{N}(0,Q)$; $\;y_i=H x_i+r_i,\; r\sim\mathcal{N}(0,R)$
predict:
- $\hat x_i^- = A \hat x_{i-1}$
- $P_i^- = A P_{i-1}A^\top + Q$
update:
- $K_i = P_i^- H^\top (H P_i^- H^\top + R)^{-1}$
- $\hat x_i = \hat x_i^- + K_i(y_i - H\hat x_i^-)$
- $P_i = (I-K_i H)P_i^-$
notes:
- constant-velocity $A$ is standard; extend state for acceleration/size
- optimal under linear-Gaussian assumptions; very fast

particle filter (nonlinear / non-Gaussian)

represent posterior by weighted samples $\{(s_i^{(n)},\pi_i^{(n)})\}_{n=1}^N$
- propagate: $s_i^{(n)}\sim P(X_i|X_{i-1}=s_{i-1}^{(n)})$
- weight: $\pi_i^{(n)} \propto \pi_{i-1}^{(n)}\,P(Y_i|X_i=s_i^{(n)})$; normalise
- resample to avoid degeneracy; estimate $\hat x_i=\sum_n \pi_i^{(n)} s_i^{(n)}$
aliases: SMC, bootstrap filter, CONDENSATION
strengths: handles multimodal posteriors, nonlinear dynamics/measurements

multi-target tracking (sketch)

mixture of single-target posteriors: $P(X|Y)=\sum_{n=1}^N w_n\,P_n(X|Y)$
practicalities: track birth/death, occlusion handling, identity switches; cost grows with targets → use gating + efficient assignment

trajectory analysis (MSD etc.)

positions along a track: $p_1,\dots,p_N$; lag $t$
- displacement: $d_{i,t}=\|p_{i+t}-p_i\|$
- mean squared displacement: \[ \mathrm{MSD}(t)=\frac{1}{N-t}\sum_{i=1}^{N-t}\! \|p_{i+t}-p_i\|^2 \]
motion regimes:
- normal diffusion: $\mathrm{MSD}(t)=c\,D\,t$
- directed + diffusion: $\mathrm{MSD}(t)=c\,D\,t+v^2 t^2$
- anomalous: $\mathrm{MSD}(t)=c\,D\,t^\alpha$ with $0<\alpha<1$ (subdiffusive)
- constants: $c=4$ in 2D, $c=6$ in 3D