Computer Vision

I think taking a course in a subject that you are interested in is never particularly a bad thing.

This page will exclusively contain what I was taught in COMP9517 for a short while. That is just how it will be.

However, eventually we will morph out of this into the textbook, and subsequently out of that too.

For now, I just want a document to type in that I enjoy typing in!

Notes

in each of the below topics, there are a number of algorithms and facts that I must know for the final exam.

Image Formation

  • pinhole camera model
  • projective geometry

    • vanishing lines and points
    • mathematics: \[x' = -x \frac{f}{z} \] \[ y' = -y \frac{f}{z} \]
  • colour spaces:

    • RGB: red green blue

      • strongly correlated channels
    • HSV: hue saturation value

      • confounded channels
    • YCbCr: stand for ??

      • fast to compute, good for compression. the modern day standard
      • Y = luminance
      • Cb = blue colour difference
      • Cr = red colour difference
    • L*a*b*: differences in luminance are more perceptually uniform

      • L* = lightness
  • quantisation: digitises the image intensity / amplitude values

Image Processing

spatial domain

point operations

\[T: \mathbb{R} \rightarrow \mathbb{R}\quad g(x,y) =T(f(x,y))\]

  • contrast stretching
  • intensity thresholding

    • automatic: otsu, isodata, multilevel
  • intensity inversion
  • log transformation
  • power transformation
  • piecewise linear transformation
  • piecewise contrast stretching
  • gray-level slicing
  • bit-plane slicing
  • histogram of pixel values
  • histogram-based thresholding "triangle"
  • histogram equalisation

    • continuous; discrete
    • constrained
  • histogram matching

    • continuous, discrete
  • arithmetic and logical operations
  • averaging
neighourhood operations

\[T: \mathbb{R^n} \rightarrow \mathbb{R}\quad g(x,y) = T(f(x,y),f(x+1,y),f(x-1,y),...\]

  • convolution

    • linear, shift-invariant
    • properties:

      • commutativity \[f_1 * f_2 = f_2 * f_1 \]
      • associativity \[f_1 * (f_2 * f_3) = (f_1 * f_2) * f_3 \]
      • distributivity \[f_1 * (f_2 * f_3) = f_1 * f_2 + f_1 * f_3\]
      • multiplicativity \[a(f_1*f_2) = (a f_1) * f_2 = f_1 * (a f_2) \]
      • derivation \[(f_1 * f_2)' = f_1'*f_2 = f_1*f_2' \]
      • theorem \[f_1 * f_2 \iff \hat{f_1} \hat{f_2}\] convolution in spatial domain amounts to multiplication in spectral domain
  • spatial filtering
  • linear shift-invariant operations
  • border problem

    • padding: add more pixels with value 0
    • clamping: repeat all border pixel values indefinitely
    • wrapping: copy pixel values from opposite sides
    • mirroring: reflect pixel values across borders
filtering methods
  • uniform filter

    • smoothing
  • gaussian filter

    • separable and circularly symmetric; the only such filter
    • optimal joint localisation in spatial and frequency domain
    • fourier transform (ft henceforth) is also a Gaussian
    • n-fold convolution of any low-pass filter converges to a Gaussian
    • infinitely smooth, so infinite derivatives exist
    • good at keeping small objects (better than median). it is a smoothing filter.
  • median filter

    • order-statistic filter
    • sorts, then takes median
    • can eliminate salt and pepper noise (which are just isolated intensity spikes)
    • nonlinear filter
    • better than gaussian at removing small objects
  • smoothing

    • image blurring, noise reduction
  • differentiation

    • forward, backward, central difference (finite differences because images are discrete)
  • separability

    • improves computation efficiency
    • examples: uniform, prewitt, sobel, gauss
  • pooling

    • max / min / average
    • makes image smaller
    • combines filtering and downsampling in one operation
image enhancement
  • sharpening

    • subtract Gaussian filtered from image, then add the produced "high-frequencies" back into the image.
    • can also use the laplacean: \(\nabla^2 f = f_{xx} + f_{yy} \) by subtracting it from the original image: \[f(x,y) - \nabla^2 f(x,y) \]
  • unsharp masking

    • ?
  • gradient vector & magnitude

    • \[\nabla f(x,y) = [f_x(x,y), f_y(x,y)]^T \]
    • \[||\nabla f(x,y) || = \sqrt{f_x^2(x,y),f_y^2(x,y)} \]
  • edge detection

    • use laplacean or intensity gradient

transform domain

  • high frequency -> rapidly changing intensities across pixels
  • low frequency -> large scale image structures
  • we process images in the frequency domain by first applying the Fourier transform
Fourier Transform
  • interpretations:

    • frequencies correspond to patterns
    • $F(0,0)$ is the total intensity over all pixels of the image
    • noise (typically) corresponds to fluctuations in the highest frequencies
  • notation:

    • $f(x)$ is the spatial input function
    • $F(u)$ is the Fourier transform
    • $e^{i\omega x} = \cos(\omega x) + i\sin(\omega x) $
    • $\omega = 2\pi u$ is radial frequency
    • $u$ is spatial frequency
  • forward fourier transform \[F(u) = \int^\infty_{-\infty} f(x)\; e^{\displaystyle -i 2\pi u x}\,\mathrm{d}x\]
  • inverse fourier transform \[f(x) = \int^\infty_{-\infty} F(u)\; e^{\displaystyle i 2\pi u x}\,\mathrm{d}u\]
  • properties:
Property Spatial Frequency
Superposition $f_1(x) + f_2(x)$ $F_1(u) + F_2(u)$
Translation $f(x-\Delta x)$ $F(u)e^{-i 2\pi u\Delta x}$
Convolution $f(x)*h(x)$ $F(u)H(u)$
Correlation $f(x) \otimes h(x)$ $F(u)H^*(u)$
Multiplication $f(x)h(x)$ $F(u)*H(u)$
Scaling $f(ax)$ $F(u/a)/a$
Differentiation $f^{(n)}(x)$ $(i2\pi u)^n F(u)$
  • 2D:

    • forward fourier transform \[F(u,v) = \int^\infty_{-\infty}\int^\infty_{-\infty} f(x,y)\; e^{\displaystyle -i 2\pi (ux+vy)}\;\mathrm{d}x\,\mathrm{d}y\]
    • inverse fourier transform \[f(x,y) = \int^\infty_{-\infty}\int^\infty_{-\infty} F(u,v)\; e^{\displaystyle -i 2\pi (ux+vy)}\;\mathrm{d}u\,\mathrm{d}v\]
    • $f \leftrightarrow F$: fourier transform pair
    • $F = R + i I$: real plus imaginary part
    • $|F| = \sqrt{R^2 + I^2}$: Magnitude
    • $\phi = \arctan(\frac{I}{R})$: Phase
  • Discrete:

    • forward \[F(u,v) = \sum_{x=0}^{M-1} \sum_{y=0}^{N-1} f(x,y)\;e^{\displaystyle -i 2 \pi (\frac{ux}{M} + \frac{vy}{N})} \] for $u=0... M-1$ and $v = 0... N -1$
    • inverse \[f(x,y) = \frac{1}{MN} \sum_{u=0}^{M-1} \sum_{v=0}^{N-1} F(u,v)\;e^{\displaystyle i 2\pi (\frac{ux}{M} + \frac{vy}{N})} \] for $x=0... M-1$ and $y = 0... N -1$
filtering
  • procedure:

    1. multiply input image $f(x,y)$ by $(-1)^{x+y}$ to ensure centering $F(u,v)$
    2. compute the transform $F(u,v)$ from image $f(x,y)$ using 2D DFT
    3. multiply $F(u,v)$ by a centred filter $H(u,v)$ to obtain result $G(u,v)$
    4. compute the inverse 2D DFT of $G(u,v)$ to obtain the spatial result $g(x,y)$
    5. take the real component of $g(x,y)$ (imaginary component is zero)
    6. multiply the result by $(-1)^{x+y}$ to remove the pattern introduced in step 1^^
convolution theorem (how does this relate to convolution?)
  • filtering in the frequency domain can be computationally more efficient
  • more intuitive in freq dom. i.e:

    • low-pass = keep low frequencies, but attenuate high frequencies
    • high-pass = keep high freq, reduce low freq
    • band-pass = keep frequencies in a given band. attenuate the rest
    • take inverse to get the corresponding spatial filter
  • notch filtering = opposite of band-pass; attenuates a given range.
  • difference of Gaussians is a high-pass
  • gaussian filter = low-pass
  • image pyramids is for multi-resolution
  • approximation = ?
  • reconstruction = ?

Feature Representation

  • image features are vectors that are a compact representation of images. i.e. blobs, edges, corners, etc.
  • more efficient and robust way to represent images. also useful for further processing: object detection, image segmentation, classification, retrieval, stitching and object tracking.
  • note that pixel values are highly redundant and victim to light intensity, colour, angle changes, camera orientation
  • we wish for features to be: reproducible, salient and compact (aka robust, descriptive, efficient)

colour features

  • colour is easy to compute
  • invariant to image scaling, translation, rotation
colour histogram

(TODO add image demo?)

  • represent the global distribution of pixel colours in an image
  • step 1: construct a histogram for each colour channel (R, G, B)
  • step 2: concatenate the histogrames (vectors) of all channels as the final feature vector.
colour moments
  • moments based representation of colour distributions
  • gives a feature vector of only 9 elements (for RGB)
  • lower representation capability than above histogram

texture features

  • visual characteristics and appearance of objects
  • a powerful discriminating feature for identifying visual patterns
  • encodes properties of structural homogeneity beyond colour or intensity
haralick features
  • array of statistical descriptors of image patterns
  • captures spatial relationship between neighbouring pixels
  • step 1: construct the gray-level co-occurence matrix (GLCM) - representing the frequency of peixel intensity pairs occurring at a specific offset and direction
  • step 2: compute the Haralick feature descriptors from the GLCM - that summarises texture information (how pixel intensities are spatially related)
  • often used in practice due to their simplicity and interpretability.
glcm method
  • TODO not sure of best explanation
haralick descriptors
  • TODO seems tacky and long.
local binary patterns
  • describe the spatial structure of local image texture

algorithm:

  • divide the image into cells of $N\times N$ pixels ($N=16$ or $N=32$)
  • compare each pixel in a given cell to each of its 8 neighbouring pixels
  • if the neighbour's value is greater than or equal to the centre, write 1; otherwise write 0.
  • this gives an 8-digit binary pattern per pixel, representing a value in the range $0...255$
  • count the number of times each 8-digit binary number occurs in the cell
  • this gives a 256-bin histogram (also known as the LBP feature vector)
  • combine the histograms of all cells of the given image
  • this gives the image-level LBP feature descriptor

TODO insert example

  • LBP can be multiresolution and rotation-invariant

    • in the case of multiresolution, you vary the distance between the centre pixel and neighbouring pixels and vary the number of neighbouring pixels
    • for rotation-invariance: vary the way of constructing the 8-digit binary number by performing bitwise shift to derive the smallest number

      • note: not all patterns have 8-shifted variants (i.e. 11001100 has only 4)
      • reduces LPB feature dimension from 256 to 36
scale-invariant feature transform
  • describes texture in a localised region around a keypoint
  • invariant to scaling, rotation, shift.
  • robust to affine distortion and illumination changes
algorithm
  • Scale-Space Extrema Detection: find maxima/minima in DoG images across scales
  • Keypoint Localisation: discard low-contrast keypoints and eliminate edge responses
  • Orientation Assignment: achieve rotation invariance by orientation assignment (make histogram of local gradient vectors)
  • Keypoint Descriptor: compute gradient orientation histograms
descriptor matching

nearest neighbour distance ratio (NNDR) \[NNDR = \frac{d_1}{d_2} = \frac{||D_a -D_B||}{||D_A-D_C||}\]

  • distance $d_1$ is to the first nearest neighbour
  • distance $d_2$ is to the second nearest neighbour
  • nearest neighbours in 128D feature space
  • reject matches with NNDR > 0.8
  • translation and rotation are rigid transformations
  • scaling, affine, perspective are nonrigid transformations
  • TODO optional: add matrices
fitting and alignment
  • Least Squares (LS) fitting of corresponding keypoints $(x_i,x_i')$

    • find parameters $p$ that minimise the squared error E \[E = \sum_i ||T(\mathbf{x}_\mathbf{i}); \mathbf{p) - \mathbf{x}_\mathbf{i}'}||^2\]
  • RANdom SAmple Consensus (RANSAC) fitting

    • least-squares is hampered by outliers
    • better use a subset of the data and check inlier agreement
    • RANSAC does this in an iterative way to find the optimum
    • algorithm:

      1. sample (randomly) the number of points required to fit the model
      2. solve for the model parameters using the samples
      3. score by the fraction of inliers within a preset threshold of the model
      • repeat 1-3 until the best model is found with high confidence
  • TODO practise problem on solving the "transformation" given matched points A and B.

feature encoding

  • Bag-of-Words (BoW) takes variable number of local image features and encodes them into a fixed-dimensional histogram. it works in general, but for SIFT too.
  • Algorithm:

    • extract local SIFT keypoint descriptors from training images
    • create the "vocabulary" from the set of SIFT keypoint descriptors

      • use K-means clustering

        • partitions the training data into $k$ categories
        • algorithm:
    • initialise: $k$ cluster centres (randomly)
    • iterate:

      1. assign data (feature vectors) to the closest cluster (Euclidean distance)
      2. update cluster centres as the mean of the data samples in each cluster
    • terminate:

      • when converged or the number of iterations reaches the maximum
      • this vocabulary represents the categories of local descriptors
      • cluster centres are the "visual words" in this "vocabulary" used to represent an image
      • each local feature descriptor is assigned to one visual word with the smallest distance
      • compute the number of local image feature descriptors assigned to each visual word
      • concatenate the numbers into a vector which is the "BoW" representation of the image
  • local features (that BoW takes in) can be LBP, SURF, BRIEF, ORB

    • BoW in turn can be replaced with VLAD, Fisher Vector

shape features

  • essential characteristic of material objects
  • typically extracted after image segmentation
  • can be used to identify and classify objects
  • challenges

    • invariant to rigid transformations
    • tolerant to non-rigid deformations
basic shape features
  • net area; principal axes; convex area
  • convexity, concavity; convex hull ; convex deficiency (set difference between the convex hull and the object)
  • compactness; circularity

    • ^inversely related
    • compactness: ratio of the area of a circle with the same perimeter as the object to the area of the object
    • circularity: ratio of $4\pi$ times the area of an object to the second power of its perimeter ($4\pi A/p^=1$ for a circle)
  • elongation; eccentricity

    • elongation: ratio between the length and width of the object's bounding box
    • eccentricity: ratio of the length of the minor axis to the length of the major axis
boundary descriptors (TODO)
  • chain code descriptor
  • local curvature descriptor
  • global curvature descriptors:

    • total bending energy $B=\oint_C K^2\; \mathrm{d}s$
    • total absolute curvature $K=\oint_C |K(s)|\;\mathrm{d}s$
  • radial distance descriptor
shape context
  • is a point-wise local feature descriptor

    • pick $n$ points $p_i$ on the contour of a shape
    • for each point, create a radial coordinate system centred at this point and compute a histogram $h_i$ based on the relative coordinates of the other $n-1$ points
    • this is the shape context of $p_i$
  • e.g. shape matching ALGORITHM TODO; TODO graphic
histogram of oriented gradients (HoG)
  • describes the distributions of gradient orientations in localised areas
  • does not require initial segmentation
  • algorithm:

    1. calculate the gradient vector at each pixel

      • gradient magnitude
      • gradient orientation
    2. construct the gradient histogram of all pixels in a cell

      • divide orientations into $N$ bins (typically $N=9$ bins evenly splitting 180 degrees)
      • assign the gradient magnitude of each pixel to the bin corresponding to its orientation
    3. generate detection-window level HOG descriptor

      • concatenate cell histograms
      • block-normalise cell histograms

sliding window ? TODO

  • use-case: detecting humans in images.

Pattern Recognition

  • automatically recognise patterns and regularities in data.

    • object recognition; text classif; speech recognition; event detection; recommender systems
  • the different learning paradigms:

    • supervised learning
    • unsupervised learning
    • semi-supervised learning

      • uses labelled and unlabelled data
    • weakly supervised learning

      • noisy / limited / imprecise supervision signals in learning

concepts

  • objects: identifiable physical entities of which images are taken
  • regions: correspond to objects after image segmentation
  • classes: disjoint subsets of objects sharing common features
  • labels: associated with objects and indicate to which class they belong
  • classification:process of assigning labels to ojbects based on features
  • classifiers: algorithms / methods performing the classication task
  • patterns: are regularities in object features and are used by classifiers
  • pre-processing: aims to enhance images for further processing
  • feature extraction: reduces the data by measuring certain properties
  • feature descriptors: aims to keep only the most descriptive features
  • models: are (mathematical or statistical) descriptions of classes
  • training samples: objects with known labels used to build models
  • cost: consequence of making an incorrect decision / assignment
  • decision boundary: demarcation between regions in feature space

supervised learning

classification
nearest class mean
  • like k-means but without the iteration
  • would depend strongly on initialised parameters
  • algorithm:

    • training: given training sample pairs ${(x_1,y_1),(x_2,y_2),...,(x_n,y_n)}$, the centrodi for each class $k$ is obtained as \[\mu_k = \frac{1}{|c_k|}\sum_{x_i\in c_k}x_i\]
    • testing: each unknown object with feature vector $x$ is classified at class $k$ if $x$ is closer to the centroid of class $k$ than to any other class centroid
  • works well when classes are compact and far from each other
  • cannot handle outliers and noisy data well
  • no good for complex classes (multimodal, non-spherical)
k-nearest neighbours
  • euclidean distance used for continuous variables
  • hamming distance for discrete variables
  • decision surfaces are non-linear
  • cons:

    • slow for big datasets (non-parametric)
    • needs homogeneous feature types and scales
    • number of variables $>> \implies$ curse of dimensionality
    • hyper-parameter tuning of $k$ can be annoying
bayesian decision theory
  • introduces prior knowledge
  • assigns class to which it most likely belongs based on observed features
  • assume the following to be known:

    • prior probability $p(c_i)$ for each class $c_i$
    • class conditional distribution $p((x|c_i)$
  • compute the posterior probability $p(c_i|x)$ as follows:

    • if all the classes are disjoint, by Bayes Rule, the posterior probabilities are given by \[p(c_i|x) = \frac{p(x|c_i)p(c_i)}{p(x)}\] \[\boxed{\begin{align}p(x,c_i) &= p(x|c_i)p(c_i) = p(c_i|x)p(x) \\ p(x) &= \sum_j p(x,c_j) = \sum_j p(x|c_j) p(c_j)}\]
  • the bayesian decision rule is: $c=\mathrm{arg}\max_i (p(x|c_i))$

    • which is equivalent to \[c=\mathrm{arg}\max_i (p(x|c_i)p(c_i))\] \[c(c_i|x) = \frac{p(x|c_i)p(c_i)}{p(x)}\propto p(x|c_i)p(c_i)\]
  • unusually, in the running example: uniform priced fish $\implies$ we should maximise the posterior probability and prices unequal $\implies$ we should minimise the loss.
  • pros:

    • simple and efficient
    • considers uncertainties
    • permits new information updates
  • cons:

    • struggles with complex data relationships
    • choice of priors can be subjective
decision trees
  • nominal data = categorical
  • entropy: of a set of events $y = \set{y_1,y_2,...,y_n}$ is: \[H(y) = \sum_{i=1}^n -p(y_i) \log_2(p(y_i))\] where $p(y_i)$ is the probability of event $y_i$

    • may be viewed as the average uncertainty of the information source

      • $H=0$ means source has no uncertainty
      • $H>0$ if the source information is uncertain
  • algorithm:

    1. select a feature to place at the node
    2. make one branch for each possible value (nominal) or range(numerical)
    3. for each branch node, repeat steps 1 and 2 using only those samples that actually reach the branch
    4. when all samples at a node have the same classifiaciton (or label), stop growing that part of the tree
  • information gain: \[IG(S,F) = H(S) - H(S|f)\]

    • \[IG(S,f) = \mathrm{Entropy}(S) - \sum_{f_a\in \mathrm{values}(f)}\frac{|S_{fa}|}{S}\mathrm{Entropy}(S_{fa})\]
    • use the feature with highest information gain to split on
  • pros:

    • interpretable
    • numerical, categorical data
    • robust to outliers and missing values
    • feature selection (helps determine the more important features)
  • cons:

    • overfitting
    • greedy algorithm
ensemble learning, random forests
  • training:

    1. let $N$ be number of training instances and $M$ the number of features
    2. sample $N$ instances at random with replacement from the original data
    3. at each node select $m << M$ features at random and split on the best feature
    4. grow each tree to the largest extent possible (no pruning)
    5. repeat $B$ times. keep the value of $m$ constant during the forest growing
  • testing:

    1. push a new sample down a tree and assign the label of the terminal node it ends up in
    2. iterate over all trees in the ensemble to get $B$ predictions for the sample
    3. report the majority vote of all trees as the random forest prediction
  • error rate depends on 2 factors:

    1. correlation between any two trees in the forest.

      • increased correlation increases the forest error rate; uncorrelated trees lead to better generalisation
    2. strength of each individual tree in the forest

      • strong tree has low error rate
      • increasing the strength of individual trees decreases the forest error rate
  • selecting parameter $m$:

    • reducing $m$ reduces both the correlation and the strength
    • increasing $m$ increases both the correlation and the strength
    • somewhere in between is an "optimal" range
  • pros:

    • high accuracy
    • efficient and effective on large datasets
    • can handle thousands of input features without feature selection
    • handles missing values
  • cons:

    • less interpretable than an individual decision tree
linear classification
  • \[f(x) = W^T x+ b\]
support vector machines
  • maximise margin - the distance to the closest sample
  • primal (hard-margin) \[\begin{equation} \min_{\mathbf{w},b} \frac{1}{2} ||\mathbf{w}||^2_2\\ \text{s.t.}\quad y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1, \quad \forall i \end{equation}\]
  • distance between a point to a hyperplane \[d = \frac{|\mathbf{w}^T\mathbf{x}'+b|}{||\mathbf{w}||_2}\]
  • soft-margin primal \[\begin{equation} \min_{\mathbf{w},b} \frac{1}{2} ||\mathbf{w}||^2_2 + C \sum_i \xi_i\\ \text{s.t.}\quad y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1-\xi_i, \quad \forall i \\ \xi_i \geq 0 \end{equation}\]

    • small $C$: more tolerance on miss-classified samples for larger margin
    • large $C$: focus on avoiding mistakes at the expense of smaller margin
    • $C$ to infinity means going back to the hard margin SVM
    • still a quadratic programming optimisation problem
  • pros:

    • effective in high-dimensional feature spaces
    • effective when number of features is larger than the training data size
    • good whenn classes are separable
    • works well when data is sparse
    • can be extended to non-linear with the kernel trick
  • cons:

    • does not perform well for overlapping classes
    • hyperparameter tuning
multiclass classification
  • one versus rest: builds one classifier for one class versus the rest and assigns a test sample to the class that has the highest confidence score
  • one versus one: builds one classifier for every pair of classes and assigns a test sample to the class that has the highest number of predictions
classification performance metrics
ROC curve
  • receiver operating curve
  • relates the false positive to the true positive rate
  • generally, false alarms go up with attempts to correctly detect higher percentages of known objects
  • AUC (area under the ROC) summarises the overall performance
Confusion Matrix
  • matrix whose entry $(i,j)$ records the number of times an object of class $i$ was classified as class $j$
  • diagonal entries indicate success
  • reproduce the binary classification matrix TODO
Precision vs. Recall
  • \[\text{accuracy} = \frac{\text{TP}+\text{TN}}{\text{TP}+\text{TN}+\text{FP}+\text{FN}} \quad \left ( \frac{\text{correct}}{\text{total}} \right)\]
  • Precision
  • Recall
  • F1 Score
  • others? TODO
regression
linear regression
  • \[f(x) = xW\]
least-squares regression
  • \[\text{RSS}(W) = \sum_{i=1}[y_i-f(x_i)]^2 = (Y-XW)^T(Y-XW)\]
  • $\hat{W} = \text{arg}\min_w \text{RSS}(W) = (X^TX)^{-1}X^TY$
regression performance evaluation

\[\text{RMSE} = \sqrt{\frac{1}{N}\sum_{i=1}^N (y_i - \hat{y_i})^2}\]

  • represents the standard deviation of the ppredicted values from the observed values

\[\text{MAE} = \frac{1}{N}\sum_{i=1}^N |y_i - \hat{y_i}|\]

  • represents the average of the absolute differences between the predicted and observed values
  • RMSE penalises big differences between predicted values and observed values more heavily.

\[R^2 = 1 - \frac{\sum_{i=1}^N(y_i-\hat{y_i})^2}{\sum_{i=1}^N (y_i - \bar{y})^2}\]

  • indicates how well the selected features explain the output variable
  • tends to always increase by adding extra features

normalisation: \[\frac{x-x_\text{min}}{x_\text{max} - x_\text{min}}\]

cross validation: data is split into $k$ subsets (folds) and at each iteration we keep one fold out for testing and use the rest for training.

  • this is repeated $K$ times until all folds have been used once as the test set.

Image Segmentation

basic segmentation

thresholding
  • works fine if regions have sufficiently different intensity distributions
  • problematic if regions have overlapping intensity distributions
k-means
  • may work if the number of clusters is known a priori
  • problematic if not
feature-based pixel classification
  • extract a patch around each pixel and compute its features
  • classify each pixel based on its features using a trained classifier

advanced segmentation

region splitting * and merging
  • recursively split the whole image into pieces based on local statistics
  • recursively merge pieces together (in a hierarchical fashion)
  • combine splitting and merging sequentially
connectivity
  • 4-connectivity in 2d
  • 8-connectivity in 2d
  • 3d: 6-connected, 18, 26
  • connected components labelling algorithm:

    • first pass:

      1. check each pixel (top left to bottom right)
      2. if an object pixel, check its neighbours ($N_4$ or $N_8$)
      3. if no neighbours have labels, assign a new label
      4. if neighbours do have labels, assign the smallest
      5. record label equivalences while assigning
    • second pass

      1. check each pixel
      2. replace each label with its smallest equivalent
      3. all background pixels default to the zero label
  • merging by region growing:

    • define a similarity measure
    • start from one seed pixel for the region
    • add neighbouring pixels to the region if they are similar
    • repeat previous step until no more pixels are similar
watershed segmentation
  • based on the analogy of immersion of a topographic surface
  • meyer's flooding algorithm:

    1. choose a set of markers to start the flooding

      • i.e. local minima. give each a different label
    2. put neighbouring pixels of each marker into a priority queue

      • a pixel with more similar gray value has higher priority
    3. pop the pixel with the highest priority level from the queue. if the neighbours of the popped pixel that have already been labelled all have the same label, then give the pixel that same label. put all non-labelled neighbours that have never been in the queue into the queue
    4. repeat step 3 until the queue is empty
  • preprocessing:

    • invert the image or compute edges if needed to get local minima markers
  • images often have many local minima, leading to heavy oversegmentation
  • preprocessing (image smoothing) may be needed to reduce false minima
  • postprocessing (basin merging) may be needed to reduce fragmentation
  • many different implementations and pre/postprocessing criteria exist
  • it is possible to start from user-defined markers instead of local minima
maximally stable extremal regions
  • try multiple thresholds and analyse the shape of the connected components
  • select regions with virtually constant shape over many thresholds
mean shifting
  • seeks stationary points (peaks / modes) in a density function
  • attempts to find all possible cluster centers in feature space
  • does not require knowing the number of clusters a priori
  • is a variant of iterative steepest-ascent method
  • algorithm:

    1. initialise a random seed point $x$ and window $N$
    2. calculate the mean (centre of gravity) $m(x)$ within $N$ \[m(x)=\frac{\sum_{x_i\in N(x)} K(x_i-x)x_i}{\sum_{x_i\in N(x)} K(x_i-x)}\qquad K(x) = \exp(-|x|^2)\] \[\text{Mean (centre of gravity)}\qqquad\text{Kernel (weight function)}\]
    3. shift the search window to the mean
    4. repeat step 2 until convergence
  • advantages:

    • model-free (does not assume any prior shape on data clusters)
    • has just a single parameter (window size)
    • finds variable number of modes (clusters)
    • robust to outliers
  • limitations:

    • computationally expensive (need to shift many windows)
    • output depends on window size parameter value
    • window size (bandwidth) selection is not trivial
    • does not scale well with dimensionality of feature space
superpixel segmentation
  • improves efficiency:

    • group similar pixels into a superpixel
    • superpixels together are an oversegmentation of the image
    • ultimate segmentation (classification, mergingg) performed on superpixels
  • simple linear iterative clustering (SLIC)

    • popular superpixel generation algorithm
    • preserves object boundaries, is fast and memory efficient
    • algorithm:

      1. initialise cluster centres $C_j$ on pixel grid with step size $S$
      2. move $C_j$ to position in $3\times 3$ window with smallest gradient
      3. compute distance $D_{ij}$ for each pixel $i$ in $2S\times 2S$ window around $C_j$
      4. assign each pixel $i$ to the cluster $C_j$ with smallest distance $D_{ij}$
      5. recompute cluster centres as mean colour and position of pixels in $C_j$
      6. iterate (go to step 3) until the residual error is small \[\begin{align} D &= \sqrt{\frac{d_{\text{lab}^2}}{m^2} + \frac{d_{xy}^2}{S^2}}\quad\text{(weight $m$ controls influence of colour over spatial distance)}\\ d_\text{lab} &= \sqrt{(l_j-l_i)^2 + (a_j-a_i)^2 + (b_j-b_i)^2}\quad\text{(distance in CIELAB space)}\\ d_{xy}=\sqrt{(x_j-x_i)^2+(y_j-y_i)^2}\quad\text{(distance in pixel space)} \]
conditional random field
  • superpixels provide a basis for further segmentation

    • determine spatial relationship between the superpixels
    • compute similarities between superpixels
    • group superpixels to form larger segments
  • conditional random field (CRF) approach

    • a probabilistic graphical model that encodes the relationships between observations (i.e. superpixels) and constructs a consistent interpretation (i.e. a segmentation) for a set of data (i.e. an image)
  • nodes: superpixels (value based on features of superpixels)
  • edges: adjacency (value based on similarity between superpixels)
  • formulated as an energy minimisaion problem: \(E(s,c) = \sum_i \psi (s_i,c_i) \sum_{ij}\Psi (s_i, s_j)\)

    • unary potentials $\psi$:

      • data term based on graph node values
      • computes the cost of superpixel $s_i$ belonging to class $c_i$
      • a lower cost means a higher likelihood of $s_i$ belonging to $c_i$
      • can be obtained via superpixel classification
    • pairwise potentials $\Psi$:

      • smoothness term based on graph edge values
      • computes a cost of neighbourhood consistency
      • a cost is assigned if adjacent superpixels are assigned to different classes
      • higher similarity results in a lower cost (0 if assigned to the same class)
  • graph cutting by min-cut / max-flow algorithm
active contour segmentation (snakes method)
  • a contour-based approach to object segmentation
  • aims to locate object boundaries in images by curve fitting
  • represents the curve by set of control points and interpolation
  • iteratively moves the control points to fit the curve to the object
  • uses image, smoothness and user-guidance forces along curve
  • a.k.a snakes:

    • smoothly follows high intensity gradients at the object boundary
    • bridges areas of noise or missing gradients using smooth interpolation
  • missing gradient problem?
level-set segmentation
  • active contours / snakes are parametric models

    • explicit representations of the object boundaries
    • typically requires manual interaction to initialise the curve
    • it is challenging to change the topology of the curve as it evolves
    • curve reparametrisation may be required for big shape changes
  • level-set methods have become more popular alternatives

    • implicit representations of the object boundaries
    • boundaries defined by the zero-set of a higher dimensional function
    • level-set function evolves to make the zero-set fit and track objects
    • easily accommodates topological changes in object shape
    • computationally more demanding than active contours
  • representation:

    • define an initial 3D level-set function
    • zero-level plane represents the 2D shape
    • iteratively deform the function to fit the shape

evaluation

pixel classification
  • segmented object pixels: $S$
  • true object pixels: $T$
  • true positives: $TP = S \cup T$

    • pixels correctly segmented as object
  • true negatives: $TN = S^c\cup T^c$

    • pixels correctl segmented as background
  • false positives: $FP=S\cap T^c$

    • pixels incorrectly segmented as object
  • false negatives: $FN=S^c \cap T$

    • pixels incorrectly segmented as background
quantitative evaluation metrics
  • sensitivity (= true-positive rate)

    • fraction of the true object that is correctly segmented \[\text{TPR} = \frac{\text{TP}}{\text{TP}+\text{FN}}\]
  • specificity (= true-negative rate)

    • fraction of the true background that is correctly segmented \[\text{TNR} = \frac{\text{TN}}{\text{TN}+\text{FP}}\]
  • precision (= positive predictive value)

    • fraction of the segmented object that is correctly segmented \[\text{P} = \frac{\text{TP}}{\text{TP}+\text{FP}}\]
  • F-measure (= harmonic mean of precision and recall) \[\text{F1} = \frac{2\text{RP}}{\text{R}+\text{P}}\]
  • recall (= sensitivity)

    • fraction of the true object that is correctly segmented \[\text{R} = \frac{\text{TP}}{\text{TP}+\text{FN}}\]
  • Jaccard similarity coefficient (JSC)

    • intersection over union (IoU) = correctly segmented fraction of the union of the segmented object and the true object \[\text{JSC} = \frac{S\cap T}{S\cup T} = \frac{\text{TP}}{\text{FP}+\text{TP}+\text{FN}}\]
  • Dice similarity coefficient (DSC)

    • correctly segmented fraction of the segmented object set joined with the true object set \[\text{DSC} = \frac{2|S\cap T|}{|S|\cup |T|} = \frac{2\text{TP}}{\text{FP}+2\text{TP}+\text{FN}}\]
receiver operating characteristics
  • plots the true-positive rate (sensitivity) versus the false positive rate (one minus the specificity) of a method as a function of its free parameters
  • auc

    • higher is better.
    • ranges from 0.0 to 1.0. can technically go negative, but maybe just invert your work at that point.

morphology

  • nonlinear, set‑theoretic post/pre‑processing to clean noise, split/merge objects, fill holes, extract shapes, measure distances. 

basic set ops

  • translation \(A_c = \{\,x + c \mid x \in A\,\}\)
  • reflection \(A^{r} = \{\, -x \mid x \in A\,\}\)
  • complement \(A^{c} = \{\,x \mid x \not\in A\,\}\)
  • union \(A \cup B = \{\,x \mid x \in A \text{ or } x \in B\,\}\)
  • intersection \(A \cap B = \{\,x \mid x \in A \text{ and } x \in B\,\}\)
  • difference \(A - B = \{\,x \mid x \in A \text{ and } x \not\in B\,\}\)
  • size \(|A|\) = number of elements

binary morphology (thresholded imgs)

core tools
  • dilation \(I \oplus S = \{\,x \mid (S^{r})_x \cap I \neq \varnothing\,\}\)
  • erosion \(I \ominus S = \{\,x \mid S_x \subseteq I\,\}\)
  • default struct elem: symmetric \(3 \times 3\); use decomposition / iteration for bigger ones
compound tweaks
  • opening \(I \circ S = (I \ominus S)\, \oplus S\) -> drops tiny foreground blobs
  • closing \(I \bullet S = (I \oplus S)\, \ominus S\) -> fills tiny background gaps
edges & outlines
  • morph gradient \((I \oplus S) - (I \ominus S)\) -> outer + inner edge
  • 1‑px outline \((I \oplus S) - I\)
reconstruction tricks
  • keep chosen objs: repeatedly dilate marker inside mask until no change
  • kill boundary objs: marker = border; reconstruct; subtract result
  • fill holes: work on complement, reconstruct from border, then take complement back
shape analysis
  • distance transform: count erosion steps per pixel
  • ultimate erosion: local maxima of dist map -> object centres
  • ultimate dilation (no‑merge) -> voronoi tessellation
  • skeleton: iterative thinning that keeps connectivity
n‑d notes
  • swap \(3 \times 3\) for \(3 \times 3 \times 3\) voxels, etc.; everything generalises

grey‑scale morphology (pre‑process)

umbra view
  • treat an \(n\)-d image as an \((n+1)\)-d binary volume under its surface
core ops (flat \(S\))
  • dilation \(I \oplus S = \max_{p \in S}\, I(x - p)\)
  • erosion \(I \ominus S = \min_{p \in S}\, I(x + p)\)
filters
  • opening / closing: size‑selective smoothing (remove bright / dark specks)
  • morph gradient \(D - E\); outer / inner variants give one‑sided edges
  • laplacian \(D + E - 2I\)
  • white top‑hat \(I - (I \circ S)\) (bright peaks)
  • black top‑hat \((I \bullet S) - I\) (dark wells)

cheat‑sheet

  • binary morphology (post‑seg): hole fill, contour, split touchers, shape & dist
  • grey morphology (pre‑seg): kill noise & shading, emphasise structures

todo

  • tikz: struct elem, opening/closing demo, voronoi, skeleton
  • org‑babel python: distance‑transform example

Deep Learning

stepping stones

  • linear model: \( \hat y = W x + b \)
  • multilayer perceptron adds hidden layers and nonlinear activation \( \phi \)
  • convolution shares weights in space, giving sparse connections and translation invariance
  • parameter count per conv: \( (k\_h \, k\_w \, c\_{in} + 1) \, c\_{out} \)

core cnn layers

  • conv: kernel \(k\_h \times k\_w\), stride \(s\), pad \(p\)

    • output height \(h\_{out} = 1 + \frac{h + 2p - k\_h}{s}\)
  • pooling: max or average, reduces resolution
  • batch norm: \( \hat x = (x-\mu)/\sqrt{\sigma^{2}+\epsilon} \) then scale and shift
  • activation: sigmoid, tanh, ReLU (default), leaky ReLU, ELU, GELU
  • dropout: zero units with prob \(1-p\) at train time
  • transposed conv and unpooling learnable upsampling
  • dilated conv increases receptive field without extra weights

data sets

  • mnist 28x28 digits, 60k train
  • cifar‑10 32x32 rgb, 50k train
  • imagenet 224x224 crops, 1.3m train, 1000 classes
  • coco detection / segmentation, 80 classes

landmark architectures

lenet‑5 1989
  • 2 conv + subsample then 3 fully connected
alexnet 2012
  • 5 conv + 3 fc, relu, lrn, dropout, heavy data aug
vgg‑16 2014
  • stacks of 3x3 conv, about 144m params
inception (googlenet) 2014
  • parallel 1x1, 3x3, 5x5, pool paths then concat
resnet 2015
  • residual block \(y = F(x) + x\), allows 50+ layers
senet 2017
  • squeeze‑excitation channel reweighting
densenet 2017
  • dense block concatenates all previous features
efficientnet 2019
  • compound scaling of depth, width, resolution
vision transformer 2020
  • image patches to tokens, pure self‑attention
convnext 2022
  • convnet with transformer design cues (patchify stem, layer norm)

training workflow

  • split data 70 train / 10 val / 20 test, avoid leakage
  • loss

    • classification: softmax cross‑entropy \(L = -\sum y\log p\)
    • detection: add smooth‑L1 bbox term
    • seg: dice or cross‑entropy
  • metrics

    • classification

      • accuracy \( (tp+tn)/(tp+tn+fp+fn) \)
      • precision \( tp/(tp+fp) \)
      • recall \( tp/(tp+fn) \)
      • f1 \( 2pr/(p+r) \)
    • segmentation (nested list)

      • dice \( 2|x\cap y|/(|x|+|y|) \)
      • iou (jaccard) \( |x\cap y|/|x\cup y| \)
      • mean iou = average over classes
    • detection

      • mAP, AP, precision‑recall curve
  • optimisers

    • sgd, momentum, nesterov, adagrad, adadelta, adam, rmsprop
  • schedules: step, cosine, one‑cycle
  • overfit cures

    • more data, early stop, data aug, weight decay, dropout, batch norm

transfer learning

  • load imagenet pretrained weights
  • freeze feature extractor, replace task head
  • fine tune later layers with small learning rate

object detection families

  • two stage: r‑cnn, fast, faster (adds rpn)
  • one stage: ssd, yolo, retinanet (focal loss)

segmentation architectures

  • fcn converts classification net to pixel output, uses skip fusion
  • u‑net encoder decoder with skip concat

    • variants: attention u‑net, resunet, transunet
  • mask r‑cnn adds mask branch on faster r‑cnn

fully convolutional means

  • all layers are convolutional (or up‑convolution) so network outputs a spatial map whose size scales with input instead of a fixed vector

practical checklist

  • verify layer dimensions with formula above
  • start from pretrained backbone, freeze then unfreeze
  • augment: crop, flip, rotate, color jitter, cutout
  • use mixed precision to save memory
  • accumulate gradients for larger effective batch
  • track train and val curves; stop when val loss stops improving

exam practice

  • key benefit of cnns over anns for images: automatically learn hierarchical features
  • transfer learning with cnns: use a pretrained model then fine tune it
  • purpose of convolutional layers: apply learned kernels (dot product) at every spatial location to extract local patterns

Motion and Tracking

homework

  • explain the transforms based on the slides.
  • Image Features: Shape Matching Algorithm; Descriptors