All images are classified to the same size

Simplification
- Black & white: channel = 1
Convolution
- 1st layer: Filter convolute with image → Feature map

- 2nd layer: Filter convolute with feature map

Pooling
After subsampling, the image looks the same

Max pooling
Leave the maximum value


Flatten → 1 dimension
can’t recognize scaling & rotation
Spatial transformation layer → deal with scaling & rotation
Spatial Transformer Layer
Invariant to scaling and rotation
Affine transform
[x′y′] = [kcosθsinθ−sinθkcosθ][xy]+ [ab]