CNN

CNN = Convolutional Neural Network
It reduces the number of input nodes
Tolerate small shifts in images (as pooling is used, small shift result in same weight)
Take advantage of local context or relation as it uses filter to gather local information
The matrix obtained as a result convolution operation is called activation map
Typically ReLU activation function is used
Usually use Padding in CNN
Stride in CNN is used to scan through the image
Typical case:
- Filter Size of 2 or 3
- Stride size of 2
- Max pooling

Steps:

Filter scans through left to right, top to bottom
Filter weights and Image weights have a dot product (Element-wise multiplication and sum)
Use Pooling to gain information

Common Structure of Vision Models

(Filter -> Pooling) x N -> (Dense Network) x M -> Output Layer

Theoretically, we can use NN to get the same or better results than CNN. The only issue is one low resolution image (224 x 224) has around ~50k features and with dense connection it will be so computationally expensive, that it wont be feasible.

So we use CNN filters and pooling to reduce the dimension and then use the dense layers of NN to formulate the model.

Steps:

Common Structure of Vision Models

Related Notes