An image is input directly to the network, and this is followed by several stages of convolution and pooling. The convolutional layers serve as feature extractors, and thus they learn the feature representations of their input images. The purpose of the pooling layers is to reduce the spatial resolution of the feature maps and thus achieve spatial invariance to input distortions and translations. Finally, the last fully connected layer outputs the class label.
The ResNet is mainly composed of the residual learning block, as shown in the following figure. ResNet revolutionized the CNN architectural race by introducing the concept of residual learning in CNNs and devised an efficient methodology for the training of deep networks. Similar to Highway Networks, it is also placed under the Multi-Path based CNN. Note that ResNet-18 is used as a baseline.