Basic architecture of two-stage detectors, which consists of region proposal network to feed region proposals into classifier and regressor.
R-CNN detector consists of four modules. The first module generates category-independent region proposals. The second module extracts a fixed-length feature vector from each region proposal. The third module is a set of class-specific linear SVMs to classify the objects in one image. The last module is a bounding-box regressor for precisely bounding-box prediction.
In Faster R-CNN, a region proposal network (RPN) shares convolutional layers with the object detection network, significantly reducing the proposal cost. The region proposal network (RPN) and the object classifier share fully convolutional layers, which are trained jointly. The RPN behaves as an attention director, determining the optimal bounding boxes across a wide range of scales and aspect ratios to be evaluated for object classification. In other words, the RPN tells the classifier where to look.
Cascade R-CNN expands the classical two-stage structure to a multi-stage structure. High quality positive sample training is performed at each stage by setting different IoU (Intersection over Union) thresholds. Hence, the accuracy of bounding box detected and adjusted will be improved. The structure of Cascade R-CNN is shown in the following figure, where I is the input; conv is the convolutional layer of the basic network; B0 is the proposal frame generated by the region proposal network; pool is the pooling layer; H1, H2 and H3 are the detection network; C1, C2 and C3 are classifiers; and B1, B2 and B3 are bounding box regressors.