SSD is an one-stage object detection algorithm based on the idea of regression. In other words, SSD transforms the problem of object detection into a regression problem. The SSD network accepts images that are resized to an input resolution of 300 by 300 pixels.
The architecture of SSD network is composed of the basic network part as the head and the additional feature extractor as the backbone. As shown in the figure, the SSD network has several grids of different sizes: 19×19, 10×10, 5×5, 3×3, 2×2, and 1×1, ranging from very fine to very coarse. This enables the network to get more accurate predictions over a wider variety of object scales. For example, the 19×19 grid produced by the first layer takes care of the smallest objects by having compact grid cells, whereas the 1×1 grid produced by the last layer is responsible for handling larger obejcts that can take up essentially the entire image; the grids from the other layers are used to handle objects of sizes that vary in between.
The SSD network uses MobileNet V1 as the feature extractor, as shown in the figure. MobileNet is a new generation of convolutional neural networks proposed by Google and is very suitable for deployment on mobile platforms due to its balance between fluency and performance. MobileNet V1 uses depthwise separable convolutions, which simplifies the traditional network structure and greatly reduces computation. The table provides a detailed summary about the MobileNet-SSD structure at the layer level.
SSDLite is similar to SSD but implements depthwise-separable convolutions rather than regular convolution layers, which makes it much faster than regular SSD and perfectly suited for use on mobile devices. The SSDLite network uses MobileNet V2 as the feature extractor. MobileNet V2 is an improvement of MobileNet V1, which is also a lightweight convolutional neural network. MobileNet V2 uses the linear layer in lieu of the nonlinear activation of ReLU after the channel number becomes less, in order to avoid the damage of ReLU to the feature.