This research explores how making a neural network deeper (which means adding more layers) affects its ability to recognize and categorize large-scale images. Specifically, it looks at convolutional neural networks (ConvNets), which are a type of AI particularly good at processing visual information.In simpler terms, think of a ConvNet as a multi-layered system that looks at an image and tries to understand different aspects of it at each layer. For instance, the first few layers might identify edges and colors, while deeper layers might recognize complex patterns like shapes or even whole objects.The unique aspect of this study is the use of very small filters, sized 3x3, in these layers. Filters are like tiny lenses that focus on specific parts of an image to extract features. The researchers found that using these small filters in a deep network (with 16 to 19 layers of weights) significantly improves the network's accuracy in recognizing and categorizing images.Their findings were so impactful that they led to their team winning first and second places in the localization and classification tracks, respectively, of the 2014 ImageNet Challenge – a major competition in computer vision.Additionally, they demonstrated that their approach isn't just good for the specific images they trained on; it can be applied to other datasets and still achieve state-of-the-art results. The significance of this work is so high that they made their best-performing models available to the public, encouraging further research in the field.In summary, this study shows that deeper networks with small filters can be incredibly effective in understanding and recognizing images, a major step forward in the field of computer vision.
The results are shown in Table 6. By the time of ILSVRC submission we had only trained the single-scale networks, as well as a multi-scale model D (by ne-tuning only the fully-connected layers rather than all layers). The resulting ensemble of 7 networks has 7.3% ILSVRC test error. After the submission, we considered an ensemble of only two best-performing multi-scale models (congurations D and E), which reduced the test error to 7.0% using dense evaluation and 6.8% using combined dense and multi-crop evaluation. For reference, our best-performing single model achieves 7.1% error (model E, Table 5). 4.5 COMPARISON WITH THE STATE OF THE ART Finally, we compare our results with the state of the art in Table 7. In the classication task of ILSVRC-2014 challenge (Russakovsky et al., 2014), our VGG team secured the 2nd place with 7 Published as a conference paper at ICLR 2015 Table 6: Multiple ConvNet fusion results. Combined ConvNet models Error top-1 val top-5 val top-5 test
id: 3eac83e3f6a8cd2457f1eae8d150f26e - page: 7
ILSVRC submission (D/256/224,256,288), (D/384/352,384,416), (D/[256;512]/256,384,512) (C/256/224,256,288), (C/384/352,384,416) (E/256/224,256,288), (E/384/352,384,416) 24.7 7.5 7.3 post-submission (D/[256;512]/256,384,512), (E/[256;512]/256,384,512), dense eval. (D/[256;512]/256,384,512), (E/[256;512]/256,384,512), multi-crop (D/[256;512]/256,384,512), (E/[256;512]/256,384,512), multi-crop & dense eval. 24.0 23.9 23.7 7.1 7.2 6.8 7.0 6.8 7.3% test error using an ensemble of 7 models. After the submission, we decreased the error rate to 6.8% using an ensemble of 2 models.
id: a441c194715e2ba798a08aaacdda0979 - page: 8
As can be seen from Table 7, our very deep ConvNets signicantly outperform the previous generation of models, which achieved the best results in the ILSVRC-2012 and ILSVRC-2013 competitions. Our result is also competitive with respect to the classication task winner (GoogLeNet with 6.7% error) and substantially outperforms the ILSVRC-2013 winning submission Clarifai, which achieved 11.2% with outside training data and 11.7% without it. This is remarkable, considering that our best result is achieved by combining just two models signicantly less than used in most ILSVRC submissions. In terms of the single-net performance, our architecture achieves the best result (7.0% test error), outperforming a single GoogLeNet by 0.9%. Notably, we did not depart from the classical ConvNet architecture of LeCun et al. (1989), but improved it by substantially increasing the depth.
id: 552afca90f3dec7e4486dd4ebd092717 - page: 8
Table 7: Comparison with the state of the art in ILSVRC classication. Our method is denoted as VGG. Only the results obtained without outside training data are reported. Method VGG (2 nets, multi-crop & dense eval.) VGG (1 net, multi-crop & dense eval.) VGG (ILSVRC submission, 7 nets, dense eval.) GoogLeNet (Szegedy et al., 2014) (1 net) GoogLeNet (Szegedy et al., 2014) (7 nets) MSRA (He et al., 2014) (11 nets) MSRA (He et al., 2014) (1 net) Clarifai (Russakovsky et al., 2014) (multiple nets) Clarifai (Russakovsky et al., 2014) (1 net) Zeiler & Fergus (Zeiler & Fergus, 2013) (6 nets) Zeiler & Fergus (Zeiler & Fergus, 2013) (1 net) OverFeat (Sermanet et al., 2014) (7 nets) OverFeat (Sermanet et al., 2014) (1 net) Krizhevsky et al. (Krizhevsky et al., 2012) (5 nets) Krizhevsky et al. (Krizhevsky et al., 2012) (1 net) top-1 val. error (%) top-5 val. error (%) top-5 test error (%) 6.8 7.1 7.5 23.7 24.4 24.7 27.9 36.0 37.5 34.0 35.7 38.1 40.7 6.8 7.0 7.3 7.9 6.7
id: f5270608b67dc332b3b63514a3c0d609 - page: 8