Decompose the network structure into two networks F and G keeping a set of top layers T at the end. F and G are small and more advance network structures respectively. Thus F is cheap to execute with lower performance compared to G.
In order to reduce the whole computation and embrace both performance and computation gains provided by both networks, they suggest an incremental pass of input data through F to G.
Network F decides the salient regions on the input by using a gradient feedback and then these smaller regions are sent to network G to have better recognition performance.
Given an input image x, coarse network F is applied and then coarse representations of different regions of the given input is computed. These coarse representations are propagated to the top layers T and T computes the final output of the network which are the class predictions. An entropy measure is used to see that how each coerce representation effects the model's uncertainty leading that if a regions is salient then we expect to have large change of the uncertainty with respect to its representation.
We select top k input regions as salient by the hint of computed entropy changes then these regions are given to fine network G obtain finer representations. Eventually, we merge all the coarse, fine representations and give to top layers T again and get the final predictions.
At the training time, all networks and layers trained simultaneously. However, still one might decide to train each network F and G separately by using the same top layers T. Authors posits that the simultaneous training is useful to keep fine and coarse representations similar so that the final layers T do not struggle too much to learn from two difference representation distribution.
I only try to give the overlooked idea here, if you like to see more detail and dwell into formulas please see the paper.
My discussion: There are some other works using attention mechanisms to improve final performance. However, this work is limited with the small datasets and small spatial dimensions. I really like to see whether it is also usefule for large problems like ImageNet or even larger.
Another caveat is the datasets used for the expeirments are not so cluttered. Therefore, it is easy to detect salient regions, even with by some algrithmic techniques. Thus, still this method obscure to me in real life problems.