Here is the G. Hinton's talk at MIT about t inabilities of Convolutional Neural Networks and 4 basic arguments to solve these.
I just watched it with a slight distraction and I need to reiterate. However these are the basic arguments in which G. Hinton is proposed whilst the speech.
1. CNN + Max Pooling is not the way of handling visual information as the human brain does. Yes, it works in practice for the current state of the art but, especially view point changes of the target objects are still unsolved.
2. Apply Equivariance instead of Invariance. Instead of learning invariant representations to the view point changes, learn changing representations correlated with the view point changes.
3. In the space of CNN weight matrices, view point changes are totally non-linear and therefore hard to learn. However, if we transfer instances into a space where the view point changes are globally linear, we can ease the problem. ( Use graphics representation uses explicit pose coordinates)
4. Route information to right set of neurons instead of unguided forward and backward passes. Define certain neuron groups ( called capsules ) that are receptive to particular set of data clusters in the instance space and each of these capsules contributes to the whole model as much as the given instance's membership to neuron's cluster.