—- I am living the joy of seeing my paper title on the list of accepted ECCV14 papers :). Seeing the outcome of your work makes worthwhile all your day to night efforts, REALLY!!!. Before start, I shall thank to my supervisor Pinar Duygulu for her great guidance.—-
In this post, I would like to summarize the title work since I believe sometimes a friendly blog post might be more expressive than a solid scientific article.
“ConceptMap: Mining noisy web data for concept learning” proposes a pipeline so as to learn wide range of visual concepts by only defining a query to a image search engine. The idea is to query a concept at the service and download a huge bunch of images. Cluster images as removing the irrelevant instances. Learn a model from each of the clusters. At the end, each concept is represented by the ensemble of these classifiers.
However, it is not that easy. The proposed pipeline needs solve following problems. First, not all gathered images are related to the concept you are searching for. Second, some visual concepts might have same verbal correspondence even if they are visually different; that is called Polysemy. Third, mostly models that we learn from the queried images are representative of statistically distinct set of images than the images we use for testing. This problem is called Domain Shift or Domain Adaptation Gap. While ConceptMap (CMAP) is dwelling into these problems, it is also aiming to contruct bottom-up hierarchy of visual concepts. That is, we start from simple color and texture concepts and budge boundaries to more global concepts like scene images by exploiting the correspondence between the low-level and the higher levels.
A novel algorithm lying at the core of CMAP, which is an extension of well-known Self-Organizing Map (a.k.a Kohonen’s Map) that we called Rectifying Self-Organizing Map. We use “Rectifying” since it is able to detect spurious instances of the given data-set as it clusters data identically to SOM.
I won’t go into details and formulations but for giving intuition, RSOM takes account of unit activations as the learning evolves in order to decide salient units (clusters) as oppose to spurious ones. Despite of its simple viewing, it is still a powerful method to remove outliers and cluster instances. It also has no much computational burden to original SOM since it only needs the unit activations that are already computed for normal SOM iteration. Although we use it in Vision problems, it is very agnostic method that is perfectly applicable to other domains. I provide a RSOM library that can be used for only SOM clustering as well. The library includes batch and stochastic learning methods based on Scipy and Theano. Scipy version is favourable form small problems as Theano version is way faster for large scale problems with GPU utilization.
Let’s summarize the pipeline overview at the figure. Start by querying “RED” on a image search engine and collect many images. We extract random patches from all images (This stage differs for different concept domains. For instance, we use only detected faces for face images and salient object regions for object concepts). We extract high dimensional visual features from these patches for the benefit of simple linear models that we train at the end. Then we feed those patches to RSOM. RSOM clusters and removes outlier clusters and outlier instances in salient clusters as well. At the end we have set of coherent clusters. Each of them is representative of a sub-concept accommodated by the queried concept “RED” (For color concepts, corresponding to different color temperatures). Each cluster is used for training a simple linear classifier. We train linear SVM with L1 norm constraint in order to select discriminant set of dimensions among long feature vectors. Finally, we have a set of classifiers, each of them supposedly is sensitive to different modularities of the “RED”. At the classification time, given a novel image, we run all classifiers over the image and maximum score among those classifiers is the prediction score for the “RED” concept.
We learned color and texture concepts. Then it is time to go further in the hierarchy. At the next stage, we use those low-level concepts (attributes) to represent scene images. For these purpose, we pass all classifiers of all concepts on the scene images in the sliding window fashion. We assume each classifier as a word reminiscent of Bag of Words model but instead of centroid distances, we use classifier confidences as feature values. We also use 2 level spatial pyramid for taking account of locality with our particular weighting scheme. (I again avoid to give formulation here). Each of the scene image becomes very long feature vector (num_of_concepts * num_classifiers_per_concept * SP_grids). Again we use linear SVM with L1 norm for classification.
Albeit we also provide results of original pipeline depicted at the figure for scene images, this hierarchical approach gives better results on MIT-indoor and Scene-15 data-sets.
CMAP is a fast and very generic method that can be used various Vision tasks as we demonstrated at the paper (attribute, scene, object, face). It gives very compelling benchmark results that are better then computationally exhausted state of art counterparts.
We tackle face-recognition and object classification problems with a similar pipelines in the paper. But this post is for giving simple intuition about the proposed system in a short space. Therefore, for details I refer to the not yet published paper (I might provide upon any request) or the slide below. I also welcome any suggestion, commend or a word :). BEST !!