Here, I summarized our new method called FAME for learning Face Models from a noisy set of web images. I am studying this for my MS Thesis. To be a little intro to my thesis, the title is “Mining Web Images for Concept Learning” and it introduces two new methods for automatic learning of visual concepts from noisy web images. The first proposed method is FAME and the other work was presented here before called ConceptMap and it is presented at ECCV14 (self-promotion :)).
Before I start, I should disclaim that FAME is not a fully furnished work and waiting your valuable comments. Please leave your statements about anything you find useful, ridiculous, awkward, or great.
In this work, we grasp the problem of learning face models from public face images collected from the Web through querying a particular person’s name. Collected images are called weakly-labeled by the rough prescription of the defined query. However, the data is very noisy even after face detection, with false detections or several irrelevant faces of other people. The proposed method FAME (Face Association through Model Evolution) is able to prune the data in an iterative manner, for the face models associated with a name to evolve. The idea is to quantify the representativeness and discriminativeness of each image against a vast amount of random images and eliminating the poor instances regarding to these qualifications. In the end, final clean data is used to train classification models for face identification. We believe that FAME is a generic method that can be used in different domains other than vision tasks but we did not testify this claim in this work.
To be more casual, the purpose is to query someone’s name from a Image Search engine like Google Image Search and then without any human effort, filter the images from irrelevant instances and train good quality face models. The crux of the pipeline is the data pruning procedure in FAME. Let’s look at how FAME eliminates spurious instances through evolving models.
The idea is very simple and intuitive. Assume that we have a queried set of images of Adam Sandler including some irrelevant images. We also have a vast number of random face images again collected from the Web. The first thing FAME does is to learn a linear model separating Sandler images (+ class) from the random images (- class). The hyperplane of gives the disciminativeness measure of Sandler images against to rest of the world (random images) by distance to the hyperplane. Based on this assumption, we select the top Sandler images that are far most from the hyperplane at the positive side. These images are supposed to be the most discriminant (iconic) images of Sandler compared to a random image. Second, we train another linear model that separates selected images (+ class) from remaining Sandler images (- images). measures the representativeness of each instance by examining them in relation to iconic images selected by . Thereafter, we eliminate instances that are most distant from the hyperplane at the negative side. We believe that those are the images that are different from the iconic images and diverse from the class basis. This is just a single iteration of FAME so we iterate this flow up to a desired number of iterations.
I know, “desired number of iterations” sounds tricky. Therefore, we can use some indicators to decide where to stop. We use the training accuracy of as the measure of data quality. This is because as we iterate, training accuracy of constantly increases up to a certain limit. This is very intuitive since as we eliminate spurious Sandler images, the dispersion of the random images and the remaining Sandler images is getting more clear. Hence, when we see a decrease or saturation of the accuracy score, we stop the algorithm and use the remaining salient images to train a final face model. Sometimes, the accuracy can get 100% in very few iterations. In that case, we wait until the elimination of 0.1 of category images. This is the overall elimination level that we encounter for all name categories.
Another carious fact of our method is the data representation. We rely on very high dimensional representations of face images since it is required to discern categories from the others despite of sub-modularity (view variation, visual differences) in each category. In that way, we are able to separate any category with 100% with an easy to train linear model (We tested it by training a linear model among all classes). Linear model takes faster FAME iterations compared to alternative complex classifiers. Furthermore, we observe better results with simple Logistic Regression compared to Linear SVM. Maybe, this is because SMV’s margin loss yields very stringed constraints over the sample space that is not suitable for our solution.
Face images are represented by 40000 dimension feature vectors, using a simple but powerful method proposed by Adam Coates LINK EKLE. It is basically single layer K-means quantization of visual words with 5 grid (4 quadrants + image center ) average spatial pooling (In their work, they use 4 quadrants but face images include information at the center as well.). If you are not able to run trendy and expensive multi-layer models, this simple K-means alternative works very well with little or no performance loss. You might see the example filters learned by the methods. There are many filter examples receptive to eyes and mouths. (Seems like magic !!!)
We train our final face models with L1 norm SVM over final pruned image collections and models are enhanced by grafting algorithm proposed by Simon Perkins LINK. Basically, grafting selects important feature dimensions in a greedy manner with respect to their gradient information at each iteration. This way, we ease the system requirements for the final models by reducing the necessary feature dimensions.
We tested our method and classification pipeline over well-known face datasets, PubFig83, FAN-large. (For more details visit project LINK page or refer to the paper). I won’t give the full result table but will give important ones. Our training pipeline ( Filter Learning + L1 SVM + grafting) (without data cleaning and models are trained with the training partition of the dataset) is able to classify 83 categories of PubFig83 with 90.75 % accuracy that is higher than state of art, up to our knowledge, presented by Becker et al. With 85.9%.
If we train our models with noisy web images after FAME iterations which is the real problem of this study, we are able to classify PubFig83 with 79.3% and FAN-large-ALL (so to say very hard dataset) with 67.1% accuracy values. However, notice that all these models are trained with only the web images filtered by FAME. The real improvement yielded by FAME is observed in comparison to baseline results. We conduct the baseline by using all the web images without any data filtering through the same training pipeline. This baseline gives 52.8% for PubFig and 52.7% for FAN-large. Hence the FAME improvement is very obvious.
As a final comment, this work has been submitted to BMVC14 but genteelly rejected with ironic reviews by 2 out of 3 reviewer. I also believe that this work is not really ready for publication, thus your comments are really precious to advance the idea.