Yali Amit (U Chicago) amit@galton.uchicago.edu Model-based object classification In analyzing images with multiple objects an initial indexing algorithm will yield multiple candidate detections and poses, as described for example in the preceding talk on Coarse-to-fine indexing. This would also be the output of a bottom up segmentation or saliency detection algorithm. In both cases it is necessary to resolve ambiguities: choosing among several candidate objects at similar poses or choosing among several candidate multiple-object configurations. Simple statistical models for objects - based on conditional independence of the edge features given the pose - lend themselves easily to these tasks using likelihood ratio tests. Object models can be composed to produce models for multiple-object configurations. Rather complex pose parameters can also be naturally integrated in such models. We will describe these ideas in the context of the license plate example introduced in Geman's talk and in the context of classifying non-rigidly deforming shapes. -- Yali Amit email: amit@marx.uchicago.edu Department of Statistics Tel: 773 - 7022568 and Computer Science Fax: 773 - 7029810 University of Chicago URL: http://galton.uchicago.edu/~amit/ %---------------------------------------------------------------------- Kobus Barnard (Arizona) kobus@cs.arizona.edu Language structure and image understanding Kobus Barnard (joint work with Matthew Johnson and Keiji Yanai) Recent work suggests that significant benefit can be gained from considering image understanding as a problem of translating between visual representations (images) and semantic ones (associated text). Doing so makes the problem more concrete and performance easier to measure. Furthermore, we can take advantage of large, readily available, training sets consisting of images and associated text. However, using such naturally available training data has the problem that we do not have a strict correspondence between the language elements and visual attributes. Recent work has shown that with sufficient training data, much of this correspondence ambiguity can be resolved. We are investigating how to exploit natural language processing to further tighten the relationship between words and image attributes. For example, parts of speech tagging is reasonably effective. Using this facility is interesting because nouns, adjectives, and prepositions tend to relate to visual attributes in different ways. Certain adjectives empody specific image region features, and this relation applies to multiple objects. If this relationship is known, it can help resolve the correspondence between words and image regions. More specifically, one should be able to simultaneously learn the meaning of words such as "red", and use natural language analysis to exploit the occurrence of the modifier "red" to help learn the meaning of "ball" from an image annotated with "red ball". Similarly, if we assume that certain simple prepositions reflect spatial relations, we should be able to simultaneously learn the meaning of those prepositions, and exploit that meaning to help learn the visual representation of nouns being spatially related. Currently we are undergoing large scale experiments to see if we can determine which adjectives can be classified as "visual adjectives", thereby being reasonable words to include in systems that automatically learn to link image regions to words. In a second effort we are testing whether simple adjectives can help learn the visual meanings of the nouns that they modify. A complimentary research direction is using image data to help ground the meaning of associated text. In this domain we have developed a method for using images accompanying text to help disambiguate the word senses within that text . For example, the word "bank" has a number of meanings including "financial institution" and that suggested by "river bank". To integrate image information we exploit a recently developed statistical model for linking visual features and words. Such a model can be used, for example, to link certain image region features to an outdoor meaning of "bank". Thus inference on visual features can then be used to help distinguish between word senses. A key point is that sense prediction can be much more accurate than general word prediction because the system only needs to distinguish between a limited number of choices---typically less than a dozen. Thus word sense disambiguation can be effective under conditions where word prediction performance may not be very good. %---------------------------------------------------------------------- Chris Bishop (Microsoft) cmbishop@microsoft.com Comparison of Generative and Discriminative Approaches to Object Recognition Probabilistic approaches to object recognition can be broadly divided into generative and discriminative techniques, which have very different characteristics, with complementary strengths and weaknesses. In this talk we illustrate the relative merits of generative and discriminative approaches using a simple object classification task. We also discuss techniques for combining the strengths of these two approaches. %---------------------------------------------------------------------- Joachim Buhmann jbuhmann@inf.ethz.ch Title: How Far Do Purely Local Features Lead in Object Categorization? Abstract: This contribution investigates the appropriateness of a collection of local scale invariant features (Lowe's SIFT features) for object categorization on a standard image dataset. This dataset is the Caltech image database that has also been used by [Fergus et al., CVPR'03]. Our approach yields recognition results which are slightly worse but comparable to the method by [Fergus et al., CVPR'03] on the same dataset albeit the latter system uses more complex constellations of local features. Our simple approach offers rapid recognition within few seconds per image. Nevertheless, such purely local processing is obviously not appropriate for reaching the ultimate goal of general object categorization. However, these results offer a hint how we should scale the complexity of datasets within the challenge of "101 visual object classes" categorization (PASCAL challenge), since such a dataset should force the development of methods that will eventually lead to general categorization and not to recognition with specialized features. An appropriate scaling of a recognition task has forced different face recognition groups to zero in on a quite similar method, namely localization of PCA by Pentland and coworkers and the enhancement of the Gabor feature matching by Malsbug et al. Conceptually these two methods are much more similar than they appear at first glance. Technical details of this contribution: Learned object categories: Rear view of cars, frontal view of faces, side view of motorbikes, and side view of planes Method: Purely local collection of SIFT features Train set size: 800000 SIFT features Validation set size: 400000 SIFT features Test set size: 120 images Method: regularized decision trees validated on a validation set. Test error rate: ~16% (random categorization error would be 75%) compare this to slightly less than 10% error rate by [Fergus et al., CVPR'03] %---------------------------------------------------------------------- Stefan Carlsson (KTH) stefanc@nada.kth.se Combinatorial Feature Selection There are several reasons why visual features should be localized in the image. These include ease of computation as well as ability to handle occlusions. On the other hand, in order to derive a sufficiently rich descriptor for classification, features have to be combined. The combination of local features for classification forms the basis of most modern approaches to visual recognition. In this work we will take this to the extreme in the sense that features will consist of just localized gradient directions. A combinatorial feature is a set of localized gradient directions scattered over the image. We will discuss ways of representing these combinatorial features as well as methods for computing them that avoids the the explosion of complexity associated with combinatorial selections. %---------------------------------------------------------------------- Tim Cootes (Manchester) Tim.Cootes@man.ac.uk Statistical Appearance Models for Face Recognition Human faces are an example of a class of objects in which each example exhibits significant variation in shape and appearance, but which is composed of a fixed number of sub-parts which have a similar configuration in every case. For such objects we can define landmark points on each example which imply a correspondence between different examples. We can then build statistical models of the shape by considering the relative positions of landmarks, and by warping into a common reference frame, can model the pattern of intensities across the object. Such combined models of shape and appearance have been found to be powerful for image interpretation. They are generative models, capable of synthesizing new examples similar to those in the training set. In this talk I will describe such models, and in particular demonstrate how they can be used for face recognition, presenting results on some of the several large databases of face images now available. %---------------------------------------------------------------------- Chris Dance (Xerox) chris.dance@xrce.xerox.com Advances in visual categorization with bags of keypoints. We present recent advances in our method for handling generic visual categorization: the problem of identifying the object content of natural images while generalizing across variations inherent to the object class. This bag of keypoints method is based on vector quantization of affine invariant descriptors of image patches. We compare two alternative implementations using different classifiers: SVM and boosting. The main advantages of the method are that it is simple, computationally efficient and intrinsically invariant. Firstly we illustrate our past findings that the method is robust to background clutter and produces good categorization accuracy. Secondly, we demonstrate the improvements that may be obtained by incorporating geometric information within weak classifiers in a boosting framework. %---------------------------------------------------------------------- Pinar Duygulu duygulu@ceng.metu.edu.tr Recognition as Machine Translation: Labeling objects and faces using large image and video collections In the first part of the talk, I will present a new approach to the object recognition problem, motivated by the recent availability of large annotated image collections. This approach considers object recognition as the translation of image regions to words, similar to the translation of text from one language to another. The lexicon for the translation is learned from large annotated image collections, which consist of images that are associated with text. First, images are segmented into regions, then the regions are clustered in the feature space, categorizing the regions into a finite set of blobs. The correspondences between the blobs and the words are learned, using a method adapted from Statistical Machine Translation. Once learned, these correspondences can be used to predict words corresponding to particular image regions (region naming), or words associated with whole images (auto-annotation). In the second part, I will discuss the methods to extend the approach to label the objects in videos. Then, I will present our recent work on automatic naming of people in news videos. This work considers face recognition on the large scale and associates face groups with extracted names similar to the approach presented in the first part of the talk. The results will be presented on TRECVID 2004 data set. %---------------------------------------------------------------------- Boris Ephstein boris.epshtein@weizmann.ac.il Classification by feature hierarchies Boris Epshtein, Shimon Ullman Weizmann Institute of Science In a number of recent classification schemes, image patches, or fragments, are successfully used as informative features for object detection and classification. The fragments are typically chosen during a training stage, according to the amount of information they deliver about the class being recognized. To perform classification, the features are searched in the input images, followed by a decision stage based on the detected features. In the present work, we extend this approach by decomposing the fragments into sub-fragments in a hierarchical fashion. The presence of a fragment is then detected by the presence of its components. For example, a fragment depicting the eye region can be subdivided into sub-fragments depicting different eye parts, eyebrow etc., which are detected separately. A higher-level fragment can use different types of sub-features (e.g., different eyebrows), and allow a controlled amount of independent movement and other deformations of the sub-parts. There are two motivations for such a hierarchical construction. The first is dealing with variability during the matching process: the same motivation that applies to the classification of complete objects by image fragments, applies also to the detection of object parts. As we will demonstrate, the decomposition by our method increases the amount of information delivered by the fragments, improves the detection rate, and increases the tolerance for local distortions and illumination changes. The second motivation comes from the structure of the primate visual system, in which objects are represented in a hierarchy of increasingly complex features, ranging from local oriented features in V1, to complex shapes and partial or complete object views at high visual areas. I will describe an algorithm for automatically constructing a full feature hierarchy, which proceeds by learning from examples, in a top-down manner. The highest level, class-specific fragments, are extracted first, and then subdivided into simpler sub-fragments in a recursive manner. The decomposition at each level is performed by the application of the same algorithm, guided by the maximization of mutual information. The subdivision stops when the mutual information cannot be improved by further decomposition, producing a set of simple, atomic features. For different natural classes, the depth of the feature hierarchy is typically up to four levels. In addition to the decomposition of a feature into its sub-parts, the method also determines the optimal parameters, such as detection thresholds and the amount of allowed displacements of the sub-features. Finally, we demonstrate that fragments selected at the higher levels of the hierarchy are typically complex and class-specific, but the lower level fragments are simple, and often shared between different object classes. %---------------------------------------------------------------------- A Tutorial on Variational Message Passing and Its Applications for Learning Object Category Models Li Fei-Fei & John Winn Graphical models have become increasingly popular in the field of computer vision. Many applications have successfully used a range of inference techniques for learning in such models. Variational inference is a particularly useful technique to approximate posterior distributions in graphical models. For example, variational methods can be used to estimate the uncertainty in particular model parameters. Additionally, variational methods allow models to be compared using Bayesian model selection. In order to apply variational inference, it has previously been necessary to derive update equations by hand, which is both time consuming and error prone. This can be avoided by using a modular algorithm called Variational Message Passing (VMP) which allows variational inference to be applied by passing messages on the model graph. The form of the messages allows the posterior beliefs to be updated using straightforward local operations at each node. VMP can be applied to a very general class of models which can contain both discrete and continuous variables. Furthermore, the modular nature of the VMP algorithm allows it to be re-applied quickly following a change to the model. In this tutorial, we present a detailed explanation of the VMP algorithm, including a worked example. Particularly, we show how VMP is applied to an object categorization model. Our object categorization model is based on the idea that much of the information of an object category is captured in the geometrical relationship and appearance of object parts. Constellation of these parts can be represented by probabilistic models to accomodate intra-class variations. While we learn a new object category using very limited training examples (less than 10 samples per class), we can incorporate useful "prior" knowledge about the world of objects into the probabilistic model. Using VMP, we update the prior models in the light of observations. We present experimental results using this model on 101 object categories. In addition, we also show that Bayesian probabilistic PCA can be added to the current graphical model. VMP proves to be handy and useful in such extensions. %---------------------------------------------------------------------- Vittorio Ferrari ferrari@vision.ee.ethz.ch Integrating multiple model views for Object Recognition In recent work a strategy has been outlined to actively explore images, in that a dense covering of affine invariant regions is created. In this way, the processes of recognition and object delineation (segmentation) are intensely intertwined. Here, we extend the approach to capture the relationships between multiple model views and exploit them to improve recognition performance. We propose an efficient algorithm for partitioning a set of region matches into groups lying on smooth surfaces (GAMs). During modeling, the model views are connected by a large number of region-tracks, each aggregating image regions of a single physical region across the views. At recognition time, GAMs are constructed matching a test image to each model view. The consistency of configurations of GAMs is measured by exploiting the model connections. The most consistent configuration, covering the object as completely as possible is found by a genetic algorithm. Introducing GAMs as an intermediate grouping level facilitates decision-making and improves discriminative power. As a complementary application, we introduce a novel GAM-based two-view filter and demonstrate its effectiveness in recovering correct matches in the presence of up to 99\% mismatches. %---------------------------------------------------------------------- Bill Freeman (MIT) billf@ai.mit.edu Sharing features for multi-class object detection We consider the problem of detecting a large number of different classes of objects in cluttered scenes. Traditional approaches require applying a battery of different classifiers to the image, at multiple locations and scales. This can be slow and can require a lot of training data, since each classifier requires the computation of many different image features. In particular, for independently trained detectors, the (run-time) computational complexity, and the (training-time) sample complexity, scales linearly with the number of classes to be detected. It seems unlikely that such an approach will scale up to allow recognition of hundreds or thousands of objects. We present a multi-class boosting procedure (joint boosting) that reduces the computational and sample complexity, by finding common features that can be shared across the classes (and/or views). The detectors for each class are trained jointly, rather than independently. For a given performance level, the total number of features required, and therefore the computational cost, is observed to scale approximately logarithmically with the number of classes. The features selected jointly are closer to edges and generic features typical of many natural structures instead of finding specific object parts. Those generic features generalize better and reduce considerably the computational cost of an algorithm for multi-class object detection. %---------------------------------------------------------------------- Nando de Freitas (UBC) nando@cs.ubc.ca Bayesian multiple instance learning models for object recognition. In this work we develop new Bayesian multiple instance learning strategies to model images with captions. The resulting discriminative models allow us to deal with missing correspondences in order to predict words for image features. The representation is feature independent and we provide results using ncuts features and several local features. In fact, the model allows us to automatically choose the features that reduce the general object classification error. The model also allows for the inclusion of different types of a priori information and handles the problem of unbalanced datasets automatically. All the model parameters are set automatically with an efficient blocked MCMC sampler. %---------------------------------------------------------------------- Don Geman (Johns Hopkins) geman@jhu.edu HIERARCHICAL DESIGNS FOR OBJECT RECOGNITION Donald Geman Dept. of Applied Mathematics and Statistics and Center for Imaging Science Johns Hopkins University It is unlikely that scene interpretation, specifically detecting and localizing instances from multiple object classes, will yield directly to improved methods of inductive learning. Some global design is needed to confront the small-sample dilemma (amount of training data relative to the number of possible explanations) and to organize the search (e.g., insure that intensive computation is restricted to genuinely ambiguous regions). Here, we focus on the computational process itself rather than directly on probability distributions (Bayesian inference) or decision boundaries (statistical learning). As an illustration, I will discuss a tree-structured representation of object indexing in which each node represents a subset of interpretations. Under mild statistical hypotheses, optimal strategies exhibit a steady progression from broad scope coupled with low power to high power coupled with dedication to specific explanations (joint work with G. Blanchard). I will also present several indexing experiments (with Y. Amit and X. Fan) involving reading license plates. Pruning an index to a final interpretation involves object models (see, e.g., Amit's talk) as well as more intense, and context-based, computation, again organized coarse-to-fine. %---------------------------------------------------------------------- Geoff Hinton (Toronto) hinton@cs.toronto.edu "How to invert computer graphics" Geoffrey Hinton University of Toronto To first order, computer graphics works and computer vision does not. This suggests that we should try to train a neural network to map from the output of a computer graphics program to its high-level input. The main difficulty is that we do not have access to the high-level input required to generate the tens of thousands of real natural images that we would like to have in our training set. I shall show how we can learn the inverse mapping on real images without being told the high-level input of the graphics program and without knowing anything about how the graphics program works. %---------------------------------------------------------------------- David Hogg (Leeds) dch@comp.leeds.ac.uk Action classification for table top games In our recent work we have looked at the way that object usage in an activity can be used to refine an unsupervised classification - specifically to collapse the number of classes. The aim is to learn how to play simple table-top games involving an unknown set of objects by simply observing repeated instances of the game. Learning is in two phases. Firstly, unsupervised learning is used to discover the object categories in the domain. Typically this produces more categories than the number of semantically meaningful objects. Secondly, the activity is learnt using inductive logic. Emergent from the second phase is a number of equivalence classes of object categories that are semantically equivalent. %---------------------------------------------------------------------- Anthony Hoogs GE Research Recognition and Segmentation of Broad Categories of Scene Content John Kaufhold and Anthony Hoogs In many applications it is possible to define a set of scene element categories that is complete, in the sense that each pixel can be correctly labeled as belonging to one category. In such a setting, categorical recognition and segmentation may be combined to partition an image into regions labeled by the categories. This differs from traditional categorical object recognition in that the categories tend to be very broad, e.g. "man-made", "building" and "vegetation"; hence the recognition algorithm must cope with very large intra-class variability while finding salient appearance characteristics to distinguish the classes. Many current approaches rely on color and local texture recognition, and hence are limited in accuracy and their ability to generalize beyond the training set, particularly when the categories become broad. We have developed an approach that first estimates image structure through dense region segmentation, then performs recognition by classifying each region using perceptual features derived from the region graph. Classification is performed by boosting stumps. Specifically, for an n-class recognition problem, we train an adaboosted stumps classifier for each category versus all other categories. This produces an ensemble of (n^2+n)/2 adaboosted classifiers which are combined into a multiclass classifier in a MAP estimation framework. Using region-based features to recognize content in the entire scene has a few key advantages. Spatial scale is dynamically determined for each region based on image structure in the region segmentation, rather than in a fixed window or a search over a range of fixed size. The features are invariant to rotation, and have been shown to recognize texture effectively [PAMI '03]. More recently, we showed that region-based features in a learning framework can distinguish object boundaries from other edges, even in textured scenes [CVPR '04]. In general, using region-based perceptual features enables recognition and learning at a higher perceptual level than many other approaches. Hence it is possible to learn the salient perceptual characteristics that distinguish broad categories, instead of relying on color and local texture. Our approach is largely complementary to patch-based recognition methods as well. It can be viewed as a mid-level content recognizer--it avoids some of the problems associated with pixel-based and filter-based approaches, but still allows conglomerations of similar regions to aggregate without constraining their particular spatial relationships. This allows, among other advantages, arbitration between low-level spatially unconstrained relationships between image features (such as foliage, e.g.) and more focused image content recognizers with more specific targets (such as motorbikes, e.g.). Preliminary results are demonstrated on aerial video, broadcast news video and aerial images. For the aerial data the categories include "building", "road", and "parking lot", which are very similar in local appearance, as well as "tree" and "grass" which are amenable to traditional texture appraoches. The method works in both grayscale and color, and has been extended to video using MLE across multiple frames. %---------------------------------------------------------------------- Frederic Jurie Scale-invariant Shape Features for Recognition of Object We introduce a new class of distinguished regions based on detecting the most salient convex local arrangements of contours in the image. The regions are used in a similar way to the local interest points extracted from gray-level images, but they capture shape rather than texture. Local convexity is characterized by measuring the extent to which the detected image contours support circle or arc-like local structures at each position and scale in the image. The detected regions are invariant to scale changes and rotations, and robust against clutter, occlusions and spurious edge detections. Experimental results show very good performance for both shape matching and recognition of object categories. %---------------------------------------------------------------------- Jan Koenderink (Utrecht) j.j.koenderink@phys.uu.nl 3D object congruence in human pictorial space Jan J. Koenderink, Universiteit Utrecht & Andrea J. van Doorn, Delft University of Technology Perception of 3D shape from pictures is inherently ambiguous. Human observers use the group of ambiguities to implement their "beholder's share" in perception of 3D shape. As a result observers can be shown to commit "mental viewpoint changes" in the space defined by the ambiguities. Such changes can be interpreted as the group of congruences in human pictorial space. From Felix Klein's perspective (the Erlangen programm) this defines the geometry of "pictorial space". It turns out to be a non-Euclidian space with one isotropic dimension. The result is important for human object recognition from pictures. %---------------------------------------------------------------------- Christophe Laurent France telecom R&D Natural Image Classification Using Foveal Strings Classically, image recognition approaches extract the image signature by considering the image content as a whole. Recently, some researchers have proposed an alternative aiming at designing the image signature from a limited number of salient regions. Consequently, the images are no more represented by a global signature but instead by a sequence of local signatures, each one being computed in a salient zone. However, due to the absence of a natural ordering between salient zones, the image signature can no longer be seen as a vector, increasing thus the complexity of the image matching step. In this paper, we propose to build such an order by constructing a string from a set of salient zones. Each symbol in the string is then a local signature describing the image content in the neighborhood of a salient zone. Image matching can thus be performed by computing a string edit distance between two strings representing the images to be matched. %---------------------------------------------------------------------- Svetlana Lazebnik slazebni@students.uiuc.edu Semi-Local Affine Parts for Object Recognition We propose an approach to object class recognition using multiple composite semi-local affine parts, or 2D configurations of local affine-covariant regions that are stable across a range of views of an object, and also across multiple instances of the same class. Unlike parts used in alternative approaches (points or individual regions), semi-local affine parts are not atomic, but are made up of multiple regions, each of which is associated with its own shape (ellipse) and intensity pattern. Making parts composite improves their expressiveness and distinctiveness, and provides added flexibility for representing complex 3D objects. The geometric layout of the regions within an individual part is assumed to be approximately affinely rigid. Because semi-local affine parts correspond to small, roughly planar patches on the surface of a 3D object, this 2D affine invariance is sufficient to handle 3D viewpoint changes and small deformations. On the other hand, the spatial configuration of multiple parts is modeled using loose co-occurrence or adjacency relations, providing stability against significant occlusions and large non-rigid deformations. We describe procedures for learning parts and relations in a weakly supervised fashion, i.e., based on small sets of unsegmented, cluttered training images, and show preliminary recognition results on real-world data. %---------------------------------------------------------------------- Yann LeCun (NYU) yann@cs.nyu.edu Loss Functions for Discriminative Training of Energy-Based Models with Applications to Object Recognition. Yann LeCun and Fu Jie Huang The Courant Institute, New York University Probabilistic graphical models model distributions over a set of variables by associating a (normalized) probability to each configuration of the variables. Ensuring proper normalization justifies many popular techniques, such as maximizing likelihood for inference and learning. However, many popular architectures, such as Markov Random Fields, Conditional Random Fields, and other non-tractable graphical models cannot be normalized implicitely, and must be normalized explicitely by dividing (un-normalized) likelihoods by an often-intractable partition function. By contrast, energy-based graphical models (EBGM) merely associate an energy to configurations of variables, eliminating the need for proper normalization. An EBGM is defined as a scalar energy function E(W,Y,Z,X), where W is a trainable parameter that parameterizes a space of possible energy functions, X is a vector of inputs (which are always observed), Y is a vector of variables to be predicted (i.e. "outputs" observed on the training set, but not on new examples), and Z is a vector of latent/hidden variables (which are never observed). One can view the energy function as a measure of "compatibility" between the values of Y, Z, and X. Making a decision (performaing an inference) with an EBGM consists in comparing the energies associated with various configurations of Y and Z, and choosing the one with the smallest energy. Because of the absence of normalization, EBGM can only be used for discrimination tasks that require "hard" decisions, where only the relative energies of the various configurations of Z and Y for a given X matter. Using EBGMs circumvents the requirement to normalize the models and compute partition functions that may be intractable. The EBGM paradigm is particularly appropriate for systems that are trained "end-to-end" to perform classifications. Specific examples include object recognition and mobile robot navigation: the robot must turn left or right when facing an obstacle. Computing a distribution over steering angles would be of little use in this context. EBGMs must be trained discriminatively to associate low energies to the desired configurations and higher energies to undesired configurations. A wide variety of loss function can be used for this purpose. We give sufficient conditions that a loss function should satisfy so that its minimization will cause the system to approach the desired behavior. We show that traditional loss functions, such as the ones derived from the probabilistic framework, belong to a small class of loss functions that are particularly difficult to compute (let alone minimize). We give many specific examples of suitable loss functions that are considerably easier to handle than traditional ones. We show an application of these ideas to object recognition with the NORB dataset which contains 50 different uniform-colored toy objects under 18 azimuths, 9 elevations, and 6 lighting conditions. The objects are 10 instance from 5 generic categories: four-legged animals, human figures, airplanes, trucks, and cars. Five instances of each category were used for training, and the other five for testing. An EBGM was used where input X is a stereo pair of 96x96-pixel images, output Y is the label of the object, and latent variable Z is 2-dimensional. The EBGM contains a convolutional network that maps images to points on a set of 2-D half-spheres embedded in 9-D spaces. There is one such half-sphere per category. The output category is the one for which the point produced by the convolutional net is closest to the corresponding half-sphere. The position on the half-sphere ends up representing the pose and other internal parameter of the objects. A contrastive loss function drives the convolutional network to output a point as close as possible to the half sphere of the desired class, and as far as possible from the half-spheres of the undesired classes. A 6-layer convolutional network trained with the traditional mean-square loss achieves 6.8% test error on this set, while the EBGM trained with the constrastive loss achieves 6.3% error. We may also briefly describe an application of EBGMs to region segmentation of microscopic images of animal embryos. %---------------------------------------------------------------------- Margarita Osdachy rita@osadchy.net Synergistic Face Detection and Pose Estimation with Energy-Based Models M. Osadchy, M. Miller, and Y. LeCun We describe a novel method that not only detects faces independently of their poses, but simultaneously estimates those poses. The system is highly-reliable, runs at near real time (5 frames per second on standard hardware), and is robust against variations in yaw (-90,+90), roll (-45,+45), and pitch (-60, +60). We build a trainable system, that can map raw images to points in a low-dimensional space. In that space, we pre-define a face manifold that we parameterize by the pose. We train the system to map face images with known poses to the corresponding point on the manifold. We also train it to map non-face images to points far away from the manifold. Proximity to the manifold then tells us whether or not an image is a face, and projection to the manifold yields an estimate of the pose. Conceptually, we can view the pose parameter as a latent variable that can be inferred through an energy-minimization process. To train the machine we derive a new type of discriminative loss function that is tailored to such detection tasks. We employ a Convolutional Network as the basic architecture for the image-to-face-space mapping function. We test the resulting system, in a single configuration, on three standard data sets - onefor frontal pose, one for rotated faces, and - one for profiles and find that its performance on each set is comparable to previous multi-view face detectors that can only handle one form of pose variation. As far as we know, we are the first to publish results for a SINGLE detector on all three standard data sets. On the standard data sets, the estimates of yaw and in-plane rotation are within 15 degrees of manual estimates over 80% and 95% of the time, respectively (faces are not localized). We also show experimentally that the system's accuracy on both face detection and pose estimation is improved by training for the two tasks together. %---------------------------------------------------------------------- Pietro Perona (Caltech) perona@vision.caltech.edu A concise taxonomy of recognition An organized survey of recognition tasks. A review of the state of the art. A short list of interesting open problems. %---------------------------------------------------------------------- Jim Rehg (Georgia Tech) rehg@cc.gatech.edu Fast and Automatic Induction of Cascade Classifiers I will present some new results in learning cascade classifiers for object recognition problems. I will introduce a risk formulation for cascade learning which yields a fully-automatic algorithm for node training and enables a principled comparison of different feature selection algorithms and feature spaces. Results of a comprehensive empirical study will be presented. Some additional optimizations for fast induction will be described, along with preliminary results for hierarchical multi-class classification. This is joint work with Charlie Brubaker, Matt Mullin, Jie Sun, and Jianxin Wu. %---------------------------------------------------------------------- Bernt Schiele (ETH Zurich) schiele@inf.ethz.ch Scale-Invariant Object Categorization and Segmentation using a Scale-Adaptive Mean-Shift Search Algorithm We present a method for scale-invariant object categorization in real-world scenes. Following a common consensus in the field, we do not assume that a figureground segmentation is available prior to recognition. However, in contrast to most standard approaches for object class recognition, our approach automatically segments the object as a result of the categorization. This combination of recognition and segmentation into one process is made possible by our use of an Implicit Shape Model, which integrates both into a common probabilistic framework. This model can be thought of as a non-parametric approach which can easily handle configurations of large numbers of object parts. In addition to the recognition and segmentation result, it also generates a per-pixel confidence measure specifying the area that supports a hypothesis and how much it can be trusted. We use this confidence to derive a natural extension of the approach to handle multiple objects in a scene and resolve ambiguities between overlapping hypotheses with a novel MDL-based criterion. Our results show that the proposed method outperforms state-of-the-art approaches on a large variety of different object categories while being robust to large scale changes. %---------------------------------------------------------------------- Bernhard Schoelkopf (Tuebingen) bernhard.schoelkopf@tuebingen.mpg.de Kernel Methods for Implicit Surface Modeling We describe methods for computing an implicit model of a hypersurface that is given only by a finite sampling. The methods work by mapping the sample points into a reproducing kernel Hilbert space and then determining regions in terms of hyperplanes. %---------------------------------------------------------------------- Tom Strat DARPA Object Recognition and Persistent Sensing The advent of persistent and ubiquitous sensing is an opportunity to automate image exploitation with reliability and accuracy that has never been possible before. It fosters a new paradigm for sensor exploitation that can help reduce the fog of war. Today's image analysis algorithms do not take advantage of persistent sensing, so new research in image understanding is needed to create the foundations upon which we can build the next generation of sensor exploitation systems. I will present some thoughts on how to do that, but will mostly lay out some challenges to the research community to devise ways to exploit a data-rich enviornment to solve heretofore unsolved problems in object recognition. %---------------------------------------------------------------------- Rahul Sukthankar (Intel) rahuls@cs.cmu.edu PCA-SIFT: Improving Matching Accuracy for Local Image Descriptors Stable local feature detection and representation is a fundamental component of many image registration and object recognition algorithms. Mikolajczyk and Schmid recently evaluated a variety of approaches and identified the SIFT algorithm as being the most resistant to common image deformations. Our work examines (and improves upon) the local image descriptor used by SIFT. Like SIFT, our descriptors encode the salient aspects of the image gradient in the feature point's neighborhood; however, instead of using SIFT's smoothed weighted histograms, we apply Principal Components Analysis (PCA) to the normalized gradient patch. Our experiments demonstrate that the PCA-based local descriptors are more distinctive, more robust to image deformations, and more compact than the standard SIFT representation. We have integrated PCA-SIFT into a near-duplicate/sub-image retrieval system. This system achieves near-perfect accuracy (100% precision at 99.85% recall) on the tests presented in Meng et al. (CVPR 2003), and consistently strong results on our own, significantly more challenging experiments. Query times are interactive even for collections of thousands of images. Joint work with Yan Ke (CMU) %---------------------------------------------------------------------- Michalis Titsias M.Titsias@sms.ed.ac.uk Fast Unsupervised Greedy Learning of Multiple Objects and Parts from Video Williams and Titsias (2004) have shown how to carry out unsupervised greedy learning of multiple objects from images (GLOMO), building on the work of Jojic and Frey (2001). In this paper we show that the earlier work on GLOMO can be greatly speeded up for video sequence data by carrying out approximate tracking of the multiple objects in the scene. Our method is applied to raw image sequence data and extracts the objects one at a time. First, the moving background is learned, and moving objects are found at later stages. The algorithm recursively updates an appearance model of the tracked object so that possible occlusion of the object is taken into account which makes tracking stable. %---------------------------------------------------------------------- Antonio Torralba torralba@ai.mit.edu Contextual models for multiclass object detection Antonio Torralba, Kevin P. Murphy, William T. Freeman We seek to both detect and segment objects in images using both local image data as well as contextual information. Conditional random fields (CRFs) are a natural way to model correlation between labels (the outputs of classifiers), given an image as input. The main problem with CRFs is how to learn the correlation (graph) structure of the model. A 4-nearest neighbor grid structure, successful in low-level vision, will fail in capturing important long distance dependencies between whole regions and across classes. Here, we learn the graph structure by using boosting to select from a dictionary of connectivity templates (derived from labeled segmentations), which are combined together in an additive model. We also use boosting to learn the local evidence potentials of the model. We show how contextual information from other objects can improve detection performance, both in terms of accuracy and speed, by using a computational cascade. We apply our system to detect stuff and things in office and street scenes. %---------------------------------------------------------------------- Bill Triggs (INRIA) William.Triggs@inrialpes.fr A Hierarchical Part-Based Model for Visual Object Categorization Guillaume Bouchard and Bill Triggs LEAR, GRAVIR-INRIA, 655 av. de l'Europe, 38330 Montbonnot, France We propose a hierarchical generative model for coding the geometry and appearance of visual object categories. The model is a collection of loosely connected parts containing more rigid assemblies of subparts. It is optimized for domains where there are relatively large numbers of somewhat informative subparts, such as the features returned by local feature methods from computer vision. The model is learned quickly by an E-M procedure. Some experiments on real images show its ability to fit complex natural object classes. %---------------------------------------------------------------------- Shimon Ullman (Weizmann Institute) shimon@wisdom.weizmann.ac.il Classification and recognition by a hierarchy of extended fragments In fragment-based classification, objects within a class are represented by common sub-structures, selected during training. Optimal fragments can be selected based on a criterion of maximizing the mutual information between the fragments and the class they represent. In this talk I will describe two extensions to the basic fragment-based scheme. First, to obtain invariant recognition of individual object, the representation uses extended fragments, which are equivalence sets of fragments representing the same object part under different viewing conditions. Based on extended fragments, invariant object recognition can be obtained from a single view of a novel object. Second, I will describe how the fragments can be constructed hierarchically, and show that features constructed hierarchically are considerably more informative and better for classification than similar features used as whole entities. %---------------------------------------------------------------------- Hirano Yutaka: Object recognition for a service robot at Toyota To grasp an object with complex shape in a cluttered scene, many technologies as 1) object detection and localization in a clutter, 2) separation of object, 3) object category recognition for handling general objects, should be developed. We will give a brief view of the development in Toyota and also present our expectation for future researches. %----------------------------------------------------------------------