A deep residual network, built by stacking a sequence of residual blocks, is easy to train, because identity mappings skip residual branches and thus improve information flow. To further reduce the training difficulty, we present a simple network architecture, deep merge-and-run neural networks. The novelty lies in a modularized building block, merge-and-run block, which assembles residual branches in parallel through a merge-and-run mapping: Average the inputs of these residual branches (Merge), and add the average to the output of each residual branch as the input of the subsequent residual branch (Run), respectively. We show that the merge-and-run mapping is a linear idempotent function in which the transformation matrix is idempotent, and thus improves information flow, making training easy. In comparison to residual networks, our networks enjoy compelling advantages: they contain much shorter paths, and the width, i.e., the number of channels, is increased. We evaluate the performance on the standard recognition tasks. Our approach demonstrates consistent improvements over ResNets with the comparable setup, and achieves competitive results (e.g., 3.57% testing error on CIFAR-10, 19.00% on CIFAR-100, 1.51% on SVHN).
Localizing text in the wild is challenging in the situations of complicated geometric layout of the targets like random orientation and large aspect ratio. In this paper, we propose a geometry-aware modeling approach tailored for scene text representation with an end-to-end learning scheme. In our approach, a novel Instance Transformation Network (ITN) is presented to learn the geometry-aware representation encoding the unique geometric configurations of scene text instances with in-network transformation embedding, resulting in a robust and elegant framework to detect words or text lines at one pass. An end-to-end multi-task learning strategy with transformation regression, text/non-text classification and coordinate regression is adopted in the ITN. Experiments on the benchmark datasets demonstrate the effectiveness of the proposed approach in detecting scene text in various geometric configurations.
In this paper, we address the problem of person reidentification, which refers to associating the persons captured from different cameras. We propose a simple yet effective human part-aligned representation for handling the body part misalignment problem. Our approach decomposes the human body into regions (parts) which are discriminative for person matching, accordingly computes the representations over the regions, and aggregates the similarities computed between the corresponding regions of a pair of probe and gallery images as the overall matching score. Our formulation, inspired by attention models, is a deep neural network modeling the three steps together, which is learnt through minimizing the triplet loss function without requiring body part labeling information. Unlike most existing deep learning algorithms that learn a global or spatial partition-based local representation, our approach performs human body partition, and thus is more robust to pose changes and various human spatial distributions in the person bounding box. Our approach shows state-of-the-art results over standard datasets, Market-1501, CUHK03, CUHK01 and VIPeR.
In this paper, we propose an end-to-end deep correspondence structure learning (DCSL) approach to address the cross-camera person-matching problem in the person re-identification task. The proposed DCSL approach captures the intrinsic structural information on persons by learning a semantics aware image representation based on convolutional neural networks, which adaptively learns discriminative features for person identification. Furthermore, the proposed DCSL approach seeks to adaptively learn a hierarchical data-driven feature matching function which outputs the matching correspondence results between the learned semantics-aware image representations for a person pair. Finally, we set up a unified end-to-end deep learning scheme to jointly optimize the processes of semantics-aware image representation learning and cross-person correspondence structure learning, leading to more reliable and robust person re-identification results in complicated scenarios. Experimental results on several benchmark datasets demonstrate the effectiveness of our approach against the state-of-the-art approaches.
A key problem in salient object detection is how to effectively model the semantic properties of salient objects in a data-driven manner. In this paper, we propose a multi-task deep saliency model based on a fully convolutional neural network (FCNN) with global input (whole raw images) and global output (whole saliency maps). In principle, the proposed saliency model takes a data-driven strategy for encoding the underlying saliency prior information, and then sets up a multi-task learning scheme for exploring the intrinsic correlations between saliency detection and semantic image segmentation. Through collaborative feature learning from such two correlated tasks, the shared fully convolutional layers produce effective features for object perception. Moreover, it is capable of capturing the semantic information on salient objects across different levels using the fully convolutional layers, which investigates the feature-sharing properties of salient object detection with great feature redundancy reduction. Finally, we present a graph Laplacian regularized nonlinear regression model for saliency refinement. Experimental results demonstrate the effectiveness of our approach in comparison with the state-of-the-art approaches.
As an important and challenging problem in computer vision and graphics, keypoint-based object tracking is typically formulated in a spatio-temporal statistical learning framework. However, most existing keypoint trackers are incapable of effectively modeling and balancing the following three aspects in a simultaneous manner: temporal model coherence across frames, spatial model consistency within frames, and discriminative feature construction. To address this issue, we propose a robust keypoint tracker based on spatio-temporal multi-task structured output optimization driven by discriminative metric learning. Consequently, temporal model coherence is characterized by multi-task structured keypoint model learning over several adjacent frames, while spatial model consistency is modeled by solving a geometric verification based structured learning problem. Discriminative feature construction is enabled by metric learning to ensure the intra-class compactness and inter-class separability. Finally, the above three modules are simultaneously optimized in a joint learning scheme. Experimental results have demonstrated the effectiveness of our tracker.