Education

  • Ph.D. 2016 - 2019

    Department of Electrical and Computer Engineering, Faculty of Engineering

    National University of Singapore, Singapore

  • Ph.D. 2014 - 2015

    School of Computer

    National University of Defense Technology, China

  • M.Eng. 2012 - 2014

    School of Computer

    National University of Defense Technology, China

  • B.Eng.2008 - 2012

    School of Automation Science and Electrical Engineering

    Beihang University, China

Work Experience

  • 2019 - Present

    Assistant Professor

    Institute of North Electronic Equipment, Beijing, China

  • 2022 - 2023

    Visiting Scholar

    Peng Cheng Laboratory, Shenzhen, China

  • 2019 - 2020

    Rhino-Bird Visiting Scholar

    Tencent AI Lab, Shenzhen, China

  • 2018 - 2019

    "Texpert" Research Scientist

    FiT DeepSea AI Lab, Tencent, Shenzhen, China

  • 2016 - 2018

    Research Intern

    Core Technology Group, Learning & Vision, Panasonic R&D Center, Singapore

  • 2016 - 2017

    Graduate Assistant

    NUS Module: EE2024 PROGRAMMING FOR COMPUTER INTERFACES

  • 2011 - 2012

    Research Intern

    China Aerospace Science and Industry Corporation, Beijing, China

Publications

  • 2023
    MSINet: Twins Contrastive Search of Multi-Scale Interaction for Object ReID
    Neural Architecture Search (NAS) has been increasingly appealing to the society of object Re-Identification (ReID), for that task-specific architectures significantly improve the retrieval performance. Previous works explore new optimizing targets and search spaces for NAS ReID, yet they neglect the difference of training schemes between image classification and ReID. In this work, we propose a novel Twins Contrastive Mechanism (TCM) to provide more appropriate supervision for ReID architecture search. TCM reduces the category overlaps between the training and validation data, and assists NAS in simulating real-world ReID training schemes. We then design a Multi-Scale Interaction (MSI) search space to search for rational interaction operations between multi-scale features. In addition, we introduce a Spatial Alignment Module (SAM) to further enhance the attention consistency confronted with images from different sources. Under the proposed NAS scheme, a specific architecture is automatically searched, named as MSINet. Extensive experiments demonstrate that our method surpasses state-of-the-art ReID methods on both indomain and cross-domain scenarios. Source code available in https://github.com/vimar-gu/MSINet.

    Jianyang Gu, Kai Wang, Hao Luo, Chen Chen, Wei Jiang, Yuqiang Fang, Shanghang Zhang, Yang You, and Jian Zhao

    (Corresponding Author: Jian Zhao)

    CVPR 2023  PDF

  • 2023
    A Compact and Powerful Single-Stage Network for Multi-Person Pose Estimation
    Multi-person pose estimation generally follows top-down and bottom-up paradigms. The top-down paradigm detects all human boxes and then performs single-person pose estimation on each ROI. The bottom-up paradigm locates identity-free keypoints and then groups them into individuals. Both of them use an extra stage to build the relationship between human instance and corresponding keypoints (e.g., human detection in a top-down manner or a grouping process in a bottom-up manner). The extra stage leads to a high computation cost and a redundant two-stage pipeline. To address the above issue, we introduce a fine-grained body representation method. Concretely, the human body is divided into several local parts and each part is represented by an adaptive point. The novel body representation is able to sufficiently encode the diverse pose information and effectively model the relationship between human instance and corresponding keypoints in a single-forward pass. With the proposed body representation, we further introduce a compact single-stage multi-person pose regression network, called AdaptivePose++, which is the extended version of AAAI-22 paper AdaptivePose. During inference, our proposed network only needs a single-step decode operation to estimate the multi-person pose without complex post-processes and refinements. Without any bells and whistles, we achieve the most competitive performance on representative 2D pose estimation benchmarks MS COCO and CrowdPose in terms of accuracy and speed. In particular, AdaptivePose++ outperforms the state-of-the-art SWAHR-W48 and CenterGroup-W48 by 3.2 AP and 1.4 AP on COCO mini-val with faster inference speed. Furthermore, the outstanding performance on 3D pose estimation datasets MuCo-3DHP and MuPoTS-3D further demonstrates its effectiveness and generalizability on 3D scenes.

    Yabo Xiao, Xiaojuan Wang, Mingshu He, Lei Jin, Mei Song, and Jian Zhao

    Electronics  PDF, BibTeX

  • 2022
    Joint Coupled Representation and Homogeneous Reconstruction for Multi-Resolution Small Sample Face Recognition
    Off-the-shelf dictionary learning algorithms have achieved satisfactory results in small sample face recognition applications. However, the achieved results depend on the facial images obtained at a single resolution. In practice, the resolution of the images captured on the same target is different because of the different shooting equipment and different shooting distances. These images of the same category at different resolutions will pose a great challenge to these algorithms. In this paper, we propose a Joint Coupled Representation and Homogeneous Reconstruction (JCRHR) for multi-resolution small sample face recognition. In JCRHR, an analysis dictionary is introduced and combined with the synthetic dictionary for coupled representation learning, which better reveals the relationship between coding coefficients and samples. In addition, a coherence enhancement term is proposed to improve the coherent representation of the coding coefficients at different resolutions, which facilitates the reconstruction of the sample by its homogeneous atoms. Moreover, each sample at different resolutions is assigned a different coding coefficient in the multi-dictionary learning process, so that the learned dictionary is more in line with the actual situation. Furthermore, a regularization term based on the fractional norm is drawn into the dictionary coupled learning to remove the redundant information in the dictionary, which can reduce the negative impacts of the redundant information. Comprehensive results demonstrate that the proposed JCRHR method achieves better results than the state-of-the-art methods, on several small sample face databases.

    Xiaojin Fan, Mengmeng Liao, Jingfeng Xue, Hao Wu, Lei Jin, Jian Zhao, and Liehuang Zhu

    Neurocomputing  PDF, BibTeX

  • 2022
    TransZero++: Cross Attribute-Guided Transformer for Zero-Shot Learning
    Zero-shot learning (ZSL) tackles the novel class recognition problem by transferring semantic knowledge from seen classes to unseen ones. Semantic knowledge is typically represented by attribute descriptions shared between different classes, which act as strong priors for localizing object attributes that represent discriminative region features, enabling significant and sufficient visualsemantic interaction for advancing ZSL. Existing attention-based models have struggled to learn inferior region features in a single image by solely using unidirectional attention, which ignore the transferable and discriminative attribute localization of visual features for representing the key semantic knowledge for effective knowledge transfer in ZSL. In this paper, we propose a cross attribute-guided Transformer network, termed TransZero++, to refine visual features and learn accurate attribute localization for key semantic knowledge representations in ZSL. Specifically, TransZero++ employs an attribute→visual Transformer sub-net (AVT) and a visual→attribute Transformer sub-net (VAT) to learn attribute-based visual features and visual-based attribute features, respectively. By further introducing feature-level and prediction-level semantical collaborative losses, the two attribute-guided transformers teach each other to learn semantic-augmented visual embeddings for key semantic knowledge representations via semantical collaborative learning. Finally, the semantic-augmented visual embeddings learned by AVT and VAT are fused to conduct desirable visual-semantic interaction cooperated with class semantic vectors for ZSL classification. Extensive experiments show that TransZero++ achieves the new stateof-the-art results on three golden ZSL benchmarks and on the large-scale ImageNet dataset. The project website is available at: https://shiming-chen.github.io/TransZero-pp/TransZero-pp.html.

    Shiming Chen, Ziming Hong, Wenjin Hou, Guosen Xie, Yibing Song, Jian Zhao, Xinge You, Shuicheng Yan, and Ling Shao

    T-PAMI  PDF, BibTeX

  • 2022
    Rethinking Sampling Strategies for Unsupervised Person Re-identification
    Unsupervised person re-identification (re-ID) remains a challenging task. While extensive research has focused on the framework design and loss function, this paper shows that sampling strategy plays an equally important role. We analyze the reasons for the performance differences between various sampling strategies under the same framework and loss function. We suggest that deteriorated over-fitting is an important factor causing poor performance, and enhancing statistical stability can rectify this problem. Inspired by that, a simple yet effective approach is proposed, termed group sampling, which gathers samples from the same class into groups. The model is thereby trained using normalized group samples, which helps alleviate the negative impact of individual samples. Group sampling updates the pipeline of pseudo-label generation by guaranteeing that samples are more efficiently classified into the correct classes. It regulates the representation learning process, enhancing statistical stability for feature representation in a progressive fashion. Extensive experiments on Market-1501, DukeMTMC-reID and MSMT17 show that group sampling achieves performance comparable to state-of-the-art methods and outperforms the current techniques under purely camera-agnostic settings. Code has been available at https://github.com/ucas-vg/GroupSampling.

    Xumeng Han, Xuehui Yu, Guorong Li, Jian Zhao, Gang Pan, Qixiang Ye, Jianbin Jiao, and Zhenjun Han

    T-IP  PDF, BibTeX

  • 2022
    Waveform Level Adversarial Example Generation for Joint Attacks Against both Automatic Speaker Verification and Spoofing Countermeasures
    Adversarial examples crafted to deceive Automatic Speaker Verification (ASV) systems have attracted a lot of attention when studying the vulnerability of ASV. However, real-world ASV systems usually work together with spoofing countermeasures (CM) to exclude fake voices generated by text-to-speech (TTS) or voice conversion (VC). The deployment of CM would reduce the capability of the adversarial samples on deceiving ASV. Although additional perturbations against CM may be generated and put on the crafted adversarial examples against ASV to yield new adversarial examples against both ASV and CM, those additional perturbations would however hinder the examples’ adversarial effectiveness on ASV. In this paper, a novel joint approach is proposed to generate adversarial examples by considering attacking ASV and CM simultaneously. For any voice from TTS, VC or a real-world speaker, our crafted adversarial perturbations will turn its original labels on CM and speaker ID to bonafide and some target speaker ID, correspondingly. In our approach, a differentiable front-end is introduced to replace the conventional hand-crafted time–frequency feature extractor. Perturbations can thus be estimated by updating the gradients of the joint objective of ASV and CM on the waveform variables. The proposed method has demonstrated a 99.3% success rate on white-box logical access attacks to deceive ASV and CM simultaneously, which outperforms the baselines of 65.3% and 36.7%. Furthermore, transferability on black-box and physical settings has also been validated.

    Xingyu Zhang, Xiongwei Zhang, Wei Liu, Xia Zou, Meng Sun, and Jian Zhao

    EAAI  PDF, BibTeX

  • 2022
    3D-Guided Frontal Face Generation for Pose-Invariant Recognition
    Although deep learning techniques have achieved extraordinary accuracy in recognizing human faces, the pose variances of images captured in real-world scenarios still hinder reliable model appliance. To mitigate this gap, we propose to recognize faces via generation frontal face images with a 3D-Guided Deep Pose-Invariant Face Recognition Model (3D-PIM) consisted of a simulator and a refiner module. The simulator employs a 3D Morphable Model (3D MM) to fit the shape and appearance features and recover primary frontal images with less training data. The refiner further enhances the image realism on both global facial structure and local details with adversarial training, while keeping the discriminative identity information consistent with original images. An Adaptive Weighting (AW) metric is then adopted to leverage the complimentary information from recovered frontal faces and original profile faces and to obtain credible similarity scores for recognition. Extended experiments verify the superiority of the proposed "recognition via generation" framework over state-of-the-arts.

    Hao Wu, Jianyang Gu, Xiaojin Fan, He Li, Lidong Xie, and Jian Zhao

    (Corresponding Author: Jian Zhao)

    T-IST  PDF, BibTeX

  • 2022
    Point-to-Box Network for Accurate Object Detection via Single Point Supervision
    Subspace clustering aims to fit each category of data points by learning an underlying subspace and then conduct clustering according to the learned subspace. Ideally, the learned subspace is expected to be block diagonal such that the similarities between clusters are zeros. In this paper, we provide the explicit theoretical connection between spectral clustering and the subspace clustering based on block diagonal representation. We propose Enforced Block Diagonal Subspace Clustering (EBDSC) and show that the spectral clustering with the Radial Basis Function kernel can be regarded as EBDSC. Compared with the exiting subspace clustering methods, an analytical, nonnegative and symmetrical solution can be obtained by EBDSC. An important difference with respect to the existing ones is that our model is a more general case. EBDSC directly uses the obtained solution as the similarity matrix, which can avoid the complex computation of the optimization program. Then the solution obtained by the proposed method can be used for the final clustering. Finally, we provide the experimental analysis to show the efficiency and effectiveness of our method on the synthetic data and several benchmark data sets in terms of different metrics.

    Pengfei Chen, Xuehui Yu, Xumeng Han, Najmul Hassan, Kai Wang, Jiachen Li, Jian Zhao, Humphrey Shi, Zhenjun Han, and Qixiang Ye

    ECCV 2022  PDF, BibTeX, WeChat News

  • 2022
    Enforced Block Diagonal Subspace Clustering with Closed Form Solution
    Subspace clustering aims to fit each category of data points by learning an underlying subspace and then conduct clustering according to the learned subspace. Ideally, the learned subspace is expected to be block diagonal such that the similarities between clusters are zeros. In this paper, we provide the explicit theoretical connection between spectral clustering and the subspace clustering based on block diagonal representation. We propose Enforced Block Diagonal Subspace Clustering (EBDSC) and show that the spectral clustering with the Radial Basis Function kernel can be regarded as EBDSC. Compared with the exiting subspace clustering methods, an analytical, nonnegative and symmetrical solution can be obtained by EBDSC. An important difference with respect to the existing ones is that our model is a more general case. EBDSC directly uses the obtained solution as the similarity matrix, which can avoid the complex computation of the optimization program. Then the solution obtained by the proposed method can be used for the final clustering. Finally, we provide the experimental analysis to show the efficiency and effectiveness of our method on the synthetic data and several benchmark data sets in terms of different metrics.

    Yalan Qin, Hanzhou Wu, Jian Zhao, and Guorui Feng

    (Corresponding Author: Jian Zhao)

    PR  PDF, BibTeX

  • 2022
    Semantic Compression Embedding for Generative Zero-Shot Learning
    Generative methods have been successfully applied in zero-shot learning (ZSL) by learning an implicit mapping to alleviate the visual-semantic domain gaps and synthesizing unseen samples to handle the data imbalance between seen and unseen classes. However, existing generative methods simply use visual features extracted from the pre-trained CNN backbone, which lack attribute-level semantic information. Thus, seen classes are indistinguishable and the knowledge transfer from seen to unseen classes is limited. To tackle this issue, we propose a novel Semantic Compression Embedding Guided Generation (SC-EGG) model, which cascades a semantic compression embedding network (SCEN) and an embedding guided generative network (EGGN). The SCEN extracts a group of attribute-level local features, which are further compressed into the new low-dimension visual feature for each sample, thus a dense-semantic visual space is obtained. The EGGN learns a mapping from the class-level semantic space to the densesemantic visual space, thus improving the discriminability of the synthesized dense-semantic unseen visual features. Extensive experiments on three benchmark datasets, i.e., CUB, SUN and AWA2, demonstrate the significant performance gains of SC-EGG over current state-of-the-art methods and its baselines.

    Ziming Hong, Shiming Chen, Guosen Xie, Wenhan Yang, Jian Zhao, Yuanjie Shao, Qinmu Peng, and Xinge You

    IJCAI 2022  PDF, BibTeX

  • 2022
    GrOD: Deep Learning with Gradients Orthogonal Decomposition for Knowledge Transfer, Distillation, and Adversarial Training
    Regularization that incorporates the linear combination of empirical loss and explicit regularization terms as the loss function has been frequently used for many machine learning tasks. The explicit regularization term is designed in different types, depending on its applications. While regularized learning often boost the performance with higher accuracy and faster convergence, the regularization would sometimes hurt the empirical loss minimization and lead to poor performance. To deal with such issues in this work, we propose a novel strategy, namely Gradients Orthogonal Decomposition (GrOD), that improves the training procedure of regularized deep learning. Instead of linearly combining gradients of the two terms, GrOD re-estimates a new direction for iteration that does not hurt the empirical loss minimization while preserving the regularization affects, through orthogonal decomposition. We have performed extensive experiments to use GrOD improving the commonly-used algorithms of transfer learning, knowledge distillation, adversarial learning. The experiment results based on large datasets, including Caltech 256, MIT indoor 67, CIFAR-10 and ImageNet, show significant improvement made by GrOD for all three algorithms in all cases.

    Haoyi Xiong, Ruosi Wan, Jian Zhao, Zeyu Chen, Xingjian Li, Zhanxing Zhu, and Jun Huan

    (Corresponding Author: Jian Zhao)

    T-KDD  PDF, BibTeX

  • 2022
    MSDN: Mutually Semantic Distillation Network for Zero-Shot Learning
    The key challenge of zero-shot learning (ZSL) is how to infer the latent semantic knowledge between visual and attribute features on seen classes, and thus achieving a desirable knowledge transfer to unseen classes. Prior works either simply align the global features of an image with its associated class semantic vector or utilize unidirectional attention to learn the limited latent semantic representations, which could not effectively discover the intrinsic semantic knowledge (e.g., attribute semantics) between visual and attribute features. To solve the above dilemma, we propose a Mutually Semantic Distillation Network (MSDN), which progressively distills the intrinsic semantic representations between visual and attribute features for ZSL. MSDN incorporates an attribute!visual attention sub-net that learns attribute-based visual features, and a visual attribute attention sub-net that learns visual-based attribute features. By further introducing a semantic distillation loss, the two mutual attention sub-nets are capable of learning collaboratively and teaching each other throughout the training process. The proposed MSDN yields significant improvements over the strong baselines, leading to new state-of-the-art performances on three popular challenging benchmarks, i.e., CUB, SUN, and AWA2. Our codes have been available at: https://github.com/shimingchen/MSDN.

    Shiming Chen, Ziming Hong, Guosen Xie, Wenhan Yang, Qinmu Peng, Kai Wang, Jian Zhao, and Xinge You

    CVPR 2022  PDF, BibTeX, WeChat News

  • 2022
    Single-Stage is Enough: Multi-Person Absolute 3D Pose Estimation
    The existing multi-person absolute 3D pose estimation methods are mainly based on two-stage paradigm, i.e., topdown or bottom-up, leading to redundant pipelines with high computation cost. We argue that it is more desirable to simplify such two-stage paradigm to a single-stage one to promote both efficiency and performance. To this end, we present an efficient single-stage solution, Decoupled Regression Model (DRM), with three distinct novelties. First, DRM introduces a new decoupled representation for 3D pose, which expresses the 2D pose in image plane and depth information of each 3D human instance via 2D center point (center of visible keypoints) and root point (denoted as pelvis), respectively. Second, to learn better feature representation for the human depth regression, DRM introduces a 2D Poseguided Depth Query Module (PDQM) to extract the features in 2D pose regression branch, enabling the depth regression branch to perceive the scale information of instances. Third, DRM leverages a Decoupled Absolute Pose Loss (DAPL) to facilitate the absolute root depth and root-relative depth estimation, thus improving the accuracy of absolute 3D pose. Comprehensive experiments on challenging benchmarks including MuPoTS-3D and Panoptic clearly verify the superiority of our framework, which outperforms the state-of-the-art bottom-up absolute 3D pose estimation methods.

    Lei Jin, Chenyang Xu, Xiaojuan Wang, Yabo Xiao, Yandong Guo, Xuecheng Nie, and Jian Zhao

    (Corresponding Author: Jian Zhao)

    CVPR 2022  PDF, Poster, BibTeX, CSIG News

  • 2022
    Toward High-Quality Face-Mask Occluded Restoration
    Face-mask occluded restoration aims to restore the masked region of a human face, which has attracted increasing attention in the context of the COVID-19 pandemic. One major challenge of this task is the large visual variance of masks in the real world. To solve it we first construct a large-scale Face-mask Occluded Restoration (FMOR) dataset, which contains 5,500 unmasked images and 5,500 face-mask occluded images with various illuminations, and involves 1,100 subjects of different races, face orientations and mask types. Moreover we propose a Face-Mask Occluded Detection and Restoration (FMODR) framework, which can detect face-mask regions with large visual variations and restore them to realistic human faces. In particular, our FMODR contains a self-adaptive contextual attention module specifically designed for this task, which is able to exploit the contextual information and correlations of adjacent pixels for achieving high realism of the restored faces, which are however often neglected in existing contextual attention models. Our framework achieves state-of-the-art results of face restoration on three datasets, including CelebA, AR and our FMOR datasets. Moreover, experimental results on AR and FMOR datasets demonstrate that our framework can significantly improve masked face recognition and verification performance.

    Feihong Lu, Hang Chen, Kang Li, Qiliang Deng, Jian Zhao, Kaipeng Zhang, and Hong Han

    T-OMCCAP  PDF, BibTeX

  • 2022
    Grouping by Center: Predicting Centripetal Offsets for the Bottom-up Human Pose Estimation
    We introduce Grouping by Center, a novel grouping approach for the bottom-up human pose estimation, which detects human joint first and then does grouping. The grouping strategy is the critical factor for the bottom-up pose estimation. To increase the conciseness and accuracy, we propose to use the center of the body as a grouping clue. More concretely, we predict the offsets from the keypoints to the body centers. Keypoints with aligned shifted results will be grouped as one person. However, the multi-scale variance of people can affect the prediction of the grouping clue, which has been neglected in previous research. To resolve the scale variance of the offset, we put forward a Multiscale Translation Layer and an iterative refinement. Furthermore, we scheme a greedy grouping strategy with a dynamic threshold due to the various scales of instances. Through a comprehensive comparison, our framework is validated to be effective and practical. We also lay out the state-of-the-art performance revolving the bottom-up multi-person pose estimation on the MS-COCO dataset and the CrowdPose dataset.

    Lei Jin, Xiaojuan Wang, Xuecheng Nie, Luoqi Liu, Yandong Guo, and Jian Zhao

    (Corresponding Author: Jian Zhao)

    T-MM  PDF, BibTeX, CSIG News

  • 2022
    Diverse Complementary Part Mining for Weakly Supervised Object Localization
    Weakly Supervised Object Localization (WSOL) aims to localize objects with only image-level labels, which has better scalability and practicability than fully supervised methods in the actual deployment. However, a common limitation for available techniques based on classification networks is that they only highlight the most discriminative part of the object, not the entire object. To alleviate this problem, we propose a novel end-to-end part discovery model (PDM) to learn multiple discriminative object parts in a unified network for accurate object localization and classification. The proposed PDM enjoys several merits. First, to the best of our knowledge, it is the first work to directly model diverse and robust object parts by exploiting part diversity, compactness, and importance jointly for WSOL. Second, three effective mechanisms including diversity, compactness, and importance learning mechanisms are designed to learn robust object parts. Therefore, our model can exploit complementary spatial information and local details from the learned object parts, which help to produce precise bounding boxes and discriminate different objects. Extensive experiments on two standard benchmarks demonstrate that our PDM performs favorably against state-of-the-art WSOL approaches.

    Meng Meng, Tianzhu Zhang, Wenfei Yang, Jian Zhao, Yongdong Zhang, and Feng Wu

    T-IP  PDF, BibTeX

  • 2021
    The 2nd Anti-UAV Workshop & Challenge: Methods and Results
    The 2nd Anti-UAV Workshop & Challenge aims to encourage research in developing novel andaccurate methods for multi-scale object tracking. The Anti-UAV dataset was used for the Anti-UAVChallenge and is publicly released. There are two subsets in the dataset,i.e., the test-dev subset andtest-challenge subset. Both subsets consist of 140 thermal infrared video sequences, spanning multipleoccurrences of multi-scale UAVs. Around 24 participating teams from the globe competed in the 2ndAnti-UAV Challenge. In this paper, we provide a brief summary of the 2nd Anti-UAV Workshop &Challenge including brief introductions to the top three methods.The submission leaderboard will bereopened for researchers that are interested in the Anti-UAV challenge. The benchmark dataset andother information can be found at: https://anti-uav.github.io/.

    Jian Zhao, Gang Wang, Jianan Li, Lei Jin, Nana Fan, Min Wang, Xiaojuan Wang, Ting Yong, Yafeng Deng, Yandong Guo, and Shiming Ge

    ArXiv  PDF, BibTeX, Qihoo 360 Summary News, CJIG Summary News, BSIG Summary News

  • 2021
    Dense Attentive Feature Enhancement for Salient Object Detection
    Attention mechanisms have been proven highly effective for salient object detection. Most previous works utilize attention as a self-gated module to reweigh the feature maps at different levels independently. However, they are limited to certain-level guidance and could not satisfy the need of both accurately detecting intact objects and maintaining their detailed boundaries. In this paper, we build dense attention upon features from multiple levels simultaneously and propose a novel Dense Attentive Feature Enhancement (DAFE) module for efficient feature enhancement in saliency detection. DAFE stacks several attentional units and densely connects attentive feature output from current unit to its all subsequent units. This allows feature maps at deep units to absorb attentive information from shallow units, thus more discriminative information can be efficiently selected at the final output. Note that DAFE is plug and play, which can be effortlessly inserted into any saliency or video saliency models for their performance improvements. We further instantiate a highly effective Dense Attentive Feature Enhancement Network (DAFE-Net) for accurate salient object detection. DAFE-Net constructs DAFE over the aggregation feature that contains both semantics and saliency details, the entire salient objects and their boundaries can be well retained through dense attentions. Extensive experiments demonstrate that the proposed DAFE module is highly effective, and the DAFE-Net performs favorably compared with state-of-the-art approaches.

    Zun Li, Congyan Lang, Liqian Liang, Jian Zhao, Songhe Feng, Qibin Hou, and Jiashi Feng

    T-CSVT  PDF, BibTeX

  • 2021
    Multi-caption Text-to-Face Synthesis: Dataset and Algorithm
    Text-to-Face synthesis with multiple captions is still an important yet less addressed problem because of the lack of effective algorithms and large-scale datasets. We accordingly propose a Semantic Embedding and Attention (SEA-T2F) network that allows multiple captions as input to generate highly semantically related face images. With a novel Sentence Features Injection Module, SEA-T2F can integrate any number of captions into the network. In addition, an attention mechanism named Attention for Multiple Captions is proposed to fuse multiple word features and synthesize fine-grained details. Considering text-to-face generation is an ill-posed problem, we also introduce an attribute loss to guide the network to generate sentence-related attributes. Existing datasets for text-to-face are either too small or roughly generated according to attribute labels, which is not enough to train deep learning based methods to synthesize natural face images. Therefore, we build a large-scale dataset named CelebAText-HQ, in which each image is manually annotated with 10 captions. Extensive experiments demonstrate the effectiveness of our algorithm.

    Jianxin Sun, Qi Li, Weining Wang, Jian Zhao, and Zhenan Sun

    ACM MM 2021  PDF, BibTeX, CSIG News, BSIG News, WeChat News

  • 2021
    Seeing Crucial Parts: Vehicle Model Verification via A Discriminative Representation Model
    Widely-used surveillance cameras have promoted large amounts of street scene data, which contains one important but long-neglected object: vehicle. Here we focus on the challenging problem of vehicle model verification. Most previous works usually employ global features (e.g., fully-connected features) to further perform vehicle-level deep metric learning (e.g., triplet-based network). However, we argue that it is noteworthy to investigate the distinctiveness of local features and consider vehicle-part-level metric learning by reducing the intra-class variance as much as possible. In this paper, we introduce a simple yet powerful deep model, i.e., enforced intra-class alignment network (EIA-Net), which can learn a more discriminative image representation by localizing key vehicle parts and jointly incorporating two distance metrics: vehicle-level embedding and vehicle-part-sensitive embedding. For learning features, we propose an effective feature extraction module which is composed of two components: Regional Proposal Network (RPN)-based network and Part-based CNN. RPN-based network is used to define key vehicle regions and aggregate local features on these regions, while Part-based CNN offers supplementary global features for RPN-based network. The fusion features learned by feature extraction module are cast into deep metric learning module. Especially, we derived an enforced intraclass alignment loss (EIAL) by re-utilizing key vehicle part information to enhance reducing intra-class variance. Furthermore, we modify the coupled cluster loss (CCL) to model the vehicle-level embedding by enlarging the inter-class variance while shortening intra-class variance. Extensive experiments over benchmark datasets VehicleID and CompCars have shown that the proposed EIA-Net significantly outperforms the state-of-the-art approaches for vehicle model verification. Furthermore, we also conduct comprehensive experiments on vehicle Re-ID datasets, i.e., VehicleID and VeRi776, to validate the generalization ability effectiveness of our proposed method.

    Liqian Liang, Congyan Lang, Zun Li, Jian Zhao, Tao Wang, and Songhe Feng

    T-OMCCAP  PDF, BibTeX

  • 2021
    Face.evoLVe: A High-Performance Face Recognition Library
    While face recognition has drawn much attention, a large number of algorithms and models have been proposed with applications to daily life, such as authentication for mobile payments, etc. Recently, deep learning methods have dominated in the field of face recognition with advantages in comparisons to conventional approaches and even the human perception. Despite the popular adoption of deep learning-based methods to the field, researchers and engineers frequently need to reproduce existing algorithms with unified implementations (i.e., the identical deep learning framework with standard implementations of operators and trainers) and compare the performance of face recognition methods under fair settings (i.e., the same set of evaluation metrics and preparation of datasets with tricks on/off ), so as to ensure the reproducibility of experiments. To the end, we develop face.evoLVe — a comprehensive library that collects and implements a wide range of popular deep learningbased methods for face recognition. First of all, face.evoLVe is composed of key components that cover the full process of face analytics, including face alignment, data processing, various backbones, losses, and alternatives with bags of tricks for improving performance. Later, face.evoLVe supports multi-GPU training on top of different deep learning platforms, such as PyTorch and PaddlePaddle, which facilitates researchers to work on both large-scale datasets with millions of images and and low-shot counterparts with limited well-annotated data. More importantly, along with face.evoLVe, images before & after alignment in the common benchmark datasets are released with source codes and trained models provided. All these efforts lower the technical burdens in reproducing the existing methods for comparison, while users of our library could focus on developing advanced approaches more efficiently. Last but not least, face.evoLVe is well designed and vibrantly evolving, so that new face recognition approaches can be easily plugged into our framework. Note that we have used face.evoLVe to participate in a number of face recognition competitions and secured the first place. The version that supports PyTorch is publicly available at https://github.com/ZhaoJ9014/face.evoLVe.PyTorch and the PaddlePaddle version is available at https://github.com/ZhaoJ9014/face.evoLVe.PyTorch/tree/master/paddle. Face.evoLVe has been widely used for face analytics, receiving 2.4K stars and 622 forks.

    Qingzhong Wang, Pengfei Zhang, Haoyi Xiong, and Jian Zhao

    (Corresponding Author: Jian Zhao)

    Neurocomputing  PDF, BibTeX

  • 2021
    Group Sampling for Unsupervised Person Re-identification
    Unsupervised person re-identification (re-ID) remains a challenging task, where the classifier and feature representation could be easily misled by the noisy pseudo labels towards deteriorated over-fitting. In this paper, we propose a simple yet effective approach, termed Group Sampling, to alleviate the negative impact of noisy pseudo labels within unsupervised person re-ID models. The idea behind Group Sampling is that it can gather a group of samples from the same class in the same mini-batch, such that the model is trained upon group normalized samples while alleviating the effect of a single sample. Group sampling updates the pipeline of pseudo label generation by guaranteeing the samples to be better divided into the correct classes. Group Sampling regularizes classifier training and representation learning, leading to the statistical stability of feature representation in a progressive fashion. Qualitative and quantitative experiments on Market-1501, DukeMTMC-reID, and MSMT17 show that Grouping Sampling improves the stateof-the-arts by up to 2.2%∼6.1%. Code is available at https://github.com/wavinflaghxm/GroupSampling.

    Xumeng Han, Xuehui Yu, Nan Jiang, Guorong Li, Jian Zhao, Qixiang Ye, and Zhenjun Han

    Under Review  PDF, BibTeX

  • 2021
    Image-to-Video Generation via 3D Facial Dynamics
    We present a versatile model, FaceAnime, for various video generation tasks from still images. Video generation from a single face image is an interesting problem and usually tackled by utilizing Generative Adversarial Networks (GANs) to integrate information from the input face image and a sequence of sparse facial landmarks. However, the generated face images usually suffer from quality loss, image distortion, identity change, and expression mismatching due to the weak representation capacity of the facial landmarks. In this paper, we propose to “imagine” a face video from a single face image according to the reconstructed 3D face dynamics, aiming to generate a realistic and identity-preserving face video, with precisely predicted pose and facial expression. The 3D dynamics reveal changes of the facial expression and motion, and can serve as a strong prior knowledge for guiding highly realistic face video generation. In particular, we explore face video prediction and exploit a well-designed 3D dynamic prediction network to predict a 3D dynamic sequence for a single face image. The 3D dynamics are then further rendered by the sparse texture mapping algorithm to recover structural details and sparse textures for generating face frames. Our model is versatile for various AR/VR and entertainment applications, such as face video retargeting and face video prediction. Superior experimental results have well demonstrated its effectiveness in generating high-fidelity, identitypreserving, and visually pleasant face video clips from a single source face image.

    Xiaoguang Tu, Yingtian Zou, Jian Zhao, Wenjie Ai, Jian Dong, Yuan Yao, Zhikang Wang, Guodong Guo, Zhifeng Li, Wei Liu, and Jiashi Feng

    T-CSVT  PDF, BibTeX, BSIG News, WeChat News1, WeChat News2

  • 2021
    Joint Face Image Restoration and Frontalization for Recognition
    In real-world scenarios, many factors may harm face recognition performance, e.g., large pose, bad illumination, low resolution, blur and noise. To address these challenges, previous efforts usually first restore the low-quality faces to high-quality ones and then perform face recognition. However, most of these methods are stage-wise, which is sub-optimal and deviates from the reality. In this paper, we address all these challenges jointly for unconstrained face recognition. We propose an Multi-Degradation Face Restoration (MDFR) model to restore frontalized high-quality faces from the given low-quality ones under arbitrary facial poses, with three distinct novelties. First, MDFR is a well-designed encoder-decoder architecture which extracts feature representation from an input face image with arbitrary low-quality factors and restores it to a high-quality counterpart. Second, MDFR introduces a pose residual learning strategy along with a 3D-based Pose Normalization Module (PNM), which can perceive the pose gap between the input initial pose and its real-frontal pose to guide the face frontalization. Finally, MDFR can generate frontalized high-quality face images by a single unified network, showing a strong capability of preserving face identity. Qualitative and quantitative experiments on both controlled and in-the-wild benchmarks demonstrate the superiority of MDFR over state-of-the-art methods on both face frontalization and face restoration.

    Xiaoguang Tu, Jian Zhao, Qiankun Liu, Wenjie Ai, Guodong Guo, Zhifeng Li, Wei Liu, and Jiashi Feng

    (Corresponding Author: Jian Zhao)

    T-CSVT  PDF, BibTeX, CSIG News, BSIG News

  • 2021
    Robust Video-based Person Re-Identification by Hierarchical Mining
    Video-based person re-identification (Re-ID) aims at retrieving the person through the video sequences across non- overlapping cameras. Some characteristics of pedestrians are not consecutive across frames due to the variations of viewpoints, postures, and occlusions over time. However, existing methods ignore such data peculiarity and the networks tend to only learn those salient consecutive characteristics among frames in video sequences. As a result, the learned representations fail to cover all the characteristics of pedestrians, thus lacking integrity and discrimination. To tackle this problem, we present a novel deep architecture termed Hierarchical Mining Network (HMN), which mines as many pedestrians’ characteristics by referring to the temporal and intra-class knowledge. It consists of a novel Attentive Temporal Module (ATM) and a Dynamic Supervising Branch (DSB), with a Balancing Triplet Loss (BTL) assisting the training. The proposed ATM, with pedestrian perceiving capacity, is capable of evaluating each activation of features through temporal analysis, so that the temporally scattered characteristics of pedestrians can be better aggregated and the contaminated ones can be eliminated. Then, the DSB along with the BTL further enhances the integrity of representations by multiple supervision. Specifically, the DSB perceives the diversities of intra-class samples in each mini-batch and generates targeted supervising signals for them, in which process the BTL guarantees the signals with smaller intra-class variations and larger inter-class variations. Comprehensive experiments on two video-based datasets, i.e., MARS, and DukeMTMC- VideoReID, demonstrate the contribution of each component and the superiority of the proposed HMN over the state-of-the-arts. Benchmarking our model on three popular image-based datasets, i.e., Market1501, DukeMTMC-Reid, and MSMT17 additionally verifies the promising generalizability of the proposed DSB and BTL.

    Zhikang Wang, Lihuo He, Xiaoguang Tu, Jian Zhao, Xinbo Gao, Shengmei Shen, and Jiashi Feng

    T-CSVT  PDF, BibTeX

  • 2021
    Images Inpainting via Structure Guidance
    Aiming at the problem of obvious visual artifacts in the content of rough network with less prior knowledge, a two-stage image inpainting method based on an edge structure generator is proposed. The edge structure generator is used to perform feature learning on the input image edge and color smoothing information, and generate the missing structural contents so as to guide the fine network to reconstruct high-quality semantic images. The mentioned method has been tested on the public benchmark datasets such as Paris Street-View. The experimental results show that the proposed approach can complete the hole images with the mask rate of 50%. The quantitative evaluation indicators: PSNR, SSIM, L1 and L2 errors respectively surpass current images inpainting algorithms with excellent performance, such as EC, GC, SF, etc. Among them, when the mask rate is 0%-20%, the PSNR index reaches 33.40 dB, which is an increase of 2.37-6.57 dB compared to other methods; the SSIM index is increased by 0.006-0.138. Meanwhile, the completed images get clearer texture and higher visual quality.

    Kai Hu, Jian Zhao, Yu Liu, Yukai Niu, and Gang Ji

    (Corresponding Author: Jian Zhao)

    Journal of BUAA  PDF, BibTeX

  • 2021
    Anti-UAV: A Large Multi-Modal Benchmark for UAV Tracking
    Unmanned Aerial Vehicle (UAV) offers lots of applications in both commerce and recreation. Therefore, perception of the status of UAVs is crucially important. In this paper, we consider the task of tracking UAVs, providing rich information such as location and trajectory. To facilitate research on this topic, we introduce a new benchmark, referred to as Anti-UAV, opening up a promising direction for UAV tracking in a long distance with more than 300 video pairs containing over 580k manually annotated bounding boxes. Furthermore, the advancement of addressing research challenges in Anti-UAV can help the design of anti-UAV systems, leading to better surveillance of UAVs. Accordingly, a simple yet effective approach named dual-flow semantic consistency (DFSC) is proposed for UAV tracking. Modulated by the semantic flow across video sequences, the tracker learns more robust class-level semantic information and obtains more discriminative instance-level features. Experiments show the significant performance gain of our proposed approach over state-of-the-art trackers, and the challenging aspects of Anti-UAV. The Anti-UAV benchmark and the code of the proposed approach will be publicly available at https://github.com/ucas-vg/Anti-UAV.

    Nan Jiang, Kuiran Wang, Xiaoke Peng, Xuehui Yu, Qiang Wang, Junliang Xing, Guorong Li, Jian Zhao, and Zhenjun Han

    (Corresponding Author: Jian Zhao)

    T-MM  PDF, BibTeX, CSIG News, BSIG News, WeChat News

  • 2021
    Effective Fusion Factor in FPN for Tiny Object Detection
    FPN-based detectors have made significant progress in general object detection, e.g., MS COCO and PASCAL VOC. However, these detectors fail in certain application scenarios, e.g., tiny object detection. In this paper, we argue that the top-down connections between adjacent layers in FPN bring two-side influences for tiny object detection, not only positive. We propose a novel concept, fusion factor, to control information that deep layers deliver to shallow layers, for adapting FPN to tiny object detection. After series of experiments and analysis, we explore how to estimate an effective value of fusion factor for a particular dataset by a statistical method. The estimation is dependent on the number of objects distributed in each layer. Comprehensive experiments are conducted on tiny object detection datasets, e.g., TinyPerson and Tiny CityPersons. Our results show that when configuring FPN with a proper fusion factor, the network is able to achieve significant performance gains over the baseline on tiny object detection datasets. Codes and models will be released.

    Yuqi Gong, Xuehui Yu, Yao Ding, Xiaoke Peng, Jian Zhao, and Zhenjun Han

    WACV 2021  PDF, BibTeX

  • 2020
    The 1st Tiny Object Detection Challenge: Methods and Results
    The 1st Tiny Object Detection (TOD) Challenge aims to encourage research in developing novel and accurate methods for tiny object detection in images which have wide views, with a current focus on tiny person detection. The TinyPerson dataset was used for the TOD Challenge and is publicly released. It has 1610 images and 72651 box-level annotations. Around 36 participating teams from the globe competed in the 1st TOD Challenge. In this paper, we provide a brief summary of the 1st TOD Challenge including brief introductions to the top three methods.The submission leaderboard will be reopened for researchers that are interested in the TOD challenge. The benchmark dataset and other information can be found at: https://github.com/ucas-vg/TinyBenchmark.

    Xuehui Yu, Zhenjun Han, Yuqi Gong, Nan Jan, Jian Zhao, Qixiang Ye, Jie Chen, Yuan Feng, Bin Zhang, Xiaodi Wang, Ying Xin, Jingwei Liu, Mingyuan Mao, Sheng Xu, Baochang Zhang, Shumin Han, Cheng Gao, Wei Tang, Lizuo Jin, Mingbo Hong, Yuchao Yang, Shuiwang Li, Huan Luo, Qijun Zhao, and Humphrey Shi

    ArXiv  PDF, BibTeX

  • 2020
    Multi-Human Parsing With a Graph-based Generative Adversarial Model
    Human parsing is an important task in human-centric image understanding in computer vision and multimedia systems. However, most existing works on human parsing mainly tackle the single-person scenario, which deviates from real-world applications where multiple persons are present simultaneously with interaction and occlusion. To address such a challenging multi-human parsing problem, we introduce a novel multi-human parsing model named MH-Parser, which uses a graph-based generative adversarial model to address the challenges of close person interaction and occlusion in multi-human parsing. To validate the effectiveness of the new model, we collect a new dataset named Multi-Human Parsing (MHP), which contains multiple persons with intensive person interaction and entanglement. Experiments on the new MHP dataset and existing datasets demonstrate that the proposed method is effective in addressing the multi-human parsing problem compared with existing solutions in the literature.

    Jianshu Li, Jian Zhao, Congyan Lang, Yidong Li, Yunchao Wei, Gudong Guo, Terence Sim, Shuicheng Yan, and Jiashi Feng

    (The first two authors are with equal contributions.)

    T-OMCCAP  PDF, BibTeX

  • 2020
    Towards Age-Invariant Face Recognition
    Despite the remarkable progress in face recognition related technologies, reliably recognizing faces across ages remains a big challenge. The appearance of a human face changes substantially over time, resulting in significant intra-class variations. As opposed to current techniques for age-invariant face recognition, which either directly extract age-invariant features for recognition, or first synthesize a face that matches target age before feature extraction, we argue that it is more desirable to perform both tasks jointly so that they can leverage each other. To this end, we propose a deep Age-Invariant Model (AIM) for face recognition in the wild with three distinct novelties. First, AIM presents a novel unified deep architecture jointly performing cross-age face synthesis and recognition in a mutual boosting way. Second, AIM achieves continuous face rejuvenation/aging with remarkable photorealistic and identity-preserving properties, avoiding the requirement of paired data and the true age of testing samples. Third, effective and novel training strategies are developed for end-to-end learning of the whole deep architecture, which generates powerful age-invariant face representations explicitly disentangled from the age variation. Moreover, we construct a new large-scale Cross-Age Face Recognition (CAFR) benchmark dataset to facilitate existing efforts and push the frontiers of age-invariant face recognition research. Extensive experiments on both our CAFR dataset and several other cross-age datasets (MORPH, CACD, and FG-NET) demonstrate the superiority of the proposed AIM model over the state-of-the-arts. Benchmarking our model on the popular unconstrained face recognition datasets YTF and IJB-C additionally verifies its promising generalization ability in recognizing faces in the wild.

    Jian Zhao, Shuicheng Yan, and Jiashi Feng

    (ESI Highly Cited Paper)

    T-PAMI  PDF, BibTeX

  • 2020
    Unsupervised Domain Adaptation with Noise Resistible Mutual-Training for Person Re-identification
    Unsupervised domain adaptation (UDA) in the task of person re-identification (re-ID) is highly challenging due to large domain divergence and no class overlap between domains. Pseudo-label based self-training is one of the representative techniques to address UDA. However, label noise caused by unsupervised clustering is always a trouble to self-training methods. To depress noises in pseudo-labels, this paper proposes a Noise Resistible Mutual-Training (NRMT) method, which maintains two networks during training to perform collaborative clustering and mutual instance selection. On one hand, collaborative clustering eases the fitting to noisy instances by allowing the two networks to use pseudo-labels provided by each other as an additional supervision. On the other hand, mutual instance selection further selects reliable and informative instances for training according to the peer-confidence and relationship disagreement of the networks. Extensive experiments demonstrate that the proposed method outperforms the state-of-the-art UDA methods for person re-ID.

    Fang Zhao, Shengcai Liao, Guosen Xie, Jian Zhao, Kaihao Zhang, and Ling Shao

    ECCV 2020  PDF, BibTeX

  • 2020
    Fine-Grained Facial Expression Recognition in the Wild
    Over the past decades, researches on facial expression recognition have been restricted within six basic expressions (anger, fear, disgust, happiness, sadness and surprise). However, these six words can not fully describe the richness and diversity of human beings' emotions. To enhance the recognitive capabilities for computers, in this paper, we focus on fine-grained facial expression recognition in the wild and build a brand new benchmark FG-Emotions to push the research frontiers on this topic, which extends the original six classes to more elaborate thirty-three classes. Our FG-Emotions contains 10,371 images and 1,491 video clips annotated with corresponding fine-grained facial expression categories and landmarks. FG-Emotions also provides several features (e.g., LBP features and dense trajectories features) to facilitate related research. Moreover, on top of FG-Emotions, we propose a new end-to-end Multi-Scale Action Unit (AU)-based Network (MSAU-Net) for facial expression recognition with image which learns a more powerful facial representation by directly focusing on locating facial action units and utilizing "zoom in" operation to aggregate distinctive local features. As for recognition with video, we further extend the MSAU-Net to a two-stream model (TMSAU-Net ) by adding a module with attention mechanism and a temporal stream branch to jointly learn spatial and temporal features. (T)MSAU-Net consistently outperforms existing state-of-the-art solutions on our FG-Emotions and several other datasets, and serves as a strong baseline to drive the future research towards fine-grained facial expression recognition in the wild.

    Liqian Liang, Congyan Lang, Yidong Li, Songhe Feng, and Jian Zhao

    T-IFS  PDF, BibTeX

  • 2020
    Learning Generalizable and Identity-Discriminative Representations for Face Anti-Spoofing
    Face anti-spoofing aims to detect presentation attack to face recognition based authentication systems. It has drawn growing attention due to the high security demand. The widely adopted CNN-based methods usually well recognize the spoofing faces when training and testing spoofing samples display similar patterns, but their performance would drop drastically on testing spoofing faces of novel patterns or unseen scenes, leading to poor generalization performance. Furthermore, almost all current methods treat face anti-spoofing as a prior step to face recognition, which prolongs the response time and makes face authentication inefficient. In this paper, we try to boost the generalizability and applicability of face anti-spoofing methods by designing a new Generalizable Face Authentication CNN (GFA-CNN) model with three novelties. First, GFA-CNN introduces a simple yet effective Total Pairwise Confusion (TPC) loss for CNN training which properly balances contributions of all the spoofing patterns for recognizing the spoofing faces. Secondly, it incorporate a Fast Domain Adaptation (FDA) component to alleviate negative effect brought by domain variation. Thirdly, it deploys the Filter Diversification Learning (FDL) to make the learned representations more adaptable to new scenes. Besides, the proposed GFA-CNN works in a multi-task manner—it performs face anti-spoofing and face recognition simultaneously. Experimental results on five popular face anti-spoofing and face recognition benchmarks show that GFA-CNN outperforms previous face anti-spoofing methods on cross-test protocol significantly and also well preserves the identity information of input face images.

    Xiaoguang Tu, Zheng Ma, Jian Zhao, Guodong Du, Mei Xie, and Jiashi Feng

    T-IST  PDF, BibTeX

  • 2020
    3D Face Reconstruction from A Single Image Assisted by 2D Face Images in the Wild
    3D face reconstruction from a single image is an important task in many multimedia applications. Recent works typically learn a CNN-based 3D face model that regresses coefficients of a 3D Morphable Model (3DMM) from 2D images to perform 3D face reconstruction. However, the shortage of training data with 3D annotations considerably limits performance of these methods. To alleviate this issue, we propose a novel 2D-Assisted Learning (2DAL) method that can effectively use “in the wild” 2D face images with noisy landmark information to substantially improve 3D face model learning. Specifically, taking the sparse 2D facial landmark heatmaps as additional information, 2DAL introduces four novel self-supervision schemes that view the 2D landmark and 3D landmark prediction as a self-mapping process, including the landmark self-prediction consistency for 2D and 3D faces respectively, cycle-consistency over the 2D landmark prediction and self-critic over the predicted 3DMM coefficients based on landmark prediction. Using these four self-supervision schemes, 2DAL significantly relieves the demands for the the conventional paired 2D-to-3D annotations and gives much higher-quality 3D face models without requiring any additional 3D annotations. Experiments on AFLW2000-3D, AFLW-LFPA and Florence benchmarks show that our method outperforms state-of-the-arts for both 3D face reconstruction and dense face alignment by a large margin.

    Xiaoguang Tu, Jian Zhao, Mei Xie, Zihang Jiang, Akshaya Balamurugan, Yao Luo, Yang Zhao, Lingxiao He, Zheng Ma, and Jiashi Feng

    T-MM  PDF, BibTeX, Code

  • 2020
    Learning to Detect Head Movement in Unconstrained Remote Gaze Estimation in the Wild
    Unconstrained remote gaze estimation remains challenging mostly due to its vulnerability to the large variability in head-pose. Prior solutions struggle to maintain reliable accuracy in unconstrained remote gaze tracking. Among them, appearance-based solutions demonstrate tremendous potential in improving gaze accuracy. However, existing works still suffer from head movement and are not robust enough to handle real-world scenarios. Especially most of them study gaze estimation under controlled scenarios where the collected datasets often cover limited ranges of both head-pose and gaze which introduces further bias. In this paper, we propose novel end-to-end appearance-based gaze estimation methods that could more robustly incorporate different levels of head-pose representations into gaze estimation. Our method could generalize to real-world scenarios with low image quality, different lightings and scenarios where direct head-pose information is not available. To better demonstrate the advantage of our methods, we further propose a new benchmark dataset with the most rich distribution of head-gaze combination reflecting real- world scenarios. Extensive evaluations on several public datasets and our own dataset demonstrate that our method consistently outperforms the state-of-the-art by a significant margin.

    Zhecan Wang, Jian Zhao, Cheng Lu, Han Huang, Fan Yang, Lianji Li, and Yandong Guo

    WACV 2020  PDF, BibTeX

  • 2019
    Recognizing Profile Faces by Imagining Frontal View
    Extreme pose variation is one of the key obstacles to accurate face recognition in practice. Compared with current techniques for pose-invariant face recognition, which either expect pose invariance from hand-crafted features or data-driven deep learning solutions, or first normalize profile face images to frontal pose before feature extraction, we argue that it is more desirable to perform both tasks jointly to allow them to benefit from each other. To this end, we propose a Pose-Invariant Model (PIM) for face recognition in the wild, with three distinct novelties. First, PIM is a novel and unified deep architecture, containing a Face Frontalization sub-Net (FFN) and a Discriminative Learning sub-Net (DLN), which are jointly learned from end to end. Second, FFN is a well-designed dual-path Generative Adversarial Network (GAN) which simultaneously perceives global structures and local details, incorporating an unsupervised cross-domain adversarial training and a meta-learning (“learning to learn”) strategy using siamese discriminator with dynamic convolution for high-fidelity and identity-preserving frontal view synthesis. Third, DLN is a generic Convolutional Neural Network (CNN) for face recognition with our enforced cross-entropy optimization strategy for learning discriminative yet generalized feature representations with large intra-class affinity and inter-class separability. Qualitative and quantitative experiments on both controlled and in-the-wild benchmark datasets demonstrate the superiority of the proposed model over the state-of-the-arts.

    Jian Zhao, Junliang Xing, Lin Xiong, Shuicheng Yan, and Jiashi Feng

    IJCV  PDF, BibTeX

  • 2019
    Cross-Resolution Face Recognition via Prior-Aided Face Hallucination and Residual Knowledge Distillation
    Recent deep learning based face recognition methods have achieved great performance, but it still remains challenging to recognize very low-resolution query face like 28×28 pixels when CCTV camera is far from the captured subject. Such face with very low-resolution is totally out of detail information of the face identity compared to normal resolution in a gallery and hard to find corresponding faces therein. To this end, we propose a Resolution Invariant Model (RIM) for addressing such cross-resolution face recognition problems, with three distinct novelties. First, RIM is a novel and unified deep architecture, containing a Face Hallucination sub-Net (FHN) and a Heterogeneous Recognition sub-Net (HRN), which are jointly learned end to end. Second, FHN is a well-designed tri-path Generative Adversarial Network (GAN) which simultaneously perceives facial structure and geometry prior information, i.e. landmark heatmaps and parsing maps, incorporated with an unsupervised cross-domain adversarial training strategy to super-resolve very low-resolution query image to its 8× larger ones without requiring them to be well aligned. Third, HRN is a generic Convolutional Neural Network (CNN) for heterogeneous face recognition with our proposed residual knowledge distillation strategy for learning discriminative yet generalized feature representation. Quantitative and qualitative experiments on several benchmarks demonstrate the superiority of the proposed model over the state-of-the-arts. Codes and models are available at https://github.com/HyoKong/Cross-Resolution-Face-Recognition.

    Hanyang Kong, Jian Zhao, Xiaoguang Tu, Junliang Xing, Shengmei Shen, and Jiashi Feng

    ArXiv  PDF, BibTeX, Code

  • 2019
    Fine-Grained Multi-Human Parsing
    Despite the noticeable progress in perceptual tasks like detection, instance segmentation and human parsing, computers still perform unsatisfactorily on visually understanding humans in crowded scenes, such as group behavior analysis, person re-identification, e-commerce, media editing, video surveillance, autonomous driving and virtual reality, etc. To perform well, models need to comprehensively perceive the semantic information and the differences between instances in a multi-human image, which is recently defined as the multi-human parsing task. In this paper, we first present a new large-scale database "Multi-Human Parsing (MHP v2.0)" for algorithm development and evaluation to advance the research on understanding humans in crowded scenes. MHP v2.0 contains 25,403 elaborately annotated images with 58 fine-grained semantic category labels and 16 dense pose key point labels, involving 2-26 persons per image captured in real-world scenes from various viewpoints, poses, occlusion, interactions and background. We further propose a novel deep Nested Adversarial Network (NAN) model for multi-human parsing. NAN consists of three Generative Adversarial Network (GAN)-like sub-nets, respectively performing semantic saliency prediction, instance-agnostic parsing and instance-aware clustering. These sub-nets form a nested structure and are carefully designed to learn jointly in an end-to-end way. NAN consistently outperforms existing state-of-the-art solutions on our MHP and several other datasets, including MHP v1.0, PASCAL-Person-Part and Buffy. NAN serves as a strong baseline to shed light on generic instance-level semantic part prediction and drive the future research on multi-human parsing. With the above innovations and contributions, we have organized the CVPR 2018 Workshop on Visual Understanding of Humans in Crowd Scene (VUHCS 2018) and the Fine-Grained Multi-Human Parsing and Pose Estimation Challenge. These contributions together significantly benefit the community. Code and pre-trained models are available at https://github.com/ZhaoJ9014/Multi-Human-Parsing_MHP.

    Jian Zhao, Jianshu Li, Hengzhu Liu, Shuicheng Yan, and Jiashi Feng

    IJCV   PDF, BibTeX, MHP Dataset v2.0 & v1.0, annotation tools, and source codes for NAN and evaluation metrics Download

  • 2019
    Multi-Prototype Networks for Unconstrained Set-based Face Recognition
    In this paper, we study the challenging unconstrained set-based face recognition problem where each subject face is instantiated by a set of media (images and videos) instead of a single image. Naively aggregating information from all the media within a set would suffer from the large intraset variance caused by heterogeneous factors (e.g., varying media modalities, poses and illuminations) and fail to learn discriminative face representations. A novel MultiPrototype Network (MPNet) model is thus proposed to learn multiple prototype face representations adaptively from the media sets. Each learned prototype is representative for the subject face under certain condition in terms of pose, illumination and media modality. Instead of handcrafting the set partition for prototype learning, MPNet introduces a Dense SubGraph (DSG) learning sub-net that implicitly untangles inconsistent media and learns a number of representative prototypes. Qualitative and quantitative experiments clearly demonstrate superiority of the proposed model over state-of-the-arts.

    Jian Zhao, Jianshu Li, Xiaoguang Tu, Fang Zhao, Yuan Xin, Junliang Xing, Hengzhu Liu, Shuicheng Yan, and Jiashi Feng

    IJCAI 2019 (Oral)  PDF, BibTeX, Foundation of Tencent Face Scan Payment

  • 2019
    Task Relation Networks
    Multi-task learning is popular in machine learning and computer vision. In multitask learning, properly modeling task relations is important for boosting the performance of jointly learned tasks. Task covariance modeling has been successfully used to model the relations of tasks but is limited to homogeneous multi-task learning. In this paper, we propose a feature based task relation modeling approach, suitable for both homogeneous and heterogeneous multi-task learning. First, we propose a new metric to quantify the relations between tasks. Based on the quantitative metric, we then develop the task relation layer, which can be combined with any deep learning architecture to form task relation networks to fully exploit the relations of different tasks in an online fashion. Benefiting from the task relation layer, the task relation networks can better leverage the mutual information from the data. We demonstrate our proposed task relation networks are effective in improving the performance in both homogeneous and heterogeneous multi-task learning settings through extensive experiments on computer vision tasks.

    Jianshu Li, Pan Zhou, Yunpeng Chen, Jian Zhao, Sujoy Roy, Yan Shuicheng, Jiashi Feng, and Terence Sim

    WACV 2019  PDF, BibTeX

  • 2019
    Look Across Elapse: Disentangled Representation Learning and Photorealistic Cross-Age Face Synthesis for Age-Invariant Face Recognition
    Despite the remarkable progress in face recognition related technologies, reliably recognizing faces across ages still remains a big challenge. The appearance of a human face changes substantially over time, resulting in significant intra-class variations. As opposed to current techniques for age-invariant face recognition, which either directly extract age-invariant features for recognition, or first synthesize a face that matches target age before feature extraction, we argue that it is more desirable to perform both tasks jointly so that they can leverage each other. To this end, we propose a deep Age-Invariant Model (AIM) for face recognition in the wild with three distinct novelties. First, AIM presents a novel unified deep architecture jointly performing cross-age face synthesis and recognition in a mutual boosting way. Second, AIM achieves continuous face rejuvenation/aging with remarkable photorealistic and identity-preserving properties, avoiding the requirement of paired data and the true age of testing samples. Third, we develop effective and novel training strategies for end-to-end learning the whole deep architecture, which generates powerful age-invariant face representations explicitly disentangled from the age variation. Moreover, we propose a new large-scale Cross-Age Face Recognition (CAFR) benchmark dataset to facilitate existing efforts and push the frontiers of age-invariant face recognition research. Extensive experiments on both our CAFR and several other cross-age datasets (MORPH, CACD and FG-NET) demonstrate the superiority of the proposed AIM model over the state-of-the-arts. Benchmarking our model on one of the most popular unconstrained face recognition datasets IJB-C additionally verifies the promising generalizability of AIM in recognizing faces in the wild.

    Jian Zhao, Yu Cheng, Yi Cheng, Yang Yang, Haochong Lan, Fang Zhao, Lin Xiong, Yan Xu, Jianshu Li, Sugiri Pranata, Shengmei Shen, Junliang Xing, Hengzhu Liu, Shuicheng Yan, and Jiashi Feng

    AAAI 2019 (Oral)  PDF, BibTeX, Code

  • 2018
    Object Relation Detection Based on One-shot Learning
    Detecting the relations among objects, such as "cat on sofa" and "person ride horse", is a crucial task in image understanding, and beneficial to bridging the semantic gap between images and natural language. Despite the remarkable progress of deep learning in detection and recognition of individual objects, it is still a challenging task to localize and recognize the relations between objects due to the complex combinatorial nature of various kinds of object relations. Inspired by the recent advances in one-shot learning, we propose a simple yet effective Semantics Induced Learner (SIL) model for solving this challenging task. Learning in one-shot manner can enable a detection model to adapt to a huge number of object relations with diverse appearance effectively and robustly. In addition, the SIL combines bottom-up and top-down attention mech- anisms, therefore enabling attention at the level of vision and semantics favorably. Within our proposed model, the bottom-up mechanism, which is based on Faster R-CNN, proposes objects regions, and the top-down mechanism selects and integrates visual features according to semantic information. Experiments demonstrate the effectiveness of our framework over other state-of-the-art methods on two large-scale data sets for object relation detection.

    Li Zhou, Jian Zhao, Jianshu Li, Li Yuan, and Jiashi Feng

    ArXiv  PDF, BibTeX

  • 2018
    3D-Aided Dual-Agent GANs for Unconstrained Face Recognition
    Synthesizing realistic profile faces is beneficial for more efficiently training deep pose-invariant models for large-scale unconstrained face recognition, by augmenting the number of samples with extreme poses and avoiding costly annotation work. However, learning from synthetic faces may not achieve the desired performance due to the discrepancy between distributions of the synthetic and real face images. To narrow this gap, we propose a Dual-Agent Generative Adversarial Network (DA-GAN) model, which can improve the realism of a face simulator’s output using unlabeled real faces while preserving the identity information during the realism refinement. The dual agents are specially designed for distinguishing real v.s. fake and identities simultaneously. In particular, we employ an off-the-shelf 3D face model as a simulator to generate profile face images with varying poses. DA-GAN leverages a fully convolutional network as the generator to generate high-resolution images and an auto-encoder as the discriminator with the dual agents. Besides the novel architecture, we make several key modifications to the standard GAN to preserve pose, texture as well as identity, and stabilize the training process: (i) a pose perception loss; (ii) an identity perception loss; (iii) an adversarial loss with a boundary equilibrium regularization term. Experimental results show that DA-GAN not only achieves outstanding perceptual results but also significantly outperforms state-of-the-arts on the large-scale and challenging NIST IJB-A and CFP unconstrained face recognition benchmarks. In addition, the proposed DA-GAN is also a promising new approach for solving generic transfer learning problems more effectively. DA-GAN is the foundation of our winning entry to the NIST IJB-A face recognition competition in which we secured the 1st places on the tracks of verification and identification.

    Jian Zhao, Lin Xiong, Jianshu Li, Junliang Xing, Shuicheng Yan, and Jiashi Feng

    T-PAMI   PDF, BibTeX

  • 2018
    Dynamic Conditional Networks for Few-Shot Learning
    This paper proposes a novel Dynamic Conditional Convolutional Network (DCCN) to handle conditional few-shot learning, i.e, only a few training samples are available for each condition. DCCN consists of dual subnets: DyConvNet contains a dynamic convolutional layer with a bank of basis filters; CondiNet predicts a set of adaptive weights from conditional inputs to linearly combine the basis filters. In this manner, a specific convolutional kernel can be dynamically obtained for each conditional input. The filter bank is shared between all conditions thus only a low-dimension weight vector needs to be learned. This significantly facilitates the parameter learning across different conditions when training data are limited. We evaluate DCCN on four tasks which can be formulated as conditional model learning, including specific object counting, multi-modal image classification, phrase grounding and identity based face generation. Extensive experiments demonstrate the superiority of the proposed model in the conditional few-shot learning setting.

    Fang Zhao, Jian Zhao, Shuicheng Yan, and Jiashi Feng

    (The first two authors are with equal contributions.)

    ECCV 2018  PDF, BibTeX, Poster, Code

  • 2018
    Understanding Humans in Crowded Scenes: Deep Nested Adversarial Learning and A New Benchmark for Multi-Human Parsing
    Despite the noticeable progress in perceptual tasks like detection, instance segmentation and human parsing, computers still perform unsatisfactorily on visually understanding humans in crowded scenes, such as group behavior analysis, person re-identification and autonomous driving, etc. To this end, models need to comprehensively perceive the semantic information and the differences between instances in a multi-human image, which is recently defined as the multi-human parsing task. In this paper, we present a new large-scale database “Multi-Human Parsing (MHP)” for algorithm development and evaluation, and advances the state-of-the-art in understanding humans in crowded scenes. MHP contains 25,403 elaborately annotated images with 58 fine-grained semantic category labels, involving 2-26 persons per image and captured in real-world scenes from various viewpoints, poses, occlusion, interactions and background. We further propose a novel deep Nested Adversarial Network (NAN) model for multi-human parsing. NAN consists of three Generative Adversarial Network (GAN)-like sub-nets, respectively performing semantic saliency prediction, instance-agnostic parsing and instance-aware clustering. These sub-nets form a nested structure and are carefully designed to learn jointly in an end-to-end way. NAN consistently outperforms existing state-of-the-art solutions on our MHP and several other datasets, and serves as a strong baseline to drive the future research for multi-human parsing.

    Jian Zhao, Jianshu Li, Yu Cheng, Li Zhou, Terence Sim, Shuicheng Yan, and Jiashi Feng

    ACM MM 2018 (Best Student PaperPDF, BibTeX, WeChat News, MHP Dataset v2.0 & v1.0, annotation tools, and source codes for NAN and evaluation metrics Download

  • 2018
    Multi-Human Parsing Machines
    Human parsing is an important task in human-centric analysis. Despite the remarkable progress in single-human parsing, the more realistic case of multi-human parsing remains challenging in terms of the data and the model. Compared with the considerable number of available single-human parsing datasets, the datasets for multi-human parsing are very limited in number mainly due to the huge annotation effort required. Besides the data challenge to multi-human parsing, the persons in real-world scenarios are often entangled with each other due to close interaction and body occlusion, making it difficult to distinguish body parts from different person instances. In this paper we propose the Multi-Human Parsing Machines (MHPM), which contains an MHP Montage model and an MHP Solver, to address both challenges in multi-human parsing. Specifically, the MHP Montage model in MHPM generates realistic images with multiple persons together with the parsing labels. It intelligently composes single persons onto background scene images while maintaining the structural information between persons and the scene. The generated images can be used to train better multi-human parsing algorithms. On the other hand, the MHP Solver in MHPM solves the bottleneck of distinguishing multiple entangled persons with close interaction. It employs a Group-Individual Push and Pull (GIPP) loss function, which can effectively separate persons with close interaction. We experimentally show that the proposed MHPM can achieve state-of-the-art performance on the multi-human parsing benchmark and the person individualization benchmark, which distinguishes closely entangled person instances.

    Jianshu Li, Jian Zhao, Yunpeng Chen, Sujoy Roy, Shuicheng Yan, Jiashi Feng, and Terence Sim

    ACM MM 2018  PDF, BibTeX

  • 2018
    3D-Aided Deep Pose-Invariant Face Recognition
    Learning from synthetic faces, though perhaps appealing for high data efficiency, may not bring satisfactory performance due to the distribution discrepancy of the synthetic and real face images. To mitigate this gap, we propose a 3D-Aided Deep Pose-Invariant Face Recognition Model (3D-PIM), which automatically recovers realistic frontal faces from arbitrary poses through a 3D face model in a novel way. Specifically, 3D-PIM incorporates a simulator with the aid of a 3D Morphable Model (3D MM) to obtain shape and appearance prior for accelerating face normalization learning, requiring less training data. It further leverages a global-local Generative Adversarial Network (GAN) with multiple critical improvements as a refiner to enhance the realism of both global structures and local details of the face simulator’s output using unlabelled real data only, while preserving the identity information. Qualitative and quantitative experiments on both controlled and in-the-wild benchmarks clearly demonstrate superiority of the proposed model over state-of-the-arts.

    Jian Zhao, Lin Xiong, Yu Cheng, Yi Cheng, Jianshu Li, Li Zhou, Yan Xu, Karlekar Jayashree, Sugiri Pranata, Shengmei Shen, Junliang Xing, Shuicheng Yan, and Jiashi Feng

    IJCAI 2018 (Oral)  PDF, BibTeX, Foundation of Panasonic FacePRO (YouTube News1, News2)

  • 2018
    Towards Pose Invariant Face Recognition in the Wild
    Pose variation is one key challenge in face recognition. As opposed to current techniques for pose invariant face recognition, which either directly extract pose invariant features for recognition, or first normalize profile face images to frontal pose before feature extraction, we argue that it is more desirable to perform both tasks jointly to allow them to benefit from each other. To this end, we propose a Pose Invariant Model (PIM) for face recognition in the wild, with three distinct novelties. First, PIM is a novel and unified deep architecture, containing a Face Frontalization sub-Net (FFN) and a Discriminative Learning sub-Net (DLN), which are jointly learned from end to end. Second, FFN is a well-designed dual-path Generative Adversarial Network (GAN) which simultaneously perceives global structures and local details, incorporated with an unsupervised cross-domain adversarial training and a "learning to learn" strategy for high-fidelity and identity-preserving frontal view synthesis. Third, DLN is a generic Convolutional Neural Network (CNN) for face recognition with our enforced cross-entropy optimization strategy for learning discriminative yet generalized feature representation. Qualitative and quantitative experiments on both controlled and in-the-wild benchmarks demonstrate the superiority of the proposed model over the state-of-the-arts.

    Jian Zhao, Yu Cheng, Yan Xu, Lin Xiong, Jianshu Li, Fang Zhao, Karlekar Jayashree, Sugiri Pranata, Shengmei Shen, Junliang Xing, Shuicheng Yan, and Jiashi Feng

    CVPR 2018  PDF, Poster, BibTeX, Foundation of Panasonic FacePRO (YouTube News1, News2)

  • 2018
    Weakly Supervised Phrase Localization with Multi-Scale Anchored Transformer Network
    In this paper, we propose a novel weakly supervised model, Multi-scale Anchored Transformer Network (MATN), to accurately localize free-form textual phrases with only image-level supervision. The proposed MATN takes region proposals as localization anchors, and learns a multi-scale correspondence network to continuously search for phrase regions referring to the anchors. In this way, MATN can exploit useful cues from these anchors to reliably reason about locations of the regions described by the phrases given only image-level supervision. Through differentiable sampling on image spatial feature maps, MATN introduces a novel training objective to simultaneously minimize a contrastive reconstruction loss between different phrases from a single image and a set of triplet losses among multiple images with similar phrases. Superior to existing region proposal based methods, MATN searches for the optimal bounding box over the entire feature map instead of selecting a sub-optimal one from discrete region proposals. We evaluate MATN on the Flickr30K Entities and ReferItGame datasets. The experimental results show that MATN significantly outperforms the state-of-the-art methods.

    Fang Zhao, Jianshu Li, Jian Zhao, and Jiashi Feng

    CVPR 2018  PDF, BibTeX

  • 2017
    Dual-Agent GANs for Photorealistic and Identity Preserving Profile Face Synthesis
    Synthesizing realistic profile faces is promising for more efficiently training deep pose-invariant models for large-scale unconstrained face recognition, by populating samples with extreme poses and avoiding tedious annotations. However, learning from synthetic faces may not achieve the desired performance due to the discrepancy between distributions of the synthetic and real face images. To narrow this gap, we propose a Dual-Agent Generative Adversarial Network (DA-GAN) model, which can improve the realism of a face simulator's output using unlabeled real faces, while preserving the identity information during the realism refinement. The dual agents are specifically designed for distinguishing real v.s. fake and identities simultaneously. In particular, we employ an off-the-shelf 3D face model as a simulator to generate profile face images with varying poses. DA-GAN leverages a fully convolutional network as the generator to generate high-resolution images and an auto-encoder as the discriminator with the dual agents. Besides the novel architecture, we make several key modifications to the standard GAN to preserve pose and texture, preserve identity and stabilize training process: (i) a pose perception loss; (ii) an identity perception loss; (iii) an adversarial loss with a boundary equilibrium regularization term. Experimental results show that DA-GAN not only presents compelling perceptual results but also significantly outperforms state-of-the-arts on the large-scale and challenging NIST IJB-A unconstrained face recognition benchmark. In addition, the proposed DA-GAN is also promising as a new approach for solving generic transfer learning problems more effectively. DA-GAN is the foundation of our submissions to NIST IJB-A 2017 face recognition competitions, where we won the 1st places on the tracks of verification and identification.

    Jian Zhao, Lin Xiong, Karlekar Jayashree, Jianshu Li, Fang Zhao, Zhecan Wang, Sugiri Pranata, Shengmei Shen, Shuicheng Yan, and Jiashi Feng

    NeurIPS 2017 PDF, Poster, BibTeX, Foundation of Panasonic FacePRO (YouTube News1, News2)

  • 2017
    Robust LSTM-Autoencoders for Face De-Occlusion in the Wild
    Face recognition techniques have been developed significantly in recent years. However, recognizing faces with partial occlusion is still challenging for existing face recognizers, which is heavily desired in real-world applications concerning surveillance and security. Although much research effort has been devoted to developing face de-occlusion methods, most of them can only work well under constrained conditions, such as all of faces are from a pre-defined closed set of subjects. In this paper, we propose a robust LSTM-Autoencoders (RLA) model to effectively restore partially occluded faces even in the wild. The RLA model consists of two LSTM components, which aims at occlusion-robust face encoding and recurrent occlusion removal respectively. The first one, named multi-scale spatial LSTM encoder, reads facial patches of various scales sequentially to output a latent representation, and occlusion-robustness is achieved owing to the fact that the influence of occlusion is only upon some of the patches. Receiving the representation learned by the encoder, the LSTM decoder with a dual channel architecture reconstructs the overall face and detects occlusion simultaneously, and by feat of LSTM, the decoder breaks down the task of face de-occlusion into restoring the occluded part step by step. Moreover, to minimize identify information loss and guarantee face recognition accuracy over recovered faces, we introduce an identity-preserving adversarial training scheme to further improve RLA. Extensive experiments on both synthetic and real data sets of faces with occlusion clearly demonstrate the effectiveness of our proposed RLA in removing different types of facial occlusion at various locations. The proposed method also provides significantly larger performance gain than other de-occlusion methods in promoting recognition performance over partially-occluded faces.

    Fang Zhao, Jiashi Feng, Jian Zhao, Wenhan Yang, and Shuicheng Yan

    T-IP  PDF, BibTeX

  • 2017
    Conditional Dual-Agent GANs for Photorealistic and Annotation Preserving Image Synthesis
    In this paper, we propose a novel Conditional Dual-Agent GAN (CDA-GAN) for photorealistic and annotation preserving image synthesis, which significantly benefits Deep Convolutional Neural Networks (DCNNs) learning. Instead of merely distinguishing real or fake, the proposed dual agents of the Discriminator are able to preserve both of realism and annotation information simultaneously through a standard adversarial loss and an annotation perception loss. During training, the Generator is conditioned on the desired image features learned by a pre-trained CNN sharing the same architecture of the Discriminator yet different weights. Thus, CDA-GAN is flexible in terms of scalability and able to generate photorealistic image with well preserved annotation information for learning DCNNs in specific domains. We perform detailed experiments to verify the effectiveness of CDA-GAN, which outperforms other state-of-the-arts on MNIST digits classification dataset and IJB-A face recognition dataset.

    Zhecan Wang, Jian Zhao, Yu Cheng, Shengtao Xiao, Jianshu Li, Fang Zhao, Jiashi Feng, and Ashraf Kassim

    (The first two authors are with equal contributions.)

    BMVC 2017 FaceHUB Workshop (Oral) PDF, BibTeX

  • 2017
    High Performance Large Scale Face Recognition with Multi-Cognition Softmax and Feature Retrieval
    To solve this large scale face recognition problem, a Multi-Cognition Softmax Model (MCSM) is proposed to distribute training data to several cognition units by a data shuffling strategy in this paper. Here we introduce one cognition unit as a group of independent softmax models, which is designed to increase the diversity of the one softmax model to boost the performance for models ensemble. Meanwhile, a template-based Feature Retrieval (FR) module is adopted to improve the performance of MCSM by a specific voting scheme. Moreover, a one-shot learning method is applied on collected extra 600K identities due to each identity has one image only. Finally, testing images with lower score from MCSM and FR are assigned new labels with higher score by merging one-shot learning results. Our solution ranks the first place in both two settings of the final evaluation and outperforms other teams by a large margin.

    Yan Xu, Yu Cheng, Jian Zhao, Zhecan Wang, Lin Xiong, Karlekar Jayashree, Hajime Tamura, Tomoyuki Kagaya, Sugiri Pranata, Shengmei Shen, Jiashi Feng, and Junliang Xing

    ICCV 2017 MS-Celeb-1M Workshop (Oral)  PDF, BibTeX

  • 2017
    Know You at One Glance: A Compact Vector Representation for Low-Shot Learning
    In this paper, we propose an enforced Softmax optimization approach which is able to improve the model's representational capacity by producing a “compact vector representation” for effectively solving the challenging low-shot learning face recognition problem. Compact vector representations are significantly helpful to overcome the underlying multi-modality variations and remain the primary key features as close to the mean face of the identity as possible in the high-dimensional feature space. Therefore, the gallery facial representations become more robust under various situations, leading to the overall performance improvement for low-shot learning. Comprehensive evaluations on the MNIST, LFW, and the challenging MS-Celeb-1M Low-Shot Learning Face Recognition benchmark datasets clearly demonstrate the superiority of our proposed method over state-of-the-arts.

    Yu Cheng, Jian Zhao, Zhecan Wang, Yan Xu, Karlekar Jayashree, Shengmei Shen, and Jiashi Feng

    (The first two authors are with equal contributions.)

    ICCV 2017 MS-Celeb-1M Workshop (Oral)  PDF, BibTeX

  • 2017
    Integrated Face Analytics Networks through Cross-Dataset Hybrid Training
    Face analytics benefits many multimedia applications. It consists of several tasks and most existing approaches generally treat these tasks independently, which limits their deployment in real scenarios. In this paper we propose an integrated Face Analytics Network (iFAN), which is able to perform multiple tasks jointly for face analytics with a novel carefully designed network architecture to fully facilitate the informative interaction among different tasks. The proposed integrated network explicitly models the interactions between tasks so that the correlations between tasks can be fully exploited for performance boost. In addition, to solve the bottleneck of the absence of datasets with comprehensive training data for various tasks, we propose a novel cross-dataset hybrid training strategy. It allows ``plug-in and play'' of multiple datasets annotated for different tasks without the requirement of a fully labeled common dataset for all the tasks. We experimentally show that the proposed iFAN achieves state-of-the-art performance on multiple face analytics tasks using a single integrated model. Specifically, iFAN achieves an overall F-score of 91.15% on the Helen dataset for face parsing, a normalized mean error of 5.81% on the MTFL dataset for facial landmark localization and an accuracy of 45.73% on the BNU dataset for emotion recognition with a single model.

    Jianshu Li, Shengtao Xiao, Fang Zhao, Jian Zhao, Jianan Li, Jiashi Feng, Shuicheng Yan, and Terence Sim

    ACM MM 2017 (Oral)   PDF, BibTeX

  • 2017
    Multi-Human Parsing in the Wild
    Human parsing is attracting increasing research attention. In this work, we aim to push the frontier of human parsing by introducing the problem of multi-human parsing in the wild. Existing works on human parsing mainly tackle single-person scenarios, which deviates from real-world applications where multiple persons are present simultaneously with interaction and occlusion. To address the multi-human parsing problem, we introduce a new multi-human parsing (MHP) dataset and a novel multi-human parsing model named MH-Parser. The MHP dataset contains multiple persons captured in real-world scenes with pixel-level fine-grained semantic annotations in an instance-aware setting. The MH-Parser generates global parsing maps and person instance masks simultaneously in a bottom-up fashion with the help of a new Graph-GAN model. We envision that the MHP dataset will serve as a valuable data resource to develop new multi-human parsing models, and the MH-Parser offers a strong baseline to drive future research for multi-human parsing in the wild.

    Jianshu Li, Jian Zhao, Yunchao Wei, Congyan Lang, Yidong Li, Terence Sim, Shuicheng Yan, and Jiashi Feng

    (The first two authors are with equal contributions.)

    ArXiv  WeChat News, PDF, BibTeX, MHP Dataset v1.0 Download

  • 2017
    Self-Supervised Neural Aggregation Networks for Human Parsing
    In this paper, we present a Self-Supervised Neural Aggregation Network (SS-NAN) for human parsing. SS-NAN adaptively learns to aggregate the multi-scale features at each pixel "address". In order to further improve the feature discriminative capacity, a self-supervised joint loss is adopted as an auxiliary learning strategy, which imposes human joint structures into parsing results without resorting to extra supervision. The proposed SS-NAN is end-to-end trainable. SS-NAN can be integrated into any advanced neural networks to help aggregate features regarding the importance at different positions and scales and incorporate rich high-level knowledge regarding human joint structures from a global perspective, which in turn improve the parsing results. Comprehensive evaluations on the recent Look into Person (LIP) and the PASCAL-Person-Part benchmark datasets demonstrate the significant superiority of our method over other state-of-the-arts.

    Jian Zhao, Jianshu Li, Xuecheng Nie, Yunpeng Chen, Zhecan Wang, Shuicheng Yan, and Jiashi Feng

    CVPR 2017 Visual Understanding of Human in Crowd Scene Workshop (Oral)  PDF, BibTeX, Code

  • 2017
    Estimation of Affective Level in the Wild with Multiple Memory Networks
    This paper presents the proposed solution to the ''affect in the wild'' challenge, which aims to estimate the affective level, i.e. the valence and arousal values, of every frame in a video. A carefully designed deep convolutional neural network (a variation of residual network) for affective level estimation of facial expressions is first implemented as a baseline. Next we use multiple memory networks to model the temporal relations between the frames. Finally ensemble models are used to combine the predictions from multiple memory networks. Our proposed solution outperforms the baseline model by a factor of 10.62% in terms of mean square error (MSE).

    Jianshu Li, Yunpeng Chen, Shengtao Xiao, Jian Zhao, Sujoy Roy, Jiashi Feng, Shuicheng Yan, and Terencei Sim

    CVPR 2017 Faces in-the-wild Workshop (Oral)  PDF, BibTeX

  • 2017
    A Good Practice Towards Top Performance of Face Recognition: Transferred Deep Feature Fusion
    Unconstrained face recognition performance evaluations have traditionally focused on Labeled Faces in the Wild (LFW) dataset for imagery and the YouTubeFaces (YTF) dataset for videos in the last couple of years. Spectacular progress in this field has resulted in a saturation on verification and identification accuracies for those benchmark datasets. In this paper, we propose a unified learning framework named transferred deep feature fusion targeting at the new IARPA Janus Bechmark A (IJB-A) face recognition dataset released by NIST face challenge. The IJB-A dataset includes real-world unconstrained faces from 500 subjects with full pose and illumination variations which are much harder than the LFW and YTF datasets. Inspired by transfer learning, we train two advanced deep convolutional neural networks (DCNN) with two different large datasets in source domain, respectively. By exploring the complementarity of two distinct DCNNs, deep feature fusion is utilized after feature extraction in target domain. Then, template specific linear SVMs is adopted to enhance the discrimination of framework. Finally, multiple matching scores corresponding different templates are merged as the final results. This simple unified framework outperforms the state-of-the-art by a wide margin on IJB-A dataset. Based on the proposed approach, we have submitted our IJB-A results to National Institute of Standards and Technology (NIST) for official evaluation.

    Lin Xiong, Jayashree Karlekar, Jian Zhao, Jiashi Feng, and Shengmei Shen

    (The first three authors are with equal contributions.)

    arXiv PDF, BibTeX, Foundation of Panasonic FacePRO (YouTube News1, News2)

  • 2017
    Marginalized CNN: Learning Deep Invariant Representations
    Training a deep neural network usually requires sufficient annotated samples. The scarcity of supervision samples in practice thus becomes the major bottleneck on performance of the network. In this work, we propose a principled method to circumvent this difficulty through marginalizing all the possible transformations over samples, termed as Marginalized Convolutional Neural Network (mCNN). mCNN implicitly considers in- finitely many transformed copies of the training data in every training epoch and therefore is able to learn representations invariant for transformation in an end-to-end way. We prove that such marginalization can be understood as a classic CNN with a special form of regularization and thus is efficient for implementation. Experimental results on the MNIST and affNIST digit number datasets demonstrate that mCNN can match or outperform the original CNN with much fewer training samples. Moreover, mCNN also performs well for face recognition on the recently released largescale MS-Cele-1M dataset and outperforms stateof-the-arts. Moreover, compared with the traditional CNNs which use data augmentation to improve their performance, the computational cost of mCNN is reduced by a factor of 25.

    Jian Zhao, Jianshu Li, Fang Zhao, Shuicheng Yan, and Jiashi Feng

    BMVC 2017 PDF, BibTeX

  • 2016
    Robust Face Recognition with Deep Multi-View Representation Learning
    This paper describes our proposed method targeting at the MSR Image Recognition Challenge MS-Celeb-1M. The challenge is to recognize one million celebrities from their face images captured in the real world. The challenge provides a large scale dataset crawled from the Web, which contains a large number of celebrities with many images for each subject. Given a new testing image, the challenge requires an identify for the image and the corresponding confidence score. To complete the challenge, we propose a two-stage approach consisting of data cleaning and multi-view deep representation learning. The data cleaning can effectively reduce the noise level of training data and thus improves the performance of deep learning based face recognition models. The multi-view representation learning enables the learned face representations to be more specific and discriminative. Thus the difficulties of recognizing faces out of a huge number of subjects are substantially relieved. Our proposed method achieves a coverage of 46.1% at 95% precision on the random set and a coverage of 33.0% at 95% precision on the hard set of this challenge.

    Jianshu Li, Jian Zhao, Fang Zhao, Hao Liu,Jing Li, Shengmei Shen, Jiashi Feng, and Terence Sim

    ACM MM 2016 (Oral)  PDF, BibTeX

  • 2015
    BE-SIFT: A More Brief and Efficient SIFT Image Matching Algorithm for Computer Vision

    Jian Zhao, Hengzhu Liu, Yiliu Feng, Shandong Yuan, and Wanzeng Cai

    IEEE PICOM2015; PDF, BibTeX

  • 2014
    Realization and Design of A Pilot Assist Decision-Making System Based on Speech Recognition

    Jian Zhao, Hengzhu Liu, Xucan Chen, and Zhengfa Liang

    AIAA2014  PDF, BibTeX

    A New Efficient Key Technology for Space Telemetry Wireless Data Link: The Low-Complexity SC-CPM SC-FDE Algorithm

    Jian Zhao, Hengzhu Liu, Xucan Chen, Botao Zhang, and Li Zhou

    ICT2014  Link, BibTeX

    A New Technology for MIMO Detection: The μ Quantum Genetic Sphere Decoding Algorithm

    Jian Zhao, Hengzhu Liu, Xucan Chen, and Ting Chen

    ACA2014  Link, BibTeX

    Research on A Kind of Optimization Scheme of MIMO-OFDM Sphere Equalization Technology for Unmanned Aerial Vehicle Wireless Image Transmission Data Link System

    Jian Zhao, Hengzhu Liu, Xucan Chen, and Shandong Yuan

    ACA2014  Link, BibTeX

    Design and Implementation for A New Kind of Extensible Digital Communication Simulation System Based on Matlab

    Jian Zhao, Hengzhu Liu, Xucan Chen, Botao Zhang, and Ting Chen

    Journal of Northerneastern University 

Ph.D. Dissertation

  • DEEP LEARNING FOR HUMAN-CENTRIC IMAGE ANALYSIS: FROM FACE RECOGNITION TO HUMAN PARSING.

    National University of Singapore, Singapore, 2019. Link, BibTeX

Special Mention

  • I was invited by CAAI to deliver a talk "Towards Unconstrained Human-centric Intelligent Perception and Deep Understanding" on 29th October 2022 (Link).
  • I was invited by BSIG to deliver a talk "Towards Unconstrained Human-centric Intelligent Perception and Deep Understanding" on 9th September 2022 (Link).
  • I was invited by Tsinghua AIR Webinar to deliver a talk "Towards Unconstrained Image/Video Deep Understanding" on 19th July 2022 (Link).
  • I was invited by CSIG Webinar to deliver a talk "Towards Unconstrained Image/Video Deep Understanding" on 28th June 2022 (Link).
  • Officially Interviewed by CSIG. (Link)
  • Officially Interviewed by Beijing Association for Science and Technology, due to a series of contributions on "Unconstrained Image/Video Deep Understanding" (BAST Official Interview1, BAST Official Interview2, BAST Official Interview3).
  • Baidu PaddlePaddle officially merged face.evoLVe to better facilitate more cutting-edge researches and applications on facial analytics and human-centric multi-media understanding (Official Announcement).
  • I have co-organized a VALSE Tutorial with Assoc. Prof. Xingxing Wei on the topic of "对抗环境下的深度合成和鉴别" on 08/09/2021 (Link).
  • I have co-organized a VALSE Webinar with Dr. Wenguan Wang and Prof. Zheng Wang on the topic of "Human-Centric Vision Techniques" on 13/01/2021 (Link).
  • I was invited by Qihoo 360 to attend the "2020上海数字创新大会" on 05/12/2020 as a panel guest of "人工智能给网络空间带来的机遇与挑战" session.
  • I have co-organized a VALSE Webinar with Prof. Chang Xu on the topic of "Visual Generation and Synthesis" on 23/09/2020 (Link).
  • I was invited by Jiang Men to attend the "将门ECCV 2020鲜声夺人云际会" on 30/08/2020 as a panel guest of "奔涌吧后浪:从PhD到助理教授身份转变" session. Video, Review
  • I was invited by Prof. Jimin Xiao at Xi'an Jiaotong-Liverpool University, Xi'an, China to deliver a talk on "Deep Learning for Human-Centric Image Analysis: From Face Recognition to Human Parsing" on 31/07/2020.
  • I have co-organized a VALSE Webinar with Prof. Shiguang Shan on the topic of "Face-based Human Understanding: beyond Face Recognition" on 25/03/2020 (Link1, Link2).
  • I was invited by Prof. Congyan Lang at Beijing Jiaotong University, Beijing, China to deliver a talk on "Deep Learning for Human-Centric Image Analysis: From Face Recognition to Human Parsing" on 24/10/2019.
  • I was invited by Dr. Jingtuo Liu at Baidu, Beijing, China to deliver a talk on "Deep Learning for Human-Centric Image Analysis: From Face Recognition to Human Parsing" on 05/09/2019.
  • I was invited by Prof. Ran He at Center for Research on Intelligent Perception and Computing, National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China to deliver a talk on "Deep Learning for Human-Centric Image Analysis: From Face Recognition to Human Parsing" on 29/08/2019.
  • I was invited by Huawei Noah's Ark Lab to deliver a talk on "Understanding Humans in Visual Scenes" on 13th June 2019.
  • I was invited by Peng Cheng Laboratory (PCL) to attend the 2019 Overseas Young Scientist Forum at Shenzhen China during 30/03/2019-01/04/2019 and deliver a talk on "Deep Learning for Human-Centric Image Analysis: From Face Recognition to Human Parsing".
  • I delivered a spotlight talk on "Understanding Humans in Crowded Scenes: Deep Nested Adversarial Learning and A New Benchmark for Multi-Human Parsing" on VALSE 2019.
  • I was invited by CoLab, School of Computer, Beihang University to deliver a talk on "Deep Learning for Human-Centric Image Analysis: From Face Recognition to Human Parsing" on 23rd March 2019.
  • I was invited by Tencent Deep Sea AI Lab to deliver a talk "Margin-based Representation Learning, Residual Knowledge Distillation and Prior-Aided Super Resolution" on 01st March 2019.
  • I was invited by UBTECH to deliver a talk "Deep Learning for Human-Centric Image Understanding" on 08th January 2019.
  • I was invited by OmniVision to deliver a talk "Facial Analytics" on 16th November 2018.
  • I was invited by Jiang Men to deliver a talk "Deep Learning for Human-Centric Image Understanding" on 30th August 2018 (Link, Poster, Summary).
  • I was invited by VALSE Webinar to deliver a talk "Deep Learning for Human-Centric Image Understanding" on 22nd August 2018 (Link, Summary).
  • I represented our group to "Launch of NUS new Vision, Mission and Values" at University Culture Center on 15th August 2018, and presented our recent work on Facial Analytics to NUS President Prof. Tan Eng Chye. NUS Instagram, NUS News, Gallery

Selected Awards

Open Positions

  • Several research fellow, master/Ph.D. student, engineer, and research assistant positions are available. Interested candidates with strong publication record please email me.
  • I can not host visiting foreign students due to administration restriction.

Contact

Email: zhaojian90 (at) u (dot) nus (dot) edu OR zhaojian9014 (at) gmail (dot) com

Phone: (86) 010 6630 5363 OR (86) 183 0135 5501

Address: 226 North Fourth Ring Road, Haidian District, Beijing, China 100191

Modified: 02 March 2023