• Ph.D. 2015 - 2019 (Expected)

    Department of Electrical and Computer Engineering, Faculty of Engineering

    National University of Singapore

  • Ph.D. 2014 - 2015

    School of Computer

    National University of Defense Technology, China

  • M.Eng. 2012 - 2014

    School of Computer

    Ss. National University of Defense Technology, China

  • B.Sc.2008 - 2012

    School of Automation Science and Electrical Engineering

    Ss. Beihang University, China

Work Experience

  • 2016 - 2017

    Graduate Assistant


  • 2016 - 2017

    Research Intern

    Core Technology Group, Learning & Vision, Panasonic R&D Center Singapore

  • 2011 - 2012

    Software Engineer

    China Aerospace Science and Industry Corporation (CASIC)


  • 2018
    Towards Pose Invariant Face Recognition in the Wild
    Pose variation is one key challenge in face recognition. As opposed to current techniques for pose invariant face recognition, which either directly extract pose invariant features for recognition, or first normalize profile face images to frontal pose before feature extraction, we argue that it is more desirable to perform both tasks jointly to allow them to benefit from each other. To this end, we propose a Pose Invariant Model (PIM) for face recognition in the wild, with three distinct novelties. First, PIM is a novel and unified deep architecture, containing a Face Frontalization sub-Net (FFN) and a Discriminative Learning sub-Net (DLN), which are jointly learned from end to end. Second, FFN is a well-designed dual-path Generative Adversarial Network (GAN) which simultaneously perceives global structures and local details, incorporated with an unsupervised cross-domain adversarial training and a "learning to learn" strategy for high-fidelity and identity-preserving frontal view synthesis. Third, DLN is a generic Convolutional Neural Network (CNN) for face recognition with our enforced cross-entropy optimization strategy for learning discriminative yet generalized feature representation. Qualitative and quantitative experiments on both controlled and in-the-wild benchmarks demonstrate the superiority of the proposed model over the state-of-the-arts.

    Jian Zhao, Yu Cheng, Yan Xu, Lin Xiong, Jianshu Li, Fang Zhao, Karlekar Jayashree, Sugiri Pranata, Shengmei Shen, Junliang Xing, Shuicheng Yan, and Jiashi Feng

    CVPR 2018  Foundation of Panasonic FacePRO (YouTube News1, News2)

  • 2018
    Weakly Supervised Phrase Localization with Multi-Scale Anchored Transformer Network
    In this paper, we propose a novel weakly supervised model, Multi-scale Anchored Transformer Network (MATN), to accurately localize free-form textual phrases with only image-level supervision. The proposed MATN takes region proposals as localization anchors, and learns a multi-scale correspondence network to continuously search for phrase regions referring to the anchors. In this way, MATN can exploit useful cues from these anchors to reliably reason about locations of the regions described by the phrases given only image-level supervision. Through differentiable sampling on image spatial feature maps, MATN introduces a novel training objective to simultaneously minimize a contrastive reconstruction loss between different phrases from a single image and a set of triplet losses among multiple images with similar phrases. Superior to existing region proposal based methods, MATN searches for the optimal bounding box over the entire feature map instead of selecting a sub-optimal one from discrete region proposals. We evaluate MATN on the Flickr30K Entities and ReferItGame datasets. The experimental results show that MATN significantly outperforms the state-of-the-art methods.

    Fang Zhao, Jianshu Li, Jian Zhao, and Jiashi Feng

    CVPR 2018 

  • 2017
    Dual-Agent GANs for Photorealistic and Identity Preserving Profile Face Synthesis
    Synthesizing realistic profile faces is promising for more efficiently training deep pose-invariant models for large-scale unconstrained face recognition, by populating samples with extreme poses and avoiding tedious annotations. However, learning from synthetic faces may not achieve the desired performance due to the discrepancy between distributions of the synthetic and real face images. To narrow this gap, we propose a Dual-Agent Generative Adversarial Network (DA-GAN) model, which can improve the realism of a face simulator's output using unlabeled real faces, while preserving the identity information during the realism refinement. The dual agents are specifically designed for distinguishing real v.s. fake and identities simultaneously. In particular, we employ an off-the-shelf 3D face model as a simulator to generate profile face images with varying poses. DA-GAN leverages a fully convolutional network as the generator to generate high-resolution images and an auto-encoder as the discriminator with the dual agents. Besides the novel architecture, we make several key modifications to the standard GAN to preserve pose and texture, preserve identity and stabilize training process: (i) a pose perception loss; (ii) an identity perception loss; (iii) an adversarial loss with a boundary equilibrium regularization term. Experimental results show that DA-GAN not only presents compelling perceptual results but also significantly outperforms state-of-the-arts on the large-scale and challenging NIST IJB-A unconstrained face recognition benchmark. In addition, the proposed DA-GAN is also promising as a new approach for solving generic transfer learning problems more effectively. DA-GAN is the foundation of our submissions to NIST IJB-A 2017 face recognition competitions, where we won the 1st places on the tracks of verification and identification.

    Jian Zhao, Lin Xiong, Karlekar Jayashree, Jianshu Li, Fang Zhao, Zhecan Wang, Sugiri Pranata, Shengmei Shen, Shuicheng Yan, and Jiashi Feng

    NIPS 2017 PDF, Poster, BibTeX, Foundation of Panasonic FacePRO (YouTube News1, News2)

  • 2017
    Robust LSTM-Autoencoders for Face De-Occlusion in the Wild
    Face recognition techniques have been developed significantly in recent years. However, recognizing faces with partial occlusion is still challenging for existing face recognizers, which is heavily desired in real-world applications concerning surveillance and security. Although much research effort has been devoted to developing face de-occlusion methods, most of them can only work well under constrained conditions, such as all of faces are from a pre-defined closed set of subjects. In this paper, we propose a robust LSTM-Autoencoders (RLA) model to effectively restore partially occluded faces even in the wild. The RLA model consists of two LSTM components, which aims at occlusion-robust face encoding and recurrent occlusion removal respectively. The first one, named multi-scale spatial LSTM encoder, reads facial patches of various scales sequentially to output a latent representation, and occlusion-robustness is achieved owing to the fact that the influence of occlusion is only upon some of the patches. Receiving the representation learned by the encoder, the LSTM decoder with a dual channel architecture reconstructs the overall face and detects occlusion simultaneously, and by feat of LSTM, the decoder breaks down the task of face de-occlusion into restoring the occluded part step by step. Moreover, to minimize identify information loss and guarantee face recognition accuracy over recovered faces, we introduce an identity-preserving adversarial training scheme to further improve RLA. Extensive experiments on both synthetic and real data sets of faces with occlusion clearly demonstrate the effectiveness of our proposed RLA in removing different types of facial occlusion at various locations. The proposed method also provides significantly larger performance gain than other de-occlusion methods in promoting recognition performance over partially-occluded faces.

    Fang Zhao, Jiashi Feng, Jian Zhao, Wenhan Yang, and Shuicheng Yan

    IEEE Transactions on Image Processing  Link, BibTeX

  • 2017
    Conditional Dual-Agent GANs for Photorealistic and Annotation Preserving Image Synthesis
    In this paper, we propose a novel Conditional Dual-Agent GAN (CDA-GAN) for photorealistic and annotation preserving image synthesis, which significantly benefits Deep Convolutional Neural Networks (DCNNs) learning. Instead of merely distinguishing real or fake, the proposed dual agents of the Discriminator are able to preserve both of realism and annotation information simultaneously through a standard adversarial loss and an annotation perception loss. During training, the Generator is conditioned on the desired image features learned by a pre-trained CNN sharing the same architecture of the Discriminator yet different weights. Thus, CDA-GAN is flexible in terms of scalability and able to generate photorealistic image with well preserved annotation information for learning DCNNs in specific domains. We perform detailed experiments to verify the effectiveness of CDA-GAN, which outperforms other state-of-the-arts on MNIST digits classification dataset and IJB-A face recognition dataset.

    Zhecan Wang, Jian Zhao, Yu Cheng, Shengtao Xiao, Jianshu Li, Fang Zhao, Jiashi Feng, and Ashraf Kassim

    (The first two authors are with equal contributions.)

    BMVC 2017 FaceHUB Workshop (Oral) PDF, BibTeX

  • 2017
    High Performance Large Scale Face Recognition with Multi-Cognition Softmax and Feature Retrieval
    To solve this large scale face recognition problem, a Multi-Cognition Softmax Model (MCSM) is proposed to distribute training data to several cognition units by a data shuffling strategy in this paper. Here we introduce one cognition unit as a group of independent softmax models, which is designed to increase the diversity of the one softmax model to boost the performance for models ensemble. Meanwhile, a template-based Feature Retrieval (FR) module is adopted to improve the performance of MCSM by a specific voting scheme. Moreover, a one-shot learning method is applied on collected extra 600K identities due to each identity has one image only. Finally, testing images with lower score from MCSM and FR are assigned new labels with higher score by merging one-shot learning results. Our solution ranks the first place in both two settings of the final evaluation and outperforms other teams by a large margin.

    Yan Xu, Yu Cheng, Jian Zhao, Zhecan Wang, Lin Xiong, Karlekar Jayashree, Hajime Tamura, Tomoyuki Kagaya, Sugiri Pranata, Shengmei Shen, Jiashi Feng, and Junliang Xing

    ICCV 2017 MS-Celeb-1M Workshop (Oral)  PDF, BibTeX

  • 2017
    Know You at One Glance: A Compact Vector Representation for Low-Shot Learning
    In this paper, we propose an enforced Softmax optimization approach which is able to improve the model's representational capacity by producing a “compact vector representation” for effectively solving the challenging low-shot learning face recognition problem. Compact vector representations are significantly helpful to overcome the underlying multi-modality variations and remain the primary key features as close to the mean face of the identity as possible in the high-dimensional feature space. Therefore, the gallery facial representations become more robust under various situations, leading to the overall performance improvement for low-shot learning. Comprehensive evaluations on the MNIST, LFW, and the challenging MS-Celeb-1M Low-Shot Learning Face Recognition benchmark datasets clearly demonstrate the superiority of our proposed method over state-of-the-arts.

    Yu Cheng, Jian Zhao, Zhecan Wang, Yan Xu, Karlekar Jayashree, Shengmei Shen, and Jiashi Feng

    (The first two authors are with equal contributions.)

    ICCV 2017 MS-Celeb-1M Workshop (Oral)  PDF, BibTeX

  • 2017
    Integrated Face Analytics Networks through Cross-Dataset Hybrid Training
    Face analytics benefits many multimedia applications. It consists of several tasks and most existing approaches generally treat these tasks independently, which limits their deployment in real scenarios. In this paper we propose an integrated Face Analytics Network (iFAN), which is able to perform multiple tasks jointly for face analytics with a novel carefully designed network architecture to fully facilitate the informative interaction among different tasks. The proposed integrated network explicitly models the interactions between tasks so that the correlations between tasks can be fully exploited for performance boost. In addition, to solve the bottleneck of the absence of datasets with comprehensive training data for various tasks, we propose a novel cross-dataset hybrid training strategy. It allows ``plug-in and play'' of multiple datasets annotated for different tasks without the requirement of a fully labeled common dataset for all the tasks. We experimentally show that the proposed iFAN achieves state-of-the-art performance on multiple face analytics tasks using a single integrated model. Specifically, iFAN achieves an overall F-score of 91.15% on the Helen dataset for face parsing, a normalized mean error of 5.81% on the MTFL dataset for facial landmark localization and an accuracy of 45.73% on the BNU dataset for emotion recognition with a single model.

    Jianshu Li, Shengtao Xiao, Fang Zhao, Jian Zhao, Jianan Li, Jiashi Feng, Shuicheng Yan, and Terence Sim

    ACM MM 2017 (Oral)   PDF, BibTeX

  • 2017
    Multiple-Human Parsing in the Wild
    Human parsing is attracting increasing research attention. In this work, we aim to push the frontier of human parsing by introducing the problem of multi-human parsing in the wild. Existing works on human parsing mainly tackle single-person scenarios, which deviates from real-world applications where multiple persons are present simultaneously with interaction and occlusion. To address the multi-human parsing problem, we introduce a new multi-human parsing (MHP) dataset and a novel multi-human parsing model named MH-Parser. The MHP dataset contains multiple persons captured in real-world scenes with pixel-level fine-grained semantic annotations in an instance-aware setting. The MH-Parser generates global parsing maps and person instance masks simultaneously in a bottom-up fashion with the help of a new Graph-GAN model. We envision that the MHP dataset will serve as a valuable data resource to develop new multi-human parsing models, and the MH-Parser offers a strong baseline to drive future research for multi-human parsing in the wild.

    Jianshu Li, Jian Zhao, Yunchao Wei, Congyan Lang, Yidong Li, Terence Sim, Shuicheng Yan, and Jiashi Feng

    (The first two authors are with equal contributions.)

    Under review  WeChat News, PDF, BibTeX, MHP Dataset v1.0 Download

  • 2017
    Self-Supervised Neural Aggregation Networks for Human Parsing
    In this paper, we present a Self-Supervised Neural Aggregation Network (SS-NAN) for human parsing. SS-NAN adaptively learns to aggregate the multi-scale features at each pixel "address". In order to further improve the feature discriminative capacity, a self-supervised joint loss is adopted as an auxiliary learning strategy, which imposes human joint structures into parsing results without resorting to extra supervision. The proposed SS-NAN is end-to-end trainable. SS-NAN can be integrated into any advanced neural networks to help aggregate features regarding the importance at different positions and scales and incorporate rich high-level knowledge regarding human joint structures from a global perspective, which in turn improve the parsing results. Comprehensive evaluations on the recent Look into Person (LIP) and the PASCAL-Person-Part benchmark datasets demonstrate the significant superiority of our method over other state-of-the-arts.

    Jian Zhao, Jianshu Li, Xuecheng Nie, Yunpeng Chen, Zhecan Wang, Shuicheng Yan, and Jiashi Feng

    CVPR 2017 Workshop on Visual Understanding of Human in Crowd Scene (Oral)  PDF, BibTeX

  • 2017
    Estimation of Affective Level in the Wild with Multiple Memory Networks
    This paper presents the proposed solution to the ''affect in the wild'' challenge, which aims to estimate the affective level, i.e. the valence and arousal values, of every frame in a video. A carefully designed deep convolutional neural network (a variation of residual network) for affective level estimation of facial expressions is first implemented as a baseline. Next we use multiple memory networks to model the temporal relations between the frames. Finally ensemble models are used to combine the predictions from multiple memory networks. Our proposed solution outperforms the baseline model by a factor of 10.62% in terms of mean square error (MSE).

    Jianshu Li, Yunpeng Chen, Shengtao Xiao, Jian Zhao, Sujoy Roy, Jiashi Feng, Shuicheng Yan, and Terencei Sim

    CVPR Faces in-the-wild 2017 Workshop (Oral)  PDF, BibTeX

  • 2017
    A Good Practice Towards Top Performance of Face Recognition: Transferred Deep Feature Fusion
    Unconstrained face recognition performance evaluations have traditionally focused on Labeled Faces in the Wild (LFW) dataset for imagery and the YouTubeFaces (YTF) dataset for videos in the last couple of years. Spectacular progress in this field has resulted in a saturation on verification and identification accuracies for those benchmark datasets. In this paper, we propose a unified learning framework named transferred deep feature fusion targeting at the new IARPA Janus Bechmark A (IJB-A) face recognition dataset released by NIST face challenge. The IJB-A dataset includes real-world unconstrained faces from 500 subjects with full pose and illumination variations which are much harder than the LFW and YTF datasets. Inspired by transfer learning, we train two advanced deep convolutional neural networks (DCNN) with two different large datasets in source domain, respectively. By exploring the complementarity of two distinct DCNNs, deep feature fusion is utilized after feature extraction in target domain. Then, template specific linear SVMs is adopted to enhance the discrimination of framework. Finally, multiple matching scores corresponding different templates are merged as the final results. This simple unified framework outperforms the state-of-the-art by a wide margin on IJB-A dataset. Based on the proposed approach, we have submitted our IJB-A results to National Institute of Standards and Technology (NIST) for official evaluation.

    Lin Xiong, Jayashree Karlekar, Jian Zhao, Jiashi Feng, and Shengmei Shen

    (The first three authors are with equal contributions.)

    arXiv PDF, BibTeX, Foundation of Panasonic FacePRO (YouTube News1, News2)

  • 2017
    Marginalized CNN: Learning Deep Invariant Representations
    Training a deep neural network usually requires sufficient annotated samples. The scarcity of supervision samples in practice thus becomes the major bottleneck on performance of the network. In this work, we propose a principled method to circumvent this difficulty through marginalizing all the possible transformations over samples, termed as Marginalized Convolutional Neural Network (mCNN). mCNN implicitly considers in- finitely many transformed copies of the training data in every training epoch and therefore is able to learn representations invariant for transformation in an end-to-end way. We prove that such marginalization can be understood as a classic CNN with a special form of regularization and thus is efficient for implementation. Experimental results on the MNIST and affNIST digit number datasets demonstrate that mCNN can match or outperform the original CNN with much fewer training samples. Moreover, mCNN also performs well for face recognition on the recently released largescale MS-Cele-1M dataset and outperforms stateof-the-arts. Moreover, compared with the traditional CNNs which use data augmentation to improve their performance, the computational cost of mCNN is reduced by a factor of 25.

    Jian Zhao, Jianshu Li, Fang Zhao, Shuicheng Yan, and Jiashi Feng

    BMVC 2017 PDF, BibTeX

  • 2016
    Robust Face Recognition with Deep Multi-View Representation Learning
    This paper describes our proposed method targeting at the MSR Image Recognition Challenge MS-Celeb-1M. The challenge is to recognize one million celebrities from their face images captured in the real world. The challenge provides a large scale dataset crawled from the Web, which contains a large number of celebrities with many images for each subject. Given a new testing image, the challenge requires an identify for the image and the corresponding confidence score. To complete the challenge, we propose a two-stage approach consisting of data cleaning and multi-view deep representation learning. The data cleaning can effectively reduce the noise level of training data and thus improves the performance of deep learning based face recognition models. The multi-view representation learning enables the learned face representations to be more specific and discriminative. Thus the difficulties of recognizing faces out of a huge number of subjects are substantially relieved. Our proposed method achieves a coverage of 46.1% at 95% precision on the random set and a coverage of 33.0% at 95% precision on the hard set of this challenge.

    Jianshu Li, Jian Zhao, Fang Zhao, Hao Liu,Jing Li, Shengmei Shen, Jiashi Feng, and Terence Sim

    ACM MM 2016  PDF, BibTeX

  • 2015
    BE-SIFT: A More Brief and Efficient SIFT Image Matching Algorithm for Computer Vision

    Jian Zhao, Hengzhu Liu, Yiliu Feng, Shandong Yuan, and Wanzeng Cai

    IEEE PICOM2015; PDF, BibTeX

  • 2014
    Realization and Design of A Pilot Assist Decision-Making System Based on Speech Recognition

    Jian Zhao, Hengzhu Liu, Xucan Chen, and Zhengfa Liang

    AIAA2014  PDF, BibTeX

    A New Efficient Key Technology for Space Telemetry Wireless Data Link: The Low-Complexity SC-CPM SC-FDE Algorithm

    Jian Zhao, Hengzhu Liu, Xucan Chen, Botao Zhang, and Li Zhou

    ICT2014  Link, BibTeX

    A New Technology for MIMO Detection: The μ Quantum Genetic Sphere Decoding Algorithm

    Jian Zhao, Hengzhu Liu, Xucan Chen, and Ting Chen

    ACA2014  Link, BibTeX

    Research on A Kind of Optimization Scheme of MIMO-OFDM Sphere Equalization Technology for Unmanned Aerial Vehicle Wireless Image Transmission Data Link System

    Jian Zhao, Hengzhu Liu, Xucan Chen, and Shandong Yuan

    ACA2014  Link, BibTeX

    Design and Implementation for A New Kind of Extensible Digital Communication Simulation System Based on Matlab

    Jian Zhao, Hengzhu Liu, Xucan Chen, Botao Zhang, and Ting Chen

    Journal of Northerneastern University 

Selected Awards

  • 2017 No.1 on ICCV 2017 MS-Celeb-1M Large-Scale Face Recognition Hard Set / Random Set / Low-Shot Learning Challenges, 1st author. WeChat News, NUS ECE News, Award Certificate for Track-1, Award Certificate for Track-2, Award Ceremony
  • 2017 No.2 on CVPR 2017 Visual Understanding of Humans in Crowd Scene & the 1st Look into Person (L.I.P) Challenges on Human Parsing and Pose Estimation, 1st author. Link, Award Certificate for Parsing, Award Certificate for Pose, Award Ceremony
  • 2017 No.1 on National Institute of Standards and Technology (NIST) IARPA Janus Benchmark A (IJB-A) Unconstrained Face Verification challenge and Identification challenge, 1st author. Official reports: Verification, Identification. WeChat News
  • 2017 No.1 on CVPR 2017 Faces in-the-wild challenge, 4th author. Link
  • 2016 No.3 on ACM MM 2016 MS-Celeb-1M Hard set challenge, 2nd author. Link
  • 2016 "Excellent Student Award", School of Computer, National University of Defense Technology (<10%)
  • 2015 "Excellent Student Award", School of Computer, National University of Defense Technology (<10%)   
  • 2014 "Excellent Graduate", National University of Defense Technology (<2%)
  • 2014 "Excellent Student Award", National University of Defense Technology (<2%). Award Certificate
  • 2014 "Guanghua Scholarship", National University of Defense Technology (<2%). Award Certificate
  • 2013 "Contribution prize" on Engineering Implementation of Tianhe-2 supercomputer (No.1 on Top500, Jun, 2013), National University of Defense Technology. Award Certificate
  • 2013 3rd prize on 13th "Great Wall Information Cup" competition, National University of Defense Technology
  • 2013 "Excellent Student Award", School of Computer, National University of Defense Technology (<10%). Award Certificate
  • 2013 1st prize on "Big Data Processing and Information Sub-Forum of the 6th Graduate Innovation Forum", Provincial Education Department of Hunan Province. Award Certificate
  • 2012 "Excellent Graduate", Beihang University (<2%). Award Certificate
  • 2012 2nd prize on 5th "Student Research Training Program (SRTP)", Beihang University. Award Certificate
  • 2011 "National Endeavor Scholarship", Central Government & Beijing Government of China
  • 2011 3rd prize on 21th "Feng Ru Cup" Competition, School of Automation Science and Electrical Engineering, Beihang University. Award Certificate
  • 2010 "SMC Scholarship", Beihang University



Phone: (65) 9610 7176

WeChat: Name Card

Address: Vision and Machine Learning Lab, E4-#08-24, 4 Engineering Drive 3, National University of Singapore, Singapore 117583

Modified: 20 February 2018