• Ph.D. 2015 - 2019

    Department of Electrical and Computer Engineering, Faculty of Engineering

    National University of Singapore

  • Ph.D. 2014 - 2015

    School of Computer

    National University of Defense Technology, China

  • M.Eng. 2012 - 2014

    School of Computer

    Ss. National University of Defense Technology, China

  • B.Sc.2008 - 2012

    School of Automation Science and Electrical Engineering

    Ss. Beihang University, China

Work Experience

  • 2015 - 2017

    Graduate Assistant


  • 2016 - 2017

    Research Intern

    Core Technology Group, Learning & Vision, Panasonic R&D Center Singapore

  • 2011 - 2012

    Software Engineer

    China Aerospace Science and Industry Corporation (CASIC)


  • 2017
    High Performance Large Scale Face Recognition with Multi-Cognition Softmax and Feature Retrieval
    To solve this large scale face recognition problem, a Multi-Cognition Softmax Model (MCSM) is proposed to distribute training data to several cognition units by a data shuffling strategy in this paper. Here we introduce one cognition unit as a group of independent softmax models, which is designed to increase the diversity of the one softmax model to boost the performance for models ensemble. Meanwhile, a template-based Feature Retrieval (FR) module is adopted to improve the performance of MCSM by a specific voting scheme. Moreover, a one-shot learning method is applied on collected extra 600K identities due to each identity has one image only. Finally, testing images with lower score from MCSM and FR are assigned new labels with higher score by merging one-shot learning results. Our solution ranks the first place in both two settings of the final evaluation and outperforms other teams by a large margin.

    Yan Xu, Yu Cheng, Jian Zhao, Zhecan Wang, Lin Xiong, Karlekar Jayashree, Sugiri Pranata, Shengmei Shen, and Jiashi Feng

    ICCV 2017 MS-Celeb-1M Workshop 

  • 2017
    Know You at One Glance: A Compact Vector Representation for Low-Shot Learning
    In this paper, we propose an enforced Softmax optimization approach which is able to improve the model's representational capacity by producing a “compact vector representation” for effectively solving the challenging low-shot learning face recognition problem. Compact vector representations are significantly helpful to overcome the underlying multi-modality variations and remain the primary key features as close to the mean face of the identity as possible in the high-dimensional feature space. Therefore, the gallery facial representations become more robust under various situations, leading to the overall performance improvement for low-shot learning. Comprehensive evaluations on the MNIST, LFW, and the challenging MS-Celeb-1M Low-Shot Learning Face Recognition benchmark datasets clearly demonstrate the superiority of our proposed method over state-of-the-arts.

    Yu Cheng, Jian Zhao, Zhecan Wang, Yan Xu, Karlekar Jayashree, Shengmei Shen, and Jiashi Feng

    (The first two authors are with equal contributions.)

    ICCV 2017 MS-Celeb-1M Workshop PDF

  • 2017
    Integrated Face Analytics Networks through Cross-Dataset Hybrid Training
    Face analytics benefits many multimedia applications. It consists of several tasks and most existing approaches generally treat these tasks independently, which limits their deployment in real scenarios. In this paper we propose an integrated Face Analytics Network (iFAN), which is able to perform multiple tasks jointly for face analytics with a novel carefully designed network architecture to fully facilitate the informative interaction among different tasks. The proposed integrated network explicitly models the interactions between tasks so that the correlations between tasks can be fully exploited for performance boost. In addition, to solve the bottleneck of the absence of datasets with comprehensive training data for various tasks, we propose a novel cross-dataset hybrid training strategy. It allows ``plug-in and play'' of multiple datasets annotated for different tasks without the requirement of a fully labeled common dataset for all the tasks. We experimentally show that the proposed iFAN achieves state-of-the-art performance on multiple face analytics tasks using a single integrated model. Specifically, iFAN achieves an overall F-score of 91.15% on the Helen dataset for face parsing, a normalized mean error of 5.81% on the MTFL dataset for facial landmark localization and an accuracy of 45.73% on the BNU dataset for emotion recognition with a single model.

    Jianshu Li, Shengtao Xiao, Fang Zhao, Jian Zhao, Jianan Li, Jiashi Feng, Shuicheng Yan, and Terence Sim

    ACM MM17 

  • 2017
    Towards Real World Human Parsing: Multiple-Human Parsing in the Wild
    The recent progress of human parsing techniques has been largely driven by the availability of rich data resources. In this work, we demonstrate some critical discrepancies between the current benchmark datasets and the real world human parsing scenarios. For instance, all the human parsing datasets only contain one person per image, while usually multiple persons appear simultaneously in a realistic scene. It is more practically demanded to simultaneously parse multiple persons, which presents a greater challenge to modern human parsing methods. Unfortunately, absence of relevant data resources severely impedes the development of multiple-human parsing methods. To facilitate future human parsing research, we introduce the Multiple-Human Parsing (MHP) dataset, which contains multiple persons in a real world scene per single image. The MHP dataset contains various numbers of persons (from 2 to 16) per image with 18 semantic classes for each parsing annotation. Persons appearing in the MHP images present sufficient variations in pose, occlusion and interaction. To tackle the multiple-human parsing problem, we also propose a novel Multiple-Human Parser (MH-Parser), which considers both the global context and local cues for each person in the parsing process. The model is demonstrated to outperform the naive "detect-and-parse" approach by a large margin, which will serve as a solid baseline and help drive the future research in real world human parsing.

    Jianshu Li, Jian Zhao, Yunchao Wei, Congyan Lang, Yidong Li, and Jiashi Feng

    (The first two authors are with equal contributions.)

    NIPS 2017 (under review) PDF

  • 2017
    Self-Supervised Neural Aggregation Networks for Human Parsing
    In this paper, we present a Self-Supervised Neural Aggregation Network (SS-NAN) for human parsing. SS-NAN adaptively learns to aggregate the multi-scale features at each pixel "address". In order to further improve the feature discriminative capacity, a self-supervised joint loss is adopted as an auxiliary learning strategy, which imposes human joint structures into parsing results without resorting to extra supervision. The proposed SS-NAN is end-to-end trainable. SS-NAN can be integrated into any advanced neural networks to help aggregate features regarding the importance at different positions and scales and incorporate rich high-level knowledge regarding human joint structures from a global perspective, which in turn improve the parsing results. Comprehensive evaluations on the recent Look into Person (LIP) and the PASCAL-Person-Part benchmark datasets demonstrate the significant superiority of our method over other state-of-the-arts.

    Jian Zhao, Jianshu Li, Xuecheng Nie, Yunpeng Chen, Zhecan Wang, Shuicheng Yan, and Jiashi Feng

    CVPR 2017 Workshop on Visual Understanding of Human in Crowd Scene (Oral) PDF

  • 2017
    Estimation of Affective Level in the Wild with Multiple Memory Networks
    This paper presents the proposed solution to the ''affect in the wild'' challenge, which aims to estimate the affective level, i.e. the valence and arousal values, of every frame in a video. A carefully designed deep convolutional neural network (a variation of residual network) for affective level estimation of facial expressions is first implemented as a baseline. Next we use multiple memory networks to model the temporal relations between the frames. Finally ensemble models are used to combine the predictions from multiple memory networks. Our proposed solution outperforms the baseline model by a factor of 10.62% in terms of mean square error (MSE).

    Jianshu Li, Yunpeng Chen, Shengtao Xiao, Jian Zhao, Sujoy Roy, Jiashi Feng, Shuicheng Yan, and Terencei Sim

    CVPR Faces in-the-wild 2017 Workshop (Oral) PDF

  • 2017
    A Good Practice Towards Top Performance of Face Recognition: Transferred Deep Feature Fusion
    Unconstrained face recognition performance evaluations have traditionally focused on Labeled Faces in the Wild (LFW) dataset for imagery and the YouTubeFaces (YTF) dataset for videos in the last couple of years. Spectacular progress in this field has resulted in a saturation on verification and identification accuracies for those benchmark datasets. In this paper, we propose a unified learning framework named transferred deep feature fusion targeting at the new IARPA Janus Bechmark A (IJB-A) face recognition dataset released by NIST face challenge. The IJB-A dataset includes real-world unconstrained faces from 500 subjects with full pose and illumination variations which are much harder than the LFW and YTF datasets. Inspired by transfer learning, we train two advanced deep convolutional neural networks (DCNN) with two different large datasets in source domain, respectively. By exploring the complementarity of two distinct DCNNs, deep feature fusion is utilized after feature extraction in target domain. Then, template specific linear SVMs is adopted to enhance the discrimination of framework. Finally, multiple matching scores corresponding different templates are merged as the final results. This simple unified framework outperforms the state-of-the-art by a wide margin on IJB-A dataset. Based on the proposed approach, we have submitted our IJB-A results to National Institute of Standards and Technology (NIST) for official evaluation.

    Lin Xiong (Panasonic), Jayashree Karlekar (Panasonic), Jian Zhao (NUS), Jiashi Feng (NUS), and Shengmei Shen (Panasonic)

    (The first three authors are with equal contributions.)

    arXiv PDF

  • 2017
    Marginalized CNN: Learning Deep Invariant Representations
    Training a deep neural network usually requires sufficient annotated samples. The scarcity of supervision samples in practice thus becomes the major bottleneck on performance of the network. In this work, we propose a principled method to circumvent this difficulty through marginalizing all the possible transformations over samples, termed as Marginalized Convolutional Neural Network (mCNN). mCNN implicitly considers in- finitely many transformed copies of the training data in every training epoch and therefore is able to learn representations invariant for transformation in an end-to-end way. We prove that such marginalization can be understood as a classic CNN with a special form of regularization and thus is efficient for implementation. Experimental results on the MNIST and affNIST digit number datasets demonstrate that mCNN can match or outperform the original CNN with much fewer training samples. Moreover, mCNN also performs well for face recognition on the recently released largescale MS-Cele-1M dataset and outperforms stateof-the-arts. Moreover, compared with the traditional CNNs which use data augmentation to improve their performance, the computational cost of mCNN is reduced by a factor of 25.

    ZHAO Jian (NUS), LI Jianshu (NUS), ZHAO Fang (NUS), YAN Shuicheng (Qihoo/360 AI Institute & NUS), and FENG Jiashi (NUS)

    BMVC2017 PDF

  • 2017
    Multi-Prototype Networks for Unconstrained Set-based Face Recognition
    In this paper, we consider the challenging unconstrainedset-based face recognition problem where each subject faceis instantiated by a set of media (images and videos) in-stead of a single image. Traditional face recognition ap-proaches based on single image may not perform well insuch scenarios since they do not exploit media-set informa-tion effectively. But naively aggregating information fromall the media within a set would suffer from the large intra-set variance caused by heterogeneous factors (e.g., varyingmedia modalities, poses and illuminations) and fail to learndiscriminative face representations. To address this chal-lenging problem, we propose a novel Multi-Prototype Net-work (MPNet) model that adaptively learns multiple proto-type face representations from sets of media. Each learnedprototype is representative for the subject face under certainconditions in terms of pose, illumination and media modal-ity. Instead of handcrafting the set partition for prototypelearning, MPNet introduces a new Dense SubGraph (DSG)learning sub-net that implicitly untangles inconsistent me-dia and learns a number of prototypes for unconstrainedset-based face recognition. The proposed MPNet with theDSG sub-net is end-to-end trainable. Comprehensive evalu-ations on the challenging IJB-A and large-scale MS-Celeb-1M benchmark datasets clearly demonstrate the superiorityof our proposed MPNet over state-of-the-arts.

    ZHAO Jian (NUS), ZHAO Jiaojiao (Northumbria University, UK), LI Jianshu (NUS), ZHAO Fang (NUS), Jayashree Karlekar (Panasonic), FENG Jiashi (NUS), and YAN Shuicheng (Qihoo/360 AI Institute & NUS)

    ICCV2017 (under review) 

  • 2017
    Weakly Supervised Phrase Localization with Multi-Scale Anchored Transformer Network
    Free-form textual phrase localization in images is extremely challenging when only image-level supervision is available. In this paper, we propose a novel weakly supervised localization model, namely Multi-scale Anchored Transformer Network (MATN) that can accurately localize textual phrases. Through taking region proposals of an image as localization anchors and computing multi-scale correspondence maps between a given phrase and the image spatial feature map, MATN learns to predict phrase location referring to the anchors. These anchors provide useful cues for MATN to reliably reason about the regions where objects most likely appear given only image-level supervision. MATN is trained by a novel strategy that simultaneously minimizes a contrastive reconstruction loss between different phrases from a single image and a set of triplet losses among multiple images with the similar phrases. Compared with other region proposal based methods, MATN searches for the optimal bounding box over the entire feature map instead of selecting a sub-optimal one from discrete region proposals and thus is more resistant to errors in the proposals. Besides, MATN explicitly leverages the shared knowledge across multiple images containing the similar objects and the discriminative information across different phrases from a single image in the learning process, which are absent in previous methods. We evaluate MATN on the Flickr30K Entities and ReferItGame datasets and the experimental results show that MATN significantly outperforms the state-of-the-art methods.

    ZHAO Fang (NUS), LI Jianshu (NUS), ZHAO Jian (NUS), FENG Jiashi (NUS), and YAN Shuicheng (Qihoo/360 AI Institute & NUS)

    ICCV2017 (under review) 

  • 2017
    Landmark Free Face Attribute Prediction
    Face attribute prediction in the wild is important for many facial analysis applications yet it is very challenging due to ubiquitous face variations. In this paper, we address face attribute prediction in the wild by proposing a novel method, lAndmark Free Face AttrIbute pRediction (AFFAIR). Unlike traditional face attribute prediction methods that require facial landmark detection and face alignment, AFFAIR uses an end-to-end learning pipeline to jointly learn spatial transformations and attribute localizations that optimize facial attribute prediction with no reliance on landmark annotations or pre-trained landmark detectors. AFFAIR achieves this through simultaneously 1) learning a global transformation which effectively alleviates negative effect of global face variation for the following attribute prediction tailored for each face, 2) locating the most relevant facial part for attribute prediction and 3) aggregating the global and local features for robust attribute prediction. Within AFFAIR, a new competitive learning strategy is developed that effectively enhances global transformation learning for better attribute prediction. We show that with zero information about landmarks, AFFAIR achieves state-of-the-art performance on three face attribute prediction benchmarks, which also simultaneously learns the face-level transformation and attribute-level localization within a unified framework.

    LI Jianshu (NUS), ZHAO Fang (NUS), ZHAO Jian (NUS), FENG Jiashi (NUS), YAN Shuicheng (Qihoo/360 AI Institute & NUS), and Terence Sim (NUS)

    ICCV2017 (under review) 

  • 2016
    Robust face recognition with deep multi-view representation learning
    This paper describes our proposed method targeting at the MSR Image Recognition Challenge MS-Celeb-1M. The challenge is to recognize one million celebrities from their face images captured in the real world. The challenge provides a large scale dataset crawled from the Web, which contains a large number of celebrities with many images for each subject. Given a new testing image, the challenge requires an identify for the image and the corresponding confidence score. To complete the challenge, we propose a two-stage approach consisting of data cleaning and multi-view deep representation learning. The data cleaning can effectively reduce the noise level of training data and thus improves the performance of deep learning based face recognition models. The multi-view representation learning enables the learned face representations to be more specific and discriminative. Thus the difficulties of recognizing faces out of a huge number of subjects are substantially relieved. Our proposed method achieves a coverage of 46.1% at 95% precision on the random set and a coverage of 33.0% at 95% precision on the hard set of this challenge.

    LI Jianshu (NUS), ZHAO Jian (NUS), ZHAO Fang (NUS), LIU Hao (HeFei University of Technology), LI Jing (NUS), SHEN Shengmei (Panasonic), FENG Jiashi (NUS), and Terence Sim (NUS)

    ACM MM16 PDF BibTeX

  • 2016
    Robust LSTM-Autoencoders for Face De-Occlusion in the Wild
    Face recognition techniques have been developed significantly in recent years. However, recognizing faces with partial occlusion is still challenging for existing face recognizers which is heavily desired in real-world applications concerning surveillance and security. Although much research effort has been devoted to developing face de-occlusion methods, most of them can only work well under constrained conditions, such as all the faces are from a pre-defined closed set. In this paper, we propose a robust LSTM-Autoencoders (RLA) model to effectively restore partially occluded faces even in the wild. The RLA model consists of two LSTM components, which aims at occlusion-robust face encoding and recurrent occlusion removal respectively. The first one, named multi-scale spatial LSTM encoder, reads facial patches of various scales sequentially to output a latent representation, and occlusion-robustness is achieved owing to the fact that the influence of occlusion is only upon some of the patches. Receiving the representation learned by the encoder, the LSTM decoder with a dual channel architecture reconstructs the overall face and detects occlusion simultaneously, and by feat of LSTM, the decoder breaks down the task of face de-occlusion into restoring the occluded part step by step. Moreover, to minimize identify information loss and guarantee face recognition accuracy over recovered faces, we introduce an identity-preserving adversarial training scheme to further improve RLA. Extensive experiments on both synthetic and real datasets of faces with occlusion clearly demonstrate the effectiveness of our proposed RLA in removing different types of facial occlusion at various locations. The proposed method also provides significantly larger performance gain than other deocclusion methods in promoting recognition performance over partially-occluded faces.

    ZHAO Fang (NUS), FENG Jiashi (NUS), ZHAO Jian (NUS), YANG Wenhan (Peking University), and YAN Shuicheng (Qihoo/360 AI Institute & NUS)

    TIP (under review) PDFBibTeX

  • 2015
    BE-SIFT: A More Brief and Efficient SIFT Image Matching Algorithm for Computer Vision

    ZHAO Jian (NUDT), LIU Hengzhu (NUDT), FENG Yiliu (NUDT), YUAN Shandong (NUDT), and CAI Wanzeng (NUDT)


  • 2014

    Jian ZHAO (NUDT), Hengzhu LIU (NUDT), Xucan CHEN (NUDT), and Zhengfa LIANGANG (NUDT)



    Jian ZHAO (NUDT), Hengzhu LIU (NUDT), Xucan CHEN (NUDT), Botao ZHANG (NUDT), and Li ZHOU (NUDT)


    A New Technology for MIMO Detection: The μ Quantum Genetic Sphere Decoding Algorithm

    Jian ZHAO (NUDT), Hengzhu LIU (NUDT), Xucan CHEN (NUDT), and Ting CHEN (NUDT)


    Research on a kind of optimization scheme of MIMO-OFDM sphere equalization technology for unmanned aerial vehicle wireless image transmission data link system

    Jian ZHAO (NUDT), Hengzhu LIU (NUDT), Xucan CHEN (NUDT), and Shandong YUAN (NUDT)


    Design and Implementation for a New Kind of Extensible Digital Communication Simulation System Based on Matlab

    Jian ZHAO (NUDT), Hengzhu LIU (NUDT), Xucan CHEN (NUDT), Botao ZHANG (NUDT), and Ting CHEN (NUDT)


Selected Awards

  • 2016 "Excellent Student Award", School of Computer, National University of Defense Technology (<10%)
  • 2015 "Excellent Student Award", School of Computer, National University of Defense Technology (<10%)   
  • 2014 "Excellent Graduate", National University of Defense Technology (<2%)
  • 2014 "Excellent Student Award", National University of Defense Technology (<2%)
  • 2014 "Guanghua Scholarship", National University of Defense Technology (<2%)
  • 3rd prize in the 13th "Great Wall Information Cup" competition, National University of Defense Technology
  • 2013 "Excellent Student Award", School of Computer, National University of Defense Technology (<10%)
  • 1st prize in the "Big Data Processing and Information Sub-Forum of the 6th Graduate Innovation Forum", The Education Ministry of Hunan Province
  • 2012 "Excellent Graduate", Beihang University (<2%)
  • 2nd prize in the 5th "Student Research Training Program (SRTP)", Education Ministry of China
  • 2011 "National Endeavor Scholarship", Central Government & Beijing Government of China
  • 3rd prize in the 21th "FENG RU Cup" Competition, Beihang University
  • 2010 "SMC Scholarship", Beihang University



Phone: (65) 9610 7176

Address: Vision and Machine Learning Lab, E4-#08-24, 4 Engineering Drive 3, National University of Singapore, Singapore 117583

Modified: 19 July 2017