Tài liệu Tái định danh trong hệ thống camera giám sát tự động

.PDF

143

hoangtuavartar Báo vi phạm

Tải xuống 61

Mô tả:

MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY NGUYEN THUY BINH PERSON RE-IDENTIFICATION IN A SURVEILLANCE CAMERA NETWORK DOCTORAL DISSERTATION OF ELECTRONICS ENGINEERING Hanoi−2020 MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY NGUYEN THUY BINH PERSON RE-IDENTIFICATION IN A SURVEILLANCE CAMERA NETWORK Major: Electronics Engineering Code: 9520203 DOCTORAL DISSERTATION OF ELECTRONICS ENGINEERING SUPERVISORS: 1.Assoc. Prof. Pham Ngoc Nam 2.Assoc. Prof. Le Thi Lan Hanoi−2020 DECLARATION OF AUTHORSHIP I, Nguyen Thuy Binh, declare that the thesis titled "Person re-identification in a surveillance camera network" has been entirely composed by myself. I assure some points as follows: This work was done wholly or mainly while in candidature for a Ph.D. research degree at Hanoi University of Science and Technology. The work has not be submitted for any other degree or qualifications at Hanoi University of Science and Technology or any other institutions. Appropriate acknowledge has been given within this thesis where reference has been made to the published work of others. The thesis submitted is my own, except where work in the collaboration has been included. The collaborative contributions have been clearly indicated. Hanoi, 24/11/ 2020 PhD Student SUPERVISORS i ACKNOWLEDGEMENT This dissertation was written during my doctoral course at School of Electronics and Telecommunications (SET) and International Research Institute of Multimedia, Information, Communication and Applications (MICA), Hanoi University of Science and Technology (HUST). I am so grateful for all people who always support and encourage me for completing this study. First, I would like to express my sincere gratitude to my advisors Assoc. Prof. Pham Ngoc Nam and Assoc. Prof. Le Thi Lan for their effective guidance, their patience, continuous support and encouragement, and their immense knowledge. I would like to express my gratitude to Dr. Vo Le Cuong and Dr. Ha thi Thu Lan for their help. I would like to thank to all member of School of Electronics and Telecommunications, International Research Institute of Multimedia, Information, Communications and Applications (MICA), Hanoi University of Science and Technology (HUST) as well as all of my colleagues in Faculty of Electrical-Electronic Engineering, University of Transport and Communications (UTC). They have always helped me on research process and given helpful advises for me to overcome my own difficulties. Moreover, the attention at scientific conferences has always been a great experience for me to receive many the useful comments. During my PhD course, I have received many supports from the Management Board of School of Electronics and Telecommunications, MICA Institute, and Faculty of Electrical-Electronic Engineering. My sincere thank to Assoc. Prof. Nguyen Huu Thanh, Dr. Nguyen Viet Son and Assoc. Prof. Nguyen Thanh Hai who gave me a lot of support and help. Without their precious support, it has been impossible to conduct this research. Thanks to my employer, University of Transport and Communications (UTC) for all necessary support and encouragement during my PhD journey. I am also grateful to Vietnam’s Program 911, HUST and UTC projects for their generous financial support. Special thanks to my family and relatives, particularly, my beloved husband and our children, for their never-ending support and sacrifice. Hanoi, 2020 Ph.D. Student ii CONTENTS DECLARATION OF AUTHORSHIP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i ACKNOWLEDGEMENT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi SYMBOLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 CHAPTER 1. LITERATURE REVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.1. Person ReID classifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.1.1. Single-shot versus Multi-shot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.1.2. Closed-set versus Open-set person ReID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.1.3. Supervised and unsupervised person ReID . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.2. Datasets and evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.2.1. Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.2.2. Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.3. Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.3.1. Hand-designed features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.3.2. Deep-learned features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.4. Metric learning and person matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 1.4.1. Metric learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 1.4.2. Person matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 1.5. Fusion schemes for person ReID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 1.6. Representative frame selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 1.7. Fully automated person ReID systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 1.8. Research on person ReID in Vietnam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 CHAPTER 2. MULTI-SHOT PERSON RE-ID THROUGH REPRESENTATIVE FRAMES SELECTION AND TEMPORAL FEATURE POOLING 36 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.2. Proposed method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.2.1. Overall framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2. Representative image selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 37 iii 2.2.3. Image-level feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.2.4. Temporal feature pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.2.5. Person matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.3. Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 2.3.1. Evaluation of representative frame extraction and temporal feature pooling schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 2.3.2. Quantitative evaluation of the trade-off between the accuracy and computational time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 2.3.3. Comparison with state-of-the-art methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 2.4. Conclusions and Future work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 CHAPTER 3. PERSON RE-ID PERFORMANCE IMPROVEMENT BASED ON FUSION SCHEMES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.2. Fusion schemes for the first setting of person ReID . . . . . . . . . . . . . . . . . . . . . . . 3.2.1. Image-to-images person ReID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 69 3.2.2. Images-to-images person ReID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.2.3. Obtained results on the first setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.3. Fusion schemes for the second setting of person ReID . . . . . . . . . . . . . . . . . . . . 3.3.1. The proposed method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 82 3.3.2. Obtained results on the second setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 3.4. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 CHAPTER 4. QUANTITATIVE EVALUATION OF AN END-TO-END PERSON REID PIPELINE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.2. An end-to-end person ReID pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.2.1. Pedestrian detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2. Pedestrian tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 97 4.2.3. Person ReID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.3. GOG descriptor re-implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.3.1. Comparison the performance of two implementations . . . . . . . . . . . . . . . . . 4.3.2. Analyze the effect of GOG parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 99 4.4. Evaluation performance of an end-to-end person ReID pipeline . . . . . . . . . . 101 4.4.1. The effect of human detection and segmentation on person ReID in singleshot scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 iv 4.4.2. The effect of human detection and segmentation on person ReID in multishot scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.5. Conclusions and Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 PUBLICATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 v ABBREVIATIONS No. Abbreviation Meaning 1 ACF Aggregate Channel Features 2 AIT Austrian Institute of Technology 3 AMOC Accumulative Motion Context 4 BOW Bag of Words 5 CAR Learning Compact Appearance Representation 6 CIE The International Commission on Illumination 7 CFFM Comprehensive Feature Fusion Mechanism 8 CMC Cummulative Matching Characteristic 9 CNN Convolutional Neural Network 10 CPM Convolutional Pose Machines 11 CVPDL Cross-view Projective Dictionary Learning 12 CVPR Conference on Computer Vision and Pattern Recognition 13 DDLM Discriminative Dictionary Learning Method 14 DDN Deep Decompositional Network 15 DeepSORT Deep learning Simple Online and Realtime Tracking 16 DFGP Deep Feature Guided Pooling 17 DGM Dynamic Graph Matching 18 DPM Deformable Part-Based Model 19 ECCV European Conference on Computer Vision 20 FAST 3D Fast Adaptive Spatio-Temporal 3D 21 FEP Flow Energy Profile 22 FNN Feature Fusion Network 23 FPNN Filter Pairing Neural Network 24 GOG Gaussian of Gaussian 25 GRU Gated Recurrent Unit 26 HOG Histogram of Oriented Gradients 27 HUST Hanoi University of Science and Technology 28 IBP Indian Buffet Process 29 ICCV International Conference on Computer Vision 30 ICIP International Conference on Image Processing vi 31 IDE ID-Discriminative Embedding 32 iLIDS-VID Imagery Library for Intelligent Detection Systems 33 ILSVRC ImageNet Large Scale Visual Recognition Competition 34 ISR TIterative Spare Ranking 35 KCF Kernelized Correlation Filter 36 KDES Kenel DEScriptor 37 KISSME Keep It Simple and Straightforward MEtric 38 kNN k-Nearest Neighbour 39 KXQDA Kernel Cross-view Quadratic Discriminative Analysis 40 LADF Locally-Adaptive Decision Functions 41 LBP Local Binary Pattern 42 LDA LinearDiscriminantAnalysis 43 LDFV Local Descriptor and coded by Feature Vector 44 LMNN Large Margin Nearest Neighbor 45 LMNN-R Large Margin Nearest Neighbor with Rejection 46 LOMO LOcal Maximal Occurrence 47 LSTM Long-Short Term Memory 48 LSTMC Long Short-Term Memory network with a Coupled gate 49 mAP mean Average Precision 50 MAPR Multimedia Analysis and Pattern Recognition 51 Mask R-CNN Mask Region with CNN 52 MCT Multi -Camera Tracking 53 MCCNN Multi-Channel CNN 54 MCML Maximally Collapsing Metric Learning 55 MGCAM Mask-Guided Contrastive Attention Model 56 ML Machine Learning 57 MLAPG Metric Learning by Accelerated Proximal Gradient 58 MLR Metric Learning to Rank 59 MOT Multiple Object Tracking 60 MSCR Maximal Stable Color Region 61 MSVF Maximally Stable Video Frame 62 MTMCT Multi-Target Multi-Camera Tracking 63 Person ReID Person Re -Identification 64 Pedparsing Pedestrian Parsing 65 PPN Pose Prediction Network vii 66 PRW Person Re-identification in the Wild 67 QDA Quadratic Discriminative Analysis 68 RAiD Re-Identification Across indoor-outdoor Dataset 69 RAP Richly Annotated Pedestrian 70 ResNet Residual Neural Network 71 RHSP Recurrent High-Structured Patches 72 RKHS Reproducing Kernel Hilbert Space 73 RNN Recurrent Neural Network 74 ROIs Region of Interests 75 SDALF Symmetry Driven Accumulation of Local Feature 76 SCNCD Salient Color Names based Color Descriptor 77 SCNN Siamese Convolutional Neural Network 78 SIFT Scale-Invariant Feature Transform 79 SILTP Scale Invariant Local Ternary Pattern 80 SPD Symmetric Positive Definite 81 SMP Stepwise Metric Promotion 82 SORT Simple Online and Realtime Tracking 83 SPIC Signal Processing: Image Communication 84 SVM Support Vector Machine 85 TAPR Temporally Aligned Pooling Representation 86 TAUDL Tracklet Association Unsupervised Deep Learning 87 TCSVT Transactions on Circuits and Systems for Video Technology 88 TII Transactions on Industrial Informatics 89 TPAMI Transactions on Pattern Analysis and Machine Intelligence 90 TPDL Top-push Distance Learning 91 Two-stream MR Two-stream Multirate Recurrent Neural Network 92 UIT University of Information Technology 93 UTAL Tracklet Association Unsupervised Deep Learning 94 VIPeR View-point Invariant Pedestrian Recognition 95 VNU-HCM Vietnam National University - Ho Chi Minh City 96 WH Weighted color Histogram 97 WHOS Weighted Histograms of Overlapping Stripes 98 WSC Weight-based Sparse Coding 99 XQDA Cross-view Quadratic Discriminative Analysis 100 YOLO You Only Look One viii LIST OF TABLES 1.1 Benchmark datasets used in the thesis. . . . . . . . . . . . . . . . . . . . . 14 2.1 The matching rates (%) when applying different pooling methods on different color spaces in case of using four key frames on PRID 2011 2.2 dataset. The two best results for each case are in bold. . . . . . . . . . . . 56 The matching rates (%) when applying different pooling methods on different color spaces in case of using frames within a walking cycle on PRID 2011 dataset. The two best results for each case are in bold. . . . . . 56 2.3 The matching rates (%) when applying different pooling methods on different color spaces in case of using all frames on PRID 2011 dataset. The two best results for each case are in bold. . . . . . . . . . . . . . . . . 57 2.4 The matching rates (%) when applying different pooling methods on different color spaces in case of using four key frames on iLIDS-VID dataset. The two best results for each case are in bold. . . . . . . . . . . . 58 2.5 The matching rates (%) when applying different pooling methods on different color spaces in case of using frames within a walking cycle on 2.6 iLIDS-VID dataset. The two best results for each case are in bold. . . . . . 58 The matching rates (%) when applying different pooling methods on different color spaces in case of using all frames on iLIDS-VID dataset. The two best results for each case are in bold. . . . . . . . . . . . . . . . . 59 2.7 Matching rates (%) in several important ranks when using four key frames, four random frames, and one random frame in PRID-2011 and iLIDS-VID datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 2.8 Comparison of the three representative frame selection schemes in term of accuracy at rank-1, computational time, and memory requirement on PRID 2011 dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 2.9 Comparison between the proposed method and existing works on PRID 2011 and iLIDS-VID datasets. Two best results are in bold. . . . . . . . . 66 3.1 Matching rates (%) in case of images-to-images on CAVIAR4REID (case B).80 3.2 Matching rates (%) in case of images-to-images person ReID on the 3.3 RAiD dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Comparison the best matching rates at rank-1 in image-to-images case and those of in images-to-images one. . . . . . . . . . . . . . . . . . . . . . 80 ix 3.4 Comparison of images-to-images and image-to-images schemes at rank1. (*) means the obtained results by applying the proposed strategies 3.5 over 10 random trials in case A of CAVIAR4REID. . . . . . . . . . . . . . 80 Comparison between the proposed method and existing works on PRID 2011 and iLIDS-VID datasets.Two best results are in bold. . . . . . . . . . 88 4.1 Comparison of the proposed method with state of the art methods for PRID 2011 (the two best results are in bold). . . . . . . . . . . . . . . . . 107 x LIST OF FIGURES 1 The ranked list of gallery person corresponding to the given query based on the similarities between the query and each of gallery ones. . . . . . . . 2 2 An example for challenges caused by variations in a) illumination b) view-point. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 A person has multiple images captured in different camera-views. . . . . . 4 4 A fully-automatic person ReID system consisting of three main stages: human detection, tracking and re-identification. . . . . . . . . . . . . . . . 4 1.1 Some important milestones for person ReID problem [8]. Several ap- 1.2 proaches related to this thesis are bounded by red blocks. . . . . . . . . . . 8 An example for a) single-shot (image-based) and b)multi-shot person (video-based) ReID approaches. . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3 The differences between a) Closed-set and b) Open-set person ReID. In closed-set person ReID, an individual appears on at least two cameraviews. Inversely, in open-set person ReID, a pedestrian might appear on only one camera-view. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.4 Two popular settings for person ReID problem: a) The testing persons have appeared in the training set (represented by the same colors) b) Persons in the training and testing sets are absolutely different. . . . . . . 12 1.5 Camera layout for PRID-2011 dataset [33]. . . . . . . . . . . . . . . . . . . 13 1.6 iLIDS-VID is captured by five non-overlapping cameras [36]. . . . . . . . . 15 1.7 Some images of five datasets used for this thesis a) VIPeR b) CAVIAR4REID c) RAiD d) PRID-2011 e) iLIDS-VID. For the first three datasets (VIPeR, CAVIAR4REID, and RAiD), images in the same column belong to the same person while for the last two datasets, images in the same row 1.8 represent for the same person. . . . . . . . . . . . . . . . . . . . . . . . . 15 An example of CMC curves obtained with two methods: Method #1 and Method #2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.9 Proposed framework using Siamese Convolutional Neural Network (SCNN) [54] a) Overall framework b) structure of a typical SCNN. . . . . . . . . . 20 1.10 Structure of a) an inception block b) a typical GoogLeNet [62] . . . . . . . 23 1.11 Structure of a) an inception block b) ResNet-50 [64] . . . . . . . . . . . . . 24 1.12 ResNet architecture [64] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 1.13 Example for metric learning. . . . . . . . . . . . . . . . . . . . . . . . . . . 25 1.14 Different strategies for a) early fusion b) late fusion. . . . . . . . . . . . . . 29 xi 2.1 The proposed framework consists of four main steps: representative image selection, image-level feature extraction, temporal feature pooling 2.2 and person matching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 An example for a normal walking cycle of a pedestrian. . . . . . . . . . . . 38 2.3 An example for motion of pixels in the two subsequent frames. . . . . . . . 39 2.4 An example for computed Vx and Vy values on every frames in a given sequence of images. The blue and red dots present minimum and maximum values in Vx and Vy , respectively. . . . . . . . . . . . . . . . . . . . . 40 2.5 Representative frame selection. The first row describes an image sequence of a person, the second row indicates the related original FEP (blue curve) and the regulated FEP (red curve). A walking cycle and four key frames extracted from this cycle are shown in the third and the last rows, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.6 An example for Gaussian filter with µ = 0. . . . . . . . . . . . . . . . . . . 42 2.7 a) Random walking cycles of some person in PRID-2011 datasets and b) Four key frames in a walking cycle. . . . . . . . . . . . . . . . . . . . . 43 2.8 (a) A person image is divided into patches and regions; (b) Pipeline for GOG feature extraction [49]. . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.9 RGB color space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.10 CIE L∗ a∗ b∗ color space [127]. . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.11 HSV color space [129]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.12 Three different feature pooling techniques for person representation. . . . . 49 2.13 Distribution of ΩI and and ΩE in one projected dimension [45] . . . . . . . 52 2.14 Evaluation the performance of GOG features on a) PRID 2011 dataset and b) iLIDS-VID dataset with three different representative frame selection scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 2.15 Matching rates when selecting 4 key frames or 4 random frames for person representation in a) PRID-2011 and iLDIS-VID . . . . . . . . . . . 60 2.16 The distribution of frames for each person in PRID 2011 dataset with a) camera A view and b) camera B view. . . . . . . . . . . . . . . . . . . . 62 3.1 Image-to-images person ReID scheme. . . . . . . . . . . . . . . . . . . . . . 69 3.2 Extracting KDES feature (best viewed in color). 3.3 An example for the effectiveness of GOG and ResNet features on different query persons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.4 Proposed framework for images-to-images person ReID without tempo- . . . . . . . . . . . . . . 70 ral linking requirement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 xii 3.5 Evaluation the performance of three chosen features (GOG, KDES, CNN) over 10 trials on (a) CAVIAR4REID-case A (b) CAVIAR4REID- 3.6 case B (c) RAiD datasets in image-to-images case. . . . . . . . . . . . . . . 78 Comparison the performance of the three fusion schemes when using two or three features over 10 trials on (a) CAVIAR4REID-case A (b) CAVIAR4REID-case B (c) RAiD datasets in image-to-images case. . . . . 79 3.7 CMC curves in case A of images-to-images person ReID on the CAVIAR4REID dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 3.8 An example result of SvsM and MvsM scenarios. Each row in SvsM scenario are the first five ranked persons for each query image obtained by using image-to-images scheme on three features. Person in red box is the true matched person. . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.9 The proposed method for video-based person ReID by combining the fusion scheme with metric learning technique. . . . . . . . . . . . . . . . . 83 3.10 Matching rates with different fusion schemes on PRID-2011 dataset with a) four key frames b) frames within a walking cycle c) all frames . . . . . . 85 3.11 Matching rates with different fusion schemes on iLID-VID dataset a) four key frames b) frames within a walking cycle c) all frames . . . . . . . 86 3.12 Average weights for GOG and ResNet features on a random split in a) PRID-2011 and b) iLIDS-VID . . . . . . . . . . . . . . . . . . . . . . . . 87 4.1 A fully person ReID pipeline including person detection, segmentation, tracking and person ReID steps. . . . . . . . . . . . . . . . . . . . . . . . . 92 4.2 An example for automatic person detection and segmentation results on PRID 2011 dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.3 4.4 An overview of ACF detector [109]. . . . . . . . . . . . . . . . . . . . . . 94 Fast feature pyramid in ACF detector [109]. . . . . . . . . . . . . . . . . . 94 4.5 a) An input image is divided in 7 × 7 grid cell b) The architecture of an YOLO detector [152]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.6 4.7 The architecture of a) Faster R-CNN [150] and b) Mask R-CNN [111]. . . . 96 DDN architecture for Pedestrian Parsing [112]. . . . . . . . . . . . . . . . 97 4.8 a) ReID accuracy of the source code provided in [49] and that of the re-implementation and b) computation time (in s) for each step in ex- 4.9 tracting GOG feature on an image in C++. . . . . . . . . . . . . . . . . . 100 The matching rates at rank-1 with different number of regions (N). . . . . 100 4.10 CMC curves on VIPeR dataset when extracting GOG features with the optimal parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.11 CMC curves of three evaluated scenarios on VIPER dataset when applying the method proposed in Chapter 2. . . . . . . . . . . . . . . . . . . 102 xiii 4.12 Examples for results of a)segmentation and b), c) person ReID in all three cases of using the original images, manually segmented images, automatically segmented images of two different persons in VIPeR dataset. 103 4.13 CMC curves of three evaluated scenarios on PRID 2011 dataset in singleshot approach (a) Without segmentation and (b) with segmentation. . . . 103 4.14 CMC curves of three evaluated scenarios on PRID 2011 dataset when applying the proposed method in Chapter 2 . . . . . . . . . . . . . . . . . 105 4.15 Examples for results of a)human detection and segmentation and b), c) person ReID in all three cases of using the original images, manually segmented images, automatically segmented images of two different persons in PRID-2011 dataset. . . . . . . . . . . . . . . . . . . . . . . . . . 106 xiv INTRODUCTION Motivation Person ReID is known as associating cross-view images of the same person when he/she moves in a non-overlapping camera network [1]. In recent years, along with the development of surveillance camera systems, person re-identification (ReID) has increasingly attracted the attention of computer vision and pattern recognition communities because of its promising applications in many areas, such as public safety and security, human-robotic interaction, and person retrieval. In early years, person ReID was considered as the sub-task of Multi-Camera Tracking (MCT) [2]. The purpose of MCT is to generate tracklets in every single field of view (FoV) and then associate the tracklets that belong to the same pedestrian in different FoVs. In 2006, Gheissari et al [3] firstly considered person ReID as an independent task. On a certain aspect, person ReID and Multi-Target Multi-Camera Tracking (MTMCT) are close to each other. However, the two issues are fundamentally different from each other in terms of objective and evaluation metrics. While the objective of MTMCT is to determine the position of each pedestrian over time from video streams taken by different cameras. Person ReID tries to answer the question: "Which gallery images belong to a certain probe person?" and it returns a sorted list of the gallery persons in descending order of the similarities to the given query person. If MTMCT classifies a pair of images as co-identical or not, person ReID ranks the gallery persons corresponding to the given query person. Therefore, their performance is evaluated by different metrics: classification error rates for MTMCT and ranking performance for ReID. It is worth noting that in case of overlapping camera network, the corresponding images of the same person would be found out based on data association, and can be considered as person tracking problem, which is out of scope of this thesis. In the last decade, with the unremitting efforts, person ReID has achieved numerous important milestones with many great results [4, 5, 6, 7, 8], however, it is still a challenging task and confronts various difficulties. These difficulties and challenges will be presented in the later section. First of all, the mathematical formulation of person REID is given as follows. Problem formulation In person ReID, the dataset is divided into two sets: probe and gallery. Noted that probe and gallery sets are captured in at least two non-overlapping field of camera views. Given a query person Qi and N persons in gallery Gj , where j = 1, N . Qi and 1 Gj are represented as follows: Qi = Gj = n (l) qi n o (k) gj , l = 1, ni o (0.1) , k = 1, mj where ni and mj are the number of images of person Qi and person Gj . ni and mj might be different from each other. Depending on the number of images used for person representation, person ReID can be categorized into single shot where one sole image is used or multishot where several images are available. The identity of the given query person Qi is determined as follows [9]: j ∗ = arg min d (Qi , Gj ) , (0.2) j where d (Qi , Gj ) is defined as the distance between the given query person Qi and a gallery person Gj . This distance can be calculated directly or learned through a metric learning method. It is worth noting that in another definition, similarity between two pedestrians is used instead of distance between them. In this case, the identity of the give query person Qi is defined as follows: formulation (0.3) j = arg max Sim (Q  , G Problem ), Person Re-identification ∗ i j j  From Inputthe Equations (0.2) and (0.3), person ReID can be defined as matching problem. returned result of person ReID isoraset gallery personnamed who has the or smallest/largest –The A person represented by an image of images probe query distance/similarity to the given query person. However, in order to evaluate the perforperson of ainperson –mance Persons galleryReID set method, a ranked list of the gallery persons is provided. This is ranked in ascending/descending order of distance/similarity to the given query  list Output person. Figure 1 shows an example of ranked list gallery person corresponding to the – A list of persons in gallery is ranked by the similarity between the person in given query based on the similarities between the given query and each of gallery ones. gallery and the query person Probe image/images Gallery images Person Re-identification Rank-1 Rank-2 Rank-3 3 Rank 1 Rank 2 Rank Rank-4 Rank-5 5 Rank 4 Rank 7 Figure 1: The ranked list of gallery person corresponding to the given query based on the similarities between the query and each of gallery ones. 2 Challenges There are many challenges to person ReID problem which might derive from environmental and actual conditions. This section discusses three main challenges including (1) the strong variations in illuminations, view-points, poses, etc, (2) the large number of images for each person in a camera view and the number of persons, (3) the effect of human detection and tracking results as follows. • Firstly, the strong variations in illuminations, view-points, and poses are the gen- eral difficulties in any image processing problem. These factors make the appearance discrepancies of the same person even larger than those of different persons. Consequently, one of the crucial task in person ReID is to build not only discriminative but also visual descriptor for person representation. This descriptor ensures to highlight the characteristics of each individual and helps to distinguish between different persons more easily. Figure 2 illustrates the variations in illuminations and view-points. This Figure shows that color of pedestrian’s clothes are significantly changed due to the variations in illuminations and view-points. Pairs of images in the same column present the same person, and are captured in the two different camera views. (a) (b) Figure 2: An example for challenges caused by variations in a) illumination b) view-point. • The second challenge is the large number of images for each person in a camera view and the number of persons in examined datasets. The number of identities as well as images in some evaluated datasets have grown rapidly in recent years. The early datasets have only hundreds of identities and thousands of images, whereas there are more than thousands of identities and millions of images in the latest dataset. This results in a significant burden on memory capacity requirement, execution speed and computation complexity when solving person ReID issue. Figure 3 shows some images of the same person captured in different camera views. Besides, the number of images for each person in the existing datasets varies greatly. For example, in PRID-2011 dataset, some persons have only 20 3 images, meanwhile others may have hundreds images. This leads to unbalance in person representation and also causes difficultiesResults for person ReID. Some examples of matching Query track Person 0002-Cam 1 Person 0002-Cam 2 Person 0002-Cam 2-T21 Person 0002-Cam3 9 Figure 3: A person has multiple images captured in different camera-views. • The third challenge is the effect of human detection and tracking results. In a fully-automatic surveillance system, person ReID task is the last stage whose in- puts are the outcomes of human detection and tracking stages as illustrated in Fig. 4. The performance of the two previous stages greatly affects the overall performance. Most of existing studies deal with human regions of interests (ROIs) that are manually detected and segmented with well-aligned bounding boxes. Nevertheless, in an automatic surveillance system, many problems and errors appear in human detection and tracking, such as false detection, ID switch, fragment, etc. Consequently, these errors might cause an reduction in ReID accuracy. Though the latest methodology-driven methods surpass the human-level performance in several commonly used benchmark datasets, improving accuracy for applicationdriven ReID is still a non-trivial task. Human detection Person Reidentification Tracking ID person 2 1 Matching Figure 4: A fully-automatic person ReID system consisting of three main stages: human detection, tracking and re-identification. Based on the above analysis, person ReID is undoubtedly an interesting issue but chal4

- Xem thêm -

Tài liệu Tái định danh trong hệ thống camera giám sát tự động

Tài liệu liên quan

Tài liệu vừa đăng

Tài liệu xem nhiều nhất