MINISTRY OF EDUCATION AND TRAINING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
NGUYEN THUY BINH
PERSON RE-IDENTIFICATION
IN A SURVEILLANCE CAMERA NETWORK
DOCTORAL DISSERTATION OF
ELECTRONICS ENGINEERING
Hanoi−2020
MINISTRY OF EDUCATION AND TRAINING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
NGUYEN THUY BINH
PERSON RE-IDENTIFICATION
IN A SURVEILLANCE CAMERA NETWORK
Major: Electronics Engineering
Code: 9520203
DOCTORAL DISSERTATION OF
ELECTRONICS ENGINEERING
SUPERVISORS:
1.Assoc. Prof. Pham Ngoc Nam
2.Assoc. Prof. Le Thi Lan
Hanoi−2020
DECLARATION OF AUTHORSHIP
I, Nguyen Thuy Binh, declare that the thesis titled "Person re-identification in
a surveillance camera network" has been entirely composed by myself. I assure some
points as follows:
This work was done wholly or mainly while in candidature for a Ph.D. research
degree at Hanoi University of Science and Technology.
The work has not be submitted for any other degree or qualifications at Hanoi
University of Science and Technology or any other institutions.
Appropriate acknowledge has been given within this thesis where reference has
been made to the published work of others.
The thesis submitted is my own, except where work in the collaboration has been
included. The collaborative contributions have been clearly indicated.
Hanoi, 24/11/ 2020
PhD Student
SUPERVISORS
i
ACKNOWLEDGEMENT
This dissertation was written during my doctoral course at School of Electronics and
Telecommunications (SET) and International Research Institute of Multimedia, Information, Communication and Applications (MICA), Hanoi University of Science and
Technology (HUST). I am so grateful for all people who always support and encourage
me for completing this study.
First, I would like to express my sincere gratitude to my advisors Assoc. Prof. Pham
Ngoc Nam and Assoc. Prof. Le Thi Lan for their effective guidance, their patience,
continuous support and encouragement, and their immense knowledge.
I would like to express my gratitude to Dr. Vo Le Cuong and Dr. Ha thi Thu Lan for
their help. I would like to thank to all member of School of Electronics and Telecommunications, International Research Institute of Multimedia, Information, Communications and Applications (MICA), Hanoi University of Science and Technology (HUST)
as well as all of my colleagues in Faculty of Electrical-Electronic Engineering, University
of Transport and Communications (UTC). They have always helped me on research
process and given helpful advises for me to overcome my own difficulties. Moreover,
the attention at scientific conferences has always been a great experience for me to
receive many the useful comments.
During my PhD course, I have received many supports from the Management Board
of School of Electronics and Telecommunications, MICA Institute, and Faculty of
Electrical-Electronic Engineering. My sincere thank to Assoc. Prof. Nguyen Huu
Thanh, Dr. Nguyen Viet Son and Assoc. Prof. Nguyen Thanh Hai who gave me a lot
of support and help. Without their precious support, it has been impossible to conduct
this research. Thanks to my employer, University of Transport and Communications
(UTC) for all necessary support and encouragement during my PhD journey. I am
also grateful to Vietnam’s Program 911, HUST and UTC projects for their generous
financial support. Special thanks to my family and relatives, particularly, my beloved
husband and our children, for their never-ending support and sacrifice.
Hanoi, 2020
Ph.D. Student
ii
CONTENTS
DECLARATION OF AUTHORSHIP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
i
ACKNOWLEDGEMENT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ii
CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vi
SYMBOLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vi
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
x
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xiv
INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
CHAPTER 1. LITERATURE REVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
1.1. Person ReID classifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
1.1.1. Single-shot versus Multi-shot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
1.1.2. Closed-set versus Open-set person ReID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.1.3. Supervised and unsupervised person ReID . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2. Datasets and evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
1.2.1. Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
1.2.2. Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
1.3. Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
1.3.1. Hand-designed features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
1.3.2. Deep-learned features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
1.4. Metric learning and person matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
1.4.1. Metric learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
1.4.2. Person matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
1.5. Fusion schemes for person ReID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
1.6. Representative frame selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
1.7. Fully automated person ReID systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
1.8. Research on person ReID in Vietnam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
CHAPTER 2. MULTI-SHOT PERSON RE-ID THROUGH REPRESENTATIVE FRAMES SELECTION AND TEMPORAL FEATURE POOLING
36
2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
2.2. Proposed method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
2.2.1. Overall framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.2. Representative image selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
37
iii
2.2.3. Image-level feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
2.2.4. Temporal feature pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
2.2.5. Person matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
2.3. Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
2.3.1. Evaluation of representative frame extraction and temporal feature pooling
schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
2.3.2. Quantitative evaluation of the trade-off between the accuracy and computational time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
2.3.3. Comparison with state-of-the-art methods . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
2.4. Conclusions and Future work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
CHAPTER 3. PERSON RE-ID PERFORMANCE IMPROVEMENT BASED
ON FUSION SCHEMES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
3.2. Fusion schemes for the first setting of person ReID . . . . . . . . . . . . . . . . . . . . . . .
3.2.1. Image-to-images person ReID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
69
3.2.2. Images-to-images person ReID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
3.2.3. Obtained results on the first setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
3.3. Fusion schemes for the second setting of person ReID . . . . . . . . . . . . . . . . . . . .
3.3.1. The proposed method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
82
3.3.2. Obtained results on the second setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
86
3.4. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
89
CHAPTER 4. QUANTITATIVE EVALUATION OF AN END-TO-END
PERSON REID PIPELINE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
91
4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
91
4.2. An end-to-end person ReID pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
92
4.2.1. Pedestrian detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.2. Pedestrian tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
92
97
4.2.3. Person ReID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
98
4.3. GOG descriptor re-implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
99
4.3.1. Comparison the performance of two implementations . . . . . . . . . . . . . . . . .
4.3.2. Analyze the effect of GOG parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
99
99
4.4. Evaluation performance of an end-to-end person ReID pipeline . . . . . . . . . .
101
4.4.1. The effect of human detection and segmentation on person ReID in singleshot scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
102
iv
4.4.2. The effect of human detection and segmentation on person ReID in multishot scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
104
4.5. Conclusions and Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
107
PUBLICATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
112
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
113
v
ABBREVIATIONS
No. Abbreviation Meaning
1
ACF
Aggregate Channel Features
2
AIT
Austrian Institute of Technology
3
AMOC
Accumulative Motion Context
4
BOW
Bag of Words
5
CAR
Learning Compact Appearance Representation
6
CIE
The International Commission on Illumination
7
CFFM
Comprehensive Feature Fusion Mechanism
8
CMC
Cummulative Matching Characteristic
9
CNN
Convolutional Neural Network
10
CPM
Convolutional Pose Machines
11
CVPDL
Cross-view Projective Dictionary Learning
12
CVPR
Conference on Computer Vision and Pattern Recognition
13
DDLM
Discriminative Dictionary Learning Method
14
DDN
Deep Decompositional Network
15
DeepSORT
Deep learning Simple Online and Realtime Tracking
16
DFGP
Deep Feature Guided Pooling
17
DGM
Dynamic Graph Matching
18
DPM
Deformable Part-Based Model
19
ECCV
European Conference on Computer Vision
20
FAST 3D
Fast Adaptive Spatio-Temporal 3D
21
FEP
Flow Energy Profile
22
FNN
Feature Fusion Network
23
FPNN
Filter Pairing Neural Network
24
GOG
Gaussian of Gaussian
25
GRU
Gated Recurrent Unit
26
HOG
Histogram of Oriented Gradients
27
HUST
Hanoi University of Science and Technology
28
IBP
Indian Buffet Process
29
ICCV
International Conference on Computer Vision
30
ICIP
International Conference on Image Processing
vi
31
IDE
ID-Discriminative Embedding
32
iLIDS-VID
Imagery Library for Intelligent Detection Systems
33
ILSVRC
ImageNet Large Scale Visual Recognition Competition
34
ISR
TIterative Spare Ranking
35
KCF
Kernelized Correlation Filter
36
KDES
Kenel DEScriptor
37
KISSME
Keep It Simple and Straightforward MEtric
38
kNN
k-Nearest Neighbour
39
KXQDA
Kernel Cross-view Quadratic Discriminative Analysis
40
LADF
Locally-Adaptive Decision Functions
41
LBP
Local Binary Pattern
42
LDA
LinearDiscriminantAnalysis
43
LDFV
Local Descriptor and coded by Feature Vector
44
LMNN
Large Margin Nearest Neighbor
45
LMNN-R
Large Margin Nearest Neighbor with Rejection
46
LOMO
LOcal Maximal Occurrence
47
LSTM
Long-Short Term Memory
48
LSTMC
Long Short-Term Memory network with a Coupled gate
49
mAP
mean Average Precision
50
MAPR
Multimedia Analysis and Pattern Recognition
51
Mask R-CNN
Mask Region with CNN
52
MCT
Multi -Camera Tracking
53
MCCNN
Multi-Channel CNN
54
MCML
Maximally Collapsing Metric Learning
55
MGCAM
Mask-Guided Contrastive Attention Model
56
ML
Machine Learning
57
MLAPG
Metric Learning by Accelerated Proximal Gradient
58
MLR
Metric Learning to Rank
59
MOT
Multiple Object Tracking
60
MSCR
Maximal Stable Color Region
61
MSVF
Maximally Stable Video Frame
62
MTMCT
Multi-Target Multi-Camera Tracking
63
Person ReID
Person Re -Identification
64
Pedparsing
Pedestrian Parsing
65
PPN
Pose Prediction Network
vii
66
PRW
Person Re-identification in the Wild
67
QDA
Quadratic Discriminative Analysis
68
RAiD
Re-Identification Across indoor-outdoor Dataset
69
RAP
Richly Annotated Pedestrian
70
ResNet
Residual Neural Network
71
RHSP
Recurrent High-Structured Patches
72
RKHS
Reproducing Kernel Hilbert Space
73
RNN
Recurrent Neural Network
74
ROIs
Region of Interests
75
SDALF
Symmetry Driven Accumulation of Local Feature
76
SCNCD
Salient Color Names based Color Descriptor
77
SCNN
Siamese Convolutional Neural Network
78
SIFT
Scale-Invariant Feature Transform
79
SILTP
Scale Invariant Local Ternary Pattern
80
SPD
Symmetric Positive Definite
81
SMP
Stepwise Metric Promotion
82
SORT
Simple Online and Realtime Tracking
83
SPIC
Signal Processing: Image Communication
84
SVM
Support Vector Machine
85
TAPR
Temporally Aligned Pooling Representation
86
TAUDL
Tracklet Association Unsupervised Deep Learning
87
TCSVT
Transactions on Circuits and Systems for Video Technology
88
TII
Transactions on Industrial Informatics
89
TPAMI
Transactions on Pattern Analysis and Machine Intelligence
90
TPDL
Top-push Distance Learning
91
Two-stream MR Two-stream Multirate Recurrent Neural Network
92
UIT
University of Information Technology
93
UTAL
Tracklet Association Unsupervised Deep Learning
94
VIPeR
View-point Invariant Pedestrian Recognition
95
VNU-HCM
Vietnam National University - Ho Chi Minh City
96
WH
Weighted color Histogram
97
WHOS
Weighted Histograms of Overlapping Stripes
98
WSC
Weight-based Sparse Coding
99
XQDA
Cross-view Quadratic Discriminative Analysis
100 YOLO
You Only Look One
viii
LIST OF TABLES
1.1
Benchmark datasets used in the thesis. . . . . . . . . . . . . . . . . . . . . 14
2.1
The matching rates (%) when applying different pooling methods on
different color spaces in case of using four key frames on PRID 2011
2.2
dataset. The two best results for each case are in bold. . . . . . . . . . . . 56
The matching rates (%) when applying different pooling methods on
different color spaces in case of using frames within a walking cycle on
PRID 2011 dataset. The two best results for each case are in bold. . . . . . 56
2.3
The matching rates (%) when applying different pooling methods on
different color spaces in case of using all frames on PRID 2011 dataset.
The two best results for each case are in bold. . . . . . . . . . . . . . . . . 57
2.4
The matching rates (%) when applying different pooling methods on
different color spaces in case of using four key frames on iLIDS-VID
dataset. The two best results for each case are in bold. . . . . . . . . . . . 58
2.5
The matching rates (%) when applying different pooling methods on
different color spaces in case of using frames within a walking cycle on
2.6
iLIDS-VID dataset. The two best results for each case are in bold. . . . . . 58
The matching rates (%) when applying different pooling methods on
different color spaces in case of using all frames on iLIDS-VID dataset.
The two best results for each case are in bold. . . . . . . . . . . . . . . . . 59
2.7
Matching rates (%) in several important ranks when using four key
frames, four random frames, and one random frame in PRID-2011 and
iLIDS-VID datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.8
Comparison of the three representative frame selection schemes in term
of accuracy at rank-1, computational time, and memory requirement on
PRID 2011 dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.9
Comparison between the proposed method and existing works on PRID
2011 and iLIDS-VID datasets. Two best results are in bold. . . . . . . . . 66
3.1
Matching rates (%) in case of images-to-images on CAVIAR4REID (case B).80
3.2
Matching rates (%) in case of images-to-images person ReID on the
3.3
RAiD dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Comparison the best matching rates at rank-1 in image-to-images case
and those of in images-to-images one. . . . . . . . . . . . . . . . . . . . . . 80
ix
3.4
Comparison of images-to-images and image-to-images schemes at rank1. (*) means the obtained results by applying the proposed strategies
3.5
over 10 random trials in case A of CAVIAR4REID. . . . . . . . . . . . . . 80
Comparison between the proposed method and existing works on PRID
2011 and iLIDS-VID datasets.Two best results are in bold. . . . . . . . . . 88
4.1
Comparison of the proposed method with state of the art methods for
PRID 2011 (the two best results are in bold). . . . . . . . . . . . . . . . . 107
x
LIST OF FIGURES
1
The ranked list of gallery person corresponding to the given query based
on the similarities between the query and each of gallery ones. . . . . . . . 2
2
An example for challenges caused by variations in a) illumination b)
view-point. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3
A person has multiple images captured in different camera-views. . . . . . 4
4
A fully-automatic person ReID system consisting of three main stages:
human detection, tracking and re-identification. . . . . . . . . . . . . . . . 4
1.1
Some important milestones for person ReID problem [8]. Several ap-
1.2
proaches related to this thesis are bounded by red blocks. . . . . . . . . . . 8
An example for a) single-shot (image-based) and b)multi-shot person
(video-based) ReID approaches. . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3
The differences between a) Closed-set and b) Open-set person ReID. In
closed-set person ReID, an individual appears on at least two cameraviews. Inversely, in open-set person ReID, a pedestrian might appear on
only one camera-view. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4
Two popular settings for person ReID problem: a) The testing persons
have appeared in the training set (represented by the same colors) b)
Persons in the training and testing sets are absolutely different. . . . . . . 12
1.5
Camera layout for PRID-2011 dataset [33]. . . . . . . . . . . . . . . . . . . 13
1.6
iLIDS-VID is captured by five non-overlapping cameras [36]. . . . . . . . . 15
1.7
Some images of five datasets used for this thesis a) VIPeR b) CAVIAR4REID
c) RAiD d) PRID-2011 e) iLIDS-VID. For the first three datasets (VIPeR,
CAVIAR4REID, and RAiD), images in the same column belong to the
same person while for the last two datasets, images in the same row
1.8
represent for the same person. . . . . . . . . . . . . . . . . . . . . . . . . 15
An example of CMC curves obtained with two methods: Method #1
and Method #2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.9
Proposed framework using Siamese Convolutional Neural Network (SCNN)
[54] a) Overall framework b) structure of a typical SCNN. . . . . . . . . . 20
1.10 Structure of a) an inception block b) a typical GoogLeNet [62] . . . . . . . 23
1.11 Structure of a) an inception block b) ResNet-50 [64] . . . . . . . . . . . . . 24
1.12 ResNet architecture [64] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.13 Example for metric learning. . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.14 Different strategies for a) early fusion b) late fusion. . . . . . . . . . . . . . 29
xi
2.1
The proposed framework consists of four main steps: representative image selection, image-level feature extraction, temporal feature pooling
2.2
and person matching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
An example for a normal walking cycle of a pedestrian. . . . . . . . . . . . 38
2.3
An example for motion of pixels in the two subsequent frames. . . . . . . . 39
2.4
An example for computed Vx and Vy values on every frames in a given
sequence of images. The blue and red dots present minimum and maximum values in Vx and Vy , respectively. . . . . . . . . . . . . . . . . . . . . 40
2.5
Representative frame selection. The first row describes an image sequence of a person, the second row indicates the related original FEP
(blue curve) and the regulated FEP (red curve). A walking cycle and
four key frames extracted from this cycle are shown in the third and the
last rows, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.6
An example for Gaussian filter with µ = 0. . . . . . . . . . . . . . . . . . . 42
2.7
a) Random walking cycles of some person in PRID-2011 datasets and
b) Four key frames in a walking cycle. . . . . . . . . . . . . . . . . . . . . 43
2.8
(a) A person image is divided into patches and regions; (b) Pipeline for
GOG feature extraction [49]. . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.9 RGB color space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.10 CIE L∗ a∗ b∗ color space [127]. . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.11 HSV color space [129]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.12 Three different feature pooling techniques for person representation. . . . . 49
2.13 Distribution of ΩI and and ΩE in one projected dimension [45] . . . . . . . 52
2.14 Evaluation the performance of GOG features on a) PRID 2011 dataset
and b) iLIDS-VID dataset with three different representative frame selection scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.15 Matching rates when selecting 4 key frames or 4 random frames for
person representation in a) PRID-2011 and iLDIS-VID . . . . . . . . . . . 60
2.16 The distribution of frames for each person in PRID 2011 dataset with
a) camera A view and b) camera B view. . . . . . . . . . . . . . . . . . . . 62
3.1
Image-to-images person ReID scheme. . . . . . . . . . . . . . . . . . . . . . 69
3.2
Extracting KDES feature (best viewed in color).
3.3
An example for the effectiveness of GOG and ResNet features on different query persons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.4
Proposed framework for images-to-images person ReID without tempo-
. . . . . . . . . . . . . . 70
ral linking requirement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
xii
3.5
Evaluation the performance of three chosen features (GOG, KDES,
CNN) over 10 trials on (a) CAVIAR4REID-case A (b) CAVIAR4REID-
3.6
case B (c) RAiD datasets in image-to-images case. . . . . . . . . . . . . . . 78
Comparison the performance of the three fusion schemes when using
two or three features over 10 trials on (a) CAVIAR4REID-case A (b)
CAVIAR4REID-case B (c) RAiD datasets in image-to-images case. . . . . 79
3.7
CMC curves in case A of images-to-images person ReID on the CAVIAR4REID
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.8
An example result of SvsM and MvsM scenarios. Each row in SvsM
scenario are the first five ranked persons for each query image obtained
by using image-to-images scheme on three features. Person in red box
is the true matched person. . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.9
The proposed method for video-based person ReID by combining the
fusion scheme with metric learning technique. . . . . . . . . . . . . . . . . 83
3.10 Matching rates with different fusion schemes on PRID-2011 dataset with
a) four key frames b) frames within a walking cycle c) all frames . . . . . . 85
3.11 Matching rates with different fusion schemes on iLID-VID dataset a)
four key frames b) frames within a walking cycle c) all frames . . . . . . . 86
3.12 Average weights for GOG and ResNet features on a random split in a)
PRID-2011 and b) iLIDS-VID . . . . . . . . . . . . . . . . . . . . . . . . 87
4.1
A fully person ReID pipeline including person detection, segmentation,
tracking and person ReID steps. . . . . . . . . . . . . . . . . . . . . . . . . 92
4.2
An example for automatic person detection and segmentation results on
PRID 2011 dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.3
4.4
An overview of ACF detector [109]. . . . . . . . . . . . . . . . . . . . . . 94
Fast feature pyramid in ACF detector [109]. . . . . . . . . . . . . . . . . . 94
4.5
a) An input image is divided in 7 × 7 grid cell b) The architecture of an
YOLO detector [152].
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.6
4.7
The architecture of a) Faster R-CNN [150] and b) Mask R-CNN [111]. . . . 96
DDN architecture for Pedestrian Parsing [112]. . . . . . . . . . . . . . . . 97
4.8
a) ReID accuracy of the source code provided in [49] and that of the
re-implementation and b) computation time (in s) for each step in ex-
4.9
tracting GOG feature on an image in C++. . . . . . . . . . . . . . . . . . 100
The matching rates at rank-1 with different number of regions (N). . . . . 100
4.10 CMC curves on VIPeR dataset when extracting GOG features with the
optimal parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.11 CMC curves of three evaluated scenarios on VIPER dataset when applying the method proposed in Chapter 2. . . . . . . . . . . . . . . . . . . 102
xiii
4.12 Examples for results of a)segmentation and b), c) person ReID in all
three cases of using the original images, manually segmented images,
automatically segmented images of two different persons in VIPeR dataset. 103
4.13 CMC curves of three evaluated scenarios on PRID 2011 dataset in singleshot approach (a) Without segmentation and (b) with segmentation.
. . . 103
4.14 CMC curves of three evaluated scenarios on PRID 2011 dataset when
applying the proposed method in Chapter 2 . . . . . . . . . . . . . . . . . 105
4.15 Examples for results of a)human detection and segmentation and b),
c) person ReID in all three cases of using the original images, manually segmented images, automatically segmented images of two different
persons in PRID-2011 dataset. . . . . . . . . . . . . . . . . . . . . . . . . . 106
xiv
INTRODUCTION
Motivation
Person ReID is known as associating cross-view images of the same person when
he/she moves in a non-overlapping camera network [1]. In recent years, along with
the development of surveillance camera systems, person re-identification (ReID) has
increasingly attracted the attention of computer vision and pattern recognition communities because of its promising applications in many areas, such as public safety and
security, human-robotic interaction, and person retrieval. In early years, person ReID
was considered as the sub-task of Multi-Camera Tracking (MCT) [2]. The purpose of
MCT is to generate tracklets in every single field of view (FoV) and then associate
the tracklets that belong to the same pedestrian in different FoVs. In 2006, Gheissari
et al [3] firstly considered person ReID as an independent task. On a certain aspect,
person ReID and Multi-Target Multi-Camera Tracking (MTMCT) are close to each
other. However, the two issues are fundamentally different from each other in terms of
objective and evaluation metrics. While the objective of MTMCT is to determine the
position of each pedestrian over time from video streams taken by different cameras.
Person ReID tries to answer the question: "Which gallery images belong to a certain
probe person?" and it returns a sorted list of the gallery persons in descending order
of the similarities to the given query person. If MTMCT classifies a pair of images as
co-identical or not, person ReID ranks the gallery persons corresponding to the given
query person. Therefore, their performance is evaluated by different metrics: classification error rates for MTMCT and ranking performance for ReID. It is worth noting
that in case of overlapping camera network, the corresponding images of the same person would be found out based on data association, and can be considered as person
tracking problem, which is out of scope of this thesis. In the last decade, with the unremitting efforts, person ReID has achieved numerous important milestones with many
great results [4, 5, 6, 7, 8], however, it is still a challenging task and confronts various
difficulties. These difficulties and challenges will be presented in the later section. First
of all, the mathematical formulation of person REID is given as follows.
Problem formulation
In person ReID, the dataset is divided into two sets: probe and gallery. Noted that
probe and gallery sets are captured in at least two non-overlapping field of camera
views. Given a query person Qi and N persons in gallery Gj , where j = 1, N . Qi and
1
Gj are represented as follows:
Qi =
Gj =
n
(l)
qi
n
o
(k)
gj
, l = 1, ni
o
(0.1)
, k = 1, mj
where ni and mj are the number of images of person Qi and person Gj . ni and mj
might be different from each other.
Depending on the number of images used for person representation, person ReID
can be categorized into single shot where one sole image is used or multishot where
several images are available.
The identity of the given query person Qi is determined as follows [9]:
j ∗ = arg min d (Qi , Gj ) ,
(0.2)
j
where d (Qi , Gj ) is defined as the distance between the given query person Qi and a
gallery person Gj . This distance can be calculated directly or learned through a metric
learning method. It is worth noting that in another definition, similarity between two
pedestrians is used instead of distance between them. In this case, the identity of the
give query person Qi is defined as follows:
formulation (0.3)
j = arg max Sim (Q
, G Problem
),
Person Re-identification
∗
i
j
j
From
Inputthe Equations (0.2) and (0.3), person ReID can be defined as matching problem.
returned
result of person
ReID isoraset
gallery
personnamed
who has
the or
smallest/largest
–The
A person
represented
by an image
of images
probe
query
distance/similarity to the given query person. However, in order to evaluate the perforperson
of ainperson
–mance
Persons
galleryReID
set method, a ranked list of the gallery persons is provided. This
is ranked in ascending/descending order of distance/similarity to the given query
list
Output
person. Figure 1 shows an example of ranked list gallery person corresponding to the
– A list of persons in gallery is ranked by the similarity between the person in
given query
based
on the
similarities between the given query and each of gallery ones.
gallery
and the
query
person
Probe image/images
Gallery images
Person Re-identification
Rank-1
Rank-2
Rank-3
3
Rank
1 Rank
2 Rank
Rank-4
Rank-5
5
Rank
4 Rank
7
Figure 1: The ranked list of gallery person corresponding to the given query based on the similarities
between the query and each of gallery ones.
2
Challenges
There are many challenges to person ReID problem which might derive from environmental and actual conditions. This section discusses three main challenges including
(1) the strong variations in illuminations, view-points, poses, etc, (2) the large number
of images for each person in a camera view and the number of persons, (3) the effect
of human detection and tracking results as follows.
• Firstly, the strong variations in illuminations, view-points, and poses are the gen-
eral difficulties in any image processing problem. These factors make the appearance discrepancies of the same person even larger than those of different persons.
Consequently, one of the crucial task in person ReID is to build not only discriminative but also visual descriptor for person representation. This descriptor
ensures to highlight the characteristics of each individual and helps to distinguish
between different persons more easily. Figure 2 illustrates the variations in illuminations and view-points. This Figure shows that color of pedestrian’s clothes
are significantly changed due to the variations in illuminations and view-points.
Pairs of images in the same column present the same person, and are captured in
the two different camera views.
(a)
(b)
Figure 2: An example for challenges caused by variations in a) illumination b) view-point.
• The second challenge is the large number of images for each person in a camera
view and the number of persons in examined datasets. The number of identities as
well as images in some evaluated datasets have grown rapidly in recent years. The
early datasets have only hundreds of identities and thousands of images, whereas
there are more than thousands of identities and millions of images in the latest
dataset. This results in a significant burden on memory capacity requirement,
execution speed and computation complexity when solving person ReID issue.
Figure 3 shows some images of the same person captured in different camera
views. Besides, the number of images for each person in the existing datasets
varies greatly. For example, in PRID-2011 dataset, some persons have only 20
3
images, meanwhile others may have hundreds images. This leads to unbalance in
person representation
and also causes
difficultiesResults
for person ReID.
Some examples
of matching
Query track
Person 0002-Cam 1
Person 0002-Cam 2
Person 0002-Cam
2-T21
Person 0002-Cam3
9
Figure 3: A person has multiple images captured in different camera-views.
• The third challenge is the effect of human detection and tracking results. In a
fully-automatic surveillance system, person ReID task is the last stage whose in-
puts are the outcomes of human detection and tracking stages as illustrated in Fig.
4. The performance of the two previous stages greatly affects the overall performance. Most of existing studies deal with human regions of interests (ROIs) that
are manually detected and segmented with well-aligned bounding boxes. Nevertheless, in an automatic surveillance system, many problems and errors appear in
human detection and tracking, such as false detection, ID switch, fragment, etc.
Consequently, these errors might cause an reduction in ReID accuracy. Though
the latest methodology-driven methods surpass the human-level performance in
several commonly used benchmark datasets, improving accuracy for applicationdriven ReID is still a non-trivial task.
Human detection
Person Reidentification
Tracking
ID person
2
1
Matching
Figure 4: A fully-automatic person ReID system consisting of three main stages: human detection,
tracking and re-identification.
Based on the above analysis, person ReID is undoubtedly an interesting issue but chal4
- Xem thêm -