HO CHI MINH NATIONAL UNIVERSITY
HCM CITY UNIVERSITY OF TECHNOLOGY
FACULTY OF COMPUTER SCIENCE AND ENGINEERING
——————– * ———————
BACHELOR DEGREE THESIS
TOWARD DATA-EFFICIENT MULTIPLE OBJECT
TRACKING
Council : Computer Science
Thesis Advisor : Dr. Nguyen Duc Dung
Reviewer : Dr. Nguyen Hua Phung
—o0o—
Students:
Phan Xuan Thanh Lam 1710163
Tran Ho Minh Thong
1710314
HO CHI MINH CITY, 07/2021
Declaration
We declare that the thesis entitled “ TOWARD DATA-EFFICIENT MULTIPLE OBJECT
TRACKING” is our own work under the supervision of Dr. Nguyen Duc Dung.
We declare that the information reported here is the result of our own work, except
where references are made. The thesis has not been accepted for any degree and is not
concurrently submitted to any candidature for any other degree or diploma.
Abstract
Multiple object tracking (MOT) is the task of estimating the trajectories of several objects as
they move around a scene. MOT is an open and attractive research field with a broad extent
of categories and applications such as surveillance, sports analysis, human-computer interface,
biology, etc.
The difficulties of this problem lie in several challenges, such as frequent occlusions, intraclass and inter-class variations, etc. Recently, deep learning MOT methods confront these challenges effectively and lead to groundbreaking results. Therefore, these methods are used in
almost all state-of-the-art MOT algorithms. Despite their successes, deep learning MOT algorithms, like other deep learning-based algorithms, are data-hungry and require a large amount
of labeled data to work. On another hand, annotating MOT data usually consists of manually
labeling positions of objects on every video frame (with bounding boxes or segmentation) and
assigning each object to a single identity (ID), such that different objects have different IDs, and
the same object in different frames has the same ID. This makes annotating MOT data a very
time-consuming task.
To solve the data problem in deep learning MOT algorithms, in this thesis, we will propose
a method, where we only need the annotations of object positions. Experiments show that our
method is compatible with the current state-of-the-art method, despite the lack of object ID
labeling.
On the other hand, we found that current annotation tools, such as the Computer Vision Annotation Tool [77], the SupperAnnotate [78], etc are not well integrated with MOT models, and
also lack necessary features for MOT problems. Therefore, in this thesis, we will also develop
a new annotation tool. It will support automatically labeling via our proposed MOT model.
Moreover, our tool will also provide plenty of convenient features, which will increase the automation for labeling processes, control the accuracy and rationality of results, and increase
users’ experiences.
To sum up, our main contributions in this thesis are twofold:
• Our first major contribution is an MOT algorithm compatible with state-of-the-art algorithms without the need for object ID labeling. This can make the cost of manually labeling
data significantly reduced.
• As a second contribution, we also build an annotation tool. Our tool supports automatic
annotation and a lot of features that help fasten the labeling process of MOT data.
Acknowledgments
We would like to thank Mr. Nguyen Duc Dung for guiding us to important publications
and for the stimulating questions on artificial intelligence. The meetings and conversations
were vital in inspiring us to think outside the box, from multiple perspectives to form a
comprehensive and objective critique.
Contents
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
3
4
5
5
Contrastive Learning & Object Detection
2.1 Contrastive Learning . . . . . . . . . . . . . . . . .
2.1.1 Self-Supervised Representation Learning . .
2.1.2 Contrastive Representation Learning . . . . .
2.1.2.1 Framework of constrastive learning
2.1.2.2 SimCLR . . . . . . . . . . . . . .
2.1.2.3 MoCo . . . . . . . . . . . . . . .
2.1.2.4 SwAV . . . . . . . . . . . . . . .
2.1.2.5 Barlow Twins . . . . . . . . . . .
2.2 Object Detection . . . . . . . . . . . . . . . . . . .
2.2.1 Two-stage Detectors . . . . . . . . . . . . .
2.2.2 One-stage Detectors . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6
6
6
8
8
10
11
12
13
14
14
17
.
.
.
.
.
.
.
.
23
23
25
27
28
30
30
31
31
4
Proposed Algorithm
4.1 Propose architecture and algorithm . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Propose Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
33
35
5
Experiment
5.1 Overview . . . . . . . . . . . . .
5.1.1 Experimental environment
5.1.2 Datasets and Metrics . . .
5.1.3 Ablative Studies . . . . .
39
39
39
39
40
2
3
Introduction
1.1 The Multiple Object Tracking Problem . . .
1.2 Introduction to MOT algorithms . . . . . .
1.3 Labelling tool for Multiple Object Tracking
1.4 Objective . . . . . . . . . . . . . . . . . .
1.5 Thesis outline . . . . . . . . . . . . . . . .
Related Work
3.1 DeepSort . . . . . . . . . . .
3.2 JDE . . . . . . . . . . . . . .
3.3 FairMot . . . . . . . . . . . .
3.4 CSTrack . . . . . . . . . . . .
3.5 MOT Metric . . . . . . . . . .
3.5.1 Classical metrics . . .
3.5.2 CLEAR MOT metrics
3.5.2.1 ID scores . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
ii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5.1.3.1
5.1.3.2
5.1.3.3
5.1.3.4
6
7
8
9
Robustness of propose augmentation . .
Result of different constrastive objective
Result of different batch size on simclr .
Results on MOTChallenge . . . . . . .
Analysis of the annotation tool
6.1 System objectives . . . . . . . .
6.2 An analysis of system objectives
6.3 System requirements . . . . . .
6.4 Use case diagram . . . . . . . .
.
.
.
.
Design of the annotation tool
7.1 Overall architecture . . . . . . . .
7.2 Deployment . . . . . . . . . . . .
7.3 Entity relationship . . . . . . . . .
7.3.1 Entity relationship diagram
7.3.2 Entities explanation . . . .
7.3.3 Deletion mechanism . . .
.
.
.
.
.
.
.
.
.
.
Used technologies of the annotation tool
8.1 Front-end module . . . . . . . . . .
8.1.1 Angular . . . . . . . . . . .
8.1.2 Ag-grid . . . . . . . . . . .
8.2 Back-end module . . . . . . . . . .
8.2.1 Django . . . . . . . . . . .
8.2.2 REST framework . . . . . .
8.2.3 Simple JWT . . . . . . . .
8.2.4 OpenCV . . . . . . . . . .
8.2.5 Request . . . . . . . . . . .
8.2.6 Django cleanup . . . . . . .
8.3 Database module . . . . . . . . . .
8.3.1 PostgreSql . . . . . . . . .
8.4 File-storage module . . . . . . . . .
8.4.1 S3 service . . . . . . . . . .
8.5 Artificial-intelligence module . . . .
8.5.1 Google Colaboratory . . . .
8.6 Docker . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Implementation of the annotation tool
9.1 Front-end module . . . . . . . . . . . . . . . . . . . . . . . .
9.1.1 Boxes rendering problem . . . . . . . . . . . . . . . .
9.1.2 Implementation of dynamic annotations . . . . . . . .
9.1.3 Implementation of the custom event managers . . . . .
9.1.4 Implementation of the drawing new annotation feature
9.1.5 Implementation of the interpolating feature . . . . . .
9.1.6 Implementation of the filtering feature . . . . . . . . .
9.2 Back-end module . . . . . . . . . . . . . . . . . . . . . . . .
9.2.1 Implementation of the multiple objects tracking feature
9.2.2 Implementation of the single object tracking feature .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
40
41
42
43
.
.
.
.
45
45
45
47
49
.
.
.
.
.
.
50
50
51
56
56
56
58
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
59
59
59
59
60
60
60
60
60
61
61
61
61
61
61
61
61
62
.
.
.
.
.
.
.
.
.
.
63
63
63
66
68
69
70
71
71
71
71
9.2.3
Implementation of commission related features . . . . . . . . . . . . .
10 System evaluation
10.1 Evaluation . . . . . .
10.1.1 Strengths . .
10.1.2 Weaknesses .
10.2 Comparison to CVAT
10.3 Development strategy
10.4 Contribution . . . . .
73
.
.
.
.
.
.
74
74
74
75
75
77
78
11 Conclusion
11.1 Achievement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
79
79
79
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
List of Tables
5.1
5.2
5.3
5.4
5.5
5.6
5.7
Training Data Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Embedding performance of different augmentation . . . . . . . . . . . . . . .
IDF1 and MOTA when and when not using our augmentation, best performance
shown in bold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Tracking accuracy of different constrastive method as well as supervised method.
True positive rate at different false accepted rate of different embedding supervised method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Performance of simclr at different batchsize . . . . . . . . . . . . . . . . . . .
Comparison of our method to the state-of-the-art method . . . . . . . . . . . .
10.1 Comparison with CVAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
v
40
40
41
41
42
43
43
77
List of Figures
1.1
1.2
1.3
An illustration of the output of a MOT algorithm [1] . . . . . . . . . . . . . . .
Some application of MOT . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Workflow of MOT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
2.10
2.11
2.12
2.13
2.14
2.15
2.16
Self-Supervised approach of bert[39] . . . . . . . . . . . . . . .
Self-supervised learning by rotating the entire input images.[45]
Positive and Negative sampling pipeline of constrastive learning
Accuracy at different random resized crop . . . . . . . . . . . .
Performance of [51] on different batch size . . . . . . . . . . .
A framework of simclr [51] . . . . . . . . . . . . . . . . . . . .
Simclr [51] algorithm . . . . . . . . . . . . . . . . . . . . . .
Moco [53] algorithm . . . . . . . . . . . . . . . . . . . . . . .
A framework of Barlow Twins [54] . . . . . . . . . . . . . . . .
The architecture of R-CNN [61] . . . . . . . . . . . . . . . . .
The architecture of Fast R-CNN [62] . . . . . . . . . . . . . . .
An illustration of Faster R-CNN model. [56] . . . . . . . . . . .
Workflow of YOLO [57] . . . . . . . . . . . . . . . . . . . . .
Model architecture of SSD [58] . . . . . . . . . . . . . . . . . .
FPN fusion strategy [36] . . . . . . . . . . . . . . . . . . . . .
Comparison between centernet and anchor-base method . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
8
9
10
10
11
11
12
14
15
16
17
18
19
20
21
3.1
3.2
3.3
3.4
3.5
3.6
3.7
Overview of the CNN architecture used to extract embedding feature of DeepSort [6] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
DeepSort[6] matching algorithm . . . . . . . . . . . . . . . . . . . . . . . . .
Architecture of the model used in JDE[7] . . . . . . . . . . . . . . . . . . . .
Feature sampling strategy of fairmot . . . . . . . . . . . . . . . . . . . . . . .
Fairmot architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Diagram of cross-correlation network[9] . . . . . . . . . . . . . . . . . . . . .
The details of scale-aware attention network[9] . . . . . . . . . . . . . . . . .
24
25
26
28
28
29
30
4.1
4.2
4.3
4.4
4.5
Propose joint detection and constrastive learning . . . . . . . . . . . .
Copy paste augmentation . . . . . . . . . . . . . . . . . . . . . . . .
Some mask image result from running FasterRCNN on CrowdHuman
Mosaic augmentation . . . . . . . . . . . . . . . . . . . . . . . . . .
Some augmentation from CrowdHuman and MOT17 dataset . . . . .
.
.
.
.
.
34
36
37
37
38
5.1
Visualization of the discriminative ability of the embedding features of different
constrastive method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Example tracking results of our method on the test set of MOT17. . . . . . . .
42
44
5.2
vi
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
2
4
6.1
The use case diagram of our annotation tool . . . . . . . . . . . . . . . . . . .
49
7.1
7.2
7.3
51
53
7.4
7.5
The overall architecture of the annotation tool . . . . . . . . . . . . . . . . . .
An overall view of the deployment of the annotation tool . . . . . . . . . . . .
The deployment of the front-end module, the back-end module and the database
module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The deployment of the file-storage module and the artificial-intelligence module
The database schema of the annotation tool . . . . . . . . . . . . . . . . . . .
54
55
56
9.1
The automata of a dynamic annotation . . . . . . . . . . . . . . . . . . . . . .
68
10.1 The CVAT tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
Chapter 1
Introduction
1.1
The Multiple Object Tracking Problem
Object tracking is the task of estimating the trajectories of one or several objects as they
move around a scene, usually captured via a camera. To follow (track) objects through time, we
usually rely on appearance features (shape, color, texture, size, deep features, etc.) and motion
patterns (velocity, position, etc). Object tracking can be categorized into single object tracking
(SOT) and multiple object tracking (MOT), based on the number of targets being tracked.
In single object tracking, a target of interest is usually given in the first frame, and our objective is to follow it throughout the video sequence. On another hand, multiple object tracking
aims to estimate the trajectories of multiple objects, which belong to one or more categories
such as pedestrians, cars, animals, etc, throughout a sequence (usually a video).
Unlike object detection, whose output is a collection of rectangular bounding boxes identified by their positions, height, and width, MOT algorithms also associate a target ID to each
box. An example of the output of an MOT algorithm is illustrated in Fig. 1.1
Figure 1.1: An illustration of the output of a MOT algorithm [1]
MOT is an open and attractive research field. It has gained numerous attentions due to
its potential in many disciplines (figure 1.2 show some application of MOT), ranging from
surveillance to biology. Some applications of MOT are :
1
INTRODUCTION
• Surveillance systems: They are deployed in cities to identify individuals or vehicles of
interest for traffic monitoring.
• Tracking body parts: It allows humans to interact with computers via gestures [2] .
• Tracking players in sport: Helps to analyze and interpret the game [3].
• Biomedical field: MOT is used to analyze the proliferate responses of stem cells for drug
discovery [4].
Figure 1.2: Some application of MOT. Top left : human computer interaction [2], top right :
sport analysis [3], bottom left : surveillance [70], bottom right : biology [4]
Recently, thanks to progress in the deep learning (DL) field, especially the convolutional
neural network (CNN) and transformer model, the MOT problem has achieved great success.
In detail, most of the deep learning-based MOT approaches perform much better compared to
traditional methods. Therefore, almost all currently state-of-the-art MOT algorithms [7][8][9]
are deep-learning-based. The strength of deep learning is the ability to automatically learn
representations and extract complex and abstract features from the input. Since the development
of alexnet [5], deep learning has dominated the field. Nowadays, almost all new MOT methods
are deep-learning-based, and thanks to that, achieve breakthrough results.
Despite the power of deep learning, MOT remains a challenging task. The main difficulties
in MOT lie into two following problems:
• Object detection: It is one of the fundamental problems of computer vision. In the case
of MOT, object detection also usually plays a big role in the MOT algorithms. Therefore,
to achieve good results, multiple MOT algorithms require accurately predicting bounding
boxes from frame to frame, which is very challenging, because of changes in poses, appearances, environment, etc throughout the frames.
• Occlusions: We need to correctly re-identify an occluded object when it is visible again. It
is also very challenging because of appearance changes and position changes after a long
time. Moreover, multiple objects in the same frame can have quite similar looks, while the
2
INTRODUCTION
same object in different frames can change its appearance rapidly. Therefore, it causes a
challenge in identifying objects from frame to frame.
Another disadvantage of deep learning-based MOT is they usually required a large amount
of labeled data to work. Since the development of the internet, obtaining raw data is not too hard,
but hand-labeled that data are by far too expensive. To overcome this disadvantage, various
unsupervised methods have been proposed to leverage such large unlabeled data. Bert [39]
in NLP and SimClr [51] in computer vision are excellent examples. Inspired by them, in this
thesis, we leverage those unsupervised learning techniques to MOT. Although our method is not
fully unsupervised, it can significantly reduce the labeling effort, while still compatible with the
current state-of-the-art methods.
1.2
Introduction to MOT algorithms
Most of the MOT algorithms today are based on the tracking-by-detection mechanism. In
detail, a set of detections (bounding boxes around the object of interest) are extracted from
video, frame by frame. After that, bounding boxes surrounding the same target (in different
frames) are associated by assigning them a common ID. This step is supported by some data association algorithms. Many MOT methods follow this mechanism [6][28][7][8][9]. Nowadays,
modern detection frameworks [21][22][23][24] can ensure good detection quality. Therefore,
the majority of current MOT methods focus on improving the associating step. Moreover,
many MOT datasets and challenges provide standard sets of detection results, and algorithms
are benchmarked on how well they associate those detections.
Despite the huge variety of approaches in MOT problem, the vast majority of them can be
divided into three following stages (summarized in Fig. 1.3), which may be missed, shared, or
separated between each other:
• Detection stage: An object detection algorithm analyzes each input frame and returns the
list of bounding boxes belonging to the object we want to track. In the context of MOT, a
bounding box is called a “detection”.
• Feature extraction stage: Each detection from the previous stage, as well as tracklets (lists
of detections belonging to the same objects) of the previous frame, are represented, usually
by a vector, through a feature extraction algorithm.
• Association stage: Base on the detections and represented features from the previous
stages, tracklets (of the previous frame) are assigned to detections to form a set of new
tracklets. A commonly used association algorithm is building a similarity matrix between
every pair of tracklets and detections. That matrix is built base on the similarity of represented features and the positions of tracklets and detections. After that, a matching algorithm, like the Hungarian algorithm [25], is used to form the association. Some new
methods [26][27] use Graph Neural Network (GNN), to learn the association step, instead
of using the heuristic algorithm like the one above.
Some early works [6][28] perform these three stages separately, which is very time-consuming.
To overcome this drawback, some new methods [7][8] proposed merging the two first stages to
create a good tracker that can run in real-time. Other recent works, like [26][27], view MOT as
a graph-matching problem. For example, [26] builds a graph, whose vertexes are a set of detections in multiple frames, and considers the MOT problem as an edge classification problem.
Meanwhile, [27] uses GNN to build the affinity matrix, instead of using the heuristic algorithm.
Some works have been exploited beyond the three-step guideline. For example, [11] modifies centernet to input two adjacent frames, and returns both bounding boxes and the offset
3
INTRODUCTION
Figure 1.3: Usual workflow of an MOT algorithm: given raw frames (1), an object detector is
run to obtain the bounding boxes of the objects (2). Then, for every detected object, different
features are computed, (3). After that, an association step builds the similarity matrix based
on results from two previous steps (4). Finally, an association algorithm runs on that matrix to
assign a numerical ID to each object (5). Figure taken from [1]
between two positive bounding boxes. Generally speaking, [11] is on the family of methods
using two or a list of frames to train, instead of using just one. Those methods, like [11] and
[12] is difficult and tricky to train, while the performance is not as good as the three-step ones.
Recently, since the success of applying transformer model on computer vision task such
as classification [13] and detection [14], the MOT community starts using it [15][16]. Their
methods also training with multiple frame input (usually 2). Their idea is taking previous frame
detection results dt−1 , combining with learned object positional embedding (similar to decoder
input of [14]) to perform both detecting new objects and matching it with the previous frames,
initiating new tracks, and removing occluded or out of frame tracks.
On the other hand, although all the mentioned methods have different approaches, they all
required the same kind of labeled data to train. Those data are bounding boxes of the objects in
each frame and the IDs assigned to the boxes. In this thesis, we propose a new MOT algorithm
that requires only the bounding boxes annotation to train and is compatible with the current
state-of-the-art algorithms.
1.3
Labelling tool for Multiple Object Tracking
In this thesis, we also expect our work to be rich in terms of academics and accessible by the
mass. Our potential end-users may include other computer vision teams who are researching
on this topic, deep learning scientists who need high-quality training data, or event commercial
users with some particular purposes such as supervising, securing, etc. Therefore, not all of
them are interested in our model itself or capable of setting up it. This fact leads to the need
for a user interface for the model, which would hide the complexity of the model and allow
end-users to make use of it. However, currently, available tools are by far more familiar with
single object tracking models and have little or no compatibility with multiple object tracking
4
INTRODUCTION
models. This is the reason for us to implement a new annotation tool.
Moreover, our tool will satisfy the following requirements:
• Utilise the ability of our multiple objects tracking model;
• Provide features that enhance users’ experiences;
• Specialized for pedestrians data;
• Produce accurate results and have suitable means for controlling accuracy.
1.4
Objective
Intending to solve the data problem in MOT, our main contributions in this thesis are :
• Propose and implement a novel data-efficient algorithm for solving the MOT problem,
which will be described in chapter 4;
• Develop a web-based application for embedding our model and providing convenient functions compared to available tools (described in chapter 9).
1.5
Thesis outline
The thesis is organized as follows
• In chapter1, we introduce what is MOT, its application, and what the standard approach
looks like. We also point out the current limitation in the labeling requirement, which
inspires us to create a novel algorithm and a new annotation tool to improve it. Finally, we
summarize the objective of our thesis.
• In chapter 2, we introduce contrastive learning and object detection, the two core components of our proposed algorithm.
• In chapter 3, we survey the current method, and how we leverage it in our proposed MOT
algorithm.
• In chapter 4, we present our approach that requires much fewer annotations.
• In chapter 5, we perform experiments and ablation studies on our proposed algorithm and
compare it with the current state of the art.
• In chapter 6, we analyze the system in term of functionality, and point out functional and
non-functional requirements.
• In chapter 7, we describe the overall architecture of the system: modules and their interaction.
• In chapter 8, we list our used technologies and explain why they are used.
• In chapter 9, we explain in details the implementation of core functions in both front-end
and back-end sides.
• In chapter 10, we make some comparison with an available tool and then point out the
strengths, the weaknesses, the developing strategy and also the contribution of our tool.
• In chapter 11, we summarize our work and state some future directions.
5
Chapter 2
Contrastive Learning & Object Detection
In this chapter, we will briefly introduce and discuss about the contrastive learning and
object detection algorithms, as they form the core component of our proposed data-efficient
MOT method.
In general, we use object detection to detect all objects needed to be tracked in every frame.
Meanwhile, contrastive learning is used to extract features from those objects, without deciding
which ID that each object belongs to. Therefore, it saves a lot of time.
2.1
2.1.1
Contrastive Learning
Self-Supervised Representation Learning
Given a task and enough labels, supervised learning using deep learning algorithm can solve
the task very well. Usually, deep learning requires a decent amount of labels to achieve good
performance. However, collecting manually labeled data is expensive and unscalable. On the
other hand, the amount of unlabelled data (e.g. free text, all the images on the Internet) is substantially larger than the limited number of human-created labeled datasets. Therefore, it would
be a huge waste if those unlabeled data are discarded. As a result, deriving useful information or
representation from them (i.e, unsupervised learning) is an attractive research field. To leverage
the massive amount of unlabelled data, the idea is to get labels for free and train unsupervised
datasets in a supervised manner. For example, we can mask some parts of the input (sequence
of word, part of an image), and then force the model to predict or reconstruct the mask. This
approach is called self-supervised Representation Learning.
6
CONTRASTIVE LEARNING & OBJECT DETECTION
Figure 2.1: Self-Supervised approach of bert[39]
Recently, the task of unsurprisingly learning representation from the data has achieved impressive results, both in computer vision and natural language processing.
In the natural language processing field, Bert [39], introduced by google in 2018, is a breakthrough. By unsurprisingly learning representation from large scale unlabeled text data, which
are crawled from the Internet (via two auxiliary tasks: mask word prediction and next sentences
prediction), and then fine-tuning on a small labeled dataset, it achieves a state-of-the-art (SOTA)
result. Bert beat the previous SOTA by 7 % on GLUE task. The approach of Bert (unsupervised
pre-trained on the unlabeled dataset and then fine-tuned on the labeled dataset) has become
the standard approach in NLP field [40][42][41]. In 2020, gpt-3 [43] by open-AI, by scaling
the model (transformer encoder [44]) up to 175 billion parameters, massively crawling texts
from the internet, and training with the task of next token prediction can achieve incredible
performance in a long range of task (question-answer, translation, Reading Comprehension,...)
without fine-tuning (i.e zero-shot learning) on that task.
In the computer vision field, many ideas have been proposed for self-supervised representation learning. The common workflow is training a model on one or multiple pretext tasks with
unlabelled images and then using one intermediate feature layer of this model to feed a logistic
regression classifier on ImageNet classification. This procedure is called linear evaluation. The
final classification accuracy quantifies how good the learned representation is.
Many pretext tasks have been proposed to tackle the problem of learning representations
from unlabeled data. [45] uses a rotation prediction as a pretext task. Each input image is first
rotated by a random multiple of 90°, corresponding to [0°,90°,180°,270°]. The model is trained
to predict which rotation has been applied. Therefore, it is a 4-class classification problem.
Via learning to predict the rotation, the model has to learn to recognize high-level parts of
objects, such as heads, noses, and eyes, and the relative positions of these parts, rather than
local patterns. As a result, the model is forced to learn semantic concepts of objects.
7
CONTRASTIVE LEARNING & OBJECT DETECTION
Figure 2.2: Self-supervised learning by rotating the entire input images.[45]
Some other pretext tasks are colorization[46] (predict the image color in the CIELab*
space), inpainting[47] ( masking a part of an image and train a model to reconstruct the mask),
jigsaw puzzle[48] (divide image into multiple patches, shuffle it, and predict the original order
of patch), etc. Although achieving some promising results, the performance of these methods
on the ImageNet dataset, when feeding the intermediate feature layer to a logistic regression
classifier, is below 60 %, compared to the performance of 8x % of the supervised ones. Thus,
self-supervised learning on images seems to be harder than self-supervised learning on texts in
NLP.
Recently, since the introduction of contrastive learning and InfoNCE loss function to the
self-supervised problem in [49], the task of self-supervised representation learning in images
has been revolutionized and achieves incredible results. Nowadays, most new methods on selfsupervised learning in the Vision domain use a contrastive learning framework. The framework
of contrastive learning and the method employed on it will be discussed in the next section.
2.1.2
Contrastive Representation Learning
The main idea of contrastive learning is to learn representations such that similar samples
stay close to each other, while dissimilar ones are far apart. Contrastive learning can be applied
to both supervised and unsupervised settings and has been shown to achieve good performance
on a variety of vision and language tasks. When working in unsupervised settings, contrastive
learning is one of the most powerful approaches in self-supervised learning. For the vision
domain, contrastive learning is the framework used in nearly every state-of-the-art method.
2.1.2.1
Framework of constrastive learning
As stated before, the goal of contrastive learning is to learn representations such that similar
samples stay close, and dissimilar ones are far apart. We can loosen the definition of “classes”
and “labels” in supervised learning to create positive and negative sample pairs out of unsupervised data. These are the key ingredients of contrastive learning. To create positive pairs,
the common is applying data augmentation to create different views of original samples. The
8
CONTRASTIVE LEARNING & OBJECT DETECTION
negative pairs can be obtained by random sampling from the original dataset. The processes of
sampling positive pairs and negative pairs are illustrated in figure 2.3
Figure 2.3: Positive and Negative sampling pipeline of constrastive learning
Mathematically speaking, given a sample x, the positive distribution (i.e, custom random
augmentation) p pos and the negative distribution pneg (usually by uniformly sampling from the
dataset). The objective of contrastive learning is trying to make representations of x and x+ ∼
p pos (.|x) close, while maximizing the distance to x− ∼ pneg (.|x). The common loss function to
achieve this objective is contrastive loss function:
exp( f (x0 )T f (x+ ))/τ
)]
−
0 T
i i=1
exp( f (x0 )T f (x+ ))/τ + ∑M
i=1 exp( f (x ) f (xi ))/τ
(2.1)
where f is the encoder network used to extract useful presentations from images.
The practice finds that the two key ingredients determining the success of contrastive learning are the data augmentation and the batch size (i.e number of negative samples). The augmentation needs to be “strong” enough to create “hard” positive pairs. However it should also
be not too hard, otherwise, we can get noised positive pairs. [50] has experimented on different
strengths of augmentation on the quality of learned representation. As a result, it finds that in
most cases, the performance graph has a reversed U shape. In detail, this means that neither too
“weak” nor too “strong” augmentation is not good. Figure 2.4 shows the linear evaluation of
performance after experimenting on different sizes of random-crop augmentation.
Lcontrast = Ex∼pdata ,x+ ,x0 ∼p pos (.|x),(x− )M
∼pneg (.|x) [−log(
9
- Xem thêm -