Đăng ký Đăng nhập
Trang chủ Toward data efficient multiple object tracking ...

Tài liệu Toward data efficient multiple object tracking

.PDF
95
1
55

Mô tả:

HO CHI MINH NATIONAL UNIVERSITY HCM CITY UNIVERSITY OF TECHNOLOGY FACULTY OF COMPUTER SCIENCE AND ENGINEERING ——————– * ——————— BACHELOR DEGREE THESIS TOWARD DATA-EFFICIENT MULTIPLE OBJECT TRACKING Council : Computer Science Thesis Advisor : Dr. Nguyen Duc Dung Reviewer : Dr. Nguyen Hua Phung —o0o— Students: Phan Xuan Thanh Lam 1710163 Tran Ho Minh Thong 1710314 HO CHI MINH CITY, 07/2021 Declaration We declare that the thesis entitled “ TOWARD DATA-EFFICIENT MULTIPLE OBJECT TRACKING” is our own work under the supervision of Dr. Nguyen Duc Dung. We declare that the information reported here is the result of our own work, except where references are made. The thesis has not been accepted for any degree and is not concurrently submitted to any candidature for any other degree or diploma. Abstract Multiple object tracking (MOT) is the task of estimating the trajectories of several objects as they move around a scene. MOT is an open and attractive research field with a broad extent of categories and applications such as surveillance, sports analysis, human-computer interface, biology, etc. The difficulties of this problem lie in several challenges, such as frequent occlusions, intraclass and inter-class variations, etc. Recently, deep learning MOT methods confront these challenges effectively and lead to groundbreaking results. Therefore, these methods are used in almost all state-of-the-art MOT algorithms. Despite their successes, deep learning MOT algorithms, like other deep learning-based algorithms, are data-hungry and require a large amount of labeled data to work. On another hand, annotating MOT data usually consists of manually labeling positions of objects on every video frame (with bounding boxes or segmentation) and assigning each object to a single identity (ID), such that different objects have different IDs, and the same object in different frames has the same ID. This makes annotating MOT data a very time-consuming task. To solve the data problem in deep learning MOT algorithms, in this thesis, we will propose a method, where we only need the annotations of object positions. Experiments show that our method is compatible with the current state-of-the-art method, despite the lack of object ID labeling. On the other hand, we found that current annotation tools, such as the Computer Vision Annotation Tool [77], the SupperAnnotate [78], etc are not well integrated with MOT models, and also lack necessary features for MOT problems. Therefore, in this thesis, we will also develop a new annotation tool. It will support automatically labeling via our proposed MOT model. Moreover, our tool will also provide plenty of convenient features, which will increase the automation for labeling processes, control the accuracy and rationality of results, and increase users’ experiences. To sum up, our main contributions in this thesis are twofold: • Our first major contribution is an MOT algorithm compatible with state-of-the-art algorithms without the need for object ID labeling. This can make the cost of manually labeling data significantly reduced. • As a second contribution, we also build an annotation tool. Our tool supports automatic annotation and a lot of features that help fasten the labeling process of MOT data. Acknowledgments We would like to thank Mr. Nguyen Duc Dung for guiding us to important publications and for the stimulating questions on artificial intelligence. The meetings and conversations were vital in inspiring us to think outside the box, from multiple perspectives to form a comprehensive and objective critique. Contents 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 3 4 5 5 Contrastive Learning & Object Detection 2.1 Contrastive Learning . . . . . . . . . . . . . . . . . 2.1.1 Self-Supervised Representation Learning . . 2.1.2 Contrastive Representation Learning . . . . . 2.1.2.1 Framework of constrastive learning 2.1.2.2 SimCLR . . . . . . . . . . . . . . 2.1.2.3 MoCo . . . . . . . . . . . . . . . 2.1.2.4 SwAV . . . . . . . . . . . . . . . 2.1.2.5 Barlow Twins . . . . . . . . . . . 2.2 Object Detection . . . . . . . . . . . . . . . . . . . 2.2.1 Two-stage Detectors . . . . . . . . . . . . . 2.2.2 One-stage Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 6 6 8 8 10 11 12 13 14 14 17 . . . . . . . . 23 23 25 27 28 30 30 31 31 4 Proposed Algorithm 4.1 Propose architecture and algorithm . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Propose Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 33 35 5 Experiment 5.1 Overview . . . . . . . . . . . . . 5.1.1 Experimental environment 5.1.2 Datasets and Metrics . . . 5.1.3 Ablative Studies . . . . . 39 39 39 39 40 2 3 Introduction 1.1 The Multiple Object Tracking Problem . . . 1.2 Introduction to MOT algorithms . . . . . . 1.3 Labelling tool for Multiple Object Tracking 1.4 Objective . . . . . . . . . . . . . . . . . . 1.5 Thesis outline . . . . . . . . . . . . . . . . Related Work 3.1 DeepSort . . . . . . . . . . . 3.2 JDE . . . . . . . . . . . . . . 3.3 FairMot . . . . . . . . . . . . 3.4 CSTrack . . . . . . . . . . . . 3.5 MOT Metric . . . . . . . . . . 3.5.1 Classical metrics . . . 3.5.2 CLEAR MOT metrics 3.5.2.1 ID scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.3.1 5.1.3.2 5.1.3.3 5.1.3.4 6 7 8 9 Robustness of propose augmentation . . Result of different constrastive objective Result of different batch size on simclr . Results on MOTChallenge . . . . . . . Analysis of the annotation tool 6.1 System objectives . . . . . . . . 6.2 An analysis of system objectives 6.3 System requirements . . . . . . 6.4 Use case diagram . . . . . . . . . . . . Design of the annotation tool 7.1 Overall architecture . . . . . . . . 7.2 Deployment . . . . . . . . . . . . 7.3 Entity relationship . . . . . . . . . 7.3.1 Entity relationship diagram 7.3.2 Entities explanation . . . . 7.3.3 Deletion mechanism . . . . . . . . . . . . . Used technologies of the annotation tool 8.1 Front-end module . . . . . . . . . . 8.1.1 Angular . . . . . . . . . . . 8.1.2 Ag-grid . . . . . . . . . . . 8.2 Back-end module . . . . . . . . . . 8.2.1 Django . . . . . . . . . . . 8.2.2 REST framework . . . . . . 8.2.3 Simple JWT . . . . . . . . 8.2.4 OpenCV . . . . . . . . . . 8.2.5 Request . . . . . . . . . . . 8.2.6 Django cleanup . . . . . . . 8.3 Database module . . . . . . . . . . 8.3.1 PostgreSql . . . . . . . . . 8.4 File-storage module . . . . . . . . . 8.4.1 S3 service . . . . . . . . . . 8.5 Artificial-intelligence module . . . . 8.5.1 Google Colaboratory . . . . 8.6 Docker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Implementation of the annotation tool 9.1 Front-end module . . . . . . . . . . . . . . . . . . . . . . . . 9.1.1 Boxes rendering problem . . . . . . . . . . . . . . . . 9.1.2 Implementation of dynamic annotations . . . . . . . . 9.1.3 Implementation of the custom event managers . . . . . 9.1.4 Implementation of the drawing new annotation feature 9.1.5 Implementation of the interpolating feature . . . . . . 9.1.6 Implementation of the filtering feature . . . . . . . . . 9.2 Back-end module . . . . . . . . . . . . . . . . . . . . . . . . 9.2.1 Implementation of the multiple objects tracking feature 9.2.2 Implementation of the single object tracking feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 41 42 43 . . . . 45 45 45 47 49 . . . . . . 50 50 51 56 56 56 58 . . . . . . . . . . . . . . . . . 59 59 59 59 60 60 60 60 60 61 61 61 61 61 61 61 61 62 . . . . . . . . . . 63 63 63 66 68 69 70 71 71 71 71 9.2.3 Implementation of commission related features . . . . . . . . . . . . . 10 System evaluation 10.1 Evaluation . . . . . . 10.1.1 Strengths . . 10.1.2 Weaknesses . 10.2 Comparison to CVAT 10.3 Development strategy 10.4 Contribution . . . . . 73 . . . . . . 74 74 74 75 75 77 78 11 Conclusion 11.1 Achievement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 79 79 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . List of Tables 5.1 5.2 5.3 5.4 5.5 5.6 5.7 Training Data Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Embedding performance of different augmentation . . . . . . . . . . . . . . . IDF1 and MOTA when and when not using our augmentation, best performance shown in bold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tracking accuracy of different constrastive method as well as supervised method. True positive rate at different false accepted rate of different embedding supervised method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Performance of simclr at different batchsize . . . . . . . . . . . . . . . . . . . Comparison of our method to the state-of-the-art method . . . . . . . . . . . . 10.1 Comparison with CVAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v 40 40 41 41 42 43 43 77 List of Figures 1.1 1.2 1.3 An illustration of the output of a MOT algorithm [1] . . . . . . . . . . . . . . . Some application of MOT . . . . . . . . . . . . . . . . . . . . . . . . . . . . Workflow of MOT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 2.15 2.16 Self-Supervised approach of bert[39] . . . . . . . . . . . . . . . Self-supervised learning by rotating the entire input images.[45] Positive and Negative sampling pipeline of constrastive learning Accuracy at different random resized crop . . . . . . . . . . . . Performance of [51] on different batch size . . . . . . . . . . . A framework of simclr [51] . . . . . . . . . . . . . . . . . . . . Simclr [51] algorithm . . . . . . . . . . . . . . . . . . . . . . Moco [53] algorithm . . . . . . . . . . . . . . . . . . . . . . . A framework of Barlow Twins [54] . . . . . . . . . . . . . . . . The architecture of R-CNN [61] . . . . . . . . . . . . . . . . . The architecture of Fast R-CNN [62] . . . . . . . . . . . . . . . An illustration of Faster R-CNN model. [56] . . . . . . . . . . . Workflow of YOLO [57] . . . . . . . . . . . . . . . . . . . . . Model architecture of SSD [58] . . . . . . . . . . . . . . . . . . FPN fusion strategy [36] . . . . . . . . . . . . . . . . . . . . . Comparison between centernet and anchor-base method . . . . . . . . . . . . . . . . . . . . . 7 8 9 10 10 11 11 12 14 15 16 17 18 19 20 21 3.1 3.2 3.3 3.4 3.5 3.6 3.7 Overview of the CNN architecture used to extract embedding feature of DeepSort [6] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . DeepSort[6] matching algorithm . . . . . . . . . . . . . . . . . . . . . . . . . Architecture of the model used in JDE[7] . . . . . . . . . . . . . . . . . . . . Feature sampling strategy of fairmot . . . . . . . . . . . . . . . . . . . . . . . Fairmot architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Diagram of cross-correlation network[9] . . . . . . . . . . . . . . . . . . . . . The details of scale-aware attention network[9] . . . . . . . . . . . . . . . . . 24 25 26 28 28 29 30 4.1 4.2 4.3 4.4 4.5 Propose joint detection and constrastive learning . . . . . . . . . . . . Copy paste augmentation . . . . . . . . . . . . . . . . . . . . . . . . Some mask image result from running FasterRCNN on CrowdHuman Mosaic augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . Some augmentation from CrowdHuman and MOT17 dataset . . . . . . . . . . 34 36 37 37 38 5.1 Visualization of the discriminative ability of the embedding features of different constrastive method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example tracking results of our method on the test set of MOT17. . . . . . . . 42 44 5.2 vi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 4 6.1 The use case diagram of our annotation tool . . . . . . . . . . . . . . . . . . . 49 7.1 7.2 7.3 51 53 7.4 7.5 The overall architecture of the annotation tool . . . . . . . . . . . . . . . . . . An overall view of the deployment of the annotation tool . . . . . . . . . . . . The deployment of the front-end module, the back-end module and the database module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The deployment of the file-storage module and the artificial-intelligence module The database schema of the annotation tool . . . . . . . . . . . . . . . . . . . 54 55 56 9.1 The automata of a dynamic annotation . . . . . . . . . . . . . . . . . . . . . . 68 10.1 The CVAT tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Chapter 1 Introduction 1.1 The Multiple Object Tracking Problem Object tracking is the task of estimating the trajectories of one or several objects as they move around a scene, usually captured via a camera. To follow (track) objects through time, we usually rely on appearance features (shape, color, texture, size, deep features, etc.) and motion patterns (velocity, position, etc). Object tracking can be categorized into single object tracking (SOT) and multiple object tracking (MOT), based on the number of targets being tracked. In single object tracking, a target of interest is usually given in the first frame, and our objective is to follow it throughout the video sequence. On another hand, multiple object tracking aims to estimate the trajectories of multiple objects, which belong to one or more categories such as pedestrians, cars, animals, etc, throughout a sequence (usually a video). Unlike object detection, whose output is a collection of rectangular bounding boxes identified by their positions, height, and width, MOT algorithms also associate a target ID to each box. An example of the output of an MOT algorithm is illustrated in Fig. 1.1 Figure 1.1: An illustration of the output of a MOT algorithm [1] MOT is an open and attractive research field. It has gained numerous attentions due to its potential in many disciplines (figure 1.2 show some application of MOT), ranging from surveillance to biology. Some applications of MOT are : 1 INTRODUCTION • Surveillance systems: They are deployed in cities to identify individuals or vehicles of interest for traffic monitoring. • Tracking body parts: It allows humans to interact with computers via gestures [2] . • Tracking players in sport: Helps to analyze and interpret the game [3]. • Biomedical field: MOT is used to analyze the proliferate responses of stem cells for drug discovery [4]. Figure 1.2: Some application of MOT. Top left : human computer interaction [2], top right : sport analysis [3], bottom left : surveillance [70], bottom right : biology [4] Recently, thanks to progress in the deep learning (DL) field, especially the convolutional neural network (CNN) and transformer model, the MOT problem has achieved great success. In detail, most of the deep learning-based MOT approaches perform much better compared to traditional methods. Therefore, almost all currently state-of-the-art MOT algorithms [7][8][9] are deep-learning-based. The strength of deep learning is the ability to automatically learn representations and extract complex and abstract features from the input. Since the development of alexnet [5], deep learning has dominated the field. Nowadays, almost all new MOT methods are deep-learning-based, and thanks to that, achieve breakthrough results. Despite the power of deep learning, MOT remains a challenging task. The main difficulties in MOT lie into two following problems: • Object detection: It is one of the fundamental problems of computer vision. In the case of MOT, object detection also usually plays a big role in the MOT algorithms. Therefore, to achieve good results, multiple MOT algorithms require accurately predicting bounding boxes from frame to frame, which is very challenging, because of changes in poses, appearances, environment, etc throughout the frames. • Occlusions: We need to correctly re-identify an occluded object when it is visible again. It is also very challenging because of appearance changes and position changes after a long time. Moreover, multiple objects in the same frame can have quite similar looks, while the 2 INTRODUCTION same object in different frames can change its appearance rapidly. Therefore, it causes a challenge in identifying objects from frame to frame. Another disadvantage of deep learning-based MOT is they usually required a large amount of labeled data to work. Since the development of the internet, obtaining raw data is not too hard, but hand-labeled that data are by far too expensive. To overcome this disadvantage, various unsupervised methods have been proposed to leverage such large unlabeled data. Bert [39] in NLP and SimClr [51] in computer vision are excellent examples. Inspired by them, in this thesis, we leverage those unsupervised learning techniques to MOT. Although our method is not fully unsupervised, it can significantly reduce the labeling effort, while still compatible with the current state-of-the-art methods. 1.2 Introduction to MOT algorithms Most of the MOT algorithms today are based on the tracking-by-detection mechanism. In detail, a set of detections (bounding boxes around the object of interest) are extracted from video, frame by frame. After that, bounding boxes surrounding the same target (in different frames) are associated by assigning them a common ID. This step is supported by some data association algorithms. Many MOT methods follow this mechanism [6][28][7][8][9]. Nowadays, modern detection frameworks [21][22][23][24] can ensure good detection quality. Therefore, the majority of current MOT methods focus on improving the associating step. Moreover, many MOT datasets and challenges provide standard sets of detection results, and algorithms are benchmarked on how well they associate those detections. Despite the huge variety of approaches in MOT problem, the vast majority of them can be divided into three following stages (summarized in Fig. 1.3), which may be missed, shared, or separated between each other: • Detection stage: An object detection algorithm analyzes each input frame and returns the list of bounding boxes belonging to the object we want to track. In the context of MOT, a bounding box is called a “detection”. • Feature extraction stage: Each detection from the previous stage, as well as tracklets (lists of detections belonging to the same objects) of the previous frame, are represented, usually by a vector, through a feature extraction algorithm. • Association stage: Base on the detections and represented features from the previous stages, tracklets (of the previous frame) are assigned to detections to form a set of new tracklets. A commonly used association algorithm is building a similarity matrix between every pair of tracklets and detections. That matrix is built base on the similarity of represented features and the positions of tracklets and detections. After that, a matching algorithm, like the Hungarian algorithm [25], is used to form the association. Some new methods [26][27] use Graph Neural Network (GNN), to learn the association step, instead of using the heuristic algorithm like the one above. Some early works [6][28] perform these three stages separately, which is very time-consuming. To overcome this drawback, some new methods [7][8] proposed merging the two first stages to create a good tracker that can run in real-time. Other recent works, like [26][27], view MOT as a graph-matching problem. For example, [26] builds a graph, whose vertexes are a set of detections in multiple frames, and considers the MOT problem as an edge classification problem. Meanwhile, [27] uses GNN to build the affinity matrix, instead of using the heuristic algorithm. Some works have been exploited beyond the three-step guideline. For example, [11] modifies centernet to input two adjacent frames, and returns both bounding boxes and the offset 3 INTRODUCTION Figure 1.3: Usual workflow of an MOT algorithm: given raw frames (1), an object detector is run to obtain the bounding boxes of the objects (2). Then, for every detected object, different features are computed, (3). After that, an association step builds the similarity matrix based on results from two previous steps (4). Finally, an association algorithm runs on that matrix to assign a numerical ID to each object (5). Figure taken from [1] between two positive bounding boxes. Generally speaking, [11] is on the family of methods using two or a list of frames to train, instead of using just one. Those methods, like [11] and [12] is difficult and tricky to train, while the performance is not as good as the three-step ones. Recently, since the success of applying transformer model on computer vision task such as classification [13] and detection [14], the MOT community starts using it [15][16]. Their methods also training with multiple frame input (usually 2). Their idea is taking previous frame detection results dt−1 , combining with learned object positional embedding (similar to decoder input of [14]) to perform both detecting new objects and matching it with the previous frames, initiating new tracks, and removing occluded or out of frame tracks. On the other hand, although all the mentioned methods have different approaches, they all required the same kind of labeled data to train. Those data are bounding boxes of the objects in each frame and the IDs assigned to the boxes. In this thesis, we propose a new MOT algorithm that requires only the bounding boxes annotation to train and is compatible with the current state-of-the-art algorithms. 1.3 Labelling tool for Multiple Object Tracking In this thesis, we also expect our work to be rich in terms of academics and accessible by the mass. Our potential end-users may include other computer vision teams who are researching on this topic, deep learning scientists who need high-quality training data, or event commercial users with some particular purposes such as supervising, securing, etc. Therefore, not all of them are interested in our model itself or capable of setting up it. This fact leads to the need for a user interface for the model, which would hide the complexity of the model and allow end-users to make use of it. However, currently, available tools are by far more familiar with single object tracking models and have little or no compatibility with multiple object tracking 4 INTRODUCTION models. This is the reason for us to implement a new annotation tool. Moreover, our tool will satisfy the following requirements: • Utilise the ability of our multiple objects tracking model; • Provide features that enhance users’ experiences; • Specialized for pedestrians data; • Produce accurate results and have suitable means for controlling accuracy. 1.4 Objective Intending to solve the data problem in MOT, our main contributions in this thesis are : • Propose and implement a novel data-efficient algorithm for solving the MOT problem, which will be described in chapter 4; • Develop a web-based application for embedding our model and providing convenient functions compared to available tools (described in chapter 9). 1.5 Thesis outline The thesis is organized as follows • In chapter1, we introduce what is MOT, its application, and what the standard approach looks like. We also point out the current limitation in the labeling requirement, which inspires us to create a novel algorithm and a new annotation tool to improve it. Finally, we summarize the objective of our thesis. • In chapter 2, we introduce contrastive learning and object detection, the two core components of our proposed algorithm. • In chapter 3, we survey the current method, and how we leverage it in our proposed MOT algorithm. • In chapter 4, we present our approach that requires much fewer annotations. • In chapter 5, we perform experiments and ablation studies on our proposed algorithm and compare it with the current state of the art. • In chapter 6, we analyze the system in term of functionality, and point out functional and non-functional requirements. • In chapter 7, we describe the overall architecture of the system: modules and their interaction. • In chapter 8, we list our used technologies and explain why they are used. • In chapter 9, we explain in details the implementation of core functions in both front-end and back-end sides. • In chapter 10, we make some comparison with an available tool and then point out the strengths, the weaknesses, the developing strategy and also the contribution of our tool. • In chapter 11, we summarize our work and state some future directions. 5 Chapter 2 Contrastive Learning & Object Detection In this chapter, we will briefly introduce and discuss about the contrastive learning and object detection algorithms, as they form the core component of our proposed data-efficient MOT method. In general, we use object detection to detect all objects needed to be tracked in every frame. Meanwhile, contrastive learning is used to extract features from those objects, without deciding which ID that each object belongs to. Therefore, it saves a lot of time. 2.1 2.1.1 Contrastive Learning Self-Supervised Representation Learning Given a task and enough labels, supervised learning using deep learning algorithm can solve the task very well. Usually, deep learning requires a decent amount of labels to achieve good performance. However, collecting manually labeled data is expensive and unscalable. On the other hand, the amount of unlabelled data (e.g. free text, all the images on the Internet) is substantially larger than the limited number of human-created labeled datasets. Therefore, it would be a huge waste if those unlabeled data are discarded. As a result, deriving useful information or representation from them (i.e, unsupervised learning) is an attractive research field. To leverage the massive amount of unlabelled data, the idea is to get labels for free and train unsupervised datasets in a supervised manner. For example, we can mask some parts of the input (sequence of word, part of an image), and then force the model to predict or reconstruct the mask. This approach is called self-supervised Representation Learning. 6 CONTRASTIVE LEARNING & OBJECT DETECTION Figure 2.1: Self-Supervised approach of bert[39] Recently, the task of unsurprisingly learning representation from the data has achieved impressive results, both in computer vision and natural language processing. In the natural language processing field, Bert [39], introduced by google in 2018, is a breakthrough. By unsurprisingly learning representation from large scale unlabeled text data, which are crawled from the Internet (via two auxiliary tasks: mask word prediction and next sentences prediction), and then fine-tuning on a small labeled dataset, it achieves a state-of-the-art (SOTA) result. Bert beat the previous SOTA by 7 % on GLUE task. The approach of Bert (unsupervised pre-trained on the unlabeled dataset and then fine-tuned on the labeled dataset) has become the standard approach in NLP field [40][42][41]. In 2020, gpt-3 [43] by open-AI, by scaling the model (transformer encoder [44]) up to 175 billion parameters, massively crawling texts from the internet, and training with the task of next token prediction can achieve incredible performance in a long range of task (question-answer, translation, Reading Comprehension,...) without fine-tuning (i.e zero-shot learning) on that task. In the computer vision field, many ideas have been proposed for self-supervised representation learning. The common workflow is training a model on one or multiple pretext tasks with unlabelled images and then using one intermediate feature layer of this model to feed a logistic regression classifier on ImageNet classification. This procedure is called linear evaluation. The final classification accuracy quantifies how good the learned representation is. Many pretext tasks have been proposed to tackle the problem of learning representations from unlabeled data. [45] uses a rotation prediction as a pretext task. Each input image is first rotated by a random multiple of 90°, corresponding to [0°,90°,180°,270°]. The model is trained to predict which rotation has been applied. Therefore, it is a 4-class classification problem. Via learning to predict the rotation, the model has to learn to recognize high-level parts of objects, such as heads, noses, and eyes, and the relative positions of these parts, rather than local patterns. As a result, the model is forced to learn semantic concepts of objects. 7 CONTRASTIVE LEARNING & OBJECT DETECTION Figure 2.2: Self-supervised learning by rotating the entire input images.[45] Some other pretext tasks are colorization[46] (predict the image color in the CIELab* space), inpainting[47] ( masking a part of an image and train a model to reconstruct the mask), jigsaw puzzle[48] (divide image into multiple patches, shuffle it, and predict the original order of patch), etc. Although achieving some promising results, the performance of these methods on the ImageNet dataset, when feeding the intermediate feature layer to a logistic regression classifier, is below 60 %, compared to the performance of 8x % of the supervised ones. Thus, self-supervised learning on images seems to be harder than self-supervised learning on texts in NLP. Recently, since the introduction of contrastive learning and InfoNCE loss function to the self-supervised problem in [49], the task of self-supervised representation learning in images has been revolutionized and achieves incredible results. Nowadays, most new methods on selfsupervised learning in the Vision domain use a contrastive learning framework. The framework of contrastive learning and the method employed on it will be discussed in the next section. 2.1.2 Contrastive Representation Learning The main idea of contrastive learning is to learn representations such that similar samples stay close to each other, while dissimilar ones are far apart. Contrastive learning can be applied to both supervised and unsupervised settings and has been shown to achieve good performance on a variety of vision and language tasks. When working in unsupervised settings, contrastive learning is one of the most powerful approaches in self-supervised learning. For the vision domain, contrastive learning is the framework used in nearly every state-of-the-art method. 2.1.2.1 Framework of constrastive learning As stated before, the goal of contrastive learning is to learn representations such that similar samples stay close, and dissimilar ones are far apart. We can loosen the definition of “classes” and “labels” in supervised learning to create positive and negative sample pairs out of unsupervised data. These are the key ingredients of contrastive learning. To create positive pairs, the common is applying data augmentation to create different views of original samples. The 8 CONTRASTIVE LEARNING & OBJECT DETECTION negative pairs can be obtained by random sampling from the original dataset. The processes of sampling positive pairs and negative pairs are illustrated in figure 2.3 Figure 2.3: Positive and Negative sampling pipeline of constrastive learning Mathematically speaking, given a sample x, the positive distribution (i.e, custom random augmentation) p pos and the negative distribution pneg (usually by uniformly sampling from the dataset). The objective of contrastive learning is trying to make representations of x and x+ ∼ p pos (.|x) close, while maximizing the distance to x− ∼ pneg (.|x). The common loss function to achieve this objective is contrastive loss function: exp( f (x0 )T f (x+ ))/τ )] − 0 T i i=1 exp( f (x0 )T f (x+ ))/τ + ∑M i=1 exp( f (x ) f (xi ))/τ (2.1) where f is the encoder network used to extract useful presentations from images. The practice finds that the two key ingredients determining the success of contrastive learning are the data augmentation and the batch size (i.e number of negative samples). The augmentation needs to be “strong” enough to create “hard” positive pairs. However it should also be not too hard, otherwise, we can get noised positive pairs. [50] has experimented on different strengths of augmentation on the quality of learned representation. As a result, it finds that in most cases, the performance graph has a reversed U shape. In detail, this means that neither too “weak” nor too “strong” augmentation is not good. Figure 2.4 shows the linear evaluation of performance after experimenting on different sizes of random-crop augmentation. Lcontrast = Ex∼pdata ,x+ ,x0 ∼p pos (.|x),(x− )M ∼pneg (.|x) [−log( 9
- Xem thêm -

Tài liệu liên quan