Tài liệu Kết hợp đặc trưng diện mạo và chuyển động trong biểu diễn hoạt động của người sử dụng mạng nơ ron tích chập =

.PDF

141

thanhphoquetoi Báo vi phạm

Tải xuống 141

Mô tả:

BỘ GIÁO DỤC VÀ ĐÀO TẠO TRƯỜNG ĐẠI HỌC BÁCH KHOA HÀ NỘI --------------------------------------Khổng Văn Minh KẾT HỢP ĐẶC TRƯNG DIỆN MẠO VÀ CHUYỂN ĐỘNG TRONG BIỂU DIỄN HOẠT ĐỘNG CỦA NGƯỜI SỬ DỤNG MẠNG NƠ RON TÍCH CHẬP Chuyên ngành : Hệ thống thông tin LUẬN VĂN THẠC SĨ KHOA HỌC HỆ THỐNG THÔNG TIN NGƯỜI HƯỚNG DẪN KHOA HỌC : TS. Trần Thị Thanh Hải Hà Nội – Năm 2018 MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY --------------------------------------KHONG VAN MINH COMBINATION OF APPEARANCE AND MOTION INFORMATION IN HUMAN ACTION REPRESENTATION USING CONVOLUTIONAL NEURAL NETWORK FIELD OF STUDY : INFORMATION SYSTEM MASTER’S THESIS IN INFORMATION SYSTEM SUPERVISOR: PhD: Tran Thi Thanh Hai HANOI – 2018 SĐH.QT9.BM11 CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM Độc lập – Tự do – Hạnh phúc BẢN XÁC NHẬN CHỈNH SỬA LUẬN VĂN THẠC SĨ Họ và tên tác giả luận văn : Khổng Văn Minh Đề tài luận văn: Kết hợp đặc trưng diện mạo và chuyển động trong biểu diễn hoạt động của người sử dụng mạng nơ ron tích chập Chuyên ngành: Hệ thống thông tin Mã số SV: CBC17021 Tác giả, Người hướng dẫn khoa học và Hội đồng chấm luận văn xác nhận tác giả đã sửa chữa, bổ sung luận văn theo biên bản họp Hội đồng ngày….........................………… với các nội dung sau: …………………………………………………………………………………………………….. …………………………………………………………………………………………………….. …………………………………………………………………………………………………….. …………………………………………………………………………………………………….. …………………………………………………………………………………………………….. …………………………………………………………………………………………………….. …………………………………………………………………………………….. Ngày Giáo viên hướng dẫn CHỦ TỊCH HỘI ĐỒNG tháng năm Tác giả luận văn Abstract In this thesis, I focus on solving the action recognition problem in video or a stack of consecutive frames. This problem plays an important role in surveillance systems that are very popular nowadays. There are two main solutions to solve this problem: using hand-crafted features or using learned features using deep learning. Both of the solutions have pros and cons and the solution that I study belongs to the secondategory. Recently, advanced techniques relying on convolutional neural networks produced impressive improvement compared to traditional handcrafted features based techniques. Besides, literature researches also showed that the use of different streams of data will help to increase recognition performance. This paper proposes a method that exploits both RGB and optical flow for human action recognition. Specifically, we deploy a two stream convolutional neural network that takes RGB and optical flow computed from RGB stream as inputs. Each stream has architecture of an existing 3D convolutional neural network (C3D) which has been shown to be compact but efficient for the task of action recognition from video. Each stream works independently then is combined by early fusion or late fusion to output the recognition results. We show that the proposed two-stream 3D convolutional neural network (2stream C3D) outperforms one stream C3D on two benchmark datasets UCF101 (from 82.79% to 89.11%), HMDB51 (from 45.71 % to 60.87%) and CMDFALL (from 65.35% to 71.77%). 1 Acknowledgments Firstly, I would like to express my deep gratitude to my supervisor PhD. Tran Thi Thanh Hai for supporting my research direction, which allowed me to explore new ideas in the field of computer vision and machine learning. I would like to thank for her supervision, encouragement, motivation, and support and her guidance helped me throughout the research work and in writing of the thesis. I would like to acknowledge the International Research Institute MICA, HUST for providing me the great research environment. I wish to express my gratitude to the teachers in Computer vision department, MICA for giving me the opportunity to work and acquire great research experience. I would like to acknowledge the School of Information and Communication Technology for providing me the knowledge and the opportunity to study. I would like to thank my friends for supporting me in my study. Last but not least, I would like to convey my deepest gratitude to my family for their supports, and sacrifices during my studies. 2 Contents 1 2 3 4 Introduction to Human Action Recognition 9 1.1 Human Action Recognition problem . . . . . . . . . . . . . . . . . . . . . 9 1.2 Overview of human action recognition approach . . . . . . . . . . . . . . . 12 1.2.1 Hand crafted feature based methods . . . . . . . . . . . . . . . . . 12 1.2.2 Deep learning based methods . . . . . . . . . . . . . . . . . . . . 13 1.2.3 Purpose of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 State-of-the-art on HAR using CNN 15 2.1 Introduction to Convolutional Neural Networks . . . . . . . . . . . . . . . 15 2.2 2D Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . 17 2.3 3D Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . 18 2.4 Multistream Convolutional Neural Networks . . . . . . . . . . . . . . . . . 20 Proposed method for HAR using multistream C3D 23 3.1 General framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 RGB stream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3 Optical Flow Stream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.4 Fusion of multistream 3D CNN . . . . . . . . . . . . . . . . . . . . . . . . 26 3.4.1 Early fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.4.2 Late fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Experimental Results 4.1 28 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3 5 4.1.1 UCF101 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.1.2 HMDB51 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.1.3 CMDFALL dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2 Experiment setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.3 Single stream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.4 Multiple stream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Conclusion 43 5.1 Pros and Cons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4 List of Figures 1-1 Human Action Recognition Problem . . . . . . . . . . . . . . . . . . . . . 10 1-2 Human Action Recognition phases . . . . . . . . . . . . . . . . . . . . . . 11 1-3 Hand-crafted feature based method for Human Action Recognition . . . . . 12 1-4 Deep learning method for Human Action Recognition problem . . . . . . . 13 2-1 Main layers in Convolutional Neural Networks . . . . . . . . . . . . . . . 15 2-2 Fusion techniques used in [1] . . . . . . . . . . . . . . . . . . . . . . . . . 17 2-3 3D convolution operator . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2-4 Two stream architecture for Human Action Recognition in [2] . . . . . . . 21 3-1 General framework for human action recognition. . . . . . . . . . . . . . . 24 3-2 Early fusion method by concatenate two L2-normalization feature vectors . 26 3-3 Late fusion by averaging class score . . . . . . . . . . . . . . . . . . . . . 27 4-1 The class labels in UCF101 dataset . . . . . . . . . . . . . . . . . . . . . . 29 4-2 The class labels in HMDB51 dataset . . . . . . . . . . . . . . . . . . . . . 30 4-3 Experiment steps for each dataset . . . . . . . . . . . . . . . . . . . . . . . 30 4-4 The step using C3D for experiment . . . . . . . . . . . . . . . . . . . . . . 32 4-5 C3D clip and video prediction . . . . . . . . . . . . . . . . . . . . . . . . 35 4-6 Confusion matrix of two stream on UCF101 . . . . . . . . . . . . . . . . . 36 4-7 Confusion matrix of two stream on HMBD51 . . . . . . . . . . . . . . . . 36 4-8 Confusion matrix of two stream on CMDFALL . . . . . . . . . . . . . . . 37 4-9 In HMDB51, the most confused action in the RGB stream is swing baseball. 60% of its videos are confused with throw. . . . . . . . . . . . . . . . 39 5 4-10 Most benefit classes in UCF101 when combining compared to RGB stream 39 4-11 Most benefit classes in HMDB51 when combining compared to RGB stream 40 4-12 Most benefit classes in HMDB51 when combining compared to RGB stream 40 4-13 Classes of UCF101 in which RGB stream perform better . . . . . . . . . . 40 4-14 Classes of UCF101 in which Flow stream perform better . . . . . . . . . . 41 4-15 Classes of HMDB51 in which RGB stream perform better . . . . . . . . . 41 4-16 Classes of HMDB51 in which Flow stream perform better . . . . . . . . . 41 4-17 Classes of CMDFALL in which RGB stream perform better . . . . . . . . 41 4-18 Classes of CMDFALL in which Flow stream perform better . . . . . . . . 42 6 Acronyms 3DCNN 3D Convolutional Neural Networks. 1, 19 CNN Convolutional Neural Networks. 1, 15, 17, 20 HAR Human Action Recognition. 1, 9, 23 HOG Histogram of Gradients. 12 MBH Motion boundary histograms. 12 SIFT Scale-invariant feature transform. 12 7 List of Tables 2.1 Result of fusion techniques on the 200,000 videos of the Sport1M test set. Hit@k indicate the fraction of test samples that contained a least one of the ground truth labels in the top k predictions [1]. . . . . . . . . . . . . . . . . 18 2.2 C3D results on different tasks . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3 Two-stream architecture mean accuracy (%) on UCF101 and HMDB51 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.1 Class tree of CMDFALL dataset . . . . . . . . . . . . . . . . . . . . . . . 31 4.2 Accuracy of action recognition on single and multiple streams C3D (%) . . 35 4.3 Comparision result on two popular benchmark datasets (%) . . . . . . . . . 37 8 Chapter 1 Introduction to Human Action Recognition 1.1 Human Action Recognition problem Human action recognition is an important topic in computer vision domain. It has many applications such as: surveillance system in hospital, abnormal activity detection in building (bank, aeroport, hotel) or in human machine interaction. There are various types of human activities. Depending on their complexity, we can categorize human activities into four different levels: gestures, actions, interactions, and group activities. ∙ Gestures are elementary movements of a person"s body part, and are atomic components describing the meaningful motion of a person. Example: "stretching an arm", "raising a leg", ... ∙ Actions are single person activities that may be composed of multiple gestures organized temporally, such as: "walking", "waving", and "punching". ∙ Interactions are human activities that involve two or more persons and/or objects. For example, "two person fighting" is an interaction between two humans, "drinking water" is an interaction between human and object. ∙ Group of activities are the activities performed by conceptual groups composed of 9 multiple persons and/or objects. Example: "A group of persons marching", ... Figure 1-1: Human Action Recognition Problem In this thesis, we focus on the human action recognition. The problem of human action recognition can be defined as below. ∙ Input: A video or a sequence of consecutive frames that contain a human action. ∙ Output: Label of the action that that belongs to one of the predefined classes. Human action recognition is a challenge for researchers in computer vision domain because of noisy background, viewpoint changes, and variety in performing action of each person. Figure 1-1 illustrates the human action recognition problem. Key components of a visual recognition system Figure 1-2 illustrate the two phases of a recognition system. ∙ Training: Learning from the training dataset to obtain the parameters of the recognition model. ∙ Recognition: Use the learned model from training phase to recognize new data. Each phase in the system has the main components as below: ∙ Preprocessing data: Convert data to the form that are compatible for the model 10 Figure 1-2: Human Action Recognition phases ∙ Feature extraction: From the preprocessed data, extract the suitable features for representing the human action. The features can be obtained by hand crafted or deep learning techniques. ∙ Classification: Use the features extracted from previous step to create the input for the training or predicting. ∙ Recognition: The new data is input through the step of preprocessing, feature extraction, then using the trained classifier for predicting the label. 11 Figure 1-3: Hand-crafted feature based method for Human Action Recognition 1.2 1.2.1 Overview of human action recognition approach Hand crafted feature based methods In this approach, human actions are represented by features that are manually designed by high experience researchers. Once features are extracted, they are inputs to a generic trainable classifier for action recognition. The building blocks for hand-crafted featurebased approach is illustrated in the figure 1-3: ∙ Feature extraction: Takes input as image or video pixel and output the features for that image or video. ∙ Classification: A classifier that takes the feature as input and provides the output as class label. There are many types of handcrafted features designed by experts to solve the human action recognition problem. Many classical image features have been generalized to videos, e.g. 3D-SIFT, HOG3D. Among local space-time features, dense trajectories have been shown to perform best on variety of datasets. The main idea is to densely sample feature points in each frame, and track them in the video based on optical flow. Multiple descriptors are computed along the trajectories of feature points to capture shape, appearance and motion information. Motion boundary histograms (MBH) give the best results among these descriptors. The idea of dense trajectories has extended by the work of Wang and Schmid [3] to improve of performance by considering the camera motion and achieved state-of-theart in hand-crafted feature. Despite its good performance, this method is computationally intensive. 12 Figure 1-4: Deep learning method for Human Action Recognition problem 1.2.2 Deep learning based methods On the other hand, a learning-based representation approach, specifically, deep learning uses computational models with multiple processing layers to learn multiple levels of abstraction from data. This learning encompasses a set of methods that enable the machine to process the data in raw form and automatically transform it into a suitable representation needed for classification. This is what we call trainable feature extractors. This transformation process is handled at different layers. These layers are learned from raw data using general purpose learning procedure which does not need to be designed manually by experts. The performance of the human action recognition methods mainly depends on the appropriate and efficient representation of data. Recently, deep learning achieved very good result on image-based task [4]. This result inspires researchers to extend it into video classification specially to solve the human action recognition problem. To deal with video input, the authors in [1] use 2DCNN on individual frame and explore the temporal information by fusing information over temporal dimension through the network. In [5], [6], the authors uses 3D convolution operator to learn the temporal information. In [2], the authors decompose video into spatial and temporal part. Deep learning methods require large number of training data to achieve good result. In [1], the authors construct a large scale dataset named Sport1M which consists of 1 million videos downloaded from YouTube annotated with 487 classes. Features learned from this dataset can be very generic to other dataset such as UCF101 [7] 1.2.3 Purpose of thesis In this thesis, we propose to improve an existing 3D convolutional network, specifically the C3D network [6]. Although C3D itself was designed with 3D kernels where one dimension is temporal, the size of filter is only 3 × 3 × 3 which seems to be unable to represent long 13 variation. Then instead of using only one RGB stream, we deploy both streams (RGB and optical flow). Each stream goes through an independent C3D network then is combined at fully-connected or score level. We experiment the proposed method on the popular challenging benchmark datasets (UCF101 and HMDB51) and dataset built by MICA (CMDFALL) and show how the two streams C3D outperforms the original one stream C3D. The thesis is organized as follows. In chapter 2, we present state of the art on Human Action Recognition using CNN. In chapter 3, we describe our proposed methods using 3D convolutional neural network for action recognition with two-stream architecture. In chapter 4, we report the result on UCF101, HMDB51, CMDFALL and analyse the result. Chapter 5 concludes and gives ideas for future works. 14 Chapter 2 State-of-the-art on HAR using CNN 2.1 Introduction to Convolutional Neural Networks Convolutional Neural Networks (CNN) are biologically-inspirire variants of Multilayer Perceptrons. They have been very effective in areas such as image recognition and classification. There are four main types of layers to build ConvNet architectures: Convolutional Layer, Non-Linearity layer, Pooling Layer, and Fully-Connected Layer. We will stack these layers to form a full ConvNet architecture. Figure 2-1: Main layers in Convolutional Neural Networks 15 Convolutional layer The Conv layer is the core building block of a Convolutional Network. The CONV layer’s parameters consist of a set fo learnable filters. Every filter is small spatially (along width and height), but extends through the full depth of the input volume. For example, a typical filter on a first layer of a ConvNet might have size 5x5x3 (5 pixels width and height, and 3 is the number of channels of an image (RGB)). During the forward pass, we slide (more precisely, convolve) each filter across the width and height of the input volume and compute dot products between the entries of the filter and the input at any position. As we slide the filter over the width and height of the input volume we will produce a 2dimensional activation map that gives the responses of that filter at every spatial position. Intuitively, the network will learn filters that activate when they see some type of visual feature such as an edge of some orientation or a blotch of some color on the first layer. Now we will have an entire set of filters in each CONV layer, and each of them will produce a separate 2-dimensional activation map. We will stack these activation maps along the depth dimension and produce the output volume. Non-Linearity layer (ReLU) An additional operation called ReLU has been used after every Convolution operation. ReLU stands for Rectified Linear Unit and is a non-linear operation. Its output is given by: Output = Max(0, Input). ReLU is an element wise operation (applied per pixel) and replaces all negative pixel values in the feature map by zero. The purpose of ReLU is to introduce non-linearity in our ConvNet, since most of the real-world data we would want out ConvNet to learn would be non-linear. Other non linear functions such as tanh or sigmoid can also be used instead of ReLU, but ReLU has been found to perform better in most situations. Pooling layer The Pooling Layer operates independently on every depth slice of the input and resizes it spatially, using the MAX operation. The most common form is a pooling layer with filters of size 2x2 applied with a stride of 2 downsamples every depth slice in the input by 2 along both width and height, discarding 75% of the activations. Every MAX operation would in this case be taking a max over 4 numbers (little 2x2 region in some depth slice). 16 The depth dimension remains unchanged. Fully-Connected layer Neurons in a fully connected layer have full connections to all activations in the previous layer, as seen in regular Neural Networks. Their activations can hence be computed with a matrix multiplication followed by a bias offset. In this thesis, we focus on presenting some related works for action recognition using CNN techniques. We categorize them into three groups: methods based on 2D convolutional neural network; methods based on 3D convolutional neural network; methods used multiple streams. 2.2 2D Convolutional Neural Networks Figure 2-2: Fusion techniques used in [1] Recently, 2D convnets have successfully obtained very good results on image based task [4]. Encouraged by these results, the authors in [1] study multiple approaches for extending CNN on video input. For baseline, they use a 2D CNN model operating on single frame to evaluate the contribution information of static appearance to the classification accuracy. To learn the information lies in temporal domain and study how it influence the performance, they use the fusion techniques as in Figure 2-2: ∙ Early fusion: from sequence of frames, they get T consecutive frames to construct the input of size 11 ×11 ×3 ×T to a CNN. In this paper, they use T=10, which is approx17

- Xem thêm -

Tài liệu liên quan

Tài liệu vừa đăng

Tài liệu xem nhiều nhất