Tài liệu Detection of interesting events in movies using only the audio signal

.PDF

135

thanhphoquetoi Báo vi phạm

Tải xuống 135

Mô tả:

DUBLIN CITY UNIVERSITY SCHOOL OF ELECTRONIC ENGINEERING Detection of Interesting Events in Movies using only the Audio signal PHAM MINH LUAN NGUYEN August 2009 MASTER OF ENGINEERING IN TELECOMMUNICATIONS Supervised by Dr. Sean Marlow Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN Acknowledgements I would like to thank my supervisor Dr. Sean Marlow for his extensive guidance, enthusiasm and commitment to this project. Thanks also due to Dr. David Sadlier for supporting movies and codes. Thanks also to all other friends/colleagues for their contribution to the establishment. Declaration I hereby declare that, except where otherwise indicated, this document is entirely my own work and has not been submitted in whole or in part to any other university. Signed: ...................................................................... ii Date: ............................... Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN Abstract The imminent rapid expansion in the movie industry is driving the need for efficient digital video indexing, browsing and playback systems. This report is to develop the idea which makes an automatic detector system to detect the exciting events directly from the original movie using only the audio signal. Interesting events in movies are typically flagged by high audio amplitude. Detection of these events based on the audio amplitude is an efficient method. It is a fast detection method, which takes advantage of the fact that audio features are computationally cheaper than the visual features. Then the highlight events are classified to evaluate the automatic system. iii Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN Contents ACKNOWLEDGEMENTS ..............................................................................................................................II DECLARATION ...............................................................................................................................................II ABSTRACT ..................................................................................................................................................... III CONTENTS ..................................................................................................................................................... IV LIST OF FIGURES......................................................................................................................................... VI LIST OF GRAPHS.........................................................................................................................................VII LIST OF TABLES........................................................................................................................................... IX CHAPTER 1 -INTRODUCTION .....................................................................................................................1 1.1 RELATED WORK ..........................................................................................................................................2 1.1.1 Automatically Selecting Shots for Action Movie Trailers .................................................................2 1.1.2 Voice Processing for Automatic TV Sports Program Highlights Detection ......................................3 1.1.3 Audio/visual analysis for high-speed TV advertisement detection from MPEG bistream .................4 1.2 EXCITING EVENT DETECTION IN MOVIE USING AUDIO SIGNAL ......................................................................5 CHAPTER 2 – MPEG-1 AUDIO/VIDEO STANDARD .................................................................................6 2.1 OVERVIEW ..................................................................................................................................................6 2.2 MPEG-1 LAYER 2 AUDIO ...........................................................................................................................7 CHAPTER 3 – MOVIE HIGHLIGHT DETECTION ..................................................................................10 3.1 GETTING GROUND TRUTH ........................................................................................................................10 3.2 AUTOMATIC DETECTION ...........................................................................................................................15 3.2.1 Getting Scale Factor.........................................................................................................................16 3.2.2 Audio amplitude threshold ...............................................................................................................19 CHAPTER 4 – RESULTS AND ANALYSIS .................................................................................................36 4.1 RESULTS ...................................................................................................................................................36 4.1.1 The average audio amplitude ...........................................................................................................36 4.1.2 The audio amplitude threshold time.................................................................................................36 4.1.3 Results and result tables ...................................................................................................................36 4.2 PRECISION AND RECALL ...........................................................................................................................44 CHAPTER 5 - CONCLUSIONS AND FURTHER WORK .........................................................................45 iv Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN 5.1 SYSTEM EVALUATION ...............................................................................................................................45 5.2 FURTHER WORK ........................................................................................................................................46 REFERENCES .................................................................................................................................................48 v Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN List of Figures FIGURE 2-1: ISO/MPEG-1 LAYER I/II ENCODER ................................................................................ 7 FIGURE 2-2: STRUCTURE OF LAYER – II SUBBAND SAMPLES ............................................................... 9 FIGURE 2-3: THE DATA BITSTREAM STRUCTURE OF LAYER - II............................................................ 9 FIGURE 3-1: MPEG-1 LAYER-II FREQUENCY SUBBANDS ................................................................. 16 FIGURE 3-2: VIDEO FRAME AUDIO LEVELS GENERATED FROM SCALEFACTORS CORRESPODING TO TEMPORALLY ASSOCIATED AUDIO........................................................................................ 18 vi Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN List of Graphs GRAPH 3-1: PER-FRAME AUDIO AMPLITUDE LEVEL FOR EXAMPLE MOVIE........................................ 17 GRAPH 3-2: PER-SECOND AUDIO AMPLITUDE LEVEL FOR EXAMPLE MOVIE ...................................... 18 GRAPH 3-3: AUDIO AMPLITUDE PROFILE OF THE NIGHT AT THE MUSEUM 2 ...................................... 20 GRAPH 3-4: AUDIO AMPLITUDE DETECTION OF THE NIGHT AT THE MUSEUM 2 ................................. 20 GRAPH 3-5: AUDIO AMPLITUDE DETECTION OF THE NIGHT AND THE MUSEUM 2 AND GROUND TRUTH (BLUE IS AUTOMATIC DETECTION. RED IS THE GROUND TRUTH)................................... 20 GRAPH 3-6: AUDIO AMPLITUDE PROFILE OF THE KINGDOM ............................................................. 21 GRAPH 3-7: AUDIO AMPLITUDE DETECTION OF THE KINGDOM ......................................................... 21 GRAPH 3-8: AUDIO AMPLITUDE DETECTION OF THE KINGDOM AND GROUND TRUTH....................... 21 GRAPH 3-9: AUDIO AMPLITUDE PROFILE OF THE LEGEND OF BUTCH AND SUNDANCE ...................... 22 GRAPH 3-10: AUDIO AMPLITUDE DETECTION OF THE LEGEND OF BUTCH AND SUNDANCE ............... 22 GRAPH 3-11: COMPARE RESULT AUTOMATIC DETECTION AND GROUND TRUTH ............................... 22 GRAPH 3-12: AUDIO AMPLITUDE PROFILE (NIGHT AT THE MUSEUM 2 - ONE FRAME) ........................ 24 GRAPH 3-13: AUTOMATIC DETECTION AND GROUND TRUTH (NIGHT AT THE MUSEUM 2 – ONE FRAME) ..................................................................................................................................... 24 GRAPH 3-14: AUDIO AMPLITUDE PROFILE (NIGHT AT THE MUSEUM 2 – TWO FRAMES)..................... 25 GRAPH 3-15: AUTOMATIC DETECTION AND GROUND TRUTH (NIGHT AT THE MUSEUM 2 – TWO FRAMES) ................................................................................................................................... 25 GRAPH 3-16: AUDIO AMPLITUDE PROFILE (NIGHT AT THE MUSEUM 2 - TWO SECONDS).................... 26 GRAPH 3-17: AUTOMATIC DETECTION AND GROUND TRUTH (NIGHT AT THE MUSEUM 2 – TWO SECONDS) ................................................................................................................................. 26 GRAPH 3-18: AUDIO AMPLITUDE PROFILE (NIGHT AT THE MUSEUM 2 – FOUR SECONDS).................. 27 GRAPH 3-19: AUTOMATIC DETECTION AND GROUND TRUTH (NIGHT AT THE MUSEUM 2 – FOUR SECONDS) ................................................................................................................................. 27 GRAPH 3-20: AUDIO AMPLITUDE PROFILE (THE KINGDOM – ONE FRAME)........................................ 28 GRAPH 3-21: AUTOMATIC DETECTION AND GROUND TRUTH (THE KINGDOM – ONE FRAME)........... 28 GRAPH 3-22: AUDIO AMPLITUDE PROFILE (THE KINGDOM – TWO FRAMES) ..................................... 29 GRAPH 3-23: AUTOMATIC DETECTION AND GROUND TRUTH (THE KINGDOM – TWO FRAMES)........ 29 GRAPH 3-24: AUDIO AMPLITUDE PROFILE (THE KINGDOM – TWO SECONDS).................................... 30 GRAPH 3-25: AUTOMATIC DETECTION AND GROUND TRUTH (THE KINGDOM – TWO SECONDS) ...... 30 GRAPH 3-26: AUDIO AMPLITUDE PROFILE (THE KINGDOM – FOUR SECONDS).................................. 31 GRAPH 3-27: AUTOMATIC DETECTION AND GROUND TRUTH (THE KINGDOM – FOUR SECONDS) ................................................................................................................................. 31 GRAPH 3-28: AUDIO AMPLITUDE PROFILE (THE LEGEND OF BUTCH AND SUNDANCE – ONE FRAME) ..................................................................................................................................... 32 vii Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN GRAPH 3-29 AUTOMATIC DETECTION AND GROUND TRUTH (THE LEGEND OF BUTCH AND SUNDANCE – ONE FRAME) ........................................................................................................ 32 GRAPH 3-30: AUDIO AMPLITUDE PROFILE (THE LEGEND OF BUTCH AND SUNDANCE – TWO FRAMES) ................................................................................................................................... 33 GRAPH 3-31: AUTOMATIC DETECTION AND GROUND TRUTH (THE LEGEND OF BUTCH AND SUNDANCE – TWO FRAMES) ..................................................................................................... 33 GRAPH 3-32: AUDIO AMPLITUDE PROFILE (THE LEGEND OF BUTCH AND SUNDANCE – TWO SECONDS) ................................................................................................................................. 34 GRAPH 3-33: AUTOMATIC DETECTION AND GROUND TRUTH (THE LEGEND OF BUTCH AND SUNDANCE – TWO SECONDS) .................................................................................................... 34 GRAPH 3-34: AUDIO AMPLITUDE PROFILE ((THE LEGEND OF BUTCH AND SUNDANCE – FOUR SECONDS) ................................................................................................................................. 35 GRAPH 3-35: AUTOMATIC DETECTION AND GROUND TRUTH (THE LEGEND OF BUTCH AND SUNDANCE – FOUR SECONDS) ................................................................................................... 35 viii Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN List of Tables TABLE 3-1: GROUND TRUTH OF NIGHT AT THE MUSEUM 2............................................................... 11 TABLE 3-2: GROUND TRUTH OF THE KINGDOM ............................................................................... 12 TABLE 3-3: GROUND TRUTH OF THE KINGDOM (CONTINUE)............................................................ 13 TABLE 3-4: GROUND TRUTH OF THE LEGEND OF BUTCH AND SUNDANCE........................................ 13 TABLE 3-5: GROUND TRUTH OF THE LEGEND OF BUTCH AND SUNDANCE (CONTINUE) .................... 14 TABLE 4-1: COMPARE RESULTS BETWEEN THE AUTOMATIC SYSTEM AND THE GROUND TRUTH........ 38 TABLE 4-2: POSSIBLE EXCITING EVENTS ARE DETECTED BY AUTOMATIC SYSTEM ............................. 38 TABLE 4-3: GROUND TRUTH EVENTS MISSED IN AUTOMATIC SYSTEM. ............................................. 39 TABLE 4-4: COMPARE RESULTS BETWEEN THE AUTOMATIC SYSTEM AND THE GROUND TRUTH........ 40 TABLE 4-5: POSSIBLE EXCITING EVENTS ARE DETECTED BY AUTOMATIC SYSTEM ............................. 41 TABLE 4-6: COMPARE RESULTS BETWEEN THE AUTOMATIC SYSTEM AND THE GROUND TRUTH........ 42 TABLE 4-7: POSSIBLE EXCITING EVENTS ARE DETECTED BY AUTOMATIC SYSTEM ............................. 43 TABLE 4-8: GROUND TRUTH EVENTS MISSED IN AUTOMATIC SYSTEM .............................................. 43 TABLE 4-9: PRECISION AND RECALL VALUES FOR THREE MOVIES ..................................................... 44 ix Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN Chapter 1 -Introduction The growing availability of video content creates a strong requirement for efficient tools to manage or access multimedia data [3]. Considerable progress has been made in audio analysis for movie content with automatic highlight detection being one of the targets of recent research. Highlight detection is important, since they provide the user with a short version of the movie that ideally contains all important information for understanding the content. Hence, the user may quickly evaluate the movie as interesting or not. Audio, which includes voice, music, and various kinds of environmental sounds, is an important type of media, and also a significant part of audiovisual data. However, since there are more and more digital audio databases in place these days, people are realizing the importance of effective management for audio databases relying on audio content analysis. Audio segmentation and classification have applications in professional media production, audio archive management, commercial music usage, surveillance, and so on. Furthermore, audio content analysis may play a primary role in video annotation. Current approaches for video segmentation and indexing are mostly focused on the visual information. However, visual – based processing often leads to a far too fine segmentation of the audiovisual sequence with respect to the diverse multimedia components (audio, visual, and textual information) will be essential in achieving a fully functional system for video parsing. Existing research on content – based on audio data management is very limited. There are in general four directions [6]. One direction is audio segmentation and classification. One basic problem is speech/music discrimination. The second direction is audio retrieval. One specific technique in content-based audio retrieval is query-by-humming. The third direction is audio analysis for video indexing. The fourth direction is the integration of audio and visual information for video segmentation and indexing. 1 Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN 1.1 Related work 1.1.1 Automatically Selecting Shots for Action Movie Trailers Alan F. Smeaton, Bart Lehane, Noel E. O’Connor, Conor Brady and Gary Craig of Dublin City University, Ireland have researched into the area of the movie highlights [3]. Their study was based on the following principles: • They utilise a shot boundary technique in order to generate the basic shot-based structure of a movie. Colour histograms have been demonstrated as a highly accurate and efficient method of comparing images and detecting shot boundaries. • The audio track of a movie is analysed in order to detect the presence of the following categories: speech, music, silence, speech with background music and other audio. Their rationale for using these audio categories is that music can be indicative of high, or low, points of a movie. • For each shot they also detect two motion features, the motion intensity and the percentage of camera movement present. The motion intensity is an indicator of the amount of motion within each frame of video, and is determined by calculating the standard deviation of the motion vectors. The features used in order to detect shots used in trailers are shot length, motion intensity, and the amount of camera movement, speech, music, silence, speech with background music and other audio present in each shot. Evaluation of the performance of their shot selection used the classic measures of precision and recall where a set of shots selected using their trained approach was compared against the ground truth of shots which appear in the official movie trailer. Their approach to using SVM (support vector machines) selects shots in rank order based on their likelihood for inclusion in the original trailer and the specific metric they use for evaluation is R−Precision [14]. Given a ranked list produced as the output of a system to be evaluated, R–Precision is defined as the precision at rank position R, where R is the number of document or objects relevant to the query. 2 Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN When evaluating shot selection they face the issue of how to evaluate sub-shot retrieval. One approach they could take to address this is to evaluate based on the proportion of frames from the original movie which appear in the trailer and this would correspond to the way gradual shot transitions are evaluated in TRECVid [13] using frame − precision and frame − recall where the evaluation is in terms of the number of overlapped frames. Evaluation of their approach to trailer shot selection was done using a leave-one-out k-fold cross validation. This is a technique used in information retrieval in which a dataset, T, is divided into training T1 and testing T2 subsets, T =T1+T2, training is done on T1 and testing on T2, and then T is re-divided into different training and testing subsets T1′ and T2′ and the training and evaluation is repeated, a total of k times. The results show several interesting aspects. Firstly, the consistently high results indicate that this approach of selecting shots for action movie trailers is both accurate and reliable. One possible danger with our results is that their accuracy could be biased by the use of automatic shot segmentation. A correct classification of a movie trailer shot occurs when the ground-truth trailer sub-shot occurs within the selected movie full-shot. Three event classes were chosen (exciting, dialogue and musical) that typically encapsulate all relevant portions of a movie. A range of low-level audiovisual features were extracted and finite state machines were used in order to detect the events. 1.1.2 Voice Processing for Automatic TV Sports Program Highlights Detection This study was done by Seán Marlow, David A. Sadlier, Noel O’Connor, Noel Murphy of Dublin City University, Ireland [4]. This report uses the Sport program which is supported by the Centre for Digital Video Processing at DCU. This report focuses the audio to do highlight detection in Sport Program. The author used some features of the Audio MPEG-1 Layer II and features of the audio in Sport Program. The audio in a sport program has a feature that gets high audio amplitude when an exciting event happens in program, i.e. goal in football match, penalty offence, Red Card offence. In this report, the author focuses the audio amplitude to highlight detection through the Scale Factor in the Audio MPEG -1 Layer II. The principle is the audio amplitude threshold. The Scale Factor was stripped from the audio then it was processed to get amplitude level in one frame. The method detected in this report that detected three audio-amplitude-frames higher than the amplitude threshold. The author detects the highlight by the audio amplitude threshold because this is the cheap, 3 Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN fast way. This report’s result had detected almost the highlight events in the Sport Program. This method was successful in locating the presence of highlight event and the boundary of the events. Their work is a preliminary investigation into the usefulness of pure audio analysis for summarisation of (limited types of) sports programmes. A further eight 10-minute summaries were generated from various other broadcast sports programmes. The content of returned clips, make up the final summary. In a real scenario, automatic summarisation of such broadcasts would depend on some combination of an analysis of the closed captions (teletext), and analysis at the visual level. 1.1.3 Audio/visual analysis for high-speed TV advertisement detection from MPEG bistream This project is a research by David A. Sadlier, Noel O’Connor, Sean Marlow, Noel Murphy [5]. The research is concerned the TV advertisements. A television programme is typically accompanied by beginning/and credits with one or more ad-breaks somewhere in the middle. To the user, these features of a programme would be generally regarded as an insignificant part of the material. Their study was based on the following principles: • Black Video Frame Detection: a black video frame may be recognised by its luminance histogram, which would be typically characterised by having most of its ‘power’ at the bottom end of pixel amplitude spectrum, corresponding to black or very dark pixels. • Silent Video Frame Detection: A summation of the absolute value of all the individual audio samples corresponding to the temporal length of one video frame may be defined as the ‘audio level’ for that frame, i.e. for a video frame with relatively quite audio, a slow audio level would be expected. Thus, by threshold this audio level, silent video frames (of intensity defined by threshold) may be detected. The authors report that black/silent video frame series may indicate the existence of an adbreak. However, they use another element which is some features of the advertisement breaks. There are the length of the advertisement breaks and the frame number between two advertisement breaks. 4 Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN 1.2 Exciting event detection in movie using audio signal We also have some cases to study about event detection and movie detection. The first case, they had detected events in movie by using the audiovisual data [3]. The second case, they use the audio signal to highlight events in the sport program [4]. The third case, they use the audiovisual data to detect the ad-break in a television program [5]. However, they have not to detect the events in movie using the only audio signal. The method uses the audio signal to highlight events in movie is the cheaper way. It does not have too much time to calculate as the audiovisual data method. In this document, we choose a figure of the audio signal to highlight event in movie. This is the audio amplitude. The audio amplitude in movie is one indicator of exciting events. The exciting events usually happen with high audio amplitude in movies. The high audio amplitude events may be the gunshot event, fighting events, crash events, or explosion events. So the audio amplitude may be helpful to highlight the events. 5 Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN Chapter 2 – MPEG-1 Audio/Video Standard 2.1 Overview The Moving Pictures Experts Group (MPEG) [15] who meet under the International Standards Organisation (ISO), generate international standards for digital video and audio compression. MPEG-1 is a standard in five parts: 1. ISO/IEC 11172-1:1993 This addresses problem of combining one or more data stream from the video and audio parts of the MPEG-1 standard with timing information to form a single stream. i.e. multiplexing and synchronisation of audio/video. 2. ISO/IEC 11172-2:1993 This specifies a coded representation that can be used for compressing video sequences. 3. ISO/IEC 11172-3:1993 This specifies a coded representation that can be used for compressing audio sequences – both mono and stereo. 4. ISO/IEC 11172-4:1995 Part 4 specifies how to test can be designed to verify whether bitstream and decoders meet the requirements as specified in part 1, 2 and 3. 5. ISP/IEC 11172-5:1998 Technically not a standard, but a technical report. Gives a full software implementation of the first three parts of the MPEG-1 standard. 6 Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN 2.2 MPEG-1 layer 2 Audio MPEG-1 audio standard (ISO/IEC 1172-3) comprises a flexible hybrid coding technique that incorporates several methods including subband decomposition, filter-bank analysis, transform coding, entropy coding, dynamic bit allocation, nonuniform quatization, adaptive segmentation, and psychoacoustic analysis. MPEG-1 audio codec operates on 16-bit PCM input data at samples rates of 32, 44.1 and 48 kHz. Moreover, MPEG-1 offers separate modes for mono, stereo, dual independent mono and joint stereo. Available bit rates are 32 192 kb/s for mono and 64-384 kb/s for stereo. The MPEG-1 architecture contains three layers of increasing complexity, delay and output quality. Each higher layer incorporates functional blocks from the lower layers. The input signal is first decomposed into 32 critically subsampled subbands using a polyphase realization of a pseudo-QMF( (PQMF) bank. The channels are equally spaced such that a 48-kHz input signal is split into 750-Hz subbands, with the subbands decimated 32:1. A 511th-order prototype filter was chosen such that the inherent overall PQMF distortion remains below the threshold of the audibility. Moreover, the prototype filter was designed for high sidelobe attenuation (96dB) to ensure that intraband aliasing remains negligible. 32 Channel 32 Block 32 ↓ PQMF Data M companding analysis bank U quantization L T x(n) Quantizers I P L E FFT computation Psychoacoustic (L1:512; L2:1024) signal analysis SMR Dynamic X bit E alllocation Side info Figure 2-1: ISO/MPEG-1 layer I/II encoder. [2] 7 R Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN For the purposes of psychoacoustic analysis and determination of just noticeable distortion (JND) thresholds, a (512 layer I) or 1024 (layer II) point FFT is computed in parallel with the subband decomposition for each decimated block of 12 input samples (8 ms at 48 kHz). Next, the subband are block companded (normalized by a scale factor) such that the maximum sample amplitude in each block is unity, then an iterative bit allocation procedure applies the JND threshold to select an optimal quantizer from a predetermined set for each subband. Quantizers are selected such that both the masking and bit rate requirements are simultaneously satisfied. In each subband, scale factors are quantized using 6 bits and quantizer selections are encoded using 4 bits. MPEG-1 Audio specifies three layers. The different layers offer increasing higher audio quality at slightly increased complexity. While Layers I and II share the basic structure of the encoding process having their roots in an ealier algorithm also known as MUSICAM, Layer III is substantially different. Layer I is the simplest layer and it operates at data rates between 32 and 224 kb/s per channel. The preferred range of operation is above 128 kb/s. Layer I finds an application, for example in the digital compact cassette, DCC, at 192 kb/s per channel. Layer II is of medium complexity and it employs data rate between 32 and 192 kb/s per channel. At 128 kb/s per channel it provides very good audio quality. The MPEG-1 Layer-II compression algorithm encodes audio signals as follows: the frequency spectrum of the audio signal, bandlimited to 20 kHz, is uniformed divided into32 subbands. The subbands are assigned individual bit-allocations according to the audibility of quantisation noise within each subband. A pyschoacoustic model of the ear analyses the audio signal and provides this information to the quantiser. Layer-II frames consist of 1152 samples; 3 groups of 12 samples from each of 32 subbands. A group of 12 samples gets a bit-allocation and, if this is non-zero, a scalefactor. Scalefactors are weights that scale groups of 12 samples such that they fully use the range of the quantiser. The scalefactor for such a group is determined by the next largest value (given in a look-up table) to the maximum of the absolute values of the 12 samples. Thus it provides an indication of the maximum power exhibited by any one of the 12 samples within the group. 8 Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN 32 subbands granule 12 . . 1152 samples granules . . . . Figure 2-2: Structure of Layer – II subband samples. [5] Bit Scale factor Scale Factor Samples Allocatio Select (6 bits) (2~16 bits) n (2~4 Information bits) (2 bits) Ancillary Figure 2-3: The data bitstream structure of Layer – II. [5] 9 Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN Chapter 3 – Movie highlight detection This study focuses on audio, especially audio amplitude. In movie, we have a lot of various events, i.e. speech, music, speech with ground music, scream... Usually, the audio amplitude event does not change much if the event just speech. Exciting event detection in movie may be a gunshot, an explosion, a laugh, a scream. When an exciting event happens, the audio amplitude of event increases suddenly, i.e. gunshot, loud voice. 3.1 Getting Ground Truth When we get the results from the automatic detection method, how do we know how it performs. So we need a table of the exciting events. To get this table, we have to do by hand. We call this work is Ground Truth. To know exactly where the events happened in a movie we need to watch the movie and to note the exciting events. We need to know when the exciting events happen and how long it happens, we write all events information in a table: the event time, the event length. In this step, we have a problem, it is our opinion because the event it may be exciting with us but it may not be exciting with someone. That is a problem; we need to find the solution. We can use the movie trailer to know more about the exciting movie when we do Ground Truth. The movie trailer was done manually. The movie trailer was done to advertise about the movie so in this case the exciting event may be in the movie trailer, but it is not all the exciting event was in the trailer. We just refer the movie trailer to know how good the automatic method. When we do the Ground Truth, another problem is the length of the events. Example: the event is a gunshot combine fighting, beating, so we need to choose the main event happen or we can combine all of these events to become a big event. In some cases, the big event has long happened – time, so the automatic detection can get result as much as we want. 10 Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN Event Events location in movie Classified Events Length of Number (hour/minute/second- hour/minute/second) event (second- second) (seconds) 1 00.01.17 – 00.01.22 (77-88) Music and name of movie 11 2 00.08.31 – 00.09.10 (531-550) Loud noise, scream, dump. 19 3 00.15.26 – 00.15.49 (926 – 949) Loud voice, scream. 23 4 00.24.20 – 00.24.40 (1460 – 1480) Loud noise 20 5 00.26.20 – 00.27.41 (1580 – 1661) Buster, scream, drum -beat 81 6 00.30.00 – 00.32.56 (1800 – 1976) Drum-beat, buster, cracker, 176 wham, fighting, sound of spear flying 7 00.35.00 – 00.35.36 (2100 -2136) Scream, fighting 36 8 00.49.00 – 00.49.20 (2940 -2960) Sound of water flowing 20 9 00.56.44 – 00.57.14 (3404 -3434) Scream, squeak 20 10 00.58.58 – 00.59.12 (3538 -3552) Scream, yell, charivari 14 11 01.01.56 – 01.02.09 (3716 -3729) Scream, speech 13 12 01.03.30 – 01.04.30 (3810 - 3870) Loud voice 60 13 01.07.07 – 01.07.40 (4027 -4060) Whirr, scream, music 33 14 01.07.50 – 01.08.40 (4070 – 4120) Alarm, scream, shouting 50 15 01.14.20 – 01.16.47 (4460 – 4607) Scream, drum-beat, crunch, 147 clump, crash, footstep, loud noise 16 01.17.36 – 01.17.57 (4656 - 4677) Trumpet-call, battle-cry 21 17 01.19.58 – 01.20.13 (4798 -4813) Beating, smack 15 18 01.21.11 – 01.21.40 (4871 -4900) Drum beating, fighting 29 19 01.21.47 – 01.22.39 (4907 – 4959) Shouting, drum beating, 52 fighting 20 01.23.32 – 01.23.53 (5012 - 5033) Crash, beating, smack 21 21 01.24.09 – 01.24.56 (5049 -5096) Drumbeating, shouting 47 22 01.31.49 – 01.32.09 (5509 -5529) Roaring 20 Table 3-1: Ground Truth of Night at the Museum 2 11

- Xem thêm -

Tài liệu Detection of interesting events in movies using only the audio signal

Tài liệu liên quan

Tài liệu vừa đăng

Tài liệu xem nhiều nhất