DUBLIN CITY UNIVERSITY
SCHOOL OF ELECTRONIC ENGINEERING
Detection of Interesting Events in Movies using
only the Audio signal
PHAM MINH LUAN NGUYEN
August 2009
MASTER OF ENGINEERING
IN
TELECOMMUNICATIONS
Supervised by Dr. Sean Marlow
Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN
Acknowledgements
I would like to thank my supervisor Dr. Sean Marlow for his extensive guidance, enthusiasm
and commitment to this project. Thanks also due to Dr. David Sadlier for supporting movies
and codes. Thanks also to all other friends/colleagues for their contribution to the
establishment.
Declaration
I hereby declare that, except where otherwise indicated, this document is entirely my own
work and has not been submitted in whole or in part to any other university.
Signed: ......................................................................
ii
Date: ...............................
Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN
Abstract
The imminent rapid expansion in the movie industry is driving the need for efficient digital
video indexing, browsing and playback systems. This report is to develop the idea which
makes an automatic detector system to detect the exciting events directly from the original
movie using only the audio signal. Interesting events in movies are typically flagged by high
audio amplitude. Detection of these events based on the audio amplitude is an efficient
method. It is a fast detection method, which takes advantage of the fact that audio features
are computationally cheaper than the visual features. Then the highlight events are classified
to evaluate the automatic system.
iii
Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN
Contents
ACKNOWLEDGEMENTS ..............................................................................................................................II
DECLARATION ...............................................................................................................................................II
ABSTRACT ..................................................................................................................................................... III
CONTENTS ..................................................................................................................................................... IV
LIST OF FIGURES......................................................................................................................................... VI
LIST OF GRAPHS.........................................................................................................................................VII
LIST OF TABLES........................................................................................................................................... IX
CHAPTER 1 -INTRODUCTION .....................................................................................................................1
1.1 RELATED WORK ..........................................................................................................................................2
1.1.1 Automatically Selecting Shots for Action Movie Trailers .................................................................2
1.1.2 Voice Processing for Automatic TV Sports Program Highlights Detection ......................................3
1.1.3 Audio/visual analysis for high-speed TV advertisement detection from MPEG bistream .................4
1.2 EXCITING EVENT DETECTION IN MOVIE USING AUDIO SIGNAL ......................................................................5
CHAPTER 2 – MPEG-1 AUDIO/VIDEO STANDARD .................................................................................6
2.1 OVERVIEW ..................................................................................................................................................6
2.2 MPEG-1 LAYER 2 AUDIO ...........................................................................................................................7
CHAPTER 3 – MOVIE HIGHLIGHT DETECTION ..................................................................................10
3.1 GETTING GROUND TRUTH ........................................................................................................................10
3.2 AUTOMATIC DETECTION ...........................................................................................................................15
3.2.1 Getting Scale Factor.........................................................................................................................16
3.2.2 Audio amplitude threshold ...............................................................................................................19
CHAPTER 4 – RESULTS AND ANALYSIS .................................................................................................36
4.1 RESULTS ...................................................................................................................................................36
4.1.1 The average audio amplitude ...........................................................................................................36
4.1.2 The audio amplitude threshold time.................................................................................................36
4.1.3 Results and result tables ...................................................................................................................36
4.2 PRECISION AND RECALL ...........................................................................................................................44
CHAPTER 5 - CONCLUSIONS AND FURTHER WORK .........................................................................45
iv
Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN
5.1 SYSTEM EVALUATION ...............................................................................................................................45
5.2 FURTHER WORK ........................................................................................................................................46
REFERENCES .................................................................................................................................................48
v
Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN
List of Figures
FIGURE 2-1: ISO/MPEG-1 LAYER I/II ENCODER ................................................................................ 7
FIGURE 2-2: STRUCTURE OF LAYER – II SUBBAND SAMPLES ............................................................... 9
FIGURE 2-3: THE DATA BITSTREAM STRUCTURE OF LAYER - II............................................................ 9
FIGURE 3-1: MPEG-1 LAYER-II FREQUENCY SUBBANDS ................................................................. 16
FIGURE 3-2: VIDEO FRAME AUDIO LEVELS GENERATED FROM SCALEFACTORS CORRESPODING
TO TEMPORALLY ASSOCIATED AUDIO........................................................................................ 18
vi
Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN
List of Graphs
GRAPH 3-1: PER-FRAME AUDIO AMPLITUDE LEVEL FOR EXAMPLE MOVIE........................................ 17
GRAPH 3-2: PER-SECOND AUDIO AMPLITUDE LEVEL FOR EXAMPLE MOVIE ...................................... 18
GRAPH 3-3: AUDIO AMPLITUDE PROFILE OF THE NIGHT AT THE MUSEUM 2 ...................................... 20
GRAPH 3-4: AUDIO AMPLITUDE DETECTION OF THE NIGHT AT THE MUSEUM 2 ................................. 20
GRAPH 3-5: AUDIO AMPLITUDE DETECTION OF THE NIGHT AND THE MUSEUM 2 AND GROUND
TRUTH (BLUE IS AUTOMATIC DETECTION. RED IS THE GROUND TRUTH)................................... 20
GRAPH 3-6: AUDIO AMPLITUDE PROFILE OF THE KINGDOM ............................................................. 21
GRAPH 3-7: AUDIO AMPLITUDE DETECTION OF THE KINGDOM ......................................................... 21
GRAPH 3-8: AUDIO AMPLITUDE DETECTION OF THE KINGDOM AND GROUND TRUTH....................... 21
GRAPH 3-9: AUDIO AMPLITUDE PROFILE OF THE LEGEND OF BUTCH AND SUNDANCE ...................... 22
GRAPH 3-10: AUDIO AMPLITUDE DETECTION OF THE LEGEND OF BUTCH AND SUNDANCE ............... 22
GRAPH 3-11: COMPARE RESULT AUTOMATIC DETECTION AND GROUND TRUTH ............................... 22
GRAPH 3-12: AUDIO AMPLITUDE PROFILE (NIGHT AT THE MUSEUM 2 - ONE FRAME) ........................ 24
GRAPH 3-13: AUTOMATIC DETECTION AND GROUND TRUTH (NIGHT AT THE MUSEUM 2 – ONE
FRAME) ..................................................................................................................................... 24
GRAPH 3-14: AUDIO AMPLITUDE PROFILE (NIGHT AT THE MUSEUM 2 – TWO FRAMES)..................... 25
GRAPH 3-15: AUTOMATIC DETECTION AND GROUND TRUTH (NIGHT AT THE MUSEUM 2 – TWO
FRAMES) ................................................................................................................................... 25
GRAPH 3-16: AUDIO AMPLITUDE PROFILE (NIGHT AT THE MUSEUM 2 - TWO SECONDS).................... 26
GRAPH 3-17: AUTOMATIC DETECTION AND GROUND TRUTH (NIGHT AT THE MUSEUM 2 – TWO
SECONDS) ................................................................................................................................. 26
GRAPH 3-18: AUDIO AMPLITUDE PROFILE (NIGHT AT THE MUSEUM 2 – FOUR SECONDS).................. 27
GRAPH 3-19: AUTOMATIC DETECTION AND GROUND TRUTH (NIGHT AT THE MUSEUM 2 – FOUR
SECONDS) ................................................................................................................................. 27
GRAPH 3-20: AUDIO AMPLITUDE PROFILE (THE KINGDOM – ONE FRAME)........................................ 28
GRAPH 3-21: AUTOMATIC DETECTION AND GROUND TRUTH (THE KINGDOM – ONE FRAME)........... 28
GRAPH 3-22: AUDIO AMPLITUDE PROFILE (THE KINGDOM – TWO FRAMES) ..................................... 29
GRAPH 3-23: AUTOMATIC DETECTION AND GROUND TRUTH (THE KINGDOM – TWO FRAMES)........ 29
GRAPH 3-24: AUDIO AMPLITUDE PROFILE (THE KINGDOM – TWO SECONDS).................................... 30
GRAPH 3-25: AUTOMATIC DETECTION AND GROUND TRUTH (THE KINGDOM – TWO SECONDS) ...... 30
GRAPH 3-26: AUDIO AMPLITUDE PROFILE (THE KINGDOM – FOUR SECONDS).................................. 31
GRAPH 3-27: AUTOMATIC DETECTION AND GROUND TRUTH (THE KINGDOM – FOUR
SECONDS) ................................................................................................................................. 31
GRAPH 3-28: AUDIO AMPLITUDE PROFILE (THE LEGEND OF BUTCH AND SUNDANCE – ONE
FRAME) ..................................................................................................................................... 32
vii
Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN
GRAPH 3-29 AUTOMATIC DETECTION AND GROUND TRUTH (THE LEGEND OF BUTCH AND
SUNDANCE – ONE FRAME) ........................................................................................................ 32
GRAPH 3-30: AUDIO AMPLITUDE PROFILE (THE LEGEND OF BUTCH AND SUNDANCE – TWO
FRAMES) ................................................................................................................................... 33
GRAPH 3-31: AUTOMATIC DETECTION AND GROUND TRUTH (THE LEGEND OF BUTCH AND
SUNDANCE – TWO FRAMES) ..................................................................................................... 33
GRAPH 3-32: AUDIO AMPLITUDE PROFILE (THE LEGEND OF BUTCH AND SUNDANCE – TWO
SECONDS) ................................................................................................................................. 34
GRAPH 3-33: AUTOMATIC DETECTION AND GROUND TRUTH (THE LEGEND OF BUTCH AND
SUNDANCE – TWO SECONDS) .................................................................................................... 34
GRAPH 3-34: AUDIO AMPLITUDE PROFILE ((THE LEGEND OF BUTCH AND SUNDANCE – FOUR
SECONDS) ................................................................................................................................. 35
GRAPH 3-35: AUTOMATIC DETECTION AND GROUND TRUTH (THE LEGEND OF BUTCH AND
SUNDANCE – FOUR SECONDS) ................................................................................................... 35
viii
Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN
List of Tables
TABLE 3-1: GROUND TRUTH OF NIGHT AT THE MUSEUM 2............................................................... 11
TABLE 3-2: GROUND TRUTH OF THE KINGDOM ............................................................................... 12
TABLE 3-3: GROUND TRUTH OF THE KINGDOM (CONTINUE)............................................................ 13
TABLE 3-4: GROUND TRUTH OF THE LEGEND OF BUTCH AND SUNDANCE........................................ 13
TABLE 3-5: GROUND TRUTH OF THE LEGEND OF BUTCH AND SUNDANCE (CONTINUE) .................... 14
TABLE 4-1: COMPARE RESULTS BETWEEN THE AUTOMATIC SYSTEM AND THE GROUND TRUTH........ 38
TABLE 4-2: POSSIBLE EXCITING EVENTS ARE DETECTED BY AUTOMATIC SYSTEM ............................. 38
TABLE 4-3: GROUND TRUTH EVENTS MISSED IN AUTOMATIC SYSTEM. ............................................. 39
TABLE 4-4: COMPARE RESULTS BETWEEN THE AUTOMATIC SYSTEM AND THE GROUND TRUTH........ 40
TABLE 4-5: POSSIBLE EXCITING EVENTS ARE DETECTED BY AUTOMATIC SYSTEM ............................. 41
TABLE 4-6: COMPARE RESULTS BETWEEN THE AUTOMATIC SYSTEM AND THE GROUND TRUTH........ 42
TABLE 4-7: POSSIBLE EXCITING EVENTS ARE DETECTED BY AUTOMATIC SYSTEM ............................. 43
TABLE 4-8: GROUND TRUTH EVENTS MISSED IN AUTOMATIC SYSTEM .............................................. 43
TABLE 4-9: PRECISION AND RECALL VALUES FOR THREE MOVIES ..................................................... 44
ix
Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN
Chapter 1 -Introduction
The growing availability of video content creates a strong requirement for efficient tools to
manage or access multimedia data [3]. Considerable progress has been made in audio
analysis for movie content with automatic highlight detection being one of the targets of
recent research. Highlight detection is important, since they provide the user with a short
version of the movie that ideally contains all important information for understanding the
content. Hence, the user may quickly evaluate the movie as interesting or not.
Audio, which includes voice, music, and various kinds of environmental sounds, is an
important type of media, and also a significant part of audiovisual data. However, since
there are more and more digital audio databases in place these days, people are realizing the
importance of effective management for audio databases relying on audio content analysis.
Audio segmentation and classification have applications in professional media production,
audio archive management, commercial music usage, surveillance, and so on. Furthermore,
audio content analysis may play a primary role in video annotation. Current approaches for
video segmentation and indexing are mostly focused on the visual information. However,
visual – based processing often leads to a far too fine segmentation of the audiovisual
sequence with respect to the diverse multimedia components (audio, visual, and textual
information) will be essential in achieving a fully functional system for video parsing.
Existing research on content – based on audio data management is very limited. There are in
general four directions [6]. One direction is audio segmentation and classification. One basic
problem is speech/music discrimination. The second direction is audio retrieval. One
specific technique in content-based audio retrieval is query-by-humming. The third direction
is audio analysis for video indexing. The fourth direction is the integration of audio and
visual information for video segmentation and indexing.
1
Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN
1.1 Related work
1.1.1 Automatically Selecting Shots for Action Movie Trailers
Alan F. Smeaton, Bart Lehane, Noel E. O’Connor, Conor Brady and Gary Craig of Dublin
City University, Ireland have researched into the area of the movie highlights [3]. Their
study was based on the following principles:
•
They utilise a shot boundary technique in order to generate the basic shot-based
structure of a movie. Colour histograms have been demonstrated as a highly accurate
and efficient method of comparing images and detecting shot boundaries.
•
The audio track of a movie is analysed in order to detect the presence of the following
categories: speech, music, silence, speech with background music and other audio.
Their rationale for using these audio categories is that music can be indicative of high,
or low, points of a movie.
•
For each shot they also detect two motion features, the motion intensity and the
percentage of camera movement present. The motion intensity is an indicator of the
amount of motion within each frame of video, and is determined by calculating the
standard deviation of the motion vectors.
The features used in order to detect shots used in trailers are shot length, motion intensity,
and the amount of camera movement, speech, music, silence, speech with background music
and other audio present in each shot. Evaluation of the performance of their shot selection
used the classic measures of precision and recall where a set of shots selected using their
trained approach was compared against the ground truth of shots which appear in the official
movie trailer. Their approach to using SVM (support vector machines) selects shots in rank
order based on their likelihood for inclusion in the original trailer and the specific metric
they use for evaluation is R−Precision [14]. Given a ranked list produced as the output of a
system to be evaluated, R–Precision is defined as the precision at rank position R, where R
is the number of document or objects relevant to the query.
2
Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN
When evaluating shot selection they face the issue of how to evaluate sub-shot retrieval. One
approach they could take to address this is to evaluate based on the proportion of frames
from the original movie which appear in the trailer and this would correspond to the way
gradual shot transitions are evaluated in TRECVid [13] using frame − precision and frame −
recall where the evaluation is in terms of the number of overlapped frames.
Evaluation of their approach to trailer shot selection was done using a leave-one-out k-fold
cross validation. This is a technique used in information retrieval in which a dataset, T, is
divided into training T1 and testing T2 subsets, T =T1+T2, training is done on T1 and
testing on T2, and then T is re-divided into different training and testing subsets T1′ and T2′
and the training and evaluation is repeated, a total of k times.
The results show several interesting aspects. Firstly, the consistently high results indicate
that this approach of selecting shots for action movie trailers is both accurate and reliable.
One possible danger with our results is that their accuracy could be biased by the use of
automatic shot segmentation. A correct classification of a movie trailer shot occurs when the
ground-truth trailer sub-shot occurs within the selected movie full-shot.
Three event classes were chosen (exciting, dialogue and musical) that typically encapsulate
all relevant portions of a movie. A range of low-level audiovisual features were extracted
and finite state machines were used in order to detect the events.
1.1.2 Voice Processing for Automatic TV Sports Program Highlights
Detection
This study was done by Seán Marlow, David A. Sadlier, Noel O’Connor, Noel Murphy of
Dublin City University, Ireland [4]. This report uses the Sport program which is supported
by the Centre for Digital Video Processing at DCU. This report focuses the audio to do
highlight detection in Sport Program. The author used some features of the Audio MPEG-1
Layer II and features of the audio in Sport Program. The audio in a sport program has a
feature that gets high audio amplitude when an exciting event happens in program, i.e. goal
in football match, penalty offence, Red Card offence. In this report, the author focuses the
audio amplitude to highlight detection through the Scale Factor in the Audio MPEG -1
Layer II. The principle is the audio amplitude threshold. The Scale Factor was stripped from
the audio then it was processed to get amplitude level in one frame. The method detected in
this report that detected three audio-amplitude-frames higher than the amplitude threshold.
The author detects the highlight by the audio amplitude threshold because this is the cheap,
3
Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN
fast way. This report’s result had detected almost the highlight events in the Sport Program.
This method was successful in locating the presence of highlight event and the boundary of
the events.
Their work is a preliminary investigation into the usefulness of pure audio analysis for
summarisation of (limited types of) sports programmes. A further eight 10-minute
summaries were generated from various other broadcast sports programmes. The content of
returned clips, make up the final summary.
In a real scenario, automatic summarisation of such broadcasts would depend on some
combination of an analysis of the closed captions (teletext), and analysis at the visual level.
1.1.3 Audio/visual analysis for high-speed TV advertisement detection
from MPEG bistream
This project is a research by David A. Sadlier, Noel O’Connor, Sean Marlow, Noel Murphy
[5]. The research is concerned the TV advertisements. A television programme is typically
accompanied by beginning/and credits with one or more ad-breaks somewhere in the
middle. To the user, these features of a programme would be generally regarded as an
insignificant part of the material. Their study was based on the following principles:
•
Black Video Frame Detection: a black video frame may be recognised by its luminance
histogram, which would be typically characterised by having most of its ‘power’ at the
bottom end of pixel amplitude spectrum, corresponding to black or very dark pixels.
•
Silent Video Frame Detection: A summation of the absolute value of all the individual
audio samples corresponding to the temporal length of one video frame may be defined
as the ‘audio level’ for that frame, i.e. for a video frame with relatively quite audio, a
slow audio level would be expected. Thus, by threshold this audio level, silent video
frames (of intensity defined by threshold) may be detected.
The authors report that black/silent video frame series may indicate the existence of an adbreak. However, they use another element which is some features of the advertisement
breaks. There are the length of the advertisement breaks and the frame number between two
advertisement breaks.
4
Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN
1.2 Exciting event detection in movie using audio signal
We also have some cases to study about event detection and movie detection. The first case,
they had detected events in movie by using the audiovisual data [3]. The second case, they
use the audio signal to highlight events in the sport program [4]. The third case, they use the
audiovisual data to detect the ad-break in a television program [5]. However, they have not
to detect the events in movie using the only audio signal.
The method uses the audio signal to highlight events in movie is the cheaper way. It does
not have too much time to calculate as the audiovisual data method. In this document, we
choose a figure of the audio signal to highlight event in movie. This is the audio amplitude.
The audio amplitude in movie is one indicator of exciting events. The exciting events
usually happen with high audio amplitude in movies. The high audio amplitude events may
be the gunshot event, fighting events, crash events, or explosion events. So the audio
amplitude may be helpful to highlight the events.
5
Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN
Chapter 2 – MPEG-1 Audio/Video Standard
2.1 Overview
The Moving Pictures Experts Group (MPEG) [15] who meet under the International
Standards Organisation (ISO), generate international standards for digital video and audio
compression. MPEG-1 is a standard in five parts:
1.
ISO/IEC 11172-1:1993
This addresses problem of combining one or more data stream from the video and audio
parts of the MPEG-1 standard with timing information to form a single stream. i.e.
multiplexing and synchronisation of audio/video.
2.
ISO/IEC 11172-2:1993
This specifies a coded representation that can be used for compressing video sequences.
3.
ISO/IEC 11172-3:1993
This specifies a coded representation that can be used for compressing audio sequences
– both mono and stereo.
4.
ISO/IEC 11172-4:1995
Part 4 specifies how to test can be designed to verify whether bitstream and decoders
meet the requirements as specified in part 1, 2 and 3.
5.
ISP/IEC 11172-5:1998
Technically not a standard, but a technical report. Gives a full software implementation
of the first three parts of the MPEG-1 standard.
6
Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN
2.2 MPEG-1 layer 2 Audio
MPEG-1 audio standard (ISO/IEC 1172-3) comprises a flexible hybrid coding technique
that incorporates several methods including subband decomposition, filter-bank analysis,
transform coding, entropy coding, dynamic bit allocation, nonuniform quatization, adaptive
segmentation, and psychoacoustic analysis. MPEG-1 audio codec operates on 16-bit PCM
input data at samples rates of 32, 44.1 and 48 kHz. Moreover, MPEG-1 offers separate
modes for mono, stereo, dual independent mono and joint stereo. Available bit rates are 32 192 kb/s for mono and 64-384 kb/s for stereo.
The MPEG-1 architecture contains three layers of increasing complexity, delay and output
quality. Each higher layer incorporates functional blocks from the lower layers. The input
signal is first decomposed into 32 critically subsampled subbands using a polyphase
realization of a pseudo-QMF( (PQMF) bank. The channels are equally spaced such that a
48-kHz input signal is split into 750-Hz subbands, with the subbands decimated 32:1. A
511th-order prototype filter was chosen such that the inherent overall PQMF distortion
remains below the threshold of the audibility. Moreover, the prototype filter was designed
for high sidelobe attenuation (96dB) to ensure that intraband aliasing remains negligible.
32 Channel
32
Block
32 ↓
PQMF
Data
M
companding
analysis bank
U
quantization
L
T
x(n)
Quantizers
I
P
L
E
FFT computation
Psychoacoustic
(L1:512; L2:1024)
signal analysis
SMR
Dynamic
X
bit
E
alllocation
Side
info
Figure 2-1: ISO/MPEG-1 layer I/II encoder. [2]
7
R
Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN
For the purposes of psychoacoustic analysis and determination of just noticeable distortion
(JND) thresholds, a (512 layer I) or 1024 (layer II) point FFT is computed in parallel with
the subband decomposition for each decimated block of 12 input samples (8 ms at 48 kHz).
Next, the subband are block companded (normalized by a scale factor) such that the
maximum sample amplitude in each block is unity, then an iterative bit allocation procedure
applies the JND threshold to select an optimal quantizer from a predetermined set for each
subband. Quantizers are selected such that both the masking and bit rate requirements are
simultaneously satisfied. In each subband, scale factors are quantized using 6 bits and
quantizer selections are encoded using 4 bits.
MPEG-1 Audio specifies three layers. The different layers offer increasing higher audio
quality at slightly increased complexity. While Layers I and II share the basic structure of the
encoding process having their roots in an ealier algorithm also known as MUSICAM, Layer
III is substantially different.
Layer I is the simplest layer and it operates at data rates between 32 and 224 kb/s per
channel. The preferred range of operation is above 128 kb/s. Layer I finds an application, for
example in the digital compact cassette, DCC, at 192 kb/s per channel. Layer II is of
medium complexity and it employs data rate between 32 and 192 kb/s per channel. At 128
kb/s per channel it provides very good audio quality.
The MPEG-1 Layer-II compression algorithm encodes audio signals as follows: the
frequency spectrum of the audio signal, bandlimited to 20 kHz, is uniformed divided into32
subbands. The subbands are assigned individual bit-allocations according to the audibility of
quantisation noise within each subband. A pyschoacoustic model of the ear analyses the
audio signal and provides this information to the quantiser.
Layer-II frames consist of 1152 samples; 3 groups of 12 samples from each of 32 subbands.
A group of 12 samples gets a bit-allocation and, if this is non-zero, a scalefactor.
Scalefactors are weights that scale groups of 12 samples such that they fully use the range of
the quantiser. The scalefactor for such a group is determined by the next largest value (given
in a look-up table) to the maximum of the absolute values of the 12 samples. Thus it
provides an indication of the maximum power exhibited by any one of the 12 samples
within the group.
8
Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN
32 subbands
granule
12
.
.
1152 samples
granules
.
.
.
.
Figure 2-2: Structure of Layer – II subband samples. [5]
Bit
Scale factor
Scale Factor
Samples
Allocatio
Select
(6 bits)
(2~16 bits)
n (2~4
Information
bits)
(2 bits)
Ancillary
Figure 2-3: The data bitstream structure of Layer – II. [5]
9
Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN
Chapter 3 – Movie highlight detection
This study focuses on audio, especially audio amplitude. In movie, we have a lot of various
events, i.e. speech, music, speech with ground music, scream... Usually, the audio amplitude
event does not change much if the event just speech. Exciting event detection in movie may
be a gunshot, an explosion, a laugh, a scream. When an exciting event happens, the audio
amplitude of event increases suddenly, i.e. gunshot, loud voice.
3.1 Getting Ground Truth
When we get the results from the automatic detection method, how do we know how it
performs. So we need a table of the exciting events. To get this table, we have to do by hand.
We call this work is Ground Truth. To know exactly where the events happened in a movie
we need to watch the movie and to note the exciting events. We need to know when the
exciting events happen and how long it happens, we write all events information in a table:
the event time, the event length. In this step, we have a problem, it is our opinion because
the event it may be exciting with us but it may not be exciting with someone. That is a
problem; we need to find the solution. We can use the movie trailer to know more about the
exciting movie when we do Ground Truth. The movie trailer was done manually. The movie
trailer was done to advertise about the movie so in this case the exciting event may be in the
movie trailer, but it is not all the exciting event was in the trailer. We just refer the movie
trailer to know how good the automatic method.
When we do the Ground Truth, another problem is the length of the events. Example: the
event is a gunshot combine fighting, beating, so we need to choose the main event happen or
we can combine all of these events to become a big event. In some cases, the big event has
long happened – time, so the automatic detection can get result as much as we want.
10
Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN
Event
Events location in movie
Classified Events
Length of
Number
(hour/minute/second- hour/minute/second)
event
(second- second)
(seconds)
1
00.01.17 – 00.01.22 (77-88)
Music and name of movie
11
2
00.08.31 – 00.09.10 (531-550)
Loud noise, scream, dump.
19
3
00.15.26 – 00.15.49 (926 – 949)
Loud voice, scream.
23
4
00.24.20 – 00.24.40 (1460 – 1480)
Loud noise
20
5
00.26.20 – 00.27.41 (1580 – 1661)
Buster, scream, drum -beat
81
6
00.30.00 – 00.32.56 (1800 – 1976)
Drum-beat, buster, cracker,
176
wham, fighting, sound of
spear flying
7
00.35.00 – 00.35.36 (2100 -2136)
Scream, fighting
36
8
00.49.00 – 00.49.20 (2940 -2960)
Sound of water flowing
20
9
00.56.44 – 00.57.14 (3404 -3434)
Scream, squeak
20
10
00.58.58 – 00.59.12 (3538 -3552)
Scream, yell, charivari
14
11
01.01.56 – 01.02.09 (3716 -3729)
Scream, speech
13
12
01.03.30 – 01.04.30 (3810 - 3870)
Loud voice
60
13
01.07.07 – 01.07.40 (4027 -4060)
Whirr, scream, music
33
14
01.07.50 – 01.08.40 (4070 – 4120)
Alarm, scream, shouting
50
15
01.14.20 – 01.16.47 (4460 – 4607)
Scream, drum-beat, crunch,
147
clump, crash, footstep, loud
noise
16
01.17.36 – 01.17.57 (4656 - 4677)
Trumpet-call, battle-cry
21
17
01.19.58 – 01.20.13 (4798 -4813)
Beating, smack
15
18
01.21.11 – 01.21.40 (4871 -4900)
Drum beating, fighting
29
19
01.21.47 – 01.22.39 (4907 – 4959)
Shouting, drum beating,
52
fighting
20
01.23.32 – 01.23.53 (5012 - 5033)
Crash, beating, smack
21
21
01.24.09 – 01.24.56 (5049 -5096)
Drumbeating, shouting
47
22
01.31.49 – 01.32.09 (5509 -5529)
Roaring
20
Table 3-1: Ground Truth of Night at the Museum 2
11
- Xem thêm -