MINISTRY OF EDUCATION
AND TRAINING
VIETNAMESE ACADEMY
OF SCIENCE AND TECHNOLOGY
GRADUATE UNIVERSITY OF SCIENCE AND TECHNOLOGY
————————————
NGUYEN VAN TRUONG
IMPROVING SOME ARTIFICIAL IMMUNE
ALGORITHMS FOR NETWORK INTRUSION
DETECTION
THE THESIS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY
IN MATHEMATICS
Hanoi - 2019
MINISTRY OF EDUCATION
AND TRAINING
VIETNAMESE ACADEMY
OF SCIENCE AND TECHNOLOGY
GRADUATE UNIVERSITY OF SCIENCE AND TECHNOLOGY
————————————
NGUYEN VAN TRUONG
IMPROVING SOME ARTIFICIAL IMMUNE
ALGORITHMS FOR NETWORK INTRUSION
DETECTION
THE THESIS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY
IN MATHEMATICS
Major: Mathematical foundations for Informatics
Code: 62 46 01 10
Scientific supervisor:
1. Assoc. Prof., Dr. Nguyen Xuan Hoai
2. Assoc. Prof., Dr. Luong Chi Mai
Hanoi - 2019
Acknowledgments
First of all I would like to thank is my principal supervisor, Assoc. Prof.,
Dr. Nguyen Xuan Hoai for introducing me to the field of Artificial Immune System.
He guides me step by step through research activities such as seminar presentations,
paper writing, etc. His genius has been a constant source of help. I am intrigued
by his constructive criticism throughout my PhD. journey. I wish also to thank my
co-supervisor, Assoc. Prof., Dr. Luong Chi Mai. She is always very enthusiastic in
our discussion promising research questions. It is a pleasure and luxury for me to work
with her. This thesis could not have been possible without my supervisors’ support.
I gratefully acknowledge the support from Institute of Information Technology,
Vietnamese Academy of Science and Technology, and from Thai Nguyen University
of Education. I thank the financial support from the National Foundation for Science
and Technology Development (NAFOSTED), ASEAN-European Academic University
Network (ASEA-UNINET).
I thank M.Sc. Vu Duc Quang, M.Sc. Trinh Van Ha and M.Sc. Pham Dinh
Lam, my co-authors of published papers. I thank Assoc. Prof., Dr. Tran Quang
Anh and Dr. Nguyen Quang Uy for many helpful insights for my research. I thank
colleagues, especially my cool labmate Mr. Nguyen Tran Dinh Long, in IT Research
& Development Center, HaNoi University.
Finally, I thank my family for their endless love and steady support.
Certificate of Originality
I hereby declare that this submission is my own work under my scientific supervisors, Assoc. Prof., Dr. Nguyen Xuan Hoai, and Assoc. Prof., Dr. Luong Chi Mai. I
declare that, it contains no material previously published or written by another person,
except where due reference is made in the text of the thesis. In addition, I certify that
all my co-authors allow me to present our work in this thesis.
Hanoi, 2019
PhD. student
Nguyen Van Truong
i
Contents
List of Figures
List of Tables
v
vii
Notation and Abbreviation
INTRODUCTION
viii
1
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
Problem statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
Outline of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1 BACKGROUND
1.1
5
Detection of Network Anomalies . . . . . . . . . . . . . . . . . . . . . .
5
1.1.1
Host-Based IDS . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.1.2
Network-Based IDS . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.1.3
Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
1.1.4
Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
1.2
A brief overview of human immune system . . . . . . . . . . . . . . . .
8
1.3
AIS for IDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
1.3.1
AIS model for IDS . . . . . . . . . . . . . . . . . . . . . . . . .
10
1.3.2
AIS features for IDS . . . . . . . . . . . . . . . . . . . . . . . .
11
Selection algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
1.4.1
12
1.4
Negative Selection Algorithms . . . . . . . . . . . . . . . . . . .
ii
1.4.2
1.5
1.6
1.7
Positive Selection Algorithms . . . . . . . . . . . . . . . . . . .
15
Basic terms and definitions . . . . . . . . . . . . . . . . . . . . . . . . .
16
1.5.1
Strings, substrings and languages . . . . . . . . . . . . . . . . .
16
1.5.2
Prefix trees, prefix DAGs and automata . . . . . . . . . . . . .
17
1.5.3
Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
1.5.4
Detection in r-chunk detector-based positive selection . . . . . .
20
1.5.5
Holes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
1.5.6
Performance metrics . . . . . . . . . . . . . . . . . . . . . . . .
22
1.5.7
Ring representation of data . . . . . . . . . . . . . . . . . . . .
23
1.5.8
Frequency trees . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
1.6.1
The DARPA-Lincoln datasets . . . . . . . . . . . . . . . . . . .
27
1.6.2
UT dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
1.6.3
Netflow dataset . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
1.6.4
Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
2 COMBINATION OF NEGATIVE SELECTION AND POSITIVE SELECTION
30
2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
2.2
Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
2.3
New Positive-Negative Selection Algorithm . . . . . . . . . . . . . . . .
31
2.4
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
2.5
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
3 GENERATION OF COMPACT DETECTOR SET
43
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
3.2
Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
3.3
New negative selection algorithm . . . . . . . . . . . . . . . . . . . . .
45
iii
3.3.1
Detectors set generation under rcbvl matching rule . . . . . . .
45
3.3.2
Detection under rcbvl matching rule . . . . . . . . . . . . . . .
48
3.4
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
3.5
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
4 FAST SELECTION ALGORITHMS
51
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
4.2
Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
4.3
A fast negative selection algorithm based on r-chunk detector . . . . . .
52
4.4
A fast negative selection algorithm based on r-contiguous detector . . .
57
4.5
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
4.6
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
5 APPLYING HYBRID ARTIFICIAL IMMUNE SYSTEM FOR NETWORK SECURITY
66
5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
5.2
Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
5.3
Hybrid positive selection algorithm with chunk detectors . . . . . . . .
69
5.4
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
5.4.1
Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
5.4.2
Data preprocessing . . . . . . . . . . . . . . . . . . . . . . . . .
71
5.4.3
Performance metrics and parameters . . . . . . . . . . . . . . .
72
5.4.4
Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
73
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
5.5
CONCLUSIONS
Contributions of this thesis
78
. . . . . . . . . . . . . . . . . . . . . . . . . . .
78
Future works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
79
Published works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
80
iv
BIBLIOGRAPHY
81
v
List of Figures
1.1
Classification of anomaly-based intrusion detection methods . . . . . .
7
1.2
Multi-layered protection and elimination architecture . . . . . . . . . .
9
1.3
Multi-layer AIS model for IDS . . . . . . . . . . . . . . . . . . . . . . .
10
1.4
Outline of a typical negative selection algorithm. . . . . . . . . . . . . .
13
1.5
Outline of a typical positive selection algorithm. . . . . . . . . . . . . .
15
1.6
Example of a prefix tree and a prefix DAG. . . . . . . . . . . . . . . . .
18
1.7
Existence of holes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
1.8
Negative selections with 3-chunk and 3-contiguous detectors. . . . . . .
23
1.9
A simple ring-based representation (b) of a string (a). . . . . . . . . . .
25
1.10 Frequency trees for all 3-chunk detectors. . . . . . . . . . . . . . . . . .
26
2.1
Binary tree representation of the detectors set generated from S. . . . .
33
2.2
Conversion of a positive tree to a negative one. . . . . . . . . . . . . . .
33
2.3
Diagram of the Detector Generation Algorithm. . . . . . . . . . . . . .
35
2.4
Diagram of the Positive-Negative Selection Algorithm. . . . . . . . . .
37
2.5
One node is reduced in a tree: a compact positive tree has 4 nodes (a)
and its conversion (a negative tree) has 3 node (b). . . . . . . . . . . .
38
2.6
Detection time of NSA and PNSA. . . . . . . . . . . . . . . . . . . . .
40
2.7
Nodes reduction on trees created by PNSA on Netflow dataset. . . . . .
41
2.8
Comparison of nodes reduction on Spambase dataset. . . . . . . . . . .
41
3.1
Diagram of a algorithm to generate perfect rcbvl detectors set. . . . . .
47
4.1
Diagram of the algorithm to generate positive r-chunk detectors set. . .
55
vi
4.2
A prefix DAG G and an automaton M . . . . . . . . . . . . . . . . . .
4.3
Diagram of the algorithm to generate negative r-contiguous detectors set. 61
4.4
An automaton represents 3-contiguous detectors set. . . . . . . . . . . .
4.5
Comparison of ratios of runtime of r-chunk detector-based NSA to runtime of Chunk-NSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6
57
62
63
Comparison of ratios of runtime of r-contiguous detector-based NSA to
runtime of Cont-NSA . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
vii
List of Tables
1.1
Performance comparison of NSAs on linear strings and ring strings. . .
24
2.1
Comparison of memory and detection time reductions. . . . . . . . . .
39
2.2
Comparison of nodes generation on Netflow dataset. . . . . . . . . . . .
40
3.1
Data and parameters distribution for experiments and results comparison. 49
4.1
Comparison of our results with the runtimes of previously published
algorithms.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
4.2
Comparison of Chunk-NSA with r-chunk detector-based NSA. . . . . .
63
4.3
Comparison of proposed Cont-NSA with r-contiguous detector-based NSA. 64
5.1
Features for NIDS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
5.2
Distribution of flows and parameters for experiments. . . . . . . . . . .
73
5.3
Comparison between PSA2 and other algorithms. . . . . . . . . . . . .
74
5.4
Comparison between ring string-based PSA2 and linear string-based PSA2. 76
viii
Notation and Abbreviation
Notation
`
Length of data samples
Sr
|X|
Set of ring presentations of all strings in S
Cardinality of set X
Σ
An alphabet, a nonempty and finite set of symbols
Σk
Set of all strings of length k on alphabet Σ, where k is a
Σ∗
positive integer.
Set of all strings on alphabet Σ, including an empty string.
r
Matching threshold
Dpi
Set of all positive r-chunk detectors at position i.
Dni
CHUNKp (S, r)
Set of all negative r-chunk detectors at position i.
Set of all positive r-chunk detectors.
CHUNK(S, r)
Set of all negative r-chunk detectors.
CONT(S, r)
Set of all r-contiguous detectors.
L(X)
rcbvl
Set of all nonself strings detected by X.
r-contiguous bit with variable length.
Abbreviation
AIS
Artificial Immune System
ACC
ACO
Accuracy Rate
Ant Colony Optimization
ANIDS
Anomaly Network Intrusion Detection System
BBNN
Block-Based Neural Network
Chunk-NSA Chunk Detector-Based Negative Selection Algorithm
Cont-NSA
Contiguous Detector-Based Negative Selection Algorithm
DR
Detection Rate
DAG
Directed Acyclic Graph
FAR
GA
False Alarm Rate
Genetic Algorithm
HIS
Human Immune System
HIDS
Host Intrusion Detection System
IDS
Intrusion Detection System
ix
ML
Machine Learning
MLP
NIDS
Multilayer Perceptron
Network Intrusion Detection System
NS
Negative Selection
NSA
Negative Selection Algorithm
NSM
PNSA
Negative Selection Mutation
Positive-Negative Selection Algorithm
PSA
Positive Selection Algorithm
PSA2
Two-class Positive Selection Algorithm
PSO
PSOGSA
Particle Swarm Optimization
Particle Swarm Optimization-Gravitational Search Algorithm
RNSA
Real-valued NSA
SVM
Support Vector Machines
TCP
VNSA
Transmission Control Protocol
Variable length detector-based NSA
1
INTRODUCTION
Motivation
Internet users and computer networks are suffering from rapidly increasing number of attacks. In order to keep them safe, there is a need for effective security monitoring systems, such as Intrusion Detection Systems (IDS). However, intrusion detection
has to face a number of different problems such as large network traffic volumes, imbalanced data distribution, difficulties to realize decision boundaries between normal
and abnormal actions, and a requirement for continuous adaptation to a constantly
changing environment. As a result, many researchers have attempted to use different
types of approaches to build reliable intrusion detection system.
Computational intelligence techniques, known for their ability to adapt and to
exhibit fault tolerance, high computational speed and resilience against noisy information, are hopefully alternative methods to the problem.
One of the promising computational intelligence methods for intrusion detection
that have emerged recently are artificial immune systems (AIS) inspired by the biological immune system. Negative selection algorithm (NSA), a dominating model of AIS,
is widely used for intrusion detection systems (IDS) [55, 52]. Despite its successful
application, NSA has some weaknesses: 1-High false positive rate (false alarm rate)
and false negative rate, 2-High training and testing time, 3-Exponential relationship
between the size of the training data and the number of detectors possibly generated for
testing, 4-Changeable definitions of ”normal data” and ”abnormal data” in dynamic
network environment [55, 79, 92]. To overcome these limitations, trends of recent works
are to concentrate on complex structures of immune detectors, matching methods and
hybrid NSAs [11, 94, 52].
Following trends mentioned above, in this thesis we investigate the ability of
NSA to combine with other classification methods and propose more effective data
2
representations to improve some NSA’s weaknesses.
Scientific meaning of the thesis: to provide further background to improve performance of AIS-based computer security field in particular and IDS in general.
Reality meaning of the thesis: to assist computer security practicers or experts
implement their IDS with new features from AIS origin.
The major contributions of this research are: Propose a new representation of
data for better performance of IDS; Propose a combination of existing algorithms as
well as some statistical approaches in an uniform framework; Propose a complete and
non-redundant detector representation to archive optimal time and memory complexities.
Objectives
Since data representation is one of the factors that affect the training and testing
time, a compact and complete detector generation algorithm is investigated.
The thesis investigates optimal algorithms to generate detector set in AIS. They
help to reduce both training time and detecting time of AIS-based IDSs.
Also, it is regarded to propose and investigate an AIS-based IDS that can
promptly detect attacks, either if they are known or never seen before. The proposed
system makes use of AIS with statistics as analysis methods and flow-based network
traffic as experimental data.
Problem statements
Since the NSA has some limitations as listed in the first section, this thesis
concentrates on three problems:
1. The first problem is to find compact representations of data. Objectives of this
problem’s solution is not only to minimize memory storage but also to reduce
testing time.
2. The second problem is to propose algorithms that can reduce training time
and testing time in compared with all existing related algorithms.
3
3. The third problem is to improve detection performance with respect to reducing false alarm rates while keeping detection rate and accuracy rate as high as
possible.
Solutions of these problems can partly improve first three weaknesses as listed in the
first section. Regarding to the last NSAs’ weakness about changeable definitions of
”normal data” and ”abnormal data” in dynamic network environment, we consider it
as a risk in our proposed algorithm and left it for future work.
Logically, it is impossible to find an optimal algorithm that can both reduce time
and memory complexities and obtain best detection performance. These aspects are
always in conflict with each other. Thus, in each chapter, we will propose algorithms
to solve each problem quite independently.
The intrusion detection problem mentioned in this thesis can be informally
stated as:
Given a finite set S of network flows which labeled with self (normal) or nonself
(abnormal). The objective is to build classifying models on S that can label exactly an
unlabeled network flow s.
Outline of thesis
The first chapter introduces the background knowledge necessary to discuss the
algorithms proposed in following chapters. First, detection of network anomalies is
briefly introduced. Following that, the human immune system, artificial immune system, machine learning and their relevance are reviewed and discussed. Then, popular
datasets used for experiments in the thesis are examined. related works.
In Chapter 2, a combination method of selection algorithms is presented. The
proposed technique helps to reduce detectors storage generated in training phase. Testing time, an important measurement in IDS, will also be reduced as a direct consequence
of a smaller memory complexity. Tree structure is used in this chapter (and in Chapter
5) to improve time and memory complexities.
A complete and nonredundant detector set, also called perfect detectors set,
4
is necessary to archive acceptable self and nonself coverage of classifiers. A selection
algorithm to generate a perfect detectors set is investigated in Chapter 3. Each detector
in the set is a string concatenated from overlapping classical ones. Different from
approaches in the other chapters, discrete structure of string-based detectors in this
chapter are suitable for detection in distributed environment.
Chapter 4 includes two selection algorithms for fast training phase. The optimal
algorithms can generate a detectors set in linear time with respect to size of training
data. The experiment results and theoretical proof show that proposed algorithms
outperform all existing ones in term of training time. In term of detection time, the
first algorithm and the second one is linear and polynomial, respectively.
Chapter 5 mainly introduces a hybrid approach of positive selection algorithm
with statistics for more effective NIDS. Frequencies of self and nonself data (strings) are
contained in leaves of trees representing detectors. This information plays an important
role in improving performance of the proposed algorithms. The hybrid approach came
as a new positive selection algorithm for two-class classification that can be trained
with samples from both self and nonself data types.
5
Chapter 1
BACKGROUND
The human immune system (HIS) has successfully protected our bodies against
attacks from various harmful pathogens, such as bacteria, viruses, and parasites. It
distinguishes pathogens from self-tissue, and further eliminates these pathogens. This
provides a rich source of inspiration for computer security systems, especially intrusion
detection systems [92]. Hence, applying theoretical immunology and observed immune
functions, its principles, and its models to IDS has gradually developed into a new
research field, called artificial immune system (AIS).
How to apply remarkable features of HIS to archive scalable and robust IDS
is considered a researching gap in the field of computer security. In this chapter, we
introduce the background knowledge necessary to discuss the algorithms proposed in
following chapters that can partly fulfill the gap.
Firstly, a brief introduction to network anomaly detection is presented. We
then overview HIS. Next, immune selection algorithms, detectors, performance metrics
and their relevance are reviewed and discussed. Finally, some popular datasets are
examined.
1.1
Detection of Network Anomalies
The idea of intrusion detection is predicated on the belief that an intruder’s
behavior is noticeably different from that of a legitimate user and that many unauthorized actions are detectable [65]. Intrusion detection systems (IDSs) are deployed as a
second line of defense along with other preventive security mechanisms, such as user
6
authentication and access control. Based on its deployment, an IDS can act either as
a host-based or as a network-based IDS.
1.1.1
Host-Based IDS
A Host-Based IDS (HIDS) monitors and analyzes the internals of a computing
system. A HIDS may detect internal activity such as which program accesses what
resources and attempts illegitimate access, for example, an activity that modifies the
system password database. Similarly, a HIDS may look at the state of a system and
its stored information whether it is in RAM or in the file system or in log files or
elsewhere. Thus, one can think of a HIDS as an agent that monitors whether anything
or anyone internal or external has circumvented the security policy that the operating
system tries to enforce [12].
1.1.2
Network-Based IDS
A Network-Based IDS (NIDS) detects intrusions in network data. Intrusions
typically occur as anomalous patterns. Most techniques model the data in a sequential
fashion and detect anomalous subsequences. The primary reason for these anomalies
is the attacks launched by outside attackers who want to gain unauthorized access to
the network to steal information or to disrupt the network. In a typical setting, a
network is connected to the rest of the world through the Internet. The NIDS reads
all incoming packets or flows, trying to find suspicious patterns. For example, if a
large number of TCP connection requests to a very large number of different ports are
observed within a short time, one could assume that there is someone committing a
port scan at some of the computers in the network. Port scans mostly try to detect
incoming shell codes in the same manner that an ordinary intrusion detection system
does. In addition to inspecting the incoming traffic, a NIDS also provides valuable
information about intrusion from outgoing or local traffic. Some attacks might even be
staged from the inside of a monitored network or network segment; and therefore, not
regarded as incoming traffic at all. The data available for intrusion detection systems
can be at different levels of granularity, like packet level traces or Cisco netflow data.
7
The data is high dimensional, typically, with a mix of categorical as well as continuous
numeric attributes. Misuse-based NIDSs attempt to search for known intrusive patterns
while an anomaly-based intrusion detector searches for unusual patterns. Today, the
intrusion detection research is mostly concentrated on anomaly-based network intrusion
detection because it can detect both known and unknown attacks [12].
1.1.3
Methods
On the basis of the availability of prior knowledge, the detection mechanism
used, the mode of performance and the ability to detect attacks, existing anomaly
detection methods are categorized into six broad categories [41] as shown in Fig. 1.1.
This figure is adapted from [12].
Supervised
Learning
Parametric
Clustering
Unsupervised
Learning
Probabilistic
Learning
Anomaly
Detection
Non-Parametric
Association
Mining
Outlier mining
ANN based
Rough Set based
Fuzzy Logic
Soft
Computing
GA based & Ant Colony
Artificial Immune System
Knowledge
based
Combination
Learners
Rule based & Expert
System based
Ontology & Logic based
Ensemble based
Fusion based
Hybrid
Figure 1.1: Classification of anomaly-based intrusion detection methods
AIS is a fairly new research subfield of Computational intelligence. It was
considered as a system that acts intelligently: What it does is appropriate for its
circumstances and its goal; it is flexible to changing environments and changing goals;
it learns from experience; also it makes appropriate choices given perceptual limitations
and finite computation [68].
- Xem thêm -