MINISTRY OF EDUCATION
AND TRAINING
VIETNAMESE ACADEMY
OF SCIENCE AND TECHNOLOGY
GRADUATE UNIVERSITY OF SCIENCE AND TECHNOLOGY ||||||||||||
NGUYEN VAN TRUONG
IMPROVING SOME ARTIFICIAL IMMUNE ALGORITHMS FOR
NETWORK INTRUSION DETECTION
THE THESIS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY
IN MATHEMATICS
Hanoi - 2019
MINISTRY OF EDUCATION
VIETNAMESE ACADEMY
AND TRAINING
OF SCIENCE AND TECHNOLOGY
GRADUATE UNIVERSITY OF SCIENCE AND TECHNOLOGY ||||||||||||
NGUYEN VAN TRUONG
IMPROVING SOME ARTIFICIAL IMMUNE ALGORITHMS FOR
NETWORK INTRUSION DETECTION
THE THESIS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY
IN MATHEMATICS
Major: Mathematical foundations for Informatics
Code: 62 46 01 10
Scienti c supervisor:
1. Assoc. Prof., Dr. Nguyen Xuan Hoai
2. Assoc. Prof., Dr. Luong Chi Mai
Hanoi - 2019
Acknowledgments
First of all I would like to thank is my principal supervisor, Assoc. Prof., Dr.
Nguyen Xuan Hoai for introducing me to the eld of Arti cial Immune System. He
guides me step by step through research activities such as seminar presentations,
paper writing, etc. His genius has been a constant source of help. I am intrigued by
his constructive criticism throughout my PhD. journey. I wish also to thank my cosupervisor, Assoc. Prof., Dr. Luong Chi Mai. She is always very enthusiastic in our
discussion promising research questions. It is a pleasure and luxury for me to work
with her. This thesis could not have been possible without my supervisors’ support.
I gratefully acknowledge the support from Institute of Information
Technology, Vietnamese Academy of Science and Technology, and from Thai
Nguyen University of Education. I thank the nancial support from the National
Foundation for Science and Technology Development (NAFOSTED), ASEANEuropean Academic University Network (ASEA-UNINET).
I thank M.Sc. Vu Duc Quang, M.Sc. Trinh Van Ha and M.Sc. Pham Dinh
Lam, my co-authors of published papers. I thank Assoc. Prof., Dr. Tran Quang Anh
and Dr. Nguyen Quang Uy for many helpful insights for my research. I thank
colleagues, especially my cool labmate Mr. Nguyen Tran Dinh Long, in IT
Research & Development Center, HaNoi University.
Finally, I thank my family for their endless love and steady support.
Certi cate of Originality
I hereby declare that this submission is my own work under my scienti c
super-visors, Assoc. Prof., Dr. Nguyen Xuan Hoai, and Assoc. Prof., Dr. Luong Chi
Mai. I declare that, it contains no material previously published or written by
another person, except where due reference is made in the text of the thesis. In
addition, I certify that all my co-authors allow me to present our work in this thesis.
Hanoi, 2019
PhD. student
Nguyen Van Truong
i
Contents
List of Figures
List of Tables
Notation and Abbreviation
INTRODUCTION
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Problem statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Outline of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 BACKGROUND
1.1 Detection of Network Anomalies . . . . . . . . . . . . . . . . . . . . . .
1.1.1
1.1.2
1.1.3
1.1.4
1.2
A brief overview of human immune sys
1.3
AIS for IDS . . . . . . . . . . . . . . . . . . . . .
1.3.1
1.3.2
1.4
Selection algorithms . . . . . . . . . . . . .
1.4.1
1.4.2
1.5
Basic terms and de nitions . . . . . . . . .
1.5.1
1.5.2
1.5.3
1.5.4
1.5.5
1.5.6
1.5.7
1.5.8
1.6
Datasets . . . . . . . . . . . . . . . . . . . . . . .
1.6.1
1.6.2
1.6.3
1.6.4
1.7
Summary . . . . . . . . . . . . . . . . . . . . . .
2 COMBINATION OF NEGATIVE SELECTION AND POSITIVE SELECTION
2.1
Introduction . . . . . . . . . . . . . . . . . . . . .
2.2
Related works . . . . . . . . . . . . . . . . . . .
2.3
New Positive-Negative Selection Algo
2.4
Experiments . . . . . . . . . . . . . . . . . . . .
2.5
Summary . . . . . . . . . . . . . . . . . . . . . .
3 GENERATION OF COMPACT DETECTOR SET
3.1
Introduction . . . . . . . . . . . . . . . . . . . . .
3.2
Related works . . . . . . . . . . . . . . . . . . .
3.3
New negative selection algorithm . . .
3.3.1
3.3.2
3.4
Experiments . . . . . . . . . . . . . . . . . . . .
3.5
Summary . . . . . . . . . . . . . . . . . . . . . .
4 FAST SELECTION ALGORITHMS
4.1
Introduction . . . . . . . . . . . . . . . . . . . . .
4.2
Related works . . . . . . . . . . . . . . . . . . .
4.3
A fast negative selection algorithm bas
4.4
A fast negative selection algorithm bas
4.5
Experiments . . . . . . . . . . . . . . . . . . . .
4.6
Summary . . . . . . . . . . . . . . . . . . . . . .
5 APPLYING HYBRID ARTIFICIAL IMMUNE SYSTEM FOR NETWORK SECURITY
5.1
Introduction . . . . . . . . . . . . . . . . . . . . .
5.2
Related works . . . . . . . . . . . . . . . . . . .
5.3
Hybrid positive selection algorithm with
5.4
Experiments . . . . . . . . . . . . . . . . . . . .
5.4.1
5.4.2
5.4.3
5.4.4
5.5
Summary . . . . . . . . . . . . . . . . . . . . . .
CONCLUSIONS
Contributions of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . .
Future works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Published works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
iv
BIBLIOGRAPHY
v
List of Figures
1.1
Classi cation of anomaly-based intrusion detection met
1.2
Multi-layered protection and elimination architecture . .
1.3
Multi-layer AIS model for IDS . . . . . . . . . . . . . . . . . . . .
1.4
Outline of a typical negative selection algorithm. . . . . .
1.5
Outline of a typical positive selection algorithm. . . . . . .
1.6
Example of a pre x tree and a pre x DAG. . . . . . . . . . .
1.7
Existence of holes. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.8
Negative selections with 3-chunk and 3-contiguous det
1.9
A simple ring-based representation (b) of a string (a). .
1.10
Frequency trees for all 3-chunk detectors. . . . . . . . . . .
2.1
Binary tree representation of the detectors set generate
2.2
Conversion of a positive tree to a negative one. . . . . . .
2.3
Diagram of the Detector Generation Algorithm. . . . . . .
2.4
Diagram of the Positive-Negative Selection Algorithm.
2.5
One node is reduced in a tree: a compact positive tree h
and its conversion (a negative tree) has 3 node (b). . .
2.6
Detection time of NSA and PNSA. . . . . . . . . . . . . . . . .
2.7
Nodes reduction on trees created by PNSA on Net ow
2.8
Comparison of nodes reduction on Spambase dataset.
3.1
Diagram of a algorithm to generate perfect rcbvl detect
4.1 Diagram of the algorithm to generate positive r-chunk detectors set. . . 55
vi
4.2 A pre x DAG G and an automaton M . . . . . . . . . . . . . . . . .
4.3 Diagram of the algorithm to generate negative r-contiguous d
4.4 An automaton represents 3-contiguous detectors set. . . . .
4.5 Comparison of ratios of runtime of r-chunk detector-based
time of Chunk-NSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6 Comparison of ratios of runtime of r-contiguous detector-ba
runtime of Cont-NSA . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vii
List of Tables
1.1 Performance comparison of NSAs on linear strings and ring strings. . . 24
2.1 Comparison of memory and detection time reductions. . . . . . . . . .
39
2.2 Comparison of nodes generation on Net ow dataset. . . . . . . . . . . .
40
3.1 Data and parameters distribution for experiments and results comparison. 49
4.1 Comparison of our results with the runtimes of previously published
algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 Comparison of Chunk-NSA with r-chunk detector-based NSA. . . . . .
63
4.3 Comparison of proposed Cont-NSA with r-contiguous detector-based NSA. 64
5.1 Features for NIDS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
5.2 Distribution of
73
ows and parameters for experiments. . . . . . . . . . .
5.3 Comparison between PSA2 and other algorithms. . . . . . . . . . . . .
74
5.4 Comparison between ring string-based PSA2 and linear string-based PSA2. 76
viii
Notation and Abbreviation
Notation
Dpi
Dni
Length of data samples
Set of ring presentations of all strings in S
Cardinality of set X
An alphabet, a nonempty and nite set of symbols
Set of all strings of length k on alphabet , where k is a
positive integer.
Set of all strings on alphabet , including an empty string.
Matching threshold
Set of all positive r-chunk detectors at position i.
Set of all negative r-chunk detectors at position i.
CONT(S; r)
L(X)
rcbvl
Set of all r-contiguous detectors.
Set of all nonself strings detected by X.
r-contiguous bit with variable length.
‘
Sr
jXj
k
r
Abbreviation
Cont-NSA
DR
DAG
FAR
GA
Arti cial Immune System
Accuracy Rate
Ant Colony Optimization
Anomaly Network Intrusion Detection System
Block-Based Neural Network
Chunk Detector-Based Negative Selection Algorithm
Contiguous Detector-Based Negative Selection Algorithm
Detection Rate
Directed Acyclic Graph
False Alarm Rate
Genetic Algorithm
HIDS
IDS
Host Intrusion Detection System
Intrusion Detection System
AIS
ACC
ACO
ANIDS
BBNN
Chunk-NSA
ix
ML
MLP
NIDS
NS
NSA
NSM
PNSA
PSA
PSA2
PSO
Machine Learning
Multilayer Perceptron
Network Intrusion Detection System
Negative Selection
Negative Selection Algorithm
Negative Selection Mutation
Positive-Negative Selection Algorithm
Positive Selection Algorithm
Two-class Positive Selection Algorithm
Particle Swarm Optimization
PSOGSA
Particle Swarm Optimization-Gravitational Search Algorithm
RNSA
SVM
TCP
VNSA
Real-valued NSA
Support Vector Machines
Transmission Control Protocol
Variable length detector-based NSA
1
INTRODUCTION
Motivation
Internet users and computer networks are su ering from rapidly increasing
num-ber of attacks. In order to keep them safe, there is a need for e ective security
monitor-ing systems, such as Intrusion Detection Systems (IDS). However, intrusion
detection has to face a number of di erent problems such as large network tra c
volumes, im-balanced data distribution, di culties to realize decision boundaries
between normal and abnormal actions, and a requirement for continuous adaptation to
a constantly changing environment. As a result, many researchers have attempted to
use di erent types of approaches to build reliable intrusion detection system.
Computational intelligence techniques, known for their ability to adapt and
to exhibit fault tolerance, high computational speed and resilience against noisy
informa-tion, are hopefully alternative methods to the problem.
One of the promising computational intelligence methods for intrusion detection
that have emerged recently are arti cial immune systems (AIS) inspired by the biological immune system. Negative selection algorithm (NSA), a dominating model of AIS,
is widely used for intrusion detection systems (IDS) [55, 52]. Despite its successful
application, NSA has some weaknesses: 1-High false positive rate (false alarm rate)
and false negative rate, 2-High training and testing time, 3-Exponential relationship
between the size of the training data and the number of detectors possibly generated
for testing, 4-Changeable de nitions of "normal data" and "abnormal data" in dynamic
network environment [55, 79, 92]. To overcome these limitations, trends of recent
works are to concentrate on complex structures of immune detectors, matching
methods and hybrid NSAs [11, 94, 52].
Following trends mentioned above, in this thesis we investigate the ability of
NSA to combine with other classi cation methods and propose more e ective data
2
representations to improve some NSA’s weaknesses.
Scienti c meaning of the thesis: to provide further background to improve
per-formance of AIS-based computer security eld in particular and IDS in general.
Reality meaning of the thesis: to assist computer security practicers or
experts implement their IDS with new features from AIS origin.
The major contributions of this research are: Propose a new representation
of data for better performance of IDS; Propose a combination of existing
algorithms as well as some statistical approaches in an uniform framework;
Propose a complete and non-redundant detector representation to archive optimal
time and memory complex-ities.
Objectives
Since data representation is one of the factors that a ect the training and
testing time, a compact and complete detector generation algorithm is investigated.
The thesis investigates optimal algorithms to generate detector set in AIS.
They help to reduce both training time and detecting time of AIS-based IDSs.
Also, it is regarded to propose and investigate an AIS-based IDS that can
promptly detect attacks, either if they are known or never seen before. The
proposed system makes use of AIS with statistics as analysis methods and owbased network tra c as experimental data.
Problem statements
Since the NSA has some limitations as listed in the rst section, this thesis
concentrates on three problems:
1. The rst problem is to nd compact representations of data. Objectives of
this problem’s solution is not only to minimize memory storage but also to
reduce testing time.
2. The second problem is to propose algorithms that can reduce training time
and testing time in compared with all existing related algorithms.
3
3. The third problem is to improve detection performance with respect to
reduc-ing false alarm rates while keeping detection rate and accuracy rate as
high as possible.
Solutions of these problems can partly improve rst three weaknesses as listed in
the rst section. Regarding to the last NSAs’ weakness about changeable de nitions
of "normal data" and "abnormal data" in dynamic network environment, we
consider it as a risk in our proposed algorithm and left it for future work.
Logically, it is impossible to nd an optimal algorithm that can both reduce
time and memory complexities and obtain best detection performance. These
aspects are always in con ict with each other. Thus, in each chapter, we will
propose algorithms to solve each problem quite independently.
The intrusion detection problem mentioned in this thesis can be informally
stated as:
Given a nite set S of network ows which labeled with self (normal) or
nonself (abnormal). The objective is to build classifying models on S that can label
exactly an unlabeled network ow s.
Outline of thesis
The rst chapter introduces the background knowledge necessary to discuss the
algorithms proposed in following chapters. First, detection of network anomalies is brie
y introduced. Following that, the human immune system, arti cial immune sys-tem,
machine learning and their relevance are reviewed and discussed. Then, popular
datasets used for experiments in the thesis are examined. related works.
In Chapter 2, a combination method of selection algorithms is presented.
The proposed technique helps to reduce detectors storage generated in training
phase. Test-ing time, an important measurement in IDS, will also be reduced as a
direct consequence of a smaller memory complexity. Tree structure is used in this
chapter (and in Chapter 5) to improve time and memory complexities.
A complete and nonredundant detector set, also called perfect detectors set,
4
is necessary to archive acceptable self and nonself coverage of classi ers. A
selection algorithm to generate a perfect detectors set is investigated in Chapter 3.
Each detector in the set is a string concatenated from overlapping classical ones.
Di erent from approaches in the other chapters, discrete structure of string-based
detectors in this chapter are suitable for detection in distributed environment.
Chapter 4 includes two selection algorithms for fast training phase. The
optimal algorithms can generate a detectors set in linear time with respect to size
of training data. The experiment results and theoretical proof show that proposed
algorithms outperform all existing ones in term of training time. In term of detection
time, the rst algorithm and the second one is linear and polynomial, respectively.
Chapter 5 mainly introduces a hybrid approach of positive selection algorithm
with statistics for more e ective NIDS. Frequencies of self and nonself data (strings)
are contained in leaves of trees representing detectors. This information plays an
important role in improving performance of the proposed algorithms. The hybrid
approach came as a new positive selection algorithm for two-class classi cation that
can be trained with samples from both self and nonself data types.
5
Chapter 1
BACKGROUND
The human immune system (HIS) has successfully protected our bodies
against attacks from various harmful pathogens, such as bacteria, viruses, and
parasites. It distinguishes pathogens from self-tissue, and further eliminates these
pathogens. This provides a rich source of inspiration for computer security systems,
especially intrusion detection systems [92]. Hence, applying theoretical immunology
and observed immune functions, its principles, and its models to IDS has gradually
developed into a new research eld, called arti cial immune system (AIS).
How to apply remarkable features of HIS to archive scalable and robust IDS
is considered a researching gap in the eld of computer security. In this chapter, we
introduce the background knowledge necessary to discuss the algorithms
proposed in following chapters that can partly ful ll the gap.
Firstly, a brief introduction to network anomaly detection is presented. We
then overview HIS. Next, immune selection algorithms, detectors, performance
metrics and their relevance are reviewed and discussed. Finally, some popular
datasets are examined.
1.1
Detection of Network Anomalies
The idea of intrusion detection is predicated on the belief that an intruder’s
behavior is noticeably di erent from that of a legitimate user and that many unauthorized actions are detectable [65]. Intrusion detection systems (IDSs) are deployed as a
second line of defense along with other preventive security mechanisms, such as user
6
authentication and access control. Based on its deployment, an IDS can act either
as a host-based or as a network-based IDS.
1.1.1
Host-Based IDS
A Host-Based IDS (HIDS) monitors and analyzes the internals of a
computing system. A HIDS may detect internal activity such as which program
accesses what resources and attempts illegitimate access, for example, an activity
that modi es the system password database. Similarly, a HIDS may look at the
state of a system and its stored information whether it is in RAM or in the le system
or in log les or elsewhere. Thus, one can think of a HIDS as an agent that monitors
whether anything or anyone internal or external has circumvented the security
policy that the operating system tries to enforce [12].
1.1.2
Network-Based IDS
A Network-Based IDS (NIDS) detects intrusions in network data. Intrusions
typically occur as anomalous patterns. Most techniques model the data in a sequential
fashion and detect anomalous subsequences. The primary reason for these
anomalies is the attacks launched by outside attackers who want to gain unauthorized
access to the network to steal information or to disrupt the network. In a typical setting,
a network is connected to the rest of the world through the Internet. The NIDS reads
all incoming packets or ows, trying to nd suspicious patterns. For example, if a large
number of TCP connection requests to a very large number of di erent ports are
observed within a short time, one could assume that there is someone committing a
port scan at some of the computers in the network. Port scans mostly try to detect
incoming shell codes in the same manner that an ordinary intrusion detection system
does. In addition to inspecting the incoming tra c, a NIDS also provides valuable
information about intrusion from outgoing or local tra c. Some attacks might even be
staged from the inside of a monitored network or network segment; and therefore, not
regarded as incoming tra c at all. The data available for intrusion detection systems
can be at di erent levels of granularity, like packet level traces or Cisco net ow data.
7
The data is high dimensional, typically, with a mix of categorical as well as continuous
numeric attributes. Misuse-based NIDSs attempt to search for known intrusive patterns
while an anomaly-based intrusion detector searches for unusual patterns. Today, the
intrusion detection research is mostly concentrated on anomaly-based network intrusion
detection because it can detect both known and unknown attacks [12].
1.1.3
Methods
On the basis of the availability of prior knowledge, the detection mechanism
used, the mode of performance and the ability to detect attacks, existing anomaly
detection methods are categorized into six broad categories [41] as shown in Fig.
1.1. This gure is adapted from [12].
Supervised
Learning
Unsupervised
Learning
Probabilistic
Learning
Anomaly
Detection
Soft
Computing
Knowledge
based
Combination
Learners
Figure 1.1: Classi cation of anomaly-based intrusion detection methods
- Xem thêm -