Đăng ký Đăng nhập
Trang chủ Machine learning approaches to cyber security ...

Tài liệu Machine learning approaches to cyber security

.PDF
45
1
93

Mô tả:

VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY HO CHI MINH UNIVERSITY OF TECHNOLOGY COMPUTER SCIENCE AND ENGINEERING FACULTY GRADUATION THESIS Machine Learning Approaches to Cyber Security Department: Computer science Committee: Advisor: Reviewer: Students: Computer Science 3 Prof. Nguyen Duc Thai Prof. Nguyen Le Duy Lai ---o0o--Huynh Kien Van 1552423 Nguyen Duc Kien 1552181 HO CHI MINH CITY, 12/2021 ĐẠI HỌC QUỐC GIA TP.HCM ---------TRƯỜNG ĐẠI HỌC BÁCH KHOA KHOA: KH&KT Máy tính BỘ MÔN: Hệ thống & Mạng máy tính CỘNG HÒA Xà HỘI CHỦ NGHĨA VIỆT NAM Độc lập - Tự do - Hạnh phúc NHIỆM VỤ LUẬN ÁN TỐT NGHIỆP Chú ý: Sinh viên phải dán tờ này vào trang nhất của bản thuyết trình HỌ VÀ TÊN: Huỳnh Kiến Văn Nguyễn Đức Kiên NGÀNH: Khoa học Máy tính MSSV: 1552423 MSSV: 1552181 LỚP: 1. Đầu đề luận án: Machine Learning Approaches for Cyber Security. 2. Nhiệm vụ (yêu cầu về nội dung và số liệu ban đầu): - Do reaearch on Machine Learning and its applications. - Do research on topics related to application of Machine Learning into cyber security. - Propose a way how to create an IDS (Intrusion Detection System) using Machine Learning. - Design the desired system as mentioned above. - Implement the system with using any programming language(s) and technologies, prove that they are suitable for the solution. - Demonstration the system to make sure it run properly and correctly. 3. Ngày giao nhiệm vụ luận án: 30/08/2021 4. Ngày hoàn thành nhiệm vụ: 31/12/2021 5. Họ tên giảng viên hướng dẫn: TS. Nguyễn Đức Thái Nội dung và yêu cầu LVTN đã được thông qua Bộ môn. Ngày 23 tháng 08 năm 2021 CHỦ NHIỆM BỘ MÔN GIẢNG VIÊN HƯỚNG DẪN CHÍNH (Ký và ghi rõ họ tên) (Ký và ghi rõ họ tên) TS. Nguyễn Đức Thái TS. Nguyễn Đức Thái PHẦN DÀNH CHO KHOA, BỘ MÔN: Người duyệt (chấm sơ bộ): ________________________ Đơn vị: _______________________________________ Ngày bảo vệ: ___________________________________ Điểm tổng kết: _________________________________ Nơi lưu trữ luận án:______________________________ TRƯỜNG ĐẠI HỌC BÁCH KHOA KHOA KH & KT MÁY TÍNH CỘNG HÒA Xà HỘI CHỦ NGHĨA VIỆT NAM Độc lập - Tự do - Hạnh phúc ---------------------------Ngày 28 tháng 12 năm 2021 PHIẾU CHẤM BẢO VỆ LVTN (Dành cho người hướng dẫn) 1. Họ và tên SV: Huỳnh Kiến Văn Nguyễn Đức Kiên MSSV: 1552423 MSSV: 1552181 Ngành (chuyên ngành): Computer Science 2. Đề tài: Machine Learning Approaches for Cyber Security. 3. Họ tên người hướng dẫn: Nguyễn Đức Thái 4. Tổng quát về bản thuyết minh: Số trang: Số chương: Số bảng số liệu Số hình vẽ: Số tài liệu tham khảo: Phần mềm tính toán: Hiện vật (sản phẩm) 5. Tổng quát về các bản vẽ: - Số bản vẽ: Bản A1: Bản A2: Khổ khác: - Số bản vẽ vẽ tay Số bản vẽ trên máy tính: 6. Những ưu điểm chính của LVTN: • Students completed a desired features of the thesis and demonstrated them. • The students applied machine learning algorithms to analyze the network traffics and cybersecurity data. 7. Những thiếu sót chính của LVTN: • Many parts in the report are short and lack justifications. • Students provided evaluation of the received results, however, the evaluation was too short and no discussion presented. 8. Đề nghị: Được bảo vệ R Bổ sung thêm để bảo vệ o 9. 3 câu hỏi SV phải trả lời trước Hội đồng: 10. Đánh giá chung (bằng chữ: giỏi, khá, TB): Huỳnh Kiến Văn Nguyễn Đức Kiên Không được bảo vệ o Điểm : 8.2/10 7/10 Ký tên (ghi rõ họ tên) Nguyễn Đức Thái 75ѬӠ1*ĈҤ,+Ӑ&%È&+.+2$ KHOA KH & KT MÁY TÍNH &Ӝ1*+Ñ$;­+Ӝ,&+Ӫ1*+Ƭ$9,ӊ71$0 ĈӝFOұS- 7ӵGR- +ҥQKSK~F ---------------------------Ngày 27 tháng 12 QăP 2021. 3+,ӂ8&+Ҩ0%Ҧ29ӊ/971 'jQKFKRQJ˱ͥLK˱ͣQJG̳QSK̫QEL͏Q +ӑYj tên SV: Huynh Kien Van -1552423 Nguyen Duc Kien -1552181 Ngành (chuyên ngành): Computer Science ĈӅWjL MACHINELEARNINGAPPROACHESTOCYBER SECURITY +ӑWrQQJѭӡLKѭӟQJGүQSKҧQELӋQ NguyӉQ/ê Duy Lai 7әQJTXiWYӅEҧQWKX\ӃWPLQK 6ӕWUDQJ 40 6ӕFKѭѫng: 8 6ӕEҧQJVӕOLӋX: 6ӕKuQKYӁ 10 6ӕWjLOLӋXWKDPNKҧR 14 3KҫQPӅPWtQKWRiQ +LӋQYұW VҧQSKҭP 7әQJTXiWYӅFiFEҧQYӁ - 6ӕEҧQYӁ %ҧQ$ %ҧQ$ .KәNKiF - 6ӕEҧQYӁYӁWD\ 6ӕEҧQYӁWUrQPi\WtQK 6. NhӳQJѭXÿLӇPFKtQKFӫD/971 In this dissertation, the topic is about how to apply machine learning approaches to analyze the amount of live network traffic. The implementation of a Traffic validator expects to validate the incoming traffic into benign and malicious classes. The network traffic has been filtered through a rule-based IDS such as Snort, and the model is an add-on to IDSs that aims to eliminate rule-based IDS false negatives. The detection task is usually expected to be timed in milliseconds, as IDSs must respond quickly and without affecting user experiences. 7. NhӳQJWKLӃXVyWFKtQKFӫD/971 However, the presented topic is still limited including the inability to support a complex prediction model. The title of this thesis seems to be very large and authors need to give a concise scope on problems and the approach to solutions. The fundamental of networks in Section 2.1 retains very elementarily that may not be necessary for this context. There are some limitations to this approach such as encrypted packets are not processed by most intrusion detection devices. ĈӅQJKӏĈѭӧFEҧRYӋ† %әVXQJWKrPÿӇEҧRYӋ† .K{QJÿѭӧFEҧRYӋ† FkXKӓL69SKҧLWUҧOӡLWUѭӟF+ӝLÿӗQJ a. How do think that the attackers can use some techniques to evade IDS such as Fragmentation, Avoiding defaults, Coordinated, low-bandwidth attacks, Address spoofing/proxying, Pattern change evasion. Your IDS integrating ML module can be immune to these evasion techniques? 1ĈiQKJLiFKXQJ EҵQJFKӳJLӓi, khá, TB): ĈLӇP8 /10 Ký tên (ghi rõ KӑWrQ NguyӉQ/ê Duy Lai VIETNAME UNIVERSITY OF TECHNOLOGY HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY FACULTY OF COMPUTER SCIENCE & ENGINEERNG BACHELOR OF ENGINEERING THESIS MACHINE LEARNING APPROACHES TO CYBER SECURITY COMPUTER SCIENCE COMMITTEE Advisor: Prof. Nguyen Duc Thai Examiner: Prof. Nguyen Le Duy Lai Students: Huynh Kien Van - 1552423 Nguyen Duc Kien - 1552181 Ho Chi Minh, 2021 Commitment We guarantee that the work in this dissertation was completed in accordance with the University’s regulations and that it has not been submitted to any other academic institutions. The works are our own, unless otherwise stated in the text by a particular reference. 1 Acknowledgement First and foremost, we would like to express our special thanks of gratitude to our supervisor, professor Nguyen Duc Thai for his never ending grace. His guidance and value knowledge helped us in all the time of writing thesis. Also, professor Nguyen Duc Thai supports us in expertise and spirit to work on the thesis. We also extend our grateful to our families and friends who have always been beside us in hard moments and encouraged us in this thesis and university life. 2 Abstract In this thesis, we are proposing machine learning-based approach to detect lively network traffic. To increase the accuracy as well as reducing False-Negative cases, we apply the Deep Learning model. We are building RNN models: LSTMs and GRU to classify a network traffic if malicious or normal. Technically, we are building RNN model run parallel with IDS and combining the results and consider which actions which actions following the decision table. Dataset used in this thesis mainly came from MTA-KDD19’ which was created by project have the same name. To enrich our data, we are also using dataset ISCX2012[1] and USTC-TFC2016[2], then preprocessing following the stagegy of MTA-KDD’19 work. For all result, the LSTM model is performced better than the GRU model. For accuracy, the LSTM model even higher than the work of MTA-KDD’19 which used the traditional neural network, 99.8% compare to 99.74. For Prediction, the LSTM reach 98.3% and the GRU reach 99.5%. Our goal is eliminate the False-Negative, so the results of Recall score of these two model is 99.75% (for GRU model) and 99.8% (for LSTM model), respectively. 3 Contents List of Figures 6 List of Tables 7 List of Abbreviations 8 1 INTRODUCTION 1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Objective and scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Thesis structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 10 10 11 2 BACKGROUND KNOWLEDGE 2.1 Fundamental of network . . . . . 2.1.1 Networking concept . . . 2.1.2 Reference models . . . . . 2.2 Intrusion Detection System . . . . 2.3 Word Embedding . . . . . . . . . 2.4 Deep Neural Network . . . . . . . 2.4.1 Recurrent Neural Network 2.4.2 Long Short Term Memory . . . . . . . . 12 13 13 13 17 19 19 19 21 3 LITERATURE REVIEW 3.1 Deep learning-based approach in improvement signature of IDSs . . . . 3.2 Deep learning-based approaches for classifying network traffic . . . . . 23 24 24 4 PROPOSED APPROACHES 4.1 Problem statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Proposed approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 26 26 27 5 DATASET 28 6 IMPLEMENTATION 6.1 Data pre-processing . . . . . . . . . . . . . . . . . . 6.1.1 Explaining features of MTA-KDD’19 dataset 6.1.2 Data processing . . . . . . . . . . . . . . . . 6.2 Prediction module . . . . . . . . . . . . . . . . . . . 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 32 32 32 33 Chapter 0 7 8 EXPERIMENTS 7.1 Evaluation methods . . . 7.1.1 Data preperation 7.1.2 Confusing matrix 7.1.3 Accuracy . . . . 7.1.4 Precision . . . . 7.2 Model evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 36 36 36 36 37 37 38 Appendices References 40 5 List of Figures 2.1 2.2 2.3 2.4 2.5 The OSI model. . . . . . . . . . . . . . . The TCP model . . . . . . . . . . . . . . Convolution Neural Network Architecture General Neural Network . . . . . . . . . LSTM . . . . . . . . . . . . . . . . . . . . . . . . 15 17 19 20 22 4.1 Traffic validator architecture . . . . . . . . . . . . . . . . . . . . . . . 27 5.1 5.2 Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 30 6.1 6.2 Data processing task . . . . . . . . . . . . . . . . . . . . . . . . . . . The full network model . . . . . . . . . . . . . . . . . . . . . . . . . . 32 34 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . List of Tables 4.1 Decision table for IDS and Prediction model . . . . . . . . . . . . . . . 27 5.1 Summary of benign and malware traffic in USTC-TFC2016 dataset. . . 28 7.1 7.2 7.3 Confusion matrix with normalize . . . . . . . . . . . . . . . . . . . . . Model evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparing models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 37 37 7 List of Abbreviations IDS . . . . . . . Intrusion Detection System. RNN . . . . . . Recurrent Neural Network. LSTM . . . . . Long-Short Term Memory. OSI . . . . . . . The Open Systems Interconnection TCP/IP . . . . . Transmission Control Protocol/Internet Protocol UDP . . . . . . User Datagram Protocol HTTP . . . . . Hyper-Text Transport Protocol SMTP . . . . . Simple Mail Transfer Protocol FTP . . . . . . File Transfer Protocol NIDS . . . . . . Network Intrusion Detection System HIDS . . . . . . Host Intrusion Detection System WAF . . . . . . Web Application Firewall GRU . . . . . . Gated Recurrent Unit. 8 1 INTRODUCTION Contents 1.1 1.2 1.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Objective and scope . . . . . . . . . . . . . . . . . . . . . . . . . . Thesis structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 10 11 Chapter 1 1.1 Overview Many parts of the world have changed as a result of our increased reliance on technology. The more technology advances, the more valuable data and information become. Furthermore, as computer networks become more complex, the number of cyber security risks increases, with a wide range of sophistication, making it more difficult for professionals to detect and defend against dangers posed by billions of network traffics. Cyber attacks can result in data breaches and significant financial losses. According to The World Economic Forum1 (WEF), they estimate the economic cost of cybercrime to be $3 trillion worldwide in 2015 and totaling $6 trillion USD globally in 2021 . Threats detection and prevention mostly depends on Intrusion Detection System (IDS) or Intrusion Prevention System (IPS) by analyzing network traffic for signatures that match known cyber attacks. The line between Intrusion Detection and Intrusion Prevention Systems (IDS and IPS respectively) has become increasingly blurred. Currently, Signaturebased IDSs are more common since they are reliable and used by many organizations. That being said, traditional signature-based IDSs are disposed towards False-Negative (FN) and impossible to identify novel attacks like zero-day exploit2 since it identifies attacks based on known attack signatures. A term of false state is the most serious and dangerous state. This is when the IDS identifies an activity as acceptable when the activity is actually an attack3. In the other words, a False-Negative is when the IDS fails to catch an attack. For the last few years, the strong development of machine learning has had a huge contribution in automatically behavior anomaly detection. In this thesis, we propose a machine learning approach to this field. By applying machine learning approach, we tend to build an intelligent system which classifies threats and threat actors that helps detect potential attacks as well as improve the domain of computer security and provide better protection. 1.2 Objective and scope In this thesis, we want to apply machine learning approaches to analyse millions of the live network traffic. Specialists may obtain a deeper understanding of each situation thanks to artificial intelligence automation, which improves network security performance and reliability. With Traffic validator, we expect to validate the incoming traffic into benign and malicious classes. The network traffic have been filtered through a rule-based IDS such as Snort, and our model is an add-on to IDSs that aims to eliminate rule-based IDS false negative. 1Available at https://reports.weforum.org 2A zero-day attack (also referred to as Day Zero) is an attack that exploits a potentially serious software security weakness that the vendor or developer may be unaware of 3Available at https://owasp.org/www-community/controls/Intrusion_Detection 10 Chapter 1 The detection task is expected to be timed in milliseconds, as IDSs must respond quickly and without affecting user experiences. 1.3 Thesis structure The remainder of this report including contents, researches and proposed approaches is organised as follows: Chapter 2 introduces the background knowledge of this thesis, including the fundamental knowledge of networking and Intrusion Detection Sytstem; machine learning algorithms of clustering, recurrent deep neural network types and their layers as well as frameworks and libraries used in the system. Chapter 3 review some related work to machine learning approaches to anomaly traffic detection and dataset processing. Chapter 4 shows the problems as well as the input and output of malicious network traffic detector. Then we propose the design and architecture of the solutions for each problem. Chapter 5 describes the datasets which we are going to use to train and evaluate malicious network traffic module with the details of processing of raw datasets. 11 2 BACKGROUND KNOWLEDGE Contents 2.1 2.2 2.3 2.4 Fundamental of network . . . . 2.1.1 Networking concept . . . 2.1.2 Reference models . . . . . Intrusion Detection System . . . Word Embedding . . . . . . . . Deep Neural Network . . . . . . 2.4.1 Recurrent Neural Network 2.4.2 Long Short Term Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 13 13 17 19 19 19 21 Chapter 2 2.1 Fundamental of network To understand anomaly detection in networks, we must have a good understanding of basic network concepts. Therefore, The first part of this chapter discusses networking concept and types of networks. 2.1.1 Networking concept A network is a complex interacting system, composed of many individual entities. Two or more computer systems that can send or receive data from each other through a medium that they share and access are said to be connected. The behavior of the individual entities contributes to the ensemble behavior of the entire network. In a computer network, there are generally three communicating entities: Users: Humans who perform various activities on the network such as browsing Web pages and shopping, Hosts: Computers, each one of which is identified with a unique address, say an IP address and Processes: Instances of executable programs used in a client–server architecture. Client processes request server(s) for a network service, whereas the server processes provide the requested services to the clients. An example client is a Web browser that requests pages from a Web server, which responds to the requests. 2.1.2 Reference models To reduce network engineering, the whole networking concept is divided into multiple layers. Each layer is involved in some particular task and is independent of all other layers. But as a whole, almost all networking tasks depend on all of these layers. Layers share data between them and they depend on each other only to take input and send output. Layered architecture In layered architecture of Network Model, one whole network process is divided into small tasks. Each small task is then assigned to a particular layer which works dedicatedly to process the task only. Every layer does only specific work. The rules and conventions used are collectively referred to as the layer-n protocol. A protocol represents an agreement as to how communications take place between two parties. A protocol is a set of rules used to govern the meaning and format of packets, or messages exchanged between two or more peer entities, as well as actions taken when a message is transmitted or received, and in certain other situations. Protocols are extensively used by computer networks. In layered communication system, one layer of a host deals with the task done by or to be done by its peer layer at the same level on the remote host. The task is either initiated by layer at the lowest level or at the top most level. If the task is initiated by the-top most layer, it is passed on to the layer below it for further processing. The lower layer does the same thing, it processes the task and passes on to lower layer. If the task is initiated by lower most layer, then the reverse path is taken. There are two well-known types: The open systems interconnection (OSI) reference model and the transmission control protocol/Internet protocol (TCP/IP) reference model. 13 Chapter 2 The ISO1 OSI Reference Model The Open Systems Interconnection model (OSI model2) is a conceptual model created by the International Organization for Standardization which is characterizes and standardizes a telecommunications or computing system’s communication operations regardless of its underlying internal structure and technology. Its objective is to ensure that different communication systems can communicate with each other using standard communication protocols. From the practical implementation of transferring bits through a communications channel to the highest-level representation of data in a distributed application, the model divides the flow of data in a communication system into seven abstraction levels. Each intermediary layer provides a class of functionality to the layer above it while also receiving service from the layer below. Standard communication protocols are used to implement classes of functionality in software. • Application layer: This layer provides users access to the OSI environment and to distributed information services. • Presentation layer: This layer provides independence to application processes from differences in data representation (syntax). • This layer provides independence to application processes from differences in data representation (syntax). • Session layer: It provides the control structure for communication between applications. Establishment, management and termination of connections (sessions) between cooperating applications are the major responsibilities of this layer. • Transport layer: This layer supports reliable and transparent transfer of data between two end points. It also supports end-to-end error recovery and flow control. • Network layer: This layer provides upper layers independence from data transmission and switching technologies used to connect systems. It is also responsible for the establishment, maintenance and termination of connections. Data link layer: The responsibility of reliably transferring information across the physical link is assigned to this layer. It transfers blocks (frames) with necessary synchronization, error control and flow control. • Physical layer: This layer is responsible for transmitting a stream of unstructured bits over the physical medium. It must deal with mechanical, electrical, functional and procedural issues to access the physical medium. 1International Organization for Standardization 2Available at https://www.cloudflare.com/ 14 Chapter 2 Figure 2.1: The OSI reference model TCP/IP 3 Reference Model The functions of the TCP/IP architecture are divided into five layers. The functions of layers 1 and 2 are supported by bridges, and layers 1 through 3 are implemented by routers. • The application layer: The application layer is responsible for supporting network applications. As computer and networking technologies change, the types of applications supported by this layer also change. Each application, such as file transmission, has its own specific module. Many protocols are included in the application layer for diverse reasons, such as hyper-text transport protocol (HTTP) for Web searches, simple mail transfer protocol (SMTP) for electronic mail, and file transfer protocol (FTP) for file transfer. • The transport layer: Application layer communications are transported between the client and the server via the transport layer. With two transport protocols: TCP and UDP (user datagram protocol), application layer communications can be transported using these protocols. To transmit application layer messages to the destination, TCP offers a guaranteed connection-oriented service. TCP also has a congestion management mechanism that allows a source to limit its transmission pace amid network congestion by segmenting a large message into shorter parts. The UDP 3Available at https://docs.oracle.com/cd/E19683-01/806-4075/ipov-10/index.html 15
- Xem thêm -

Tài liệu liên quan