[ISI 2009 (IEEE Intelligence & Security Informatics 2009) -- Dallas, Texas - June 8-11, 2009]

[ISI 2009 -- Dallas, Texas]

Tutorial 1 Information Page
 

Tutorial 1: Data Mining for Security

In this tutorial, we will discuss various data mining techniques for malware detection and network intrusion detection including current state of the art.  First, we will present Email worm detection using behavioral and statistical analysis. Second, we will present techniques for detecting malicious executables using multi-level features. Third, we will present techniques for detecting remote exploits using mining. Fourth, we will present network traffic mining by exploiting stream mining classification techniques. Fifth, we will present classification techniques to handle stream mining for limited labeled training data. Finally, we will present techniques to detect entirely brand new class/attack in the stream data.

1) Email worm detection using behavioral and statistical analysis: Here we focus on applying data mining to detect email worms. We identify several features of benign and malicious emails using different statistical and behavioral analysis of emails sent over a certain period of time.  We then select the best set of features that can efficiently distinguish between normal and viral emails using a two-level feature selection technique. In the first level, we apply Principal Component Analysis (PCA) to reduce the high dimensionality of data and to find a projected, optimal set of attributes. Second, we apply J48 decision tree algorithm to determine the relative importance of features based on information gain. We are able to identify a subset of features, along with a set of classification rules that have a better performance in detecting novel worms than the original set of features or PCA reduced features.

2) Detecting malicious executables using multi-level features: We present a scalable and multi-level feature extraction technique to detect malicious executables. We propose a novel combination of three different kinds of features at different levels of abstraction. These are binary n-grams, assembly instruction sequences, and Dynamic Link Library (DLL) function calls; extracted from binary executables, disassembled executables, and executable headers, respectively. We also propose an efficient and scalable feature extraction technique, and apply this technique on a large corpus of real benign and malicious executables. The above mentioned features are extracted from the corpus data and a classifier is trained, which achieves high accuracy and low false positive rate in detecting malicious executables. Our model is compared against other feature-based approaches for malicious code detection, and found to be more efficient in terms of detection accuracy and false alarm rate.

3) Detecting Remote Exploits using Data Mining: We describe a Data Mining based Exploit code detector (DExtor),, to protect network services. The main assumption of our work is that normal traffic into the network services contain only data, whereas exploit code contains code. Thus, the “exploit code detection” problem reduces to “code detection” problem. DExtor is an application-layer attack blocker, which is deployed between a web service and its corresponding firewall. The system is first trained with real training data containing both exploit code, and normal traffic. Training is performed by applying binary disassembly on the training data, extracting features, and training a classifier. Once trained, DExtor is deployed in the network to detect exploit code and protect the network service. We evaluate DExtor with a large collection of real exploit code and normal data. Our results show that DExtor can detect almost all exploit code with negligible false alarm rate. We also compare DExtor with other published works and prove its effectiveness.

4) Intrusion detection using network traffic mining: We show how network traffic mining can be mapped to a data stream mining problem, and propose an enhanced data stream classification technique to detect peer to peer botnet traffic. Botnet is a network of compromised host under the control of a single human attacker, called botmaster. Botmaster uses his botnet for malicious activities such as spamming, phishing, DDoS attack, extortion, and so on. Botnet traffic can be considered as stream data having two important properties: infinite length and drifting concept. Thus, stream data classification technique is more appealing to botnet detection than simple classification technique. We propose a multi-partition, multi-chunk ensemble classifier based data mining technique to classify concept-drifting stream data. We have also tested our technique on both botnet traffic and simulated data, and obtained better detection accuracies compared to other published works.

5) A practical approach for network intrusion detection:  We have already discussed how network traffic mining problem can be mapped to data stream mining. Recent approaches in classifying data streams are based on supervised learning algorithms, which can be trained with labeled data only. Manual labeling of data is both costly and time consuming. Therefore, in a real streaming environment, where large volume of data appears at a high speed, only a small fraction of the data can be labeled. Thus, only a limited number of instances will be available for training/updating the classification models, leading to poorly trained classifiers. We apply a novel technique to overcome this problem by utilizing both unlabeled and labeled instances to train/update the classification model. Each classification model is built as a collection of micro-clusters using semi-supervised clustering, and an ensemble of these models is used to classify unlabeled data using nearest neighbor algorithm. Empirical evaluation on both synthetic and real traffic, including botnet traffic, reveals that our approach, using only a small amount of labeled data for training outperforms state-of-the-art stream classification algorithms that use five times more labeled data than our approach.

6) Novel intrusion detection by mining network traffic: New kind of intrusions often occurs in real networks. These intrusions cannot be detected automatically by a traditional data stream classification technique. A traditional data stream classification technique assumes that the total number of classes (i.e., intrusions) is fixed in the stream. This assumption may not be valid in a real streaming environment, where new intrusions may appear at any time. Traditional data stream classification techniques are not capable of recognizing the novel kind of intrusion until the appearance of the novel intrusion is manually identified, and labeled instances of that intrusion are presented to the learning algorithm for training. The problem becomes more challenging in the presence of concept-drift, when the underlying data distributions evolve in streams. We propose a novel and efficient technique that can automatically detect the emergence of a novel class in the presence of concept-drift by quantifying cohesion among unlabeled test instances, and separation of the test instances from training instances. Our approach is non-parametric, meaning; it does not assume any underlying distributions of data.

Lecturers:
Latifur Khan and Mohammad Mehedy Masud
Department of Computer Science
University of Texas at Dallas
lkhan, mehedy@utdallas.edu

--> Return to Workshops and Tutorials page