Deep learning for network packet forensics using TensorFlow

2
109

Main image

TensorFlow is an open source Python library for machine learning. It does mathematical computation using dataflow graphs. This article dwells on the use of TensorFlow as a forensic tool for classifying and predicting malware sourced from honeypots and honeynets.

Data mining and machine learning are key methods used to gain information from databases in which hidden patterns are analysed from a huge repository of records. In classical methodology, there are a number of algorithms for clustering, association rule mining, visualisation and classification to get meaningful information for predictive analysis.

Nowadays, a number of tools and technologies are available for the implementation of data mining and machine learning including WEKA, Tanagra, Orange, ELKI, KNIME, RapidMiner, R and many others which have Java, Python or C++ for back-end programming and customisation.

Machine learning has core tasks associated with classification and recognition, which are usually related to artificial intelligence. Generally, these operations are performed using some metaheuristic approach in which global optimisation or simply effective results can be fetched from a huge search space of solutions.

There are a number of prominent soft computing approaches such as neural networks, fuzzy logic, support vector machines, swarm intelligence, metaheuristics, etc. Metaheuristics, in turn, has many optimisations like Ant Colony Optimisation, Cuckoo Search, Bees Algorithm, Particle Swarm Optimisation, etc.

Despite the number of metaheuristic approaches and other effective soft computing algorithms, there are many applications for which a higher degree of accuracy and lower error rate is required.
For machine learning, the approaches of artificial neural networks (ANN) or support vector machines (SVM) can be used with the dataset to be integrated. ANN can be used for malware detection or classification, face recognition, fingerprint or finger vein structure analysis in which the previous dataset is used for training a model and then the prediction or classification of further datasets is done. ANN based learning is fully dependent on the dataset used for training the model and apparently, if the dataset is not accurate, the predictive analysis will affect the accuracy of the results.

To implement modelling and training with ANN, there are a number of open source tools available including SciLab, OpenNN, FANN, PyBrain and many others.

Screenshot from 2016-07-28 13-26-48
Figure 1: TensorFlow logo

Deep learning
Deep learning is one of the branches of machine learning with a strong base of algorithms that have multi-layered processing, a higher degree of computations and accuracy, and a lower error rate with the integration of deep graph based learning. Other branches of deep learning include deep structured learning, deep machine learning or hierarchical learning.

Deep learning can be used for various real world applications including speech recognition, malware detection and classification, natural language processing, bioinformatics, computer vision and many others.

TensorFlow: A Python based open source software library for deep learning
TensorFlow (tensorflow.org) is a powerful open source software library for the implementation of deep learning. It is based on Python and C++ at the back-end, with the incorporation of algorithms for data flow as well as graphs based numerical computations to achieve multi-layered computations with higher accuracy and a lower error rate.
TensorFlow has been developed by Google under a research project for deep learning titled Google Brain. It is a second generation system developed by Google after DistBelief. TensorFlow was released as open source in November 2015 by Google and this move has motivated research scholars, academicians and scientists to work on this powerful library.

TensorFlow can be installed with binary packages or authorised GitHub sources. Any one of the following methods can be used to install it:

  • Python Pip
  • Virtualenv
  • Anaconda
  • Docker Toolbox
  • Source based installation
Screenshot from 2016-07-28 13-27-18
Figure 2: The official TensorFlow page

Running and testing TensorFlow on the command line
At the terminal, the following instructions can be executed for testing:

$ python
>>> import tensorflow as myTensorFlow
>>> MyMessage = myTensorFlow.constant(‘Hello To All’)
>>> mysession = myTensorFlow.Session()
>>> print(sess.run(MyMessage))
Hello To All
>>> x = myTensorFlow.constant(9)
>>> y = myTensorFlow.constant(2)
>>> print(mysession.run(x + y))
11
>>>

Using TensorFlow for network packet analysis
For classification and prediction of malware from network packets, the dataset should be fetched from any authentic source like honeypots, honeynets, data repositories, open data portals, etc. Many anti-virus companies and research organisations release their datasets for R&D purposes, enabling research algorithms to be implemented by research scholars and practitioners.

If the datasets of malware or network traffic are required, the following URLs can be used:

  • http://www.netresec.com/?page=PcapFiles
  • https://wiki.wireshark.org/SampleCaptures
  • https://snap.stanford.edu/data/#email
  • http://www.secrepo.com/

As the network traffic is captured in PCAP format, there is a need to transform the PCAP format to CSV (comma separated values) using Snort IDS, by which all the alert files from PCAP can be generated and then changed to CSV format.
To read a pcap file in Snort IDS, use the following code:

$ snort -r mypcap.pcap    
$ snort --pcap-single= mypcap.pcap

The alert file format generated by Snort IDS is as follows:

[**] [1:2016936:2] Suspicious inbound to Database port 1433 [**]
[Classification: Potentially Bad Traffic] [Priority: 2] 
07/07-10:10.817217 <IP>:<PORT> -> <IP>:<PORT>
TCP TTL:112 TOS:0x0 ID:256 IpLen:24 DgmLen:20
******S* Seq: 0x43EE1111  Ack: 0x0  Win: **** TcpLen: 20
[Xref => http://doc.url.net/2016936]
 
[**] [1:2016936:2] Suspicious inbound to Database port 1433 [**]
[Classification: Potentially Bad Traffic] [Priority: 2] 
07/07-10:10.838622 <IP>:<PORT> -> <IP>:<PORT>
TCP TTL:113 TOS:0x0 ID:256 IpLen:22 DgmLen:21
******S* Seq: 0x6D5F1111  Ack: 0x0  Win: **** TcpLen: 20
[Xref => http://doc.url.net/2016936]

From the Snort IDS generated alert file, the records of ‘Potentially Bad Traffic’ can be cut and placed in a separate CSV file.
The finally obtained CSV file is read in TensorFlow for deep learning of the model and further predictive analysis of upcoming network traffic. Using this approach, the upcoming traffic can be analysed for the probability of being malware or not.

>>> import pandas
>>> dataset = pandas.read_csv(‘mydataset.csv’)
>>> dataset.shape
(1000, 5)
>>> data.columns
Index([u’Classification’, u’DGMLEN’, u’IPLEN’, u’TTL’, u’IP’],dtype=’object’)
>>> y, X = train[‘Classification’], train[[‘DGMLEN’, ‘IPLEN’, ‘IP’]].fillna(0)
>>> X_Modeltrain, X_Modeltest, y_Modeltrain, y_Modeltest = train_Modeltest_split(X, y, test_size=0.1, random_state=29)
>>> lr = LogisticRegression()
>>> lr.fit(X_Modeltrain, y_Modeltrain)
>>> print accuracy_score(y_Modeltest, lr.predict(X_Modeltest))
Output - 0.60183838
The accuracy score of existing model can be evaluated
Screenshot from 2016-07-28 13-27-46
Figure 3: Live demo of TensorBoard

Visualisation of learning and graphs using TensorBoard
The visualisation of learning behaviour can be analysed using TensorBoard, which is a set of Web based applications. TensorBoard includes the visualisation of five types of data including images, audio, scalars, graphs and histograms so that better and more effective analysis can be done.

A live demo of TensorBoard is available at http://www.tensorflow.org/tensorboard/.

To open and execute TensorBoard, the application should be opened in a Web browser on Port 6006 – http://localhost:6006/.

At the top right panel, there are navigation tabs using which different types of visualisation options can be selected.

Using TensorFlow for research and development
TensorFlow can be used by research scholars and academicians to implement research proposals and projects that include deep learning. The results of TensorFlow can be compared with other tools for analytics and machine learning so that a clear and self-executed experience can be gained.

2 COMMENTS

  1. Hi
    Can you explain why did you use Snort IDS to convert the data-set into CSV format ? and explain the following sentence please “From the Snort IDS generated alert file, the records of ‘Potentially Bad Traffic’ can be cut and placed in a separate CSV file”

    Because i am doing a research on deep learning and how we can apply machine learning on IDS.

    Thanks for helping

LEAVE A REPLY

Please enter your comment!
Please enter your name here