MAWILab

What is MAWILab?

MAWILab labels
Anomaly classification

Files format

XML schema (admd)
CSV format

How to read admd files?
Under the hood

What is MAWILab?

MAWILab labels

MAWILab annotates traffic anomalies in the MAWI archive with four different labels: anomalous, suspicious, notice, and benign.

The label anomalous is assigned to all abnormal traffic and should be identified by any efficient anomaly detector.
The label suspicious is assigned to all traffic that is probably anomalous but not clearly identified by our method.
The label notice is assigned to all traffic that is not identified anomalous by our method but that has been reported by at least one anomaly detector. This traffic should not be identified by any anomaly detector, we do not label them as benign in order to trace all the alarms reported by the combined detectors.
All the other traffic are labeled benign because none of the anomaly detectors identified them.

Anomaly classification

For a better understanding of identified anomalies, MAWILab also employs two distinct anomaly classification techniques:

simple heuristic based on port numbers, TCP flags and ICMP codes (presented in the original MAWILab research paper, Conext 2010),
and a taxonomy of backbone traffic anomalies based on protocol headers and connection patterns (presented in Mazel et al., TRAC 2014, and http://www.fukuda-lab.org/mawilab/classification/index.html).

Simple heuristic

The heuristic inspects the port number, TCP flags and ICMP codes of anomalous traffic and assign a code to each anomaly. If the code value is lower than 500 it means the anomalous traffic is using well known suspicious ports or it contains an abnormally high number of packets with SYN, RST or FIN flag:

1:Sasser worm
2:Netbios attack
3:RPC attack
4:SMB attack
10:SYN attack
11:RST attack
12:FIN attack
20:Ping flood
51:FTP attack
52:SSH attack
53:HTTP attack
54:HTTPS attack
else:Other

If the value is between 500 and 900 it means the anomaly is seen on well known ports:

501:FTP traffic
502:SSH traffic
503:HTTP traffic
504:HTTPS traffic
else:Other

If the value is higher than 900 it means the anomaly is seen on unknown ports.

901:Unknown

Anomaly taxonomy

Mazel et al. (TRAC 2014) presented a taxonomy that reveals the nature of backbone traffic anomalies. MAWILab takes advantage of this taxonomy to provide more insights into the identified anomalies. The taxonomy consists of more than one hundred labels and corresponding signatures to classify events identified in backbone traffic. The details of labels and signatures are available at http://www.fukuda-lab.org/mawilab/classification/ .

Since MAWILab v1.1, the plots depicting the byte and packet breakdown in the data set webpages (e.g. http://www.fukuda-lab.org/mawilab/v1.1/index.html) are also based on this taxonomy. Each class in the plots corresponds to labels with a certain prefix:

Unknown are labels starting with the prefixes "unk" and "empty"
Other are labels starting with the prefixes "ttl_error","hostout","netout", and "icmp_error"
HTTP are labels starting with the prefixes "alphflHTTP","ptmpHTTP","mptpHTTP","ptmplaHTTP" and "mptplaHTTP"
Multi. points are labels starting with the prefixes "ptmp","mptp" and "mptmp"
Alpha flow are labels starting with the prefixes "alphfl","malphfl","salphfl","point_to_point" and "heavy_hitter"
IPv6 tunneling are labels starting with the prefixes "ipv4gretun" and "ipv46tun"
Port scan are labels starting with the prefixes "posca" and "ptpposca"
Network scan ICMP are labels starting with the prefixes "ntscIC" and "dntscIC"
Network scan UDP are labels starting with the prefixes "ntscUDP" and "ptpposcaUDP"
Network scan TCP are labels starting with the prefixes "ntscACK","ntscSYN","sntscSYN","ntscTCP","ntscnull","ntscXmas","ntscFIN" and "dntscSYN"
DoS are labels starting with the prefixes "DoS","distributed_dos","ptpDoS","sptpDoS","DDoS" and "rflat"

Files format

XML Schema (admd)

For each traffic trace of the MAWI archive the traffic annotation is provided in the form of an admd file. admd is a meta-data format and associated tools for the analysis of pcap data. More information on this format is available on this website: http://admd.sourceforge.net/

Here is a brief explanation of the structure of the xml files:

<admd:annotation>
  <algorithm>
      "MAWILab logging information"
  </algorithm>

  <analysis>
      "Analyst description" 
  </analysis>

  <dataset>
      "Link to the analyzed dataset"
  </dataset>

  <anomaly type="T" value="Dn,Da,C0,V,C1"> (see explanation below)
     <description>
        "Structure of the community reporting the anomaly (in dot language)"
     </description>

    <slice>
        <filter "Traffic features describing the anomaly: 
			destination IP 
			and/or source IP
			and/or destination port
			and/or source port">
     </slice>
     <from "timestamp of the start of the anomaly">
     <to "timestamp of the end of the anomaly">
  </anomaly>
</admd:annotation>

The type and value of the anomaly tag provide more details about the reported traffic:

T is the MAWILab label assigned to the anomaly, it can be either: anomalous, suspicious, or notice.
Dn,Da are the distance of the anomaly to reference points in the reduced SCANN space. Dn is the distance to the reference point representing normal traffic, while Da is the distance to the reference point standing for anomalous traffic.
C0 is the code assigned to the anomaly using simple heuristic based on port number, TCP flags and ICMP code.
V shows which detector with which parameters found the anomaly. It is a vector of binary values, 0 means the detector did not report the traffic whereas 1 means that the detector reported an alarm for the anomaly. There is four detectors (Hough, Gamma, KL, PCA) each using 3 different parameter tuning (sensitive, optimal, conservative).
The first value of the vector correspond to sensitive Hough, the second value is optimal Hough, the third is conservative Hough, the fourth is sensitive Gamma, etc... The order is Hough(sensitive,optimal,conservative), Gamma(sensitive,optimal,conservative), KL(sensitive,optimal,conservative), PCA(sensitive,optimal,conservative).
C1 is the category assigned to the anomaly using the taxonomy for backbone traffic anomalies.

CSV format

Since MAWILab v1.1, anomalies are also reported in CSV format. Each line in the CSV files consists of a 4-tuple describing the traffic characteristics (similar to filters in the admd format) and additional information such as the heuristic and taxonomy classification results. The actual order of the fields is given by the CSV files header:

anomalyID, srcIP, srcPort, dstIP, dstPort, taxonomy, heuristic, distance, nbDetectors, label

anomalyID is a unique anomaly identifier. Several lines in the CSV file can describe different sets of packets that belong to the same anomaly. The anomalyID field permits to identify lines that refer to the same anomaly.
srcIP is the source IP address of the identified anomalous traffic (optional).
srcPort is the source port of the identified anomalous traffic (optional).
dstIP is the destination IP address of the identified anomalous traffic (optional).
dstPort is the destination port of the identified anomalous traffic (optional).
taxonomy is the category assigned to the anomaly using the taxonomy for backbone traffic anomalies.
heuristic is the code assigned to the anomaly using simple heuristic based on port number, TCP flags and ICMP code.
distance is the difference Dn-Da, see XML Schema (admd).
nbDetectors is the number of configurations (detector and parameter tuning) that reported the anomaly.
label is the MAWILab label assigned to the anomaly, it can be either: anomalous, suspicious, or notice.

How to read admd files?

C/C++

MAWILab is a collection of xml files describing the anomalies found in each trace of the MAWI archive. These xml files follow the admd XML Schema (admd.xsd). So the files collected on MAWILab are compatible with the C/C++ tools available on the admd website: http://admd.sourceforge.net/.

Python

An easiest way to manipulate the xml files from MAWILab is to use the python API available at http://www.fukuda-lab.org/mawilab/tools/admd.py. This API allows one to load an xml file (in the admd format) to a python object or export such object to an xml file.

The script http://www.fukuda-lab.org/mawilab/tools/listAno.py is an example using this API. This script reads an xml file and outputs the timestamps, IP addresses and port numbers corresponding to each anomaly identified in the file. The script should be in the same directory as the API and is executed as follow:

% wget http://www.fukuda-lab.org/mawilab/v1.0/2003/09/13/200309131400_anomalous_suspicious.xml
% python2 listAno.py 200309131400_anomalous_suspicious.xml
200309131400_anomalous_suspicious.xml
Anomaly 0, from 1063429217 to 1063430101
         145.154.79.170:None --> None:None
Anomaly 1, from 1063429201 to 1063430101
         200.35.112.72:None --> 212.35.181.89:None
         None:None --> 212.35.181.89:80
         None:None --> 212.35.181.89:None
Anomaly 2, from 1063429201 to 1063430101
         194.63.172.173:None --> 212.7.225.28:119
Anomaly 3, from 1063429411 to 1063430063
         200.82.177.214:None --> None:None
...

This example along with the admd XML Schema (admd.xsd) is a good starting point to play with MAWILab. Note that this library is not compatible with python version 3.0 and higher.

Under the hood

The annotations collected in MAWILab result from the combination of several anomaly detectors. The detectors are combined with a method based on singular value decomposition and graph theory. The detailed description of this combination method is available in the following publication: "MAWILab: Combining diverse anomaly detectors for automated anomaly labeling and performance benchmarking", R.Fontugne, P.Borgnat, P.Abry, K.Fukuda, in CoNEXT 2010.

Documentation

Contents