CIRCL Images AIL Dataset - Open Data at CIRCL

Introduction

CERTs such as CIRCL and security teams collect and process content such as images (at large from photos, screenshots of websites or screenshots of sandboxes). Datasets become larger - e.g. on average 10000 screenshots of onion domains websites are scrapped each day in AIL - Analysis Information Leak framework, an analysis tool of information leak - and analysts need to classify, search and correlate through all the images.

Automatic tools can help them in this task. Less research about image matching and image classification seems to have been conducted exclusively on websites screenshots. However, a classification of this kind of pictures needs to be addressed.

Goal

Image-matching algorithms benchmarks already exist and are highly informative, but none is delivered turnkey.

Our long-term objective is to build a generic library and services which can at least be easily integrated in Threat Intelligence tools such as AIL, and MISP - Malware Information Sharing Platform. A quick-lookup mechanism for correlation would be necessary and part of this library. This paper includes the release of two datasets to support research effort in this direction.

MISP is an open source software solution tool developed at CIRCL for collecting, storing, distributing and sharing cyber security indicators and threats about cyber security incidents analysis. AIL is also an open source modular framework developed at CIRCL to analyze potential information leaks from unstructured data sources or streams. It can be used, for example, for data leak prevention.

The dataset presented on this page is strongly associated with other projects, which are an evaluation framework provided as Carl-Hauser and the open-source library provided as Douglas-Quaid.

Problem Statement

Image correlation for security event correlation purposes is nowadays mainly manual. No open-source tool provides easy correlation on pictures, without regard to the technology used. Ideally, the extraction of links or correlation between these images could be fully automated. Even partial automation would reduce the burden of this task on security teams. Datasets are part of the foundation needed to construct such tool.

Our contribution about this problem is the provision of datasets to support research effort in this direction.

Dataset description

circl-ail-dataset-01

This dataset is named circl-ail-dataset-01 and is composed of AIL’s scraped onion websites. Around 37500 pictures are in this dataset to date.

Only one label-classification (DataTurks direct output) is provided along with the dataset. This classification is per part and will be improved and updated as soon as classification operations had been achieved.

Direct link : - Part 1 : https://www.circl.lu/opendata/datasets/circl-ail-dataset-01/

AIL
AIL dataset

Data sources

Different tools collected the dataset presented on this page. The screenshots’s data source is a subset of onion domain websites scrapped by AIL;

Processing on datasets

Each picture’s content of each dataset was hashed to “humanly readable” name to allow a unified and readable reference system for image’s naming convention. This had been performed with a slightly modified version of Codenamize - Consistent easier-to-remember codenames generator. The bytes-content of each file is hashed and mapped to a list of words, from a dictionary.

Collision were handled by keeping track of which name has already been generated, and temporary adding bytes to each colliding file. However collision were still rare. A human-readable hash of 3 adjectives (without a maximum number of characters) can generates up to 2 trillion combinations, which is far sufficient to handle even 40 000 pictures without common collision occurrence. Collision were however easily met in case of similar pictures (typically, all white or all dark pictures) but then, their name can be swapped without incidence on the meaning of the dataset.

This dataset is a folder of pictures as well as a reference JSON, containing a mapping from file names to MD5, SHA1, SHA256 of each picture. This allows an easy retrieval of which picture is which, in case picture names need to be modified.

We manually reviewed datasets, picture by picture. We used a private instance of Dataturks - OpenSource Data Annotation tool for teams to perform classification and review the datasets. We removed datasets pictures which were identified as containing personal information such as sensitive e-mail address clearly displayed on screenshoots, … We also manually removed pictures which were identified as containing harmful content, such as violent, offensive, obscene or equivalent undesirable pictures which may shock anyone.

We makes reasonable effort not to display anything in the dataset which may specifically identify an individual. This dataset is provided for research purposes. We stay available for any request. Please refer to contact information at the end of this page. Please note that each website behind each screenshot can be freely accessed by one with relevant means.

Potential Use

These datasets can be used to create classifiers, which then can be used to automate processes. Few examples of application :

  • Automatically classify onion website;
  • Correlate object on pictures from crawled websites, mainly with screenshots of hidden services (AIL usecase);
  • Correlate websites screenshots to cluster of websites with common topic together, to keep track of domain-name changes for example (Lookyloo usecase).
  • Isolate and characterize outliers
  • Extracting statistics about crawled websites (per theme, per type, per content, per access allowance …)

Detailed information of the dataset

Labels are used to classify each picture in one or more cluster. Labels are different depending on the dataset and the tool used to classify it.

AIL dataset

Pictures are labeled following MISP “dark-web” taxonomy - Taxonomies used by MISP and other information threat sharing tools.

Labels are expressed as triplets ‘namespace:predicate=value’. For example ‘dark-web:topic=”hacking”’ is one label of this taxonomy. Two labels were added : “error page” and “other” which does not specifically belongs to dark-web, but could cover any other case met in the dataset.

For a complete list of labels used, please see the following :

 1dark-web:topic="drugs-narcotics",
 2dark-web:topic="extremism",
 3dark-web:topic="finance",
 4dark-web:topic="cash-in",
 5dark-web:topic="cash-out",
 6dark-web:topic="hacking",
 7dark-web:topic="identification-credentials",
 8dark-web:topic="intellectual-property-copyright-materials",
 9dark-web:topic="pornography-adult",
10dark-web:topic="pornography-child-exploitation",
11dark-web:topic="pornography-illicit-or-illegal",
12dark-web:topic="search-engine-index",
13dark-web:topic="unclear",
14dark-web:topic="violence",
15dark-web:topic="weapons",
16dark-web:topic="credit-card",
17dark-web:topic="counteir-feit-materials",
18dark-web:topic="gambling",
19dark-web:topic="library",
20dark-web:topic="other-not-illegal",
21dark-web:topic="legitimate",
22dark-web:topic="chat",
23dark-web:topic="mixer",
24dark-web:topic="mystery-box",
25dark-web:topic="anonymizer",
26dark-web:topic="vpn-provider",
27dark-web:topic="email-provider",
28dark-web:topic="escrow",
29dark-web:topic="softwares",
30dark-web:motivation="education-training",
31dark-web:motivation="file-sharing",
32dark-web:motivation="forum",
33dark-web:motivation="wiki",
34dark-web:motivation="hosting",
35dark-web:motivation="general",
36dark-web:motivation="information-sharing-reportage",
37dark-web:motivation="marketplace-for-sale",
38dark-web:motivation="recruitment-advocacy",
39dark-web:motivation="system-placeholder",
40dark-web:motivation="conspirationist",
41dark-web:motivation="scam",
42dark-web:motivation="hate-speech",
43dark-web:motivation="religious",
44dark-web:structure="incomplete",
45dark-web:structure="captcha",
46dark-web:structure="LoginForms",
47dark-web:structure="police-notice",
48dark-web:structure="test", 
49dark-web:structure="legal-statement",
50error_page,
51other,

Clustering file format for Dataturks tool is a list of filename along with their labels, to which they belong. Here follows technical overview of this file format:

 1(...),
 2    {
 3        "picture": "tricky-sturdy-impossible-sweet.png",
 4        "labels": [
 5            'dark-web:structure="legal-statement"'
 6        ]
 7    },
 8    {
 9        "picture": "old-wet-evasive-influence.png",
10        "labels": [
11            'dark-web:motivation="scam"'
12        ]
13    }, 
14(...)

Future work

This lead to a list of future possible developments :

  • Extending provided dataset to support research effort
  • Improve classification provided
  • Add images extracted from DOM. These pictures allow a more particular matching.

Please note that ground truth files provided with current dataset as well as dataset themselves may evolve and be updated.

Even partial automation of screenshots classification would reduce the burden on security teams, and that the data we provide is a step further in this direction.

Contact information

If you have a complaint related to the dataset or the processing over it, please contact us. We aim to be transparent, not only about how we process but also about rights that are linked to such information and processing.

You can contact us at circl.lu/contact/ for request about the dataset itself, regarding elements of the dataset, or extension requests. You can contact us at same address or on github for feedback about the benchmarking framework, methodology or relevant ideas/inquiries.

Cite

@Electronic{CIRCL-AILDS2019, author = {Vincent Falconieri}, month = {07}, year = {2019}, title = {CIRCL Images AIL Dataset}, organization = {CIRCL}, address = {CIRCL - Computer Incident Response Center Luxembourg c/o "security made in Lëtzebuerg" (SMILE) g.i.e. 16, bd d'Avranches L-1160 Luxembourg Grand-Duchy of Luxembourg}, url = {https://www.circl.lu/opendata/circl-ail-dataset-01/}, abstract = {This dataset is named circl-ail-dataset-01 and is composed of Tor hidden services websites screenshots. Around 37000+ pictures are in this dataset to date.}, }

Revision

  • Version 1.0 - 2019-07-10 (initial release)