CIRCL Images Phishing Dataset - Open Data at CIRCL

Introduction

CERTs such as CIRCL and security teams collect and process content such as images (at large from photos, screenshots of websites or screenshots of sandboxes). Datasets become larger - e.g. on average 10000 screenshots of onion domains websites are scrapped each day in AIL - Analysis Information Leak framework, an analysis tool of information leak - and analysts need to classify, search and correlate through all the images.

Automatic tools can help them in this task. Less research about image matching and image classification seems to have been conducted exclusively on websites screenshots. However, a classification of this kind of pictures needs to be addressed.

Goal

Image-matching algorithms benchmarks already exist and are highly informative, but none is delivered turnkey.

Our long-term objective is to build a generic library and services which can at least be easily integrated in Threat Intelligence tools such as AIL, and MISP - Malware Information Sharing Platform. A quick-lookup mechanism for correlation would be necessary and part of this library. This paper includes the release of two datasets to support research effort in this direction.

MISP is an open source software solution tool developed at CIRCL for collecting, storing, distributing and sharing cyber security indicators and threats about cyber security incidents analysis. AIL is also an open source modular framework developed at CIRCL to analyze potential information leaks from unstructured data sources or streams. It can be used, for example, for data leak prevention.

The dataset presented on this page is strongly associated with other projects, which are an evaluation framework provided as Carl-Hauser and the open-source library provided as Douglas-Quaid.

Problem Statement

Image correlation for security event correlation purposes is nowadays mainly manual. No open-source tool provides easy correlation on pictures, without regard to the technology used. Ideally, the extraction of links or correlation between these images could be fully automated. Even partial automation would reduce the burden of this task on security teams. Datasets are part of the foundation needed to construct such tool.

Our contribution about this problem is the provision of datasets to support research effort in this direction.

Dataset description

circl-phishing-dataset-01

This dataset is named circl-phishing-dataset-01 and is composed of phishing websites screenshots. Around 460 pictures are in this dataset to date.

Three files are provided along with the dataset :

  • a label-classification (DataTurks direct output)
  • a second label-classification (VisJS transformed output)
  • a graph-based classification (VisJS direct output).

Direct link : https://www.circl.lu/opendata/datasets/circl-phishing-dataset-01/

phishing
phishing dataset

Data sources

Different tools collected the dataset presented on this page. The screenshots’s data source is a subset of public security events of proven or potential phishing cases, having attached screenshots, from MISP and from URLAbuse

Processing on datasets

Each picture’s content of each dataset was hashed to “humanly readable” name to allow a unified and readable reference system for image’s naming convention. This had been performed with a slightly modified version of Codenamize - Consistent easier-to-remember codenames generator. The bytes-content of each file is hashed and mapped to a list of words, from a dictionary.

Collision were handled by keeping track of which name has already been generated, and temporary adding bytes to each colliding file. However collision were still rare. A human-readable hash of 3 adjectives (without a maximum number of characters) can generates up to 2 trillion combinations, which is far sufficient to handle even 40 000 pictures without common collision occurrence. Collision were however easily met in case of similar pictures (typically, all white or all dark pictures) but then, their name can be swapped without incidence on the meaning of the dataset.

This dataset is a folder of pictures as well as a reference JSON, containing a mapping from file names to MD5, SHA1, SHA256 of each picture. This allows an easy retrieval of which picture is which, in case picture names need to be modified.

We manually reviewed datasets, picture by picture. We used a private instance of Dataturks - OpenSource Data Annotation tool for teams to perform classification and review the datasets. We removed datasets pictures which were identified as containing personal information such as sensitive e-mail address clearly displayed on screenshoots, … We also manually removed pictures which were identified as containing harmful content, such as violent, offensive, obscene or equivalent undesirable pictures which may shock anyone.

We makes reasonable effort not to display anything in the dataset which may specifically identify an individual. This dataset is provided for research purposes. We stay available for any request. Please refer to contact information at the end of this page. Please note that each website behind each screenshot can be freely accessed by one with relevant means.

Potential Use

These datasets can be used to create classifiers, which then can be used to automate processes. Few examples of application :

  • Automatically classify phishing website : matching screenshots from phishing-like website to known legitimate websites;
  • Correlate websites screenshots to cluster phishing-like websites together, to keep track of domain-name changes for example (Lookyloo usecase).
  • Isolate and characterize outliers
  • Extracting statistics about crawled websites (per theme, per type, per content, per access allowance …)

Detailed information of the dataset

Labels are used to classify each picture in one or more cluster. Labels are different depending on the dataset and the tool used to classify it.

Phishing dataset

The phishing dataset was clustered twice :

  • once with Dataturks (Collaborative classic labeling tool)
  • once with VisJS-Classificator (Collaborative graph-based labeling tool).

Distincts labels were used in both tools, as labels were created at classification time.

These labels were used with Dataturks :

Microsoft, KBC, Advenzia, ErrorPage, WellsFargo, Office, Outlook, Excel, Counterstrike, Steam, other, Scam, uncomplete, TaggingServer, CUFBank, CIRCL, Email, AmericanExpress, AckDupekm, BancoInter, N26, HSBC, OneDrive, WeTransfer, DropBox, SystemPlaceHolder, EmirateNBD, Webmail, MultiLogo, PayPal, SPankki, DHL, News, FrImpots, Netflix, FiBank, Sparbanken, OrangeWebmail, Google, ForeignLanguage, MISP, rackspace, MeuEmpretimo, Windows, NationalBank, PirateBay, ADBC, ABSA, GOOX, Amazon, WashingtonTrustBank, OtherBank, gambling, CIBC, Apple, Post

Clustering file format for Dataturks tool is a list of filename along with their labels, to which they belong. Here follows technical overview of this file format:

 1
 2(...),
 3    {
 4        "picture": "tricky-sturdy-impossible-sweet.png",
 5        "labels": [
 6            "News"
 7        ]
 8    },
 9    {
10        "picture": "old-wet-evasive-influence.png",
11        "labels": [
12            "N26"
13        ]
14    }, 
15(...)
16

These labels were used with VisJS :

BancoInter, Advanzia, N26, WeTransfer, microsoft, KBC, multilogo, OrangeWebmail, Netflix, Steam, CounterStrike, Outlook, OneDrive, unloaded, TaggingServer, CirclWebServer, ackouperm, dhl, fibank, paypal, EmiratesNBD, google, SPankki, ForeignLanguage, Office, Windows, Oldcircl, rackspace, DropBox, CarreBlue, WellsFargo, Post, FrImpots, news, AmericanExpress, bitcoin, android, errorMessage, uncomplete, hsbc, formular, misc 

Clustering file format for VisJS tool is a mapping from cluster name to the list of pictures which belongs to this cluster. Here follows technical overview of this file format:

 1
 2(...)
 3    {
 4        "cluster": "BancoInter",
 5        "members": [
 6            "ambitious-curious-picayune-shock.png",
 7            "awful-nasty-heartbreaking-appointment.png",
 8            "charming-defeated-voracious-beat.png",
 9            "devilish-unwritten-fast-wake.png",
10            "greasy-heartbreaking-married-race.png",
11            "learned-selfish-elfin-growth.png",
12            "magnificent-demonic-faint-recommendation.png",
13            "majestic-labored-better-pop.png",
14            "muddled-nice-hanging-piano.png",
15            "naughty-numberless-wasteful-responsibility.png",
16            "robust-drunk-rampant-trouble.png",
17            "wrathful-lyrical-alcoholic-county.png"
18        ]
19    }, 
20(...)
21

A graph export of VisJS is provided too, which can directly be loaded in VisJS Classificator to get the same display as shown in following picture.

 1  (...) 
 2  "clusters": [
 3    {
 4      "id": "a5e1baa2-aead-4164-9205-63f26f656d6f",
 5      "image": "anchor.png",
 6      "label": "BancoInter",
 7      "shape": "image",
 8      "group": "anchor",
 9      "members": [
10        20,
11        (... more members ...)
12      ]
13    }, 
14    (... more clusters ...)
15  ],
16  "nodes": [
17    {
18      "id": 0,
19      "image": "abashed-careless-ordinary-crew.png",
20      "shape": "image"
21    }, 
22    (... more nodes ...),
23    {
24      "id": 456,
25      "image": "zonked-silent-snobbish-review.png",
26      "shape": "image"
27    }
28  ],
29  "edges": [
30    {
31      "to": "a5e1baa2-aead-4164-9205-63f26f656d6f",
32      "from": 20
33    }, 
34    (... more edges / links ...)
35  ]

Use example

Image matching

This section is a glimpse to what can be achieved with such dataset. Results presented in the following are only preliminary results of Carl-Hauser operated on the previously presented Phishing dataset.

Structures presents on pictures can be easily detected with fuzzy-hashes approaches. Content and theme can be matched with some confidence with fuzzy-hashes or feature point approaches.

False structural in phishing dataset Good structural in phishing dataset

Usage examples of Phishing dataset to train an Image-Matching tool. Most left-hand picture is request picture, right pictures are best matched pictures. Small differences are presents in similar pictures, which are a challenge and a common noise that algorithms have to deal with, in a phishing visual-based clustering context.

Future work

This lead to a list of future possible developments :

  • Extending provided dataset to support research effort
  • Improve classification provided

Please note that ground truth files provided with current dataset as well as dataset themselves may evolve and be updated.

Even partial automation of screenshots classification would reduce the burden on security teams, and that the data we provide is a step further in this direction.

Contact information

If you have a complaint related to the dataset or the processing over it, please contact us. We aim to be transparent, not only about how we process but also about rights that are linked to such information and processing.

You can contact us at circl.lu/contact/ for request about the dataset itself, regarding elements of the dataset, or extension requests. You can contact us at same address or on github for feedback about the benchmarking framework, methodology or relevant ideas/inquiries.

Cite

@Electronic{CIRCL-PhishingDS2019, author = {Vincent Falconieri}, month = {07}, year = {2019}, title = {CIRCL Images Phishing Dataset}, organization = {CIRCL}, address = {CIRCL - Computer Incident Response Center Luxembourg c/o "security made in Lëtzebuerg" (SMILE) g.i.e. 16, bd d'Avranches L-1160 Luxembourg Grand-Duchy of Luxembourg}, url = {https://www.circl.lu/opendata/circl-phishing-dataset-01/}, abstract = {This dataset is named circl-phishing-dataset-01 and is composed of phishing websites screenshots. Around 460 pictures are in this dataset to date.}, }

Revision

  • Version 1.0 - 2019-07-04 (initial release)