Overview - MSc Student Internship Position - Analyse, visualise and export HTTP Archives (HAR)
Lookyloo started as a side project aiming to help internal teams in media organisations to have an overview of the content loaded on their websites. It is now used on a daily basis by CIRCL in order to analyse phishing and other malicious websites in the context of incident response.
The current status of the system makes it relatively simple for an analyst to understand what is going on for a simple website, but it is a lot more complex for big websites, often loading massive amount of contents from a vast array of 3rd party services.
The two core goals of this internships are following: * Automate the analysis of complex websites * Generate a relevant output that can be shared via MISP Threat Sharing
Automate the analysis of complex websites
Many big websites load over 1000 resources from hundred of different domains in a few seconds when the page ie opened for the first time. At this point, there are no simple methods to understand precisely what these ressources are, if the request is initiating from the website we’re browsing, or from a 3rd party service. Understanding if a specific request is related to user tracking would be a starting point. But it will also be important to be able to detect malicious content served to the users (cryptomining, exploits, or plain old malwares).
A strong focus will be put on reproducibility and comparaison of results across sequential load of the same website, especially from different sources, and with different user agents.
The same research will be realized on onion (Tor) websites, in order to detect potential attack against users.
Share on MISP Threat Sharing
In a second time, the finding need to be normalized and exported in MISP format in order to be shared with partners in order to help them in their own analysis, and allow correlation on tfhe platform. This last part will be a standalone library that can be integrated in Lookyloo, and in AIL Framework.
Current status of the project
Lookyloo connects together a few tools in a consistent manner:
- Scrapy, a webcrawling framework (Python).
- Splash, a webservice used for rendering the website and generating the HTTP Archive (HAR) file (runs in a docker).
- ETE Toolkit, a Python framework for the analysis and visualization of (phylogenetic) trees (Python).
-
d3JS, for the visualisation of the tree in the browser (JavaScript).
- ScrapySplashWrapper, a simplistic library relying on scrapy to filter out the ressources to open on the website to investigate. Then, it queries Splash, format and returns the data generated by it (Python 3.6+).
- har2tree, a library that generates an ETE Toolkit tree from the HAR file, and other data returned by Splash (Python 3.6+)
- Lookyloo glues all the parts together (Python 3.6+, Javacript, CSS, HTML). Note that the webserver used is flask
The current code is stable but needs a lot of improvements in order to support the required features.
Your task is to understand the code and interfaces to other services and bring the code to the next level.
Your work will be part of the daily activities of CIRCL and for countless people doing lookups against our web service.
If this is a challenge you like to accept, talk to us!
Qualification
- Must be an EU citizen with a valid work permit in Luxembourg
- Must be eligible for an MSc student internship in the field of information security and/or computer science
- Must have a high-level of ethics due to the nature of the work
- Must be fluent in English, Unix, git, and Python. JavaScript and web development in general would be a plus.
- Contributions performed under this MSc internship will be released as free software
How to apply
The application package must include the following:
- A resume in ASCII text format
- A motivation letter why you are interested in the internship
The package is to be sent to info(@)circl.lu indicating reference internship-lookyloo-01.
Application deadline
The deadline for the application is the 15th of March 2020. Applications received after the deadline will not be considered.
Classification of this document
TLP:WHITE information may be distributed without restriction, subject to copyright controls.