Phishalytics is a measurement infrastructure system I built to research phishing and malware attacks on Twitter during my PhD at Royal Holloway, University of London. Phishalytics is written in Python code and stores data in a structured database (such as MySQL, PostgreSQL, etc). See the design architecture page for details. Research outputs from Phishalytics have been published internationally and won multiple awards (see publications).
The codebase for Phishalytics is available on GitHub (see code). A summary of my PhD thesis is described on the thesis page. Key research projects I undertook that used Phishalytics are described on the research projects page. My thesis, and research papers relating to Phishalytics, can be accessed from the publications page.
Phishalytics is designed to perform the following core functionalities:
Interacting with Phishalytics is via an SSH connection in a terminal window. The server-side interface uses GNU Screen. The Screenshot below shows Phishalytics during one of our measurement studies. The layout consists of 18 windows; 16 small and 2 large. The two larger windows display a development area and the system monitor (htop command showing CPU and RAM usage, top processes, etc).
The 16 smaller windows in the above screenshot, labelled s1 to s16, show the following:
s1: twitter_stream.py - Twitter filter stream (tweets containing URLs). Each character in this window represents the following:
s2: twitter_stream_sample.py - Twitter sample stream (same characters as above)
s3: update_gsb.py - Update our local copy of GSB blacklist
s4: update_phishtank_and_openphish.py - Update our local copies of PT and OP blacklists
s5: twitter_gsb_lookup_fast.py - Fast Google Safe Browsing tweeted URL lookup system
s6: twitter_gsb_lookup.py - Comprehensive Google Safe Browsing tweeted URL lookup system
s7: twitter_op_pt_lookup.py - Comprehensive Openphish and Phishtank tweeted URL lookup system
s8: twitter_op_pt_lookup_fast.py - Fast Openphish and Phishtank tweeted URL lookup system
s9: lookup_gsb_timestamps.py - GSB timestamp lookup system
s10: twitter_search_api_lookup.py - Twitter search API lookup system
s11: trending_hashtags.py - Retrieve and save current trending hashtags from Twitter API
s12: post_twitter_collection_processing.py - Post Twitter collection processing (for metadata such as: lookup redirections chains, num URL hops, landing page URL, calculate Levenshtein distance, determine if trending hashtags used, etc)
s13: compare_gsb_updates.py - Calculate, update, and compare GSB sizes
s14: Not currently being used for the present study
s15: status_monitor.py - Check everything is functioning correctly, check all feeds are live, etc. Send error notification emails to admin
s16: trending_hashtags_london.py - Currently trending hashtags on Twitter for London