profile-pic

Jordan Hook

Software Developer


Tutorial
Posted: 28/04/2019
malware  dotnet core  C#  

One of my more popular projects on GitHub as of recently is the MalwareFinder application that was written in C# .NET. The purpose of this project was to scrape various sources around the internet for live malware samples. Now, you may be wondering why would anyone want to find viruses and download them to their computer? Well, firstly they wouldn't download them to their computer (at least I hope...) and secondly, in order to develop malware prevention technology one must be able to find malware. It can be a very long and in-efficient process to manually search the depths of the dark web in order to locate only a few samples. Anyone who has analyzed malware before and have gone looking for samples knows exactly what I am talking about. The MalwareFinder application helps expedite this process by automatically downloading samples from various sources that provide free samples. For example, the original scraper would search the following websites:

These websites are well known malware repositories where security researchers share their malicious findings with each other. 

Recently, I've began using a MacBook Pro more frequently to complete some of my web development tasks and thus have not had time to run the malware scraper for my own personal projects. I can't run the .NET version of the scraper on Mac without wine, and I can't run it on a linux virtual machine without compiling with mono. The solution I have come up with in order to bypass their restrictions is to redesign the scraper in .NET core and to deploy it via a docker image. This will allow me to collect the samples within a virtualized environment and run the script from another that I can deploy docker containers; for example my raspberry pi. 

Creating a new project

To get started we first need to create a new dotnet core project. In order to do this I will be using the dotnet CLI to create a new Console Application.

Jordans-MacBook-Pro:dotnet jhook$ mkdir scraper
Jordans-MacBook-Pro:dotnet jhook$ cd scraper
Jordans-MacBook-Pro:scraper jhook$ dotnet new console
The template "Console Application" was created successfully.

Processing post-creation actions...
Running 'dotnet restore' on /Users/jhook/Desktop/dotnet/scraper/scraper.csproj...
  Restoring packages for /Users/jhook/Desktop/dotnet/scraper/scraper.csproj...
  Generating MSBuild file /Users/jhook/Desktop/dotnet/scraper/obj/scraper.csproj.nuget.g.props.
  Generating MSBuild file /Users/jhook/Desktop/dotnet/scraper/obj/scraper.csproj.nuget.g.targets.
  Restore completed in 343.86 ms for /Users/jhook/Desktop/dotnet/scraper/scraper.csproj.

Restore succeeded.

Jordans-MacBook-Pro:scraper jhook$ ls
Program.cs     obj            scraper.csproj
Jordans-MacBook-Pro:scraper jhook$ code .

Now that we have our newly created project, we can begin the development of our new application. To start, I will be creating a new file called Downloader.cs. This file will store most of our projects code as the main purpose of this application is to simply download and store malicious samples. Similar to the original project, we will be multi-threading this application in order to completely multiple tasks at the same time. To get started lets create some constructors for our Downloader implementation. 

using System;
using System.IO;

namespace scraper
{
    class Downloader
    {
        private int maxThreads;
        private string outputDirectory;

        // Safe defaults
        public Downloader() {
            this.maxThreads = 2;
            this.outputDirectory = Path.GetTempPath() + "/scraper";

            this.init();
        }

        public Downloader(string outputDirectory, int threads = 2) {
            this.maxThreads = threads;
            this.outputDirectory = outputDirectory;
        }

        private void init() {
            if(!Directory.Exists(outputDirectory)) {
                Directory.CreateDirectory(outputDirectory);
            }
        }
    }
}

Two constructors are available; one with safe defaults and another in which accepts parameters that will determine how many threads our downloader will use and where the samples will be stored. 

The next step for our scraper is determine how we will be processing the malware repositories. For this we will be making some modifications that were not in the original .NET implementation by creating parsing classes to collect all of our samples upfront before attempting to download them. To complete this task we will be creating this new parent class called the SampleParser.

using System.Text.RegularExpressions;
using System.Collections.Generic;

abstract class SampleParser {
    public abstract List parseSampleURLs(string data);

    public string HTTP(string url)
    {
        // add http to string if required 
        if (!url.StartsWith("http://") || url.StartsWith("https://"))
        {
            url = "http://" + url;
        }

        return url;
    }

    public List GetAllBetween(string text, string start, string end)
    {
        List matches = new List();
        // create regex expression 
        string regex = string.Format("{0}(.*?){1}", Regex.Escape(start), Regex.Escape(end));

        foreach(Match m in Regex.Matches(text, regex, RegexOptions.Singleline)) {
            matches.Add(m.Groups[1].Value);
        }

        // return matches as a list 
        return matches;
    }
}

This parent class will provide us with the following: 

  • public abstract List parseSampleURLs(string data)
    • A function that will need to be implemented by each parser; this function will return a list of samples scraped from each website as URLs
    • Data is all of the data scraped from the repository page (this could be in any format)
  • public string HTTP(string url)
    • This function will return the URLs found with HTTP formatting
  • public List GetAllBetween(string text, string start, string end)
    • This function will return text between two sets of strings. This is a useful function for reading URLs through html and xml tags

Now that we have our Parser base class we will need an implementation for each malware repository. Below you will find an implementation for each base depending on how we are able to extract the URLs from each.

Malc0deParser
class Malc0deParser : SampleParser {
    public override List parseSampleURLs(string data) {
        List results = new List();
        string[] split = data.Split('\n');
        foreach(string line in split)
        {
            // check the XML element for the url 
            if(line.StartsWith("URL:"))
            {
                // Extract the URL 
                var sample = line.Split(',')[0];
                sample = sample.Remove(0, sample.IndexOf(' ')).Trim(); 

                // validate length of url to make sure valid data is there 
                if(sample.Length > 6)
                {
                    // add http to the sample url is required 
                    sample = HTTP(sample);

                    results.Add(sample);
                }
            }
        }
        return results;
    }
}
VXVaultParser
class VXVaultParser : SampleParser {
    public override List parseSampleURLs(string data) {
        List results = new List();
        string[] links = data.Split('\n');
        foreach (string link in links)
        {
            // validate data in string 
            if (link.Length > 6)
            {
                // add http if required 
                string sample = HTTP(link);

                // replace \r that may have downloaded as well from plain text file 
                sample = sample.Replace("\r", "");

                results.Add(sample);
            }
        }

        return results;
    }
}
URLQueryParser
class URLQueryParser : SampleParser {
    public override List parseSampleURLs(string data) {
        List results = new List();

        // pull all links from html page 
        List LinksFound = GetAllBetween(data, "<a", ""); 

        // loop throug each link 
        foreach(string link in LinksFound)
        {
                // sample links can be found in the title element 
            if(link.StartsWith("title="))
            {
                // get the element between the ' quotes (so the title value) 
                string sample = link.Split('\'')[1].Trim();

                // add http if required 
                sample = HTTP(sample); 

                results.Add(sample);
            }
        }

        return results;
    }
}

Now that we have our parsers we can start developing our scraper function in the downloader class. This function will be responsible for driver or orchestrating the order of operations in which need to occur to download the malicious samples.

  1. Scape target websites 
  2. Parse data scraped and compile list of samples
  3. Download individual samples to output directory
using System.Threading;

public void scrapeSamples() {
    // Declare our sources and parsers
    SampleParser[] parsers = { new Malc0deParser(), new VXVaultParser(), new URLQueryParser() };
    string[] sources = new string[] { "http://malc0de.com/rss", "http://vxvault.net/URL_List.php", "http://urlquery.net" };
    string sourceData = null;
    List targetSamples = new List();
    
    // Scrape all of the sources for samples to download
    using(WebClient wc = new WebClient()) {
        for(int i = 0; i < sources.Length; i++) {
            Console.WriteLine("Scraping for data from source : {0:0}", sources[i]);

            // RAW scrape from source
            sourceData = wc.DownloadString(sources[i]);

            // Parse raw data and append to our samples list
            targetSamples.AddRange(parsers[i].parseSampleURLs(sourceData));
        }
    }

    List threadPool = new List();

    // While there are samples to download OR samples currently downloading
    while(targetSamples.Count > 0 || threadPool.Count > 0) {

        // While there is room in the threadpool and more samples to download
        while(threadPool.Count < this.maxThreads && targetSamples.Count > 0) {
            var downloaderThread = new Thread(() => {
                // Code executed here will be running in a new thread
                string sample; 

                lock(targetSamples) {
                    // Ensure there is samples to download still
                    if(targetSamples.Count == 0) { return; }

                    // Get the next sample URL
                    sample = targetSamples[0];
                    targetSamples.RemoveAt(0);
                }
                
                // Extract the file name from the URL and determine the save file path
                string[] sampleParts = sample.Split('/');
                string targetPath = this.outputDirectory + "/" + sampleParts[sampleParts.Length - 1];

                try {
                    // Attempt to download the file
                    Console.WriteLine("Downloading {0:0} to {1:0}", sample, targetPath);

                    using(WebClient wc = new WebClient()) {
                        wc.DownloadFile(sample, targetPath);
                    }

                    // TODO: 
                    // Additional Processing can occur here
                    // Such as renaming file to SHA-1 hash? 
                    // Sorting by file type (Executables, scripts, etc)

                } catch (Exception e) {
                    Console.WriteLine(e.Message);
                }
            });

            // Add thread to pool and start
            threadPool.Add(downloaderThread);
            downloaderThread.Start();
        }

        // Check for threads that have finished executing
        var deadThread = threadPool.Find(t => t.ThreadState == ThreadState.Stopped);

        while(deadThread != null) {
            threadPool.Remove(deadThread);
            deadThread = threadPool.Find(t => t.ThreadState == ThreadState.Stopped);
        }
    }
}

All that is left to do at this point is create an instance of the downloader and call the scraper function.

using System;

namespace scraper
{
    class Program
    {
        static void Main(string[] args)
        {
            Downloader downloader = null;

            if(args.Length <= 3) {
                downloader = new Downloader();
            } else {
                foreach(var arg in args) {
                    Console.WriteLine(arg);
                }
                downloader = new Downloader(args[2], int.Parse(args[3]));
            }

            downloader.scrapeSamples();
        }
    }
}

Once compiled you will now be able to scrape for malware samples via a CLI application. A second part to this series will be coming shortly in which I go over how to automate this task and run the application via docker. 

Full Source (GitHub)