profile-pic

Jordan Hook

Software Developer


Tutorial
Posted: 09/03/2020
C#  malware  analysis  pattern scanning  hashing  

Before reading this article I strongly recommend taking a quick peak at my previous tutorials with regards to scraping for malware (links below). They will help get you started on your malware analysis journey by providing you with a free and simple tool to begin collecting samples to analyze. Please remember to be safe whilst following this tutorial series as you are provided with the ability to collect and analyze very real and very dangerous malicious payloads. If you have any questions about staying safe please feel to contact me for some pointers!

Introduction

Detecting malware is a continuously growing issue in today's world of electronics. What we see as tools to help us and to simplify our everyday lives, hackers see as a new possible vector they can exploit. IOT (Internet of things) devices such as smart TVs, thermostats, cell phones are all on the table when it comes to cyber-security. Whether it be by accident or on purpose people are always finding new ways to modify or break the security of our existing systems. For example, think of a game in which you may have played in the past where you had to learn a set of repeatable actions in order to complete a task or defeat an enemy. Whether you failed and tried again, or were just grinding out some repeatable tasks in the game you may begin to notice patterns. E.g. moving your character a specific way at a specific time in order to avoid an interaction with an enemy. Believe it or not, it's the same for hackers looking to break into your systems. Repeating the same tasks, programs, scripts, etc, and adding small variations each time until they finally find a way to break through and accomplish their goal. Because of this, detecting malware has become an increasingly difficult task to complete and we are almost always at least one step behind the attackers.

History Lesson

Detecting malware wasn't always so complicated. In the beginning some of the very first viruses were actually quite simple to detect. Running the exact same way, performing the exact same actions each time. We were able to detect and remove them simply by checking hard coded file paths.

For example:

C:\windows\virus.exe

However, as time progressed, so did hackers and their methods of infection. Through the use of configurations and randomization, hackers were able to have their malicious payloads install in a variety of different places on your system. Some of the infected files even tried to use your own operating system against you by locking or hiding the malware in protected folders. In order to combat these changes many anti-malware / anti-virus solutions began implementing a form of malware detection that is still used up until this day in some systems. This method is called hashing and refers to the method in which you calculate a unique sequence based on the contents of a file. This allowed us to quickly check every single file on your system and compare it against a database of known malicious hashes. This meant that no matter where the file was stored or what folder it hid in, the engine was able to detect the malware as long as the hash computed matched a hash in the database.

For example, let pretend we have a malicious file and the contents are as follows:

{ MALICIOUS PAYLOAD EXAMPLE }

We could hash this data using the MD5 algorithm to create the following signature:

d06322da8a92f57b06b18e4cd6cc8637

For a time, this worked. However, it wasn't long before hackers were able to figure out a way to avoid this method of detection. By design, when it comes to detecting malware, hashing has one fatal flaw: The unique fixed length sequences calculated by the algorithms whether it be MD5, SHA256, etc. can be easily changed by changing the contents of the file. For example, lets modify our payload example so that it says the following:

{ MALICIOUS PAYLOAD EXAMPLe }

The new MD5 hash required to detect it would be:

55c60f2f7a2878350cdf104691448b4f

Notice how the hash completely changed? This meant that hackers could simply edit their malicious payload such as a script or executable, changing only one single byte in the file and it would become undetectable by the known hash database. The band-aid styled fixed some engines tried to employ was hashing specific sections of the file in order to ignore what may have been changed by the hacker however, it was only a matter of time before hackers were able to figure out exactly how to change their file in order to avoid detection. Multi-engine malware scanners such as Virus Total, or even just hackers installing multiple scanners on a testing system allowed them to quickly test their files in order to avoid detection.

The Problem

Please note before reading anything further that I am not providing you with a tutorial on how to write your own anti-malware or anti-virus system nor do i recommend taking on such a large project alone. I am simply providing you with introductory knowledge that I have acquired overtime in hopes that you will be able to use it to better your understanding of malware, computer algorithms and data structures.

Since changing byte sequences had a catastrophic effect on our ability to use hashing algorithms to detect malware, we had to come up with a better way of detecting malware. The way we will be talking about in this article is pattern scanning or better known as traditional signature detection; a method that is still actively used to this day by many mainstream anti-malware engines. The idea behind pattern scanning is to utilize searching algorithms or pattern matching algorithms to find sequences of bytes or text within a set of data. If the data-set contains said sequence it can then be flagged as malicious. Many algorithms can be used to accomplish this task ranging from a simple linear search to a search tree data structure. For the purpose of this introductory article I will be using the linear search pattern.

Note, I am planning on providing articles in the future on additional pattern scanning algorithms such as boyer-moore, aho-corasick and how they perform.

The idea behind the linear search algorithm to start at the beginning of your data-set and check every element for a matching element in the desired pattern you are looking for. Lets use the following data-set as an example:

[1, 3, 5, 8, 9, 4, 1, 6, 6, 4, 0]

There are eleven elements in the set above. A linear search algorithm could be used to search or a sequence of 1 or more of the elements in a specific order. For example, we could search for the sequence:

[4, 1, 6]

This sequence starts at index 5 and continues to index 7. A linear search algorithm would find this sequence by comparing each index in the sequence against the first sequence in our pattern. This means it would compare 4 to 1, 3, 5, 8 and 9 until it finally found it's first match. It would then compare the next index of the array to the next index of our desired pattern, and continue doing this until we have a full match or until a difference is found; in which case the algorithm will carry on looking for it's next matching element starting at the beginning of the pattern.

What this might look like in code would be (C#):

using System;
					
public class Program
{
	public static void Main()
	{
		byte[] hay = new byte[] { 1, 3, 5, 8, 9, 4, 1, 6, 6, 4, 0 };
		byte[] needle = new byte[] {4, 1, 6};
		
		Console.WriteLine("We found a match for needle at: " + FindMatch(hay, needle));
	}
	
	// Returns -1 if no matchers
	// or 
	// Returns index of match if found 
	private static int FindMatch(byte[] haystack, byte[] needle)
	{
		// Can't find the needle if it's bigger than the haystack
		if (needle.Length > haystack.Length)
			return -1;
		
		int startIdx;
		
		// Loop through the haystack while it is still possible to find a match
		for (int i = 0; i < haystack.Length - needle.Length; i++)
		{
			// First byte of pattern matches current byte in haystack
			if (haystack[i] == needle[0])
			{
				// Compare the rest of the needle to corrusponding bytes in the haystack
				for (startIdx = 1; startIdx < needle.Length; startIdx++) 
				{
					if (needle[startIdx] != haystack[i + startIdx])
						break;
				}
				
				// If we looped through the whole pattern then weh ave a match
				if (startIdx == needle.Length)
					return i;
			}
		}
		
		return -1;
	}
}

 The above program outputs the following:

We found a match for needle at: 5

This form of pattern matching is much more powerful then file hashing as it allows you to look for specific pieces of a malicious file at a very granular level. For example, if you could look for a certain piece of the malicious script responsible for installing a payload or even analyze the binary data of a compiled executable. Looking for smaller more unique byte patterns is a lot harder to hide then simply changing the hash of your file. Because of this, pattern scanning is still a very powerful tool used to this day for a variety of purposes. Obviously, it's not powerful enough to be the only method of detecting malware however, some of the more advanced methods deserve full articles of their own.

Before we end this section, I wanted to quickly touch base with regards to what happens if a hacker was to simply change part of the data-set in which our pattern belongs to. Well... This is one of the reasons why pattern scanning isn't the only technology being used anymore and you are absolutely right! This could break out signature and we would no longer be able to detect said piece of malware. Similar to the hashing algorithms band-aid their are various approaches that have been used to help prevent this from occurring such as using wild cards in your signature. For example looking for the following pattern in our sample data-set:

[4, ?, 6]

Where '?' is a wild character that can be any byte or number of bytes before the pattern continues. Another band-aid like solution to extract multiple patterns from the same data-set in order to make it harder for the hacker to figure out exactly what part of their payload you are detecting. At the end of the day it all comes down to the quality of your pattern though. For example, did you extract a piece of compiled code responsible for dropping the payload or are you scanning for a modifiable attribute within the file?

Conclusion

We've barely touched the tip of the iceberg when it comes to detecting malware. I have some additional articles planned for the future with topics such as analyzing malware and extracting patterns, heuristic analysis, cryptography and even machine learning. If you have any questions, suggestions or concerns please feel free to drop or a comment below or to contact me and I'll follow up with you as soon as I can!