Author: Stephan Simon
The first post of this two-part blog covered a simple introduction to YARA, simple rules and the basic structure around them. If you haven’t already read through it, the post can be found here.
In this second part, we will cover the basics of creating more complex YARA rules based on code.
Hexadecimal Strings
One of the most common ways to use YARA is by creating a collection of strings found within the malware or files you are trying to detect. What if the malware somehow obfuscates them, though? What if the malware itself is obfuscated? You may still be able to use features like the PE module to check DLL imports and the import hash, but these methods may not be totally unique.
YARA has the ability to search using hexadecimal strings (referred to as “hex” from this point on). Instead of just being able to search for the hex-equivalent of an ASCII string (e.g. “Hello” is “48 65 6c 6c 6f”,) rules can be created to find patterns of the actual bytes that make up a target file. Hex strings are an extremely useful way for a malware analyst to detect code patterns found during an analysis.
Wildcards in Hex Strings
Hex strings also have their own useful feature: wildcards. YARA’s documentation describes wildcards as “placeholders that you can put into the string indicating that some bytes are unknown and they should match anything.” By using well-placed wildcards hex rules can detect patterns in code, even if some of the underlying bytes in the binary change. To use the word “Hello” again as an example, the pattern “48 65 ?? 6c 6f” means that YARA does not care what byte is in the third position of that string and should return a match if it finds the rest of the bytes.
Another nice feature of wildcards in YARA are called “jumps.” Jumps allow the author to specify that a range of bytes will be unknown. By using “48 65 [1-3] 6c 6f” as a string, the author is telling YARA that as long as “He” and “lo” are found with anywhere between one and three bytes between them, it’s a match. Jumps can also be used as a single number in square brackets as opposed to a range as a shortcut to typing out several wildcards.
Interpreted Languages
Because this post is about YARA rules and not reverse engineering, a .NET binary with no obfuscation was chosen intentionally to make analysis simpler, even though it is not likely to be found in this state in the wild. Malware authors have used .NET in several recent samples, but the code is always obfuscated to complicate analysis. Unlike applications written in compiled languages such C or C++ which produce machine instructions, .NET applications compile to an intermediate language called Microsoft Intermediate Language (MSIL). Without heavy obfuscation tricking a decompiler, it’s very easy to recover source code from the MSIL in .NET applications.
As we look through the sample, anyone familiar with programming in a C-based language should be able to follow the process. Creating rules for .NET-based languages is largely the same as creating for a native language in that the analyst is just deciding which bytes represent unique or interesting segments of the code, and thus should be added to a hex string for pattern matching.
Unfortunately, there is a trade-off here. While easier to recover readable code, .NET languages store a lot of metadata about the application. In assembly, when a function is called, that function is assigned an address (its location in the code). As the application is updated and recompiled, there is no guarantee that the location of a function will stay the same. The same is true with objects in .NET applications. Object references are everywhere in MSIL, meaning hex rules will (most likely) require many more wildcards. More wildcards can potentially cause issues with performance when scanning large amounts of files with Yara. Single fixed wildcards are faster than variable ranges of wildcard bytes. Wildcard nibbles (4 bits fixed and 4 bits that could match any pattern) make rules very slow to run.
Detecting the Echelon Malware Sample
This post will be covering Echelon, a stealer type-malware written in the .NET framework. Specifically, we’ll use the malware sample that has SHA256 hash: b52d4177277851b95c5cdf08bf2e3261c7ac80af449da00741c83bcf6c181d67.
The details of this sample can be viewed on VirusTotal: https://www.virustotal.com/gui/file/b52d4177277851b95c5cdf08bf2e3261c7ac80af449da00741c83bcf6c181d67/details
A copy of the malware sample can also be downloaded from Malware Bazaar: https://bazaar.abuse.ch/sample/b52d4177277851b95c5cdf08bf2e3261c7ac80af449da00741c83bcf6c181d67/
Reading Through Code
To view the source for Echelon, I’ll be using dnSpy. This tool allows us to view the source code of compiled .NET binaries in Visual Basic.NET (VB.NET), C#, or the compiled MSIL. No matter which language a binary was written in, a dropdown allows the analyst to read in the language of their choice. This blog post will be showing a combination of C# and the MSIL views.
Upon opening the sample, we can see a lot of classes (blue text) in the Assembly Explorer (left-hand side). Right clicking on the “Echelon” namespace (yellow text) will bring up a few context options, including “Go to Entry Point” which will take us to the Main function inside the Program class.
The language toggle can be found at the top of dnSpy. I find working in a split view with both the C# and MSIL viewable together very helpful.
When viewing the MSIL, dnSpy also shows other useful information including the hexadecimal location, hex opcodes (blue box), opcode arguments (red box), and the name of the opcode (green box). If you are unfamiliar with any of the opcodes shown, simply click on them and dnSpy will open the MSDN documentation for that specific opcode in your default browser. Not all opcodes are a single byte, so be careful! In the screenshot below, opcode “ceq” is “FE 01”in hex, but the box is drawn over a single column for simplicity.
You may have noticed that the MSIL view placed some empty lines in the code. These empty lines follow branching statements in the code, making the MSIL view easier to follow. I’m going to choose to use some of these to break up the code into smaller hex strings in YARA, starting with lines 39-47 in the above screenshot. By clicking on opcodes and viewing the documentation, I can see that the majority of these are a single byte followed by a 4-byte reference. Since these references can change between builds, we aren’t interested in them right now. Using wildcard jumps to fill in for these dynamic bytes, we end up with something like this:
$main_1 = { 00 7E [4] 72 [4] 7E [4] 28 [4] 28 [4] 0B 07 2C }
In the very last line of MSIL, I’ve left out the last byte. The first byte, “2C,” is the opcode for branching if the statement is false, while the following “2A” is the “target” or offset in the code. This is another byte whose value could change between builds, so we aren’t interested in it. There is no need to use the entire line and placing extra wildcards will just have a negative impact on performance. Before moving on, I’ll create a second hex string based on lines 49-61.
$main_2 = { 00 7E [4] 72 [4] 7E [4] 28 [4] 28 [4] 7E [4] 6F [4] 16 FE 01 0C 08 2C }
These two sections of code should be good enough for this example before moving on. Static or hard-coded configurations are a great candidate for code rules as well. Scrolling further down, we can see that Echelon stores its configuration inside the constructor for the Program class. While a string-based rule might trigger on hard-coded API keys or passwords, the format of a configuration probably doesn’t change much between binaries until the author makes bigger code changes.
This next hex string is for Echelon’s configuration, but it uses a simple regular expression and a wildcard in a slightly different way. Be careful when using regular expressions! They can have a big impact on performance when scanning. Standard wildcards (??) have to be used in place of wildcard jumps as well.
$configuration = { 72 [4] 80 [4] 72 [4] 80 [4] 72 [4] 80 [4] 20 [4] 80 [4] (1? | 20 ?? ?? ?? ??) 8D [4] 25 16 72 [4] A2 }
I’ve chosen to use a wildcard at the nibble level because of the “ldc.i4.8” opcode (1E in hex). It’s a very simple opcode that pushes an 8 onto the stack. Several similar opcodes exist for pushing the numbers 0-7 as well. Reading a little more documentation would reveal to the analyst that the string array reads a number from the stack to know how many elements the array contains. If the malware author adds more file extensions to the string array, this opcode would change while the rest of the code would look the same. If enough elements are added, the .NET compiler would use “ldc_i4” instead which takes a 32-bit integer as an argument. With a little extra effort, this hex string can potentially detect newer versions of Echelon’s configuration.
Piecing it Together
After identifying the sections of code to be used in a detection, it’s time to add it to a rule file. The completed rule has been uploaded to https://pastebin.com/eX2eF8qF for convenience.
Basic rules were covered in the first part of this blog, so we will skip right to the conditions of the rule after adding the three hex strings from earlier.
The first condition is extremely common to see in rules as it is looking for the traditional “MZ” header found in exe or dll files. It is asking for those specific bytes to be found at position 0 in the file.
Next up is the filesize keyword.As the word implies, this keyword limits matches to files with the specified size. This is not limited to megabytes, so see YARA’s documentation for more info.
Something that should have stuck out is the import “pe” statement at the top of this rule. YARA has many helpful modules that can be used to add features. In this case, we use the pe module in the conditions to require a binary import mscoree.dll that is imported by all .NET binaries.
The last part of our condition is fairly simple as well. This statement says that if $configuration exists, this part of the condition should return true. If not, both of the strings beginning with $main_ are required to return true. Referring to $main_1 and $main_2 with a wildcard like $main_* is a convenient way to refer to specific groups of strings in one statement.
A Final Note
While the rule created for this post only detected Echelon in my small library of malware, it should not be considered a rule for production use! Very little analysis was done on just a single sample to speed up the process of creating this guide. When creating your own rules, careful consideration should be given to a much broader range of functionality from multiple related samples to be considered a well thought out rule. YARA rules also do not have to be exclusively string-based or hex-based. The Echelon sample used here had several unique strings that were worth adding to a rule.