An Introduction to Malware Hashes and Hash Functions
Malware hashes are found everywhere in our industry. And for a good reason. They very efficiently help identify malware samples and standardize the exchange of information among researchers, to name a couple of use cases.
The topic of hash functions is an enormous and complex one. There are dozens of them, if not more, with significant variations in base computation methods, applications, security, and outputs. So, for time and sanity’s sake, we will only discuss hashes in the context of information security and cybersecurity.
What is a Hash Function?
A hash function is an algorithm that takes an arbitrary input of bits of any size and produces a unique, fixed-size output. The output is known as a hash, hash code, hash sum, hash value, checksum, digital fingerprint, or message digest. A hash calculated for a malware file is a malware hash.
The hashing process is mathematically guaranteed to only work in one direction – from a string of bits of varied size to a fixed-size output – and cannot be reversed. And the output’s uniqueness is meant to be absolute, that means no two distinct inputs can generate the same output; modifying only one bit of the input generates a completely different hash sum.
It Takes All Kinds
The most common hash functions are MD5, SHA-1, SHA-256, and SHA-512. Their main purpose in cybersecurity is to generate unique identifiers for their inputs, such as malware files, that can be cataloged, shared or (re)searched with relative ease.
Other types of hash functions are used for granular identification, grouping, comparison, and analysis of malware. For example, fuzzy hashes were developed to identify files that share characteristics or have been modified only slightly. One common type of fuzzy hash is SSDEEP.
Are Hashes Really Secure?
It depends. Some hashes once believed unbreakable are now considered to be insecure. This can mean that it’s possible to reverse, generate a collision (create the same hash value for two different inputs) or otherwise manipulate the algorithm and/or its output.
Of the hashing functions previously mentioned, the MD5 is no longer considered secure. According to some sources, the SHA-1, SHA-256, and SHA-512 functions can also be considered insecure, depending on the intended use.
For protecting data, like passwords, strong hashing algorithms are necessary. A common-sense best practice is to make sure that any hash function you are considering meets the security requirements of your 1) use case and 2) organization/industry.
How are Hashes Used?
Hash functions have many uses in cybersecurity and elsewhere. Their overall most popular uses are in the areas of data confidentiality and integrity as well as authentication and non-repudiation. The following characteristics make them ideal for the job:
1) They are not reversible
2) The output is unique and of fixed length
3) They significantly – exponentially even – reduce the amount of original data they represent.
A few examples of uses that rely heavily on the features above include:
Hash table – A data structure that utilizes hash values to represent large amounts of data or large files. Both the reduction in the amount of data to query and the quick cross-referencing facilitated by unique identifiers allow for rapid data lookup.
File integrity – A hash is computed and then compared with the stored hash of the original data. If the two values match, the data has not been modified.
Password security – User-created passwords are run through a hashing algorithm and the hash is stored instead of the plaintext version. This protects passwords in the event of unauthorized access. Any password entered for logging in is hashed and compared with the stored hash for verification purposes.
Hashes in Cybersecurity
In the cybersecurity industry, hashes are primarily used to identify, share, and group malware samples. One of the first use cases for them was in antivirus (AV) software. AVs use a database of malware hashes as a sort of blocklist. During the scanning process, the blocklist is compared against the hashes calculated for the executable files on the system. A match indicates a malicious file is present.
A drawback to this detection method is that the list of known malware hashes is already huge and grows larger every day. This amount of data can easily overload the storage and processing capacity of personal computers, IDS/IPS and firewalls. A best practice when using blocklists of any kind is to make sure your threat intelligence is fresh, not full of inactive indicators or false positives. Quality over quantity!
As the industry evolved, security tools began to additionally make use of heuristic/behavioral analysis to detect malware. Without this capability, polymorphic malware, for example, would go undetected.
Beyond the AV
Malware hashes have several uses beyond signature-based detection tools. They standardize and simplify the exchange of IoCs (Indicators of Compromise) among researchers. (And we all know that our malware naming conventions are far from standardized!) This list of hashes from a VirusTotal search result probably looks familiar to you:
And just as AV software uses malware hashes to look for infected machines, threat hunters and SOC teams use them for the same purpose. In fact, being able to search for a hash value instead of looking for evidence of the malware itself saves a lot of time. Having quality malware hashes data in a TIP or SIEM to assist with investigations is invaluable.
Researchers utilize hashes to analyze and compare malware, such as with fuzzy hashes that try to find similarities between samples. Machine learning models can use perceptual hash data from screenshots, for example, to learn how to recognize screenshots of web pages that have similar content, such as phishing sites.
While hashes have limitations when used for perimeter-based malware detection, they are still extremely helpful. It’s all about layered security and not ever depending on one tool or type of indicator to protect your network. As for the exchange of IoCs, threat research, and machine learning, hashes have a multitude of applications. Some of these are still new and very promising.
If you want to learn more about hashes, there are many good resources out there. Some articles we encountered during our research are listed below:
- Calculate Your Own Hashes with CyberChef: https://gchq.github.io/CyberChef/
Looking for Quality Malware Hashes Data?
Malware Patrol offers the three hashes feeds below. You can request a free evaluation here.
1) Malware Hashes Feed. Includes MD5, SHA-1, and SHA-256 hashes, as well as classification of verified active malware and ransomware samples.
2) Risk Indicators Feed. Composed of a variety of IoCs, including MD5, SHA-1, and SHA-256 hashes, email addresses, cryptocurrency addresses, and CVEs. The curated data in this feed is derived from our network of honeypots as well as trusted third-party sources.
3) Phishing Screenshots & Perceptual Hashes Feed. Malware Patrol collects phishing URLs from various sources – crawlers, emails, spam traps, and more – to ensure coverage of the most current campaigns. We take screenshots of the phishing pages and the corresponding perceptual hashes are calculated. These can later be compared with hashes of other screenshots to determine a match likelihood.