Datasets Enumeration and Description¶
In over 20 years Spamhaus ended up producing quite a number of different datasets. Some are very broad in terms of possible usage, some other are extremely specific to one purpose only, and quite often email is by far not the only field they can be succesfully used in.
Understanding where they come from, what they’re intended to solve and what one can expect to be included in them is the basic foundation to understanding what they can do for one’s needs.
Here’s a list of the various datasets published by Spamhaus and their description, along with the return code(s) they are associated with, that are represented both as an IPv4 IP (as that’s used by the DNSBL semantics) and an integer number (HTTP API semantics).
Associated with the return code 127.0.0.2 (1002), the SBL is a manually maintained list of abuse-related resources, not necessarily of exclusively SMTP emitters. Resources that can be listed in the SBL are for example webservers or DNS servers (sometimes, even routers) providing service to abusing actors, either as a result of a compromise or because they’re dedicated to that purpose.
In general, outright blocking at the SMTP level a source that is listed by the SBL is supposed to be safe in terms of false positives. Another usage with a fairly low false-positive rate is checking the IPs contained in the Received headers of the messages (so-called Deep-Header Parsing, or, from now on, DHP).
Due to the characteristics above, however, other uses are possible: for example, a sender whose domain is served by an SBL-listed DNS server has a non-trivial probability of being abusive too.
Similarly, if the message contains URLs resolving to SBL-listed addresses, there’s a reasonable chance the message is abusive. However, use of the SBL for these specific purposes is encouraged only within scoring systems, as a contribution to a decision taken upon multiple factors.
Because SBL is a manually-managed resource, it means it can be suboptimal when it comes to following fast-paced operations that constantly shift from one location to another. Therefore, in order to keep track of these operations (snowshoe and hailstorm sources are the first to come to mind) something else was needed as integration to it. That is how the CSS dataset came to life and associated with its own return code 127.0.0.3 (1003).
It’s a completely automated sublist, listing SMTP emitters associated with a low reputation or confirmed abuse. This can either mean a resource controlled by an abusing actor or a compromised host. Its usage should be limited to the sending IP and can be used to outright reject the delivery.
It’s an additional flag added to SBL listings, indicating that the resource is known to be controlled by a bad actor, meaning that a query returning the code for DROP/EDROP (127.0.0.9 / 1009) will also always return the code for SBL listings. It indicates that the queried IP is part of IP resources assigned to known rogue entities, with bulletproof hosters and similar shady operators being a typical example.
It is strongly suggested to avoid any kind of interaction with entities listed by this dataset. This is by no means limited to SMTP: as a matter of fact Spamhaus has been inviting consumers to apply DROP/EDROP at the firewall or router level, dropping any traffic coming from (or going to) these network resources.
For this reason, DROP/EDROP is distributed in a number of different ways, in order to give users the largest possible flexibility.
The distinction between DROP and EDROP is:
if the network resource has been directly given by an RIR to the bad actors, it belongs to DROP
if the network resource has been delegated to the bad actor by an ISP, it belongs to EDROP
From any consumer perspective, the difference between the two sub-lists is negligible, and they can safely be treated as the same thing. As a result, not all the query methods provide this dataset in a way that allows to discriminate between the two components.
It is widely known how compromised hosts (being them servers sitting in a datacenter, infected computers on someone’s desk, or vulnerable IoT devices in somebody’s cellar) are generally used to emit spam (among other bad deeds). XBL is a list of IPs that have recently been observed hosting compromised hosts, and can be composed of several independent contributions, each one associated with each own return code in the range between 127.0.0.4 (1004) and 127.0.0.7 (1007). However, in this specific moment the only component is the XBL itself, hence only 127.0.0.4 (1004) is currently being returned.
The first suggested use for this dataset is to outright block SMTP deliveries coming from an IP listed by it.
Hosts can be compromised and be used for abusive purposes even without actually emitting spam, however. For this reason other usages are possible: for example if an URL contained in the message body points to an exploited webserver, there’s a non-trivial chance that the message itself is spam, pointing the recipient to an abusive URL that will be redirecting him to the spammer’s website or -in the worst case- downloading malware of some sort.
Using the XBL to check the IPs URLs point to is therefore possible and suggested, but only as part of a scoring system where this is one of the indicators taken into account.
Similarly, using DHP against the XBL is possible, but the chance of false positive can be quite significant, particularly in cases where the source is on a dynamically assigned address (meaning the sender inherited an IP that hosted a compromised system hours before) or in case of NAT (where one host is compromised but most others are not, but all share the same public IP); therefore, it should only be used in a scoring system.
A side effect of having compromised hosts emitting spam is that one would end up seeing SMTP traffic reaching their MX from networks where no SMTP server is expected. Most notably, dynamic IP space used for residential connectivity pools.
This means that, even when an infected system is not yet known to the XBL, one can can possibly identify it as an unwanted source based on where the traffic is originating.
This is, in the end, what led to the creation of PBL. It’s not -strictly speaking- an abuse-related list: it’s a list of dynamic and low-security IP space. In general, it’s address space that should never host an SMTP server, therefore any SMTP connection coming from this IP space is almost certainly abusive.
Since every message has to originate somewhere, DHP against PBL makes no sense and is highly discouraged. On the other hand, scoring based on PBL for URLs is possible, although not particularly performant.
Two return codes are associated with this dataset, telling whether the nature of the listed subnet has been inferred by Spamhaus (127.0.0.11 /1011) or indicated directly by the ISP responsible for the network (127.0.0.10 / 1010).
Some bots are known to perform authentication credential hijacking or bruteforcing. Knowing if the peer in an authentication session is amongst those can be a precious datapoint when the application is trying to decide if a client session is legit or abusive.
AuthBL is basically that: a collection of bots known to use stolen credentials or authentication bruteforce.
For the largest part, AuthBL is therefore a subset of the XBL, and it aims to help in any situation where credentials are in use and can be stolen, from SMTP-AUTH, to IMAP, to HTTP or other protocols that have nothing to do with email in the first place, like ssh or VoIP.
It’s associated with its own return code 127.0.0.20 / 1020.
IPs are not, by far, the only thing in a message that can be associated with a reputation.
DBL is a database of domains with a poor reputation, at least from the end-user perspective.
In truth, what the DBL does is effectively keeping track and computing a reputation “score” for every domain seen on the Internet and produce a list of those that
are above a certain threshold
have been observed active in the last X days
This list is what users would query.
Different return codes are used to tag the type of abuse the domain has been observed involved in whenever that information is available.
One thing that should be noted is that not all the records have the same meaning in term of “badness”: basically two separate sets of return codes are provided:
127.0.1.2-99 / 2002-2099 identify resources that are considered inherently bad or associated with a low reputation. In general, it means that the domain is “safe to block” according to Spamhaus data.
127.0.1.102-199 / 2102-2199 identify domains that -while not inherently “bad”- have been observed involved in abuse. Briefly referred to as “abused-legit” the typical example of this is a domain that due to a security issue is currently serving malicious contents. This second set of return codes is only suggested for use in scoring systems.
If queried for an IP, the DBL will return a positive reply with the return code 127.0.1.255: this should be under any aspect treated as an error code, with the meaning “IP queries not supported”; in HTTP lookups, this error is conveyed by a HTTP code.
Ok, but then: what if somebody starts using a domain that has just been registered, before it even acquires a reputation?!?
With no surprise, that happens all the time and has been for quite a while. The actual numbers may differ, but something all the researches in domain abuse agree upon is that the vast majority of newly-registered domains will only be used for some bad deed for some time, then scrapped and left unused until they expire.
As a result, the vast majority of newly-registered domains can’t be trusted, turning the lack of reputation into a strong reputation indicator by itself, in a way.
ZRD is a database of domains that have been observed for the first time in the last 24 hours and can therefore be treated with extreme prejudice.
The fourth octet of the return code (in the range between 127.0.2.2 / 3002 and 127.0.2.24 / 3024) is used to indicate the time elapsed since its first observation, in hours.
If queried for an IP, the ZRD will return a positive reply with the return code 127.0.2.255: this should be under any aspect treated as an error code, with the meaning “IP queries not supported”; in HTTP lookups, this error is conveyed by a HTTP code.
When evaluating whether an email message is to be considered malicious, a great deal of the effort goes into assessing various elements of that message. These elements are “hooks” on which to attach “reputation” to the message.
All the datasets detailed above can be used to associate internet resources (i.e. IP addresses or domain names) to a reputation factor. This approach also can be extended to include other parts of an email message, namely content, in order to provide additional “reputation hooks”.
Hash Blocklist (HBL) focuses on content elements (tokens), such as email addresses or cryptocurrency wallets. Software analyses the incoming message, identifies and normalizes these content elements, and for each of them produces a cryptographic hash. This means that a known mathematical function is applied to the element and produces a string of bits with a fixed size, known as the hash. The function is thought to be impossible to invert, so that it is not feasible to obtain the original contents from the hash.
This method is industry-standard and has the dual advantage of reducing all contents to a fixed size, and avoiding the disclosure of the content in distributed data which may be disallowed by specific privacy laws.
The HBL list contains hashes, therefore the hash is queried against the list to determine whether the content element has a bad reputation.
For example, in order to verify if the file with an SHA256 hash (represented in BASE32 as by defined by RFC4648) of
KADTR46EPIEQVM7C3GEODZCTXO2HUQSO34T3YYLVBCMPOAA3GSBA is known to HBL, one would query HBL for the string
The appended substring
._file specifies the context the hashed content belongs to.
Each context can be considered as a separate sub-list: hashes comprising of specific token types. The HBL can be extended by adding more contexts at any time.
The implementation generally relies on SHA256 hashes represented as BASE32 strings, so wherever “SHA256” is reported as the sublist content, its BASE32 representation is to be intended.
For some contexts, SHA1 hashes are supported too (represented as hexadecimal strings), to ensure compatibility with existing software.
Contexts currently implemented are:
SHA256 only. It contains file hashes of two types:
malicious: identified by the return code 127.0.3.10 / 4010 - meaning the queried file has been analyzed by Spamhaus Malware Labs and recognized as known malware.
suspicious: identified by the return code 127.0.3.15 / 4015 - meaning the queried file has been observed in spam and its nature makes it particularly suspicious. Despite Spamhaus Malware Labs having not yet confirmed its maliciousness, it still should be treated with extreme caution.
In order to allow users to test their implementation, a hash of the EICAR test file is always present in the dataset:
A DNS query asking for the TXT record would provide back an URL pointing to a lookup form in case the listing needs to be re-evaluated, followed by the malware family between parenthesis. In case of suspicious files, the malware family reported will be
$ dig +short txt E5NAEG57WZEJ4VGUOGEZ67NZ2FTD7RUV5QX6FIWEKOFKX5SR7UHQ._file.<key>.hbl.dq.spamhaus.net "https://www.spamhaus.org/query/hash/e5naeg57wzej4vguogez67nz2ftd7ruv5qx6fiwekofkx5sr7uhq._file (EICAR_test_file)"
SHA256 and SHA1. It contains the hashes of cryptowallet addresses (Bitcoin, Bitcoin Cash, Ethereum, Monero, Ripple, Litecoin) observed in spam campaigns.
Ethereum addresses (given their representation as an hexadecimal number) must be converted into lower-case before being hashed, while all the other wallets’ canonical strings must be hashed in the form they have in the message.
The return code is 127.0.3.20 / 4020.
In order to allow users to test their implementation, the following test entries are always in the dataset
|Currency||Wallet address||SHA256 Hash|
SHA256 and SHA1. It contains hashes of email addresses observed in spam, either as Sender, Reply-To or in the message body. Contact addresses used for 419 (“Nigerian Prince”) scams, or email contacts seen in aggressive marketing campaigns based on unsolicited messages, are typical examples of what this sublist aims to target.
Before applying the hashing function, email addresses are supposed to go through a minimal normalization process. This consists in the following steps:
lowercase the entire address
strip the left-hand part from tags and similar, removing anything following and including the first ‘+’ character in it
if the right-hand part is “googlemail.com”, replace it with “gmail.com”
if the right-hand part is “gmail.com”, strip all the dots from the left-hand part
Example code in perl (assuming an already-valid email address):
use Digest::SHA qw(sha256); use MIME::Base32 qw( RFC ); my ($Left, $Right) = split ('@', lc($Address)); $Left =~ s/\+.*$//; $Right = 'gmail.com' if ($Right eq 'googlemail.com'); $Left =~ s/\.//g if ($Right eq 'gmail.com'); $Hash = MIME::Base32::encode(sha256($Left.'@'.$Right));
The return code is 127.0.3.2 / 4002.
In order to allow users to test their implementation, the email address
firstname.lastname@example.org is always present in the dataset both as SHA256 (
F3PDGTMWU6LFIGDJC67YNIWRY5ZRM7ERLETNFO36QAEQPMBPW2DA) and SHA1 (