Datasets Enumeration and Description

In over 20 years Spamhaus ended up producing quite a number of different datasets. Some are very broad in terms of possible usage, some other are extremely specific to one purpose only, and quite often email is by far not the only field they can be succesfully used in.

Understanding where they come from, what they’re intended to solve and what one can expect to be included in them is the basic foundation to understanding what they can do for one’s needs.

Here’s a list of the various datasets published by Spamhaus and their description, along with the return code(s) they are associated with, that are represented both as an IPv4 IP (as that’s used by the DNSBL semantics) and an integer number (HTTP API semantics).

SBL

Associated with the return code 127.0.0.2 (1002), the SBL is a manually maintained list of abuse-related resources, not necessarily of exclusively SMTP emitters. Resources that can be listed in the SBL are for example webservers or DNS servers (sometimes, even routers) providing service to abusing actors, either as a result of a compromise or because they’re dedicated to that purpose.

In general, outright blocking at the SMTP level a source that is listed by the SBL is supposed to be safe in terms of false positives. Another usage with a fairly low false-positive rate is checking the IPs contained in the Received headers of the messages (so-called Deep-Header Parsing, or, from now on, DHP).

Due to the characteristics above, however, other uses are possible: for example, a sender whose domain is served by an SBL-listed DNS server has a non-trivial probability of being abusive too.

Similarly, if the message contains URLs resolving to SBL-listed addresses, there’s a reasonable chance the message is abusive. However, use of the SBL for these specific purposes is encouraged only within scoring systems, as a contribution to a decision taken upon multiple factors.

CSS

Because SBL is a manually-managed resource, it means it can be suboptimal when it comes to following fast-paced operations that constantly shift from one location to another. Therefore, in order to keep track of these operations (snowshoe and hailstorm sources are the first to come to mind) something else was needed as integration to it. That is how the CSS dataset came to life and associated with its own return code 127.0.0.3 (1003).

It’s a completely automated sublist, listing SMTP emitters associated with a low reputation or confirmed abuse. This can either mean a resource controlled by an abusing actor or a compromised host. Its usage should be limited to the sending IP and can be used to outright reject the delivery.

BCL

The Spamhaus Botnet Controller List (”BCL”) is a specialized advisory “drop all traffic” list consisting of single IPv4 or IPv6 addresses, used by cybercriminals to control infected computers (bots). BCL does not contain any subnets or CIDR prefixes shorter than /32 for IPv4 or /128 for IPv6.

If an IP address meets the following criteria

  • The server hosted at the IP address is used to control computers that are infected with malware

  • The server hosted at the IP address is operated with malicious intent (in other words, the server is operated by cybercriminals for the exclusive purpose of hosting a botnet C&C server) then it would appear as listed by a DNSBL lookup, with its own return code 127.0.0.30 (1030).

If only the first condition is met, the IP would still make its way into the BCL, but it would be displayed only by an API lookup.

DROP/EDROP

It’s an additional flag added to SBL listings, indicating that the resource is known to be controlled by a bad actor, meaning that a query returning the code for DROP/EDROP (127.0.0.9 / 1009) will also always return the code for SBL listings. It indicates that the queried IP is part of IP resources assigned to known rogue entities, with bulletproof hosters and similar shady operators being a typical example.

It is strongly suggested to avoid any kind of interaction with entities listed by this dataset. This is by no means limited to SMTP: as a matter of fact Spamhaus has been inviting consumers to apply DROP/EDROP at the firewall or router level, dropping any traffic coming from (or going to) these network resources.

For this reason, DROP/EDROP is distributed in a number of different ways, in order to give users the largest possible flexibility.

The distinction between DROP and EDROP is:

  • if the network resource has been directly given by an RIR to the bad actors, it belongs to DROP

  • if the network resource has been delegated to the bad actor by an ISP, it belongs to EDROP

From any consumer perspective, the difference between the two sub-lists is negligible, and they can safely be treated as the same thing. As a result, not all the query methods provide this dataset in a way that allows to discriminate between the two components.

XBL

It is widely known how compromised hosts (being them servers sitting in a datacenter, infected computers on someone’s desk, or vulnerable IoT devices in somebody’s cellar) are generally used to emit spam (among other bad deeds). XBL is a list of IPs that have recently been observed hosting compromised hosts, and can be composed of several independent contributions, each one associated with each own return code in the range between 127.0.0.4 (1004) and 127.0.0.7 (1007). However, in this specific moment the only component is the XBL itself, hence only 127.0.0.4 (1004) is currently being returned.

The first suggested use for this dataset is to outright block SMTP deliveries coming from an IP listed by it.

Hosts can be compromised and be used for abusive purposes even without actually emitting spam, however. For this reason other usages are possible: for example if an URL contained in the message body points to an exploited webserver, there’s a non-trivial chance that the message itself is spam, pointing the recipient to an abusive URL that will be redirecting him to the spammer’s website or -in the worst case- downloading malware of some sort.

Using the XBL to check the IPs URLs point to is therefore possible and suggested, but only as part of a scoring system where this is one of the indicators taken into account.

Similarly, using DHP against the XBL is possible, but the chance of false positive can be quite significant, particularly in cases where the source is on a dynamically assigned address (meaning the sender inherited an IP that hosted a compromised system hours before) or in case of NAT (where one host is compromised but most others are not, but all share the same public IP); therefore, it should only be used in a scoring system.

PBL

A side effect of having compromised hosts emitting spam is that one would end up seeing SMTP traffic reaching their MX from networks where no SMTP server is expected. Most notably, dynamic IP space used for residential connectivity pools.

This means that, even when an infected system is not yet known to the XBL, one can can possibly identify it as an unwanted source based on where the traffic is originating.

This is, in the end, what led to the creation of PBL. It’s not -strictly speaking- an abuse-related list: it’s a list of dynamic and low-security IP space. In general, it’s address space that should never host an SMTP server, therefore any SMTP connection coming from this IP space is almost certainly abusive.

Since every message has to originate somewhere, DHP against PBL makes no sense and is highly discouraged. On the other hand, scoring based on PBL for URLs is possible, although not particularly performant.

Two return codes are associated with this dataset, telling whether the nature of the listed subnet has been inferred by Spamhaus (127.0.0.11 /1011) or indicated directly by the ISP responsible for the network (127.0.0.10 / 1010).

AuthBL

Some bots are known to perform authentication credential hijacking or bruteforcing. Knowing if the peer in an authentication session is amongst those can be a precious datapoint when the application is trying to decide if a client session is legit or abusive.

AuthBL is basically that: a collection of bots known to use stolen credentials or authentication bruteforce.

For the largest part, AuthBL is therefore a subset of the XBL, and it aims to help in any situation where credentials are in use and can be stolen, from SMTP-AUTH, to IMAP, to HTTP or other protocols that have nothing to do with email in the first place, like ssh or VoIP.

It’s associated with its own return code 127.0.0.20 / 1020.

DBL

IPs are not, by far, the only thing in a message that can be associated with a reputation.

DBL is a database of domains with a poor reputation, at least from the end-user perspective.

In truth, what the DBL does is effectively keeping track and computing a reputation “score” for every domain seen on the Internet and produce a list of those that

  • are above a certain threshold

  • have been observed active in the last X days

This list is what users would query.

Different return codes are used to tag the type of abuse the domain has been observed involved in whenever that information is available.

One thing that should be noted is that not all the records have the same meaning in term of “badness”: basically two separate sets of return codes are provided:

  • 127.0.1.2-99 / 2002-2099 identify resources that are considered inherently bad or associated with a low reputation. In general, it means that the domain is “safe to block” according to Spamhaus data.

  • 127.0.1.102-199 / 2102-2199 identify domains that -while not inherently “bad”- have been observed involved in abuse. Briefly referred to as “abused-legit” the typical example of this is a domain that due to a security issue is currently serving malicious contents. This second set of return codes is only suggested for use in scoring systems.

If queried for an IP, the DBL will return a positive reply with the return code 127.0.1.255: this should be under any aspect treated as an error code, with the meaning “IP queries not supported”; in HTTP lookups, this error is conveyed by a HTTP code.

ZRD

Ok, but then: what if somebody starts using a domain that has just been registered, before it even acquires a reputation?!?

With no surprise, that happens all the time and has been for quite a while. The actual numbers may differ, but something all the researches in domain abuse agree upon is that the vast majority of newly-registered domains will only be used for some bad deed for some time, then scrapped and left unused until they expire.

As a result, the vast majority of newly-registered domains can’t be trusted, turning the lack of reputation into a strong reputation indicator by itself, in a way.

ZRD is a database of domains that have been observed for the first time in the last 24 hours and can therefore be treated with extreme prejudice.

The fourth octet of the return code (in the range between 127.0.2.2 / 3002 and 127.0.2.24 / 3024) is used to indicate the time elapsed since its first observation, in hours.

If queried for an IP, the ZRD will return a positive reply with the return code 127.0.2.255: this should be under any aspect treated as an error code, with the meaning “IP queries not supported”; in HTTP lookups, this error is conveyed by a HTTP code.

HBL

When evaluating whether an email message is to be considered malicious, a great deal of the effort goes into assessing various elements of that message. These elements are “hooks” on which to attach “reputation” to the message.

All the datasets detailed above can be used to associate internet resources (i.e. IP addresses or domain names) to a reputation factor. This approach also can be extended to include other parts of an email message, namely content, in order to provide additional “reputation hooks”.

Hash Blocklist (HBL) focuses on content elements (tokens), such as email addresses or cryptocurrency wallets. Software analyses the incoming message, identifies and normalizes these content elements, and for each of them produces a cryptographic hash. This means that a known mathematical function is applied to the element and produces a string of bits with a fixed size, known as the hash. The function is thought to be impossible to invert, so that it is not feasible to obtain the original contents from the hash.
This method is industry-standard and has the dual advantage of reducing all contents to a fixed size, and avoiding the disclosure of the content in distributed data which may be disallowed by specific privacy laws.

The HBL list contains hashes, therefore the hash is queried against the list to determine whether the content element has a bad reputation.
For example, in order to verify if the file with an SHA256 hash (represented in BASE32 as by defined by RFC4648) of KADTR46EPIEQVM7C3GEODZCTXO2HUQSO34T3YYLVBCMPOAA3GSBA is known to HBL, one would query HBL for the string KADTR46EPIEQVM7C3GEODZCTXO2HUQSO34T3YYLVBCMPOAA3GSBA._file.
The appended substring ._file specifies the context the hashed content belongs to.

Each context can be considered as a separate sub-list: hashes comprising of specific token types. The HBL can be extended by adding more contexts at any time.

The implementation generally relies on SHA256 hashes represented as BASE32 strings, so wherever “SHA256” is reported as the sublist content, its BASE32 representation is to be intended.
For some contexts, SHA1 hashes are supported too (represented as hexadecimal strings), to ensure compatibility with existing software.

Contexts currently implemented are:

File (_file)

SHA256 only. It contains file hashes of two types:

  • malicious: identified by the return code 127.0.3.10 / 4010 - meaning the queried file has been analyzed by Spamhaus Malware Labs and recognized as known malware.

  • suspicious: identified by the return code 127.0.3.15 / 4015 - meaning the queried file has been observed in spam and its nature makes it particularly suspicious. Despite Spamhaus Malware Labs having not yet confirmed its maliciousness, it still should be treated with extreme caution.

In order to allow users to test their implementation, a hash of the EICAR test file is always present in the dataset: E5NAEG57WZEJ4VGUOGEZ67NZ2FTD7RUV5QX6FIWEKOFKX5SR7UHQ.

A DNS query asking for the TXT record would provide back an URL pointing to a lookup form in case the listing needs to be re-evaluated, followed by the malware family between parenthesis. In case of suspicious files, the malware family reported will be suspicious. Example:

$ dig +short txt E5NAEG57WZEJ4VGUOGEZ67NZ2FTD7RUV5QX6FIWEKOFKX5SR7UHQ._file.<key>.hbl.dq.spamhaus.net
"https://www.spamhaus.org/query/hash/e5naeg57wzej4vguogez67nz2ftd7ruv5qx6fiwekofkx5sr7uhq._file (EICAR_test_file)"

CryptoWallet (_cw)

SHA256 and SHA1. It contains the hashes of cryptowallet addresses (Bitcoin, Bitcoin Cash, Ethereum, Monero, Ripple, Litecoin) observed in spam campaigns.
Ethereum addresses (given their representation as an hexadecimal number) must be converted into lower-case before being hashed, while all the other wallets’ canonical strings must be hashed in the form they have in the message.
The return code is 127.0.3.20 / 4020.
In order to allow users to test their implementation, the following test entries are always in the dataset

Currency Wallet address SHA256 Hash
BTC 1Gx3ZjJaHkXquhPzwYSFbVz1uSfdMGJY48 R4WIMMVSTRVIWLVVF3CMYQDRHR4AINEHEFNZNXXHZ62PCAJQKTNA
BCH bitcoincash:qre5at72qr6kthtty72nu5g52swpcpu2xungmtrj74 TV7QRQPGBKF4X3K4T5QYILRI3SP5CIWVIIOH25YUOGVOJ3SBTYNA
XRP rnJ5gQRETvwwwPiH5tZEtLUYZ5HUDakUR6 VG77WSCZ54FHY7JFDA4SRPJ4UBFJMD5LR7DQNH7ALYHGQMPLBNOQ
LTC LXXSYD1Qgyq7oBFcGFeTApt3JQN7cKLfDe E75IGJABXX2JHHNXTICYRMX6FMG3FN2WIJOWZK2KFGW5H6BODKPQ
ETH 0xa6136b765BC065554702a9A77A3C6C66Ab4905cE W7YYPNGRDFJ5LZ7IKFDAU42YTHBNQVWOXVVFI4C3KZ2X3HL2XCLA
XMR 41yyXHfaFqaHhur3kUSQtXKBsDZuXDbPwCSxVXNQvd5BByRZP6UhMbaYRPoxx8piSzQETNMMfMSaPLoNaVPwFYjmM4jnWD5 YODGEZCDG6FMZHPZTTVHYBE3RTKOIUI26HWDJHMUAQDRGJZCRTIA

Email(_email)

SHA256 and SHA1. It contains hashes of email addresses observed in spam, either as Sender, Reply-To or in the message body. Contact addresses used for 419 (”Nigerian Prince”) scams, or email contacts seen in aggressive marketing campaigns based on unsolicited messages, are typical examples of what this sublist aims to target.
Before applying the hashing function, email addresses are supposed to go through a minimal normalization process. This consists in the following steps:

  • lowercase the entire address

  • strip the left-hand part from tags and similar, removing anything following and including the first ‘+’ character in it

  • if the right-hand part is “googlemail.com”, replace it with “gmail.com

  • if the right-hand part is “gmail.com”, strip all the dots from the left-hand part

Example code in perl (assuming an already-valid email address):

use Digest::SHA qw(sha256);
use MIME::Base32 ();
my ($Left, $Right) = split ('@', lc($Address));
$Left =~ s/\+.*$//;
$Right = 'gmail.com' if ($Right eq 'googlemail.com');
$Left =~ s/\.//g if ($Right eq 'gmail.com');
$Hash = MIME::Base32::encode(sha256($Left.'@'.$Right));

The return code is 127.0.3.2 / 4002.

In order to allow users to test their implementation, the email address user@hbltest.com is always present in the dataset both as SHA256 (F3PDGTMWU6LFIGDJC67YNIWRY5ZRM7ERLETNFO36QAEQPMBPW2DA, base32 encoded) and SHA1 (ebcb8a93f4d4c80a83f7fc886fd2de97f0de4814, hex format).

Assuming openssl and GNU core utilities are installed:

echo -n "[email protected]" | openssl dgst -sha256 -binary | base32

or

echo -n "[email protected]" | openssl sha1

will return the correct result.

URL(_url)

SHA256 and SHA1. It contains the hashes of URLs observed in spam campaigns. URLs can either be in the body or in the headers of the email (eg in the List-Unsubscribe header). On larger sites used by many different users, it is not desirable to list the entire domain in the DBL. Instead, specific URLs can be listed using this URL hashing scheme.

Normalizing the URL before hashing depends on the domain name, and is described below.

There are currently four slightly different algorithms to reduce a URL to the normalised form. The algorithms only differ by how much of the original URL they use for the hash, and on whether the url is lowercased before hashing or not.

The different algorithms are described by a YAML file that is available for download from https://docs.spamhaus.com/download/URL_normalization.yaml

This YAML file consists of a list of dictionaries (currently 4, for the four algorithm variations). Each dictionary consists these fields:

  • name: the name of the algorithm, for documentation purposes only.

  • re: a regular expression that matches the path component of the URL. Everything that is matched by the regular expression, is used in the final hash. The regular expressions used are standard unix extended regular expressions, that are compatible with both PCRE and Re2, and they do not use backtracking. The regular expression must start matching at the beginning of the url path.

  • chars: the set of characters from the path component that will be included in the hash. The first character not in the set will mark the end of the hashed part of the URL. Note that the re field is a better way to match the characters that are part of the hash, you should only use the chars field if you do not want to use regular expressions in your implementation. The special string .* means to take all characters.

  • lowerhash: (optional, boolean) if present, and true, lowercase the path component of the URL before calculating the hash. Note that the host part is always lowercased.

  • domains: (optional) a list of domain names for which to apply this algorithm. If this entry is missing, it means the algorithm applies to all domains that don’t match any other rule.

An example YAML entry is:

- name: catchall
  re: ^$|^[?].*|^[#].*|[^#?]+
  chars: .*
  lowerhash: true

This entry is the “catchall” entry. Absence of the domains list signifies that all domains that aren’t matched by another rule, will use this rule for the hash calculation. The lowerhash indicates that the path component should be lowercased, and the regular expression matches either an empty path, a path that starts with a ? or a # followed by anything, or a path that stops at the first ? or #.

To calculate the hash, take these steps:

  • remove the protocol and initial slashes (like http://).

  • take the hostname part, make it lowercase. Remove any username/password specified. Leave the port, if present.

  • find the relevant algorithm based on the lowercase hostname (without port).

  • remove %-encoding in the path.

  • if lowerhash is true, lowercase the path component.

  • match the (optionally lowercased) path component of the URL with the regular expression that is specified in the re field of the given algorithm. If the regular expression does not match, there is no hash for the given URL.

  • concatenate the lowercase hostname (optionally including port), and the part of the URL path that was matched by the regular expression.

Example code in perl. Note that the example assumes that the “catchall” algorithm above is used. Implementing the YAML parsing and applying any of the four possible algorithms based on the given URL, is beyond the scope of this documentation.

use Digest::SHA qw(sha256);
use MIME::Base32 ();

# this code assumes that the URL to be processed is in the $url variable, eg:
# $url = "http://catchall.hbltest.com/testdir1/testdir2/Test";

# match parts of the URL
my ($proto, $hostpart, $path) = $url =~ m{
    ^(\w+://)?      # optional protocol
    ([^/]*)         # host part, including optional authentication and port
    (/.*)           # path
}x;
return if !$hostpart;
if ( defined($proto) and $proto !~ m{^(https?|ftp)://} ) {
    # not a hashable URL scheme
    return;
}
$hostpart =~ s/^.*\@//;   # remove optional authentication
my $host = lc $hostpart;
my ($yamlkey) = $host =~ /^([^:]+)/;
# undo encoding of %-encoded characters
$path =~ s/%([0-9a-f]{2})/chr hex $1/gei;
# the $path_re and $lowerhash are hardcoded here, normally taken from the YAML file.
# The $yamlkey can be used as the key for a lookup
my $path_re = qr{^$|^[?].*|^[#].*|[^#?]+};
my $lowerhash = 1;
my ($matched_path) = $path =~ /^($path_re)/
    or return;
$matched_path = lc $matched_path if $lowerhash;
my $hash = MIME::Base32::encode( sha256($host . $matched_path) );

Or using unix command line scripting, assuming openssl and GNU core utilities are installed:

echo -n "www.hbltest.com/test" | openssl dgst -sha256 -binary | base32

or

echo -n "www.hbltest.com/test" | openssl sha1

will return the correct result.

In order to allow users to test their implementation, a number of test URLs are always present in the dataset. There are 4 test URLs, one for each algorithm as specified in the normalization file above. This is a list of the test URLs and the hashes that are always present:

URL SHA1 Hash SHA256 Hash
www.hbltest.com/test 1944927f9a3e6ca8a6367aea29eee9087d449a64 JRBRNOUNTQWKLOKRDDX5DC65YHFFJ4ZLJ5EXP7D6TWHJTDR23PSA
short.hbltest.com/test 14faaf38d7b96b78b3ac7ef802287b9f6a6cd8be WL5VHDGVHOEPT5LGMFUZHI6TLZWYSCEMDXFX73RF3CQ7YOJPEAJQ
withqm.hbltest.com/openurl?lid=test ce0962bd61a6534ce83054c233d8c1b759f20777 EORVXYR6YPU2B54QOCEK4SS6XN3YXMHDQCFGHAEN4ZPMZ5QUSCVA
catchall.hbltest.com/testdir1/testdir2/Test 68a3efb846587649de4ac89ca48c3d1e3b98e99b Z3GPTQBSXPBLM7BMWMULIJAFD5BAKYW4AX5TYSB5XHTCL5X4NBGA

The return code is 127.0.3.30 / 4030