logo
home - calc - FAQ - download - doc - irc - links - credits - contact

The algorithm of the spamcalc script

Last updated: 2003-06-19

This is the algorithm of spamcalc 0.7.0.

Terminology:
- a field is one field between two dots, between the start of the hostname and the first dot, or between the last dot and the end of the hostname.
- the domain part is usually the last 2 fields of the host. Some countries have an enforced sub-tld structure (.co.uk, .net.au) and in that case the domain part is the last 3 fields of the host.
- the host part is the host minus the domain part.

First, the hostname is split in two parts: the domain part and the host part. Normally, the domain part is just the last two fields. However, .uk requires and extra field between the TLD and the domain name, and the script takes that into account by means of a data file which holds a list of those sub-tld domains.

Whitelist - if the hostname ends in one of the domains on the whitelist, the score of the host is set to zero and the rest of the calculation is skipped.

Blacklist - if the hostname ends in one of the domains on the blacklist, the base score is raised from zero to the score of that blacklist domain.

Word - for each field in the host part, a lookup is done in the word list. If the field is in the word list, the field gets a word score of the score of that word.

Regexp - each field is matched with all regular expressions on the regexp list. If it matches any of those regexps, the highest score from the regexps that it matched is assigned to the regexp score of this field.

Total word and regexp - the word and regexp scores of each field are added. If a field has both a word and a regexp score, use only the word score. If at least 2 fields have a score higher than zero, and at least one of those fields is at least 3 characters long, the total score is multiplied by the number of fields with a score higher than zero.

Number of fields - the number of fields in the host part is counted. Each of the fields in the host part that equals one of the hierarchical fields (like home, ipv6, router, dial, ...) is not counted for the number of fields score. Then, a score for the number of fields is assigned, starting at 5 for 3 fields, and going up exponentially.

Field length - any field with an exceptionally high number of characters in it will get a score.

Dashes - if a field contains at least one dash (-), the sub-fields in that field are calculated for word and regexp scores.

Looks like word - if the host part has at least 3 fields and all of them look like normal (human) words (of any language), a score is assigned. Whether a field looks like a word is decided by the ratio between consonants and vowels.

Repeat - if a field is repeated in the host, a score for repetition will be set. The more times the field is repeated, the higher this score will be.

One letter - if there are 2 or more fields consisting of only a single character, this will get a score.

Instead of giving an example, I present you the online dnsspam calculator. Enter some hostnames and see how the calculation is done.