It could happen to you, analyzing data from Google Analytics, to find strange and suspicious traffic. A sea of visits with a bounce rate of 100% and strange, unknown to you, Referral.
Most likely it is Spam Referral, a technique used by some black seo sites to try to trick the search engines. A more detailed description can be found in the following Wikipedia article.
The typical situation that you might find on your Analytics console is the following:
This phenomenon can be very annoying and it hides interesting data among a thousand false visits.
Several solutions to the problem are suggested: from WordPress plugins to filtering data directly in Google Analytics. The approach chosen by us is a little more technical and directly related to Apache.
The idea is using mod_rewrite to identify calls that return a known referrer as spammer and block it with a 403 – Forbidden.
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{HTTP_REFERER} ^http(s)?://(www\.)?.*thespammer2\.org.*$ [NC,OR]
RewriteCond %{HTTP_REFERER} ^http(s)?://(www\.)?.*thespammer1\.net.*$ [NC,OR]
RewriteCond %{HTTP_REFERER} ^http(s)?://(www\.)?.*thespammer\.com.*$ [NC]
RewriteRule ^(.*)$ – [F,L]
</IfModule>
For a constantly updated list of spammers you can use a public repository maintained on GitHub piwik/referrer-spam-blacklist. Below you’ll find a small template and a python script to generate a configuration file for apache.
Template to create .conf for Apache
This is a minimal template, which can be used to be included. As an alternative, you can prepare the templates of your own virtual hosts.
<IfModule mod_rewrite.c> RewriteEngine On $spammerList RewriteRule ^(.*)$$ – [F,L] </IfModule>
Script
The script fills the template with the data downloaded from the repository. It is written in python and you can easily schedule it.
#!/usr/bin/python from string import Template import urllib SPAMMER_SOURCE = “https://raw.githubusercontent.com/piwik/referrer-spam-blacklist/master/spammers.txt“ template = Template(open(‘template’, ‘r’).read()) spammers = urllib.urlopen(SPAMMER_SOURCE).read().splitlines() RULE = “RewriteCond %{HTTP_REFERER} ^http(s)?://(www\.)?.*$domain.*$$ [NC,OR]“ RULE_TEMPLATE = Template(RULE) LASTRULE = “RewriteCond %{HTTP_REFERER} ^http(s)?://(www\.)?.*$domain.*$$ [NC]“ LASTRULE_TEMPLATE = Template(LASTRULE) formattedLines = [] for (i, line) in enumerate(spammers): line = line.replace(‘.’, ‘\.’) if i == len(spammers) - 1: formattedLines.append(LASTRULE_TEMPLATE.substitute(domain = line)) else : formattedLines.append(RULE_TEMPLATE.substitute(domain = line)) output = template.substitute(spammerList = “\n“.join(formattedLines)) print output
To verify the correct operation of the configuration, any method of protection you have chosen, you can use wget
Example of blocked call
You have to simulate a http call that has as referrer a site belonging to the list of spammers, the answer must be 403 Forbidden.
wget \ --server-response \ --spider \ --referer='http://thespammer.com/' \ https://www.opengate.biz ... HTTP request sent, awaiting response... HTTP/1.1 403 Forbidden ...
Example of successful call
You have to simulate a http call that has as referrer a licit site (not belonging to the list of spammers), the answer must be 200 OK.
wget \ --server-response \ --spider \ --referer='http://legitsite.com/' \ https://www.opengate.biz ... HTTP request sent, awaiting response... HTTP/1.1 200 OK ...