WP Engine Hotfix: Preventing Spam and Bad Bot Traffic, Part II

20 December 2013
sysadmin,
bad bots,
block ip,
htaccess,
spam,
visitors,
wp engine

In [Part I][1] of this WP Engine Hotfix, I discussed some of the theory behind WP Engine's visitor calculations and how end-users of WP Engine could benefit from taking charge of their traffic themselves. In this next part, I'll discuss ways to log your visitor traffic, scrub that traffic for blacklisted and abusive IPs (as well as employ a nifty contact form honeypot), and completely block access to your site by these harmful bots, scrapers, harvesters, and spammers that jack up your visitor count.

It's important to note that this tutorial is not WP Engine specific. You can employ these methods on any hosting environment in which you have access to Apache. If you're on Nginx, I'll cover how to block unwanted traffic in a different tutorial.

Update: Got a reply from Donovan once he saw this monstrosity of an article:

@lawsonry @austingunter @wpengine That's a lot of work! You could just parse the IP's with one line: https://t.co/n6nuFjrz69

— Donovan Hernandez (@heydonovan) December 20, 2013

His solution is great if you have command-line SSH access to a Linux box, but since WP Engine users do not, the solution has to work within the confines of WP Engine's strict SFTP-only policy.

Step 1: Create a script that logs every single visitor to your site. #

Go to the root of your site and create a file called log_visitors.php. Copy and paste the following into it:

<?php

// Log Visitors
// This is a very basic way of logging users to a text file.

$remote_addr = $_SERVER['REMOTE_ADDR'];
$request_time = date('Y-m-d g:i a', $_SERVER['REQUEST_TIME']);
$request_uri = $_SERVER['REQUEST_URI'];
$user_agent = $_SERVER['HTTP_USER_AGENT'];

$visitor = "$request_time: $remote_addr -&gt; \"$request_uri\" as $user_agent\r\n";

// Record EVERY visitor in the visitors.txt file
file_put_contents("visitors.txt", $visitor, FILE_APPEND);

?>

The file is very straightforward: every time this script is accessed, the requesting entity's IP address, requesting URL, user agent, and time of request will be appended on a single line to a text file.

Now let's setup a call to this file every time someone visits our site.

Step 2: Modify index.php to include visitor logger #

Open up your index.php file in the root of your WordPress installation and make it look like the following:

<?php

// Include our visitor logging script
include("log_visitors.php");

/**
 * Front to the WordPress application. This file doesn't do anything, but loads
 * wp-blog-header.php which does and tells WordPress to load the theme.
 *
 * @package WordPress
 */

/**
 * Tells WordPress to load the WordPress theme and output it.
 *
 * @var bool
 */
define('WP_USE_THEMES', true);

/** Loads the WordPress Environment and Template */
require( dirname( __FILE__ ) . '/wp-blog-header.php' );

This will have our log_visitors.php file called every time our WP install is accessed, which means every time someone accesses our site, we append a new line of text with their header request information to a text file.

Step 3: Wait for 24-48 hours while your script logs visitors. #

You will want to have at least 24 hours worth of data to account for all the scrapers and spam bots that are hitting your site on a rotating daily frequency. I would recommend you keep this bad boy running for 48 hours, and follow this tutorial once per week for an entire month to ensure that you find as many of these garbage traffic hoarders as possible.

For this tutorial (and to combat the immediate threat of spam bots that were targeting my site), I only waited 5 hours.

Step 4: Copy and paste our visitor logs into Excel. #

Since I like to visualize my data, I use Excel in order to sort through and delete duplicate entries in the visitor logs. Here's what my logs looked like when I opened them with a text editor after 5 hours of logging:

So what do we see here? Automated activity for sure. You can see some "nice" bots accessing the robots.txt file, and then you can see some not so nice bots continuously trying to access pages that I know don't exist anymore. Or you could see just a bunch of garbled text because you have no idea what this all means. That's okay. All the important stuff is in a nice line down the file toward the left (the IP addresses).

We need to pull out these IPs and get them into a list, so let's copy and paste the contents of the file and paste it into excel, using Paste Special. We want to ensure that we allow spaces to be considered the delimiter so that each segment of every line goes into its own column. Here's what it looked like when I copied and pasted into Excel:

Starting to look a little more manageable? I hope so.

Now do the following to clean up this list:

Select all the cells and go to Data->Sort, and sort by IP address (Column D for me).
Go to your menu and find the Remove Duplicates button. We don't need any duplicate IP entries here for what we're about to do next.
Get up and stretch your arms and legs. It's good to stretch every once in a while.

We're left with an Excel spreadsheet with a list of IP addresses all in one column and no duplicate entries.

Copy the entire column of IP addresses and paste them into a text editor, preferably one with regular expression search+replace capabilities (you'll see why later). So at the end of this step, you should have a text file with a list of IP addresses, one per line. Hold on to it, because we're coming back to it in Step 6.

Step 5: Scrub the IPs for blacklisted, abusive, etc addresses. #

With our list of IPs on the side, we'll now create a custom function to check the reputation (read: public blacklist status) of each of them and then report to us whether the IP address is good to go or should be banned to the depths of internet oblivion.

The way I did this was through a shell script, but it works just fine in your browser, too.

Let's jump right in to some code. The following shall be called spamscrub.php:

<?php

// Let's count how many bad IPs we found
$counter = 0;

// We'll hold all the bad guys in an array
$denyfrom = array();

// The Almighty SpamScrub Function
// This is a very rudimentary (read: prototype but working) function
// that will check with CleanTalk.org to determine whether the IP
// address you passed to the function is publicly blacklisted on any of
// over a dozen public bad IP databases.
// The irony of this function is that we are scraping the results of
// the URL query in order to block bots that are scraping our site.

function spamscrub($addr) {

        // We want to use our counter and denyfrom array in here.
	global $counter;
	global $denyfrom;

	echo "Checking $addr... ";

        // We'll read in the contents of the URL of cleantalk.org for
        // each IP check. We're basically loading up the page and then
        // checking to see whether it tells us that the IP was found
        // in a blacklist or not. Automated browsing, is what you'd call it?
	// First we'll fopen the remote file.
	$handle = fopen("https://cleantalk.org/blacklists?record=$addr", "rb");
	$contents = '';

        // While the file still has content...
	while (!feof($handle)) {
		// Read in 1,000 byte chunks at a time
  		$contents .= fread($handle, 1000);

  		// First check if it's clean
  		if(strstr($contents, "not found in blacklists")){
			echo "[  OK  ]\r\n";
			break; // Forget the rest of the file
		}

                // Then check if it's dirty
  		if(strstr($contents, "has spam activity on")){
			echo "[ SPAM ]\r\n";

                        // Since this IP has been found on a blacklist,
                        // let's add it to our denyfrom array.
			$denyfrom[$counter] = $addr;
			$counter++;
			break; // Forget the rest of the file
		}

		// If neither condition is met, it will just loop through and grab more $content, but the page always displays one of those conditions so this is just a safeguard.
	}

	fclose($handle);
}

// Placeholder for our calls to spamscrub()
// We're going to put a bunch of calls to this function during the next
// step. For now, let's just leave this here as a placeholder.

// Report back on the status of this script.
echo "Spam Scrub Complete.\r\nTotal spam/bad IPs: $counter\r\n";

// Print out a copy+paste-able list of "deny from" entries for our .htaccess
echo "Add the following to your .htaccess:\r\n\r\n";
echo "# BEGIN IP blocking from ".date("m/d/Y")." spam scrub\r\n";

foreach($denyfrom as $addr){
	echo "deny from $addr\r\n";
}

echo "# END ".date("m/d/Y")." spam scrub\r\n";

?>

Of course, we could just manually check each and every single IP address by going to cleantalk.org and typing in the IP addresses one by one, one we could just automatically "browse" the results and do it this way.

You're going to want to remember where we put that placeholder in there, because we'll be coming back to it in a moment. Now, let's get those IP addresses ready to plug into our spamscrub.php file.

Step 6: Call our spam scrubber on each IP address. #

Go to the text file with our list of IPs, and be prepared for some copy + paste goodness. We're going to plug each and every one of those IPs into a call to our spamscrub() function and then place all those function calls where we had that placeholder text.

You might be wondering why we don't just pass a huge array to our spamscrub() function, and then tailor the spam scrubbing goodness to just loop through each array element. While this would make sense for normal variable and data manipulation, since we're fread()ing content from a remote URL it is better for everyone's sanity if we break up the calls into separate function calls, that way our php process isn't hanging and waiting for this giant loop to be completed.

The methodology is simple: for every IP address in our text file, we want to add spamscrub(" to the beginning and then "); to the end. So you can copy and paste the first part (the function call) to the beginning of each IP listing and then copy and paste the ending (the closing parenthesis and semicolon) to the end of all of the IPs, or you can use a regular expression to do all this for you.

To use a regular expression, you'll need something like Sublime Text or some other editor that allows you to search and replace using regular expressions. For Sublime (which is what I use), I would hit Ctrl+H (search + replace) and then type in the following to the search and replace fields:

Click on "replace all" and every instance of your IP address xx.xxx.xx.xxx will turn into spamscrub("xx.xxx.xx.xxx"); .

We're almost done!

Step 7: Place these new calls to our function in spamscrub.php #

Copy and paste all these calls to our spamscrub() function into the spamscrub.php file where we had that placeholder comment. For example, the entire contents of my spamscrub.php file for this tutorial and at this step looks like this:

<?php

$counter = 0;

$denyfrom = array();

spamscrub("101.18.43.67");
spamscrub("103.24.73.26");
spamscrub("107.211.0.61");
spamscrub("65.28.184.81");
spamscrub("65.52.0.135");
spamscrub("65.55.24.237");
spamscrub("66.249.76.31");
spamscrub("66.249.84.100");
spamscrub("68.32.228.137");
spamscrub("69.171.247.114");
// 100 calls have been removed so that we're not beating a dead horse with this tutorial

function spamscrub($addr) {

	global $counter;
	global $denyfrom;

	echo "Checking $addr... ";

	// fread the remote file in chunks to speed up reading.
	// Besides, we don't need the whole page
	$handle = fopen("https://cleantalk.org/blacklists?record=$addr", "rb");
	$contents = '';
	while (!feof($handle)) {
		// Read in 1,000 byte chunks at a time
  		$contents .= fread($handle, 1000);

  		// First check if it's clean
  		if(strstr($contents, "not found in blacklists")){
			echo "[  OK  ]\r\n";
			break;
		}

  		if(strstr($contents, "has spam activity on")){
			echo "[ SPAM ]\r\n";
			$denyfrom[$counter] = $addr;
			$counter++;
			break;
		}

		// If neither condition is met, it will just loop through and grab more $content
	}

	fclose($handle);

}

echo "Spam Scrub Complete.\r\nTotal spam/bad IPs: $counter\r\n";
echo "Add the following to your .htaccess:\r\n\r\n";
echo "# BEGIN IP blocking from ".date("m/d/Y")." spam scrub\r\n";

foreach($denyfrom as $addr){
	echo "deny from $addr\r\n";
}

echo "# END ".date("m/d/Y")." spam scrub\r\n";

?>

Save this file and run it. I chose to run it in the command line, but you're more than welcome to run it in a browser (though you might want to change the \r\n to
instead).

When you run it in your browser or from the command line, you'll get some output like this:

Checking 12.345.67.890... [ OK ]
Checking 12.345.67.890... [ OK ]
Checking 12.345.67.890... [ OK ]
Checking 12.345.67.890... [ OK ]
Checking 12.345.67.890... [ OK ]
Checking 12.345.67.890... [ SPAM ]
Checking 12.345.67.890... [ OK ]
Checking 12.345.67.890... [ OK ]
Checking 12.345.67.890... [ SPAM ]
Checking 12.345.67.890... [ SPAM ]
Spam scrub Complete. Total spam/bad IPs: 3
Add the following to your .htaccess:

# BEGIN IP blocking from 12/20/2013 spam scrub
deny from 12.345.67.890
deny from 09.876.54.321
deny from 54.321.67.890
# END 12/20/2013 spam scrub

So what we have here is a simple copy+paste-able inclusion to our .htaccess file that will give us the ability to block out unwanted IPs from harvesters, spammers, bad bots, and more, so that we don't have to worry about then hogging up our resources.

If you are using WP Engine, keep reading. #

On December 18th, 2013, I engaged WP Engine's Christian Justo about implementing this WP Engine Hotfix and found out something interesting about WP Engine and they way they log visitors:

Blocking IP addresses via .htaccess will not prevent bad traffic from being counted against your visitors per month.

Essentially, WP Engine customers are powerless to take their traffic into their own hands and filter out all this garbage traffic. I was very clear in Part I of this Hotfix that WP Engine's users need to stop complaining about the traffic that reaches their site and start doing something about it (WP Engine is not your traffic filter, they're your hosting provider).

But how is anyone supposed to take matters into their own hands if the only method of IP denial (.htaccess) does nothing for their visitors metrics? (We have to assume with a high degree of confidence that 99.99% of people implementing IP denial are actively trying to stop WP Engine from counting garbage traffic toward their monthly visitors allotment).

Christian at WP Engine offered the following:

So we can actually block traffic in a particular way in which if the request hits the server it will not count.

The only problem is that it can not be implemented through .htaccess. If we create a custom nginx rule to return a 444 to particular bots, the traffic will not count.

This is a good temporary solution for one or two accounts, but it will never work as a distributed solution. I know you can do it at the server block level in Nginx, but imagine all the support tickets if that becomes the norm once people realize that they can filter out bad bot and spam traffic.

I've been actively pursuing Senior Engineer Donovan Hernandez and he's been very open to a lot of my ideas. I'm looking forward to turning this into a permanent solution for WP Engine and their customers (and hopefully getting a good word in with whomever is responsible for hiring over there). Hint hint!

I think WP Engine is ready to start reaching out and coming up with new ways to combat a negative social image that, from my end, seems to really degrade their word-of-mouth advertising.

Previous: WP Engine Hotfix: Preventing Spam and Bad Bot Traffic, Part I
Next: The Problem with Anti-Spam Plugins