The Problem with Anti-Spam Plugins

08 January 2014
sysadmin

Installing anti-spam plugins are only the first step to mitigating garbage traffic. If we really want to tackle the issue of spam, we need to approach it from the server's perspective and thwart garbage requests before they're served. Is this possible? The short answer is: it depends.

WP Engine recently published an article entitled Staying One Step Ahead, in which they've commented on an Incapsula report that showed how more than half of your site's traffic generally comes from non-human visitors. In a previous article, I discussed why this occurs and how this type of traffic is in fact considered "visitor" traffic for managed WordPress hosting providers. For WP Engine customers, their sites are protected behind a great firewall and monitoring system that will help prevent large-scale attacks on WP sites like the one from 2013, but where they and frankly the rest of the internet fall short is in the mitigation of garbage traffic at the server level in all cases (and not just in the case of targeted attacks). For this reason, active server-level measures should be taken in order to inhibit server resources from being queued by garbage requests, long before the server is instructed to serve up a site.

The problem with anti-spam plugins is that garbage traffic is still being served up by the host. Akismet et al. do a great job filtering out the good comments from the bad ones, but what is really happening when spam comments are being filtered? Garbage traffic goes to your site, someone posts their garbage, and then they leave. Akismet, sniffing the comments and traffic for certain keywords and behavior, notices this, flags it, and tells you it's spam. This type of internal control is what I like to call "after-the-fact" in that it only works after the server has committed resources to serving up the site to the requester.

Current anti-spam plugins like Akismet are limited in this way; server resources are being committed to serve sites to garbage requests that only exist to spam your comments and scrape your site. In an ideal environment, traffic coming into the server would be filtered at the server or traffic balancer level in order to ensure server resources are not being queued by garbage requests.

Even plugins like this one that market themselves as "first lines of defense" against spam traffic still force the server to serve up the request to the requester. Customers who pay for managed hosting don't have access to server-side resources like the hosting engineers do (and sometimes they don't even realize that), so it's up to the host to make sure that traffic coming to their customers isn't just a bunch of gobbledygook.

One possible solution is to create a list of known bad bots based off of the results of scrubbing logs against known bad IP lists. This technique, which is a type of detective control, is how I do it with my managed hosting service, and takes the amount of traffic that my clients see from 4-5x what their Google Analytics report to about 1.5-2x. Note that any type of analytics software is going to report different numbers than others — that's just the way they work.

A limitation of this technique would be that you're preventing traffic after-the-fact. If sites are under attack or being bombarded because of a huge social media campaign, you're not doing what you sought to do in the first place, which was prevent traffic (especially spikes in traffic) from garbage being treated like good traffic.

Taking this into account, another possibility is to start compiling a list of known bad IPs and just immediately begin scrubbing traffic a the balancer level against it. This has the negative effect of slowing down your sites if 1) the list is extremely long, and/or 2) you run these checks in one big chunk on your web server. Traffic balancers, distributed web servers, and other high-availability methodologies will help to ensure that dynamically scrubbing traffic against a list of bad IPs wont have a negative effect on your server's performance.

There are caveats, of course, when one discusses how and why the server should block certain traffic.

What is the threshold in which a server deems a comment or IP address "spam"? Is this an absolute, or is this variable?
*How can we accurately measure traffic in a way that combines "good" traffic and "bad" traffic such that internal controls are efficiently using resources and external stakeholders are being accurately updated? *
*In the grand scheme of things, should a managed hosting provider really care about garbage traffic? *

Going forward, it is imperative that of recognize just how grey this line is between a host holding its clients' hands and them taking accountability for the traffic that goes to their servers. In my mind, web hosts are paid to ensure their servers are ran efficiently, and to me, cleaning out the traffic garbage is just part of the ballgame. Whether you agree with me or not, one thing is for certain: anti-spam plugins are only 50% of the answer, and if we truly want to combat bad traffic, we need to prevent our servers from serving content to bad IPs in the first place.