Distributed comment spam prevention

Freshness Warning
This blog post is over 19 years old. It's possible that the information you read below isn't current and the links no longer work.

Earlier I mentioned some ideas for preventing comment spam. Thanks to a TrackBack ping, I found out that Simon Willison had been discussing the same thing yesterday. I need to read Simon more often. This is the second time that I’ve been working on something only to find out that he’s doing something similar.

Simon’s offering a blacklist of domains that are used in his spam, and that gave me an idea. Combine a distributed blacklist with my distributed anti-spam concept. Sites could participate by sending the IP address, URL, and a digest of the comment body (an MD5 hash would work) to a central server or a cloud of servers. If the server saw that the same comment was being posted multiple places within a short time period it would send a ping to all participating sites. The ping would contain the IP address and URL of the spammer. The sites would then use this information to ban further comments from that site and IP. Ideally the ban would be temporary to minimize the impact of false positives, but that would be up to the site’s software.

Essentially, this would create an organic system that responds to wholesale comment spamming in real time. This wouldn’t solve the problem of someone posting an individual comment on a single site, but that’s not really the way spammers work. For spam to be effective, it needs enormous volume. And the only way to have that sort of posting volume is to automate it.

Simon Willison
September 4, 2003 7:19 AM

The problem with using an MD5 hash of the comment body is that spammers can get around it by adding a couple of random characters to each comment, causing almost identical comments to generate completely different hashes. They do this in emails already (the weird random characters you get in the subject lines of some spams). There's also a trust issue with a single server or crowd of servers - what if one of them starts maliciously flagging legitimate comments as spam? A real time distributed system is a very interesting idea but there are quite a few kinks to iron out.

Adam Kalsey
September 4, 2003 9:18 AM

I agree that a message hash isn't perfect. The addition of a single character to a message would throw the hash off completely. The basic problem is that we need to identify comments from multiple sites that are identical. Not itdentical to a computer, but identical to a person. People recognize that the following two lines mean the same thing, but computers don't: One, Two, Three, Four 0ne, t wo, three, F`o`u`r I'll look into the fuzzy checksumming of some of the anti-spam systems, but I don't think that using such a system directly is the way to go. What we're looking for here isn't content that someone has flagged as spam (as Razor and Cloudmark do), but content that is repeated across multiple sites. The problem of innocent people being blacklisted due to false positives or maliciousness is mitigated by the fact that the blacklist is temporary. Sites implementing the blacklist would be expected to expire all bannings within a few hours. Malicious blacklisting could be even further removed by making sure that the system is maintained by a trusted group, sort of how the RBL or ORBS are for email.

September 4, 2003 10:19 AM

It would be better to prevent robots entering comments at all, so a human validator is my suggestion, like paypal vB and ebay use. "Enter the digits from the box on the right", where the box is an obscureed graphic. ...as random text can be added to a url or comment to make it unique to avoid a blacklist type system. It's easy to get hold of a daily list of 200-400 open http proxies, or use ISP dialups with DHCP to spam with. ... Just thoughts. Anything to stop the spammers would slow them down ...(cynic) and move them off to wiki's and trackback... *sigh*

Adam Kalsey
September 4, 2003 11:20 AM

Those random text images are an accessibility nightmare. If you make it hard for machines to interact with your site, you are making it hard for screen readers as well.

David Beckemeyer
September 8, 2003 10:23 PM

Here's an idea I've implemented on my blog: http://www.toyz.org/mrblog/archives/00000059.html It is a simple CAPTCHA Turing Test for posters. It doesn't stop all spam, but it prevents spam robots from posting.

Nick Altmann
September 14, 2003 6:02 PM

Would this problem be moot if comments were kept in the posters blog instead of on the commented page? Then a feature like Google's "backward links" could find related comments. The display could end up being the same (with the user agent pulling the comments into a single page), but it would shift the filtering burden (or privelege) to the user instead of the publisher.

Adam Kalsey
September 15, 2003 9:00 AM

That assumes that everyone who would like to comment has a blog. It also assumes that they want their blog to become a list of comments on other blogs.

Wolfgang Flamme
September 19, 2003 12:47 PM

Adam, why go for text content? We should a) aim at posted URLs b) monitor poster's IP activity (a) will prevent backlink spam activity targeting search engines (b) will prevent any excessive or automatic comment activity from a spammer (someone leaving 50 comments per day probably doesn't have that much to say) Wolfgang

Mean Dean
October 6, 2003 2:45 AM

You've been blogged in a post of mine about how I was able to discern a pattern used by a particular comment spammer who afflicted my site 2x today. Perhaps we could combine technologies to thwart this putz? See the hyperlink associated with my name.

Mike Steinbaugh
October 10, 2003 10:00 AM

Adam, I think I have a solution. Include a checkbox in the comment form that says something like, "Are you human? (prevents against comment spam)". Then once the user checks the box, the comment will go through. This can be used as a short term fix until Moveable Type allows users to change the names of the form elements, which I think is the easiest fix. I totally agree that the random digits approach is an accessibility nightmare and should be avoided. I think the delay time idea is good, but very hard to implement since it would involve JavaScript for the time being until Ben and Mena can make it server side in MT. Just some thoughts...I'd love to get this fixed though. My blog is starting to get lots of comment spam.

Adam Kalsey
October 10, 2003 10:12 AM

Bots would simply set the checkbox and submit it. There's all sorts of things you could do with JavaScript, if you want to require the user has JS before submitting a comment. For instance, you could have a checkbox that alters the value of a hidden field through JavaScript. Ignore or moderate any postings that don't have the correct value in the hidden field.

Trackback from random ruminations
October 11, 2003 9:21 AM

Comment Spam

Excerpt: I've been struck with comment spam three times in the last week. I don't know if this means that, suddenly, my blog has hit the radar screens of whatever search engine spammers use, or if I'm just lucky. Regardless, the first time is was mild, the seco...

November 7, 2003 9:01 AM

i've noticed a trick to get rid of comment noise when filtering. SPAM random characters will still allow the message to be read (otherwise the spam would have no impact). So they usually insert non-alphanumeric characters in the comment subject. Here's a small formula that i'd like to try out in your anti-spam blocker. 1) Perform an anti-l337 filter. A simple translation table will do the job. (result must be always lowercase) 2) Strip spaces and non alphabetic characters. 3) Change 2-character sequences for their phonetic equivalent (i.e. ph -> f ). Simple translation tables also work. 4) There you go. The message has been filtered and ready for digest. Example: Phr'33 v149r4 ' ph; 0r .U Step 1 - Anti 1337 filter: > Phr'ee viagra ' ph; or .u Step 2 - Strip non alphanum chars: > phreeviagraphoru Step 3 - 2-char Phonetic replacement: > freeviagraforu We could have a massive test and then perhaps, with some "scientific" research build a database, who knows. Still, the content has been filtered and ready for a keyworkd search. The keywords that can be found by any simple search routine are "free","viagra","for". The trick here is that spammers are cheapstakes. They won't do artificial intelligence programs to fool spam filters. They will use insted simple translation tables. Therefore, simple translation tables can also be used to decrypt their subject fields. About input forms, I find this one easy. 1) Use sessions cookies. 2) For each mail submit form, include a delay of 3 seconds before processing the submission. 3) Include the hidden random fileld. That will ensure that a same mail form will only be processed once. This will ensure that the spammer will at least have to wait 3 seconds between mail submissions. This will narrow the spammer's "damage zone". i.e. from 20 form submissions that could be performed in three seconds, you only get one. Replies welcome :)

Trackback from floating atoll
November 14, 2003 8:00 PM

A thousand monkeys filtering advertising

Excerpt: A common thread between the most effective forms of online advertising is the introduction of a hyperlink to a targeted user. In this respect, there is no difference between Google text ads, Orbitz pop-ups, and DoubleClick banner ads: for the advertise...

October 16, 2005 2:21 PM

I don't think an MD5 of the body would be useful. Even a tiny variation in the message would generate a different hash.

These are the last 15 comments. Read all 17 comments here.

This discussion has been closed.

Recently Written

The Trap of The Sales-Led Product (Dec 10)
It’s not a winning way to build a product company.
The Hidden Cost of Custom Customer Features (Dec 7)
One-off features will cost you more than you think and make your customers unhappy.
Domain expertise in Product Management (Nov 16)
When you're hiring software product managers, hire for product management skills. Looking for domain experts will reduce the pool of people you can hire and might just be worse for your product.
Strategy Means Saying No (Oct 27)
An oft-overlooked aspect of strategy is to define what you are not doing. There are lots of adjacent problems you can attack. Strategy means defining which ones you will ignore.
Understanding vision, strategy, and execution (Oct 24)
Vision is what you're trying to do. Strategy is broad strokes on how you'll get there. Execution is the tasks you complete to complete the strategy.
How to advance your Product Market Fit KPI (Oct 21)
Finding the gaps in your product that will unlock the next round of growth.
Developer Relations as Developer Success (Oct 19)
Outreach, marketing, and developer evangelism are a part of Developer Relations. But the companies that are most successful with developers spend most of their time on something else.
Developer Experience Principle 6: Easy to Maintain (Oct 17)
Keeping your product Easy to Maintain will improve the lives of your team and your customers. It will help keep your docs up to date. Your SDKs and APIs will be released in sync. Your tooling and overall experience will shine.


What I'm Reading


Adam Kalsey

+1 916 600 2497


Public Key

© 1999-2023 Adam Kalsey.