Need someone to lead product management at your software company? I create software for people that create software and I'm looking for my next opportunity. Check out my resume and get in touch.

Distributed comment spam prevention

Freshness Warning
This blog post is over 20 years old. It's possible that the information you read below isn't current and the links no longer work.

Earlier I mentioned some ideas for preventing comment spam. Thanks to a TrackBack ping, I found out that Simon Willison had been discussing the same thing yesterday. I need to read Simon more often. This is the second time that I’ve been working on something only to find out that he’s doing something similar.

Simon’s offering a blacklist of domains that are used in his spam, and that gave me an idea. Combine a distributed blacklist with my distributed anti-spam concept. Sites could participate by sending the IP address, URL, and a digest of the comment body (an MD5 hash would work) to a central server or a cloud of servers. If the server saw that the same comment was being posted multiple places within a short time period it would send a ping to all participating sites. The ping would contain the IP address and URL of the spammer. The sites would then use this information to ban further comments from that site and IP. Ideally the ban would be temporary to minimize the impact of false positives, but that would be up to the site’s software.

Essentially, this would create an organic system that responds to wholesale comment spamming in real time. This wouldn’t solve the problem of someone posting an individual comment on a single site, but that’s not really the way spammers work. For spam to be effective, it needs enormous volume. And the only way to have that sort of posting volume is to automate it.

Simon Willison
September 4, 2003 7:19 AM

The problem with using an MD5 hash of the comment body is that spammers can get around it by adding a couple of random characters to each comment, causing almost identical comments to generate completely different hashes. They do this in emails already (the weird random characters you get in the subject lines of some spams). There's also a trust issue with a single server or crowd of servers - what if one of them starts maliciously flagging legitimate comments as spam? A real time distributed system is a very interesting idea but there are quite a few kinks to iron out.

Adam Kalsey
September 4, 2003 9:18 AM

I agree that a message hash isn't perfect. The addition of a single character to a message would throw the hash off completely. The basic problem is that we need to identify comments from multiple sites that are identical. Not itdentical to a computer, but identical to a person. People recognize that the following two lines mean the same thing, but computers don't: One, Two, Three, Four 0ne, t wo, three, F`o`u`r I'll look into the fuzzy checksumming of some of the anti-spam systems, but I don't think that using such a system directly is the way to go. What we're looking for here isn't content that someone has flagged as spam (as Razor and Cloudmark do), but content that is repeated across multiple sites. The problem of innocent people being blacklisted due to false positives or maliciousness is mitigated by the fact that the blacklist is temporary. Sites implementing the blacklist would be expected to expire all bannings within a few hours. Malicious blacklisting could be even further removed by making sure that the system is maintained by a trusted group, sort of how the RBL or ORBS are for email.

September 4, 2003 10:19 AM

It would be better to prevent robots entering comments at all, so a human validator is my suggestion, like paypal vB and ebay use. "Enter the digits from the box on the right", where the box is an obscureed graphic. random text can be added to a url or comment to make it unique to avoid a blacklist type system. It's easy to get hold of a daily list of 200-400 open http proxies, or use ISP dialups with DHCP to spam with. ... Just thoughts. Anything to stop the spammers would slow them down ...(cynic) and move them off to wiki's and trackback... *sigh*

Adam Kalsey
September 4, 2003 11:20 AM

Those random text images are an accessibility nightmare. If you make it hard for machines to interact with your site, you are making it hard for screen readers as well.

David Beckemeyer
September 8, 2003 10:23 PM

Here's an idea I've implemented on my blog: It is a simple CAPTCHA Turing Test for posters. It doesn't stop all spam, but it prevents spam robots from posting.

Nick Altmann
September 14, 2003 6:02 PM

Would this problem be moot if comments were kept in the posters blog instead of on the commented page? Then a feature like Google's "backward links" could find related comments. The display could end up being the same (with the user agent pulling the comments into a single page), but it would shift the filtering burden (or privelege) to the user instead of the publisher.

Adam Kalsey
September 15, 2003 9:00 AM

That assumes that everyone who would like to comment has a blog. It also assumes that they want their blog to become a list of comments on other blogs.

Wolfgang Flamme
September 19, 2003 12:47 PM

Adam, why go for text content? We should a) aim at posted URLs b) monitor poster's IP activity (a) will prevent backlink spam activity targeting search engines (b) will prevent any excessive or automatic comment activity from a spammer (someone leaving 50 comments per day probably doesn't have that much to say) Wolfgang

Mean Dean
October 6, 2003 2:45 AM

You've been blogged in a post of mine about how I was able to discern a pattern used by a particular comment spammer who afflicted my site 2x today. Perhaps we could combine technologies to thwart this putz? See the hyperlink associated with my name.

Mike Steinbaugh
October 10, 2003 10:00 AM

Adam, I think I have a solution. Include a checkbox in the comment form that says something like, "Are you human? (prevents against comment spam)". Then once the user checks the box, the comment will go through. This can be used as a short term fix until Moveable Type allows users to change the names of the form elements, which I think is the easiest fix. I totally agree that the random digits approach is an accessibility nightmare and should be avoided. I think the delay time idea is good, but very hard to implement since it would involve JavaScript for the time being until Ben and Mena can make it server side in MT. Just some thoughts...I'd love to get this fixed though. My blog is starting to get lots of comment spam.

Adam Kalsey
October 10, 2003 10:12 AM

Bots would simply set the checkbox and submit it. There's all sorts of things you could do with JavaScript, if you want to require the user has JS before submitting a comment. For instance, you could have a checkbox that alters the value of a hidden field through JavaScript. Ignore or moderate any postings that don't have the correct value in the hidden field.

Trackback from random ruminations
October 11, 2003 9:21 AM

Comment Spam

Excerpt: I've been struck with comment spam three times in the last week. I don't know if this means that, suddenly, my blog has hit the radar screens of whatever search engine spammers use, or if I'm just lucky. Regardless, the first time is was mild, the seco...

November 7, 2003 9:01 AM

i've noticed a trick to get rid of comment noise when filtering. SPAM random characters will still allow the message to be read (otherwise the spam would have no impact). So they usually insert non-alphanumeric characters in the comment subject. Here's a small formula that i'd like to try out in your anti-spam blocker. 1) Perform an anti-l337 filter. A simple translation table will do the job. (result must be always lowercase) 2) Strip spaces and non alphabetic characters. 3) Change 2-character sequences for their phonetic equivalent (i.e. ph -> f ). Simple translation tables also work. 4) There you go. The message has been filtered and ready for digest. Example: Phr'33 v149r4 ' ph; 0r .U Step 1 - Anti 1337 filter: > Phr'ee viagra ' ph; or .u Step 2 - Strip non alphanum chars: > phreeviagraphoru Step 3 - 2-char Phonetic replacement: > freeviagraforu We could have a massive test and then perhaps, with some "scientific" research build a database, who knows. Still, the content has been filtered and ready for a keyworkd search. The keywords that can be found by any simple search routine are "free","viagra","for". The trick here is that spammers are cheapstakes. They won't do artificial intelligence programs to fool spam filters. They will use insted simple translation tables. Therefore, simple translation tables can also be used to decrypt their subject fields. About input forms, I find this one easy. 1) Use sessions cookies. 2) For each mail submit form, include a delay of 3 seconds before processing the submission. 3) Include the hidden random fileld. That will ensure that a same mail form will only be processed once. This will ensure that the spammer will at least have to wait 3 seconds between mail submissions. This will narrow the spammer's "damage zone". i.e. from 20 form submissions that could be performed in three seconds, you only get one. Replies welcome :)

Trackback from floating atoll
November 14, 2003 8:00 PM

A thousand monkeys filtering advertising

Excerpt: A common thread between the most effective forms of online advertising is the introduction of a hyperlink to a targeted user. In this respect, there is no difference between Google text ads, Orbitz pop-ups, and DoubleClick banner ads: for the advertise...

October 16, 2005 2:21 PM

I don't think an MD5 of the body would be useful. Even a tiny variation in the message would generate a different hash.

These are the last 15 comments. Read all 17 comments here.

This discussion has been closed.

Recently Written

Mastery doesn’t come from perfect planning (Dec 21)
In a ceramics class, one group focused on a single perfect dish, while another made many with no quality focus. The result? A lesson in the value of practice over perfection.
The Dark Side of Input Metrics (Nov 27)
Using input metrics in the wrong way can cause unexpected behaviors, stifled creativity, and micromanagement.
Reframe How You Think About Users of your Internal Platform (Nov 13)
Changing from "Customers" to "Partners" will give you a better perspective on internal product development.
Measuring Feature success (Oct 17)
You're building features to solve problems. If you don't know what success looks like, how did you decide on that feature at all?
How I use OKRs (Oct 13)
A description of how I use OKRs to guide a team, written so I can send to future teams.
Build the whole product (Oct 6)
Your code is only part of the product
Input metrics lead to outcomes (Sep 1)
An easy to understand example of using input metrics to track progress toward an outcome.
Lagging Outcomes (Aug 22)
Long-term things often end up off a team's goals because they can't see how to define measurable outcomes for them. Here's how to solve that.


What I'm Reading


Adam Kalsey

+1 916 600 2497


Public Key

© 1999-2024 Adam Kalsey.