Distributed comment spam prevention

Freshness Warning
This article is over 15 years old. It's possible that the information you read below isn't current.

Earlier I mentioned some ideas for preventing comment spam. Thanks to a TrackBack ping, I found out that Simon Willison had been discussing the same thing yesterday. I need to read Simon more often. This is the second time that I’ve been working on something only to find out that he’s doing something similar.

Simon’s offering a blacklist of domains that are used in his spam, and that gave me an idea. Combine a distributed blacklist with my distributed anti-spam concept. Sites could participate by sending the IP address, URL, and a digest of the comment body (an MD5 hash would work) to a central server or a cloud of servers. If the server saw that the same comment was being posted multiple places within a short time period it would send a ping to all participating sites. The ping would contain the IP address and URL of the spammer. The sites would then use this information to ban further comments from that site and IP. Ideally the ban would be temporary to minimize the impact of false positives, but that would be up to the site’s software.

Essentially, this would create an organic system that responds to wholesale comment spamming in real time. This wouldn’t solve the problem of someone posting an individual comment on a single site, but that’s not really the way spammers work. For spam to be effective, it needs enormous volume. And the only way to have that sort of posting volume is to automate it.

Simon Willison
September 4, 2003 7:19 AM

The problem with using an MD5 hash of the comment body is that spammers can get around it by adding a couple of random characters to each comment, causing almost identical comments to generate completely different hashes. They do this in emails already (the weird random characters you get in the subject lines of some spams). There's also a trust issue with a single server or crowd of servers - what if one of them starts maliciously flagging legitimate comments as spam? A real time distributed system is a very interesting idea but there are quite a few kinks to iron out.

Adam Kalsey
September 4, 2003 9:18 AM

I agree that a message hash isn't perfect. The addition of a single character to a message would throw the hash off completely. The basic problem is that we need to identify comments from multiple sites that are identical. Not itdentical to a computer, but identical to a person. People recognize that the following two lines mean the same thing, but computers don't: One, Two, Three, Four 0ne, t wo, three, F`o`u`r I'll look into the fuzzy checksumming of some of the anti-spam systems, but I don't think that using such a system directly is the way to go. What we're looking for here isn't content that someone has flagged as spam (as Razor and Cloudmark do), but content that is repeated across multiple sites. The problem of innocent people being blacklisted due to false positives or maliciousness is mitigated by the fact that the blacklist is temporary. Sites implementing the blacklist would be expected to expire all bannings within a few hours. Malicious blacklisting could be even further removed by making sure that the system is maintained by a trusted group, sort of how the RBL or ORBS are for email.

Chris
September 4, 2003 10:19 AM

It would be better to prevent robots entering comments at all, so a human validator is my suggestion, like paypal vB and ebay use. "Enter the digits from the box on the right", where the box is an obscureed graphic. ...as random text can be added to a url or comment to make it unique to avoid a blacklist type system. It's easy to get hold of a daily list of 200-400 open http proxies, or use ISP dialups with DHCP to spam with. ... Just thoughts. Anything to stop the spammers would slow them down ...(cynic) and move them off to wiki's and trackback... *sigh*

Adam Kalsey
September 4, 2003 11:20 AM

Those random text images are an accessibility nightmare. If you make it hard for machines to interact with your site, you are making it hard for screen readers as well.

David Beckemeyer
September 8, 2003 10:23 PM

Here's an idea I've implemented on my blog: http://www.toyz.org/mrblog/archives/00000059.html It is a simple CAPTCHA Turing Test for posters. It doesn't stop all spam, but it prevents spam robots from posting.

Nick Altmann
September 14, 2003 6:02 PM

Would this problem be moot if comments were kept in the posters blog instead of on the commented page? Then a feature like Google's "backward links" could find related comments. The display could end up being the same (with the user agent pulling the comments into a single page), but it would shift the filtering burden (or privelege) to the user instead of the publisher.

Adam Kalsey
September 15, 2003 9:00 AM

That assumes that everyone who would like to comment has a blog. It also assumes that they want their blog to become a list of comments on other blogs.

Wolfgang Flamme
September 19, 2003 12:47 PM

Adam, why go for text content? We should a) aim at posted URLs b) monitor poster's IP activity (a) will prevent backlink spam activity targeting search engines (b) will prevent any excessive or automatic comment activity from a spammer (someone leaving 50 comments per day probably doesn't have that much to say) Wolfgang

Mean Dean
October 6, 2003 2:45 AM

You've been blogged in a post of mine about how I was able to discern a pattern used by a particular comment spammer who afflicted my site 2x today. Perhaps we could combine technologies to thwart this putz? See the hyperlink associated with my name.

Mike Steinbaugh
October 10, 2003 10:00 AM

Adam, I think I have a solution. Include a checkbox in the comment form that says something like, "Are you human? (prevents against comment spam)". Then once the user checks the box, the comment will go through. This can be used as a short term fix until Moveable Type allows users to change the names of the form elements, which I think is the easiest fix. I totally agree that the random digits approach is an accessibility nightmare and should be avoided. I think the delay time idea is good, but very hard to implement since it would involve JavaScript for the time being until Ben and Mena can make it server side in MT. Just some thoughts...I'd love to get this fixed though. My blog is starting to get lots of comment spam.

Adam Kalsey
October 10, 2003 10:12 AM

Bots would simply set the checkbox and submit it. There's all sorts of things you could do with JavaScript, if you want to require the user has JS before submitting a comment. For instance, you could have a checkbox that alters the value of a hidden field through JavaScript. Ignore or moderate any postings that don't have the correct value in the hidden field.

Trackback from random ruminations
October 11, 2003 9:21 AM

Comment Spam

Excerpt: I've been struck with comment spam three times in the last week. I don't know if this means that, suddenly, my blog has hit the radar screens of whatever search engine spammers use, or if I'm just lucky. Regardless, the first time is was mild, the seco...

Rick
November 7, 2003 9:01 AM

i've noticed a trick to get rid of comment noise when filtering. SPAM random characters will still allow the message to be read (otherwise the spam would have no impact). So they usually insert non-alphanumeric characters in the comment subject. Here's a small formula that i'd like to try out in your anti-spam blocker. 1) Perform an anti-l337 filter. A simple translation table will do the job. (result must be always lowercase) 2) Strip spaces and non alphabetic characters. 3) Change 2-character sequences for their phonetic equivalent (i.e. ph -> f ). Simple translation tables also work. 4) There you go. The message has been filtered and ready for digest. Example: Phr'33 v149r4 ' ph; 0r .U Step 1 - Anti 1337 filter: > Phr'ee viagra ' ph; or .u Step 2 - Strip non alphanum chars: > phreeviagraphoru Step 3 - 2-char Phonetic replacement: > freeviagraforu We could have a massive test and then perhaps, with some "scientific" research build a database, who knows. Still, the content has been filtered and ready for a keyworkd search. The keywords that can be found by any simple search routine are "free","viagra","for". The trick here is that spammers are cheapstakes. They won't do artificial intelligence programs to fool spam filters. They will use insted simple translation tables. Therefore, simple translation tables can also be used to decrypt their subject fields. About input forms, I find this one easy. 1) Use sessions cookies. 2) For each mail submit form, include a delay of 3 seconds before processing the submission. 3) Include the hidden random fileld. That will ensure that a same mail form will only be processed once. This will ensure that the spammer will at least have to wait 3 seconds between mail submissions. This will narrow the spammer's "damage zone". i.e. from 20 form submissions that could be performed in three seconds, you only get one. Replies welcome :)

Trackback from floating atoll
November 14, 2003 8:00 PM

A thousand monkeys filtering advertising

Excerpt: A common thread between the most effective forms of online advertising is the introduction of a hyperlink to a targeted user. In this respect, there is no difference between Google text ads, Orbitz pop-ups, and DoubleClick banner ads: for the advertise...

Kevin
October 16, 2005 2:21 PM

I don't think an MD5 of the body would be useful. Even a tiny variation in the message would generate a different hash.

These are the last 15 comments. Read all 17 comments here.

This discussion has been closed.

Follow me on Twitter

Best Of

  • Pitching Bloggers Forget what you learned in your PR classes. Start acting like a human instead of a marketer, and the humans behind the blogs will respond.
  • Rounded corners in CSS There lots of ways to create rounded corners with CSS, but they always require lots of complex HTML and CSS. This is simpler.
  • Lock-in is bad T-Mobile thinks they'll get new Hotspot customers with exclusive content and locked-in devices.
  • Customer reference questions. Sample questions to ask customer references when choosing a software vendor.
  • Comment Spam Manifesto Spammers are hereby put on notice. Your comments are not welcome. If the purpose behind your comment is to advertise yourself, your Web site, or a product that you are affiliated with, that comment is spam and will not be tolerated. We will hit you where it hurts by attacking your source of income.
  • More of the best »

Recently Read

Get More

Subscribe | Archives

15

Recently

How the Sales organization in a large company slows innovation (Nov 16)
If you have a new innovative product inside a large established company, it can be much harder to reach product market fit than it would be for the same product in a startup.
Networking as an entrepreneur (Oct 23)
Having a network is crazy important. Networking is not.
Stretching your team (Jun 11)
Stretching your team is one of the best ways to improve your output, your team's happiness, and your velocity. But they'll need coaching.
Physical camera shutter for Cisco Spark Board (Jul 6)
A 3d printable design for a camera shutter for a Cisco Spark Board
My Travel Coffee Setup (Jan 20)
What my travel coffee brewing setup looks like, and how you can build your own for under $100.
Turkey Legs (May 30)
Product naming gone awry.
Speaking for Geeks: Your Slides (Dec 17)
Tips and tricks for creating great slides.
Speaking for Geeks: Writing Your Talk (Dec 14)
Don’t wait until the night before the talk to write it. Crazy, I know.

Subscribe to this site's feed.

Elsewhere

Tropo
Voice and communications platforms, including Tropo and Phono. Work.
SacStarts
The Sacramento technology startup community.
Pinewood Freak
Pinewood Derby tips and tricks

Contact

Adam Kalsey

Mobile: 916.600.2497

Email: adam AT kalsey.com

AIM or Skype: akalsey

Resume

PGP Key

©1999-2018 Adam Kalsey.
Content management by Movable Type.