Distributed comment spam prevention

Earlier I mentioned some ideas for preventing comment spam. Thanks to a TrackBack ping, I found out that Simon Willison had been discussing the same thing yesterday. I need to read Simon more often. This is the second time that I’ve been working on something only to find out that he’s doing something similar.

Simon’s offering a blacklist of domains that are used in his spam, and that gave me an idea. Combine a distributed blacklist with my distributed anti-spam concept. Sites could participate by sending the IP address, URL, and a digest of the comment body (an MD5 hash would work) to a central server or a cloud of servers. If the server saw that the same comment was being posted multiple places within a short time period it would send a ping to all participating sites. The ping would contain the IP address and URL of the spammer. The sites would then use this information to ban further comments from that site and IP. Ideally the ban would be temporary to minimize the impact of false positives, but that would be up to the site’s software.

Essentially, this would create an organic system that responds to wholesale comment spamming in real time. This wouldn’t solve the problem of someone posting an individual comment on a single site, but that’s not really the way spammers work. For spam to be effective, it needs enormous volume. And the only way to have that sort of posting volume is to automate it.

Adam Kalsey
October 10, 2003 10:12 AM

Bots would simply set the checkbox and submit it. There’s all sorts of things you could do with JavaScript, if you want to require the user has JS before submitting a comment.

For instance, you could have a checkbox that alters the value of a hidden field through JavaScript. Ignore or moderate any postings that don’t have the correct value in the hidden field.

Trackback from random ruminations
October 11, 2003 9:21 AM

Comment Spam

Excerpt: I've been struck with comment spam three times in the last week. I don't know if this means that, suddenly, my blog has hit the radar screens of whatever search engine spammers use, or if I'm just lucky. Regardless, the first time is was mild, the seco...

Rick
November 7, 2003 9:01 AM

i’ve noticed a trick to get rid of comment noise when filtering. SPAM random characters will still allow the message to be read (otherwise the spam would have no impact). So they usually insert non-alphanumeric characters in the comment subject. Here’s a small formula that i’d like to try out in your anti-spam blocker.

1) Perform an anti-l337 filter. A simple translation table will do the job. (result must be always lowercase) 2) Strip spaces and non alphabetic characters. 3) Change 2-character sequences for their phonetic equivalent (i.e. ph -> f ). Simple translation tables also work.

4) There you go. The message has been filtered and ready for digest.

Example:

Phr’33 v149r4 ’ ph; 0r .U

Step 1 - Anti 1337 filter:

Phr’ee viagra ’ ph; or .u

Step 2 - Strip non alphanum chars:

phreeviagraphoru

Step 3 - 2-char Phonetic replacement:

freeviagraforu

We could have a massive test and then perhaps, with some “scientific” research build a database, who knows. Still, the content has been filtered and ready for a keyworkd search. The keywords that can be found by any simple search routine are “free”,”viagra”,”for”.

The trick here is that spammers are cheapstakes. They won’t do artificial intelligence programs to fool spam filters. They will use insted simple translation tables. Therefore, simple translation tables can also be used to decrypt their subject fields.

About input forms, I find this one easy.

1) Use sessions cookies.

2) For each mail submit form, include a delay of 3 seconds before processing the submission.

3) Include the hidden random fileld. That will ensure that a same mail form will only be processed once.

This will ensure that the spammer will at least have to wait 3 seconds between mail submissions. This will narrow the spammer’s “damage zone”. i.e. from 20 form submissions that could be performed in three seconds, you only get one.

Replies welcome :)

Trackback from floating atoll
November 14, 2003 8:00 PM

A thousand monkeys filtering advertising

Excerpt: A common thread between the most effective forms of online advertising is the introduction of a hyperlink to a targeted user. In this respect, there is no difference between Google text ads, Orbitz pop-ups, and DoubleClick banner ads: for the advertise...

Kevin
October 16, 2005 2:21 PM

I don’t think an MD5 of the body would be useful. Even a tiny variation in the message would generate a different hash.

These are the last 15 comments. Read all 17 comments here.

This discussion has been closed.

Lijit Search

Best Of

Recently Read

Get More

Subscribe | Archives

Recently

Thanks O'Reilly (Jun 29)
Captcha usability
BarCamp Sacramento today (Apr 26)
A short report from Barcamp Sacramento.
Barcamp Sacramento (Apr 16)
BarCamp is coming to Sacramento
Chrometa in Inc (Jan 14)
A local startup gets some national ink.
Scrum introduction (Jan 10)
Getting a handle on the Scrum project methodology.
Unfriendly returns (Dec 27)
Toys R Us blocks returns. You can bet I'll do all my shopping at a store with a friendlier return policy in the future.
The ongoing Comcast saga (Dec 27)
Using Twitter as a customer service tool.

Subscribe to this site's feed.

Elsewhere

Feed Crier
Get alerted by IM when your favorite web sites and feeds are updated.
SacStarts
The Sacramento technology startup community.
Pinewood Freak
Pinewood Derby tips and tricks
Del.icio.us
My tagstream at del.icio.us.
Waddlespot
My son's Club Penguin community. News, blogs, tips, and tricks.

Contact

Adam Kalsey

Mobile: 916.600.2497

Email: adam AT kalsey.com

AIM or Skype: akalsey

Resume

PGP Key

©1999-2009 Adam Kalsey.
Content management by Movable Type.