Comments

Comments on Distributed comment spam prevention

An organic system that responds to wholesale comment spamming in real time. Continue reading...

Comments

17

john September 4, 2003 5:55 AM

Wrap a web services engine around that and you make it very easy for others to use the list. Presto!

Matthew Walker September 4, 2003 6:43 AM

I don’t think you want to go the MD5 route, you’re probably better off leveraging existing software in the form of Razor/Pyzor/DCC which have already built this stuff for email. They incorporate stuff like fuzzy checksumming.

Simon Willison September 4, 2003 7:19 AM

The problem with using an MD5 hash of the comment body is that spammers can get around it by adding a couple of random characters to each comment, causing almost identical comments to generate completely different hashes. They do this in emails already (the weird random characters you get in the subject lines of some spams). There’s also a trust issue with a single server or crowd of servers - what if one of them starts maliciously flagging legitimate comments as spam? A real time distributed system is a very interesting idea but there are quite a few kinks to iron out.

Adam Kalsey September 4, 2003 9:18 AM

I agree that a message hash isn’t perfect. The addition of a single character to a message would throw the hash off completely.

The basic problem is that we need to identify comments from multiple sites that are identical. Not itdentical to a computer, but identical to a person. People recognize that the following two lines mean the same thing, but computers don’t:

One, Two, Three, Four 0ne, t wo, three, Fou`r

I’ll look into the fuzzy checksumming of some of the anti-spam systems, but I don’t think that using such a system directly is the way to go. What we’re looking for here isn’t content that someone has flagged as spam (as Razor and Cloudmark do), but content that is repeated across multiple sites.

The problem of innocent people being blacklisted due to false positives or maliciousness is mitigated by the fact that the blacklist is temporary. Sites implementing the blacklist would be expected to expire all bannings within a few hours.

Malicious blacklisting could be even further removed by making sure that the system is maintained by a trusted group, sort of how the RBL or ORBS are for email.

Chris September 4, 2003 10:19 AM

It would be better to prevent robots entering comments at all, so a human validator is my suggestion, like paypal vB and ebay use. “Enter the digits from the box on the right”, where the box is an obscureed graphic.

…as random text can be added to a url or comment to make it unique to avoid a blacklist type system. It’s easy to get hold of a daily list of 200-400 open http proxies, or use ISP dialups with DHCP to spam with.

… Just thoughts. Anything to stop the spammers would slow them down …(cynic) and move them off to wiki’s and trackback… sigh

Adam Kalsey September 4, 2003 11:20 AM

Those random text images are an accessibility nightmare. If you make it hard for machines to interact with your site, you are making it hard for screen readers as well.

David Beckemeyer September 8, 2003 10:23 PM

Here’s an idea I’ve implemented on my blog: http://www.toyz.org/mrblog/archives/00000059.html

It is a simple CAPTCHA Turing Test for posters. It doesn’t stop all spam, but it prevents spam robots from posting.

Nick Altmann September 14, 2003 6:02 PM

Would this problem be moot if comments were kept in the posters blog instead of on the commented page? Then a feature like Google’s “backward links” could find related comments. The display could end up being the same (with the user agent pulling the comments into a single page), but it would shift the filtering burden (or privelege) to the user instead of the publisher.

Adam Kalsey September 15, 2003 9:00 AM

That assumes that everyone who would like to comment has a blog. It also assumes that they want their blog to become a list of comments on other blogs.

Wolfgang Flamme September 19, 2003 12:47 PM

Adam, why go for text content? We should a) aim at posted URLs b) monitor poster’s IP activity

(a) will prevent backlink spam activity targeting search engines

(b) will prevent any excessive or automatic comment activity from a spammer (someone leaving 50 comments per day probably doesn’t have that much to say)

Wolfgang

Mean Dean October 6, 2003 2:45 AM

You’ve been blogged in a post of mine about how I was able to discern a pattern used by a particular comment spammer who afflicted my site 2x today.

Perhaps we could combine technologies to thwart this putz? See the hyperlink associated with my name.

Mike Steinbaugh October 10, 2003 10:00 AM

Adam, I think I have a solution.

Include a checkbox in the comment form that says something like, “Are you human? (prevents against comment spam)”. Then once the user checks the box, the comment will go through. This can be used as a short term fix until Moveable Type allows users to change the names of the form elements, which I think is the easiest fix. I totally agree that the random digits approach is an accessibility nightmare and should be avoided.

I think the delay time idea is good, but very hard to implement since it would involve JavaScript for the time being until Ben and Mena can make it server side in MT.

Just some thoughts…I’d love to get this fixed though. My blog is starting to get lots of comment spam.

Adam Kalsey October 10, 2003 10:12 AM

Bots would simply set the checkbox and submit it. There’s all sorts of things you could do with JavaScript, if you want to require the user has JS before submitting a comment.

For instance, you could have a checkbox that alters the value of a hidden field through JavaScript. Ignore or moderate any postings that don’t have the correct value in the hidden field.

Trackback from random ruminations
October 11, 2003 9:21 AM
Comment Spam
Excerpt: I've been struck with comment spam three times in the last week. I don't know if this means that, suddenly, my blog has hit the radar screens of whatever search engine spammers use, or if I'm just lucky. Regardless, the first time is was mild, the seco...

Rick November 7, 2003 9:01 AM

i’ve noticed a trick to get rid of comment noise when filtering. SPAM random characters will still allow the message to be read (otherwise the spam would have no impact). So they usually insert non-alphanumeric characters in the comment subject. Here’s a small formula that i’d like to try out in your anti-spam blocker.

1) Perform an anti-l337 filter. A simple translation table will do the job. (result must be always lowercase) 2) Strip spaces and non alphabetic characters. 3) Change 2-character sequences for their phonetic equivalent (i.e. ph -> f ). Simple translation tables also work.

4) There you go. The message has been filtered and ready for digest.

Example:

Phr’33 v149r4 ’ ph; 0r .U

Step 1 - Anti 1337 filter:

Phr’ee viagra ’ ph; or .u

Step 2 - Strip non alphanum chars:

phreeviagraphoru

Step 3 - 2-char Phonetic replacement:

freeviagraforu

We could have a massive test and then perhaps, with some “scientific” research build a database, who knows. Still, the content has been filtered and ready for a keyworkd search. The keywords that can be found by any simple search routine are “free”,”viagra”,”for”.

The trick here is that spammers are cheapstakes. They won’t do artificial intelligence programs to fool spam filters. They will use insted simple translation tables. Therefore, simple translation tables can also be used to decrypt their subject fields.

About input forms, I find this one easy.

1) Use sessions cookies.

2) For each mail submit form, include a delay of 3 seconds before processing the submission.

3) Include the hidden random fileld. That will ensure that a same mail form will only be processed once.

This will ensure that the spammer will at least have to wait 3 seconds between mail submissions. This will narrow the spammer’s “damage zone”. i.e. from 20 form submissions that could be performed in three seconds, you only get one.

Replies welcome :)

Trackback from floating atoll
November 14, 2003 8:00 PM
A thousand monkeys filtering advertising
Excerpt: A common thread between the most effective forms of online advertising is the introduction of a hyperlink to a targeted user. In this respect, there is no difference between Google text ads, Orbitz pop-ups, and DoubleClick banner ads: for the advertise...

Kevin October 16, 2005 2:21 PM

I don’t think an MD5 of the body would be useful. Even a tiny variation in the message would generate a different hash.


Add your comments

This discussion has been closed.

Recently

Coffee spill (Nov 11)
Coffee vs the laptop
Cloud Reliability (Aug 12)
Would you like to take bets as to whether Amazon or Google have better reliability and safety than your local network service providers?
George Carlin (Jun 22)
"I'm always relieved when someone is delivering a eulogy and I realize I'm listening to it."
Business lessons from the Kitchen (Jun 9)
The Gordon Ramsay School of Business
Under The Radar twittering (Jun 3)
My live stream from Under the Radar
Measuring a CEO's mind (May 29)
Not everything that's important can be measured. Not everything that can be measured is important.
Golden 1: breaking customer expectations (May 25)
Take a potential new user and give them a poor signup experience, then call them a liar.

Subscribe to this site's feed.

Elsewhere

Feed Crier
Get alerted by IM when your favorite web sites and feeds are updated.
SacStarts
The Sacramento technology startup community.
Pinewood Freak
Pinewood Derby tips and tricks
Del.icio.us
My tagstream at del.icio.us.
Waddlespot
My son's Club Penguin community. News, blogs, tips, and tricks.

Contact

Adam Kalsey

Mobile: 916.600.2497

Email: adam AT kalsey.com

AIM or Skype: akalsey

PGP Key

©1999-2008 Adam Kalsey.
Content management by Movable Type.