Filtered Comments

Freshness Warning
This article is over 8 years old. It's possible that the information you read below isn't current.

A number of people have suggested using Bayes filters or other content analysis in order to defeat comment spam. The problem with this approach is that comment spam is a different animal that email spam. At this point in time (although it may change), the purpose behind comment spam is to add the spammer’s URL to your site. By sprinkling their links over hundreds or thousands of pages, they hope to increase their search engine rankings.

The messages themselves are always short, innocuous comments. Some of the ones I’ve seen make an attempt to appear to be a valid comment by copying verbatim the contents of someone else’s comment.

Since there is nothing spammy about the text of the comment, content filtering won’t do any good.

Martin Sutherland
October 10, 2003 6:52 PM

It sounds like the comments spam I get is probably not representative of comments spam in general. About half of the stuff I get has some kind of marketing message in the body, and I thought that would be enough to use as a basis for Bayesian filtering.

Even the other kind, which consists of a URL and an apparently innocuous null sentence, would contain a URL that has never before been seen on my site. As a message token, this would therefore be assigned a higher spam probability than the homepage of a frequent comment poster, whose messages have been categorized as ham.

However, you’re right that in these cases, there isn’t an awful lot to go on. A Bayesian filter generally wouldn’t be able to make a definitive determination. In his article “So Far, So Good” (http://paulgraham.com/sofar.html) , Paul Graham calls messages like this “spam from the future”, because spammers have figured out that this is a way to defeat (or at least sow doubt amongst) Bayesian filters.

You could take the Bayesian path one step beyond the message itself, and apply filtering to the text of the target URL, but that obviates the unique feature of tools like SpamBayes: the fact that you don’t need to consult any external references to identify spam. If you’re willing to consult an external source to confirm the spam content of a URL, why not consult a blacklist rather than the URL itself? The blacklist would probably be quicker. (Unless, of course, spammers started using customized URLs that were materially different for each person on their mailing/comments list. Ugh.)

Bayesian filtering works so well for email, that I’d like to think there is a place for it in identifying comments spam. But it clearly isn’t enough in itself to be a complete solution. Does it have enough merit to be worth considering as part of a complete solution, or do other techniques cover the same ground just as (or more) effectively?

KO
October 11, 2003 6:28 AM

Yes, people have been leaving generic comments like “Nice post” and so on with spam urls in the links on my blog (despite the fact that it hardly gets any traffic.

There doesn’t really seem to be any easy solution to this besides blacklisting. Isn’t there a limit to the number of different urls spammers use? After all, they are promoting actual existing sites, and there are a finite number of links to those. Cannot the blacklist only contain the main URL for a spam site, and automatically find and blacklist all the alternate url’s which also point to it?

kim
October 12, 2003 11:13 AM

I’ve been noticing that most of my spam has been coming from the 209.210.176.* block. They always resolve to pornfilter#.someISP or pfilter#.someISP. The URLs always contain some sort of porn. I agree with KO, I’ve only found blacklisting to work, but it’s becoming a pain.

MojoMark
October 13, 2003 10:36 AM

FYI: Lots more comment spam around hitting blogs recently. Just saw lots of discussion at WindsOfChange.Net and some good stuff at ScriptyGodess.

Shirley Kaiser
October 14, 2003 12:40 PM

Hi, Adam,

I came over to your site to see if you had any information on SimpleComments with Jay’s new filter, as there are some little quirks going on with numbers. You probably already know about Jay’s MT-Blacklist plugin, but just in case you can see what’s going on at Jay’s site about this at http://www.jayallen.org/journey/2003/10/mtblackliststopspam_now.

I agree with you that content-only filtering is off-the-mark. Filtering the URLs that the spammers are trying to leave at our sites can be quite helpful, however, as I’ve experienced firsthand.

Since Jay’s approach includes all the form fields, it covers wherever the spammers try to add their links.

Of course filtering isn’t going to stop 100% of the spam, but I’ve seen it trickle to almost nothing since I implemented the filters at my own site. Not having the spam show up at all is a huge deterrent, and the volume of spam that does get through is far less time-consuming to manage.

Just like email spam, though, I suspect the comment spammers will work on ways around the filters. So we’ll see where it all goes. Too bad they don’t use their energy in far more positive ways. Then again, the sites they’re leaving for us are questionable, too.

Hopefully the search engines and directories will address this comment spam, too, as they do try to ward off tactics to fool them.

Additionally, I totally agree with you that removing spam immediately is critical. I’ve linked to your site about what you’ve written on this.

Thanks for all the terrific content at your site, Adam.

Chris Vance
October 16, 2003 2:41 PM

I just found a link to MT-Blacklist, but Shirley linked to it first. I initially found the link from Luke Hetteman’s blog (see http://www.hutteman.com/weblog/2003/10/14-132.html).

In an earlier entry (http://www.hutteman.com/weblog/2003/09/22-123.html), a commenter in refered Luke to your “Comment Spam” article.

James Seng
October 17, 2003 5:09 AM

Actually, Bayesian proposed only needs about 10-15 significant words to work properly. And if you add IPs, URLs and Hosts (even those “Nice Post” spam will have URLs), it get pretty significant.

And that’s what I did with my MT-Bayesian.

mark
February 26, 2006 11:43 PM

I’m confused about just what a comment spammer is? If someone addresses the topic at hand, they should be able to link your article to their site. Especially if the site’s relate (blog site to a blog site etc.).

This discussion has been closed.

Follow me on Twitter

Lijit Search

Best Of

  • Comment Spam Manifesto Spammers are hereby put on notice. Your comments are not welcome. If the purpose behind your comment is to advertise yourself, your Web site, or a product that you are affiliated with, that comment is spam and will not be tolerated. We will hit you where it hurts by attacking your source of income.
  • Best of Newly Digital There have been dozens of Newly Digital entries from all over the world. Here are some of the best.
  • Let it go Netscape 4 is six years old.
  • The importance of being good Starbucks is pulling CD burning stations from their stores. That says something interesting about their brand.
  • Google on the desktop Google picks up Picasa, giving them an important foothold on people's PCs.
  • More of the best »

Recently Read

Get More

Subscribe | Archives

8

Recently

invisible Fence (Mar 22)
The New York Times has a paywall now. Sorta. If you don't choose to ignore it.
Black status icon for Chrometa (Mar 17)
Replacing the status icon of Chrometa
Using Google Voice as your voicemail on AT&T (Oct 26)
How I set up my iPhone to use Google Voice as it's voicemail system.
Don Mattingly forced to make coaching change (Sep 17)
New LA Dodgers coach starts to wonder if he knows the rules of baseball at all.
In which Vonage pretends their prices haven't changed (Apr 12)
Translating what Vonage marketing says about their price increase into plain English.
Twitter app competition (Apr 12)
Life as a Twitter app developer is far from over.
Twitter app competition (Apr 12)
Life as a Twitter app developer is far from over.
The rest of the world is not like you (Apr 5)
Normal people are different. Keep that in mind when creating or marketing a product.

Subscribe to this site's feed.

Elsewhere

IMified
Build instant messaging applications. (My company)
SacStarts
The Sacramento technology startup community.
Pinewood Freak
Pinewood Derby tips and tricks

Contact

Adam Kalsey

Mobile: 916.600.2497

Email: adam AT kalsey.com

AIM or Skype: akalsey

Resume

PGP Key

©1999-2012 Adam Kalsey.
Content management by Movable Type.