Automatic Keywords

Freshness Warning
This article is over 16 years old. It's possible that the information you read below isn't current.

Nathan Jacobs wants MT to automatically suggest categories for the current entry.

The question is, how would it know? What criteria would be used to determine the category? I played with the concept of creating a keyword generator for MT that would parse your entry text and create a keyword list. But how to come up with the keywords? Word frequency is the most likely method, and generating keywords solely on word frequencies isn’t likely to create acceptable results.

For example, take a look at a recent entry here, Open Letter to Barnes and Noble:

The entry is about their customer service mistake in sending me a marketing email when I had clearly asked not to receive them. Here’s the list of keywords generated by a word frequency analysis:

email, sent, unsubscribe, received, first, it’s, since, easier, list, now, they, why, preferences, mistake, understand, email preferences, marketing email

Here’s my hand generated list of keywords:

barnes and noble, email, customers, open letter, privacy

Now obviously the automatic list could be improved somewhat by ignoring certain words (it already ignores things like "the" and "and") but there are still going to be limits to the automatic method. The subject of the entry is Barnes & Noble, but the words "Barnes and Noble" don’t appear very often in the text. So going by word frequency alone obviously won’t cut it.

This is the problem that the big search engines had a few years ago. Keyword frequency isn’t always good indicator of the subject of a page. Just because my page has the word "animation" in it repeatedly doesn’t mean it’s about cartoons. That’s why Google was such a big hit. They determined relavancy based on what other Web sites thought. If sites about cartoons link to me, then my site is probably about cartoons. So Google uses humans—Web site owners—to determine what my pages are about.

Perhaps someone can suggest a better algorithm for generating keywords?

Paul
November 27, 2002 4:19 AM

This is one of those problems that seems trivial but isn't. Controlled vocabularies and synonym rings etc are one solution. These probably don't make sense for a personal blog, as it would take longer to maintain the lists than to just manually assign categories. An alternative I was playing with for a while is to compare word frequencies with the entries already in a category and making suggestions based on that. My test scripts didn't give particularly useful results, but I think the problem may have been my algorithms rather than the idea.

Scott Johnson
November 28, 2002 3:56 AM

I'd strongly suggest you look at Porter's Stemming Algorithm. It will come close to generating what you need. You might also want to check this comments form on IE 5.5. There are internesting wrapping issues (like I couldn't see the word interesting).

Adam Kalsey
November 28, 2002 2:01 PM

Word stemming would certainly improve the keyword list by combining words with the same root, but it doesn't solve the problem that sometimes the subject of a text isn't well represented by the keywords in the text. Synonyms are often used, and this makes a simple word frequency inaccurate. Here's a list of things that could improve automatic keyword generation... Comprehensive list of ignore words. Build a list of words that the keyword generator wouldn't include in the word list. Phrase dictionary. Some words need to always be paired with others to create key phrases. In my Barnes & Noble example, the word Barnes shouldn't be in the keyword list alone. It should read "Barnes & Noble." Keyword Thesaurus. Words that mean the same thing should be accounted for. It should recognize that Barnes & Noble, Barnes and Noble, BN, and BN.com are all the same. Controlled vocabulary. Always use the same terms for the same things. The problem with this approach is that it using the same terms repeatedly isn't a good writing practice. Combining this with the keyword thesaurus would allow the writter to use a greater variety of words. Word stemming. Run and running are the same thing as far as keywords are concerned. This technique would be most effective when combined with a thesaurus and the ignore words. Keyword Weighting Establish a weighting system for keywords, much like many search engines do. Words in the entry's title have more significance than those in the body, and the closer a word is to the begining of an entry, the higher weight it receives. Other things like headers would be factored in as well. I'm sure that other ideas from the disciplines of information architecture, search engine optimization, and search index creation could be applied to this problem.

Adam Kalsey
November 28, 2002 2:09 PM

The textarea wrapping problem is an apparent IE bug. The textarea uses CSS to set it's width at 100%. It fits perfectly on the screen, but as soon as you start typing in IE, the textarea widens to fill the entire width of the screen. I'm working on ideas for a solution, but I haven't come up with anything concrete yet. If anyone has a solution for this bug, I'd appreciate it.

Josh Santangelo
December 8, 2002 1:42 PM

This is a problem that I've been dealing with for the past few weeks, and there's a whole field of computer science dedicated to the analysis of unstructured data like this. I really need a keyword generator for a project of mine (tangent.cx), but I don't think I'm really qualified to take on the problem that no one seems to have really solved yet. I'll follow this site to see what you come up with.

Don
June 15, 2005 9:29 AM

I developed an algorithm for doing this that considers: the title, the meta-data, any H1 headers and the content. Each weighed differently ( Title = 20, Meta-data = 15 etc... ) and for a well structured - static -page this gives good, solid, dependable results.

This discussion has been closed.

Follow me on Twitter

Best Of

  • How not to apply for a job Applying for a job isn't that hard, but it does take some minimal effort and common sense.
  • Movie marketing on a budget Mark Cuban's looking for more cost effective ways to market movies.
  • California State Fair The California State Fair lets you buy tickets in advance from their Web site. That's good. But the site is a horror house of usability problems.
  • Customer reference questions. Sample questions to ask customer references when choosing a software vendor.
  • Comment Spam Manifesto Spammers are hereby put on notice. Your comments are not welcome. If the purpose behind your comment is to advertise yourself, your Web site, or a product that you are affiliated with, that comment is spam and will not be tolerated. We will hit you where it hurts by attacking your source of income.
  • More of the best »

Recently Read

Get More

Subscribe | Archives

Recently

Assumptions and project planning (Feb 18)
When your assumptions change, it's reasonable that your project plans and needs change as well. But too many managers are afraid to go back and re-work a plan that they've already agreed to.
Feature voting is harmful to your product (Feb 7)
There's a lot of problems with using feature voting to drive your product.
Encouraging 1:1s from other managers in your organization (Jan 4)
If you’re managing other managers, encourage them to hold their own 1:1s. It’s such an important tool for managing and leading that everyone needs to be holding them.
One on One Meetings - a collection of posts about 1:1s (Jan 2)
A collection of all my writing on 1:1s
Are 1:1s confidential? (Jan 2)
Is the discussion that occurs in a 1:1 confidential, even if no agreed in the meeting to keep it so?
Skip-level 1:1s are your hidden superpower (Jan 1)
Holding 1:1s with peers and with people far below you on the reporting chain will open your eyes up to what’s really going on in your business.
Do you need a 1:1 if you’re regularly communicating with your team? (Dec 28)
You’re simply not having deep meaningful conversation about the process of work in hallway conversations or in your chat apps.
What agenda items should a manager bring to a 1:1? (Dec 23)
At least 80% of a 1:1 agenda should be driven by your report, but if you also to use this time to work on things with them, then you’ll have better meetings.

Subscribe to this site's feed.

Contact

Adam Kalsey

Mobile: 916.600.2497

Email: adam AT kalsey.com

Twitter, etc: akalsey

Resume

PGP Key

©1999-2019 Adam Kalsey.