Content Management
Automatic Keywords
Freshness Warning
This blog post is over 21 years old. It's possible that the information you read below isn't current and the links no longer work.
26 Nov 2002
Nathan Jacobs wants MT to automatically suggest categories for the current entry.
The question is, how would it know? What criteria would be used to determine the category? I played with the concept of creating a keyword generator for MT that would parse your entry text and create a keyword list. But how to come up with the keywords? Word frequency is the most likely method, and generating keywords solely on word frequencies isn’t likely to create acceptable results.
For example, take a look at a recent entry here, Open Letter to Barnes and Noble:
The entry is about their customer service mistake in sending me a marketing email when I had clearly asked not to receive them. Here’s the list of keywords generated by a word frequency analysis:
email, sent, unsubscribe, received, first, it’s, since, easier, list, now, they, why, preferences, mistake, understand, email preferences, marketing email
Here’s my hand generated list of keywords:
barnes and noble, email, customers, open letter, privacy
Now obviously the automatic list could be improved somewhat by ignoring certain words (it already ignores things like "the" and "and") but there are still going to be limits to the automatic method. The subject of the entry is Barnes & Noble, but the words "Barnes and Noble" don’t appear very often in the text. So going by word frequency alone obviously won’t cut it.
This is the problem that the big search engines had a few years ago. Keyword frequency isn’t always good indicator of the subject of a page. Just because my page has the word "animation" in it repeatedly doesn’t mean it’s about cartoons. That’s why Google was such a big hit. They determined relavancy based on what other Web sites thought. If sites about cartoons link to me, then my site is probably about cartoons. So Google uses humans—Web site owners—to determine what my pages are about.
Perhaps someone can suggest a better algorithm for generating keywords?
(Edit, Oct 2005) I released Tagyu to solve this exact problem, Tagyu analyzes the content, the context it’s in, and other factors and generates a list of keywords. These keywords aren’t extracted from the content, but instead they are created by understanding how a human has classified similar text.