Need someone to lead product management at your software company? I create software for people that create software and I'm looking for my next opportunity. Check out my resume and get in touch.

This is the blog of Adam Kalsey. Unusual depth and complexity. Rich, full body with a hint of nutty earthiness.

Content Management

Automatic Keywords

Freshness Warning
This blog post is over 21 years old. It's possible that the information you read below isn't current and the links no longer work.

Nathan Jacobs wants MT to automatically suggest categories for the current entry.

The question is, how would it know? What criteria would be used to determine the category? I played with the concept of creating a keyword generator for MT that would parse your entry text and create a keyword list. But how to come up with the keywords? Word frequency is the most likely method, and generating keywords solely on word frequencies isn’t likely to create acceptable results.

For example, take a look at a recent entry here, Open Letter to Barnes and Noble:

The entry is about their customer service mistake in sending me a marketing email when I had clearly asked not to receive them. Here’s the list of keywords generated by a word frequency analysis:

email, sent, unsubscribe, received, first, it’s, since, easier, list, now, they, why, preferences, mistake, understand, email preferences, marketing email

Here’s my hand generated list of keywords:

barnes and noble, email, customers, open letter, privacy

Now obviously the automatic list could be improved somewhat by ignoring certain words (it already ignores things like "the" and "and") but there are still going to be limits to the automatic method. The subject of the entry is Barnes & Noble, but the words "Barnes and Noble" don’t appear very often in the text. So going by word frequency alone obviously won’t cut it.

This is the problem that the big search engines had a few years ago. Keyword frequency isn’t always good indicator of the subject of a page. Just because my page has the word "animation" in it repeatedly doesn’t mean it’s about cartoons. That’s why Google was such a big hit. They determined relavancy based on what other Web sites thought. If sites about cartoons link to me, then my site is probably about cartoons. So Google uses humans—Web site owners—to determine what my pages are about.

Perhaps someone can suggest a better algorithm for generating keywords?

(Edit, Oct 2005) I released Tagyu to solve this exact problem, Tagyu analyzes the content, the context it’s in, and other factors and generates a list of keywords. These keywords aren’t extracted from the content, but instead they are created by understanding how a human has classified similar text.

Recently Written

Video calls using a networked camera (Sep 25)
A writeup of my network-powered conference call camera setup.
Roadmap Outcomes, not Features (Sep 4)
Drive success by roadmapping the outcomes you'll create instead of the features you'll deliver.
Different roadmaps for different folks (Sep 2)
The key to effective roadmapping? Different views for different needs.
Micromanaging and competence (Jul 2)
Providing feedback or instruction can be seen as micromanagement unless you provide context.
My productivity operating system (Jun 24)
A framework for super-charging productivity on the things that matter.
Great product managers own the outcomes (May 14)
Being a product manager means never having to say, "that's not my job."
Too Big To Fail (Apr 9)
When a company piles resources on a new product idea, it doesn't have room to fail. But failing is an important part of innovation. If you can't let it fail, it can't succeed.
Go small (Apr 4)
The strengths of a large organization are the opposite of what makes innovation work. Starting something new requires that you start with a small team.

Older...

What I'm Reading