Need someone to lead product management at your software company? I create software for people that create software and I'm looking for my next opportunity. Check out my resume and get in touch.

Automatic Keywords

Freshness Warning
This blog post is over 21 years old. It's possible that the information you read below isn't current and the links no longer work.

Nathan Jacobs wants MT to automatically suggest categories for the current entry.

The question is, how would it know? What criteria would be used to determine the category? I played with the concept of creating a keyword generator for MT that would parse your entry text and create a keyword list. But how to come up with the keywords? Word frequency is the most likely method, and generating keywords solely on word frequencies isn’t likely to create acceptable results.

For example, take a look at a recent entry here, Open Letter to Barnes and Noble:

The entry is about their customer service mistake in sending me a marketing email when I had clearly asked not to receive them. Here’s the list of keywords generated by a word frequency analysis:

email, sent, unsubscribe, received, first, it’s, since, easier, list, now, they, why, preferences, mistake, understand, email preferences, marketing email

Here’s my hand generated list of keywords:

barnes and noble, email, customers, open letter, privacy

Now obviously the automatic list could be improved somewhat by ignoring certain words (it already ignores things like "the" and "and") but there are still going to be limits to the automatic method. The subject of the entry is Barnes & Noble, but the words "Barnes and Noble" don’t appear very often in the text. So going by word frequency alone obviously won’t cut it.

This is the problem that the big search engines had a few years ago. Keyword frequency isn’t always good indicator of the subject of a page. Just because my page has the word "animation" in it repeatedly doesn’t mean it’s about cartoons. That’s why Google was such a big hit. They determined relavancy based on what other Web sites thought. If sites about cartoons link to me, then my site is probably about cartoons. So Google uses humans—Web site owners—to determine what my pages are about.

Perhaps someone can suggest a better algorithm for generating keywords?

Paul
November 27, 2002 4:19 AM

This is one of those problems that seems trivial but isn't. Controlled vocabularies and synonym rings etc are one solution. These probably don't make sense for a personal blog, as it would take longer to maintain the lists than to just manually assign categories. An alternative I was playing with for a while is to compare word frequencies with the entries already in a category and making suggestions based on that. My test scripts didn't give particularly useful results, but I think the problem may have been my algorithms rather than the idea.

Scott Johnson
November 28, 2002 3:56 AM

I'd strongly suggest you look at Porter's Stemming Algorithm. It will come close to generating what you need. You might also want to check this comments form on IE 5.5. There are internesting wrapping issues (like I couldn't see the word interesting).

Adam Kalsey
November 28, 2002 2:01 PM

Word stemming would certainly improve the keyword list by combining words with the same root, but it doesn't solve the problem that sometimes the subject of a text isn't well represented by the keywords in the text. Synonyms are often used, and this makes a simple word frequency inaccurate. Here's a list of things that could improve automatic keyword generation... Comprehensive list of ignore words. Build a list of words that the keyword generator wouldn't include in the word list. Phrase dictionary. Some words need to always be paired with others to create key phrases. In my Barnes & Noble example, the word Barnes shouldn't be in the keyword list alone. It should read "Barnes & Noble." Keyword Thesaurus. Words that mean the same thing should be accounted for. It should recognize that Barnes & Noble, Barnes and Noble, BN, and BN.com are all the same. Controlled vocabulary. Always use the same terms for the same things. The problem with this approach is that it using the same terms repeatedly isn't a good writing practice. Combining this with the keyword thesaurus would allow the writter to use a greater variety of words. Word stemming. Run and running are the same thing as far as keywords are concerned. This technique would be most effective when combined with a thesaurus and the ignore words. Keyword Weighting Establish a weighting system for keywords, much like many search engines do. Words in the entry's title have more significance than those in the body, and the closer a word is to the begining of an entry, the higher weight it receives. Other things like headers would be factored in as well. I'm sure that other ideas from the disciplines of information architecture, search engine optimization, and search index creation could be applied to this problem.

Adam Kalsey
November 28, 2002 2:09 PM

The textarea wrapping problem is an apparent IE bug. The textarea uses CSS to set it's width at 100%. It fits perfectly on the screen, but as soon as you start typing in IE, the textarea widens to fill the entire width of the screen. I'm working on ideas for a solution, but I haven't come up with anything concrete yet. If anyone has a solution for this bug, I'd appreciate it.

Josh Santangelo
December 8, 2002 1:42 PM

This is a problem that I've been dealing with for the past few weeks, and there's a whole field of computer science dedicated to the analysis of unstructured data like this. I really need a keyword generator for a project of mine (tangent.cx), but I don't think I'm really qualified to take on the problem that no one seems to have really solved yet. I'll follow this site to see what you come up with.

Don
June 15, 2005 9:29 AM

I developed an algorithm for doing this that considers: the title, the meta-data, any H1 headers and the content. Each weighed differently ( Title = 20, Meta-data = 15 etc... ) and for a well structured - static -page this gives good, solid, dependable results.

This discussion has been closed.

Recently Written

Too Big To Fail (Apr 9)
When a company piles resources on a new product idea, it doesn't have room to fail. That keeps it from succeeding.
Go small (Apr 4)
The strengths of a large organization are the opposite of what makes innovation work. Starting something new requires that you start with a small team.
Start with a Belief (Apr 1)
You can't use data to build products unless you start with a hypothesis.
Mastery doesn’t come from perfect planning (Dec 21)
In a ceramics class, one group focused on a single perfect dish, while another made many with no quality focus. The result? A lesson in the value of practice over perfection.
The Dark Side of Input Metrics (Nov 27)
Using input metrics in the wrong way can cause unexpected behaviors, stifled creativity, and micromanagement.
Reframe How You Think About Users of your Internal Platform (Nov 13)
Changing from "Customers" to "Partners" will give you a better perspective on internal product development.
Measuring Feature success (Oct 17)
You're building features to solve problems. If you don't know what success looks like, how did you decide on that feature at all?
How I use OKRs (Oct 13)
A description of how I use OKRs to guide a team, written so I can send to future teams.

Older...

What I'm Reading

Contact

Adam Kalsey

+1 916 600 2497

Resume

Public Key

© 1999-2024 Adam Kalsey.