Need someone to lead product or development at your software company? I lead product and engineering teams and I'm looking for my next opportunity. Check out my resume and get in touch.

Automatic Keywords

Freshness Warning
This blog post is over 18 years old. It's possible that the information you read below isn't current and the links no longer work.

Nathan Jacobs wants MT to automatically suggest categories for the current entry.

The question is, how would it know? What criteria would be used to determine the category? I played with the concept of creating a keyword generator for MT that would parse your entry text and create a keyword list. But how to come up with the keywords? Word frequency is the most likely method, and generating keywords solely on word frequencies isn’t likely to create acceptable results.

For example, take a look at a recent entry here, Open Letter to Barnes and Noble:

The entry is about their customer service mistake in sending me a marketing email when I had clearly asked not to receive them. Here’s the list of keywords generated by a word frequency analysis:

email, sent, unsubscribe, received, first, it’s, since, easier, list, now, they, why, preferences, mistake, understand, email preferences, marketing email

Here’s my hand generated list of keywords:

barnes and noble, email, customers, open letter, privacy

Now obviously the automatic list could be improved somewhat by ignoring certain words (it already ignores things like "the" and "and") but there are still going to be limits to the automatic method. The subject of the entry is Barnes & Noble, but the words "Barnes and Noble" don’t appear very often in the text. So going by word frequency alone obviously won’t cut it.

This is the problem that the big search engines had a few years ago. Keyword frequency isn’t always good indicator of the subject of a page. Just because my page has the word "animation" in it repeatedly doesn’t mean it’s about cartoons. That’s why Google was such a big hit. They determined relavancy based on what other Web sites thought. If sites about cartoons link to me, then my site is probably about cartoons. So Google uses humans—Web site owners—to determine what my pages are about.

Perhaps someone can suggest a better algorithm for generating keywords?

Paul
November 27, 2002 4:19 AM

This is one of those problems that seems trivial but isn't. Controlled vocabularies and synonym rings etc are one solution. These probably don't make sense for a personal blog, as it would take longer to maintain the lists than to just manually assign categories. An alternative I was playing with for a while is to compare word frequencies with the entries already in a category and making suggestions based on that. My test scripts didn't give particularly useful results, but I think the problem may have been my algorithms rather than the idea.

Scott Johnson
November 28, 2002 3:56 AM

I'd strongly suggest you look at Porter's Stemming Algorithm. It will come close to generating what you need. You might also want to check this comments form on IE 5.5. There are internesting wrapping issues (like I couldn't see the word interesting).

Adam Kalsey
November 28, 2002 2:01 PM

Word stemming would certainly improve the keyword list by combining words with the same root, but it doesn't solve the problem that sometimes the subject of a text isn't well represented by the keywords in the text. Synonyms are often used, and this makes a simple word frequency inaccurate. Here's a list of things that could improve automatic keyword generation... Comprehensive list of ignore words. Build a list of words that the keyword generator wouldn't include in the word list. Phrase dictionary. Some words need to always be paired with others to create key phrases. In my Barnes & Noble example, the word Barnes shouldn't be in the keyword list alone. It should read "Barnes & Noble." Keyword Thesaurus. Words that mean the same thing should be accounted for. It should recognize that Barnes & Noble, Barnes and Noble, BN, and BN.com are all the same. Controlled vocabulary. Always use the same terms for the same things. The problem with this approach is that it using the same terms repeatedly isn't a good writing practice. Combining this with the keyword thesaurus would allow the writter to use a greater variety of words. Word stemming. Run and running are the same thing as far as keywords are concerned. This technique would be most effective when combined with a thesaurus and the ignore words. Keyword Weighting Establish a weighting system for keywords, much like many search engines do. Words in the entry's title have more significance than those in the body, and the closer a word is to the begining of an entry, the higher weight it receives. Other things like headers would be factored in as well. I'm sure that other ideas from the disciplines of information architecture, search engine optimization, and search index creation could be applied to this problem.

Adam Kalsey
November 28, 2002 2:09 PM

The textarea wrapping problem is an apparent IE bug. The textarea uses CSS to set it's width at 100%. It fits perfectly on the screen, but as soon as you start typing in IE, the textarea widens to fill the entire width of the screen. I'm working on ideas for a solution, but I haven't come up with anything concrete yet. If anyone has a solution for this bug, I'd appreciate it.

Josh Santangelo
December 8, 2002 1:42 PM

This is a problem that I've been dealing with for the past few weeks, and there's a whole field of computer science dedicated to the analysis of unstructured data like this. I really need a keyword generator for a project of mine (tangent.cx), but I don't think I'm really qualified to take on the problem that no one seems to have really solved yet. I'll follow this site to see what you come up with.

Don
June 15, 2005 9:29 AM

I developed an algorithm for doing this that considers: the title, the meta-data, any H1 headers and the content. Each weighed differently ( Title = 20, Meta-data = 15 etc... ) and for a well structured - static -page this gives good, solid, dependable results.

This discussion has been closed.

Recently Written

Domain expertise in Product Management (Nov 16)
When you're hiring software product managers, hire for product management skills. Looking for domain experts will reduce the pool of people you can hire and might just be worse for your product.
Strategy Means Saying No (Oct 27)
An oft-overlooked aspect of strategy is to define what you are not doing. There are lots of adjacent problems you can attack. Strategy means defining which ones you will ignore.
Understanding vision, strategy, and execution (Oct 24)
Vision is what you're trying to do. Strategy is broad strokes on how you'll get there. Execution is the tasks you complete to complete the strategy.
How to advance your Product Market Fit KPI (Oct 21)
Finding the gaps in your product that will unlock the next round of growth.
Developer Relations as Developer Success (Oct 19)
Outreach, marketing, and developer evangelism are a part of Developer Relations. But the companies that are most successful with developers spend most of their time on something else.
Developer Experience Principle 6: Easy to Maintain (Oct 17)
Keeping your product Easy to Maintain will improve the lives of your team and your customers. It will help keep your docs up to date. Your SDKs and APIs will be released in sync. Your tooling and overall experience will shine.
Developer Experience Principle 5: Easy to Trust (Oct 9)
A developer building part of their business on your product needs to believe that you're going to do the right thing for them and their customers.
Developer Experience Principle 4: Easy to Get Help (Oct 8)
The faster you can unblock a stuck developer, the better their experience will be.

Older...

What I'm Reading

Contact

Adam Kalsey

+1 916 600 2497

Resume

Public Key

© 1999-2020 Adam Kalsey.