You mark that frame an 8 and you're entering a world of pain
Loosely Typed in Ohio

Designing tag-based search systems

wtfTagging that works

A large project we launched recently has me thinking a lot about search. Search on most commercial websites has historically been a kind of cruel joke on users-both because search is usually an afterthought, implemented as a rudimentary full-text database query, and because web search is hard.

In general, doing good web search requires machine learning (see Google), or careful “results tuning” on the part of site designers. The machine learning approach only works for huge budgets and huge user-bases, so for MemberHealth, we used Lucene complemented by semantic analysis (woot!)-which we implemented through a combination of page-specific “tags,” together with domain-specific keywords (for example, the site is for a major pharmaceutical insurance company, so we check a global list of prescription drugs for matches to search queries, and return “drug search application” results if the query matches).

Both of these efforts result in a kind of uber tagging system where results are served according to carefully designed matches made courtesy of painstaking content editing and a $100K text file.

Tagging that doesn’t

So, I’ve been using Delicious for some time now, and despite its annoying URI, I like how quick it is to post bookmarks. The problem with Delicious though is that it doesn’t work. Well, it works for sharing links with friends (see popular), but it doesn’t work for classification of documents for later retrieval. In fact, reliance upon a personal bookmark cache is sort of like using a Ground Hog Day version of of Excite, which searches only meta name=”keywords” where content=”Web20, ToRead, Apple, NataliePortman”.

It could be that I’m an idiot, but I think the problem more likely stems from the fact that classifying documents is subjective, emotional, and too much work to do carefully when all I’m trying to do is read Slashdot. Moreover, making links useful is more than just classification-I need to anticipate how I will think about them (so that I can retrieve them) in the future-two weeks ago I was thinking about The Office as being “hilarious,” “can’t miss,” and “smart” and now I’m thinking about it as being “boring,” “go out instead,” and “insipid.”

In my Delicious account, I have 374 links and 218 distinct tags, for roughly 1.7 links per tag-but the median is much fewer, closer to 1 link per tag. Of course, links can have more than 1 tag–the ratio there is also about 1:1. So, we have essentially a simple directory-based filesystem. But with one file per directory. And with search that only operates upon directory names. Kind of like Windows Explorer.

Wrapping up

Users can’t self-classify documents. We’ve known that forever. Classification is particularly difficult when tag-based systems resemble directory-based systems where each document has one-and-only one tag. Designing systems that don’t rely upon smarter semantic search, trained by site designers is nearly always a mistake.

Here are some things to re-consider when you design search for your web application:

  • Do I have only one search box on the site? Presenting multiple search boxes is confusing to users who don’t understand your implementation model. Make sure search looks in all data sources on the site, not just your bodyText database column-you need to search vertical applications and .pdf documents from one single text input. Remember that users don’t think about your application architecture–they see one search box and will use it to find everything on your site.
  • Does the language my users speak match the way my site is written? Consider adding page-tags to full-text search to make sure visitors find the right pages. In particular, think about layman-izing jargon and acronyms.
  • Are you doing the work of classifying search results or are your users? Asking users to bookmark, tag, save, navigate to, or otherwise go out of their way to find content is just wrong. If you are being passive, the ordering of your results is a way to help your users find their content.
  • Lastly, don’t forget about Google. Even if your site isn’t SEO’d, make sure you know what the page summaries look like in Google. In particular, look at page titles and summaries for common searches like: foo site:mydomain.com

Leave your mark

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Close
E-mail It
Socialized through Gregarious 42