Search: Separating the Wheat from the Chaff with Information Equity

Search on corporate intranets is difficult, often because algorithms based on page rank don’t work particularly well. In short PageRank is a “vote”, by all the other pages on the Web, about how important a page is. A popular document on the corporate intranet may have very few pages linking to it. Without page rank the search results are ordered largely by frequency of key words, meta data, and currency. This makes it almost impossible to find a given page with a popular or overloaded term in the title such as Solaris 10, Cloud Computing, Identity Management, etc.

SunSpace, Sun’s internal enterprise wiki, with an integrated document repository, is based around the notion of Community Equity. Each person, and each document is assigned an equity value. A document’s Information Equity is mostly based on:

  1. Hits or downloads.
  2. Updates.
  3. Number of different people accessing/updating the page. (very important in a collaborative wiki!)
  4. Currency. The equity decreases over time.

So how does this fit into search … ?

SunSpace search has a 3 tier architecture. The back end is a commercial search engine that indexes the content (wiki pages and uploaded documents) and delivers the search results. We spend a lot of time tuning this engine so that the optimal weighting is given to titles, urls, keywords, and various meta data.

The middle tier is a set of feed readers that monitor all the updates on SunSpace. The feed readers create “stubfiles” for every document that contain all the meta data, which includes equity, tags, creator, and last updated date. The feed readers run continuously and notify the back end whenever a page is created or updated. A big advantage is that new pages and updates are added to the search index immediately, no more waiting days until the crawler finds the new documents.

The front end is where most of the action is. When a user types in a search query it is submitted to the back end and 100 results, in xml format, are returned. The stubfile for each result in then read. The initial results ordering is directly from the back end search engine, but the user is given the option to Sort by Information equity, or Sort by Date. We also display other relevant information such as tag clouds and communities for these 100 results. Each document has a creator, who is generally a primary contributor to the page. We assemble a list of the creators for the results, then credit each creator with the Information equity of the result they created – and display a list sorted by sum of information equity.

A few simple use cases:

  1. Find a new document. I am constantly creating documents on SunSpace, and generally am not careful with the titles. Two hours later when I have forgotten the title and need to forward the URL to a colleague, I simple search for myself, then Sort by Date.
  2. Find a popular document. The second case refers back to the beginning of this article. Let’s say a colleague mentions a cool wiki page on Solaris 10 that’s all the buzz. I’d search for “Solaris 10”, then sort by Information Equity.
  3. Find the expert. I have a big presentation on Cloud Computing, and need to seek the advice of a knowledgeable colleague. I search for Cloud Computing, then refer to the right of the results page for the people with the most Information equity. (for that particular search.)