Thursday, November 13, 2008

Lucene Overview Part One: Creating the Index

Introduction

I've recently been working with the open source search engine Lucene. I'm no expert, but since I have just pored through some rather sparse documentation and migrated an application from a very old version of Lucene to the latest version, 2.4, I'm pretty clear on the big picture. The documentation for Lucene leaves a bit to the imagination, so I thought I'd take this opportunity to share a high level overview of Lucene while it's fresh in my mind.

If you find this page looking for introductory material to Lucene, good for you! That's what it's for. Don't expect to find best practices, code samples or advanced topics. You will find a clear introduction to the conceptual architecture of Lucene, with which you will be able to productively approach the FAQ's and tutorials on the project web site. I'm using the Java implementation of Lucene, but all of this high level stuff would apply equally to any of the other Lucene flavors.

The first thing you should understand is what Lucene actually does. Lucene only does two things really.
  1. It creates search indexes.
  2. It searches for content in those indexes.
An index is a efficiently navigable representation of what ever data you need to make searchable. Your data might be as simple as a set of Word documents in a content management system, or it might be records from a database, HTML pages, or any kind of data object in your system. It's up to you to decide what entities you want to make searchable. For our discussion, we'll assume that we are working with a set of Word documents.

Create the Index

So, step one is to create the index for our set of Word documents. To do this, we need to write some code that takes the information from the Word documents and turns them into a searchable index. The only way to do this is by brute force. We'll have to iterate over each of the Word documents, examing each and converting each into the pieces that Lucene needs to work with when it creates the index.

What are the pieces that Lucene needs to create the index? There are two.

  1. Documents
  2. Fields
These two abstractions are so key to Lucene that Lucene represents them with two top level Java classes, Document and Field. A Document, not to be confused with our actual Word documents, is a Java class that represents a searchable item in Lucene. By searchable item, we mean that a Document is the thing that you find when you search. It's up to you to create these Documents.

Lucky for us, it's a pretty clear step from an actual Word document to a Lucene Document. I think anyone would agree that it will be the Word documents that our users will want to find when they conduct a search. This makes our processing rather simple, we will simply create a single Lucene Document for each of our actual Word documents.

Create the Document and its Fields

But how do we do that? It's actually very easy. First, we make the Document object, with the new operator -- nothing more. But at this point the Document is meaningless. We now have to decide what Fields to add to the Document. This is the part where we have to think. A Document is made of any number of Fields, and each Field has a name and a value. That's all there is to it.

Two fields are created almost universally by developers creating Lucene indexes. The most important field will be the "content" field. This the Field that holds the content the Word document for which we are creating the Lucene Document. Bear in mind, the name of the Field is entirely arbitrary, but most people call one of the Fields "content" and they stick the actual content of the real world searchable object, the Word document in our case, into the value of that Field. In essense, a Field is just a name: value pair.

Another very common Field that developers create is the "title" Field. This field's value will be the title of the Word document. What other information about the Word document might we want to keep in our index. Other common fields are things like "author", "creation_date", "keywords", etc. The identification of the fields that you will need is entirely driven by your business requirements.

So, for each Word document that we want to make searchable, we will have to create a Lucene Document, with Fields such as those we outlined above. Once we have created the Document with those Fields, we then add it the Lucene index writer and ask it to write our Index. That's it! We now have a searchable index. This is true, but we may have glossed over a couple of Field details. Let's take a closer look at Fields.

Field Details: Stored or Indexed?

A Field may be kept in the index in more than one way. The most obvious way, and perhaps the only way that you might at first suspect the existence of, is the searchable way. In our example, we fully expect that if the user types in a word that exists in the contents of one of the Word documents, then the search will return that Word document in the search results. To do this, Lucene must index that Field. The nomenclature is a bit confusing a first, but, note, it is entirely possible to "store" a Field in the index without making it searchable. In other words, it's possible to "store" a Field but not "index" it. Why? You'll see shortly.

The first distiniction that Lucene makes between the way it can keep a Field in the index is whether it is stored or indexed. If we expect a match on a Field's value to cause the Document to be hit by the search, then we must index the Field. If we only store the Field, it's value can't be reached by the search queries. Why then store a Field? Simple, when we hit the Document, via one of the indexed fields, Lucene will return us the entire Document object. All stored Fields will be available on that Document object; indexed Fields will not be on that object. An indexed Field is information used to find an Document, a stored Field is information returned with the Document. Two different things.

This means that while we might not make searches based upon the contents of a given Field, we might still be able to make use of that Field's value when the Document is returned by the search. The most obvious use case I can think of is a "url" Field for a web based Document. It makes no sense to search for the value of aURL, but you will definitely want to know the URL for the documents that your search returns. How else would your results page be able to steer the user to the hit page? This is a very important point: a stored Field's value will be available on the Document returned by a search, but only an indexed Field's value can actually be used as the target of a search.

Technically, stored Fields are kept within the Lucene index. But we must keep track of the fact that an indexed Field is different than a stored Field. Unfortunate nomenclature. This is why words matter. They can save on a lot of confusion.

Indexed Fields: Analyzed or Not Analyzed?

For the next wrinkle, we must point out that an indexed Field can be indexed in two different fashions. First, we can index the value of the Field in a single chunk. In other words, we might have a "phone number" Field. When we search for phone numbers, we need to match the entire value or nothing. This makes perfect sense. So, for a Field like phone number, we index the entire value ATOMICALLY into the Lucene index.

But let's consider the "content" Field of the Word document. Do we want the user to have to match that entire Field? Certainly not. We want the contents of the Word document to be broken down into searchable tokens. This process is know as analyzation. We can start by throwing out all of the unimportant words like, "a", "the", "and", etc. There are many other optimizations we can make, but the bottom line is that the content of a Field like "contents" should be analyzed by Lucene. This produces a targeted lightweight index. This is how search becomes efficient and powerful.

In the APIs, this comes down to the fact that when we create a Field, we must specify
  1. Whether to STORE it or not
  2. Whether to INDEX it or not
    • If indexing, whether to ANALYZE it or not
Now, you should be clear on the details of Fields. Importantly, we can both store and index a given Field. It's not an either or choice.

Creating the Index

When we have added all the Documents to the index, we simply tell the index writer to create the index. From this point on we can search according to the indexed Fields for any of our Documents. Look for an upcoming entry to give a high level overview of the searching for things in a Lucene index.

Parting Note

Recall that we said it would be simpler to assume that our target data was a set of Word documents. Now that we've finished, consider that your target data can be anything. In reality, it's the Lucene Documents that are searched. And you can create these from anything you want. They can, and frequently do, come from an aggregation of real world data objects. Again, what data will go into your Lucene Documents is up to your business requirements. It can be as simple as a one-to-one mapping of Word documents to Lucene Documents, or each Lucene Document can be the aggregate of a variety of database queries and anything else you might find laying around.

Happy indexing!

8 comments:

Pepe perez said...

Hi, First... congratulation for your post!! I'm a newbie on lucene!!
My documents would be just the name of a city and I was wondering If lucene could answer if the user misunterstood a character, like Valtimore instead of Baltimore.
Thanks! ;)

chadmichael said...

@maria

Sure, you can do all sorts of matching of the search terms to the index. This is handled through the search portions of the API, not the indexing. In particular, there's a variety of "Query" objects that can be used to query the index.

Look for coming soon post on the basics of searching.

Anonymous said...

please tell how to convert word documents to lucene document format.pls give the source code by explanation......my email is swarajkp@gmail.com...pls help....

chadmichael said...

@swaraj

I've not done this yet, but I think there is a subproject of Lucene, or a 3rd party project, that can be used to index all sort of problematic document formats from pdf to word to whatever.

I think it might be called Nutch, but that project might be more than you need.

Check the lucene mailing lists.

chadmichael said...

@swaraj Actually, want you want is Tika. http://lucene.apache.org/tika/

Anonymous said...

Who knows where to download XRumer 5.0 Palladium?
Help, please. All recommend this program to effectively advertise on the Internet, this is the best program!

Anonymous said...

I am new to Lucene, and thanks for this well written overview.

Anonymous said...

Great explaination.
Thanks!