Brian McCallister

Sat, 11 Sep 2004

Search is important! All too often search looks like where thing like '%that%'. Users know google, and quite a few even know its query language at this point. Aside from wanting to provide more functionality in search, users are expecting it. Google seems simple, doesn't it?

Enter Lucene. I'll presume you've heard of it at least, if not used it. Lucene does full text indexing, and that is it. It does this really well. The beauty (well, one) is that you can index anything. In this case, I'll index an object being persisted by OJB. The key is to embed information required to retrieve the document being indexed.

Take a gander at a fairly simple Student class (this is frmo an app I am doing for my little brother, who is a professor (of such terrible subjects as rock climbing and white water kayaking, don't get me started)).

The primary use case for this application is for a student coop employee to be finding a student in the system, then finding gear and checking the gear out for the student. Finding the student is key, and that is best served by... searching! So we have a database record for each student, and want to have a convenient search facility, which can search based on name, student id (idNumber), phone number, even address. Lucene makes this is a snap. To do it, we just store the id (internal/pk id) in an unindexed field when we add a student in the StudentIndexer:

    public void add(final Student student) throws ServiceException {
        final Document doc = new Document();
        doc.add(Field.Text(NAME, student.getName()));
        doc.add(Field.Text(ID_NUMBER, student.getIdNumber()));
        doc.add(Field.Text(ADDRESS, student.getAddress()));
        doc.add(Field.Text(PHONE, student.getPhone()));
        doc.add(Field.UnIndexed(IDENTITY, student.getId().toString()));
        try {
            synchronized (mutex) {
                final IndexWriter writer = new IndexWriter(index, analyzer, false);
                writer.addDocument(doc);
                writer.optimize();
                writer.close();
            }
        }
        catch (IOException e) {
            throw new ServiceException("Unable to index student", e);
        }
    }

Notice the UnIndexed field on the Document? This tells Lucene to store this field with the record, but don't index it or search on it. When you retrieve the document you will get the field though. Perfect place to stash the primary key.

When we look for the students, we don't want to get back Lucene Document instances, though, we want to go ahead and get the nice domain model instances of Student. What we'll do is query against the index, pull all the pk's for the hits out, then select for the domain objects using those pks (from the StudentIndex:

    public List findStudents(final String search) throws ServiceException {
        return this.findStudents(search, Integer.MAX_VALUE);
    }

    public List findStudents(final String search, final int numberOfResults) throws ServiceException {
        final Query query;
        try {
            query = QueryParser.parse(search, StudentIndexer.NAME, analyzer);
        }
        catch (ParseException e) {
            throw new ServiceException("Unable to make any sense of the query", e);
        }
        final ArrayList ids = new ArrayList();
        try {
            final IndexReader reader = IndexReader.open(index);
            final IndexSearcher searcher = new IndexSearcher(reader);
            final Hits hits = searcher.search(query);
            for (int i = 0; i != hits.length() && i != numberOfResults; ++i) {
                final Document doc = hits.doc(i);
                ids.add(new Integer(doc.getField(StudentIndexer.IDENTITY).stringValue()));
            }
            searcher.close();
            reader.close();
        }
        catch (IOException e) {
            throw new ServiceException("Error while reading student data from index", e);
        }
        final List students = dao.findStudentsWithIdsIn(ids);
        Collections.sort(students, new Comparator() {
            public int compare(final Object o1, final Object o2) {
                final Integer id_1 = ((Student) o1).getId();
                final Integer id_2 = ((Student) o1).getId();
                for (int i = 0; i != ids.size(); i++) {
                    final Integer integer = (Integer) ids.get(i);
                    if (integer.equals(id_1)) {
                        return -1;
                    }
                    if (integer.equals(id_2)) {
                        return 1;
                    }
                }
                return 0;
            }
        });
        return students;
    }

The findStudents(string, string, int): List method is a little bit more complex than I like as it does a few things: query against the lucene index, extract the primary keys for the hits, query for the students matching those pk's (via the StudentDAO), and finally sorts the results (no way to specify the sort order in the query, it is dependent on the order of the hits from the lucene query). With that though, we support queries such as Tiffany, which is simple, or a more fun one, name: Aching phone: ???-1234 or what not. Go look at the Lucene query parser syntax. It is worth noting that the above query defaults to searching on the name field if no specific field is specified. This seems to make sense to me =)

If you look at the StudentIndex and StudentIndexer you will see there are also facilities for adding and removing documents from the lucene index. This gets important on any insert/update/delete operation. The update is important to catch as you need to remove the old entry and insert a new one in the index. Doing this is best done (my opinion) via an aspect which picks these operations out. That is outside the scope of this article though ;-)

For a larger application with more things being indexed (this just has two searchable domain types) I might generalize the search capability via a DocumentFactory such as:

public class BeanDocumentFactory implements DocumentFactory {
    public Document build(Object entity) {
        final Document document = new Document();
        try {
            final BeanInfo info = Introspector.getBeanInfo(entity.getClass());
            final PropertyDescriptor[] props = info.getPropertyDescriptors();
            for (int i = 0; i != props.length; ++i) {
                final PropertyDescriptor prop = props[i];
                final String name = prop.getName();
                final Method reader = prop.getReadMethod();
                final Object value = reader.invoke(entity, new Object[]{});
                final Field field = Field.Text(name, String.valueOf(value));
                document.add(field);
            }
        }
        catch (Exception e) {
            throw new RuntimeException("Handle these in real application", e);
        }
        return document;
    }
}

But I have not needed to generalize it for a real project yet =)

Speaking of Lucene (which rocks) I am eagerly anticipating Erik Hatcher's new book, Lucene in Action. If it is anything like Erik and and Steve Loughran's Java Development with Ant Lucene will be a lucky project to have it in circulation.

9 writebacks [/src/java/ojb] permanent link

Brian's Waste of Time