Search is important! All too often search looks like where thing like '%that%'. Users know google, and quite a few even know its query language at this point. Aside from wanting to provide more functionality in search, users are expecting it. Google seems simple, doesn't it?
Enter Lucene. I'll presume you've heard of it at least, if not used it. Lucene does full text indexing, and that is it. It does this really well. The beauty (well, one) is that you can index anything. In this case, I'll index an object being persisted by OJB. The key is to embed information required to retrieve the document being indexed.
Take a gander at a fairly simple Student class (this is frmo an app I am doing for my little brother, who is a professor (of such terrible subjects as rock climbing and white water kayaking, don't get me started)).
The primary use case for this application is for a student coop employee to be finding a student in the system, then finding gear and checking the gear out for the student. Finding the student is key, and that is best served by... searching! So we have a database record for each student, and want to have a convenient search facility, which can search based on name, student id (idNumber), phone number, even address. Lucene makes this is a snap. To do it, we just store the id (internal/pk id) in an unindexed field when we add a student in the StudentIndexer:
public void add(final Student student) throws ServiceException {
final Document doc = new Document();
doc.add(Field.Text(NAME, student.getName()));
doc.add(Field.Text(ID_NUMBER, student.getIdNumber()));
doc.add(Field.Text(ADDRESS, student.getAddress()));
doc.add(Field.Text(PHONE, student.getPhone()));
doc.add(Field.UnIndexed(IDENTITY, student.getId().toString()));
try {
synchronized (mutex) {
final IndexWriter writer = new IndexWriter(index, analyzer, false);
writer.addDocument(doc);
writer.optimize();
writer.close();
}
}
catch (IOException e) {
throw new ServiceException("Unable to index student", e);
}
}
Notice the UnIndexed field on the Document? This tells Lucene to store this field with the record, but don't index it or search on it. When you retrieve the document you will get the field though. Perfect place to stash the primary key.
When we look for the students, we don't want to get back Lucene Document instances, though, we want to go ahead and get the nice domain model instances of Student. What we'll do is query against the index, pull all the pk's for the hits out, then select for the domain objects using those pks (from the StudentIndex:
public List findStudents(final String search) throws ServiceException {
return this.findStudents(search, Integer.MAX_VALUE);
}
public List findStudents(final String search, final int numberOfResults) throws ServiceException {
final Query query;
try {
query = QueryParser.parse(search, StudentIndexer.NAME, analyzer);
}
catch (ParseException e) {
throw new ServiceException("Unable to make any sense of the query", e);
}
final ArrayList ids = new ArrayList();
try {
final IndexReader reader = IndexReader.open(index);
final IndexSearcher searcher = new IndexSearcher(reader);
final Hits hits = searcher.search(query);
for (int i = 0; i != hits.length() && i != numberOfResults; ++i) {
final Document doc = hits.doc(i);
ids.add(new Integer(doc.getField(StudentIndexer.IDENTITY).stringValue()));
}
searcher.close();
reader.close();
}
catch (IOException e) {
throw new ServiceException("Error while reading student data from index", e);
}
final List students = dao.findStudentsWithIdsIn(ids);
Collections.sort(students, new Comparator() {
public int compare(final Object o1, final Object o2) {
final Integer id_1 = ((Student) o1).getId();
final Integer id_2 = ((Student) o1).getId();
for (int i = 0; i != ids.size(); i++) {
final Integer integer = (Integer) ids.get(i);
if (integer.equals(id_1)) {
return -1;
}
if (integer.equals(id_2)) {
return 1;
}
}
return 0;
}
});
return students;
}
The findStudents(string, string, int): List method is a little bit more complex than I like as it does a few things: query against the lucene index, extract the primary keys for the hits, query for the students matching those pk's (via the StudentDAO), and finally sorts the results (no way to specify the sort order in the query, it is dependent on the order of the hits from the lucene query). With that though, we support queries such as Tiffany, which is simple, or a more fun one, name: Aching phone: ???-1234 or what not. Go look at the Lucene query parser syntax. It is worth noting that the above query defaults to searching on the name field if no specific field is specified. This seems to make sense to me =)
If you look at the StudentIndex and StudentIndexer you will see there are also facilities for adding and removing documents from the lucene index. This gets important on any insert/update/delete operation. The update is important to catch as you need to remove the old entry and insert a new one in the index. Doing this is best done (my opinion) via an aspect which picks these operations out. That is outside the scope of this article though ;-)
For a larger application with more things being indexed (this just has two searchable domain types) I might generalize the search capability via a DocumentFactory such as:
public class BeanDocumentFactory implements DocumentFactory {
public Document build(Object entity) {
final Document document = new Document();
try {
final BeanInfo info = Introspector.getBeanInfo(entity.getClass());
final PropertyDescriptor[] props = info.getPropertyDescriptors();
for (int i = 0; i != props.length; ++i) {
final PropertyDescriptor prop = props[i];
final String name = prop.getName();
final Method reader = prop.getReadMethod();
final Object value = reader.invoke(entity, new Object[]{});
final Field field = Field.Text(name, String.valueOf(value));
document.add(field);
}
}
catch (Exception e) {
throw new RuntimeException("Handle these in real application", e);
}
return document;
}
}
But I have not needed to generalize it for a real project yet =)
Speaking of Lucene (which rocks) I am eagerly anticipating Erik Hatcher's new book, Lucene in Action. If it is anything like Erik and and Steve Loughran's Java Development with Ant Lucene will be a lucky project to have it in circulation.
writebacks...
btw...
There is actually a bug in the code. The id field needs to be indexed for StudentIndex#remove to work as shown. There are multiple workarounds available -- left as exercise to reader ;-)
Lucene rocks
I use lucene... it's really very easy to use, powerful and fast. What more could you ask for.
slight correction
Our Ant book is called "Java Development with Ant", not "Ant in Action". I'm quite interested in representing object graphs and DOMs in Lucene - nice work! When I have more time I'll review what you've done, but at first glance it doesn't look like you're doing hierarchical indexing. How are you handling that type of thing, if I've missed it?
btw...
Sorry on the name mixup. I think of it as "The Ant Book" =) I don't do full graph indexing there, but certainly can. Will whip up something to do it =)
btw...
Fixed the Ant book name, sorry about that!
Re: graphs
I am not handling graphs in lucene, really, It is a flat index on particular classes. I am not sure generalized graph indexing would be useful, but doing a straightforward "index everything and keep it up to date" is pretty easy, as would be searching based on parent or child only. Will play a bit and see what I can do. If I can find a useful-in-general algorithm better than simple bean property indexing willpost it.
Re: graphs
Got something cool, will post soon, prolly tomorrow, just need to clean up some.
Bug?
Good stuff! Very useful and interesting. I think there may be a slight problem in findStudents, when specifying max results. As an example, imagine you specify numberOfResults=2 and have the following pks in the hits returned from Lucene: (2, 3, 1). findStudents will return a list with (Student:2, Student:3) in it, where the expected result should presumably be (Student:1, Student:2). I've been looking for an elegant solution to this problem for a while now. The best thing I've been able to come up with so far is to do an insertion sort for each pk read from the search engine hits, and then use sublist(0, numberOfResults) to get the list of pks for the domain objects. You have to process each hit, so the price for correctness will be performance. Also, this approach doesn't work if you want to also use an aggregate function as part of the domain object criteria. Anyone know of a better solution?
Same idea
I've had the same idea some months ago, and the XML repository of all articles published on http://www.vnunet.com/ are indexed and retrieved using roughly the same technique you mentioned. What happens there is that specific XPaths are extracted from the XML documents, and serialized in the Lucene index. From this archive, then, the different articles are pulled out using simple term matches. Works great, and it's so much better than any SQL database.
comment...