metagear.de
A Sample Library Application, Featuring the Spring Framework, Lucene-Based Hibernate Search, and JavaServer Faces
August 19, 2009 · Robert Söding

Preface

Primarily, this article and sample application have been written to study the feasibility of using the Lucene full-text search engine - or some framework on top of it - in a Spring, and Hibernate, application.
In contrast to my other recently written articles, this one has been dealt with in an ad-hoc way. That is, I have had not read the complete documentation before starting to code. Likewise, the article is intentionally kept brief.
As for Lucene integration with Hibernate and Spring, there are a number of frameworks around (see chapter Related Resources). While the Compass framework might be worth more than a second look, Hibernate Search has been chosen for the simple reason of JBoss being an industry leader with Hibernate itself.
Interested readers are supposed to already have a basic knowledge of the Spring framework, Hibernate, and (for that matter) JavaServer Faces (JSF).
Feedback is welcome and may be directed to .

Prerequisites

There are the following software requirements:
To test the sample application, ...
This should be all to consider.

Use Cases

Basically, a user can search the web (Google), download and add the search results to the library, search the library, view document details (including extracted plain text), and view an archived copy of the original media.

Search the Web and Add new Media to the Library

The following image shows the "Add Media" view:
Add Media View

Search the Library

The following image shows the "Search Library" view:
Search Library View

View Media Details and the Archived Document

The following image shows the "Show Media Details" view:
Show Media Details View

Application Layers

The JavaDocs are also available.

Database and Persistence

As RDBMS (Relational Database Management System), MySQL is used. (Most common databases would also do.) The Hibernate OR/M (Object Relational Mapper) is used to persist and retrieve data, on top of the JPA (Java Persistence API).
For configuration details, see
See ORM with Hibernate for more information on that matter.

Data Access Objects (DAOs)

Data access is encapsulated within Data Access Objects, in this case, one for CRUD (Create-Read-Update-Delete) operations, and another one for more sophisticated search operations:
DAOs
Central APIs used include Spring's JpaTemplate and JpaCallback, Hibernate Search's FullTextEntityManager and FullTextEntityQuery, and Lucene's Query and Analyzer.
The DAOs are exposed as Spring beans see chapter Dependency Injection for more information.

Model

Entities

The following image shows the entities and the MediaFactory:
Entities
The MediaFactory is used to create and populate a Media from a WebSearchResult instance. It downloads the corresponding document and uses Apache Tika to extract the document contents.
The entities, Media and their MetaData, are mapped to the database tables and fields, as well as Lucene index fields, using Hibernate, Hibernate Search, and JPA annotations.
There is further information available on these JPA, Hibernate, and Hibernate Search, annotations. You may also want to read my previously written chapter on XML-based Hibernate Mappings
The following image shows a Media's (constructors and) methods:
Media Methods

Value Objects

Value - or Transfer - Objects are used to transfer specific information. For clarity, they may expose getter methods, only.
The following image shows the value objects used in the application:
Value Objects
Additionally, there are the WebSearchCriteria and WebSearchResult value objects, in the de.metagear.util.web package.

Service Layer

Any central business logic in the sample application is coordinated by Service Layer methods, which, in this case, mostly operates on the DAOs' methods. The following image shows the service layer structure:
Service Layer
The DataQueryService is used to retrieve data from the database and the Lucene index.
The WebInteractionService is used to search the web (utilizing the Google Search APIs) and to save the WebSearchResults retrieved. Implementations of both are exposed as Spring beans.
The DocumentServlet displays archived Media documents.

Controllers

The Controllers, implemented as JavaServer Faces Managed Beans, connect the service and view layer. See WEB-INF/faces-config.xml and the following image for an overview:
Controllers
We could have used the Spring MVC API, however, the application's complexity does not require that.
Commonly, these controllers provide JSF action methods, a BackingBean (containing static properties to be used in the JSF views) and a CommandBean (containing properties that are to be edited by the views).
The service layer's Spring beans are dependency-injected, which is configured in WEB-INF/faces-config.xml, where a SpringBeanFacesELResolver resolves the Spring beans' names to JSF.

Glue

Several portions of functionality can be used (and re-used) indepently from concrete applications. These, in the sample application, are organized in the de.metagear.util and de.metagear.library.util Java packages.

MediaParser

The MediaParser puts the Apache Tika APIs to work to extract contents (into plain text or HTML) as well as meta data from documents of a large number of document formats. Internally, Tika uses Apache POI, PDFBox, and various other libraries.

QueryTermsProcessor

The QueryTermsProcessor plays quite a central role in querying the Lucene index. It processes the search terms entered by the user (see Apache Lucene - Query Parser Syntax) as well as other query terms (i.e, the requested document format or language) and assigns them to Lucene index fields. Thereby, the QueryTermsProcessor also combines terms and term groups (using Lucene's AND, OR, and NOT operators) and properly nests them.
A formatted query passed to Lucene's QueryParser might resemble the following code snippet:
      (
                (
                        plainText:spring OR plainText:groovy
                ) 
        OR 
                (
                        title:spring OR title:groovy
                )
        ) 
AND 
        (
                (
                        languageCode:de
                ) 
        AND 
                (
                        mimeType:text/html
                )
        )

General-Purpose Libraries

The CollectionUtils, IoUtils and StringUtils classes provide low-level functionality.
The PaginationSupport class ("next page", "previous page", etc.) can be plugged into frameworks of any type.

MySql5InnoDbDialectUTF8

The MySql5InnoDbDialectUTF8 Hibernate dialect causes Hibernate to create a database with a UTF-8 character set.

Java Reflection

Classes in the de.metagear.util.reflection package are used to manipulate Java bean properties, which saves a lot of code in classes using them.

Web and Google Search

The following image shows parts of the WebSearch APIs and its GoogleSearch implementation:
WebSearch APIs
The GoogleSearch APIs are based on the *wonder* Google Search APIs (which is not of great scope, BTW).
Currently, Google returns only eight matches per query. This could be changed by obtaining a client key.

View

While JavaServer Faces 2.0 are stable in my own findings (according to their automated testing, however, they are yet not as stable as JSF 1.2), JavaServer Faces 1.2 have been chosen to be implemented in the sample application. That way, the application could be extended with other JSF frameworks, which are typically not yet compatible with JSF 2.0.

Facelets

A composition and templating framework, Facelets are used. The main template is WEB-INF/templates/masterTemplate.jspx (I couldn't get Eclipse's code completion to work with ".xhtml"-extended files). See Facelets Resources for more information.

JavaServer Faces (JSF)

The JSF view's code is pretty straight.
For a more thorough discussion, see chapter Presentation Layer (on JSF 2.0) in my previously written JEE 6 article.
The sample application's JSF beans are session scoped. In a rather large-scaling application, one would have second thoughts on which properties actually need to belong to the resource-intensive session scope. See chapter Bean Declaration and Scope in the aforementioned article.

Testing

For the tests, the JUnit framework is used.
Currently, the basic libraries are relatively thoroughly covered, and there is a decent covering of the service methods. DAO and integration tests (including the web GUI) are missing.
The reason for the missing tests is, of course, that the sample application's scope does not comprise testing, at all. Moreover, it cannot be foreseen that the tests would need to be conducted repeatedly in future.
Note that the service tests do populate the database, their automatic transaction rollback is currently switched off.

Content Analysis and Processing, and Indexing

Content Analysis and Processing

An Analyzer returns a TokenStream by applying one or more TokenFilters (accepting or discarding tokens) to a Tokenizer (splitting a character sequence into tokens).
The following images show the Analyzer, Tokenizer, and TokenFilter, type hierarchies (the optional library lucene-analyzers.jar being installed).
Analyzers
Tokenizers
TokenFilters
The StandardAnalyzer (which executes if not otherwise specified) works with the StandardTokenizer, to which the filters StandardFilter, LowerCaseFilter and StopFilter are applied.
In human language texts, particularly, stop words and word stems are to be considered. On the other hand, if a text, for example, is expected to contain an ISBN number, like "978-3-89864-465-5", it needs to be ensured that this entity is indexed as is, i.e., not splitted or discarded.
See also Analyse mit Lucene (in German) for an introduction into Lucene text analysis. See A Fast and Simple Stemming Algorithm for German Words for a strategy to treat word stem variations, so-called lemmas, in a unified way. See KEA - Keyphrase Extraction Algorithm for an advanced text extraction library (based on the Lucene APIs).

Indexing

Note that - other than the Hibernate Search APIs discussed in this chapter - Lucene provides its own index-related APIs, including the IndexReader and IndexWriter classes.

Index Creation

The sample applications' entities' properties are annotated with Hibernate Search's @Field marker as in the following snippet:
...
@Field(index = Index.TOKENIZED, store = Store.YES)
private String title;

...
@Field(index = Index.UN_TOKENIZED, store = Store.YES)
private String mimeType;
The @Field annotation causes the property value to be indexed in the corresponding Lucene fields. The value can be tokenized, that is, split. The tokenizing behavior can be specified using the @AnalyzerDef, @TokenizerDef and TokenFilterDef annotations (see chapter Content Analysis and Processing).
Given the @Field annotation is in place, Hibernate Search, by default, will automatically index the property values on JpaTemplate.persist(Object) and JpaTemplate.merge(Object).
Indexing will cause a write lock to be put into effect on the data store, and only one index writer can operate at a time. Therefor, strategies exist to defer the index creation. There are configuration settings for indexing after a given number of transaction or operations (see Tuning Lucene indexing performance). Additionally, Hibernate provides means to send index change requests to a JMS (Java Messaging Service) queue.
Automatic indexing can also be disabled (see Automatic indexing ). Manually, index changes can be conducted by using the FullTextSession's <T> void index(T) and void flushToIndexes() methods (see Manual indexing).

Index Optimization

An optimization consolidates index files into one main file. Optimization can be set to be performed automatically in Hibernate Search's configuration (see Automatic Optimization) or manually, by invoking one of the overloaded optimize(..) methods of a SearchFactory, which, in turn, can be obtained from a FullTextSession (see Manual Optimization).

Index Search

A Lucene search in the sample application is implemented as follows:
public Collection<MediaSearchResultVO> getSearchResults(
                final MediaSearchCriteriaVO criteria) {
        return (Collection<MediaSearchResultVO>) jpaTemplate
                        .execute(new JpaCallback() {

                                @Override
                                public Collection<MediaSearchResultVO> doInJpa(
                                                EntityManager em) throws PersistenceException {

                                        try {
                                                String preProcessedSearchTerms = new QueryTermsProcessor()
                                                                .processQueryTerms(criteria, new String[] {
                                                                                "plainText", "title" },
                                                                                QueryOperator.OR);
                                                Query query = new QueryParser("plainText",
                                                                new WhitespaceAnalyzer())
                                                                .parse(preProcessedSearchTerms);

                                                FullTextEntityManager fullTextEntityManager = Search
                                                                .getFullTextEntityManager(em);
                                                FullTextQuery fullTextQuery = fullTextEntityManager
                                                                .createFullTextQuery(query, Media.class);

                                                fullTextQuery.setProjection(FullTextQuery.SCORE,
                                                                "id", "title", "mimeType", "languageCode",
                                                                "lastUpdated");

                                                fullTextQuery
                                                                .setResultTransformer(
                                                new MediaSearchResultVoResultTransformer());

                                                fullTextQuery.setFirstResult(criteria
                                                                .getStartItem());
                                                fullTextQuery.setMaxResults(criteria
                                                                .getNumOfItemsPerPage());

                                                if (criteria.getOrderBy() != null) {
                                                        fullTextQuery.setSort(new Sort(criteria
                                                                        .getOrderBy()));
                                                }

                                                return (Collection<MediaSearchResultVO>) fullTextQuery
                                                                .getResultList();
                                        }
                                        catch (ParseException e) {
                                                throw new MediaSearchException(e);
                                        }
                                }
                        });
}
First of all, the application's QueryTermsProcessor pre-processes the search terms (see chapter QueryTermsProcessor).
Next, the Lucene QueryParser further processes the search terms string (which will be semantically equal, afterwards) and returns a Query, using a WhitespaceAnalyzer in this case (see chapter Content Analysis and Processing).
A FullTextQuery instance is created and provided with a Projection. This, firstly, specifies and limits the fields to be returned and, secondly, causes Hibernate Search to query the Lucene indexes, only, in contrast to querying the database.
The sample application's MediaSearchResultVoResultTransformer transforms each result row's values to a MediaSearchResultVO instance (of which, after all, a Collection will be returned).
Additionally, pagination and sort properties are set.

Miscellaneous

Luke - Lucene Index Toolbox

Luke is a simple, yet effective, tool to view, query, and manipulate Lucene indexes. See the following screenshot and the Luke Homepage for more information.
Luke - Lucene Index Toolbox

Design Patterns Applied in the Sample Application

For a discussion of several patterns, also see Patterns in Spring of my previously written article A Comprehensive Introduction into the Spring Framework.
Design patterns implemented in the sample application include

Resources

All links retrieved at the date of publication.

Hibernate

Hibernate Search

Related Resources

Lucene

Content Analysis

Content Extraction

Spring Framework

JavaServer Faces (JSF)

Facelets

Tools

Valid XHTML 1.0 Transitional Valid CSS!