Google App Engine – Full Text Search with JDO – Revisited

Objective

This article will show you how to implement a full text search in Google App Engine using JDO. I tried my hand at this couple month ago, but after watching this presentation I decided to do it properly.

The Problem

In my first attempt I managed to get the search working, but after watching Brett Slatkin’s presentaion I realized where the problem is. In short deserializing a list of strings (which is our search index) is a very costly operation, but he presented with a solution. Bellow you will find my solution to this problem.

Data Model

For this example we will use such data model: we have Customer (name, contact, notes) which has a list of Addresses and Phones. We need ability to find customer by name, address or phone.

@PersistenceCapable(identityType = IdentityType.APPLICATION, detachable="true")
public class Customer {
    @PrimaryKey
    @Persistent(valueStrategy = IdGeneratorStrategy.IDENTITY)
    private Long id;

    @Persistent
    private String name;
    @Persistent
    private String contactName;
    @Persistent
    private String comments;

    @Persistent(mappedBy = "customer")
    @Element(dependent = "true")
    private List<Address> addresses = new ArrayList<Address>();

    @Persistent(mappedBy = "customer")
    @Element(dependent = "true")
    private List<Phone> phones = new ArrayList<Phone>();

    @Persistent(dependent="true")
    private CustomerIndex index;

   // getters and setter go here....
}

@PersistenceCapable(identityType = IdentityType.APPLICATION, detachable="true")
public class Address {
    @PrimaryKey
    @Persistent(valueStrategy = IdGeneratorStrategy.IDENTITY)
    private Key id;
    @Persistent
    private String type;
    @Persistent
    private String line1;
    @Persistent
    private String line2;
    @Persistent
    private String city;
    @Persistent
    private String state;
    @Persistent
    private String zip;
    @Persistent
    private Customer customer;

   // getters and setter go here....
}

@PersistenceCapable(identityType = IdentityType.APPLICATION, detachable="true")
public class Phone {
   @PrimaryKey
    @Persistent(valueStrategy = IdGeneratorStrategy.IDENTITY)
    private Key id;
    @Persistent
    private String type;
    @Persistent
    private String phone;
    @Persistent
    private Customer customer;

   // getters and setter go here....
}

If you paid attention you notice that we have an interesting child in the Customer class called CustomerIndex. Here it is:

@PersistenceCapable(identityType = IdentityType.APPLICATION, detachable="true")
public class CustomerIndex {
    @PrimaryKey
    @Persistent(valueStrategy = IdGeneratorStrategy.IDENTITY)
    private Key id;
    @Persistent
    private Set<String> index;

   // getters and setter go here....
}

Search Approach

Here is the theory of what we gonna do: Since deserializing of List properties is a very very costly operation (and we do not care abut the data it holds anyway) we move customer search index property into a Child object. We will perform a search on this Child and we will get only the keys of the child objects. This way we do not have to incur the penalty of deserializing our search index (the search happens on the index). Once we have our child object keys we will load Parent objects with those keys. We can do this because a child key is a composite key and always includes parent key.
To make our search more usable we will use Lucenen and SnowballAnalyzer for word stemming.
Here is the method that gives us the Set of words. We use it to generate the index of searchable words as well as search phrases.

protected Set<String> getIndex( String input, int maxTokens ) {
  Set<String> returnSet = new HashSet<String>();
  try {
    Analyzer analyzer =  new SnowballAnalyzer( org.apache.lucene.util.Version.LUCENE_30,"English", stopWords());
    TokenStream tokenStream = analyzer.tokenStream( "content", new StringReader(input) );
    while ( tokenStream.incrementToken() && (returnSet.size() < maxTokens) ) {
      if( tokenStream.hasAttribute( TermAttribute.class ) ) {
        TermAttribute attr = tokenStream.getAttribute( TermAttribute.class );
        logger.debug( attr.term() );
        returnSet.add( attr.term() );
      }
    }
  }catch( Exception exc ) {
    logger.equals(exc);
  }
  return returnSet;
}

Here is our search method:

public List<Customer> searchCustomers( String search1, Long entityId ) throws IOException {
  PersistenceManager pm = PMF.getManager();

  Set<String> search = getIndex(search1, 3);
  Query query = pm.newQuery("SELECT id FROM " + CustomerIndex.class.getName() );
  query.setFilter("index == param0");
  query.declareParameters("String param0");

  Query query2 = pm.newQuery(Customer.class);
  query2.setFilter("id == keyParam");
  query2.declareParameters("com.google.appengine.api.datastore.Key keyParam");

  List<Customer> custs = null;
  List<Key> keys;
  List<Key> parents = new ArrayList<Key>();

  try {
    keys = (List<Key>) query.execute( search );
    for( Key k : keys){
      parents.add( k.getParent() );
  }
  custs = (List<Customer>) query2.execute( parents );
  for( Customer cust : custs ) {
    for( Address addr : cust.getAddresses() )
      logger.debug( addr.getId() );
    for( Phone ph : cust.getPhones() )
      logger.debug( ph.getId() );
    }
  } catch ( Exception exc ) {
    logger.error(exc);
  } finally {
    query.closeAll();
    query2.closeAll();
    pm.close();
  }
  return custs;
}

You will notice that we walk the address and Phone lists for each customer to load them form Storage. We do that so we can ship them over the wire. UI in this case is a Flex client, so we do JSON serialization of the results.

Conclusion

Text searching can be implemented in GAE and to boot it can be implemented efficiently. Just remember before you store this Customer record you need to build out the CustomerIndex object with the set of words that we will search on. I just concat all the properties to one string and Lucene build the set for me by calling my getIndex().

3 thoughts on “Google App Engine – Full Text Search with JDO – Revisited

  1. I like this example. It is obviously very comprehensive and a big help for myself. So thanks for that.

    I am have downloaded the lucene package 3.0.2. but I am having trouble download the SnowballAnalyzer code. Could you please supply a location for the jar file?

Leave a Reply

Your email address will not be published. Required fields are marked *