Shaping up Intermine Search — My GSOC journey so far.. Part-1

Arunan Sugunakumar
InterMine
Published in
3 min readJun 6, 2018

--

I am working with the organization Intermine as part Google Summer of Code 2018. InterMine is a platform which integrates data from different biological sources, providing a web-based user interface for querying and analysing the data. This is my first time in GSOC, and so far I am enjoying being part of the organization. Nearly four weeks have gone by in the program, and I would like to update my progress to the Intermine community.

As I explained above, Intermine is a data warehouse platform which means it provides access to many bio data sources. Searching and querying through them is an essential feature of the platform. Currently it uses Apache Lucene (v3.0.2) library to index the data so that searching can be done quickly. In the current implementation, the intermine platform is tightly coupled with Lucene and a user who wants to adapt the platform to his own needs is stuck with the older version of Lucene. My project was to seperate the search implementation and integrate it with an advanced search engine like Apache Solr.

In the early stage, my job was to evaluate different search engines like Apache Solr and ElasticSearch and select a suitable canditate. I went with Apache Solr, because it’s been out there for a very long time and there is a good open source community behind it. Of course, both Apache Solr and Elastic Search is using the Apache Lucene underneath, but they have built many features on top of it. They act as a separate server and communicate with the actual application through REST clients. Thus, they provide the ability to scale and avoid point of failure regardless of our application’s limitations.

So basically, there are two sides to my project : indexing and searching. The beauty of Intermine is that, we do not need to have a static data model. The data can have any number of fields. So during indexing process, the dynamic nature of the data has to be kept in mind. So far I have completed the data index procedure and I was able to query through the data in in Solr admin dashboard.

Query done on indexed data from Solr admin dashboard

The good news is that even though we have separated the search engine, there are no extra steps needed to be done other than starting the solr server. The same steps that needs to be followed to setup an intermine instance is sufficient to make this work. Earlier Intermine used Ant to manage its build process. In their upcoming Intermine 2.0 version, they are shifting to gradle, and I am doing my upgrades against this new version. I am doing all the search engine integration in a manner that only a minimal amount of changes needs to be done if a user wants to use other search engine like ElasticSearch or if he/she wants to upgrade to a new version of Solr.

From the search perspective, I have done initial implementations to handle the keyword search queries.

Me after my first successful query

In the coming weeks, I will be implementing facets (a technique for accessing information organized according to a faceted classification system). Also I will be working on creating a docker image for the solr instance with the all the necessary configurations added. Anyone who wants to follow my work can go on github and checkout my branch https://github.com/arunans23/intermine/tree/gradle-search

--

--

Arunan Sugunakumar
InterMine

Software Engineer (Integration) @wso2, Graduated from University of Moratuwa (Computer Science & Eng) and GSOCer @intermineorg