A few months ago I came across the free law project and its awesome dataset of court decisions from around the country. I think it is amazing that everyday people can have easy and free access to a searchable index of court decisions. In addition to their front end search page and search api, they allow more advanced people to download their dataset using their bulk api. People are doing some pretty interesting things with this dataset. I even came across what looked like a decision prediction engine. I decided to use the dataset to create my own search index because it was the perfect opportunity to play around with Elasticsearch (a distributed search index). Note that I have no intention of replacing the existing functionality provided on courtlistener.com. This is more of an academic exercise for me.
Getting and processing the data
The bulk api outputs a compressed file containing opinions in xml format. I started off by creating a bash script which accepts one of the compressed files outputted by the bulk api, iterates over the opinions and get them ready for import into the search index. Creating the script proved challenging at first because I did not have enough disk space available on my MBP to store and processed the 12gb file. I then realized that i have tons of space available to me in Azure storage so I created an Azure ubuntu vm with an attached disk to test my processing script.
Choosing a platform
By the time I was done with the script I started thinking about how I would get Elasticsearch stood up on multiple nodes with discovery enabled. I came across the Elasticsearch Cloud Azure Plugin which is maintained by Elasticsearch. This plugin uses the Azure Management Api for discovery and the Azure Storage api for snapshots. I started implementing it but then I thought maybe it didn’t have to be this hard. Why should I have to be messing with certificate stores and all that when the service runtime api provides all the info I need for discovery. This is when I decided to focus my efforts on getting this going in the Azure PAAS environment. I came up with a plan of attack to get things going:
- Get Java Installed
- Persistent Storage for data
- Persistent Storage for snapshots
- Configuring and running Elasticsearch
- A discovery plugin based on the runtime api
- Internal load balanced endpoints
At this point I was confident I could get everything done using Azure Worker roles.
Setting up the project
Getting Java installed was a no brainer. All I needed to do was create a startup script which downloaded and run the installer. After a few bruises and a couple of hours into creating the startup script as a batch file, I decided it didn’t make sense to be battling with batch scripts when I had access to the full .net framework and a much more modern scripting engine in powershell. I also decided that I should include the java installer in the project instead of downloading it because it made more sense to take that hit when uploading the package instead of during startup. Once I had a working powershell script which ensured java was installed, all I needed to do was configure and run Elasticsearch.
Azure PAAS is a mostly stateless environment so applications must point to some external persistent storage. I was thinking that I could attach a vhd and mount it on my worker instances but eventually decided to try the new Azure File service with SMB support. It turns out that someone had already thought of this and released some code which made it very easy to mount Azure file storage shares as mapped drives in worker roles. With that, my persisted storage problem was solved. Elasticsearch could store its data on what would look like just another local drive. It is interesting to note that this is the only aspect of this solution I could not test on the Azure Emulator. So the code uses resource directories for storage when in the emulator.
Configuring and Running Elasticsearch
Elasticsearch does not have an installer per se. Instead, you use platform specific scripts to run it or install it as a service. So all I needed to do in the RoleEntryPoint was start the process and wait for it to exit. Also, when then RoleEntryPoint exists I can have it stop the process. The Elasticsearch configuration can be placed in either a .yml or .json file. I didn’t find out about the .json file till after so I pulled in a project called YamlDotNet to programmatically write configuration values such as the node name (instance id) that would only be available in the RoleEntryPoint. The deployment starts off with a base .yml, merges the base config with its own custom values in memory and writes the merged file to the final Elasticsearch config directory.
A discovery plugin based on the runtime api
I thought the discovery plugin would have been a walk in the park until I encountered a big problem. The Runtime API for java is supported by named pipes. Unfortunately, the named pipe is only available when using a ProgramEntryPoint in the service definition (I wish they would change that or at least make it configurable in the service definition). At this point I thought of moving the entire project to eclipse or using a console application as a ProgramEntryPoint but that took away my ability to simply run and debug in the emulator. Then, I thought if they were already using IPC for the java service runtime then it should be good enough for my solution. All I had to do was setup a mini server in the RoleEntryPoint that the java discovery plugin could communicate with. I did a proof of concept for both TCP and named pipes but eventually decided on named pipes because it was simpler. So with a working named pipes server answering all the questions about the state of the cloud service and the persistent storage for both data and snapshots abstracted away I was able to remove the dependency on the azure api for java in my plugin project.
Internal load balanced endpoints
My goal for this solution is to not expose the Elasticsearch cluster publicly but rather create a WebRole with a custom ui and or another WorkerRole which exposes a custom api to interact with the elasticsearch index. Looking at the service definition and reading around the internet it seems that I can’t declare an internal load balanced endpoint which can only be accessed by my public facing roles. It seems like my only option is to write a thin layer which cycles through or returns random node endpoints from the Elasticsearch cluster.
Elasticsearch is currently logging to a resource folder on the role, but I would really like to ship the logs somewhere. I am thinking that I can somehow hookup Elasticsearch logging to the diagnostic trace store which gets shipped to Azure storage or simply mount another Azure File share dedicated to logging. I also plan to read up on Logstash but I am not sure what value it would provide to my solution.
Once I have all the configuration pieces in place, publishing this service will be super easy. With the click of a button (or two) I would have a private Elasticsearch cluster and a public facing website/api for interacting with it. I already have a script which will allow me to bootstrap the data but I have not thought of how to integrate it into the deployment workflow. The other challenge is scaling Elasticsearch. Scaling Elasticsearch is not as straight forward as adding more identical nodes to something such as a web application. You really have to put some thought into it. For this project I am thinking that the data, once bootstrapped, would be mostly read only. Therefore, the requirements will differ from other clusters. For example, I can have a set of main data nodes on one WorkerRole and a set of replica only nodes (need to read up on this) in another WorkerRole configured with autoscaling.
Thanks for taking the time to read my random stuff. The code for the project inspired by this proof of concept can be found on Github: https://github.com/garvincasimir/Elasticsearch-Azure-PAAS