Elasticsearch In Azure PAAS [Worker Roles]

The Preamble
A few months ago I came across the free law project and its awesome dataset of court decisions from around the country. I think it is amazing that everyday people can have easy and free access to a searchable index of court decisions. In addition to their front end search page and search api, they allow more advanced people to download their dataset using their bulk api. People are doing some pretty interesting things with this dataset. I even came across what looked like a decision prediction engine. I decided to use the dataset to create my own search index because it was the perfect opportunity to play around with Elasticsearch (a distributed search index). Note that I have no intention of replacing the existing functionality provided on courtlistener.com. This is more of an academic exercise for me.

Getting and processing the data
The bulk api outputs a compressed file containing opinions in xml format. I started off by creating a bash script which accepts one of the compressed files outputted by the bulk api, iterates over the opinions and get them ready for import into the search index. Creating the script proved challenging at first because I did not have enough disk space available on my MBP to store and processed the 12gb file. I then realized that i have tons of space available to me in Azure storage so I created an Azure ubuntu vm with an attached disk to test my processing script.

Choosing a platform
By the time I was done with the script I started thinking about how I would get Elasticsearch stood up on multiple nodes with discovery enabled. I came across the Elasticsearch Cloud Azure Plugin which is maintained by Elasticsearch. This plugin uses the Azure Management Api for discovery and the Azure Storage api for snapshots. I started implementing it but then I thought maybe it didn’t have to be this hard. Why should I have to be messing with certificate stores and all that when the service runtime api provides all the info I need for discovery. This is when I decided to focus my efforts on getting this going in the Azure PAAS environment. I came up with a plan of attack to get things going:

  • Get Java Installed
  • Persistent Storage for data
  • Persistent Storage for snapshots
  • Configuring and running Elasticsearch
  • A discovery plugin based on the runtime api
  • Internal load balanced endpoints
  • Logging

At this point I was confident I could get everything done using Azure Worker roles.

Elasticsearch PAAS

Setting up the project
Getting Java installed was a no brainer. All I needed to do was create a startup script which downloaded and run the installer. After a few bruises and a couple of hours into creating the startup script as a batch file, I decided it didn’t make sense to be battling with batch scripts when I had access to the full .net framework and a much more modern scripting engine in powershell. I also decided that I should include the java installer in the project instead of downloading it because it made more sense to take that hit when uploading the package instead of during startup. Once I had a working powershell script which ensured java was installed, all I needed to do was configure and run Elasticsearch.

Elastic PAAS Solution

Persistent Storage
Azure PAAS is a mostly stateless environment so applications must point to some external persistent storage. I was thinking that I could attach a vhd and mount it on my worker instances but eventually decided to try the new Azure File service with SMB support. It turns out that someone had already thought of this and released some code which made it very easy to mount Azure file storage shares as mapped drives in worker roles. With that, my persisted storage problem was solved. Elasticsearch could store its data on what would look like just another local drive. It is interesting to note that this is the only aspect of this solution I could not test on the Azure Emulator. So the code uses resource directories for storage when in the emulator.

Configuring and Running Elasticsearch
Elasticsearch does not have an installer per se. Instead, you use platform specific scripts to run it or install it as a service. So all I needed to do in the RoleEntryPoint was start the process and wait for it to exit. Also, when then RoleEntryPoint exists I can have it stop the process. The Elasticsearch configuration can be placed in either a .yml or .json file. I didn’t find out about the .json file till after so I pulled in a project called YamlDotNet to programmatically write configuration values such as the node name (instance id) that would only be available in the RoleEntryPoint. The deployment starts off with a base .yml, merges the base config with its own custom values in memory and writes the merged file to the final Elasticsearch config directory.

A discovery plugin based on the runtime api
I thought the discovery plugin would have been a walk in the park until I encountered a big problem. The Runtime API for java is supported by named pipes. Unfortunately, the named pipe is only available when using a ProgramEntryPoint in the service definition (I wish they would change that or at least make it configurable in the service definition). At this point I thought of moving the entire project to eclipse or using a console application as a ProgramEntryPoint but that took away my ability to simply run and debug in the emulator. Then, I thought if they were already using IPC for the java service runtime then it should be good enough for my solution. All I had to do was setup a mini server in the RoleEntryPoint that the java discovery plugin could communicate with. I did a proof of concept for both TCP and named pipes but eventually decided on named pipes because it was simpler. So with a working named pipes server answering all the questions about the state of the cloud service and the persistent storage for both data and snapshots abstracted away I was able to remove the dependency on the azure api for java in my plugin project.

Internal load balanced endpoints
My goal for this solution is to not expose the Elasticsearch cluster publicly but rather create a WebRole with a custom ui and or another WorkerRole which exposes a custom api to interact with the elasticsearch index. Looking at the service definition and reading around the internet it seems that I can’t declare an internal load balanced endpoint which can only be accessed by my public facing roles. It seems like my only option is to write a thin layer which cycles through or returns random node endpoints from the Elasticsearch cluster.

Elasticsearch is currently logging to a resource folder on the role, but I would really like to ship the logs somewhere. I am thinking that I can somehow hookup Elasticsearch logging to the diagnostic trace store which gets shipped to Azure storage or simply mount another Azure File share dedicated to logging. I also plan to read up on Logstash but I am not sure what value it would provide to my solution.

Additional Thoughts
Once I have all the configuration pieces in place, publishing this service will be super easy. With the click of a button (or two) I would have a private Elasticsearch cluster and a public facing website/api for interacting with it. I already have a script which will allow me to bootstrap the data but I have not thought of how to integrate it into the deployment workflow. The other challenge is scaling Elasticsearch. Scaling Elasticsearch is not as straight forward as adding more identical nodes to something such as a web application. You really have to put some thought into it. For this project I am thinking that the data, once bootstrapped, would be mostly read only. Therefore, the requirements will differ from other clusters. For example, I can have a set of main data nodes on one WorkerRole and a set of replica only nodes (need to read up on this) in another WorkerRole configured with autoscaling.

Thanks for taking the time to read my random stuff. The code for the project inspired by this proof of concept can be found on Github: https://github.com/garvincasimir/Elasticsearch-Azure-PAAS


Tutorial: Mvc application using Azure acs and forms authentication Part 1

A few months ago I created a post which described a method for adding single signon services to an asp.net mvc website using Windows Azure ACS. The workflow involved allowing users to signup using forms authentication then associating their on-site identities with their facboook, google, yahoo and windows live identities. This post is meant to be a more detailed followup which will take you through  some other things not covered in my original post.

The tutorial steps are as follows:

  • Create an asp.net mvc 3 web application with forms authentication
  • Register a new user
  • Add Azure ACS sts reference to application
  • Alter web.config to allow both sts and forms authentication
  • Add database table to store mappings between identities and user accounts
  • Create a custom class for extracting required claims
  • Create a special controller action which Azure ACS will call to sign users in
  • Create Hybrid login page with new functions
    • Login with forms
    • Login with ACS
    • Prompt user to login and complete association process

Here is the workflow from the original post:

Assumptions about audience

Please ensure you have the following before proceeding with this tutorial

  1. Windows aure account
  2. Azure Access Control Services (ACS) working
  3. Windows Identity Framework Installed and working
  4. Basic understanding of Asp.net memberships
  5. Comfortable with Asp.net MVC
  6. Visual Studio 2010
  7. You have SQL express installed with required permissions
  8. You have experience with ACS/STS either through your own projects or tutorials such as on acs.codeplex.com
Again, If you have not been able to get ACS working using other samples and tutorials then it will be very difficult for you to follow this tutorial as I will not go into detailed explanations about setting up ACS, installing the Windows Identity Framework and Asp.net memberships.

Creating the Project

We will start by creating a new Asp.net MVC 3 project. I called mine MVC3MixedAuthenticationSample

Use the “Internet Application” template

Once the project is created add a reference to the Microsoft.IdentityModel assembly.

You should have a working Asp.net MVC 3 application at this point. The asp.net database will be created in the App_Data folder when we run the application.

Now run the project and register a new user. I called mine bob and gave him the password ‘password’.

Your new user should now be logged into the site. You can also use the asp.net configuration tool to add a new user (you must build the project first). If you get any errors while attempting to add a new user you probably don’t have the correct permissions to the App_Data folder in the project or to sql express. There are several things you can try to resolve this issue that I can mention later if needed.

Our next step will be to include the newly created database in our project. This file is hidden by default so click the second icon from the left below “Solution Explorer” to show hidden files.

The database should now appear in the list of databases on the server explorer. Right click on the Tables node to add a new table.

The new table will store the mappings between the acs identities and the local users.

Remember to enable auto increment for the primary key (IdentityID)

Save table and call it UserIdentity

Now we need data access to our new table and any other tables related to it. I decided to use entity framework for this but you can use any ORM or simple ADO.NET if you like. Many people are reading this because they want to modify their existing projects so they may be using something else for data access. However, the principles are the same.

Create a new entity data model and opt to generate the entity from a database.

The “ApplicationServices” connection should already be selected. If not, select it. If it is not there, I would say add it but it not being there might be a symptom of a bigger issue.

Add the UserIdentity table to the model. I have the aspnet_users table selected but it is not required since we will be using the asp.net membership api for everything related to local users.

Now that the database and data access is setup we can add a reference to the Azure ACS STS.

Please fill in the url for the mvc application

Select “Use an existing STS” and paste in the link to your ACS metadata. This can be found in your Azure control panel.

The following claims will be used to associate users with their identities

The wizard automatically disables forms authentication so the next step is to edit our web.config and turn forms authentication back on.

Open the web.config and find the attribute passiveRedirectEnabled and set  it to false.

<wsFederation passiveRedirectEnabled="false" issuer="https://[namespace].accesscontrol.windows.net/v2/wsfederation" realm="http://localhost:52119/" requireHttps="false" />

Find the following and comment it out:

<authentication mode="None" />

Find the following and uncomment it:

<!--<authentication mode="Forms"><forms loginUrl="~/Account/LogOn" timeout="2880" /></authentication>-->

Find the following and comment it out:

 <deny users="?" />

At this point you should be able to login to run the application and login with the user you created at the beginning.

Next we will Create a new class called IdentityClaim which we will use to extract the claims from the sts response. I added mine to the Models folder. Please note the using statement for Microsoft.IdentityModel.Claims.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using Microsoft.IdentityModel.Claims;

namespace MVC3MixedAuthenticationSample.Models
    public class IdentityClaim
        private string _identityProvider;
        private string _identityValue;
        public const string ACSProviderClaim = "http://schemas.microsoft.com/accesscontrolservice/2010/07/claims/identityprovider";

        public IdentityClaim(IClaimsIdentity identity)

            if (identity != null)
                foreach (var claim in identity.Claims)
                    if (claim.ClaimType == ClaimTypes.NameIdentifier)
                        _identityValue = claim.Value;
                    if (claim.ClaimType == ACSProviderClaim)
                        _identityProvider = claim.Value;



        public IdentityClaim()
            _identityProvider = HttpContext.Current.Session["IdentityProvider"] as string;
            _identityValue = HttpContext.Current.Session["IdentityValue"] as string;

        public bool HasIdentity
                return (!string.IsNullOrEmpty(_identityProvider) && (!string.IsNullOrEmpty(_identityValue)));

        public string IdentityProvider
                return _identityProvider;

        public string IdentityValue
                return _identityValue;

        internal void SaveToSession()
            HttpContext.Current.Session["IdentityProvider"] = _identityProvider;
            HttpContext.Current.Session["IdentityValue"] = _identityValue;

        public static void ClearSession()

        public static string ProviderNiceName(string identityProivder)
            if (identityProivder.ToLower().Contains("windowslive"))
                return "Windows Live";

            if (identityProivder.ToLower().Contains("facebook"))
                return "Facebook";

            if (identityProivder.ToLower().Contains("yahoo"))
                return "Yahoo";

            if (identityProivder.ToLower().Contains("google"))
                return "Google";

            return identityProivder;


In my next post I will talk about creating the new controller action that will support Azure ACS signin and modifying the generated logOn page to include both the ACS buttons and the regular forms authentication option.