Building Analytics with Data Lake

Published On: Jun 9, 2016

As part of a project that I am working on I was doing some work with Azure Data Lake.  As part of the solution we need to analyses large amounts of data that will be generated by users and condensed into some interesting analytics. As part of the PoC we decided to look at two different directions in attempting to solve the problem.  The first was to build a solution in data lake that would take all of the generated data and summarize it into something that would make some sense. The second was building an application to run the analytics.

At the current moment I think the hardest part of working with the toolset is the lack of documentation that exists.  I'm sure this will improve over time and is something that is normal for pre-release software.

So... what was I attempting to do?  The idea is simple we wanted one of our applications to publish a message to an event hub to capture that the event happened, we are expecting there to be several thousands of these messages over the course of an hour (not really too ambitious but it's a place to start and the key is making sure that we can scale when we need to). From that point we would have web workers that would take those messages off of the event hub and push them over to table storage (cheap storage is always a good thing), then we would move them from the table storage over to data lake and run some processes for aggregating the data. Nothing to complicated, after working through all of the steps needed to make it happen we were successful.  The whole process worked fairly well and would have done what we needed and the pricing on it was much less than I expected for the demo.  Even with all of the services parts spun up I was still going to come in well under my $50 allocation for MSDN.

The application direction ends up exactly where you think it would, too much yak shaving to make it worth while, so there's nothing really interesting to talk about there.

In the end we decided not to go this way because we decided that part of the underlying architecture was the problem and that we would be able to get better analytics out of fixing that part. But in the end I have to say that I was really surprised about the quality of the product that's out there and I'm extremely interested to see where the final version ends up.