Data analytics have become a crucial requirement for every business today. The rate at which data generates has an exponential growth. Almost all the decisions we make are based on the data we’ve gathered. Okay, I guess I better stop here, for it’s impossible to summarize the importance of data in just one para or even article. Besides, this article is not about the importance of the data, because it’s been already proven.
This is about the analytics platform we built and its design considerations.
What is the role of an analytics platform? Well, basically what it does is:
- Gathering data
- Processing data
- Storing the processed/derived data, so that we can later pull the required insights easily.
Here, the processing and storing is domain specific. That means what to process and what to store is decided based on the business context.
Gathering and storing are straightforward steps. So what’s important is processing and the approach we use to process. In high level, there are two main approaches to process data:
- Batch processing
- Stream processing
In batch processing, we collect a batch of data and process it at periodic intervals and store the processed data. In this approach we need to allocate fairly large compute resources periodically since we run the processing on a large collection of data at once. Moreover, there is a delay for the end result to be available based on the processing interval we choose.
In stream processing, we process the data as soon as it’s available and store the processed data right away. In this approach we need to allocate fairly small compute resources but we have to allocate them more frequently (based on the rate at which you generate data). However there is only a small delay here and the end result is available to be consumed almost instantly.
From the above approaches, we opted for stream processing mainly because of the fact that we wanted to make the reports available to the users within a short time period and we wanted to trigger other automated actions based on the processed data. Also there was a concern for the cost factor as well. I’m not saying that the streaming processing is cost effective than batch processing; it really depends on how we implement it. Anyways, the latter part of this article describes how we reduced the cost with this approach.
When we talk about stream processing the most important component of the system is the stream of data, and when talking about it we need a reliable way to store and manage the data stream as it can be easily consumed to process data. There are several solutions which provide data streams. After trying out few PoCs (Proof of Concepts) we finally shortlisted Apache kafka and AWS Kinesis.
I’m not directly comparing as to which one is better, because both perform well and satisfy the same requirements in most cases. However with Apache Kafka we have options both to deploy on-premise or use as a service from vendors like confluent who is the main contributor to Apache Kafka project. AWS Kinesis only comes as a service by AWS. Considering the scale of our data generation and the operational cost, we decided to select AWS Kinesis since it was cost effective.
Now that we have selected the heart of the analytics platform, next thing is how we produce (send) data to the stream and how we consume it to process the data and generate the insights we want. There were two approaches to implement consumers to poll the stream and process data when available:
- Standalone service
- Lambda function
Here, again we made the decision mainly based on the cost to run each and based on the expected traffic. And in our case, lambda was the winner.
Finally, where to store the processed data? This was straightforward – we selected MySQL database since we have been using it in our application already.
The following diagram shows the high level architecture of our analytics platform.
On the ideal use case, we need to use multiple shards and streams to properly distribute different types of data payloads and process accordingly. However, we initially used a simple filtering mechanism as a data pre-processing step in lambda, so that each lambda function would filter a certain subtype of a dataset to process, rather putting all the processing onto a single function.
On an additional note, I must say that lambda integration with kinesis is seamless and hassle free. AWS has done a great job providing a simple integration hiding all the overheads when writing a stream consumer.
This is not the final architecture of our analytics platform. As the data and the types of processing scales, we continuously update this to meet the requirement. Who knows some time later this would be completely different from the initial design. In order to embrace the future changes, the important thing is to design the system with less dependencies, and always refactor your source code so that it is up to date and maintainable. System components like consumers, stream, producers should be pluggable so that we can change individual components without collapsing the whole system.