Humans to Machine – Shift of data source
Data has been growing exponentially. We have more data streaming through the wire than we can keep them on disk from both value and volume perspective. These data are being created by everything we deal with on daily basis. When humans were the dominant creator of data, we naturally used to have fewer amounts of data to deal with and at the same time value used to persist for a longer period. This, in fact, holds true now as well, if humans are the creator of the data.
However, humans are no longer the dominant creator of the data. Machines, sensors, devices etc. have taken over long time back. These data, created by machines with humongous speed, is so much that in last two years we had 90% of the data created since the dawn of civilization. These data tend to have limited shelf life as far as value is concerned. The value of data decreases rapidly with time. If the data is not processed as soon as possible then it may not be very useful for ongoing businesses and operations. Naturally, we need to have different thought process and approach to deal with these data.
Why stream analytics is the key for future analytics
Since we are having more of these data streaming in from all different sources, that if combined and analyzed then huge value could be created for the users or businesses. At the same time given the perishable nature of the data, it’s imperative that these data must be analyzed and used as soon as they are created.
More and more use cases are being generated which need to be tackled to push the boundaries and achieve newer goals. These use cases demand collection of data from different data sources, joining across different layers, correlation and processing across different domains, all in real time. The future of analysis is less about understanding “what happened” and more about “what’s happening or what may happen”.
Let’s analyze some of the use cases. Consider an e-commerce platform, which is integrated with real time streaming platform. Using this integrated streaming analysis of data, it could combine & process different data in real time to figure out the intent or behavior of the user to present personalized offer or content. This could increase the conversion rate significantly or reduce to eroding customer engagements. It could also have better campaign management to yield better results for the same spend.
Think of a small or mid-size data center (DC), which typically have many kinds of different devices and machines each generating volume of data every moment. They typically use many different static tools for different kinds of data in different silos. These tools not only restrict the DC in having a single view of the entire data center but also work like a BI tool. Because of this, the issue identification in predictive or real time manner doesn't happen as a result firefighting becomes the norm of the day. With converged integrated stream analytic platform, DC could have a single view of the entire DC along with real time monitoring of events, data to ensure issues are caught before it may create bigger problems. A security breach could be seen or predicted much earlier before the damage is done. Analyzing the bandwidth usage and forecasting in near real time could do better resource planning and provisioning.
The entire IoT is based on the premise that everything can generate data and interact with other things in real time to achieve larger goals. This requires real time streaming analytic framework to be in place to ingest all sorts of unstructured data from different disparate sources, monitor them in real time and take actions as required after identifying either known patterns or anomalies
The AI and predictive analytic means that the data is being collected and processed in real time otherwise the impact of AI could only be in understanding what happened. And with the growth of data and types, it will be prudent to not rely solely on what has been learnt so far in the hindsight. Demand will be in reacting to new things as it is seen or felt. Also, we have learnt from our experiences that a model trained on older data often struggles to deal with newer data with acceptable accuracy. Therefore, here also, the real time streaming platform becomes the required part rather than a good to have piece.
Limitations with existing tools or platforms
There are two broad categories in which we can slot the options available in the market. One is appliance model and another one is series of open source tools that need to be assembled to create a platform. While former costs several millions of dollars up front, the latter requires dozens of consultants for several months to create a platform. Time to market, cost, ease of use and lack of unified options are few major drawbacks. However, there are bigger issues to be addressed by either of these options when it comes to stream processing and here we require a new approach to solve the problems. We can’t apply older tools to newer, future looking problems. Otherwise, it will remain a patchwork and would not scale to the needs of the hour.
Challenges with Stream Processing
Here are the basic high-level challenges when it comes to dealing with a stream of data and to process them in real time:
· Deal with high volume of unstructured data
· Avoiding multiple copies of the data across different layers
· Optimal flow of the data through the system
· Partitioning the application across resources
· Processing streaming data in real time
· Data storage
· Remain predictive rather than only forensic or BI tool
· Ease of use
· Time to market
· Deployment model
There are a few solutions and workarounds to achieve stream processing in the current landscape. However these may not be the optimal approach for current and future needs. We need a scalable solution that can process streaming data with low latency and also provide ease of use. This is the only way, innovations in IoT, AI, ML will be ushered in the right direction.