The Big Data promise of better decisions with ever more variety of data comes with the high cost of data integration

Lifecycle.png

Big data promises smarter, finer-grain decision-making using the large variety of data sources that now permeates our society. However, it is widely known that the bottleneck in the Predictive Analytics Lifecycle is in integrating and preparing of the data for machine learning. This “data wrangling” step involves both format and semantic mediation across different data sources.

Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane effort of collecting and preparing unruly digital data, before it can be explored for useful nuggets.
— New York Times technology article “For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights”

While many tools have emerged to automate this task, fully automated entity resolution and data semantic mediation prove elusive and remain labor intensive. The data integration cost is recurring as it needs to be repeated whenever data changes. Factors relevant to model today’s trends may be irrelevant tomorrow. There are seasonal or year-to-year variations in certain data sources. Old data becomes stale, and newly collected data is needed. These time dynamics of data vary across different data sources, and therefore, the frequency for this recurring cost increases with increasing number of data sources. The data integration cost itself grows with each new data source because even with a common data model, that data model may need to be extended or modified to fully capture the relevant information from a new data source. Furthermore, the predictive model needs to be completely re-learned for each cycle of the predictive analytics workflow. With increasing number of data sources, this model becomes larger and combinatorically more complex, increasing the time and cost to re-learn. All this data integration cost is incurred before its value can be quantified. Specifically, the resulting analysis may indeed reveal that some of the data sources provide diminishing performance gains that are not worth the cost of integrating them. Consequently, a majority of CEOs now question whether the promised ROI from Big Data can ever be realized.

86% of executives say their organizations have been at best only somewhat effective at meeting the primary objective of their data and analytics programs, including more than one-quarter who say they’ve been ineffective.
— McKinsey Global Executive Survey: The need to lead in data and analytics (2016)

Collaborative Analytics empowers organizations to realize the promise of Big Data without incurring the recurring cost of data integration that grows with each new data source

In Collaborative Analytics, locally derived analytics at each data source are integrated into a predictive network model. Our Collaborative Analytics platform operates as a distributed system in the form of a network of collaborating agents that are organically embedded with the data, adapting local predictive models derived from that data as microservices available on the network. Operating from business goals encoded as decision queries of the data, a data analysts search for and select matching models that answer these queries across data sources to assemble a prediction network. Once models are selected, the associated nodes collaborate automatically to organize and to configure the network, optimizing the resulting network decision performance.

CA.png

The key to collaborative analytics is that the self organizing and configuring process of learning the prediction network is orders of magnitude faster than integrating the data to learn a centralized model. This collaborative learning process scales organically to the number of data sources and allows the data analysts to explore, to quantify, to compare, and to identify very quickly different combinations of data sources to meet their business objectives. Furthermore, this process requires only sparse, compact communication between nodes and no global synchronization, making learning across an internet wide network possible.

The revolutionary algorithms behind Collaborative Analytics are modeled after organizational decision making, leveraging expertise and integrating insight across a network of predictive models to deliver better business intelligence, faster

In organizational decision making, decisions are made based on one’s own data and summarizing information from subordinates at each level of the organization with the resulting team decision made at the top. More sophisticated than simple voting or meta-learning resulting from independently derived decisions, the decision processes across the organization are coupled. That is, the decision made on the superior’s data depends on both the summarizing information sent by its subordinates and knowledge of their decision processes, and conversely, the information that the subordinates choose to send depend on the knowledge of their superiors’ decision processes. In the context of Collaborative Analytics, these coupled decision processes translate to how each local model is used to generate the summarizing information sent to its superior in the network, which is learned in a distributed manner through collaboration among the node agents to optimize the network prediction performance. The learned decision process at each node effectively reduces its local data into compact messages serving as signals that influence other nodes’ decision processes. A benefit of Collaborative Analytics is preserving the privacy of the local data because the meanings of these compact messages are obscure without knowledge of the entire network’s computation.

Collaborative Analytics brings the focus back to the business questions asked of the data, allowing data scientists to focus on building better models through domain expertise

Simply collecting and integrating data into a warehouse/lake and then letting machine learning sort it all out may sound attractive but is not enough to efficiently extract all the business value from the data. Predictive analytics use models to answer business queries of the data. These models, learned from historic data, encapsulates the questions whose answers support specific business decisions or goals. The first step of the Predictive Analytics Lifecycle is to formulate queries of the data from business goals and to identify the data that support these queries. With each cycle, interpretation of the previous model through domain expertise provides the understanding of which features of the data are truly important to answer these queries and hence, provides the means to improve the model.

Collaborative Analytics brings the focus back to this first step because the platform manages the relationships between business goals/queries and the predictive models that answer them. This empowers data scientists to use their domain expertise to build, to validate, and to refine models efficiently with the data on-hand that are focused to provide the decisive answers to support specific business goals. Data analysts can then use the platform to discover related business queries answered by other data sources and assemble prediction networks integrating these to augment the answering of their own business queries, delivering better business intelligence, faster.