Efficient Data Integration
The ObjectivE
For various reasons, organizations may be committed to analytics strategies that call for integrating data across silos, most commonly into a “data lake”. These projects are initiated with a set of requirements, targeted business objectives and metrics, and cost/time budgets supporting an expected return-on-investment (ROI) level to justify the project. Plans of execution that provide the most interim value the soonest, to help solidify the overall strategy, are highly desired. The business goal is to smartly navigate the complexity, keep the costs under control, and make the investment in analytics pay.
The Challenge
The cost and complexity of integrating sources for centralized predictive modeling can be significant, and it is easily underestimated because so much depends on the specifics of the data available and its quality, as well as the type of business insights the data actually supports (as opposed to what business analysts may optimistically imagine!). The process of cleaning, preparing, and organizing a large integrated data set for analysis, for example on cloud-based analytics platforms such as Hadoop/Spark clusters, has costs that increase exponentially with the addition of more feature-bearing sources, until the clean-up becomes the dominant contributor to the total turnaround time in the workflow. Larger and higher variety data exacerbates the costs and the technical challenges associated with this step, which is often quoted by experts as consuming as much as 70% of the total effort and resource in the analytics workflow. The labor costs alone are painfully high. Estimating just 3 man months of effort per silo, at a burdened rate of $100/hr, gives $50K per silo, and those costs do not scale linearly but exponentially as more silos are combined. It can be difficult to get beyond the spending as the projects have a tendency to go on and on, with goals undergoing modification, and the data constantly evolving underneath. Long analytics cycle times just exacerbate the problem, pushing out time to results and making iteration and experimentation hard. And in addition, there may be recurring downstream costs associated with governance, redundancy, and security/privacy, of large integrated datasets.
A key challenge to cite is that it is often not clear which data is most worth putting together in the context of answering specific business questions. In a scenario in which half a dozen sources are being combined, if it were knowable in advance that integrating three of the six would give an 80% solution, that would allow prioritizing the effort to deliver the most value the fastest to help the business. Otherwise, there is no option but to pay up front to bring all the data together to find out whether it is worth bringing the data together. And even if the test data looks promising, there is no guarantee that the value will hold up over time. This amounts to a bet, just another gamble.
Without some way to navigate the complexity, the investments can become so large that it becomes hard to make the analytics pay off.
The Collaborative Analytics Solution for Data Integration
Collaborative Analytics (CA) can support projects that are integrating data for centralized predictive modeling by providing a front-end indexer function to help navigate the complexity. CA provides an agile framework that enables rapid iteration and experimentation with combinations of methods, and provides performance projections to help refine strategy. Some specific use cases and associated benefits are:
- Identify which information is most worth combining in the context of specific business questions, and project the likely performance when those silos are combined; this capability can be used to prioritize silo integration efforts (which ones will add the most value to combine first)
- Provide transitional and/or optimized query capability across silos while the data integration effort catches up; this helps to address the “time inconsistency” problem that arises with lags in data integration
- Augment an existing (centralized) predictive model with new sources without rebuilding the original model, thereby measuring the value of adding the new information as a by-product
- Continuously assess the value of sources and worthiness of expenditures required to integrate them as the sources change over time
- Provide redundant “at source” capability to provide robustness to centralized system lack of availability or outages; selective treatment of certain sources with in situ processing due to proprietary, privacy, governance, or security restrictions
- Augment an existing model built off integrated data with additional sources for which the feature data cannot ever be accessed/integrated
CA can serve as a Navigation Aid can help make data integration projects more efficient and targeted, and improve the ROI.