Risk Analysis
The Objective
Problems of automated decision making with uncertain information (risky decisions) appear in areas such as online credit approval, mortgage loan approval, electronic trading, diagnosing medical conditions, and insurance policy issuance, to name a few. Data-driven analytics can support the decision process to reduce risk and establish fair pricing by creating models based on consumer profiles and histories. Leveraging multiple sources of data that are timely and relevant can make decision models even more accurate by tailoring to specific individuals and current circumstances, and taking advantage of the most recent data available (to overcome staleness). While decision models are commonly derived from large bodies of historical data, it is often when deviations from that history occur, i.e., when change is happening, and shocks are occurring, that the risk profile is also rapidly changing. Modeling systems that can recognize and adapt to that change will perform better, and in addition to supporting spot decisions, they will also do a better job with continuous monitoring of ongoing risk downstream. Superior modeling supports businesses taking on risk (e.g., underwriting) by reducing exposure and improving profitability, and at the same time increases the satisfaction of the business’ customers who receive better pricing and more rapid response.
The Challenge
In a variety of application settings, the sources of data that can be used to quantify risk are highly diverse and may comprise both structured and complex data. As an example, a business’ ability to pay back a loan is a function of its business condition, which could be assessed through the transactions in its bank statements, its monthly financial statements, overhead video imagery of its parking lots (customer traffic volume), the press its products are receiving in blogs and social media, trends within its market segment and health of competitors, etc...
The conventional approach to attempting to leverage these diverse and distributed sources would be to integrate the data for centralized machine learning. But approaches that pull, aggregate, and mediate large integrated datasets to learn a predictive model are directly in tension with user requirements on modeling agility and getting real-time answers based on the most current data. Because the data sources are evolving at intrinsically different rates, problems of mixed staleness occur if the analytics workflow cannot keep pace. Ultimately the problems surface in terms of decision quality, high manpower and infrastructure costs, and ongoing cost and effort to continually service the modeling effort. An additional complication may be that bringing certain data together may be in violation of customer privacy, or certain sources that would appear to be of high utility in the risk assessment cannot be accessed for integration due to proprietary concerns or regulations. Some sources may be available and/or useful only intermittently, compounding issues with missing or low quality data.
Model learning with the conventional method can be challenged to keep pace with data that is evolving and exhibiting statistical non-stationarities. If a centralized model is learned on integrated data, “incremental updates” based on new data in some silos cannot be made, meaning that an entirely new global model must be learned from scratch to replace the existing model, which is now based on stale assumptions. Also, if entirely new sources become available to the system, their data must be combined into the existing training set and a new model learned all over again.
Online prediction can be challenged to provide real-time response because the new test inputs arriving to silos must be communicated to a central location to feed to the model. Depending on how the model was learned and constructed, if some data is late when decisions need to be made, missing data must be imputed and can lead to loss in performance and challenges in normalization to level the playing field when comparing risk in different scenarios.
The Collaborative Analytics Solution for Risk Analysis
Cross-silo risk analysis via Collaborative Analytics (CA) offers a number of advantages in such a dynamic distributed big data setting. Because the costs and delays of data integration are avoided, the analysis can more easily make use of a wider variety of sources, explore different combinations to see which are most predictive, and can include sources for which the raw data itself could never be obtained.
An agile model learning capability is enabled that can keep pace with changes in the local data sources. The global analytic model is learned from automatically optimizing the combination of local models, which can be freely and continuously improved with higher performance algorithms or parameter settings, and connected to the system while it operates (no downtime) making incremental updates to the global model. This accelerates the model learning part of the analytics workflow, empowering data scientists to rapidly experiment and iterate. If new sources become available for testing, these can be connected to the system or removed at will.
Real-time online prediction (i.e., automated decision making) is more readily achievable because the feature data originating in the silos does not need to be communicated to a central location to feed the model, but rather, is processed in place and only compact signaling statistics are communicated across the CA analytics network. The method supports a “sample when ready mode”, and will re-optimize on-the-fly to make best use of the information that is available at the time the decision is requested. Because a rigorous quantitative calculus of uncertainty is employed, information is properly weighted and normalized by design. A “sample when needed” mode is also supported in which additional sources which are costly to bring to bear can be added only when and if they are needed to improve the fidelity of the risk analysis.
It is recognized that companies in industries that have a history of risk modeling, such as credit and insurance, most likely already have an existing risk modeling approach in place. A unique capability of CA is that it can augment existing models with additional sources of information, mined with the latest methods, without opening up the core model in use or even understanding its inner (proprietary) workings. This provides an extremely low touch way to try combining additional information to assess the benefit without disrupting the current approach in use.
CA provides an agile cross-silo risk analysis framework that allows the most sources to be combined for the highest accuracy assessments while keeping pace with dynamic high variety data.