Guest post by Scott Fischaber from Analytics Engines
Virtualization is one of the key technology drivers of this decade. It is ubiquitous in cloud computing and well known in terms of compute, storage, and networking. Perhaps less established is the area of data virtualization, where a virtual interface is used to provide a single point of access to different data assets which may be physically separated in different systems. This data virtualization layer provides transformation of data queries to the underlying source systems and collects, combines, and presents the results back to a user or application. It typically doesn’t store the data itself (although data can generally be cached) and instead only contains metadata about the source systems. This allows more agile information architectures to be created as data assets can be quickly added, removed, or transformed within the data virtualization layer without requiring major modifications to the source systems or a centralized data warehouse.
While originally seen as a solution for niche use cases, data virtualization is quickly gaining mainstream adoption as technology vendors such as SAP, Oracle, IBM, Red Hat, and PostgreSQL develop data virtualization features for their existing products and new dedicated data virtualization solutions appear on the market from the likes of Denodo, Data Virtuality, and others. Deployment of data virtualization solutions has been steadily increasing and expected to be adopted by 35% of organisations by 20201. Primary drivers behind the uptake of data virtualization technologies include the continued proliferation in the amount of data within (and external to) organizations which can be harnessed for analysis coupled with the desire for organizations to become more data driven. This presents a challenge for IT teams to provide business intelligence platforms that respond to the needs of agile data analytics projects. Data virtualization technology is seen as one component of the solution.
Traditional data marts, data warehouses, and data lakes store all of the organizations data to be analysed. In the latter case this is largely raw data, which can be transformed and analysed within the data lake. This provides some flexibility from the data analytics point of view, as the exact transforms and structure required by your application or analysis does not need to be known prior to creating the data lake (referred to as extract-load-transform (ELT)). In the former two, data is extracted from source systems and periodically loaded into a new platform which can then be used to create the necessary data views for an application or analytics project. Here, the structure of the data held within the warehouse is largely fixed and defined by the transformation process being performed on the source data (referred to as extract-transform-load (ETL)). A lesser known approach is transform-extract-load (TEL), where any data transformations occur within the source data store, which can simplify the extraction process, but requires operations on the source system which may not always be possible.
Data virtualization technology can be seen as complimentary to all of the above approaches. Where existing data warehouses or data lakes exist, the data virtualization platform can sit over these and integrate them with other systems. Breaking down these existing silos within an organisation without necessarily moving the data to a central location (e.g. for compliance, storage, or legacy reasons) will be key to analytics initiatives going forward.
Data Virtualization Conceptual Diagram
Within the healthcare sector, the on-going digital transformation has created an explosion of data, expected to reach 2314 exabytes in 2020 up from 153 exabytes in 20132. This is based on the widespread adoption of Electronic Healthcare Records (EHRs) and Electronic Medical Records (EMRs), digitalisation of imaging/sensor data, expansion of IoT/wearables, data supporting personalised medicine initiatives (e.g. genomic data), etc. Successfully making use of this data within a clinical setting requires access and sharing within the organisation and ideally combining with existing systems covering finance, operations, compliance, clinical and other departments to provide actionable information across the whole organization. However, this needs to be implemented in a way that ensures good data governance, access control, and privacy.
In the MIDAS project, we are investigating how we can integrate into existing healthcare IT systems and data silos to develop analytical models and visualizations which can support policy-makers in creating new policies or monitoring existing policies. As healthcare policy does not exist within a vacuum, it is often necessary to not only integrate data from existing healthcare systems, but also to associate this with other governmental or open data sets which can help to reveal larger patterns and trends not visible by looking at the data in isolation. Within Analytics Engines we are developing the data virtualization architecture to support this integration from a large number of source systems and external data sources. The data can be cached locally in the MIDAS platform to minimise the load on healthcare systems (at the expense of real-time data) or can pass-through queries to the source systems (although this may increase response-times). Data access control is an important issue and is built into each layer within the platform. We also understand that while data virtualization can be a powerful tool, it is not the only solution, so support loading data locally into the platform or capturing and storing broadcast messages from healthcare systems for later analysis. Data transformations3 are largely run between the source systems and the MIDAS platform, so akin to the typical ETL model, although we support both ELT (transforming the data within the data virtualization layers) and TEL (transforming the data within the source systems) for specific use cases.
A key aspect of the MIDAS project is how to run cross-border analysis without moving the data. As data can rarely be moved outside of a region, we are working on analytics models that can be run in a distributed manner across European countries without moving the underlying data. We are using data virtualization technology here to virtualize access to the different MIDAS platforms throughout Europe, providing a single point of access to run the analytics. This will enable us to build more detailed models that can work on a European-wide level and not just be trained for a single country.
Within the MIDAS project, we want to provide policy-makers with tools to allow them to better demonstrate outcomes, evidence policy decisions, and better measure KPIs. Data virtualization provides a key feature to deal with the complexity of data sources while providing for an agile information architecture, allowing the utilisation of the descriptive and predictive analytics being developed in the project and built on pan-European data.