A lot of people are saying “data virtualization is the answer” and for good reason – it takes away a lot of risk and complexity when unifying and utilising data from multiple systems. It makes it easy for business users to access complete datasets in a consumable way and market leading data virtualization platforms are well supported and easy to manage.
However data virtualization thrives in an environment where you have a lot of databases, and with the significant momentum around cloud and SaaS (software as a service) business applications it is unlikely to meet the needs of organisations looking to modernise and leverage their data for business growth and innovation. In this blog I will discuss the advantages of data virtualization, scenarios where it needs to be enhanced with additional capabilities and why I recommend building a Unified Data & Integration Capability instead of focussing on one particular tool.
The Case for Data Virtualization
Database views have been with us for as long as I can remember working with relational databases (more than 20 years- I stopped counting after 20!). To get a single view of the customer you have to copy data in from all the source systems containing customer data such as CRM, billing, etc., store it in a staging area and then transform it to a target schema. But the more copies and transformations of data there are, the higher the chance something could go wrong such as the data getting out of sync, transforming it wrongly or that ends up affecting the data quality. Data profiling, data quality and data governance were all introduced to try and to clean up those issues by identifying where errors occurred, applying rules etc. However, these capabilities are difficult and involved, so data architects started utilising the ‘lean principle’ where essentially instead of copying data you try to keep it in its original raw format as much as possible and just access a view of it via virtualization. This approach has driven data virtualization’s momentum. Data virtualization minimises the need to copy or transform data by providing a view across source system data, as close to the source system as possible. This has many benefits including abstraction from underlying systems, access to real time data, lower storage costs and reduced development effort. Business users are able to access and analyse their data to make timely decisions, mitigate risk and improve business performance.Some Scenarios Where Data Virtualization Alone is Very Challenging
Data virtualization is a great pattern that definitely solves a lot of problems. But in our experience it needs to be complemented by other capabilities such as enterprise application integration, storage, catalogues, data quality and ETL to reach its full potential. Below I have outlined several scenarios we’ve encountered where data virtualization couldn’t fulfil end-to-end requirements and the patterns we used to supplement it. 1. No Direct Access to Storage Where the source system is a SaaS the vendor typically won’t allow you to connect via SQL. Instead they provide well-defined APIs that may not handle ‘bulk’ scenarios well e.g. get all customers. Even more tricky is where the source system only provides a GUI (graphical user interface) – this is common with legacy systems. If you’re lucky there might be a report capability that can output a file to an SFTP server. Additional patterns that can be considered:- API connectors with low/no-code orchestration tools often with persistence needing to be sprinkled in
- For GUIs – RPA as an integration connector
- File to an SFTP server – SFTP orchestration and some persistence mechanism like a data lake
- Ability to expose APIs, orchestrate subscriptions and persist data in an appropriate event store
- A wide library of connectors, orchestration and an advanced API capability
- ‘Transient’ persistence mechanism for short term storage and sharing
- Scheduled orchestration
- API capability that persists the data
- Staging persistence
- ETL (Extract, Transform, Load) or CDC (Change Data Capture)
Building a Unified Data & Integration Capability
As outlined above, data virtualization can provide a lot of benefit by minimising data movement and transformation. But it cannot be used in isolation, and should be optimised by building a maturity in other patterns such as a data warehouse and/or data lake solution, big data approaches and enterprise application and data integration. Such an ecosystem will help to solve a host of problems such as data quality, data speed, data visibility, data exploration and complex analytics schema challenges by working with the other components. Our recommendation is to take a more holistic approach and have data virtualization as one key pillar in your Unified Data & Integration Capability. Gartner refers to this approach as Data Fabric where the main goal ‘is to deliver integrated and enriched data – at the right time, in the right shape, and to the right data consumer for supporting various data and analytics use cases of an organization.’ A Unified Capability provides many benefits including:- Economies of scale through reuse of overlapping patterns
- Improved quality, reliability and accuracy of data used by all initiatives
- Consistency through common models, governance and master data usage
- Assured data quality prior to triggering integration
- High quality reporting through secure data access & querying and graphical representation of the data