Why Data Virtualization is Only Part of the Solution

Insights

A lot of people are saying “data virtualization is the answer” and for good reason – it takes away a lot of risk and complexity when unifying and utilising data from multiple systems. It makes it easy for business users to access complete datasets in a consumable way and market leading data virtualization platforms are well supported and easy to manage.  However data virtualization thrives in an environment where you have a lot of databases, and with the significant momentum around cloud and SaaS (software as a service) business applications it is unlikely to meet the needs of organisations looking to modernise and leverage their data for business growth and innovation. In this blog I will discuss the advantages of data virtualization, scenarios where it needs to be enhanced with additional capabilities and why I recommend building a Unified Data & Integration Capability instead of focussing on one particular tool.

The Case for Data Virtualization

Database views have been with us for as long as I can remember working with relational databases (more than 20 years- I stopped counting after 20!). To get a single view of the customer you have to copy data in from all the source systems containing customer data such as CRM, billing, etc., store it in a staging area and then transform it to a target schema. But the more copies and transformations of data there are, the higher the chance something could go wrong such as the data getting out of sync, transforming it wrongly or that ends up affecting the data quality.  Data profiling, data quality and data governance were all introduced to try and to clean up those issues by identifying where errors occurred, applying rules etc. However, these capabilities are difficult and involved, so data architects started utilising the ‘lean principle’ where essentially instead of copying data you try to keep it in its original raw format as much as possible and just access a view of it via virtualization. This approach has driven data virtualization’s momentum.  Data virtualization minimises the need to copy or transform data by providing a view across source system data, as close to the source system as possible. This has many benefits including abstraction from underlying systems, access to real time data, lower storage costs and reduced development effort. Business users are able to access and analyse their data to make timely decisions, mitigate risk and improve business performance.

Some Scenarios Where Data Virtualization Alone is Very Challenging

Data virtualization is a great pattern that definitely solves a lot of problems. But in our experience it needs to be complemented by other capabilities such as enterprise application integration, storage, catalogues, data quality and ETL to reach its full potential. Below I have outlined several scenarios we’ve encountered where data virtualization couldn’t fulfil end-to-end requirements and the patterns we used to supplement it. 1.  No Direct Access to Storage Where the source system is a SaaS the vendor typically won’t allow you to connect via SQL. Instead they provide well-defined APIs that may not handle ‘bulk’ scenarios well e.g. get all customers. Even more tricky is where the source system only provides a GUI (graphical user interface) – this is common with legacy systems. If you’re lucky there might be a report capability that can output a file to an SFTP server. Additional patterns that can be considered: 
  • API connectors with low/no-code orchestration tools often with persistence needing to be sprinkled in
  • For GUIs – RPA as an integration connector
  • File to an SFTP server – SFTP orchestration and some persistence mechanism like a data lake
2.  Event Subscriptions Some cloud providers handle bulk data by way of event subscriptions only, which cannot be virtualized. This includes both subscribing to individual events as well as a consumer emitting a request for data and receiving a callback with the bulk data or a file attachment with the bulk data. Additional patterns that can be considered: 
  • Ability to expose APIs, orchestrate subscriptions and persist data in an appropriate event store 
3.  Complex Orchestration Data virtualization can not be achieved where complex orchestration is required before accessing APIs or in accessing APIs e.g. a complex custom authentication mechanism (key pair refreshes, authentication token reuse, device activation, subscriptions). Additional patterns that can be considered: 
  • A wide library of connectors, orchestration and an advanced API capability 
  • ‘Transient’ persistence mechanism for short term storage and sharing
4.  Historical Data Not Captured in Source System Some systems have databases that only store current state data, not historical changes, which will limit the ability to do any detailed analysis with data virtualization alone. Additional patterns that can be considered: 
  • Scheduled orchestration 
  • API capability that persists the data 
5.  Source Systems have Performance Constraints When source transactional systems have performance issues or constraints they can’t accommodate having data virtualization push processing onto them.  Additional patterns that can be considered: 
  • Staging persistence 
  • ETL (Extract, Transform, Load) or CDC (Change Data Capture)
These use cases confirmed to me that data virtualization is best suited to help achieve business goals and objectives in an environment where other data and integration patterns are also present, providing an ecosystem of patterns. In such an environment it becomes another valuable tool in the data and integration toolbox to solve challenges, especially if you believe in evolutionary architectures, microservices and cloud native.  This thinking isn’t anything new. Most of the big data virtualization software/cloud vendors have the same architectural opinion, but despite this we often encounter ‘silver bullet’ type thinking, where data virtualization is identified as the only software product required to solve all the business challenges. 

Building a Unified Data & Integration Capability

As outlined above, data virtualization can provide a lot of benefit by minimising data movement and transformation. But it cannot be used in isolation, and should be optimised by building a maturity in other patterns such as a data warehouse and/or data lake solution, big data approaches and enterprise application and data integration. Such an ecosystem will help to solve a host of problems such as data quality, data speed, data visibility, data exploration and complex analytics schema challenges by working with the other components.  Our recommendation is to take a more holistic approach and have data virtualization as one key pillar in your Unified Data & Integration Capability. Gartner refers to this approach as Data Fabric where the main goal ‘is to deliver integrated and enriched data –  at the right time, in the right shape, and to the right data consumer for supporting various data and analytics use cases of an organization.’ A Unified Capability provides many benefits including:
  • Economies of scale through reuse of overlapping patterns
  • Improved quality, reliability and accuracy of data used by all initiatives
  • Consistency through common models, governance and master data usage
  • Assured data quality prior to triggering integration
  • High quality reporting through secure data access & querying and graphical representation of the data
There are obviously implications for implementation and run costs when planning to build this capability. A ‘pure play’ data virtualization strategy will likely have higher software costs (when looking at the leading vendors), potentially offset by a lower requirement for highly technical resources to manage. But if your use cases require the more holistic capability of a Unified Data & Integration Platform (UDIP) costs may be spread differently. You will likely have some of the ‘additional pattern’ components in your technology environment already.  It is therefore essential to have a solid understanding of your current technology state and a prioritised view of your data requirements so that you can build and mature your Unified Data & Integration Capability incrementally according to business priorities. As with any capability enhancement you will need to consider if you have the right skills to support it, you may need to train or upskill your team and/or work with a partner or a managed service provider.  Whichever way you go, one thing is for sure; the future of competitive advantage, new business opportunities and streamlined operations lies largely in your data so the sooner you can harness all of it the better.  

Need help with your Data Strategy & Architecture? 
A data innovation workshop can help capture stakeholder pain points, crystalise what your digital priorities are and link technology enablers to deliver on strategy. Click here for more information or to register interest for a complimentary workshop with our team.

Author Details

Riaan Ingram
Riaan is a Principal Consultant with over 20+ years’ IT architecture, design and development experience. Riaan has a strong technical, hands on background and specialises in developing and integrating enterprise business solutions. Riaan has in depth knowledge of integration as well as cloud patterns & technologies and is an expert in the planning, design and implementation of API first, event driven, microservices, low/no code integration approaches.

You might be interested in these related insights