Synthetic Data: Privacy vs Utility

Guest post by Dr Debbie Rankin and Dr Michaela Black from Ulster University

Synthetic data, also known as artificial data, is data that is simulated from real data using statistical models in order to represent the population, preserving the accuracy of the model, whilst avoiding any divulgence of real person data, private data or data with ethical sensitivity.

The amount of health-related patient data being collected is ever increasing. This data is vital for life saving research in the prevention and treatment of disease, informing effective health care policies, development of personalised medicine, and to support healthy lifestyles. Whilst patient data is invaluable, it is also sensitive, personally identifiable information. Therefore, it is of utmost importance that the privacy of individuals is protected.

As a result there are, understandably, huge barriers to accessing health care data for serious research. Data can be used in the positive ways described; however, it can also be used in ways that we may perceive to be negative, for example, we may not want to be identified as having a certain medical condition for privacy reasons. In addition, what if our data is in some way used against us?

Synthetic data is one solution under investigation as a means to enable important health-related research to be undertaken, whilst also ensuring privacy. The healthcare industry has the potential to benefit significantly from synthetic data. Synthetic data created with similar statistical properties to the real data, and therefore representative of the real data, can allow accurate data analysis to be carried out without the concern of disclosing sensitive information to the public. Governance and confidentiality issues are avoided since real patient data is not disclosed.

In the MIDAS project we are investigating the synthesis of data using a variety of techniques and comparing their results. We will analyse the tradeoff between data safety (disclosure risk) and data utility (based on information loss). When data is synthesised, we must ensure the risk of disclosing sensitive data is minimised, whilst also preserving the analytical validity and hence utility of the data. To produce synthetic data we must accurately model the real data, preserving patterns and relationships. The model is then sampled from to build synthetic data. We can then validate the approach by comparing the results of analysing the real data versus the synthetic data. Methods of evaluating and verifying that the synthetic data is safe to release to the public will also be investigated.

MIDAS hopes to prove that synthetic data is a robust and convincing alternative to real data in health-related research. MIDAS aims to instil confidence in data owners and users of the analytical validity and data safety of synthetic data by providing empirical evidence of its feasibility.

Within MIDAS this task is experimental in nature, in that we will work with members of our Policy Board, which includes governmental health care institutions, to determine the viability of releasing synthetic data as open data or by making it remotely available via a safe haven.

If it is possible, agreeable and legal, we will look to exploit the use of these synthetic datasets for a number of activities including sharing data across countries to help inform policy making decisions at a European level, and releasing open synthetic data to researchers and for student competitions which will provide further potential impact for the project.