Challenges of Data Anonymisation vs. Use of De-identified Data

Guest Post by Antti Tuomi-Nikula (THL) and Juha Pajula (VTT)

Of all data in the world, personal health data can perhaps be seen as one of the most sensitive. Medical records, laboratory test results, survey answers and their combinations are not exactly something you would like other people to see, save your doctor. According to the section 3 of the Finnish Personal Data Act, the term personal data refers to any information relating to a private individual or to his/her personal characteristics or personal circumstances, where these are identifiable as concerning him/her or the members of his/her family or household. Utilisation of these data in research purposes – or any purposes other than medical treatment – must therefore be done with extra care, with the personal data protection as the top priority. In this text we describe some key elements of the data anonymisation protocol of the Finnish Social Data Archive (FSD), which we follow in the Finnish Pilot of the MIDAS project. The data on Finnish Pilot are derived from primary sources containing super sensitive information, such as mental health data.

The focus in Finnish Pilot in MIDAS project is “Preventive mental health and intoxicant problems of young people”. The project aims to show how it is possible to achieve efficient preventive policies for problems and significant long-term impacts. The legislation and data sensitivity on Finnish Pilot will force the project to develop separated anonymised secured datasets from each source separately. Only the preprocessed and secured datasets are delivered to the MIDAS system of the Finnish Pilot. In addition to THL data, other data providers for MIDAS project are Oulu Cohort 1986 and the city of Oulu.

FSD points out that from the point of view of research participants, processing personal data constitutes the risk that confidential information relating to them is revealed to outsiders (for instance, to people close to them, to employers or authorities). Therefore personal data must be processed carefully and in a well-planned manner. Data protection must not be jeopardised, for example, by careless preservation or insecure digital transfers. Personal identification numbers, personal names, addresses and other unnecessary identifiers must be removed from data whenever possible. Identifying information used on data processing must be destroyed permanently when it is no longer needed for validating analyses and there is no longer any legal grounds for its preservation.

However strict the data protection rules may be, their effectiveness always depends on the people implementing them and the processes the data go through during their lifespans. Human errors happen easily, as was the case in THL earlier this year, when personal information of over 6000 people was unintentionally leaked online. This incident stimulated THL to scrutinise its data handling guidelines and processes, in order to minimise all technical possibilities of leaks and breaches.

Traditionally in the research community, there has been a presumption that all research data should be as detailed as possible, down to micro level and direct identifiers. Less attention has been paid on the real need: what is the necessary level of detail vs. the study purpose? Could more rough and aggregated, de facto anonymised de-identified data be enough in many cases to extract the phenomenon at focus? We think yes.

In the Finnish pilot of MIDAS project, the anonymisation is following the basic anonymisation and de-identification concepts of FSD. The aim of the data preparation is to develop an anonymised dataset, which is no more under GDPR or registry laws in Finland and can be treated as sensitive research data. Even the data will be anonymised and secured, a dedicated data transfer agreement is applied for the developed dataset and THL will apply separated security testing for the final data prior to the transfer to MIDAS system.

The final dataset will not include any direct identifiers. During the development process, the main identifier to enable linking between different registers is the pseudo random presentation of personal identification number, which will be removed from the dataset when all registers are combined. After that, restoring the direct identifiers should be definitively impossible with any reasonable methods, which will be also tested. The anonymisation would not have significant affect to the overall efficiency of the Finnish pilot as the studies are concentrating on policies which should be made for population, not individuals.