Privacy and health data mining

Guest post by Gorana Nikolic from KU Leuven

Modern-day data mining algorithms often involve not only large amounts of data, but also collaboration between several parties in constructing data mining algorithms. In most of the cases, such collaborations imply that the data is not shared between the parties involved. But how to join forces, without joining the data?

Privacy preserving data mining represents a useful approach when a collaboration is needed between several parties, if each party does not want to reveal the data to the other collaborators. One of the common examples of privacy preserving data mining problems can be found in healthcare applications. If caretakers from different levels, such as primary care and secondary care, want to use a patient’s records for answering a medical question, they need to pool all their data together. This is rarely possible because of different privacy policies and laws that prohibit revealing the confidential patient records between different caretakers. Therefore, it is necessary to find a solution that will allow different parties to run a data analysis algorithm on all their datasets but without pooling or revealing the actual data. There are different privacy preserving data analysis methods with which we can bypass this data pooling problem.

Privacy preserving data analysis methods can be divided into several different groups depending on the privacy objective and its application to data input, chosen model or model output. Some methods use perturbation and randomisation techniques that alter the data by adding noise to the data or provide additional cleaning to avoid involuntary leak of data. Other methods use multi-party computation or secure multi-party computation, where different cryptography tools provide privacy between involved parties but add extra computational costs. As we can see, there always exists a kind of trade-off between data and utility when we want to ensure privacy.

Federated or shared learning represents a solution for shared learning between involved parties without data centralisation. This solution was recently proposed by Google^[1], where instead of data, learned parameters are shared between parties. These parameters are then centralised and secured on a server side that distributes an aggregated model back to clients. At the end of the learning process each partner receives a finished model. Even though clients don’t reveal their data, sometimes there is a possibility for some information to be revealed about the parameters obtained from the training set. This can be resolved with a differential privacy approach, proposed by Apple^[2], where noise is added to individual client training sets.

In the context of MIDAS, we want to explore the most recent advances in privacy preserving machine learning that are known as federated learning and differential privacy. A high-level concept on how to best apply privacy preserving machine learning and data mining is being drafted between partners. Once ready, this approach will allow the synthesis of a common model, which is based on the parameters independently obtained by every participant, without compromising the sensitive medical data held by each party.

^[1] https://research.googleblog.com/2017/04/federated-learning-collaborative.html

^[2] https://images.apple.com/privacy/docs/Differential_Privacy_Overview.pdf