March 2, 2020
Just as data is now considered the most important currency in commerce, data privacy has become an equally important topic for organizations worldwide. Through the rapid adoption of the digital cloud, computing has evolved to a privacy-first paradigm that is strongly enforced by consumer demand, business rules, and a growing regulatory landscape (such as GDPR and CCPA). In 2019 alone, the GDPR enforcement resulted in a total fine of approximately 440 million euros. Acknowledging the challenges, in January 2020 the National Institute of Standards and Technology released a Privacy Framework to help enterprises more easily achieve their privacy objectives.
While many approaches have been proposed to protect data privacy, the most dominant method is the anonymization of personally identifiable data (PII). In its most common implementation, this process refers to the removal of information that can be used to uniquely identify an individual. The problem, however, is that anonymized data in many cases can still be used to reveal personal data, or to cross-reference with external sources to reconstruct the PII. Examples of this include anonymized genomic and location data.
Consequently, what is lacking from this and other existing methods is the guarantee that user privacy is truly protected. To support a real transition, we need technologies that can provide a mathematically proven level of privacy and security. Fortunately, the last few years have seen a rise of advances in various areas which have made this goal much more practical and achievable.
One of the first such technologies available in our arsenal is Fully Homomorphic Encryption (FHE). As further detailed in our previous blog, FHE is the first encryption technology that enables arbitrary computations to be performed on encrypted data, without decrypting it. Consider any generic computational process as a function \(f\) which, given an input data \(x\), produces a result \(y\), that is: \(f(x)=y\). FHE provides quantum-grade guarantee that can protect the privacy of \(x\) and \(y\) from a provider who is computing \(f\). But what about the privacy of \(f\)? The receiver who can view \(x\) and \(y\) may be able to infer additional details.
This brings us to Differential Privacy (DP), which provides not only a mathematically rigorous definition of privacy but also a framework for constructing private algorithms. Roughly speaking, given two datasets \(A\) and \(B\) which differ by a single record \(R\), certain algorithms can be run against \(A\) and \(B\) without revealing any information about \(R\). In other words, one can perform certain computations on a database without learning much about any individual. This is usually achieved by adding a controlled perturbation to the computational process. In recent years, new techniques have been developed to preserve privacy of AI models trained on sensitive data, that is, to protect the privacy of \(f\) in our previous example.
A related and fast-emerging technology is Federated Learning (FL). In this regime, the analytics provider would like to obtain data from multiple, often many, different data owners to learn better patterns and to deliver better insights for each owner. The challenge is of course that such data is often safeguarded by strict privacy and security rules, such as in healthcare, and so gaining access can be quite difficult. FL solves this by letting the computations run on-premise, within the perimeter of each data owning entity. Specifically, to train an AI model, the provider sends the training code to each institution to be run locally and only the updates to the model are shared back (instead of the original data). This way, private data never moves outside of its trusted boundaries.
While FL shows great promise, there’s no privacy for the provider’s AI model as every institution holds a copy of it. This is not desirable if the model contains important IP. Instead, the provider can keep the model running in the cloud and use FHE + DP to protect privacy for \(x, y, \) and \(f\). Another problem with FL is that the model updates being shared with the provider may contain important details relating to the original data. If so, a malicious provider can exploit it to gain unintended knowledge. To prevent this information leakage, commercial implementations of FL often leverage a cryptographic technique called Secure Multiparty Computation (MPC). In general, MPC allows multiple parties to perform a joint computation without revealing each party’s data to the others. In the context of FL, to train the AI model, the provider needs to know the average of the local updates coming from each institution, which can be done with MPC via the secure aggregation algorithm.
It should be evident to the readers by now that building a privacy preserving solution often requires using a combination of existing technologies. While in most cases FHE provides the highest level of privacy and security, depending on the problem at hand, the solution may require additional innovations in other areas. For more details on how Inferati’s technology can help, please contact us.