As society and business work together to control the spread of COVID-19, differential privacy has hit the headlines. Concerns that government, technology and cellphone companies are using personal data to track movement and identify where COVID-19 may be spreading has raised the thorny issue of privacy.
While sharing data between institutions and communities can help boost AI innovation, this practice runs the risk of exposing sensitive and private information about involved parties. Private and secure sharing of data is imperative and necessary for the AI field to succeed at scale.
Traditionally, a common practice has been to simply delete PII (personal identifiable information)—such as name, social insurance number, home address, or birth date—from data before it is shared. However, “Scrubbing,” as it is often called, is no longer a reliable way to protect privacy, because widespread proliferation of user-generated data has made it possible to reconstruct PII from scrubbed data. For example, in 2009 a Netflix user sued the company because her “anonymous” viewing history could be de-anonymized using other publicly-available data sources, inferring her sexual orientation.
Differentially private synthetic data generation (differential privacy) presents an interesting solution to this problem. In a nutshell, this technology adds “noise” to sensitive data while preserving the statistical patterns from which machine learning algorithms learn, allowing data to be shared safely and innovation to proceed rapidly.
Differential privacy preserves the statistical properties of a data set—the patterns and trends that algorithms care about to drive insights or automate processes—while obfuscating the underlying data themselves. The key idea behind data generation is to mask PII by adding statistical noise. The noisy synthetic data can be shared without compromising users’ privacy but still yields the same aggregate insights as the raw, noiseless data. Compare it to a doctor sharing trends and statistics about a patient base without ever revealing individual patients’ specific details.
Many major technology companies are already using differential privacy. For example, Google has applied differential privacy in the form of RAPPOR, a novel privacy technology that allows inferring statistics about populations while preserving the privacy of individual users. Apple also applies statistical noise to mask users’ individual data.
Differential privacy is not a free lunch, however: adding noise makes ML algorithms less accurate, especially with smaller datasets. This allows groups to join forces and safely leverage the combined size of their data to gain new insights. Consider a network of hospitals studying diabetes and needing to use their patient records to construct early diagnostics techniques from their collective intelligence. Each hospital could analyze its own patient records independently however, modern AI systems thrive on massive amounts of data, a reality that can only be practically achieved through large scale merging of patient records. Differential privacy presents a way of achieving that through the sharing of synthetic data and creation of a single, massive—but still privacy-preserving—dataset for scientists.
While differential privacy is not a universal solution, it bridges the gap between the need for individual privacy and the need for statistical insights – opening the doors to new possibilities.