Differential Privacy (DP) approaches are emerging as a key tool for ensuring people’s privacy is protected when training large machine learning models. In this article, we talk with Zhiqi Bu, Applied Research Scientist at Amazon AWS AI, about efforts to improve the efficiency of DP techniques and the trade-offs companies and researchers need to consider when training large models.

Why is Differential Privacy becoming an increasingly important field of research?

A portrait of Zhiqi Bu

Companies and academics are starting to understand the harm of not having any privacy protection when deploying their models. Some of the more recent literature, for example, suggests that Large Language Models like ChatGPT can have severe memorization issues on the training data set, meaning it could possibly leak sensitive information if you give it the right adversarial prompt.

At the same time, I think organizations are trying to determine the best privacy techniques for their specific setting. There are multiple definitions of Differential Privacy (DP) – some protect the data while others protect the labels, for example – and DP can also be combined with non-DP privacy-preserving techniques like federated learning. Most organizations are now trying to understand their scenario and the trade-offs involved in each situation.

What kind of trade-offs are we talking about?

A portrait of Zhiqi Bu

If you want to evaluate any differential privacy method, we need to consider four factors: time efficiency, memory efficiency, parameter efficiency and the potential accuracy gap between DP and non-DP training.

The reality is that differential privacy optimization with per-sample gradient clipping often requires much more computation than standard non-private optimization. Current methods like full fine-tuning can sometimes incur 100 times more memory cost or training time if implemented inefficiently, particularly for large models.

Existing work has demonstrated that high accuracy is possible under strong privacy constraints yet requires significant computational overhead or modifications to the network architecture.

You and your colleagues have proposed a Differentially Private Bias-Term only Fine-Tuning of Foundation Model (DP BiTFiT) as a way to improve the accuracy and efficiency of DP. Can you tell us more about your research?

A portrait of Zhiqi Bu

The algorithmic innovation really comes from the insight that biases are fundamentally different from weight. One big difference is quantity.

Our experiments on various foundation models – including GPT – have shown that the bias terms represent around 0.1% or less of the total parameters. This means that DP-BiTFiT, by optimizing only the biases, can be much more efficient, in fact, 1.5 times faster than full fine-tuning while remaining almost the same accuracy. Another difference is that the chain rule for the biases is different from that of the weights. Since the biases do not directly interact with the data input, DP-BiTFiT doesn’t need to store the activation during forward propagation, which means it almost eliminates the overhead created by traditional DP in terms of both time and space complexity.

Another benefit is that DP-BiTFiT is model agnostic. Other methods like traditional full fine-tuning and adapters generally require you to really understand the model to determine which parameters you want to optimize. That may require hundreds of lines of code to implement. DP-BiTFiT can be applied to any model, which means you don’t need to modify the network architecture. And it can often be achieved with just one line of code.

How does that research influence the trade-offs we discussed earlier?

A portrait of Zhiqi Bu

We wanted to develop a privacy optimization with a similar level of accuracy and efficiency to standard optimization. On a wide range of tasks, DP-BiTFiT is as efficient as non-DP BiTFiT, and up to 30 times faster and uses up to 8 times less memory than DP full fine-tuning. Our paper proves that DP-BiTFiT allows the gap between DP optimization and non-DP optimization to become almost non-existent.

Simply put, DP-BiTFiT changes the trade-offs that need to be made because it makes DP nearly as accurate and efficient as non-DP optimization. And, since it is model agnostic, it can basically be plug-and-play.”

The hope is that all researchers and practitioners who have a need to protect data privacy can benefit from this work, particularly those who want to use large models but only have limited computing resources. Especially we see a trend that larger pre-trained models give better accuracy using DP-BiTFiT, compared to non-DP training.

Why are you interested in the field of Differential Privacy?

A portrait of Zhiqi Bu

First off, I think the development of trustworthy and socially responsible machine learning should be important to anyone who not only cares about the accuracy of models. For researchers like myself, Differential Privacy stands out as a relatively young field of research with many opportunities to explore.

It is surprising to witness how differential privacy deep learning evolves from small models with 5 layers in 2015 to GPT-level with billions of parameters nowadays. I also believe that the techniques developed while building privacy-preserving machine learning can often be future-proof. It’s a great opportunity to have a big impact on all future models yet to come.

I think differential privacy research is a high priority, and AWS also cares about efficiency across multiple dimensions – time efficiency, memory efficiency, communication efficiency, and parameter efficiency – these are all directly related to the applicability of any large model. I think DP optimization is an actively evolving field of research and one that I believe companies and researchers should be emphasizing.

About Zhiqi Bu

You can read ‘Differentially Private Bias-Term only Fine-tuning of Foundation Models’ referenced throughout the interview here.