Over the past few months, a series of reinforcement learning applications have made their public debut. Like any promising debutante, these new applications have captured all eyes in the room. Their reception will likely set the tone for early adoption.

While deep learning applications have successfully integrated into multiple product categories, RL has had a slower initiation. The recent momentum behind RL-based commercialization has been propelled by research advancements that have naturally lent themselves to product ideas in specific sectors, like financial markets, health care and marketing. Once competition ignites, this early trickle is predicted to burst into a gushing pipeline.

But RL algorithms are not your standard, run-of-the-mill solutions and it’s unwise to treat them as such. Most pressingly, they’re continual learning algorithms, which means the type of data they require, combined with their potential industry disruption, demands that privacy techniques catch up with the privacy challenges these algorithms pose. One such technique is differential privacy.

### What is Differential Privacy?

The notion of privacy is inuitively difficult to translate into a technical definition. One standard definition around which the academic community has coalesced is through a framework called differential privacy. Differential privacy centers around the notion of whether an individual’s participation in a dataset is discernable. So, an algorithm that acts on a dataset is called differentially private if an individual’s presence or removal from that dataset makes minimal impact on the algorithm’s output. Differential privacy would then be achieved when perturbation – or “noise” – is added during the algorithmic training process. The level and location of the noise is finely calibrated according the degree of privacy and accuracy desired and the properties of the dataset and algorithm.

Standard differential privacy techniques work on fixed data sets that are already known to researchers. This prior knowledge allows researchers to decide how much noise they want to add to the data set in order to protect the individual’s privacy. A standard example of how this works is by compiling some aggregate statistics on how many people have done ‘x’ activity, then setting the parameters so that we end up with the same statistical result whether we keep or remove any individual from within the data set. But what happens when the data set is from a continuous state space, dynamic, constantly changing and we are continuously learning? For that, we need a new approach.

### Differential Privacy in Deep Reinforcement Learning

In our paper, Private Q-Learning with Functional Noise in Continuous Spaces, we focus on finding new avenues to address this complexity. We do this by approaching general concepts of differential privacy, then abstracting them and applying them to a different space. So, instead of adding scalar noise to a vector, we focus on protecting the reward function, adding perturbations as it gets updated by the algorithm.

This step is important because a reward function reveals the value of actions and, therefore, the latent preferences of users. So, for example, when you click the thumbs-up button on a social media app, this action gets codified as a “reward” that informs the “policy” for what the algorithm should do next time it identifies a similar user in a similar state. Our approach protects the “why” – the motivation or intent – of the individual’s decision. It blocks individual preferences from being identified while still allowing for the abstraction of the policy. This protects the motivation for the reward instead of the outcome. We want to protect the fact that the system has learned about your die-hard fandom for indie music, while enabling the algorithm to build intelligence so it can personalize recommendations to different users.

### Show me the math

We applied privacy to a setting that can be generalized to a variety of learning tasks – the Q-learning framework of reinforcement learning – where the objective was to maximize the action-value function. We used function approximators (i.e. a neural network) parameterized by θ to learn the optimal action-value function. In particular, we considered the continuous state space setting, where the action-state value Q(s, a) was assumed to a set of *m* functions defined on the interval [0, 1] and, similarly, the reward was a set of *m* functions each defined on the interval [0, 1].

Standard perturbation methods for ML models achieve DP by adding noise to vectors – the input to the algorithm, the output of the algorithm, or to gradient vectors within the ML model. In our case, we aimed to protect the reward function, which can depend on high-dimensional context. Using standard methods to add perturbation would mean that the amount of noise to be added would grow quickly to infinity if the continuous state space were to be discretized. Since we wanted to perturb the action-value functions, we added functional noise, rather than vector-valued noise as in standard methods. This functional noise was a sample path of an appropriately parametrized Gaussian process and was added to the function released by our Q-learning algorithm. As in the standard methods, the noise was parametrized by the sensitivity of the query, which, for vector-valued noise was the l-2 norm of the difference in output with two datasets that differ in one individual’s value. Here, since we considered reward functions which change in value according to the state of the environment (that has randomness), we used the notion of Mahalanobis distance for sensitivity, which captured the idea of a distance from one point to a set of sampled points.

### What this means for applications

Say Patient Zero is exhibiting medical symptoms and goes to see the doctor. The doctor gives Patient Zero Drug A (first state). Drug A doesn’t alleviate the symptoms, so now the doctor tries Drug B (second state). Patient Zero then moves into third state, and so on, until the problem (illness) gets solved. Here, the agent is programmed to be able to take a limited number of actions, then the system observes the state of the agent (symptoms relieved? Not relieved?), and based on the observations of that state, the agent must make a decision about what to do. The algorithm observes the outcome and, depending on the results, the agent gets a reward or punishment. The quality of the reward will depend on the long-term goals it’s trying to achieve. Are you getting closer to the goal (symptom alleviation) or moving away from it (even sicker than before the drugs)?

The privacy measures applied to RL in the past have mostly centered around protecting an individual’s movement (or itinerary) within a particular state. The policy, then, would be defined around why a user took a specific action in this state. This approach works well for the above scenario, where we’re protecting the user’s movement from state-to-state but not protecting a policy that can be extrapolated to many other users. It falls short, however, when applied to areas like marketing, with far more dynamic data sets and continual learning.

Differential privacy in deep RL is a more general and scalable technique, as it protects a higher-level model that captures behaviors rather than just limiting itself to a particular data point. This approach is important for the future as we move to continuous, online learning systems: by blocking individual preferences from being identified while allowing for the policy to be abstracted, we protect the motivation for the reward instead of the outcome. These kinds of safety guarantees are vital in order to make RL practical.