Last week, members of our research team attended the ICLR conference in Vancouver with a paper on Improving GAN Training via Binarized Representation Entropy (BRE) Regularization. Of course, while we were there we also took advantage of some of the interesting talks happening at the event. Here are a few trends and themes that stood out to us.

## On Adversarial Robustness

Seven out of 8 defense papers were broken before the ICLR conference even started. When it comes to the attack-and-defense mode, this means heuristic-based methods without any theoretical guarantee are not as reliable as they may seem, since the arms race between the attack and the heuristic defense always seems to be won by the attacking side.

If you launch a heuristic defense, it’s usually broken after the defense is disclosed to the attacker. So with that in mind, when you evaluate for robustness it’s assumed the attacker already knows everything. It’s not always the case, but for the strongest examination you have to assume that the attacker is working with all the information beforehand.

**Trends: **The community is now thinking more about:

i.) **strengthening the theoretical intrinsic robustness of the neural network**. For instance, when someone builds and trains a model, they will typically build in a theoretical guarantee. Their goal is to have a theoretically robust neural network under a certain threat model (for example, what type of attack can this model defend?)

ii.) **measuring the intrinsic robustness of the neural network**. If someone proves the network is robust, no attack can break it under that threat model. Given a heuristically defended model, the theoretical properties of these models can be measured. That way, we have a more accurate robustness guarantee on these models instead of just using existing attacks to evaluate the robustness, because this robustness doesn’t accurately represent the effectiveness of unknown attacks.

Both of these methods provide a theoretical measurement on robustness, which means if these measurements say the network is robust, it means it’s truly robust.

**Main takeaway:** We’re still not as far ahead in this field as we think.

#### Notable papers:

## Machine Learning in the Real World

The invited talk by **Suchi Saria** (“Augmenting Clinical Intelligence with Machine Intelligence“) gave an excellent example of the difficulties in applying machine learning algorithms on applications where supervised data is unavailable or highly skewed. In this case, Prof. Saria looked at patient survival rates based on historical data.

However, this data is highly impacted by the particulars of the patient, the interventions they receive, the doctors that attend to them, and numerous other factors which cannot be controlled. It additionally highlighted the excellent potential for machine learning to create new tools by which diseases (in this case Parkinsons) can be measured and assessed more naturally and the power of such tools to improve both the treatments and quality of life of people everywhere.

**Kristen Grauman** also gave a great invited talk (“Visual Learning With Unlabeled Video and Look-Around Policies“) about how intuitive ideas of exploration and inquisitiveness can be used to learn powerful new representations in largely unsupervised ways.

## Policy Optimization

In reinforcement learning, an agent receives a scalar reward from the environment while not having any information about the reward function (i.e. the mechanism behind the reward it receives). The agent then tries to improve its policy by ascending the gradient of the expected cumulative reward with respect to said policy. A policy gradient is an algorithm designed to estimate such a gradient so that the agent is able to improve its policy as it plays within its environment.

There were a few notable papers about variance reduction for policy optimization at this year’s conference. **Cathy Wu**’s paper on Variance Reduction for Policy Gradient with Action-Dependent Factorized Baselines used factorization assumption to break policy gradient down into several terms. The total variance of those terms were then reduced from the original gradient. She also tuned the baseline function so as to further reduce the variance while keeping everything unbiased.

**Jiajin Li**’s paper on Policy Optimization with Second-Order Advantage Information built off Cathy’s work and picked up the decomposition involved in using a value function’s information instead of simply decomposing all the dimensions. This method achieved a trade-off between the strength of the model assumption and the degree of variance reduced.

**Hao Liu**’s paper on Action-dependent Control Variates for Policy Optimization via Stein’s Identity used the most advanced control variates technique to develop an action-dependent baseline for policy gradients. The variance was then reduced, which allowed for several ways to pick the baseline function. The team’s methods involved Stein’s Identity, a probability theory which managed to connect the policy gradient and the value learning induced DDPG (deep deterministic policy gradient).

Finally, **George Tucker**’s paper on The Mirage of Action-Dependent Baselines in Reinforcement Learning tested the algorithms in the papers listed above on their synthetic environment; Their environment was able to theoretically demonstrate the amount of variance reduction in each. The paper claims that the action-dependent baseline may not be significant when run on this particular synthetic environment. Inspired by the theoretical analysis of the environment, the paper instead proposed a horizon-aware value function which performs well for variance reduction.