Predicting the future is a fundamental task in human activity understanding. The complexity of this task comes from the fact that the future is uncertain (not to mention that humans are notoriously bad at predicting it!)
Given a fixed history of events and their corresponding times – like those shown below in Fig. 1 – multiple actions are possible in the future. In our CVPR paper of the same name, which we will be presenting this week in Long Beach, we propose a powerful generative approach that can effectively model the distribution over future actions.
To date, much of the work in this domain has focused on taking framelevel data of video as input in order to predict the actions or activities that may occur in the immediate future. Timeseries data often involves regularly spaced data points with interesting events occurring sparsely across time. We hypothesize that in order to model future events in such a scenario, it is beneficial to consider the history of sparse events (action categories and their temporal occurrence in the above example) alone, instead of regularly spaced frame data. This approach also allows us to model highlevel semantic meaning in the timeseries data that can be difficult to discern from framelevel data.
More specifically, we are interested in modeling the distribution over future action category and action timing given the past history of sparse events. For action timing, we aim to model the distribution over interarrival time. The interarrival time is the time difference between the starting time of two consecutive actions.
The contributions of this work center around the APPVAE (Action Point Process VAE), a novel generative model for asynchronous time action sequences. Fig. 2 shows the overall structure of our proposed framework. We formulated our model with the variational autoencoder (VAE) paradigm, a powerful class of probabilistic models that facilitate generation and the ability to model complex distributions. We present a novel form of VAE for action sequences under a point process approach. This approach has a number of advantages, including a probabilistic treatment of action sequences to allow for likelihood evaluation, generation, and anomaly detection.
Fig. 3 shows the architecture of our model. Overall, the input sequence of action categories and interarrival times are encoded using a recurrent VAE model. At each step, the model uses the history of actions to produce a distribution over latent codes $zn$, a sample of which is then decoded into two probability distributions: one over the possible action categories and another over the interarrival time for the next action.
Since the true distribution over latent variables $z_n$ is intractable, we rely on a timedependent posterior network $q_\phi(z_{n}x_{1:n})$ that approximates it with a conditional Gaussian distribution $N(\mu_{\phi_n}, \sigma^2_{\phi_n})$.
To prevent $z_n$ from just copying $x_n$, we force $q_\phi(z_nx_{1:n})$ to be close to the prior distribution $p(z_n)$ using a KLdivergence term. Here, in order to consider the history of past actions in generation phase, we learn a prior that varies across time and is a function of past actions, except the current action $p_\psi(z_nx_{1:n1})$.
The sequence model generates two probability distributions: i) a categorical distribution over the action categories; and ii) a temporal point process distribution over the interarrival times for the next action.
The distribution over action categories $a_n$ is modeled with a multinomial distribution when $a_n$ can only take a finite number of values: \begin{equation}
p^a_\theta(a_n=kz_n) = p_k(z_n) \quad \text{and} \,\,\,\,
\sum_{k=1}^K{p_k(z_n)} =1 \label{eq:action}
\end{equation} where $p_k(z_n)$ is the probability of occurrence of action $k$, and $K$ is the total number of action categories.
The interarrival time $\tau_n$ is assumed to follow an exponential distribution parameterized by $\lambda(z_n)$, similar to a standard temporal point process model:
\begin{equation}
\begin{aligned}
p^{\tau}_{\theta}(\tau_n  z_n) =
\begin{cases}
\lambda(z_n) e^{{\lambda(z_n)}\tau_n} & \text{if}~~ \tau_n \geq 0 \\
0 & \text{if}~~ \tau_n<0
\end{cases}
\end{aligned} \label{eq:time}
\end{equation}
where $p^{\tau}_{\theta}(\tau_nz_n)$ is a probability density function over random variable $\tau_n$ and $\lambda(z_n)$ is the intensity of the process, which depends on the latent variable sample $z_n$. We train the model by optimizing the variational lower bound over the entire sequence comprised of $N$ steps:
\begin{align}
\mathcal{L}_{\theta,\phi}(x_{1:N}) = \sum_{n=1}^N(&{\mathop{\mathbb{E}}}_{q_\phi(z_{n}x_{1:n})}[\log p_\theta{(x_nz_{n})}] \\
& D_{KL}(q_\phi(z_nx_{1:n})p_\psi(z_nx_{1:n1})))
\nonumber
\label{eq:loss}
\end{align}
We empirically validate the efficacy of APPVAE for modeling action sequences on the MultiTHUMOS and Breakfast datasets. Experiments show that our model is effective in capturing the uncertainty inherent in tasks such as action prediction and anomaly detection.
Fig. 4 shows examples of diverse future action sequences that are generated by APPVAE given the history. For different provided histories, sampled sequences of actions are shown. We note that the overall duration and sequence of actions on the Breakfast Dataset are reasonable. Variations, e.g. taking the juice squeezer before using it, adding salt and pepper before cooking eggs, are plausible alternatives generated by our model.
Fig. 5 visualizes a traversal on one of the latent codes for three different sequences by uniformly sampling one z dimension over µ − 5σ, µ + 5σ while fixing others to their sampled values. As shown, this dimension correlates closely with the action add saltnpepper, strifry egg and fry egg.
We further qualitatively examine the ability of the model to score the likelihood of individual test samples. We sort the test action sequences according to the average per timestep likelihood estimated by drawing samples from the approximate posterior distribution following the importance sampling approach. High scoring sequences should be those that our model deems as “normal” while low scoring sequences those that are unusual. Tab. 1 shows some example of sequences with low and high likelihood on the MultiTHUMOS dataset. We note that a regular, structured sequence of actions such as jump, body roll, cliff diving for a diving action or body contract, squat, clean and jerk for a weightlifting action receives high likelihood. However, repeated hammer throws or golf swings with no set up actions receives a low likelihood.
Table 1 (below): Example of test sequences with high and low likelihood according to our learned model:
Test sequences with high likelihood 



Test sequences with low likelihood 



We presented a novel probabilistic model for point process data – a variational autoencoder that captures uncertainty in action times and category labels. As a generative model, it can produce action sequences by sampling from a prior distribution, the parameters of which are updated based on neural networks that control the distributions over the next action type and its temporal occurrence. Our model can be used to analyze and model asynchronous data in a wide variety of ranges like social networks, earthquakes events, health informatics, and the list goes on.