Learning high-quality representations for data from different modalities but with a shared underlying meaning has been a critical building block for information retrieval. Moreover, hard negative mining has shown to be effective in forcing models to learn discriminative features.
In this paper, we present a new technique for hard negative mining for learning visual-semantic embeddings for cross-modal retrieval. We focus on selecting hard negative pairs that are sampled by an adversarial generator.
In settings with attention, our adversarial generator composes harder negatives through novel combinations of image regions across different images for a given caption. We find scores across the board for all R@K-based metrics, but this technique is also significantly more sample efficient and leads to faster convergence in fewer iterations.