Learning high-quality representations for data from different modalities but with a shared underlying meaning has been a critical building block for information retrieval. Moreover, hard negative mining has shown to be effective in forcing models to learn discriminative features.
In this paper, we present a new technique for hard negative mining for learning visual-semantic embeddings for cross-modal retrieval. We focus on selecting hard negative pairs that are sampled by an adversarial generator.
In settings with attention, our adversarial generator composes harder negatives through novel combinations of image regions across different images for a given caption. We find scores across the board for all R@K-based metrics, but this technique is also significantly more sample efficient and leads to faster convergence in fewer iterations.
Machine Learning Certification: Approaches and Challenges
Our NeurIPS 2021 Reading List
Computer Vision; Data Visualization; Graph Representation Learning; Learning And Generalization; Natural Language Processing; Optimization; Reinforcement Learning; Time series Modelling; Unsupervised Learning
Heterogeneous Multi-task Learning with Expert Diversity
Computer Vision; Natural Language Processing; Reinforcement Learning