Compositional Hard Negatives for Visual Semantic Embeddings via an Adversary

Learning high-quality representations for data from different modalities but with a shared underlying meaning has been a critical building block for information retrieval. Moreover, hard negative mining has shown to be effective in forcing models to learn discriminative features.

In this paper, we present a new technique for hard negative mining for learning visual-semantic embeddings for cross-modal retrieval. We focus on selecting hard negative pairs that are sampled by an adversarial generator.

In settings with attention, our adversarial generator composes harder negatives through novel combinations of image regions across different images for a given caption. We find scores across the board for all R@K-based metrics, but this technique is also significantly more sample efficient and leads to faster convergence in fewer iterations.

Related Research

A High-level Overview of Large Language Models

A High-level Overview of Large Language Models

W. Zi, L. El Asri, and S. Prince.

Learning And Generalization; Natural Language Processing

Research
ACL 2023 Recommended Reading List

ACL 2023 Recommended Reading List

P. Forsyth, K. Tang, and W. Zi.

Causality; Generative Models; Natural Language Processing

Research
Unveiling the Role of Computer Vision in Financial Services

Unveiling the Role of Computer Vision in Financial Services

J. He.

Computer Vision

Research

Cookies Settings

Related Research