Arabic Image Captioning Using Pre-Training Of Deep Bidirectional Transformers

Jonathan Emami, Pierre Nugues, Ashraf Elnagar, Imad Afyouni

Demo/Poster Session 1 - Tuesday 07/19 16:30 EST
Paper Zoom Discord
Poster Code Data
Arabic Image Captioning Using Pre-Training Of Deep Bidirectional Transformers
Abstract: Image captioning is the process of automatically generating a textual description of an image. It has a wide range of applications, such as effective image search, auto archiving and even helping visually impaired people to see. English image captioning has seen a lot of development lately, while Arabic image captioning is lagging behind. In this work, we developed and evaluated several Arabic image captioning models with well-established metrics on a public image captioning benchmark. We initialized all models with transformers pre-trained on different Arabic corpora. After initialization, we fine-tuned them with image-caption pairs using a learning method called OSCAR. OSCAR uses object tags detected in images as anchor points to significantly ease the learning of image-text semantic alignments. In relation to the image captioning benchmark, our best performing model scored 0.39, 0.25, 0.15 and 0.092 with BLEU-1,2,3,4 respectively1 , an improvement over previously published scores of 0.33, 0.19, 0.11 and 0.057. Beside additional evaluation metrics, we complemented our scores with human evaluation on a sample of our output. Our experiments showed that training image captioning models with Arabic captions and English object tags is a working approach, but that a pure Arabic dataset, with Arabic object tags, would be preferable.