If you've ever delved into the world of digital art or explored the fascinating realm of machine learning, chances are you've come across the term "Generative Art." It's a captivating intersection of creativity and artificial intelligence, where algorithms transform textual prompts into beautiful visuals. Generative art has exploded in popularity recently, with prominent models like DALL-E, Stable Diffusion, Crayon, Midjourney, etc. However, this burgeoning field also brings a novel challenge due to the inference cost of such models (time, resources, money): How do we predict the effectiveness of the textual prompts in generating relevant, aesthetically pleasing images? In this blog post we will try to answer this question in simple terms. For more technical details you can take a look at our paper: Prompt Performance Prediction for Generative IR.
Let's start !
To answer this question, we delve into the exciting new task of "Prompt Performance Prediction" (PPP). PPP is close to the established concept of Query Performance Prediction (QPP) in traditional information retrieval systems. It aims to gauge the effectiveness of a prompt, predicting how well it will perform before we even see the generated images. It's like having a crystal ball that predicts whether your textual prompt will result in a stunning image or a less impressive one.
PPP's implications are widespread and impactful. For digital artists, it could be a game-changer, providing insights into the potential effectiveness of their prompts before they even hit 'generate.' This helps them refine their prompts, leading to better outcomes, and saves time (and money) that would have been spent on trial and error. For advertising and marketing professionals, it informs the creation of more engaging visual content. For developers of generative models, it offers valuable feedback for enhancing the models themselves.
In simple terms, PPP is about learning a function that can predict a prompt's performance based on its features. This involves some clever machine learning techniques and statistical analysis. The machine learning model, which we'll call the performance predictor, learns from a dataset of prompts and their associated performance scores. The training process involves adjusting the model's internal parameters so that it becomes better and better at predicting performance scores from prompts. For example, imagine you're trying to teach a child to guess how much they will enjoy a new kind of food. You might tell them things like, "If you liked apples, you'll probably like pears because they're both sweet and crunchy" or "If you didn't like spinach, you might not like kale because they're both leafy greens". Over time, the child learns to make more accurate predictions based on these cues.
In the same way, our performance predictor learns from the dataset to recognize patterns or features in the prompts that are associated with higher or lower performance scores. Once trained, the performance predictor can then take a new, unseen prompt and estimate its performance score. Importantly, this is done without actually generating any images. Instead, the performance predictor relies solely on the features of the prompt itself. This makes PPP a much quicker and more efficient way to estimate prompt performance compared to actually running the generative model and evaluating the generated images.
An essential part of this process is having the right data to learn from. Deep learning models generally need a lot of data to perform well, and in our case, we needed datasets containing triplets of prompts, generated images, and their corresponding performance scores. Since such datasets don't exist, we went ahead and created them, curating three distinct datasets based on different generative models: Midjourney, Stable Diffusion, and DALL-E 2.
To measure the performance of the prompts, we had to determine the relevance of the generated images. For this, we considered factors like the aesthetic appeal, memorability, and compositionality of the images. Since we didn't have human judgments for this, we utilized pre-trained models to extract the relevant scores. By aggregating the scores assigned to images generated from the same prompt, we constructed tuples of prompts and their corresponding relevance scores.
To validate our approach, we compared the performance of various pre-trained textual feature extractors. Using a method called linear probe evaluation, we assessed their ability to predict prompt performance. We conducted extensive experiments and observed a significant correlation between the predicted prompt performance scores and the performance ground truth.
Let's pause for a moment to understand a crucial aspect of our research: the discrepancy in the shared representation space of CLIP (Contrastive Language-Image Pre-training). As a refresher, CLIP is designed to unify images and text within a shared space to ease the process of searching and understanding both modalities interchangeably. However, studies have shown that despite the initial idea, the representations of text and images learned by CLIP are not fully interchangeable. This discrepancy can lead to inconsistent predictions when applying the model to different tasks.
In simpler terms, imagine a class where an English speaker and a Spanish speaker are trying to understand each other. They both know a bit of the other's language, so they can understand and interact to a certain extent. But when they each explain complex concepts in their own language, the other person might misunderstand because they don't fully understand the nuances of that language. The same happens in CLIP, where textual prompts and images are like the two different languages – they reside in separate 'subspaces', and this gap in understanding can lead to differences in results. To understand this issue in detail, we ran two experiments. The first experiment involved a visual analysis, where we applied Principal Component Analysis (PCA) to both prompts and images. This showed us that the segregation between prompts and images mainly occurs along a single component, like a major language barrier in our English-Spanish class example.
In our second experiment, we aimed to predict aesthetic scores for each dataset. But instead of using image embeddings as we had been doing to extract the ground truth, we used the prompt embedding derived from the corresponding CLIP extractor. The results showed a significant decrease in scores, confirming the presence of the "modality gap".
In the worlds of information retrieval and image generation, we've identified a novel task: Prompt Performance Prediction. By predicting the performance of prompts before they generate any images, we're adding a whole new level of efficiency and effectiveness to the system. Our work is just the beginning. As more research is carried out in this area, we hope to see even more advancements in generative information retrieval. Ultimately, this work lays the foundation for a proactive approach to information retrieval in generative systems. By predicting the performance of prompts before images are generated, we can guide the generation process towards more relevant and high-quality outputs, improving the efficiency and effectiveness of the system as a whole.
To dig deeper in the subject, take a look at our technical report: Prompt Performance Prediction for Generative IR.
```
@article{bizzozzero2023prompt,
title={Prompt Performance Prediction for Generative IR},
author={Bizzozzero, Nicolas and Bendidi, Ihab and Risser-Maroix, Olivier},
journal={arXiv preprint arXiv:2306.08915},
year={2023}
}
```