Generative AI: What is it, actually?

This image is based on pictures created with Stable Diffusion Web
What does Generative AI do?
Generative Artificial Intelligence (GenAI) uses a machine-learning model which can generate content. Popular modalities of GenAI content are:
- text: summaries, answers, novels, poems, computer code
- images: paintings, graphs, sequences of images (called videos), 3D models
- audio: speech, music
- molecules: proteins, drugs
Okay, but how does it work?
Until the early 2020s, almost every practical AI application was using a so-called discriminative model: It was trained to predict a label (e.g., cat) given an input (e.g., a specific image). More precisely, trained to estimate a probability $\mathbf{p}$ for input $x$ to belong to category $y$: $\mathbf{p}(y|x)$. Discriminative models were primarily used because for any given discriminative problem, the corresponding generative problem is typically much harder to solve (and because they were sufficient for most applications).
Generative AI models the opposite: $\mathbf{p}(x|y)$, so for example the probability of observing certain pixel values in an image given that it shows a cat. So while discriminative models assign a particular label to each input, generative models try to actually learn how the input itself in general looks like (its so-called distribution). This requires (a typically very large) number of observations - in our example, images of cats - and assumes that these observations were created according to some unknown distribution $\mathbf{p}_{\mathrm{data}}$. The goal is to build a generative model $\mathbf{p}_{\mathrm{model}}$ that mimics $\mathbf{p}_{\mathrm{data}}$.
How to build a good generative model and generate content
During the past decades, different algorithms to arrive at a generative model $\mathbf{p}_{\mathrm{model}}$ using some training data (the observations) have been devised. I will leave details to a later post; but currently autoregressive models using transformers are state-of-the-art for text, and diffusion-based models for images and speech (following the domination of General Adversarial Networks GANs until around 2020). But how do you know if your model is any good?
The short answer: if it allows you to generate good content. This is done using so-called sampling: The values $x$ are your content - remember, they encode the input, e.g., an image. Once we have a model $\mathbf{p}_{\mathrm{model}}(x)$, we can draw random samples of $x$: that means to pick values of $x$ which correspond to higher values of $\mathbf{p}_{\mathrm{model}}$ more often than to lower values. For this to work well, your $\mathbf{p}_{\mathrm{model}}(x)$ should be a close approximation of $\mathbf{p}_{\mathrm{data}}(x)$ and it should be easy to sample from it - a process that can be intricate and slow.
The rise of Generative AI
Why has Generative AI become so successful now? There are two parts to the answer:
- Why now: Only now, a) the necessary hardware resources (e.g., powerful GPUs with large memory) have become available and b) the hard problems of Generative AI have been solved in many important cases, together with general improvements in Machine Learning like of optimization algorithms and regularization techniques.
- Why at all: Generative AI simply is a more sophisticated and powerful form of AI, allowing a more complete understanding of data. For most researchers it is evident that for any intelligence comparable to human intelligence, generative modeling must be part of the solution.
In fact, neuroscientific theory suggests that our perception of reality is not a complex discriminative model using our sensory input, predicting (labeling) what we experience. Rather, it is a generative model simulating our environment trying to accurately match the future state of the environment. In fact, some theories go as far as claiming that what we perceive as reality is actually the output of our internal generative model.