Qualitatively analysing language model / image generation improvements since ~2000

While we can plot graphs showing quantitative changes in language model / image generation performance over time (e.g. in terms of the perplexity), what does this actually mean in terms of model capabilities? Having a collection of samples from language models in the last two decades could help give a visceral sense of how much they have improved. The comparison could include a selection of the best output out of 10 prompts, a comparison of prompt completions, etc.

Qualitatively analysing language model / image generation improvements since ~2000

Answers

Discussion