SHIFT HAPPENS was a project created by Sarah Cole and Annis Joslin in partnership with Brighton Women’s Centre, a safe experimental space for twelve people making and sharing work over a year together. The open mic night on 7 September 2023 was part of the Community Takeover exhibition at Phoenix Art Space. Sarah asked if I could do something live with AI. I had a Stable Diffusion install I was already messing with, so yes.
The pipeline
Three stages on a laptop in the room, projector pointed at the wall behind the performers:
- Whisper ran on the microphone feed, transcribing in rolling chunks (≈10 seconds of audio at a time, with a 2-second overlap to catch words that crossed the boundary).
- GPT-3.5 (this was 2023; modern image models bake the language understanding in now) took the rolling transcript and produced a single Stable Diffusion prompt for the most recent chunk. The job there was less describe what they said and more take what they said as a starting point, find the image. Without that step the prompts read like search queries.
- Stable Diffusion rendered an image at 768×768. Roughly 6-8 seconds on the GPU I had at the time.
Total latency end-to-end: somewhere around 20 seconds from spoken word to image on the wall. Too long for true real-time, fast enough to feel like the image was responding to the performer rather than illustrating a script.
What was hard
Three things I’d do differently with what’s available now.
The 20-second latency was the dominant ugliness. By the time a performer’s first line had an image up, they were already on the third line. Modern multimodal models with built-in language understanding (and faster small image models like SDXL Turbo, or distilled models like Nitro-style approaches) could plausibly run word-by-word rather than chunk-by-chunk. Lower per-image quality, much more responsive, more abstract. That feels like a more honest fit for live performance.
Image quality vs latency was a real trade-off. Drop to lower-resolution images and you can render faster; you get more visual stream, less individual punch. Push for higher quality and you get punch but the audience watches a static frame for too long. I picked the middle and accepted that the middle was a compromise.
The third was content sensitivity. I’ll come back to that.
What worked in the room
Performers responded to it in two distinct ways. Some treated it as a collaborator and started shaping their delivery for the picture (longer pauses, more specific imagery, a half-step toward the visual). Others ignored it completely and let the system do whatever the system did. Both worked. The interesting bit was the audience: people enjoyed anticipating what image would come up, and then trying to interpret the more abstract ones the model produced when it didn’t quite know what to do with the words. That ambiguity was where the piece earned its keep.
The moment I was glad we’d planned for moderation
One of the poets read a piece that moved through menstruation and into deliberately uncomfortable territory around abuse. The audience was right with her; the room was holding. My internal panic at the desk was about what image was about to appear on the wall.
The model held. The safety filters caught what needed catching and returned a sensitive, abstract image rather than something that would have torn the room. That moderation step (which I’d left on, because I’d thought about exactly this scenario) was the thing standing between a careful performance and a horrible misfire. If you’re building anything that puts an AI image generator in front of a live audience, leave the moderation on. The fact that artists had not yet turned on AI as broadly as they have since 2024 also meant the room was open to the experiment in a way it might not be today.
What I took from it
Live AI art is mostly bad. The bar to clear is whether the system genuinely listens, or whether it’s a pre-rendered visual loop pretending to listen. The pipeline above genuinely listens, with a 20-second tax. That tax shapes everything about the piece, including whether you should use it for this kind of work at all.
If I returned to this pattern, I would experiment with word-by-word generation (lower fidelity, much higher responsiveness) and with a smaller, faster model. The current version of this would feel meaningfully different on the same kit. The piece worked because the system was honest about its tempo and the performers were generous about adapting to it.