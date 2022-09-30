Everyone is obsessed with text-to-image artificial intelligence now, those seemingly magical prompts that turn your thoughts into images with one sentence and a mouse click. Surprisingly, however, you can harness their power in a completely unexpected way: image compression so good that it is almost indistinguishable from the original… but 155 times smaller. None of the current standards—like JPEG or WebP—can get anywhere near it.

Let me show you how impossibly good it is using this example of a llama named Anna: The Stable Diffusion’s compressed image (bottom left corner, 4.97 kilobytes) is almost indistinguishable from the original image (bottom right corner). Top left is JPEG (5.66 KB) and top right is WebP (6.74 KB). [Matthias Bühlmann] The two images on top, JPEG and WebP, clearly show plenty of compression artifacts with both using extreme size-reduction settings. But details are completely lost all over: the hair turns into pixelated noise, the color saturation is gone, the eyes look dead, and there are whole sections of the images that just look like big blobs of flat color. Now compare them to the Stability AI‘s Stable Diffusion-based compressed version and the uncompressed images, below. Here, it’s very hard to see any loss of image quality. The hair is just fine, the color is perfect, even the fine details and photo grain are all saved. And still, the image has a smaller size than any of the industry standard compressors. According to Matthias Bühlmann, the Swiss software engineer who developed this compression method, this image of Anna has never been published on the web to avoid any biases in Stable Diffusion’s artificial intelligence model. If it had been used in its training, he says, it may have resulted in an even smaller compression size.

“Seeing things in clouds” Why does it work so well? Rather than encoding an image with mathematical operations that discard enough detail to save a smaller image that kind-of-looks like the original, this new compression technology reconstructs the image from a fuzzy noise representation, not a text prompt. Or in AI lingo, a “latent space representation.” This is a crude snapshot of the image made from that noise, which in Anna’s case, comes out like this: Latent space representation of a photo, as represented by Stable Diffusion [Matthias Bühlmann] In a Medium post describing his development, Bühlmann says all text-to-image synthesis AIs start with random Gaussian noise. Then they iterate the noise, literally denoising (aka reducing the noise) it more and more with every step until it looks just like your text description. “[The AI predicts] what it thinks it ‘sees’ in that noise,” Bühlmann describes, “similarly to how we sometimes see shapes and faces when looking at clouds.” In very basic terms, Bühlmann’s Stable Diffusion-based compressor uses an AI network called Variational Auto Encoder to create a noisy representation of the original image, which guides Stable Diffusion toward recreating the original image. Bühlmann explains the process with a superb analogy: “Say we have a highly skilled artist who possesses a photographic memory. We show them an image and then have them recreate it, and they can create an almost perfect copy just from their memory. The photographic memory of this artist is Stable Diffusion’s [latent space representation].”

There’s just one problem It seems like a dream come true: outstanding image quality indistinguishable from the real thing at a fraction of the size of industry compression standards. You can easily imagine billions of downloaded terabytes saved each year using a method like this. However, as it usually happens with any nascent technology, there’s always a big hairy but around the corner. If you look at the Stable Diffusion Anna image up close, you can actually start to see some changes. Stable Diffusion’s compression on the left and center show weird features added by the AI. [Matthias Bühlmann] Observe the heart in the first two images, for example, both compressed with this technology at different rates. Focus on the heart and compare it with the ground image on the far right. You can see a glossy effect on the first and second hearts. The problem is that Stable Diffusion is introducing some details that don’t exist in the original. In other words, the compressed image has invented, or imagined, components. Another good example is this image of San Francisco:

This photo of San Francisco’s skyline (center) has different details on the Stable Diffusion’s compressed image (right), which otherwise retains all the fine detail of the original. On the left, the JPEG version has obvious compression problems, especially visible on the sky’s gradient. [Matthias Bühlmann] The JPEG on the left side is plagued with extreme compression artifacts. However, on the right side, the Stable Diffusion appears to be rendered exactly like the ground image in the center. Even the camera grain is preserved! However, if you look into that green circle, you will see that, while the Stable Diffusion skyline may look just the same to your brain, in reality, the AI compressor is making up some of the buildings. Your eyes will easily deceive you because it looks real. But obviously, some of it is not. Bühlmann says that Stable Diffusion also has problems reconstructing faces and text, as seen in this example: Stable Diffusion’s compression (center and right) have problems “remembering” both text and human faces. [Matthias Bühlmann] Following his analogy of Stable Diffusion as an artist with a photographic memory, he points out that it is “not very good at remembering faces and also suffers from dyslexia.” Bühlmann tells me via chat that he believes that these problems will be fixed in future versions of the technology: “I think it could absolutely be made more accurate, there is still a huge potential for further optimization, both in terms of compression factor and reducing the amount of semantic compression artifacts.”

He also thinks that the bad preservation of faces and text will resolve itself as the system gets more training. This is, after all, a crude first approach to demonstrate that this technology works. Huge potential ahead It’s a very promising and exciting first step with a lot of room for improvement, as Bühlmann says. He believes that machine learning technology like Stable Diffusion could have a big impact in the industry. “I could see a performance-optimized version of this Stable Diffusion based codec being used in some very specific image archival cases where petabytes of images must be stored efficiently,” he says. He also points at other applications, like video calls or remote control of large drones and vehicles. There, “bandwidth optimization and visual quality has a higher priority than the semantic fidelity of small details.” Bühlmann says he is absolutely certain that “ML-based methods will become popular compression methods very soon and some of them will find widespread adoption at some point.”