To my eye, the birds look entirely real. One features a fluffy yellow belly of feathers. Another, a forest green tail, with a long beak that looks perfect for hooking bugs out of tree bark. But these photos I’m looking at are complete fictions. They’re images that come straight from the imagination of Microsoft’s latest AI, dubbed AttnGAN. They were created by typing a sentence into the system, like, “this bird is red and white with a very short beak.” AttnGAN then generated these highly realistic, 256 x 256 pixel photos of fictional birds from nothing.
“Four years ago, no one even believed such a thing could be done,” says Xiaodong He, the lead researcher on the project.
Indeed, for the past five years, He has been researching the relationship of images and words, training AIs to do all sorts of compelling tasks. First, he created an AI called CaptionBot that could use words to describe a photo–a bit of research that is now an accessibility feature to help the vision impaired using Microsoft products. Then, he pushed that research further, creating an AI that could answer specific questions you might ask about photos.
Now, with AttnGAN, he’s “closed the loop.” In other words, Microsoft’s AIs can create images from mere words, that another AI can then caption.
The name “AttnGAN” comes from how it was built–which is easy enough to understand, on a general level. Microsoft researchers pitted two AIs against one another (this was the “GAN” part, or Generative Adversarial Network). Both were trained on language and vast image sets, but one attempted to create images, while the other critiqued them. This critique happened at three stages as the image was produced, from a very blurry initial sketch up to the full fidelity final model. The ongoing competition improved AttnGAN enough to produce the images you see today.
These photos are often realistic, though relatively low resolution–and on top of the realism, they’re also highly specific in their customized detail. This is the “attention” part of “AttnGAN,” as the AI fine-tunes very small regions of each image to the verbal specifications. That means a bird, for instance, can have extremely specific features, like a blue beak, a yellow beak, a long beak, or a short beak. From resolution, to improvisation, to the inclusion of finite details, it’s all a lot more complex than Google’s generalized sketching AI. Even Adobe’s eerie image creation tools all start with actual photos, not a blank canvas.
AttnGAN is something of a birdwatcher’s dream, able to generate countless bespoke birds in a believable fashion. But bird photos are relatively predictable: Most are taken of birds perched on a branch in a tree–context that is easily improvised by AttnGan when you ask it to draw a bird. But ask AttnGan take these objects out of their context, and mix it with some other objects, and things go wonky. “If there are complicated attributes or relationships of objects in the system, then the machine gets confused and draws something not as good as we hope,” says He.
A great example of AttnGAN’s limitations was discovered when researchers asked it to draw a surreal “red double decker bus is floating on top of a lake.” The resulting photo looks more like a blurry red and white boat. The context seems to have influenced the subject, muddling the two as one. Buses don’t drive on water! So AttnGAN drew a boat.
In another case, researchers asked for “an image of a girl eating a large slice of pizza.” The shape of the girl is actually excellent. But just about everything else is off in this invented portrait. It looks borderline cubist in its strange rendering.
“The machine still needs to learn a lot of common sense to draw a good picture of complicated objects,” He concludes. Indeed, in both cases of failure, AttnGAN seems to understand what’s being requested, but it lacks the fundamental world-to-object relationships to draw them convincingly. That logic is necessary to ground AttnGAN’s imagination. Even still, He isn’t deterred. In just a couple years, he insists these AI models will vastly improve, and with faster computers loaded with more memory, researchers will be able to make the final images larger and more detailed, too. Given his last half decade of progress, it’s hard to disagree.
Eventually, He believes that AttnGAN-style technology will change creative tools altogether. He imagines Bing image search inventing photos as needed–say, if you asked for a stop sign flying through the sky, and such a thing didn’t exist on iStockPhoto. But fast-forward a bit, and He sees the system generating images for artists, or room layouts for designers, that only need a little tweaking to be convincing. He even believes that one day in the foreseeable future such AIs will be able translate scripts into turnkey animated films.
For now, however, the research is meant to blur the line between human thinking and machine thinking. “It’s so interesting. It’s a fundamental AI problem, ‘What is intelligence? What separates us from animals?'” He muses. “We know how to express ourselves, and we know how to read an image. [Duplicating] those kinds of things, to me, is a way toward [recreating] the general intelligence of a human.”