“Form is henceforth divorced from matter,” Oliver Wendell Holmes wrote of photography in 1859. I was reminded of Holmes’s sweeping pronouncement in talking with Behrooz Rezvani, the CEO of the curious, and more than a tad eerie, Seyyer. In the future, explains Rezvani, we might have a video presence that is divorced from our physical one. Read on to learn more about a possible panacea for overbearing parents, the ghost of Ronald Reagan, and Angelina’s possible future stint as a Siri replacement.
FAST COMPANY: What’s the idea behind Seyyer?
BEHROOZ REZVANI: The genesis of the whole thing is the idea to convert text to video. In order to convert text to video for a particular person, and have that person be animated saying certain things, you have to learn a lot about the way that person talks, and their facial expressions.
So you mean, I would write an email or text, and it would show up for my recipient as my face talking.
That’s the ultimate goal. If the recipient has a model of you on their phone, then when they receive the text, they can actually see you talking on it.
Why would you want that?
In many countries around the world, there’s a kind of bandwidth starvation. What if you want to share a magical moment with grandma, with her grandkids texting from far away distances? Also, texting is now more popular than calling someone. A change really took place around 2008 or 2009.
I feel guilty all the time because my grandma doesn’t have email or receive texts, so we can only communicate by phone, but I never call anyone anymore.
To me that happened because my son would not answer my calls, but he would respond to my texts. That was my “Gee” moment. I thought, what does it take to actually hear his voice or see his face talking to me?
So the idea for your company comes from your son refusing to take your calls? This is stereotypically a Jewish mother problem.
There was some epiphany around both kind of topics. Getting forced to use SMS by my son, and also the lack of bandwidth in the developing world.
The idea being: If we can’t have bandwidth for FaceTime, at least we’ll simulate it.
Right. Once the idea started taking shape, I realized that so many other things would become possible: texting, e-books, Twitter. Imagine people having their tweets come alive–so Anderson Cooper would be tweeting from some place, and all of a sudden his picture pops up on Twitter.
“Applications include video texting, books read by their author, or by your favorite actor. Angelina Jolie could be the one setting up your schedule, instead of Siri.”
Of course, you can’t record an infinite number of videos with Anderson Cooper to transform anything he might write into video. So how does your tech work?
For full control, obviously you cannot record all of it–it becomes obscene. So we develop models. We record enough visual expressions of a person that we have a range of emotions and expressions available to use. Then we do the same thing for audio: The person talks for a certain amount of time, we record that, we build a model. For the vocabulary and expressions that are not there, we can interpolate them from past history.
At this point let’s have readers take a look at this video you recently release demonstrating your technology:
When I first saw this, I thought you just took a 22-second clip of Reagan and fiddled with his mouth. Did you do more than alter his mouth here?
Absolutely. We changed the whole face. Variations of the mouth are all connected to the rest of the face. There’s a lot more than a 22-second set of data. We looked at probably about 20 minutes of total video to extract a model from.
In behind-the-scenes features on Pixar movies, you see “wireframe” builds of the animation. Is there a wireframe here?
There is a wireframe–but not the way you’re familiar with in cartoons. You have a huge number of possible variations that the wireframe could move–for example when the mouth opens, depending on the expression, the cheeks could be in a different place.
Essentially, this image of Reagan is a sort of model or puppet that you can manipulate in all sorts of ways.
Right. We’re not modifying it–we are generating the face from scratch.
Is this completely new technology?
From our knowledge, and we’ve talked to a number of experts, this is the first time. It’s actually quite complex and difficult.
Why’d you choose to demo on Reagan? Isn’t it partisan, and also kind of creepy?
I’ll take all the credit and the blame.
You should have reached out to the RNC and told them Reagan could have given the keynote.
We didn’t want to particularly take sides on either party.
How do you commercialize the technology in the near future?
One of the most interesting things for us is the advertising space. To a lot of advertising experts, personalization and video are important. If you have a brand–take for example the T-Mobile girl. It takes a lot of time and money to get these actors in front of the camera to shoot them. Now if you want to update the message about the brand, it may be expensive and impractical to get these guys in front of the camera again. So the question was, can we do this one-time shoot and generate any message dynamically?
The T-Mobile girl’s agent is screaming out in pain right now.
No, I think the T-Mobile girl would be extremely happy. She could continue to monetize her image, and her agent would get a cut of that too.
But licensing your likeness for infinite permutations–that’s kind of scary.
All these things are negotiated. I think they would not allow–I’m just making it up–using her image more than once a week, or not more than two times a quarter.
So do you have advertisers or brands interested?
In the past week I had some really exciting conversations. I don’t know where they’re gonna end up, but there were some really big names. In the next couple of months, we could come up with our exciting first application.
Where do you imagine this tech in five or 10 years?
I strongly believe that text-to-video will be a mainstay, but whether it will blossom into full bloom in five or 10 years, I don’t know. But applications include video texting, books read by their author, or by your favorite actor. Angelina Jolie could be the one setting up your schedule, instead of Siri.
That’s trading up. But wouldn’t it dilute Angelina’s brand?
I don’t know. Let’s say someone offers hundreds of millions of dollars…
Returning to the idea of having a text-to-video chat with a family member–won’t there be the issue of the Uncanny Valley?
I think the technology would get to the point where we wouldn’t be able to tell the difference.
But then if someone can hijack my mom’s likeness and impersonate her, isn’t that problematic?
With every new technology and paradigm, we have to deal with philosophical questions of how to prevent abuse. There are several ways we think we could do this: with a watermark on the audio or video, for instance. There are ways that you can assure that people know the difference, which one is real, and which one is not real.