The system is dubbed I2T, for “image to text”, and it’s a collection of extremely clever computer vision algorithms that analyze the ongoing video stream from a typical surveillance camera that you may see watching a store or a busy road intersection in a city.
The core of I2T is a vast database of images and objects that the algorithms consult when they’re trying to recognize objects in the video scenes–there are over two million images covering 500 categories of object. I2T grabs a video frame, works out what is background information and ignores it, then tries to recognize objects in the scene, before spitting out a semi-natural language description of what’s going on. For example, it’s smart enough to detect an object moving from one scene to another and can report a car jumping a red light at a traffic stop. It can even remember if a particular object leaves the scene and returns, which may have potential in attempting to spot activities like criminals casing a location before attempting a crime.
The clever part is the text output, of course. Surveillance footage typically requires a pair of human eyes to monitor it to watch for what’s going on, as machine’s aren’t that good at this task yet. And searching through a vast array of video footage for a particular event usually requires some chump doing so manually. Being able to search for keywords in a text archive is a much simpler way to access the relevant moments in a camera surveillance history.
I2T’s database may be vast, but it’s not large enough (and the system’s not quite intelligent enough yet) to be extremely useful in a real-life situation. The technology does point to the future though, where super-smart surveillance cams can self-analyze what’s going on, and spit out natural language descriptions of the events in real time, which can then be accessed through a regular search engine system like Google. This may even have applications in less serious implementations like YouTube videos, where Google’s already experimenting with automatic speech transcription technology for subtitles.