I'm CEO of a robotics company, and I believe AI has failed on many promises. Here's what comes next

Aside from drawing photo-realistic images and holding seemingly sentient conversations, AI has failed on many promises. The resulting rise in AI skepticism leaves us with a choice: We can become too cynical and watch from the sidelines as winners emerge, or find a way to filter noise and identify commercial breakthroughs early to participate in a historic economic opportunity.

There’s a simple framework for differentiating near-term reality from science fiction. We use the single most important measure of maturity in any technology: its ability to manage unforeseen events commonly known as edge cases. As a technology hardens, it becomes more adept at handling increasingly infrequent edge cases and, as a result, gradually unlocking new applications.

Edge case reliability is measured differently for different technologies. A cloud service’s uptime could be one way to assess reliability. For AI, a better measure would be its accuracy. When an AI fails to handle an edge case, it produces a false positive, or a false negative. Precision is a metric that measures false positives, and Recall measures false negatives.

Here’s an important insight: Today’s AI can achieve very high performance if it is focused on either precision, or recall. In other words, it optimizes one at the expense of the other (i.e., fewer false positives in exchange for more false negatives, and vice versa). But when it comes to achieving high performance on both of those simultaneously, AI models struggle. Solving this remains the holy grail of AI.

Low-fidelity vs. high-fidelity AI

Based on the above, we can categorize AI into two classes: high-fidelity versus low-fidelity. An AI with either high precision or high recall is lo-fi. And one with both high precision and high recall is hi-fi. Today, AI models used in image recognition, content personalization, and spam filtering are lo-fi. Models required by robo-taxis, however, have to be hi-fi.

There are a few important insights about lo-fi and hi-fi AI worth noting:

Lo-fi works: Most algorithms today are designed to optimize for precision at the expense of recall or vice versa. For example, to avoid missing fraudulent credit card charges (minimizing false negatives), a model can be designed to aggressively flag charges with the slightest indication of fraud, thus increasing false positives.
Hi-fi = Sci-fi: Today, no commercial applications exist that are built on hi-fi AI. In fact, hi-fi AI may be decades away, as shown below.
Hi-fi is rarely needed: In many domains, smart product and business decisions could downgrade AI needs from hi-fi to lo-fi, with minimal/acceptable business impact. To do so, product leaders must understand the limits of AI and apply it in their design process.
Time-critical safety needs hi-fi: Time-sensitive, safety decisions is one area where hi-fi AI is often needed. This is where many autonomous car use cases tend to be focused.
Lo-fi + humans = hi-fi: Safety uses cases aside, it is often possible to achieve hi-fi performance by combining artificial and human intelligence. Products can be designed to incorporate human assistance at opportune moments, whether by the user or by support staff, to achieve their desired levels in both precision and recall.

Quantifying AI’s fidelity

A popular metric for evaluating AI reliability is the F1 score, which is a type of numeric average of precision and recall, thus measuring for both false positives and false negatives. A F1 of 100% represents a perfectly error-free AI that handles all edge cases. By our estimate, some of the best AI today perform at a rate of 99%, though a score above 90% is generally considered high.

Let’s calculate the F1 score for two applications:

If Spotify plays songs you like 95% of the time (precision), but only surfaces half of the songs you like (recall of 50%), its F1 would be 65%. This is an adequate score, because a high precision makes for a great user experience and low user churn, whereas a low recall is not noticed by users.
When a robo-taxi decides whether to cross at a traffic light, it is making a time-sensitive safety decision. Both blowing a red light (false negative) and unexpectedly braking at a green (false positive) have a high risk of collision. We devised a method to estimate the level of AI accuracy needed to achieve parity between autonomy and human drivers, taking into account current intersection collision rates and other factors. We estimate that a robo-taxi must achieve over 99.9999% precision and 99.9999% recall in detecting red lights in order to be on par with humans. That is a F1 of 99.9999%—or six nines.

It is clear from the above examples that a F1 of 65% is easily achievable by today’s AI, but how far away are we from an F1 of six nines?

A roadmap to hi-fi

As discussed earlier, maturity and market readiness for any technology is tied to how well it handles edge cases. For AI, the F1 score can be a useful approximation for maturity. Similarly, for previous waves of digital innovation such as web and cloud, we can use their uptime as a signal for maturity.

As a 30-year-old technology, the web is one of the most reliable digital experiences. The most mature sites such as Google and Gmail aim for 99.999% uptime (five nines), meaning the service is unavailable no more than six minutes per year. This is sometimes missed by a wide margin, such as YouTube’s 62 minute disruption in 2018 or Gmail’s six hour outage in 2020.

At roughly half of the web’s age, the cloud is less reliable. Most services offered by Amazon AWS have an uptime SLA of 99.99%, or four nines. That is an order of magnitude less than Gmail, but still very high.

A few observations:

It takes decades: The above examples show that it often takes decades to move up the edge-case maturity ladder.
Some use cases are particularly challenging: The extremely high level of edge-case performance needed by robo-taxis (six nines) exceeds even that of Gmail. Bear in mind that self-driving also runs on computers similar to cloud services. Yet the operational uptime required by robo-taxis must exceed what current web and cloud services achieve!
Narrow applications beat general purpose: Web applications are narrowly-defined use cases for cloud services. As such, web services can achieve higher uptimes than cloud services because the more generalized the technology, the more difficult it is to harden.

Case Study: Not all autonomy is created equal

Google engineers who left their self-driving car team to start their companies had a common thesis: Narrowly-defined applications of autonomy will be easier to commercialize than general self-driving. In 2017, Aurora was founded to move goods via long-haul trucks on highways. Around the same time, Nuro was founded to move goods in small cars and at slower speeds.

Our team also shared this thesis when we started off inside Postmates (also in 2017). Our focus has also been on moving goods but, contrary to others, we chose to leave cars behind and instead focus on smaller form robots that operate off the street: Autonomous Mobile Robots (AMRs). These are widely adopted in controlled environments such as factory floors and warehouses.

Consider red-light detection for delivery robots. While they should never cross on red given the risk of collision with vehicles, conservatively stopping on green introduces no safety risk. Therefore, a recall rate similar to robo-taxis (99.9999%) along with a modest precision (80%) would be adequate for this AI use case. This results in an F1 of 90% (one nine), which is easy to achieve. By moving from street to sidewalk and from a full-size car to a small robot, the AI accuracy required decreases six nines to one.

Robots are here

Delivery AMRs are the first application of urban autonomy to commercialize, while robo-taxis still await an unattainable hi-fi AI performance. The rate of progress in this industry, as well as our experience over the past five years, has strengthened our view that the best way to commercialize AI is to focus on narrower applications enabled by lo-fi AI, and use human intervention to achieve hi-fi performance when needed. In this model, lo-fi AI leads to early commercialization, and incremental improvements afterwards help drive business KPIs.

By targeting more forgiving use cases, businesses can use lo-fi AI to achieve commercial success early, while maintaining a realistic view of the multi-year timeline for achieving hi-fi capabilities. After all, sci-fi has no place in business planning.

Ali Kashani is the cofounder and CEO of Serve Robotics.

Recognize your brand’s excellence by applying to this year’s Brands That Matter Awards before the early-rate deadline, May 3.

I’m CEO of a robotics company, and I believe AI has failed on many promises. Here’s what comes next