Smart city boosters aspire to collect all kinds of data about how people navigate the streets—a prospect that could violate citizens’ privacy even if it might help planners build more efficient cities. That’s why Replica, a spin-off from the controversial Alphabet urban tech firm Sidewalk Labs, uses fake data to create a virtual world that mimics the way real people move through a metropolis. As fabricated as this “synthetic” data is, the company wants governments to use it to inform real-world policy decisions about where new bike lanes are built, what times roads are repaired, and how bus services reach people of color.
Replica has been piloted in Kansas City and sold for use to the state of Illinois. Now, despite its goal to guide policy in Portland, the company has not been fully transparent with municipal staff there about the real-world sources of its synthetic data or how its system works, even in a confidential “data disclosure” that Fast Company reviewed.
Replica has shown reluctance for over a year to give Portland Metro, the metropolitan district overseeing use of its system, sufficient information about its privacy protections. It has not provided Portland Metro with a full report showing the privacy audit the company points to as proof that its system is secure from reidentification of actual people.
“[Sidewalk Labs] isn’t willing to share the full privacy audit,” wrote Eliot Rose, Portland Metro’s technology strategist, in an email to agency staff in February 2019 that Fast Company reviewed. At the time, Sidewalk Labs was still Replica’s parent company. Now, after months of requesting up-to-date and comprehensive information detailing the system’s data sources and methodology, Rose and his team have received only an excerpt.
When asked about Portland’s demands for more comprehensive methodology and data source documentation, Replica CEO Nick Bowden tells Fast Company it is “inaccurate” to state that Portland has not yet received it. Bowden did not respond to a request to elaborate.
To assess the system’s privacy safeguards, Portland Metro expects to test its vulnerabilities by attempting to expose identities using an unusually detailed Replica data set, a proposition experts say has its own privacy risks.
Cities see value in algorithmic technologies that can inform planners of where to place a bus stop or make automatic decisions about which construction sites should be inspected. To help cities make these kinds of decisions, Replica uses machine learning to generate a synthetic populous using a blend of data sources the firm says contain no traceable links to actual people. Compared to more traditional data privacy techniques that strip data of personal identifiers, “synthetic data is a great next step in terms of protecting privacy,” says Nathan Reitinger, a security and privacy attorney and PhD at the University at Maryland who has studied synthetic data.
Synthetic data is a great next step in terms of protecting privacy.”
However, as Replica and others aim to sell systems that impact government policy, some argue they should be more transparent about the nuts and bolts of technologies that inform or even replace human decisions.
Fears of black box tech sparked scrutiny in Toronto in 2018 when Replica was introduced as part of Sidewalk Labs’ much grander scheme intended to turn a waterfront neighborhood there into a high-tech urban utopia. Critics said Replica could result in “mass surveillance” and privacy invasions. Just six months after touting Replica in Toronto, its creators walked back plans to use the software there. Instead, representatives said they would focus on building the technology for use by U.S. cities, which it has continued to do as an independent company outside of the Alphabet umbrella.
Local governments’ struggle to understand what’s under the hood of algorithmic technology mirrors the experiences of everyday consumers who have little insight into how social media platforms, mobile devices, and smart home products gather, share, and sell data. When it comes to designing more efficient, equitable cities, synthetic data holds great promise—but understanding its provenance has become the next frontier in the urban privacy debate.
Demanding the data sources
Despite data privacy and policy hurdles, Portland representatives are excited by Replica’s potential to create more accessible transit and services. But before it trusts the system with that mission, Portland Metro wants more information.
“There are a lot of larger conversations happening in our region around bias in data and how data underrepresents communities of color and other underserved groups, and we need to be able to address those concerns with respect to Replica if we’re using it publicly. That’s part of the reason why we’re so focused on getting thorough and up-to-date documentation,” Rose tells Fast Company in an email.
Privacy experts agree. “Portland is right to dig in their heels and mandate the full audit report and technical documentation,” says Pam Dixon, executive director of World Privacy Forum who has consulted for Portland’s city government on data privacy.
Eliot Rose, Portland Metro
We need to be able to address those concerns with respect to Replica if we’re using it publicly.”
Portland’s use of Replica is months behind schedule as the city awaits the requested information. An early project schedule shows the city expected to have Replica’s system tested, validated, and ready to put to work over eight months ago in June 2019.
Portland Metro currently has only limited information about Replica’s data sources. Fast Company reviewed a four-page “data disclosure” document that Replica provided to Portland Metro labeled as “confidential and proprietary” that lists telecommunications companies and third-party app data aggregators among Replica’s synthetic population ingredients. That makes location data taken from people’s cellphones a foundational component of Replica’s synthetic data. But the document doesn’t name any specific data sources.
In the past, Replica has told clients that it uses data from Google, the mobility analytics firm Streetlight, and Safegraph, a provider of store visitor demographics, according to an email sent in 2018 from Bowden to Rose. Those sources, Bowden tells Fast Company, are no longer used. He did not provide names of other data sources currently used.
This is the first time these previous Replica data providers have been reported. The documents and emails mentioned in this article were obtained through a November 2019 Freedom of Information Act request.
Streetlight is a Replica competitor that also uses mobile location data, but does not build synthetic populations like Replica does. CEO Laura Schewel confirmed the company did provide Replica with the same sort of anonymized, aggregated data it provides to its other clients, but has not worked with the company for at least a year. “I feel very strongly that to fulfill the potential of big data . . . you do have to be transparent about your sources,” Schewel says. She says Streetlight uses data from Safegraph, location data provider Cuebiq, and Inrix, which provides truck fleet data. Streetlight cites methodology information and names some data sources on its website.
Laura Schewel, Streetlight
I feel very strongly that to fulfill the potential of big data . . . you do have to be transparent about your sources.”
However, even Streetlight’s provider names offer little insight into where the data originally came from. The mobile location data sector is comprised of layers upon layers of data sellers, most subject to nondisclosure agreements shielding them from naming the origins of the data they compile.
Safegraph did not respond to a request to comment for this story.
Replica‘s data disclosure says it ensures its suppliers have obtained opt-in consent for location data collection. But that may not mean much: When you click to allow location tracking when you’ve first downloaded an app, the data world considers that to be opt-in consent for disseminating that information through the location data ecosystem. There’s no telling where it might end up.
Confusion over Google data
Replica was born as the Model Lab product inside Alphabet’s Sidewalk Labs. At its public introduction in April 2018, Replica’s family ties to Google through Alphabet led some to assume that the Replica system derived its mobile location data through Google’s pervasive Android mobile operating system. Some worried the software product would link Google mobile data showing individuals’ movements and activities to the futuristic city infrastructure Sidewalk Labs still plans for Toronto.
John Verdi, Future of Privacy Forum
Cities around the world are increasingly grappling with complicated technical de-identification and data governance issues.”
Replica was established as a separate business in March 2019 and has generated funding from venture firms including Innovation Endeavors, where longtime Google CEO Eric Schmidt is a founding partner.
Despite Replica’s status as an independent company that is not owned by Alphabet, it is still dogged by concerns about possible use of Google or Alphabet data, if its confidential data disclosure is any indication. While it does not name data sources, the document makes a point of stating, “Replica does not source raw location data directly from Google or any other Alphabet companies.”
However, some of Replica’s statements regarding Google and Alphabet data are contradictory. Bowden told Rose via email in September 2018, “We actually use a combination of Streetlight, Safegraph, cell companies, consumer marketing data, Google data, and census data.”
Documents also show city clients were indeed under the impression Replica used Google location data. As reported by The Intercept, a March 2018 Illinois procurement document lists Google location data as one of Replica‘s data sources. A Portland Metro intergovernmental agreement dated that September also includes “location data from Android phones and Google apps” as Replica sources.
But the company said otherwise just a few months later when asked in December 2018 by Oregon Public Broadcasting, which reported, “A spokesman for Sidewalk Labs says Google is not a data source for Replica.”
Bowden tells Fast Company neither Streetlight, Safegraph, nor Google supply data to Replica today. “We no longer use any Google services in creating Replicas,” he says, using the firm’s terminology for the synthetic individuals it builds.
While Replica’s data disclosure is vague, it does reveal a new clue. The document lists payment-processing companies among data sources. Purchase transaction data showing where, when, and what people buy has been available for years, and could help Replica produce far more robust synthetic profiles for its database. For instance, purchase data could help Replica estimate how much avatars in the system spent on ride sharing or cabs last month, or if they just bought an SUV.
Along with cellphone location data derived from unknown sources, Replica uses lots of data from its municipal clients to calibrate and validate its city-specific models. Portland has supplied ride-share and e-scooter trip data, household data from Oregon Household Activity and American Community Surveys, and data from Portland’s public transit agency Tri-Met that includes non-identifiable information such as passenger income, race, and ethnicity by route.
“If we increase bus service to East Portland, we can use Replica to assess whether people of color and low-income communities are taking advantage of that service,” Rose wrote to other government staff in a 2019 email labeled “talking points and guidance.”
Cities can also use Replica to gauge the impact of Uber and Lyft on traffic congestion or determine the best times to do road or sidewalk repairs. But Replica wants the system to be about much more; it wants to inform government policy decisions.
“Finally, and most important in my opinion, is we calibrate Replica to world conditions to ensure it can be used in policy making vs. just a source of additional data,” stated Bowden in an email sent to Rose.
As cities demand better transparency of data sources and privacy methods related to systems used to inform government decisions, they’re figuring out their own policies for tech use, data security, privacy, and community engagement around new tech.
“Cities around the world are increasingly grappling with complicated technical de-identification and data governance issues at a time when privacy risks are mounting,” says John Verdi at the Future of Privacy Forum, which consults with Portland and other cities. “It is essential that clear guidelines are set out for public review and debate.”
Portland’s risky privacy test
Balancing Replica’s city planning promise with its privacy concerns has led to some surprising data negotiations in Portland. During contract discussions back in March 2019, Replica‘s Bowden told Portland Metro’s Rose he wanted to ensure Replica cannot be “used as a mechanism for attempting to specifically target a single individual.”
But Portland Metro wanted an exception that would allow the city to stress-test the system. “I thought that we had agreed to make an exception that allows Metro to attempt to ID people within the data during the validation phase so that we can verify that the data adequately protects people’s privacy,” stated Rose in the exchange.
Bowden wrote, “We can’t conceive of a scenario, with any combination of outside data, whereby this is possible.”
A summary of Replica’s third-party privacy audit featured in the data disclosure says Replica data has been manipulated to prevent reidentification. This includes altering data showing the types of activities people travel to and adding random noise to location and time data.
Ultimately, Portland Metro’s contract with Replica allows the exception. It stipulates that city officials cannot reverse engineer the system to identify people or personalize the data “except for the purposes of ensuring that Services and Content adequately safeguard residents’ privacy during validation testing.”
To perform its privacy test, Portland Metro expects to receive “disaggregate” Replica data featuring more detail than what’s in the aggregated data most users would access. It will show individual trips made by the synthetic population including home locations and trip origins, destinations, and times. Fewer than five people will be able to access the refined data. Bowden confirmed Replica provides this trip activity data to clients.
With this information, Portland Metro aims to ensure the data cannot be reidentified, and that Replica is not simply training its model to match the city’s pre-existing data.
“It’s a good thing the city is investigating and attempting to do their due diligence on [Replica],” says University at Maryland’s Reitinger. But without details from the company about how the system is built, he says it would be difficult for the city to gauge which types of potential risks the system poses.
At this point, Portland Metro is still waiting for full data source and methodology information from Replica before it decides how it will test its privacy safeguards. However, even after the city has concluded its accuracy and privacy testing, the agency hopes its small group will have continued access to that disaggregate database. The group wants to use it in city planning projects that require varying forms of data only available from the disaggregate data set.
Despite synthetic data’s inherent privacy-by-design setup, “it is still possible for cities to unravel that by poor uses of that data,” World Privacy Forum’s Dixon says. If the city performs its internal privacy test by adding real-world information, otherwise de-identified and synthetic replicas could be linked to actual people. That could risk misuse that could expose an individual, she says.
Without government policies for emerging technologies and data use, cities and their residents are at a disadvantage, Dixon says. “If there’s a villain here, it’s the transitional time we live in.”