- AI Needs Your Data—and You Should Get Paid for It?
That is, if he could get his hands on enough data. Chang embarked on a journey that’s familiar to many medical researchers looking to dabble in machine learning. He started with his own patients, but that wasn’t nearly enough, since training AI algorithms can require thousands or even millions of data points. He filled out grants and appealed to collaborators at other universities. He went to donor registries, where people voluntarily bring their data for researchers to use. But pretty soon he hit a wall. The data he needed was tied up in complicated rules for sharing data. “I was basically begging for data,” Chang says.
Chang thinks he might soon have a workaround to the data problem: patients. He’s working with Dawn Song, a professor at the University of California-Berkeley, to create a secure way for patients to share their data with researchers. It relies on a cloud computing network from Oasis Labs, founded by Song, and is designed so that researchers never see the data, even when it’s used to train AI. To encourage patients to participate, they’ll get paid when their data is used.
That design has implications well beyond healthcare. In California, Governor Gavin Newsom recently proposed a so-called “data dividend” that would transfer wealth from the state’s tech firms to its residents, and US Senator Mark Warner (D-Virginia) has introduced a bill that would require firms to put a price tag on each user’s personal data. The approach rests on a growing belief that the tech industry’s power is rooted in its vast stores of user data. These initiatives would upset that system by declaring that your data is yours, and that companies should pay you to use it, whether it’s your genome or your Facebook ad clicks.
In practice, though, the idea of owning your data quickly starts looking a little … fuzzy. Unlike physical assets like your car or house, your data is shared willy-nilly around the web, merged with other sources and, increasingly, fed through a Russian doll of machine learning models. As the data transmutes form and changes hands, its value becomes anybody’s guess. Plus, the current way data is handled is bound to create conflicting incentives. The priorities I have for valuing my data (say, personal privacy) conflict directly with Facebook’s (fueling ad algorithms).
Song thinks that for data ownership to work, the whole system needs a rethink. Data needs to be controlled by users, but still usable to others. “We can help users to maintain control of their data and at the same time to enable data to be utilized in a privacy preserving way for machine learning models,” she says. Health research, Song says, is a good way to start testing those ideas, in part because people are already often paid to participate in clinical studies.
This month, Song and Chang are starting a trial of the system, which they call Kara, at Stanford. Kara uses a technique known as differential privacy, where the ingredients for training an AI system come together with limited visibility to all parties involved. Patients upload pictures of their medical data—say, an eye scan—and medical researchers like Chang submit the AI systems they need data to train. That’s all stored on Oasis’s blockchain-based platform, which encrypts and anonymizes the data. Because all the computations happen within that black box, the researchers never see the data they’re using. The technique also draws on Song’s prior research to help ensure that the software can’t be reverse-engineered after the fact to extract the data used to train it.
Sounds nice in theory, but how do you incentivize people to actually snap pictures of their health records?
In a medical study that uses machine learning, there are lots of reasons why your data might be worth more or less than mine, says Zou. Sometimes it’s the quality of the data—a poor quality eye scan might do a disease-detection algorithm more harm than good. Or perhaps your scan displays signs of a rare disease that’s relevant to a study. Other factors are more nebulous. If you want your algorithm to work well on a general population, for example, you’ll want an equally diverse mix of people in your research. So, the Shapley value for someone from a group often left out of clinical studies—say, women of color—might be relatively high in some cases. White men, who are often overrepresented in datasets, could be valued less.
Put it that way and things start to sound a little ethically hairy. It’s not uncommon for people to be paid differently in clinical research, says Govind Persad, a bioethicist at the University of Denver, especially if a study depends on bringing in hard-to-recruit subjects. But he cautions that the incentives need to be designed carefully. Patients will need to have a sense of what they’ll be paid so they don’t get low-balled, and receive solid justifications, grounded in valid research aims, for how their data was valued. What’s more challenging, Persad notes, is getting the data market to function as intended. That’s been a problem for all sorts of blockchain companies promising user-controlled marketplaces—everything from selling your DNA sequence to “decentralized” forms of eBay.