« Whitopian migration results from tempting pulls as much as alarming pushes | Main | I am mayor of the Internet: FourSquare »

De-anonymizing

For example, contestants in Netflix's competition to improve its recommendation software received a training data set containing the movie preferences of more than 480,000 customers who had, as they say in the trade, been "de-identified." But as part of a privacy experiment, a pair of computer scientists at the University of Texas at Austin decided to see if it was possible to re-identify those unnamed movie fans.

By comparing the film preferences of some anonymous Netflix customers with personal profiles on imdb.com, the Internet movie database, the researchers said they easily re-identified some people because they had posted their e-mail addresses or other distinguishing information online.

Vitaly Shmatikov, an associate professor of computer science at the University of Texas at Austin and a co-author of the "de-anonymization" study, says the researchers were able to analyze users' public postings and connect that to their Netflix preferences -- including how a person may have rated films with controversial themes. Those are choices a person may or may not want to make public, Mr. Shmatikov said.

Steve Swasey, a Netflix spokesman, disputed the study's conclusions, saying the customers were not re-identifiable because Netflix had altered the data set before sending it to contestants.

"There is no way with certainty that anyone could link a Netflix member with the data Netflix has disclosed by linking it with any publicly available data," he said. "The anonymity of the information is comparable to the strictest federal standards for anonymizing personal health information."

The clinical information systems market in the United States has sales of $8 billion to $10 billion annually, and about 5 percent of that comes from data and analysis, according to estimates by George Hill, an analyst at Leerink Swann, a health care investment bank.

But by 2020, when a vast majority of American health providers are expected to have electronic health systems, the data mining component alone could generate sales of up to $5 billion, Mr. Hill said. Demand for the data is likely to be robust. Policy makers and hospitals will want to dig into it to analyze physician practices and glean information about patient health trends.

Big players like the Cerner Corporation, which maintains electronic health systems for 8,000 clients, including large hospitals and retail clinics, and smaller players like Practice Fusion, which offers its Web-based health record systems free to health care providers, say they make use of patient data collected from their clients.

A spokeswoman for Cerner, whose Web site promotes its "data mining of our vast warehouse of electronic health records," said the company shares de-identified patient data with researchers or drug companies looking for patients to participate in clinical trials. The patient records are "double scrubbed," she said, explaining that the company removes personal data like names and addresses before it runs a search using a numbered code for each patient.

In 1997, for example, a researcher identified the medical records of William Weld, then the governor of Massachusetts, by correlating birthdays, ZIP codes and gender in voter registration rolls and information published by the state's government insurance commission.

Slipstream
When 2+2 Equals a Privacy Question
By NATASHA SINGER
Published: October 18, 2009
Some privacy advocates wonder whether rules for electronic records offer enough protection.

TrackBack

TrackBack URL for this entry:
http://www.stylizedfacts.com/cgi-sys/cgiwrap/fotohof/managed-mt/mt-tb.cgi/5060

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)