The Lowdown on Data Privacy and What Data Scientists Really Need to Know

Image: expertbeacon.com

 

While data privacy remains to be the hot topic of the day, particularly with the ongoing battle between Apple and the US government, we thought of dissecting the topic of data privacy deeply for a more comprehensive understanding. The notes below from a presentation at the Chief Analytics Officer Forum Europe by Harry Powell, Head of Advanced Data Analytics at Barclays, offers a timely insight on this all-important matter. What constitutes ‘data privacy’? Can it really be measured? If so, how? Read on.

 

The value of respecting people’s data

Data Scientists apply Big Data, pattern recognition and automation technologies to data to build and deliver information products. Some of the most valuable data is recorded at an individual level. This data is owned by the individual in the sense that they continue to have some rights with respect to that data, such as privacy and consent. Withdrawal of consent would inhibit future use. In addition, for a bank, there would be both regulatory and broad commercial implications of any data breach. So it is important to ensure that these rights are respected.

 

Defining ‘Data Privacy’

For Barclays’, data privacy means no individual or business should be uniquely identifiable from an analytical report on its own or in combination with other data to which the recipient of the report might have access.

It is not about access to or encryption of data. It is about ensuring that the information remains private even if it is hacked.

It is not about access to or encryption of data. It is about ensuring that the information remains private even if it is hacked.

 

Tensions with Data Science

Privacy vs Information: Data Science identifies patterns in data: Patterns in data can be used to identify individuals

Agility and Momentum vs Security: Data science works by testing ideas iteratively. To do this requires fluid access to data. Conventionally, data security has relied on controlling and slowing flows of data to aid visibility.

Policies vs process: Data scientists want to know what they have to do to comply. Security focuses on policies, which mostly offer guidelines of how to define compliance.

Data democratisation: Data is most productive when lots of people use it. It is easiest to control the privacy of data if it is only used by a small group of people.

Soft Development vs Analytics vs Data Science: Conventionally there was Software Development which had the freedom to build code on a Development Environment, but had no access to real data; and there was Analytics which have real data but worked on restricted environments e.g. RDBMS and SAS. Data Science needs the freedom of software development activity but has to do it on real data.

 

Measurement of Data Privacy

We need an agreed way to measure the level of privacy. This will lead to a more objective discussion of the necessity and efficacy of controls, which will (from a data science perspective) in turn lead to a faster process with pre-agreed thresholds.
There are three commonly discussed concepts of Privacy:

K-Anonymity
A release of data is said to have k-anonymity if the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appears in the release. Optimisation (K-anonymity with minimum information loss) is an NP-Hard problem.

L-Diversity
For anonymity, you want homogeneity within each cluster on the anonymised fields, and heterogeneity within the cluster on the non-k-anonymised data fields. So if within a K-anonymised medical data set, every member of a cluster has ?cancer? then even though you don’t know which record an individual is, you know that they have ?cancer?.  The attributes do not ned to be idencal in order to do this. Imagine in the same dataset, each member has one of 4 different kinds of cancer, you can then infer that the individual in question has cancer, even though you can’t b definitive which.

Efficient K-Anonymised, L-Diverse data is hard to achieve in high-dimensional datasets because clusters become large.

A cluster has l-diversity if each attribute is represented by at least l values.

T-Closeness
If the distribution of an attribute within a k-anonymous cluster is so atypical of the population distribution that a meaningful inference can be made about the value of an individual within the cluster. A cluster is said to have t-closeness if the distance between the distribution of a sensitive attribute in this cluster and the distribution of the attribute in the whole table is no more than a threshold t.

 

Data Scientists, Listen Up

What data science needs is a standardised way to measure (rather than implement) privacy. This should include:

  1. The risk of an individual being identified;
  2. The additional information required to de-anonymise an individual;
  3. The implications of de-anonymisation including risk of loss; and
  4. A measure of the informational loss due to anonymisation.

 

By Harry Powell:

 

Harry Powell leads the Advanced Data Analytics (ADA) team within the Analytics Centre of Excellence in Barclays’ Personal and Corporate Bank.

ADA is a world-class data science team which innovates, designs and builds applications that deliver, direct to customers, relevant analytical content that will help them make smart decisions to improve their lives.

The team is a combination machine learning and Big Data specialists. Because the content is delivered directly to users without human oversight there is an emphasis on software engineering and mathematics/statistics. ADA builds its products in Scala and Spark which are type safe, functional, object oriented and performant.

 

This post was originally published on Data Digest. For more content related to big data, innovation and analytics, visit www.datadigestonline.com

Leave a Reply

Your email address will not be published. Required fields are marked *