Pandas extension that measures privacy risk
|
преди 6 години | |
---|---|---|
src | преди 6 години | |
README.md | преди 6 години |
This framework computes re-identification risk of a dataset assuming the data being shared can be loaded into a dataframe (pandas) The framework will compute the following risk measures:
- marketer
- prosecutor
- pitman
References :
[http://ehelthinformation.ca](http://www.ehealthinformation.ca/wp-content/uploads/2014/08/2009-De-identification-PA-whitepaper1.pdf)
[https://www.scb.se/contentassets](https://www.scb.se/contentassets/ff271eeeca694f47ae99b942de61df83/applying-pitmans-sampling-formula-to-microdata-disclosure-risk-assessment.pdf)
This framework integrates pandas (for now) as an extension and can be used in two modes :
- Marketer risk
- Prosecutor risk
- Journalist risk
- Pitman Risk
import numpy as np
import pandas as pd
from pandas_risk import *
mydf = pd.DataFrame({"x":np.random.choice( np.random.randint(1,10),50),"y":np.random.choice( np.random.randint(1,10),50) })
print mydf.risk.evaluate()
#
# computing journalist and pitman
# - Insure the population size is much greater than the sample size
# - Insure the fields are identical in both sample and population
#
pop = pd.DataFrame({"x":np.random.choice( np.random.randint(1,10),150),"y":np.random.choice( np.random.randint(1,10),150) ,"q":np.random.choice( np.random.randint(1,10),150)})
mydf.risk.evaluate(pop=pop)
- Evaluation of how sparse attributes are (the ratio of non-null over rows)
- Have a smart way to drop attributes (based on the above in random policy search)
Basic examples that illustrate usage of the the framework are in the notebook folder. The example is derived from
Dependencies:
numpy
pandas
Limitations:
@TODO:
- Add support for journalist risk