__init__.py 2.6 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869
  1. """
  2. # Re-Identification Risk
  3. This framework computes re-identification risk of a dataset by extending pandas. It works like a pandas **add-on**
  4. The framework will compute the following risk measures: marketer, prosecutor, journalist and pitman risk.
  5. References for the risk measures can be found on
  6. - http://www.ehealthinformation.ca/wp-content/uploads/2014/08/2009-De-identification-PA-whitepaper1.pdf
  7. - https://www.scb.se/contentassets/ff271eeeca694f47ae99b942de61df83/applying-pitmans-sampling-formula-to-microdata-disclosure-risk-assessment.pdf
  8. There are two modes available :
  9. **explore:**
  10. Here the assumption is that we are not sure of the attributes to be disclosed, the framework will randomly generate random combinations of attributes and evaluate them accordingly as it provides all the measures of risk.
  11. **evaluation**
  12. Here the assumption is that we are clear on the sets of attributes to be used and we are interested in computing the associated risk.
  13. ### Four risk measures are computed :
  14. - Marketer risk
  15. - Prosecutor risk
  16. - Journalist risk
  17. - Pitman Risk
  18. ### Usage:
  19. Install this package using pip as follows :
  20. Stable :
  21. pip install git+https://hiplab.mc.vanderbilt.edu/git/steve/deid-risk.git
  22. Latest Development (not fully tested):
  23. pip install git+https://hiplab.mc.vanderbilt.edu/git/steve/deid-risk.git@risk
  24. The framework will depend on pandas and numpy (for now). Below is a basic sample to get started quickly.
  25. import numpy as np
  26. import pandas as pd
  27. import risk
  28. mydf = pd.DataFrame({"x":np.random.choice( np.random.randint(1,10),50),"y":np.random.choice( np.random.randint(1,10),50),"z":np.random.choice( np.random.randint(1,10),50),"r":np.random.choice( np.random.randint(1,10),50) })
  29. print (mydf.risk.evaluate())
  30. #
  31. # computing journalist and pitman
  32. # - Insure the population size is much greater than the sample size
  33. # - Insure the fields are identical in both sample and population
  34. #
  35. pop = pd.DataFrame({"x":np.random.choice( np.random.randint(1,10),150),"y":np.random.choice( np.random.randint(1,10),150) ,"z":np.random.choice( np.random.randint(1,10),150),"r":np.random.choice( np.random.randint(1,10),150)})
  36. print (mydf.risk.evaluate(pop=pop))
  37. @TODO:
  38. - Evaluation of how sparse attributes are (the ratio of non-null over rows)
  39. - Have a smart way to drop attributes (based on the above in random policy search)
  40. Basic examples that illustrate usage of the the framework are in the notebook folder. The example is derived from
  41. """
  42. from risk import deid