__init__.py 1.8 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
  1. """
  2. (c) 2019, Health Information Privacy Lab
  3. Brad. Malin, Weiyi Xia, Steve L. Nyemba
  4. This framework computes re-identification risk of a dataset assuming the data being shared can be loaded into a dataframe (pandas)
  5. The framework will compute the following risk measures:
  6. - marketer
  7. - prosecutor
  8. - pitman
  9. References :
  10. https://www.scb.se/contentassets/ff271eeeca694f47ae99b942de61df83/applying-pitmans-sampling-formula-to-microdata-disclosure-risk-assessment.pdf
  11. This framework integrates pandas (for now) as an extension and can be used in two modes :
  12. 1. explore:
  13. Here the assumption is that we are not sure of the attributes to be disclosed,
  14. The framework will explore a variety of combinations and associate risk measures every random combinations it can come up with
  15. 2. evaluation
  16. Here the assumption is that we are clear on the sets of attributes to be used and we are interested in computing the associated risk.
  17. Four risk measures are computed :
  18. - Marketer risk
  19. - Prosecutor risk
  20. - Journalist risk
  21. - Pitman Risk
  22. Usage:
  23. import numpy as np
  24. import pandas as pd
  25. from pandas_risk import *
  26. mydf = pd.DataFrame({"x":np.random.choice( np.random.randint(1,10),50),"y":np.random.choice( np.random.randint(1,10),50) })
  27. print mydf.risk.evaluate()
  28. #
  29. # computing journalist and pitman
  30. # - Insure the population size is much greater than the sample size
  31. # - Insure the fields are identical in both sample and population
  32. #
  33. pop = pd.DataFrame({"x":np.random.choice( np.random.randint(1,10),150),"y":np.random.choice( np.random.randint(1,10),150) ,"q":np.random.choice( np.random.randint(1,10),150)})
  34. mydf.risk.evaluate(pop=pop)
  35. @TODO:
  36. - Evaluation of how sparse attributes are (the ratio of non-null over rows)
  37. - Have a smart way to drop attributes (based on the above in random policy search)
  38. """