Mathematics, risk, and messy survey data
Keywords:data, data deidentification, anonymization, anonymity, survey data
Research funder mandates, such as those from the U.S. National Science Foundation (2011), the Canadian Tri-Agency (draft, 2018), and the UK Economic and Social Research Council (2018) now often include requirements for data curation, including where possible data sharing in an approved archive. Data curators need to be prepared for the potential that researchers who have not previously shared data will need assistance with cleaning and depositing datasets so that they can meet these requirements and maintain funding. Data de-identification or anonymization is a major ethical concern in cases where survey data is to be shared, and one which data professionals may find themselves ill-equipped to deal with. This article is intended to provide an accessible and practical introduction to the theory and concepts behind data anonymization and risk assessment, will describe a couple of case studies that demonstrate how these methods were carried out on actual datasets requiring anonymization, and discuss some of the difficulties encountered. Much of the literature dealing with statistical risk assessment of anonymized data is abstract and aimed at computer scientists and mathematicians, while material aimed at practitioners often does not consider more recent developments in the theory of data anonymization. We hope that this article will help bridge this gap.
How to Cite
Copyright (c) 2020 Kristi Anne Thompson, Carolyn Sullivan
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
"This license lets others remix, tweak, and build upon your work non-commercially, and although their new works must also acknowledge you and be non-commercial, they don’t have to license their derivative works on the same terms."