Tree Traversal to Achieve Generalization for Data De-identification

Jose Carlos Donaldo Hernandez, David Anastacio Garcia, Farshad Rabbani


Every day data is published of different types and
from various sources. Data de-identification protects the privacy
of most of this data before its publication. Over the recent years,
a technique proposed by Dr. Sweeney, known as kanonymization
as a means for privacy protection has gained
great popularity. There has been intensive research involving this
method and many alterations, in the hope to find an optimal
solution in real-time to the generalization problem. To achieve
either generalization or suppression, researchers have used
different types of heuristics, most of them being tree-based.
Although this is a heavily investigated area, there is no simple
method to prepare data for generalization; in theory, there are
infinite methods for data preparation and partitioning. In this
research, we first propose the use of commonly known algorithms
to prepare data and achieve generalization. We then introduce
the use of tree-based algorithms and tree traversal as the
mechanism to achieve data generalization. We further investigate
them, by comparing the quality of generalization sets obtained in
each traversal method, in the hope to determine which method is

Full Text:



K. Yang, X. Jia, K. Zhang, and others, “Privacy-Preserving Data

Publish-Subscribe Service on Cloud-based Platforms,” 2014.

L. Jiang, Proceedings of the 2011 International Conference on

Informatics, Cybernetics, and Computer Engineering (ICCE2011)

November 19-20, 2011, Melbourne, Australia: Volume 1: Intelligent

Control and Network Communication, vol. 110. Springer Science &

Business Media, 2011.

R. Turn, “Information privacy issues for the 1990s,” in Research in

Security and Privacy, 1990. Proceedings., 1990 IEEE Computer Society

Symposium on, 1990, pp. 394–400.

V. S. Iyengar, “Transforming data to satisfy privacy constraints,” in

Proceedings of the eighth ACM SIGKDD international conference on

Knowledge discovery and data mining, 2002, pp. 279–288.

J. Nin and J. Herranz, Privacy and Anonymity in Information

Management Systems. Springer, 2010.

A. Manta, “Literature survey on privacy preserving mechanisms for data publishing,” 2013.

A. Saroha, “Survey of k-Anonymity,” 2014.

C. Thomas and D. Thomas, “A Survey on Privacy Preservation in Data


O. Heffetz and K. Ligett, “Privacy and data-based research,” Journal of Economic Perspectives, vol. 28, no. 2, pp. 75–98, 2014.

J. Cao and P. Karras, “Publishing microdata with a robust privacy

guarantee,” Proceedings of the VLDB Endowment, vol. 5, no. 11, pp.

–1399, 2012.

S. L. Garfinkel, “De-identification of personal information,” NISTIR,

vol. 8053, pp. 1–46, 2015.

M. Callahan, “Us dhs handbook for safeguarding sensitive personally

identifiable information,” Washington, DC, 2012.

G. Navarro-Arribas and V. Torra, Advanced research in data privacy,

vol. 567. Springer, 2014.

F. Kohlmayer, F. Prasser, and K. A. Kuhn, “The cost of quality:

Implementing generalization and suppression for anonymizing

biomedical data with minimal information loss,” Journal of biomedical

informatics, vol. 58, pp. 37–48, 2015.

B. C. Fung, K. Wang, A. W.-C. Fu, and S. Y. Philip, Introduction to

privacy-preserving data publishing: Concepts and techniques. Chapman

and Hall/CRC, 2010.

B. C. Fung, K. Wang, and P. S. Yu, “Top-down specialization for

information and privacy preservation,” in Data Engineering, 2005.

ICDE 2005. Proceedings. 21st International Conference on, 2005, pp.


Z. Huang, “Extensions to the k-means algorithm for clustering large data sets with categorical values,” Data mining and knowledge discovery,

vol. 2, no. 3, pp. 283–304, 1998.

D. Riboni, L. Pareschi, and C. Bettini, “Preserving Privacy in Sequential Data Release against Background Knowledge Attacks,” arXiv preprint arXiv:1010.0924, 2010.

E. McCallister, Guide to protecting the confidentiality of personally

identifiable information. Diane Publishing, 2010.

L. Sweeney, “Achieving k-anonymity privacy protection using

generalization and suppression,” International Journal of Uncertainty,

Fuzziness and Knowledge-Based Systems, vol. 10, no. 05, pp. 571–588, 2002.

M. Press, Microsoft Press computer dictionary: the comprehensive

standard for business, school, library, and home. Microsoft Pr, 1994.

K. El Emam and F. K. Dankar, “Protecting privacy using k-anonymity,” Journal of the American Medical Informatics Association, vol. 15, no. 5, pp. 627–637, 2008.

L. Sweeney, “k-anonymity: A model for protecting privacy,”

International Journal of Uncertainty, Fuzziness and Knowledge-Based

Systems, vol. 10, no. 05, pp. 557–570, 2002.

A. et al. Machanavajjhala, “l-diversity: Privacy beyond k-anonymity,”

ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 1,

no. 1, p. 3, 2007.

P. Samarati, “Protecting respondents identities in microdata release,” IEEE transactions on Knowledge and Data Engineering, vol. 13, no. 6, pp. 1010–1027, 2001.

W. E. Yancey, W. E. Winkler, and R. H. Creecy, “Disclosure risk

assessment in perturbative microdata protection,” in Inference control in

statistical databases, Springer, 2002, pp. 135–152.

P. Samarati and L. Sweeney, “Generalizing data to provide anonymity when disclosing information,” in PODS, 1998, vol. 98, p. 188.

J.-W. Byun, A. Kamra, E. Bertino, and N. Li, “Efficient kanonymization using clustering techniques,” in International Conference

on Database Systems for Advanced Applications, 2007, pp. 188–200.

L. Willenborg and T. De Waal, Elements of statistical disclosure

control, vol. 155. Springer Science & Business Media, 2012.

S. Chawla, C. Dwork, F. McSherry, A. Smith, and H. Wee, “Toward

privacy in public databases,” in Theory of Cryptography Conference, 2005, pp. 363–385.

A. Meyerson and R. Williams, “On the complexity of optimal kanonymity,” in Proceedings of the twenty-third ACM SIGMODSIGACT-

SIGART symposium on Principles of database systems, 2004, pp. 223–228.

M. E. Kabir, H. Wang, and E. Bertino, “Efficient systematic clustering

method for k-anonymization,” Acta Informatica, vol. 48, no. 1, pp. 51– 66, 2011.

V. Ciriani, S. D. C. Di Vimercati, S. Foresti, and P. Samarati, “kanonymous data mining: A survey,” in Privacy-preserving data mining,

Springer, 2008, pp. 105–136.

W. K. Wong, N. Mamoulis, and D. W. L. Cheung, “Non-homogeneous

generalization in privacy preserving data publishing,” in Proceedings of

the 2010 ACM SIGMOD International Conference on Management of

data, 2010, pp. 747–758.

S. Pigeon, “Huffman coding,” Lossless Compression Handbook, pp. 79–100, 2003.

S. S. Skiena, The algorithm design manual: Text, vol. 1. Springer

Science & Business Media, 1998.

K. Mehlhorn and P. Sanders, Algorithms and data structures: The basic toolbox. Springer Science & Business Media, 2008.

D. A. Huffman, “A method for the construction of minimumredundancy codes,” Proceedings of the IRE, vol. 40, no. 9, pp. 1098–1101, 1952.

T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to algorithms. MIT press, 2009.

S. Martello, “Knapsack problems: algorithms and computer

implementations,” Wiley-Interscience series in discrete mathematics and

optimization, 1990.

E. Bertino, D. Lin, and W. Jiang, “A survey of quantification of privacy preserving data mining algorithms,” in Privacy-preserving data mining, Springer, 2008, pp. 183–205.

J.-L. Lin and M.-C. Wei, “An efficient clustering method for kanonymization,” in Proceedings of the 2008 international workshop on

Privacy and anonymity in information society, 2008, pp. 46–50.

R. J. Bayardo and R. Agrawal, “Data privacy through optimal kanonymization,” in Data Engineering, 2005. ICDE 2005. Proceedings.

st International Conference on, 2005, pp. 217–228.

A. Solanas, F. Sebé, and J. Domingo-Ferrer, “Micro-aggregation-based heuristics for p-sensitive k-anonymity: one step beyond,” in Proceedings of the 2008 international workshop on Privacy and anonymity in information society, 2008, pp. 61–69.

G. Ghinita, P. Karras, P. Kalnis, and N. Mamoulis, “Fast data

anonymization with low information loss,” in Proceedings of the 33rd

international conference on Very large data bases, 2007, pp. 758–769.

J. Soria-Comas, “Improving data utility in differential privacy and kanonymity,” arXiv preprint arXiv:1307.0966, 2013.

C. Fang and E.-C. Chang, “Information leakage in optimal anonymized and diversified data,” in International Workshop on Information Hiding, 2008, pp. 30–44.

K. B. Frikken and Y. Zhang, “Yet another privacy metric for publishing micro-data,” in Proceedings of the 7th ACM workshop on Privacy in the electronic society, 2008, pp. 117–122.

S. M. Morton, “An Improved Utility Driven Approach Towards kanonymity Using Data Constraint Rules,” 2013.

T. Li and N. Li, “Towards optimal k-anonymization,” Data &

Knowledge Engineering, vol. 65, no. 1, pp. 22–39, 2008.

G. T. Duncan and S. Mukherjee, “Optimal disclosure limitation strategy in statistical databases: Deterring tracker attacks through additive noise,” Journal of the American Statistical Association, vol. 95, no. 451, pp. 720–729, 2000.

“UC Irvine Machine Learning Repository: Adult Data Set:” [Online].

Available: [Accessed: 04- Jul-2016].

K. LeFevre, D. J. DeWitt, and R. Ramakrishnan, “Incognito: Efficient

full-domain k-anonymity,” in Proceedings of the 2005 ACM SIGMOD

international conference on Management of data, 2005, pp. 49–60.


  • There are currently no refbacks.

Abava   MSU conference 2018

ISSN: 2307-8162