Datafly algorithm


Datafly algorithm is an algorithm for providing anonymity in medical data. The algorithm was developed by Latanya Arvette Sweeney in 1997−98. Anonymization is achieved by automatically generalizing, substituting, inserting, and removing information as appropriate without losing many of the details found within the data. The method can be used on-the-fly in role-based security within an institution, and in batch mode for exporting data from an institution.
Organizations release and receive medical data with all explicit identifiers -- such as name -- removed, in the erroneous belief that patient confidentiality is maintained because the resulting data look anonymous. However the remaining data can be used to re-identify individuals by linking or matching the data to other databases or by looking at unique characteristics found in the fields and records of the database itself.
The Datafly algorithm has been criticized for trying to achieve anonymization by over-generalization. The algorithm selects the attribute with the greatest number of distinct values as the one to generalize first.

Core algorithm

An outline of the Datafly algorithm is presented below.
Input:
Private Table PT; quasi-identifier QI =, k-anonymity constraint k; domain generalization hierarchies DGHAi, where i = 1,...,n with accompanying functions fAi, and loss, which is a limit on the percentage of tuples that can be suppressed. PT is the set
of unique identifiers or keys for each tuple.
Output:
MGT a generalization of PT that enforces k-anonymity
Assumes: | PT | ≤ k, and loss * | PT | = k
algorithm Datafly:
// Construct a frequency list containing unique sequences of values across the quasi-identifier in PT,
// along with the number of occurrences of each sequence.