hash_names.Rd
This function uses the scrypt algorithm from libsodium to anonymise data, based on user-indicated data fields. Data fields are concatenated first, then each entry is hashed. The function can either return a full detailed output, or short labels ready to use for 'anonymised data'. Before concatenation (using "_" as a separator) to form labels, inputs are modified using [clean_labels()]
hash_names(..., size = 6, full = TRUE, hashfun = "secure", salt = NULL, clean_labels = TRUE)
... | Data fields to be hashed. |
---|---|
size | The number of characters retained in the hash. |
full | A logical indicating if the a full output should be returned as a
|
hashfun | This defines the hashing function to be used. If you specify "secure" (default), it will use [sodium::scrypt()], which will be secure, but will be slow for large data sets. For fast hashing with no colisions, you can sepecify "fast", and it will use [sodium::sha256()], which is several orders of magnitude faster than [sodium::scrypt()]. You can also specify a hashing function that takes and returns a [raw][base::raw] vector of bytes that can be converted to character with [rawToChar()]. |
salt | An optional object that can be coerced to a character to be used to 'salt' the hashing algorithm (see details). Ignored if `NULL`. |
clean_labels | A logical indicating if labels of variables should be standardized; defaults to `TRUE` |
The argument `salt` should be used for salting the algorithm, i.e. adding an extra input to the input fields (the 'salt') to change the resulting hash and prevent identification of individuals via pre-computed hash tables.
It is highly recommend to choose a secret, random salt in order make it harder for an attacker to decode the hash.
[clean_labels()], used to clean labels prior to hashing
[sodium::hash()] for available hashing functions.
first_name <- c("Jane", "Joe", "Raoul") last_name <- c("Doe", "Smith", "Dupont") age <- c(25, 69, 36) # secure hashing hash_names(first_name, last_name, age, hashfun = "secure")#> label hash_short #> 1 jane_doe_25 6485f2 #> 2 joe_smith_69 ea1ccc #> 3 raoul_dupont_36 f60676 #> hash #> 1 6485f29654c5a9d55625cd6efeb96d569917e1c272790959ad3fa132c6d51648 #> 2 ea1cccce320aa45a0d694ea12c30ff6b4b52c67f69d58b23dad5441ea17c5807 #> 3 f60676d1c11ae5badc0e5ec4dfde06eaba817a78f3d54eb327a25df485ec1efd# fast hashing hash_names(first_name, last_name, age, size = 8, full = FALSE, hashfun = "fast")#> [1] "f3bb3fb9" "1a62b1ef" "3f2a0e20"## salting the hashing (more secure!) hash_names(first_name, last_name) # unsalted - less secure#> label hash_short #> 1 jane_doe 304b8f #> 2 joe_smith 1b2f7a #> 3 raoul_dupont d46b41 #> hash #> 1 304b8f9af58f2e4ed66ecb67a1a0e4f1608180f2961b09e99f109f79fb781932 #> 2 1b2f7a188e03fd13dc3a9937dc4e69245b3c5893a14813a244cb79366471b218 #> 3 d46b41ba61628e3413cb217cf65b0cc9b8547d002dd032bfff771802f281dec2hash_names(first_name, last_name, salt = 123) # salted with an integer#> label hash_short #> 1 jane_doe 7d82bc #> 2 joe_smith c7e089 #> 3 raoul_dupont 81af59 #> hash #> 1 7d82bcc6922728b15c1695e54957c8435bea7658b21d3282cb39554073a4cfab #> 2 c7e089a24d5ff4cf2608fe8a16d6dd5a4711fbf0d78247b67d18416c1bc4f1e1 #> 3 81af59670506b2f6402d9751aadb25ca8678b14183394ea6f1ec32d022b50d66hash_names(first_name, last_name, salt = "foobar") # salted with an character#> label hash_short #> 1 jane_doe cd525d #> 2 joe_smith ad9dd4 #> 3 raoul_dupont e5aba9 #> hash #> 1 cd525deb528a070fe124e03922f12555b40f3b012ce09b79a97b256ce1b5b8fb #> 2 ad9dd482fb90e7e8e50567b9e9664013b064fc9dbfa7f9646642bab85cbd8cf9 #> 3 e5aba9f7a4feed03c7dc66533d01dc6768236105daf2768c9c7769fee54735d1## using a different hash algorithm if you want things to run faster hash_names(first_name, last_name, hashfun = "fast") # use sha256 algorithm#> label hash_short #> 1 jane_doe a79b84 #> 2 joe_smith 661b7f #> 3 raoul_dupont 66ec68 #> hash #> 1 a79b84d8a9787704e9760eb81286676ef64ece85ea780a0793f3de8e698185f9 #> 2 661b7ffc27bc217bdd04a085b2cafe698d36496d0d3c372a89d4f77f0115ad8c #> 3 66ec68882bd0f52e6861078067003c291290504f3f87298d091999668b5901cf