Finding personally identifiable information (PII) in a text document can be useful for a few reasons, but one use case I've come across many times is to help anonymize text in order to:
Sharing data with third parties.
Comply with regulatory requirements such as GDPR.
Replace PII with simulated data to be used as training data for machine learning and other exploratory analyses.
I'm going to try to automate the process of finding PIIs, and in this series of articles, we'll explore some popular open-source tools and techniques to identify different types of PIIs in our own data.
In the first part we found a way to find people's names in the text, let's see what other types of PII we can find.
Duckling is a Haskell library, open-sourced by Facebook, for parsing text into structured data. Duckling can help us find different types of information in the text, including credit card numbers, email addresses, and ** numbers.
Now don't worry, if you're not one of the three people who know Haskell, we can use Duckling with any programming language.
Let's see how we can use duckling in a language that doesn't require a speech about the harms of ***.
Install git, docker, and docker-compose
git clone [email protected]:facebook/duckling.gitMake a docker compose file in the cloned duckling repo.
docker-compose.yml:
version: '3'services: duckling: build: context: .ports: -8000:8000Start ducking as a docker service :
The Duckling service is now available via HTTP API over port 8000 on our localhost. Let's start making some calls to the API and see what we get:
import requeststext = 'my email address is [email protected] and my number is +1 (650) 123-4567 so call me maybe?'response = requests.post('http://localhost:8000/parse', )entities = response.json()for entity in entities: print( entity['dim'] +": "+ entity['body'])This will print the following:
email: [email protected]: +1 (650) 123-4567Wonderful Duckling found the email address and ** number in our text and confirmed that this text contains PII. Now let's see how it handles credit card numbers:
import requeststext = 'last christmas i g**e you my card 4111-1111-1111-1111 but the very next day you g**e it away'response = requests.post('http://localhost:8000/parse', )entities = response.json()for entity in entities: print( entity['dim'] +": "+ entity['body'])Can't wait to see that sweet credit card number printed. Let's see what it prints:
credit-card-number: 4111-1111-1111-1111phone-number: 4111-1111-1111-1111Uh....It detects that our number is ** number and credit card number. I guess it's better to be safe than sorry.
Duckling can help us find other types of data, or "dimensions" in Duckling's language, so feel free to browse the project's GitHub page and see what else is available.
We can now add to the list of PII types that we were able to find: personal name, email address, **number, and credit card number. We've seen that there's room for improvement, for example, we can use the Luhn algorithm to confirm that a number is a credit card number, not a ** number, but that's beyond the scope of this series, as everyone needs to build their own use case on top of the topics discussed here.
In the following articles we will see how other tools perform and what other types of PII they can help us find.