Amazon Comprehend - detecting PII data (introduction part)

By Daniel Aniszkiewicz · 25 April, 2021

Introduction

From the previous blog post, we’ve learned what PII data is, what scenario we have to solve, and what potential solutions can help solve the problem there. We decided that a solution with a cloud and ML approach would be the optimal solution. In this blog post, I will focus on a theoretical description of the Amazon Comprehend service.

The goals of this post are:

get a high-level overview regarding Amazon Comprehend.
get familiar with what the service can offer in the case of PII data.
test the service through the AWS console.

General content

Amazon Comprehend is a cloud computing service of Amazon Web Services. As stated in the official documentation:

Amazon Comprehend is a natural language processing (NLP) service that uses machine learning to discover information in unstructured data. Instead of combing through documents, the process is simplified and unnoticed information is easier to understand.

The first developer’s guide was written on November 29, 2017. Since this time, the service was significantly improved, and new features were added as well.

The service itself is very comprehensive, allowing for:

Key phrase extraction to identify key noun phrases used in the text
Sentiment analysis for detecting the overall sentiment of the text (positive, negative, neutral, or mixed).
Entity recognition for detecting named entities that are automatically categorized based on the text provided.
Language detection for language identification (from over 100 supported languages).
PII detection for detecting sensitive data (which we will use in the next steps).
Comprehend Medical for extracting information from unstructured medical text.
many other interesting functionalities.

For custom data (which is not supported by AWS), which you would like to detect, you can train your own classifier to detect them via Amazon Comprehend, and you don’t need to be an ML expert.

Recently I was wondering, why the name for it is Amazon Comprehend, and why not AWS Comprehend. I’ve found a pretty nice explanation on Stackoverflow:

The pattern is that utility services are prefixed with AWS, while standalone services are prefixed by “Amazon”. Services prefixed with AWS typically use other services.

If your data are not in English, please check the language support here, whether the specific feature supports your demanded language.

PII detection with Amazon Comprehend

Since September 17, 2020, Amazon Comprehend supports working with the PII data ( it could be called also PII entity in the case of Amazon Comprehend). The service itself allows:

detecting PII data (Amazon Comprehend automatically identifies PII entities)
labeling detected data to PII Entity Types (a full list can be found within the PII Entity Types section).
redaction of the PII entities (replace the characters in PII entities with a character of various choice ( !, #, $, %, &, *, or @ )).

If you're wondering how long data detection takes, don't worry, it happens in realtime. You can detect PII entities with both real-time synchronous operations and batch asynchronous jobs, which pretty useful for a larger dataset of data. For the processing in async your data needed to be stored on the S3 file, and the output will be also in the S3 bucket file. However, you must use an asynchronous job if you want to produce output with redacted PII entities.

For custom PII data types, you can train our own classifier to detect them via Amazon Comprehend (either by Annotations or Entity Lists which are custom entities recognizer).

Let's test the service.

1. Login into the AWS console.
2. Navigate to the Amazon Comprehend service.
3. You will see the information page.
4. Simply click on Launch Amazon Comprehend.
5. Scroll down to the Input text section.

Let's insert some text with PII data. After submitting, you will see the Insights tab. Simply PII tab to see the PII detection result.

Within the Insights you will see:

Entity name which was provided by you
Type which is the entity category, which Amazon Comprehend assigned detected entity
Confidence, which stands for the score that indicates the level of confidence that Amazon Comprehend has that the entity is a PII data.

Underneath the results, you will see application integration for example how to use it via API, as well as which specific endpoint needs to be used. You will learn how to do it, in the next part of the blog post.

Pricing

The pricing depends on the endpoints which you will use in the case of the PII data. You can either only detect PII data, or also detect and redact those data. The pricing depends on the requests to the Amazon Comprehend API. These requests are also measured in units of 100 characters (1 unit = 100 characters), with a 3 unit (300 characters) minimum charge per request. It will worth checking the pricing examples (in our case, example 7 and 8).

Summary

We've learned what is Amazon Comprehend on a high level.
What possibilities we have in the case of PII detection with Amazon Comprehend.
Played around with Amazon Comprehend via AWS console.
Get familiar with how pricing works.