Amazon Comprehend - detecting PII data (coding part)

By Daniel Aniszkiewicz · 3 May, 2021

Introduction

From the previous blog post, we’ve learned about Amazon Comprehend from the theory part, what it is, how it works with PII data, as well as we’ve made some detection via AWS dashboard. This time, we will implement PII detection ourselves within the Rails application by using AWS SDK.

The goals of this post are:

  • get our hands dirty with Amazon Comprehend API from the code.
  • implement PII data detection from the application.

For purpose of the next blog posts, I’ve created a basic Rails application, which will better visualize what we've trying to achieve, as well as to gain more practice while working with Amazon Comprehend. Within the app, you can create a note via form, and underneath all notes could be found. The repo for it could be found here.
We will reuse the repository for the next couple of blogposts (The Readme is filled, so there should be no issue to start it locally, you can also check features branches if you’re interested in the specific implementation).

Daniel

General content

Let's start by adding the aws-sdk-comprehend gem to our Gemfile. The gem provides a client to make API requests to Amazon Comprehend.


After finish with bundle install, we will create a new service, which will be handling PII detection within our application.

Let's create a Pii::DetectionService:



Within the service constructor we initialize:


  • Note as an AR object which content we will analyze.
  • Comprehend Client which will be dealing with to make a request to the Amazon Comprehend service.
  • Detection results we will gather all detection results in place, and reuse them later.

Keep in mind, that you need to specify the AWS region. The best approach is to fetch the region somewhere from settings, to avoid hardcoded region within the code.


AWS SDK

Let's spend a few more minutes on a topic regarding establishing a connection to the AWS SDK. For the purpose of authentification with AWS SDK, we simply provide credentials to the AWS account (Access key ID and Secret access key). So either by directly passing credentials while initializing the Comprehend Client:


Or it will fetch our default keys from the

~/.aws/credentials

on your machine, which looks like this:



It worth mentioning, that the AWS SDK will automatically search for the credentials so you shouldn’t have to provide them. If the app is running on EC2 machines, the SDK will retrieve temporary credentials using the EC2 instance profile. On others services, there is a similar mechanism.


An extra hint while working with different access keys, within the terminal you can easily switch between aliases:



The above snippet will fetch keys from the personal alias. Easy as that :)


More information about configuring the AWS SDK for Ruby could be found here.


Detect PII Entities API

We will use Detect PII Entities API endpoint from the Amazon Comprehend API. The endpoint inspects the input text for entities that contain PII data and returns information about them (More information about the endpoint here).



We need to provide a text string, which will be analyzed, as well as the language of the text. For today, it accepts en, es, fr, de, it, pt, ar, hi, ja, ko, zh, zh-TW.


Important: Max length of request text allowed is 5000 bytes. For longer text, you need to either split it or you async batch job processing (we will discuss this topic in some of the next blog posts).


The response from the API is almost in real-time. The response looks like this:



It’s a collection of PII entities identified in the input text. For each entity, the response provides the entity type, where the entity text begins and ends, and the level of confidence that Amazon Comprehend has in the detection.


For checking only, if the text contains the PII data, you can simply use contains_pii_entities. The main difference will be in response, as it returns the labels of identified PII entity types without the offset to localize those sensitive data:



Let's improve our service a little bit:



Common errors

It's also good to be familiar with common errors while working with Amazon Comprehend API. If you will encounter errors like:


"Aws::Comprehend::Errors::AccessDeniedException (User: your-user is not authorized to perform: comprehend:DetectPiiEntities)"

most likely it means that your existing IAM role doesn't have the necessary permissions to interact with the Comprehend API.


The second issue of which we've already discussed was the size of the analyzed content. If you send a string that contains more than 5000 bytes, you will see an error:

"Aws::Comprehend::Errors::TextSizeLimitExceededException (Input text size exceeds limit. Max length of request text allowed is 5000 bytes while in this request the text size is #{your_content_size} bytes) "

How to solve it you will find out in a couple of next blogposts.


Summary

  • We've used the prepared Rails workshop application to use it for purpose of this blog post.
  • We've learned about AWS SDK Comprehend Client for Ruby.
  • We've created a service to detect PII data, found out how to make a call, which params we need, as well how the structure of the response looks like.
  • We've found out about a couple of common errors.