Create your own language detector with Amazon Comprehend

By Daniel Aniszkiewicz ยท 21 August, 2021

Introduction

If we think of functionality related to the detection of what language a given text is written in, the first thing that comes to mind is Google Translator.

However, we don't have to use Google's service for this, we can use AWS services for this, even if we mainly use this cloud.


For the purpose of today's post, we will try in 30 minutes to create a simple service that after passing the parameter with the text, will check the language in which it was written, and what is the certainty that this is exactly the language.


The AWS cloud service is Amazon Comprehend. We've been playing around with this service for a few posts for the purposes of detecting sensitive data (and redacting that data).


General content

Diagram of the service


detection-language

You can find the repository here.


Amazon Comprehend will analyze the text itself (based on ML) and return the information to us. We will use the AWS SDK Comprehend to communicate with Amazon Comprehend. The implementation will be in Ruby language, while we will use Serverless Framework to create the infrastructure.


The entire list of available languages can be found here.


Let's start with serverless.yml file:


Configuring some basic things from the provider, such as cloud computing, region, runtime for Ruby (currently only 2.7 is supported), logs, as well as some Serverless Framework related items. Also a plugin to work with Ruby.


Now let's focus on functions:



Here we need to configure our lambda functions. For this project, we only need one function that will check the language for us (This lambda, will be used to communicate with the AWS SDK Comprehend to use a method from the SDK to communicate with the Comprehend API).


We need to pass the path to the handler, and in env pass our region, as it is required when initializing the Amazon Comprehend client.


We want to call Lambda via API Gateway, so we need to add its handler, and pass the path as well as the http method.


Finally, we need to add our function to ruby-layer.


Now let's create the code needed to communicate with Amazon Comprehend:



First of all, we need to initialize the Comprehend client to use the Comprehend API.


In serverless.yml we passed as env region, finally we have the opportunity to use this variable.


We will use the endpoint detect_dominant_language to detect the language.


Next, let's focus on the handler.



In the run function, where we operate our handler, we must first retrieve our text from the event body (from the Gateway API).


We also need to check if we are passing the text on ( if not, abort). Then we call our service to detect the language, and depending on the result return the result.


We use the transform_body method to parse json and symbolize keys from the event.


Before testing the solution, we need to add IAM roles to the serverless.yml file:




Now let's make a deployment to AWS:


Let's try to send the test query via postman (but first grab the API Gateway link and copy it to Postman):



Response will be:



As a response we have language_code - the RFC 5646 language code for the dominant language, and score - the level of confidence that Amazon Comprehend has in the accuracy of the detection.


Let's improve our service a bit. First of all, we would like to know the name of the language, because the language code is not always clear to us:


We used a ready-made gem which, based on our language code, will return the full name of the language. We have also added some small improvements to the service.



Final Gemfile file:



The full response from the service:

detection-language

Summary

And that's it! In a quick way in the Serverless approach, we were able to create a service for detecting language in text. If you're stuck on any step, check the repository, or send me a message, I'd be happy to help!