Using AWS Polly for Text-to-Speech

“Since 1987, when I got my first one, I’ve been wearing a clock around my neck 24/7. You feel me? 24/7.” –Flavor Flav

AWS recently announced at re:Invent a slew of new tools and services to be gobbled up by eager developers.  AWS calls their strategy one of creating “elementals”–meaning that they create the building blocks and its up to us to snap those blocks together in new and interesting ways.

One of the services introduced is AWS Polly, a text to speech synthesizer with its own simple API.  Now, Alexa Skills Kit has been around for a while and does text-to-speech pretty well, but its overkill if all you need is to generate a simple audio file from some text.  Plus, its not very easy to integrate it into other platforms like a phone system or website.

Polly offers 47 different voices across 24 different dialects and it’s charged on a per character basis.  You get 1 million characters encoded for a mere $4 but that’s only after the 5 million characters of Free Tier.  In case you’re wondering, 1 million characters is almost 24 solid hours of speaking!

So, I knew I wanted to build some demo using Polly, but just didn’t know what topic…  While the rest of the world was getting smashed on New Years Eve….well, I was too.  But I was also thinking about a topic for my Polly demo.  (Inspiration is indeed all around us but its on us to throw a net on the little bastards when they come wondering by!)  I was watching one of the many New Years Countdown shows on TV,  staring at the giant clock in Times Square….then it hit me.

Maybe it was the booze or all the cocktail weenies I had been eating, but an idea came to me.   For some reason, that giant clock brought to mind one of the most unique individuals ever to grace pop culture–Flavor Flav!  “Flava” and Public Enemy were mainstays of my music collection as a middle-class white kid growing up in the suburbs–which I’m sure is quite horrific to Chuck D and the boys.  Nonetheless, I undertake this demo with the utmost respect for Public Enemy, their message and to honor one of the best Hype Men ever!

Let’s see how Polly takes on Flavor Flav and other Public Enemy works!  The code is over on our GitHub site for you to look at closer.

You can give it a try by calling +1-206-900-9686

 

Use Cases

Like all of my demos, there are some real potential use-cases behind them.

  1. Provide cheap and easy automated order or shipment tracking information to callers without complex or costly CTI or PBX integrations.
  2. Create passable (not perfect) multi-language translation for a video.  You’ll have to do the translation, but Polly can narrate that translation in the proper dialect.
  3. Automate customized outbound calling for community emergencies or school closures.  No need to record a message manually at the crack of dawn.  Or, along the same educational lines, automate student welfare calls–if a student doesn’t show up for school, automatically call the parents with an automated customized message.  [“Hello Mr. Smith, this is the Deer Valley School District calling about Timmy Smith.”]
  4. Create better accessible apps or services for those who are sight-impaired or who otherwise cannot read.  There is still a large portion of the global population who cannot read or write, so voice is the most natural way to communicate.

Objectives

  1. Serverless (no surprise here…got no time for patching OSs, sucka)
  2. Infinitely Scale-able
  3. Robust and redundant
  4. Accessible via phone (because I couldn’t figure out a better way to demo audio)
  5. Cheap

Design

The front-end is going to be using the good old fashion telephone line.  This is kind of reminiscent of those info lines you could dial up back in the day for date and time or the Joke of the Day.  We’re going to build the Public Enemy Lyric Line.   For the phone service, I’m using my old favorite Twilio as it is totally simple to create an endpoint to play an audio file.  I’ll have Twilio setup with an endpoint on the API Gateway which then in turn calls a Lambda function to do the work.   In that Lambda function, we’ll select the lyric at random, build the text and sent it to the Polly API.  We’ll get that back as an MP3 and stash it on S3.   We’ll then return the URL path to that unique S3 file which Twilio will read for us.

[Aside: I really wanted to serve the MP3 straight off the API Gateway, requiring no S3 storage, but I just couldn’t figure out how to get Lambda to properly hand a binary file back to the API Gateway.  The API Gateway does support binary now, but the examples are via the HTTP proxy method rather than straight Lambda.  I tried all manner of Base64 encoding between Lamdba and the Response Integration but just couldn’t get it to convert back to a playable MP3.   If I figure it out, I’ll update this blog.  Likewise, if you know of a way, please let me know!]

The Steps

  1. Create Lambda function to randomize and build text string
  2. Convert text to audio file via Polly and Save audio file to S3 and return URL
  3. Create API Gateway to front-end our Lambda function
  4. Play dynamically generated audio file upon call to a phone number

Create Lambda Function

The Lambda function is very straight-forward.  Just a basic Node.js function.  I did decide to use one 3rd party package for generating a UUID for the name of the MP3 file.  You’ll quickly see from the code that we have a little surprise waiting for about 10% of callers…

"use strict";

var AWS = require("aws-sdk");
const uuidV4 = require("uuid/v4");
const songLibrary = require("./lyrics");
const phraseBook = require("./phrases");

const lang = "en";

AWS.config.region = "us-east-1";

var polly = new AWS.Polly();
var s3 = new AWS.S3();

exports.handler = (event, context, callback) => {

    var phrases = phraseBook.phrases.find(phrase => phrase.lang === lang);
    var lyrics = songLibrary.lyrics[Math.floor(Math.random() * (songLibrary.lyrics.length))];

    var now = new Date();

    var script = ""
        + "<speak>"
        + "<p>"
        + phrases.greeting
        + "</p>"
        + "<p>"
        + phrases.currentTimePrefix
        + "<say-as interpret-as='time'>"
        + now.getUTCHours() + ":" + ("00" + now.getUTCMinutes()).substr(-2, 2)
        + "</say-as>"
        + phrases.currentTimePostfix
        + "</p>";

    if (lyrics.song === "Never Gonna Give You Up"){
        script +=
            "<p>"
            + phrases.prank
            + "</p>";
    } else {
        script +=
            "<p>"
            + phrases.songTag
            + " '" + lyrics.song + "' "
            + phrases.albumTag
            + " '" + lyrics.album + "'."
            + "</p>";
    }

    script +=
        "<p>"
        + lyrics.lyric
        + "</p>"
        + "<break time='2s'/>"
        + phrases.outro
        + "</speak>";

    var pollyParams = {
        OutputFormat: "mp3",
        Text: script,
        VoiceId: phrases.voiceId,
        SampleRate: "16000",
        TextType: "ssml"
    };

    polly.synthesizeSpeech(pollyParams, function (err, data) {
        if (err) callback(err);
        else {

            const mp3Key = phrases.s3KeyPrefix + uuidV4() + ".mp3";

            var s3Params = {
                Bucket: phrases.s3Bucket,
                Key: mp3Key,
                ACL: "public-read",
                Body: data.AudioStream,
                ContentType: "audio/mpeg"
            };

            s3.putObject(s3Params, function (err) {
                if (err) callback(err);
                else {
                    var twilioInstruction = "https://s3.amazonaws.com/" + phrases.s3Bucket + "/" + mp3Key;
                    callback(null, { "response": twilioInstruction });
                }
            });
        }
    });
};

Here’s the phrase book file…are snippets of the Phrase book.  Polly is multilingual, but it does not translate from one language to another.  You have to do that yourself.  As part of a good design, I’ve included a language key in the phrase book to accommodate other languages…but I doubt I’ll implement that.

"use strict";

const phrases = [
    {
        "lang": "en",
        "greeting": "Yeah boy. Its the Public Enemy lyric line.",
        "currentTimePrefix": "The current time is",
        "currentTimePostfix": "coordinated universal time.",
        "songTag":"An excerpt from a composition entitled",
        "albumTag":"from the album",
        "outro": "We hope you have enjoyed these words of wisdom. Thank you and good day. I saidgood day!",
        "prank": "Hahahaa! I tricked you!  You've been Rick Rolled!",
        "voiceId": "Brian",
        "s3Bucket": "<your bucket with Static Web Hosting here>",
        "s3KeyPrefix": "polly/"
    }
];

module.exports = { phrases: phrases };

And here is a snippet of our Song Lyrics file.

...
const lyrics = [
    {
        "song": "Never Gonna Give You Up",
        "album": "Never Gonna Give You Up",
        "lyric": "Never gonna give you up, never gonna let you down. Never gonna run around and desert you. Never gonna make you cry, never gonna say goodbye. Never gonna tell a lie and hurt you."
    },
    {
        "song": "Bring The Noise",
        "album": "It Takes A Nation Of Millions To Hold Us Back",
        "lyric": "Bass! How low can you go? Death row, what a brother know. Once again, back is the incredible rhyme animal, the incredible. D, Public Enemy number one. 5 Oh said, Freeze! and I got numb. Can I tell them that I really never had a gun? But it's the wax that the Terminator X spun"
    },
...

We use a simple Lambda security role that gives us access to the s3 bucket we’re going to use to serve up the MP3 files.  Well documented elsewhere so not showing that here.  Also, you’ll need to enable Static Web Hosting on your s3 bucket to allow Twilio to fetch the files.

Convert Text to Audio

This little snippet does the conversion and saves the file to S3.

...
                var s3Params = {
                Bucket: phrases.s3Bucket,
                Key: mp3Key,
                ACL: "public-read",
                Body: data.AudioStream,
                ContentType: "audio/mpeg"
            };

            s3.putObject(s3Params, function (err) {
                if (err) callback(err);
                else {
                    var twilioInstruction = "https://s3.amazonaws.com/" + phrases.s3Bucket + "/" + mp3Key;
                    callback(null, { "response": twilioInstruction });
                }
...

The parameters we’re sending in tell Polly what voice to use and what format we’re sending it.  I chose “Brian” as the voice.  Not because Brian is my 3rd most favorite character on Family Guy, but rather because it is, to my ear, the most realistic (non-robot) voice for English.  Plus, the voice sounds like a cross between David Attenborough and Alan Rickman and I can think of no better tone with which to honor the flowing lyrics of PE.

Create API Gateway

Ok, the hard part is done and we’re almost there.  We just need to lay down a simple API Gateway that hits the Lambda function and returns a formatted TwiML string back. TwiML is the Twilio Markup Language and is used to give instructions to Twilio upon an incoming call or SMS.

 

Notice the use of “application/xml” in the Content-Type body mapping.  This is important as the default is JSON and TwiML needs to have the XML content type.  We just take the response value from the Lambda return.

Here’s the test from the API Gateway console.  Notice the Latency is pretty high.  I’ve seen anywhere between 2-4 seconds latency for generating the text-to-speech.  Be sure to set your timeout high enough on your Lambda routine to accommodate this.  It does not create an issue for the inbound caller as the API Gateway call is made as soon as Twilio detects a call which is way before you even here it ring.  In my testing, it generally answers after one half ring.

Play the Dynamically Generated File to the Caller

Its dead simple to route a phone call to the API Gateway…just insert the URL for the API Gateway endpoint, including any paths that you may have added.

And we’re done!

Give a call to your number and you should hear Polly reading to you.  Hit me up if you have any questions or comments!

Fight the Power!

 

Tagged , , , , .

Scott has been in the IT industry for 25 years, starting at the very entry-level as a PC repair tech while working his way through college. Over the years, he's had the opportunity to work across a variety of IT roles and industry verticals. For the past seven years, Scott has focused on Omni-Channel technologies and Innovation practices in the Retail industry.

If Scott ever made it onto Jeopardy, his dream categories would be "80's Metal Bands", "Craft Beers of the Northwest", "Kevin Smith Movies", "Jason Isbell Lyrics" and "American BBQ Styles".