Programmed my first Alexa skill: I was shocked by what I found!

Although I am pretty deeply entrenched in the Apple ecosystem, the recently-announced \$50 Dot was so inexpensive I could not resist checking it out. (Before I go further: I work for Microsoft, so take that into account as you see fit.)

Out of the box, the Echo is very easy to setup for basic queries "Alexa, what's my latitude and longitude?" and so forth. The Echo has a relatively lo-fi speaker and the integration with Sonos (what Amazon calls an "Alexa Skill") is not yet available, so I haven't used it all that much.

But there's an API so you know I had to program something. My preferred solution for "computations in the cloud" is definitely Azure Functions written in F#, but for my first Alexa Skill I used Amazon Lambda running Python.

The first thing to focus on is that Alexa Skills are a separate service that can be programmed many ways, so there's always going to be a certain amount of integration overhead in the form of multiple tabs open, jumping back and forth between the Alexa Skills and the Web server/service where you are handling the computation.

The Alexa Skills documentation is good, but there's a good number of parts and I think it's wise to write your first skill using Amazon Lambda, as I did. Amazon Lambda is often the default service in the documentation and there are often hyperlinks to the Lambda-specific page to do "X."


A Skill for Gravity

A friend was talking to me about riflery and astonishing me with the flight times he was talking about. Alexa failed to answer some basic questions about ballistics (Alexa seems to me less capable than Google Assistant, Cortana, or Siri at answering freeform questions), offering me the perfect simple use-case for my first skill.

Minimum viable query: "What is the speed of an object that has fallen for 1.5 seconds?"

SWAG achievable: "How long would it take for an object dropped from the height of the Empire State Building to fall to the ground on Mars?"

The nice thing about my minimal query is that it's both stateless and easy to answer with some math: all you need to answer is the duration of the drop and use a gravitational constant of -9.81. (Conversions from meters/second can come later.)

I followed the documentation on building an Alexa skill with a Lambda function to create an Alexa Skill named called "Gravity." After naming, the next page of the Skill development site is "Interaction Model." This is where I was shocked to discover:

Alexa doesn't do natural language processing!

I ASS-U-ME'd that I would be receiving some programmatic structure that told me the "nominal subject" of the sentence was the noun speed and would allow me to search for a "prepositional modifier" whose "object" was the noun seconds and extract its modifier. That would allow me to recognize either of these sentences:

  • What is the speed of an object that has fallen for 1.5 seconds?; or
  • What's the velocity of an apple after after 1.5 seconds?

Or any of a large number of other sentences. Foxtype will show you such parsing in action at this (fascinating) page.

But no! As you can see in the screenshot below, the mapping of a recognized sentence to a programmatic "intent" is nothing but a string template! You either have to anticipate every single supported structure or you have to use wildcards and roll your own. (Honestly, I imagine that it's not a long road before the wisest interaction model is Parse {utterance}.)

intents1

To be clear: 'just' voice recognition is extraordinarily hard and doing it in ambient environmental noise is insane. It's only because Alexa already does this very, very hard task that it's surprising to me that they don't provide for some amount of the (also hard) task of parsing. The upside, of course, is that sound->utterance is decoupled from utterance->sentence. As far as I know, no one today provides "NLP as a Service" but it's easy to imagine. (Although latency... Nope, nope, staying on topic...)

Returning to the screenshot above, you can see that it contains the bracketed template {duration}. The matching value will be associated with the key duration in calls to the Lambda function. And, to be honest, it's a place where Alexa Kit does do some NLP.

You can help Alexa by specifying the type of the variables in your template text. For instance, I specified the duration variable as a NUMBER. Alexa does use NLP to transform the utterances meaningfully -- so "one and a half" becomes "1.5" and so forth. I haven't really explored the extent of this -- does it turn "the Tuesday after New Year's Day" into a well-formed date and so forth?

Alexa packages session data relating to an ongoing conversation and intent data and performs an RPC-like call (I actually don't know the details) to the endpoint of your choice. In the case of Amazon Lambda, that's the Amazon Resource Name (ARN) of your function.

The data structures it passes look like this:

[code lang="javascript"]
{
"session": {
"sessionId": "SessionId.07dc1151-eb4e-4e12-98fa-64af3f59d82a",
"application": {
"applicationId": "amzn1.ask.skill.443f7cb5-ETC-dbecb288ff2d"
},
"attributes": {},
"user": {
"userId": "amzn1.ask.account.ETC"
},
"new": true
},
"request": {
"type": "IntentRequest",
"requestId": "EdwRequestId.13cf7a2b-0789-4244-879f-f4fae08f315f",
"locale": "en-US",
"timestamp": "2016-11-18T17:24:09Z",
"intent": {
"name": "FallingSpeedIntent",
"slots": {
"duration": {
"name": "duration",
"value": "1.5"
}
}
}
},
"version": "1.0"
}
[/code]

The values in the session object relate to a conversation and the values in the request object belong to a specific intent -- in this case the FallingSpeedIntent with the duration argument set to "1.5".

On the Lambda side of things

Amazon Lambda has a template function called ColorIs that provides an easy starting point. It supports session data, which my Gravity skill doesn't require, so I actually ended up mostly deleting code (always my favorite thing). Given the JSON above, here's how I route the request to a specific function:

[code lang="python"]
def on_intent(intent_request, session):
""" Called when the user specifies an intent for this skill """

print("on_intent requestId=" + intent_request['requestId'] +
", sessionId=" + session['sessionId'])

intent = intent_request['intent']
intent_name = intent_request['intent']['name']

Dispatch to your skill's intent handlers

if intent_name == "FallingSpeedIntent" :
return get_falling_speed(intent, session)

def get_falling_speed(intent, session):
session_attributes = {}
reprompt_text = None
should_end_session = True

g = -9.82 #meters per second squared

if "duration" in intent['slots']:
duration = float(intent['slots']['duration']['value'])
velocity = g * duration**2

speech_output = "At the end of " + str(duration) + " seconds, an object will be falling at " + ('%.1f' % velocity) + " meters per second. " + \
"Goodbye."
else:
speech_output = "Pretty fast I guess."

return build_response(session_attributes, build_speechlet_response(
intent['name'], speech_output, reprompt_text, should_end_session))

[/code]

(Boilerplate not shown)