Building a Voice-Driven TV Remote - Part 4: Some Basic Alexa Commands

This is part four of the Building a Voice-Driven TV Remote series:

  1. Getting The Data
  2. Adding Search
  3. The Device API
  4. Some Basic Alexa Commands
  5. Adding a Listings Search Command
  6. Starting to Migrate from HTTP to MQTT
  7. Finishing the Migration from HTTP to MQTT
  8. Tracking Performance with Application Insights

The first three posts in this series were a lot of fun but ultimately they were all just setting the stage for being able to do cool things. It's now time to start tying these pieces together to light up some basic functionality with Alexa and my new living room API. For now I'm going to leave out the whole listings search piece and simply expose some remote control commands.

Alexa Skill

First I'll need to add a definition for a new Alexa skill in Amazon's Developer Console. For it's invocation name I'm using tv, so that it can be invoked via Alexa, tell the TV to... and Alexa, ask the TV to.... You can get a little more flexibility with this format if you have a published skill, but unfortunately since there is no support for publishing private skills I'll be limited to this format since I'll need to keep it in development mode. It's certainly not the end of the world, though, since this is the format most Alexa skills adhere to.

Interaction Model

Next up is defining the rest of the interaction model for the skill. In the schema I'm just going to define one intent for now, to allow for directly running a remote control command:

{
  "intents": [
    {
      "intent": "DirectCommand",
      "slots": [
        {
          "name": "command",
          "type": "command_name"
        },
        {
          "name": "target",
          "type": "command_target"
        }
      ]
    }
  ]
}

In this schema I've also defined a couple custom slots to capture data from the spoken command. The command_name will capture the actual control command, and is defined as:

pause | volume up | volume down | play | fast forward | mute

As I want to add support for more commands I can just add some more in here.

The command_target_ slot isn't going to be particularly useful in this post, but the idea is that it can help capture some more context around the command to allow for something like Alexa, tell the TV to pause the movie. This could end up being useful to make the skill handler a little more intelligent down the line.

Finally, I need to provide Amazon with some sample utterances for the skill:

DirectCommand {command}
DirectCommand {command} the {target}
DirectCommand {command} {target}

This tells Alexa to map these phrase structures to the DirectCommand intent, and how to map the slots so we can access them in code. Before defining the rest of the skill configuration we'll need to set up a new Azure function to handle the skill, so let's do that now.

Skill Handler

To handle the skill requests I created a new Azure Function named RemoteSkill that has HTTP input and output bindings:

{
  "bindings": [
    {
      "authLevel": "anonymous",
      "name": "req",
      "type": "httpTrigger",
      "direction": "in",
      "methods": [
        "post"
      ]
    },
    {
      "name": "res",
      "type": "http",
      "direction": "out"
    }
  ],
  "disabled": false
}

I'll also need to pull in a few dependencies in project.json:

{
  "frameworks": {
    "net46":{
      "dependencies": {
        "FSharp.Interop.Dynamic": "3.0.0",
        "AlexaSkillsKit.NET": "1.5.1",
        "FSharp.Data": "2.3.2"
      }
    }
  }
}

Here I'm pulling in a nice library called AlexaSkillsKit.NET that handles the parsing of Alexa requests, and more importantly, signature validation on those requests to make sure I only accept requests coming from Amazon.

Running Commands

In order to run any commands the function will need to talk to the Harmony API defined in the previous post. To facilitate that I created a file named commands.fsx that contains some helpers to do just that:

module Commands

open System
open FSharp.Data

type CommandsResponse = JsonProvider<""" {"commands":[{"name":"VolumeDown","slug":"volume-down","label":"Volume Down"}]} """>

let getCommand (label: string) =
    let url = sprintf "%s/commands" (Environment.GetEnvironmentVariable("HarmonyApiUrlBase"))
    let authHeader = "Authorization", (Environment.GetEnvironmentVariable("HarmonyApiKey"))

    Http.RequestString(url, headers = [authHeader])
    |> CommandsResponse.Parse
    |> fun res -> res.Commands
    |> Seq.tryFind (fun command -> command.Label.ToLowerInvariant() = label.ToLowerInvariant())

let executeCommand commandSlug =  
    let url = sprintf "%s/commands/%s" (Environment.GetEnvironmentVariable("HarmonyApiUrlBase")) commandSlug
    let authHeader = "Authorization", (Environment.GetEnvironmentVariable("HarmonyApiKey"))

    Http.RequestString(url,
                       headers = [authHeader],
                       httpMethod = "POST") |> ignore

The API exposes a list of commands that are exposed by the current activity, so I try to match those with a desired command and return it if there's a match. You can also poll and execute commands for specific activities but keeping it generic makes things easier for now. It also means that I can use common commands like "pause", "play", etc across all activities since they are common to all of them.

Handling Alexa Intents

Now we can define the main function implementation. I'll start by pulling in our dependencies and defining a little helper for returning a spoken response from the skill:

#load "commands.fsx"
open System.Net.Http
open AlexaSkillsKit.Slu
open AlexaSkillsKit.Speechlet
open AlexaSkillsKit.UI

let buildResponse output shouldEndSession =
    SpeechletResponse(ShouldEndSession = shouldEndSession,
                      OutputSpeech = PlainTextOutputSpeech(Text = output)

Next, a couple small functions for handling the actual intents:

let handleDirectCommand (intent: Intent) =
    match (Commands.getCommand intent.Slots.["command"].Value) with
    | Some(command) ->
        Commands.executeCommand command.Slug
        buildResponse "OK" true
    | None -> buildResponse "Sorry, that command is not available right now" true

let handleIntent (intent: Intent) =
    match intent.Name with
    | "DirectCommand" -> handleDirectCommand intent
    | _ -> buildResponse "Sorry, I'm not sure how to do that" true

F#'s pattern matching once again makes it very elegant to parse and handle these intents. It's easy to imagine how this block would be extended to handle more types of intents as the system gets built out further.

With the handler defined, we now need to define an implementation of AlexaSkillsKit's Speechlet abstract class to hook into the pipeline there:

type RemoteSpeechlet(log: TraceWriter) =
    inherit Speechlet()

    override this.OnLaunch(request: LaunchRequest, session: Session): SpeechletResponse =
        sprintf "OnLaunch: request %s, session %s" request.RequestId session.SessionId |> log.Info
        buildResponse "" false

    override this.OnIntent(request: IntentRequest, session: Session): SpeechletResponse =
        sprintf "OnIntent: request %s, session %s" request.RequestId session.SessionId |> log.Info
        handleIntent request.Intent

    override this.OnSessionStarted(request: SessionStartedRequest, session: Session) =
        sprintf "OnSessionStarted: request %s, session %s" request.RequestId session.SessionId |> log.Info

    override this.OnSessionEnded(request: SessionEndedRequest, session: Session) =
        sprintf "OnSessionEnded: request %s, session %s" request.RequestId session.SessionId |> log.Info

While it can feel a bit dirty to mix in OO with our nice functional code, you can see here we were able to keep that implementation very thin. It simply does some diagnostic logging and then delegates to the handlers defined earlier.

Now we can define the main Run method:

let Run(req: HttpRequestMessage, log: TraceWriter) =
    let speechlet = RemoteSpeechlet log
    speechlet.GetResponse req

Connect Alexa Skill to Function

With the function defined we just need to point Alexa there to start trying it out. Under coniguration, choose the option for an HTTPS endpoint instead of a Lambda ARN, and provide the URL for the RemoteSkill function. Under SSL Certificate you can choose the option that says your endpoint is a subdomain of a domain with a certificate since Azure takes care of all that for you.

Try It Out

Finally we can actually try this out!

Service Simulator

Amazon's developer console provides a nice little simulator where you can type in commands and see the request and response for it to your skill. You can also choose to hear the response as well, so it's a great way to tinker with your skill without having to actually speak to your device over and over. For example, for an utterance of mute the movie, here's an example response:

{
  "version": "1.0",
  "response": {
    "outputSpeech": {
      "type": "PlainText",
      "text": "OK"
    },
    "shouldEndSession": true
  },
  "sessionAttributes": {
    "intentSequence": "DirectCommand",
    "command": "mute",
    "target": "movie"
  }
}

Try It For Real

That's fun and all, but how does it work in practice? I recorded a quick video of some initial tests in my living room:

Not bad for a first iteration! There's a lot more to do, but I must say it's pretty satisfying to actually have these pieces finally working together.


Next post in series: Adding a Listings Search Command