This article will be useful for those who want to integrate the Alexa voice assistant into devices with limited capabilities, such as teapots or coffee makers.
For most people in the developed world, voice services such as Amazon Alexa, Siri from Apple, Google Assistant and Cortana from Microsoft have simply become an integral part of life, making everyday tasks easier and more manageable. Voice control is practically everywhere, from smart speakers and smartphones to cars. And the current pace of technology development allows developers to incorporate this technology into more and more new and interesting projects.
If you look at existing products, you are likely to notice that they use quite powerful processors based on x86 or ARM Cortex-A architecture, which increases their cost and power consumption. In turn, the Internet of Things (loT) era leads to new and more affordable voice assistants.
Unfortunately, full-fledged speech recognition is impossible on devices that are equipped exclusively with microcontrollers. It requires much more powerful equipment. In this sense, the cloud is great for all Internet-connected devices.
The world’s largest cloud provider, Amazon and Amazon Web Services (AWS), sees great potential in using voice assistants in devices for the consumer and industrial sectors and offers developers a powerful tool for this in the form of Alexa Voice Service Integration for AWS IoT (AIA).
How it works
First of all, it’s worth taking a look at the overall architecture of AIA. We need to move Alexa functionality to the cloud, and to do that, we need a channel to send and receive audio streams and control commands. Surely the core functionality of your device also requires channels to communicate with the cloud. Usually these are already built around MQTT protocol, in which case they can be used for Alexa data as well.
The communication interface between the device and the Alexa voice service (AVS) is established using a set of special MQTT protocol tops. This interface must use the AWS loT Core message broker, so you need to create a separate Thing for each IoT device. That said, if you’re already using AWS loT Core, you probably already have these Things and can reuse them. If not, just follow the documentation.
The overall protocol structure is a set of high-level tops in $aws/alexa/ais/v1/<clientId>:
- connection: this topix is used to control connecting and disconnecting your device from AVS;
- capabilities: used to negotiate capabilities. In AIA terms, a capability is a predefined set of features that your device supports. There are only five at the moment, but new versions may come.
- Let’s briefly describe them a bit below.
- directive: a topic for messages triggered by AVS;
- event: a topic for device-initiated messages;
- others: currently contains two sub-topics, microphone and speaker, dedicated to audio data transmission.
All data transmission except for the connection is encrypted using the AES-GCM symmetric encryption algorithm. This works in addition to the TLS encryption used to protect the data channel.
To use Alexa Voice Service, you need to follow a certain procedure:
- First of all, you need to register a new device. This is done by connecting to a secure HTTP/1.1 service. You must provide Alexa authentication information using the LWA update token and Alexa client ID, along with the encryption scheme supported by the device and AWS loT Core connectivity information. The response will contain the public key that will be used to obtain the symmetric encryption key as well as the MQTT root token.
- Next, you need to connect to the cloud via MQTT and subscribe to the above tops. The device can then join the Alexa Voice Service.
- Now you need to let AVS know about the features that your device supports. This is done by publishing a description of them in the “capabilities” topic. It is important to remember to do this immediately after connecting to AVS. As soon as you get the confirmation – everything is ready to work.
What your device needs
The AIA currently supports five capabilities:
- System. Supports important features such as key updates, Alexa notification system management, and error messages.
- Clock. This feature is used to synchronize local time with AIA.
- Microphone. Used to stream and control audio from the device’s microphone. To preserve user privacy, there are two directives that start and stop the transmission. This transmission can be initiated either from the cloud or from the device when the user speaks a word or presses a button on the device. If a wake word was used to activate the data transfer, it must be included in the audio data for verification on the cloud, so you need to keep an audio buffer of at least 500ms on the device. Regardless of the initiator, the device must transmit data according to the Alexa
- Attention System. AVS requires that the data be transmitted in single-channel 16-bit linear PCM format with a sampling rate of 16 kHz. Therefore, if your device has multiple microphones, additional audio data processing may be required.
- Speaker. This feature is used for audio transmission. Playback is controlled by the cloud, which sends the appropriate instructions and audio data. However, if necessary, playback can also be controlled on the device using the buttons. The data is encoded by the Opus audio codec, which is lightweight enough to work in the embedded device. The transmission of audio data is interspersed with special markers. Each time you receive such a token, the SpeakerMarkerEncountered event should be sent to notify AVS of the current playback position.
- Alerts. This feature allows your device to trigger reminders and alarms even when it is not connected to AVS. But it must have an RTC timer built in for that.