AGL Voice Agent / Assistant

Introduction

A gRPC-based voice agent designed for Automotive Grade Linux (AGL). This service leverages GStreamer, Vosk, Whisper AI, and Snips to process user voice commands seamlessly. It converts spoken words into text, extracts intents from these commands, and performs actions through the Kuksa interface. The voice agent is modular and extensible, allowing for the addition of new speech recognition and intent extraction models. Whisper AI is available for both online and offline modes, offering more accurate transcription compared to Vosk, though it may run slower in offline mode.

Note: RASA NLU is currently not available as it is not supported by Python 3.12, the version used in AGL.

Installation and Usage

Before diving into the detailed components documentation, let's first look at how to install and use the voice agent service. All of the features of the voice agent service are encapsulated in the meta-offline-voice-agent sub-layer, which can be found under the meta-agl-devel layer. You can build this sub-layer into the final image using the following commands:

Building for QEMU x86-64

$ source salmon/meta-agl/scripts/aglsetup.sh -m qemux86-64 -b build-salmon agl-demo agl-devel agl-offline-voice-agent
$ source agl-init-build-env
$ bitbake agl-ivi-demo-flutter

Building for Raspberry Pi 5

Whisper AI does not work on the emulator; therefore, building and running on a physical device like the Raspberry Pi 5 is recommended. You can set up the build environment specifically for Raspberry Pi 5 using:

$ source salmon/meta-agl/scripts/aglsetup.sh -m raspberrypi5 -b build-salmon agl-demo agl-devel agl-offline-voice-agent
$ source agl-init-build-env
$ bitbake agl-ivi-demo-flutter


Note: The voice assistant client is already integrated into the flutter-ics-homescreen, which is the default homescreen app for agl-ivi-demo-flutter. However, if you wish to build any other image, you can use the flutter-voiceassistant app, which is a standalone flutter-based client for the voice assistant. You can add flutter-voiceassistant to your build by adding the following line to the conf/local.conf file:

FEATURE_PACKAGES_agl-offline-voice-agent:append = " \
    flutter-voiceassistant \
"

The voice agent service will automatically start on startup with the default configuration located at /etc/default/voice-agent-config.ini.

The default configuration file looks like this:

[General]
base_audio_dir = /usr/share/nlu/commands/
vosk_model_path = /usr/share/vosk/vosk-model-small-en-us-0.15/
whisper_model_path = /usr/share/whisper/tiny.pt
whisper_cpp_path = /usr/bin/whisper-cpp
whisper_cpp_model_path = /usr/share/whisper-cpp/models/tiny.en.bin
wake_word_model_path = /usr/share/vosk/vosk-model-small-en-us-0.15/
snips_model_path = /usr/share/nlu/snips/model/
channels = 1
sample_rate = 16000
bits_per_sample = 16
wake_word = hey automotive
server_port = 51053
server_address = 127.0.0.1
rasa_model_path = /usr/share/nlu/rasa/models/
rasa_server_port = 51054
rasa_detached_mode = 1
base_log_dir = /usr/share/nlu/logs/
store_voice_commands = 0
online_mode = 1
online_mode_address = online-whisper-asr-service-address
online_mode_port = online-whisper-asr-service-port
online_mode_timeout = 15
mpd_ip = 127.0.0.1
mpd_port = 6600

[Kuksa]
ip = 127.0.0.1
port = 55555
protocol = grpc
insecure = 0
token =  /usr/lib/python3.12/site-packages/kuksa_certificates/jwt/super-admin.json.token
tls_server_name = Server

[VSS]
hostname = localhost
port = 55555
protocol = grpc
insecure = 0
token_filename = /etc/xdg/AGL/agl-vss-helper/agl-vss-helper.token
ca_cert_filename = /etc/kuksa-val/CA.pem
tls_server_name = Server

[Mapper]
intents_vss_map = /usr/share/nlu/mappings/intents_vss_map.json
vss_signals_spec = /usr/share/nlu/mappings/vss_signals_spec.json

Most of the above configuration variable are self explanatory, however, I'll dive deeper into the ones that might need some explanation.

If you want to change the default configuration, you can do so by creating a new configuration file and then passing it to the voice agent service using the --config flag. For example:

$ voiceagent-service run-server --config path/to/config.ini

One thing to note here is that all the directory paths in the configuration file should be absolute and always end with a /.

High Level Architecture

Voice_Agent_Architecture

Components

Voice Agent Service

The voice agent service is a gRPC-based service that is responsible for converting spoken words into text, extracting intents from these commands, and performing actions through the Kuksa interface. The service is composed of three main components: Whisper AI, Vosk Kaldi, RASA, and Snips.

Whisper AI

Whisper AI is a versatile speech recognition model developed by OpenAI. It is designed to convert spoken words into text with high accuracy and supports multiple languages out of the box. Whisper AI offers various pre-trained models, each optimized for different hardware capabilities and performance requirements. The current voice agent service uses Whisper AI for speech recognition. In offline mode, Whisper AI processes speech locally on the device, providing an accurate transcription of user commands. While Whisper AI may be slightly slower than other models like Vosk, it offers higher accuracy, especially for complex and noisy inputs. It does not currently support wake word detection, so it is used alongside other tools like Vosk for a comprehensive voice assistant solution.

Vosk Kaldi

Vosk Kaldi is a speech recognition toolkit that is based on Kaldi and Vosk. It is used to convert spoken words into text. It provides us with some official pre-trained models for various popular languages. We can also train our own models using the Vosk Kaldi toolkit. The current voice agent service requires two different models to run, one for wake-word detection and one for speech recognition. The wake word detection model is used to detect when the user says the wake word, which is "Hey Automotive" by default, we can easily change the default wake word by modifying the config file. The speech recognition model is used to convert the user's spoken words into text.

Note: The current wake word is set to "Hey Automotive," but you can change it to any wake word of your choice. However, if you plan to use a more specific wake word, you may need to train the Vosk wake word model yourself, as the pre-trained models may not be optimized for your specific wake word.

Snips

Snips NLU (Natural Language Understanding) is a Python based Intent Engine that allows to extract structured information from sentences written in natural language. The NLU engine first detects what the intention of the user is (a.k.a. intent), then extracts the parameters (called slots) of the query. The developer can then use this to determine the appropriate action or response. Our voice agent service uses either Snips or RASA to extract intents from the user's spoken commands.

It is recommended to take a brief look at Snips Official Documentation to get a better understanding of how Snips works.

Dataset Format

The Snips NLU engine uses a dataset to understand and recognize user intents. The dataset is structured into two files:

To train the NLU Intent Engine model, a pre-processing step is required to convert the dataset into a format compatible with the Snips NLU engine. Once the model is trained, it can be used to parse user queries and extract the intent and relevant slots for further processing.

Training

To train the NLU Intent Engine for your specific use case, you can modify the dataset files intents.yaml and entities.yaml to add new intents, slots, or entity values. You need to re-generate the dataset if you modify intent.yaml or entities.yaml, for this purpose you need to install snips-sdk-agl module. This module is an extension of the original Snips NLU with upgraded Python support and is specifically designed for data pre-processing and training purposes only.

After installation run the following command to generate the updated dataset.json file:

$ snips-sdk generate-dataset en entities.yaml intents.yaml > dataset.json

Then run the following command to re-train the model:

$ snips-sdk train path/to/dataset.json path/to/model

Finally, you can use the snips-inference-agl module to process commands and extract the associated intents.

Usage

To set up and run the Snips NLU Intent Engine, follow these steps:

  1. Train your model by following the steps laid earlier or just clone a pre-existing model from here.

  2. Install and set up the snips-inference-agl module on your local machine. This module is an extension of the original Snips NLU with upgraded Python support and is specifically designed for inference purposes only.

  3. Once you have the snips-inference-agl module installed, you can load the pre-trained model located in the model/ folder. This model contains the trained data and parameters necessary for intent extraction. You can use the following command to process commands and extract the associated intents:

    $ snips-inference parse path/to/model -q "your command here"
    

Observations

RASA

Note: : RASA is currently not included into the build as it is not supported by Python 3.12, the version used in AGL.

RASA is an open-source machine learning framework for building contextual AI assistants and chatbots. It is based on Python and TensorFlow. It is used to extract intents from the user's spoken commands. The RASA NLU engine is trained on a dataset that contains intents, entities, and sample utterances. The RASA NLU engine is used to parse user queries and extract the intent and relevant entities for further processing.

It is recommended to take a brief look at RASA Official Documentation to get a better understanding of how RASA works.

Dataset Format

Rasa uses YAML as a unified and extendable way to manage all training data, including NLU data, stories and rules.

You can split the training data over any number of YAML files, and each file can contain any combination of NLU data, stories, and rules. The training data parser determines the training data type using top level keys.

NLU training data consists of example user utterances categorized by intent. Training examples can also include entities. Entities are structured pieces of information that can be extracted from a user's message. You can also add extra information such as regular expressions and lookup tables to your training data to help the model identify intents and entities correctly. Example dataset for check_balance intent:

nlu:
- intent: check_balance
  examples: |
    - What's my [credit](account) balance?
    - What's the balance on my [credit card account]{"entity":"account","value":"credit"}

- synonym: credit
  examples: |
    - credit card account
    - credit account

Training

To train the RASA NLU intent engine model you need to curate a dataset for your sepcific use case. You can also use the RASA NLU Trainer to curate your dataset. Once you have your dataset ready, now you need to create a config.yml file. This file contains the configuration for the RASA NLU engine. A sample config.yml file is given below:

language: en  # your 2-letter language code
assistant_id: 20230807-130137-kind-easement

pipeline:
  - name: WhitespaceTokenizer
  - name: RegexFeaturizer
  - name: LexicalSyntacticFeaturizer
  - name: CountVectorsFeaturizer
  - name: CountVectorsFeaturizer
    analyzer: "char_wb"
    min_ngram: 1
    max_ngram: 4
  - name: DIETClassifier
    epochs: 100
    constrain_similarities: true
  - name: EntitySynonymMapper
  - name: ResponseSelector
    epochs: 100
    constrain_similarities: true
  - name: FallbackClassifier
    threshold: 0.3
    ambiguity_threshold: 0.1

Now download RASA (v3.6.4) using the following command:

$ pip install rasa==3.6.4
Finally, you can use the following command to train the RASA NLU engine:
$ rasa train nlu --config config.yml --nlu path/to/dataset.yml --out path/to/model

Usage

To set up and run the RASA NLU Intent Engine, follow these steps:

  1. Train your model by following the steps laid earlier or just clone a pre-existing model from here.

  2. Once you have RASA (v3.6.4) installed, you can load the pre-trained model located in the model/ folder. This model contains the trained data and parameters necessary for intent extraction. You can use the following command to process commands and extract the associated intents:

    $ rasa shell --model path/to/model
    

Observations

Voice Assistant Client

flutter-ics-homescreen Integration

The voice assistant client is integrated into the flutter-ics-homescreen app, which can interact with the voice agent service to process user voice commands.

Flutter-ics-homescreen_1 Flutter-ics-homescreen_2
Flutter-ics-homescreen_1 Flutter-ics-homescreen_2

Voice Assistant App

The voice assistant app is a standalone, Flutter-based client designed for Automotive Grade Linux (AGL). It is responsible for interacting with the voice agent service for user voice command recognition, intent extraction, and command execution. It also receives the response from the voice agent service and displays it on the screen. Some app UI screenshots are attached below.

Voice_Agent_App_1 Voice_Agent_App_2