Prototype architecture for audio-visual Interaction

The architecture is composed of 3 components (cf. image):

  • The State Machine Frontend controls the user interface using high-fidelity graphics.
  • The Middleware Backend connects the State Machine Frontend with the LogicalBackend.
  • The Logical Backend is based on Rhasspy 2.5, an open-source framework for offline voice assistant applications

Dataset for Natural Language Understanding

Training data

  • 1964 user utterances
  • 10724 running words

Test data

  • 839 user utterances
  • 4507 running words

Example Utterances with annotations

1) Intent: FindRecipes, (slot_name, slot_value) = (ingredient, pasta)

"What about pasta recipes today"

2) Intent: RequestRecipeVariant, (slot_name, slot_value) = (recipe_type, vegetarian)

"Show me the vegetarian version of the recipe"

3) Intent: SetPortions, (slot_name, slot_value) = (amount, five)

"Select five portions please"