Multi-modal recipe guidance

Multi-modal interaction options offer a high degree of accessibility, novelty and user-friendliness for smart kitchen devices.
As part of the UX design master project Natural User Interfaces in summer 2021, we present the design and prototype implementation of such a multi-modal interaction:
Guided cooking with a smart kitchen appliance that, in addition to touch gestures, also accepts user input in the form of speech and has a voice output. This means that nothing stands in the way of operating a smart kitchen device, even if your hands are dirty or not free at the moment.

As part of the project, a paper was written that was accepted at the ICNLSP 2021: https://aclanthology.org/2021.icnlsp-1.30

The code for the prototype is available on GitHub at the following link: https://github.com/VoiceCookingAssistant/Audio-Visual-Cooking-Assistant

Prototype architecture for audio-visual Interaction

The architecture is composed of 3 components (cf. image):

The State Machine Frontend controls the user interface using high-fidelity graphics.
The Middleware Backend connects the State Machine Frontend with the LogicalBackend.
The Logical Backend is based on Rhasspy 2.5, an open-source framework for offline voice assistant applications