Multimodal

Multimodal interaction provides the user with multiple modes of interacting with a system. A multimodal interface provides several distinct tools for input and output of data. Multimodal human-computer interaction (HCI) involves natural communication with virtual and physical environments. It facilitates free and natural communication between users and automated systems, allowing flexible input (speech, handwriting, gestures) and output (speech synthesis, graphics). Multimodal fusion combines inputs from different modalities, addressing ambiguities. https://en.wikipedia.org/wiki/Multimodal_interaction

see also unified communication

Multimodal learning is a type of deep learning that integrates and processes multiple types of data, referred to as modalities, such as text, audio, images, or video. This integration allows for a more holistic understanding of complex data, improving model performance in tasks like visual question answering, cross-modal retrieval,[1] text-to-image generation,[2] aesthetic ranking,[3] and image captioning.[4] Large multimodal models, such as Google Gemini and GPT-4o, have become increasingly popular since 2023, enabling increased versatility and a broader understanding of real-world phenomena. https://en.wikipedia.org/wiki/Multimodal_learning