11-777 MMML

Multimodal machine learning (MMML)

is a vibrant multi-disciplinary research field which addresses some of the original goals of artificial intelligence by integrating and modeling multiple communicative modalities, including linguistic, acoustic, and visual messages. With the initial research on audio-visual speech recognition and more recently with language & vision projects such as image and video captioning, this research field brings some unique challenges for multimodal researchers given the heterogeneity of the data and the contingency often found between modalities. This course will teach fundamental mathematical concepts related to MMML including multimodal alignment and fusion, heterogeneous representation learning and multi-stream temporal modeling. We will also review recent papers describing state-of-the-art probabilistic models and computational algorithms for MMML and discuss the current and upcoming challenges.

The course will present the fundamental mathematical concepts in machine learning and deep learning relevant to the six main challenges in multimodal machine learning: (1) representation, (2) alignment, (3) reasoning, (4) generation, (5) transference and (6) quantification. These include, but not limited to, multimodal transformers, neuro-symbolic models, multimodal tensor fusion, mutual information and multimodal graph networks. The course will also discuss many of the recent applications of MMML including multimodal affect recognition, multimodal language grounding and language-vision navigation.

Semesters

This course is offered every semester:

Fall 2025 (Taught by Carlos, see Canvas)
Spring 2025
Fall 2024
Spring 2024
Spring 2023
Fall 2022
Spring 2022
Spring 2021
Fall 2020

For advanced topics, see

The syllabus and material are slightly each semester, primarily to incorporate new emerging research topics, but the structure of the course remains the same. The instructors are:

With substantial assistance from Paul Pu Liang