This course focuses on core techniques and modern advances for integrating different "modalities" into a shared representation or reasoning system. Specifically, these include text, audio, images/videos and action taking.

Yingshan Chang

Catherine Cheng

Durvesh Malpure

Soham Tiwari

Aditya Veerubhotla

Kenneth Zheng

Piazza and Canvas

All course communication will happen via Piazza and Canvas. All videos will be posted to Canvas for offline viewing though aspects of the class/teaching will be interactive and missing from the zoom sessions. Questions will be taken in person -- not on chat

Piazza: https://piazza.com/cmu/spring2023/11777

Assignments Timeline and Grading

The course is primarily project based, but there will be readings throughout the course which are only graded via participation.

Project Timeline and Assignments (70% total): (see links for more details)
Jan 30 Submit group members and name
Feb 10 R1 Dataset Proposal and Analysis (10%)
Mar 03 R2 Related Work and Model Proposal (15%)
Mar 31 R3 Baseline Analysis (15%)
Apr 25/27 Presentation (10%)
May 1 R4 Completed Report (20%)

Discussion Participation (20%, with up to 2% bonus):
We will have discussion posts for papers and lectures on Piazza. We ask you to make two thoughtful contributions per week (either asking questions or posting answers) on these posts. 11 weeks x 2 contributions = 22pts, but we will grade out of 20 points allowing for 2pts of extra credit. We also will give credit for asking questions in lectures.

Paper Summaries (10%):
Writing a three sentence summary describing the paper you read earns you 1pt. This summary will be submitted in three text boxes. Specifically, A. The goal of the paper, B. Explain the key insight, C. State a key limitation or important extension. There will be 11 opportunities, so one bonus point can be earned (11%). Paper summaries are due the following Tuesday night (1 week after being assigned).

Submission Policies:

Tasks & Datasets

The course will be primarily centered on a few datasets/tasks to facilitate cross-team collaboration and technical assistance. If your team has a good reason to work on something else, please reach out so we can discuss it and put together a proposal.

Simulator Based
Room-Across-Room Code Multilingual Embodied Navigation
ALFRED Code Embodied instruction following with interaction

Question Answering & Captioning
TextVQA Code Text in images (referring expressions and reading)
WebQA Code Multihop Visual QA
VizWiz VQA and Captioning Visual models for blind users
Social-IQ Code
Proj page
Video Question Answering focused on social interactions
NLVR2 Code
Proj page
Complex reasoning about pairs of images

Multi-turn QA
CompGuessWhat?! Visual Guessing Game and Attribute Prediction

Spoken Image Captions A series of audio corpora and corresponding images for connecting audio directly to image regions.

TVQA Video Question Answering Dataset
VATEX Multilingual Video Captioning and Translation

Physical hardware / robots / sensors ...
What about physical hardware? robots? tasks not datasets? Let's talk.

Compute Limited AWS compute credits will be made available to each student, so please consider both your interests and available compute resources when deciding on a dataset/project.


Tuesday Thursday
Jan 17: Course Structure
  • Research and technical challenges
  • Syllabus and requirements
Jan 19: Multimodal applications and datasets
  • Research tasks and datasets
  • Team projects
Overview Readings: Tasks and Data Readings:
Jan 24: Unimodal representations (Vision)
  • CNNs
  • Residuals and Skip connections
Jan 26: Unimodal representations (Language)
  • Gating and LSTMs
  • Transformers
  • Form groups!
Jan 31: Representation
Feb 02: Representation
Feb 07: Alignment + Grounding
Feb 09: Multimodal Transformers
  • Report 1
Feb 14: Multimodal Reasoning
Feb 16: Multimodal Reasoning (cont)
Feb 21: Embodiment
Feb 23: RL, Logic, and Causality
Feb 28: Project Hours Mar 02: Project Hours
  • Report 2
Mar 07: Spring Break!
Mar 09: Spring Break!
Mar 14: Embodiment (cont)
Mar 16: Quantification and Bias
Mar 21: Generation + Translation Mar 23: Generation
Mar 28: Transference Mar 30: Transference
  • Report 3
Apr 04: Project Hours

Apr 06: Project Hours
Apr 11: New research directions
  • Recent publications
Apr 13: Carnival (no class)
Apr 18: Guest Lecture Apr 20: Guest Lecture
Apr 25: Final Presentations Apr 27: Final Presentations