Fall 2024    Previous Projects

This course focuses on core techniques and modern advances for integrating different "modalities" into a shared representation or reasoning system. Specifically, these include text, audio, images/videos and action taking.


Piyush Khanna
Head TA

Piyush Khanna

OHs TBD
(location TBD)
piyushkh@andrew

Li-Wei Chen
TA

Li-Wei Chen

OHs TBD
(location TBD)
liweiche@andrew

Madhura Deshpande
TA

Madhura Deshpande

OHs TBD
(location TBD)
mvdeshpa@cs

Haoyang He
TA

Haoyang He

OHs TBD
(location TBD)
hhe2@andrew

Lawrence Jang
TA

Lawrence Jang

OHs TBD
(location TBD)
ljang@andrew

Siddhant Waghjale
TA

Siddhant Waghjale

OHs TBD
(location TBD)
swaghjal@andrew


Piazza and Canvas

All course communication will happen via Piazza and Canvas. All videos will be posted to Canvas for offline viewing though aspects of the class/teaching will be interactive and missing from the zoom sessions. Questions will be taken in person -- not on chat

Piazza: https://piazza.com/cmu/fall2024/11777


Assignments Timeline and Grading

The course is primarily project based, but there will be readings throughout the course and participation points.

Project Timeline and Assignments (70% total): (see links for more details)
Sep 13 Submit group members and name
Sep 19 R1 Dataset Proposal and Analysis (10%)
Oct 8 / 10 Midterm Presentation (10%)
Oct 10 R2 Baselines and Model Proposal (10%)
Nov 7 R3 Analysis of Baselines (10%)
Dec 3 / 5 Final Presentation (10%)
Dec 11 R4 Completed Report (20%)

Discussion Participation (20%, with up to 2% bonus):
We will have discussion posts for class sessions on canvas. There are two types of sessions that we will ask for discussion in: There will be 23 course sessions that we’ll grade participation for, but we will grade out of 20 points allowing for 3pts of extra credit.

Paper Readings (10%):
These readings start generic and become specific. You are building the related work section of your project. We will begin by offering readings to choose from and you will eventually be finding your own readings. You will submit annotations on the paper with your questions and notes, tied to particular parts of the paper. You can either do this by uploading a pdf with your notes (handwritten, or using a pdf comment function); or if necessary by uploading a text file. If you use a text file, you *must* have notes that refer to specific parts of the paper, e.g. Equation 2, Section 2.3, Conclusion. Paper summaries are due the following Monday night (1 week after being assigned).

Submission Policies:

Tasks & Datasets

The course is primarily centered on a project. We will have a list of seed tasks and datasets below (coming soon!).


Piyush Visual question answering, uncertainty estimation and calibration in multimodal models, multimodal reasoning using code generation.
  • OK-VQA: Outside Knowledge Visual Question Answering
  • A-OKVQA: Augmented OK-VQA
  • V* Bench: Visual search question answering
  • MMCoQA Conversational Question Answering over Text, Tables, and Images
  • xGQA Cross-lingual Visual Question Answering
Li-Wei Speech processing and generation
  • Video + Audio + Text Emotion Recognition
    • IEMOCAP: Interactive Emotional Dyadic Motion Capture
    • RAVDESS: Emotional Speech and Song
  • Multimodal ASR
  • SoundSpaces: Audio-Visual Navigation in 3D Environments
Madhura Vision-language models, image/text modalities
Haoyang Embodiment, Language+Vision
Lawrence OS/Web Agents
  • WebArena: A Realistic Web Environment for Building Autonomous Agents
  • VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks
  • VisualWebBench: Web page understanding and grounding
Siddhant Visual Common Sense Reasoning, Multimodal Question Answering
  • Memecap: A Dataset for Captioning and Interpreting Memes
  • Sherlock: A dataset for visual abductive reasoning
  • SPIQA: A dataset for multimodal question answering on scientific papers

Physical hardware / robots / sensors ...
What about physical hardware? robots? tasks not datasets? Let's talk.


Compute Limited AWS compute credits will be made available to each student, so please consider both your interests and available compute resources when deciding on a dataset/project.

Lectures

(tentative)
Tuesday Thursday
Aug 27: Course Structure
  • Research and technical challenges
  • Syllabus and requirements
Aug 29: Multimodal applications and datasets
  • Research tasks and datasets
  • Team projects
Overview Readings: Or Tasks and Data Readings:
  • Any of the papers linked in Tasks & Datasets or relevant to your idea for a project
Sep 3: Unimodal representations (Language)
Sep 5: Unimodal representations (Vision + Audio)
Readings:
Sep 10: Fusion
Sep 12: Fission
Readings:
Sep 17: Alignment + Grounding
Sep 19: Aligned Representations
  • Report 1
Sep 24: Project Hours Sep 26: Project Hours
Oct 1: Transformers Part 1
Oct 3: Transformers Part 2
Oct 8: Midterm Presentations Oct 10: Midterm Presentations
  • Report 2
Oct 15: Fall Break!
Oct 17: Fall Break!
Oct 22: Generation
Oct 24: Generation (cont.)
Oct 29: Embodiment
Oct 31: Embodiment (cont)
Nov 5: Election Day (university holiday) Nov 7: Agents Guest Lecture: JY Koh
  • Report 3
Nov 12: Reasoning Nov 14: Guest Lecture: So Yeon Tiffany Min
Nov 19: Guest Lecture: Lili Yu Nov 21: Quantification + Bias
Nov 26: Interacting with (V)LMs Nov 28: Thanksgiving (no class)
Dec 3: Final Presentations Dec 5: Final Presentations