Spring 2024    Previous Projects

This course focuses on core techniques and modern advances for integrating different "modalities" into a shared representation or reasoning system. Specifically, these include text, audio, images/videos and action taking.

Soham Dinesh Tiwari
Head TA

Soham Dinesh Tiwari

Wed 9:30-10:30am
(Zoom: Piazza pin)

Mehul Agarwal

Mehul Agarwal

Tues 4-5pm
(Wean 3110)

Yingshan Chang

Yingshan Chang

Mon 10-11am
(Weah 3110)

Vidhi Jain

Vidhi Jain

Wed 2:30-3:30pm
(Wean 3110)

Piyush Khanna

Piyush Khanna

Mon 4-5pm
(Wean 3110)

Vanya Bannihatti Kumar

Vanya Bannihatti Kumar

Mon 2:15-3:15pm
(GHC 5417)

Piazza and Canvas

All course communication will happen via Piazza and Canvas. All videos will be posted to Canvas for offline viewing though aspects of the class/teaching will be interactive and missing from the zoom sessions. Questions will be taken in person -- not on chat

Piazza: https://piazza.com/cmu/spring2024/11777

Assignments Timeline and Grading

The course is primarily project based, but there will be readings throughout the course and participation points.

Project Timeline and Assignments (70% total): (see links for more details)
Feb 2 Submit group members and name
Feb 08 R1 Dataset Proposal and Analysis (10%)
Mar 1 R2 Baselines and Model Proposal (15%)
Mar 28 R3 Analysis of Baselines (15%)
Apr 23/25 Presentation (10%)
Apr 26 R4 Completed Report (20%)

Discussion Participation (20%, with up to 2% bonus):
We will have discussion posts for papers and lectures on Piazza. We ask you to make two thoughtful contributions per week (either asking questions or posting answers) on these posts. 11 weeks x 2 contributions = 22pts, but we will grade out of 20 points allowing for 2pts of extra credit. We also will give credit for asking questions in lectures, but you must post a note documenting your participation to piazza for TAs to accumulate after class..

Paper Readings (10%):
These readings start generic and become specific. You are building the related work section of your project. We will begin by offering readings to choose from and you will eventually be finding your own readings and answering the quiz questions to link them to the lecture material and to your project. Paper summaries are due the following Tuesday night (1 week after being assigned).

Submission Policies:

Tasks & Datasets

The course is primarily centered on a project. Below, the TAs have listed several they are most interested in across audio, embodiment, vision, etc. If your team has a good reason to work on something else, please reach out so we can discuss it and put together a proposal.

Soham Ego4D world's largest egocentric (first person) video ML dataset and benchmark suite
SoundSpaces Audio-Visual Navigation in 3D Environments
OmniObject3D Large-Vocabulary 3D Object Dataset for Realistic Perception, Reconstruction and Generation
StressID a Multimodal Dataset for Stress Identification (note delays in acquiring data)
Mehul WebArena A Realistic Web Environment for Building Autonomous Agents
WebQA Multimodal multihop reasoning
Room across Room a multilingual dataset for Vision-and-Language Navigation
Yingshan ShapeStacks
SugarCrepe A benchmark for faithful vision-language compositionality evaluation
LayoutBench Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation
Vidhi RT-X (components)
MUTEX Learning Unified Policies from Multimodal Task Specifications
LIBERO Benchmarking Knowledge Transfer for Lifelong Robot Learning
Piyush OK-VQA Outside Knowledge Visual Question Answering
Whoops A Vision-and-Language Benchmark of Synthetic and Compositional Images
MMCoQA Conversational Question Answering over Text, Tables, and Images
Vanya Visual Storytelling
VATEX A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research.

Physical hardware / robots / sensors ...
What about physical hardware? robots? tasks not datasets? Let's talk.

Compute Limited AWS compute credits will be made available to each student, so please consider both your interests and available compute resources when deciding on a dataset/project.


Tuesday Thursday
Jan 16: Course Structure
  • Research and technical challenges
  • Syllabus and requirements
Jan 18: Multimodal applications and datasets
  • Research tasks and datasets
  • Team projects
Overview Readings: Or Tasks and Data Readings:
  • Any of the papers linked in Tasks & Datasets or relevant to your idea for a project
Jan 23: Unimodal representations (Vision + Audio)
Jan 25: Unimodal representations (Language)
Jan 30: Fusion
Feb 01: Fission
Feb 06: Alignment + Grounding
Feb 08: Aligned Representations
  • Report 1
Feb 13: Transformers Part 1
Feb 15: Transformers Part 2
Feb 20: Reasoning
Feb 22: Project Hours
Feb 27: Project Hours Feb 29: Quantification and Bias
  • Report 2
Mar 05: Spring Break!
Mar 07: Spring Break!
Mar 12: Embodiment
Mar 14: Embodiment (cont)
Mar 19: Generation + Translation Mar 21: Generation
Mar 26: Transference Mar 28: Transference
  • Report 3
Apr 02: Project Hours

Apr 04: Project Hours
Apr 09: New research directions
  • Recent publications
Apr 11: Carnival (no class)
Apr 16: Raymond Mooney Apr 18: Guest Lecture
Apr 23: Final Presentations Apr 25: Final Presentations