11-777 MultiModal Machine Learning

Fall 2024 Previous Projects

This course focuses on core techniques and modern advances for integrating different "modalities" into a shared representation or reasoning system. Specifically, these include text, audio, images/videos and action taking.

Time & Place: 9:30am - 11:00am on Tu/Th (Margaret Morrison A14)
Canvas: Lectures and additional details (coming soon)
Course questions and discussion (Piazza): https://piazza.com/cmu/fall2024/11777
Project Report Template (GitHub): https://github.com/cmu-mmml/11-777-template

Instructor

Daniel Fried

dfried@cs.cmu.edu

Instructor

Yonatan Bisk

ybisk@cs.cmu.edu

Head TA

Piyush Khanna

OHs TBD
(location TBD)
piyushkh@andrew

TA

Li-Wei Chen

OHs TBD
(location TBD)
liweiche@andrew

TA

Madhura Deshpande

OHs TBD
(location TBD)
mvdeshpa@cs

TA

Haoyang He

OHs TBD
(location TBD)
hhe2@andrew

TA

Lawrence Jang

OHs TBD
(location TBD)
ljang@andrew

TA

Siddhant Waghjale

OHs TBD
(location TBD)
swaghjal@andrew

Piazza and Canvas

All course communication will happen via Piazza and Canvas. All videos will be posted to Canvas for offline viewing though aspects of the class/teaching will be interactive and missing from the zoom sessions. Questions will be taken in person -- not on chat

Piazza: https://piazza.com/cmu/fall2024/11777

Assignments Timeline and Grading

The course is primarily project based, but there will be readings throughout the course and participation points.

Project Timeline and Assignments (70% total): (see links for more details)

Sep 13		Submit group members and name
Sep 19	R1	Dataset Proposal and Analysis	(10%)
Oct 8 / 10		Midterm Presentation	(10%)
Oct 10	R2	Baselines and Model Proposal	(10%)
Nov 7	R3	Analysis of Baselines	(10%)
Dec 3 / 5		Final Presentation	(10%)
Dec 11	R4	Completed Report	(20%)

Discussion Participation (20%, with up to 2% bonus):
We will have discussion posts for class sessions on canvas. There are two types of sessions that we will ask for discussion in:

Lectures. We will have a canvas discussion thread where you should ask a question, or give an answer. Questions should be thoughtful, and have citations (e.g. to papers, points in the lecture slides, or other resources). Please try to ask questions that you and others might actually want the answers to! We also will give credit for asking questions in lectures, but you must post a note in the discussion thread documenting your participation to canvas for TAs to accumulate after class.
Project presentations. We will pair you with a group, and ask you to give peer feedback (via email or canvas, TBD) on their project – making suggestions on the work they are doing, as given in the presentation.

There will be 23 course sessions that we’ll grade participation for, but we will grade out of 20 points allowing for 3pts of extra credit.

Paper Readings (10%):
These readings start generic and become specific. You are building the related work section of your project. We will begin by offering readings to choose from and you will eventually be finding your own readings. You will submit annotations on the paper with your questions and notes, tied to particular parts of the paper. You can either do this by uploading a pdf with your notes (handwritten, or using a pdf comment function); or if necessary by uploading a text file. If you use a text file, you *must* have notes that refer to specific parts of the paper, e.g. Equation 2, Section 2.3, Conclusion. Paper summaries are due the following Monday night (1 week after being assigned).

Submission Policies:

All deadlines are midnight EST (determined by Canvas submission)
Project reports are graded as a group (single PDF submission), while all other grades are individual.
Late days: Every team has a budget of 6 late days. They will be automatically calculated, after which 2% absolute is removed from max grade. You can use up to 3 of these late days on the Final Report. (updated 11/1)

Tasks & Datasets

The course is primarily centered on a project. We will have a list of seed tasks and datasets below (coming soon!).

Piyush	Visual question answering, uncertainty estimation and calibration in multimodal models, multimodal reasoning using code generation. OK-VQA: Outside Knowledge Visual Question Answering A-OKVQA: Augmented OK-VQA V* Bench: Visual search question answering MMCoQA Conversational Question Answering over Text, Tables, and Images xGQA Cross-lingual Visual Question Answering
Li-Wei	Speech processing and generation Video + Audio + Text Emotion Recognition IEMOCAP: Interactive Emotional Dyadic Motion Capture RAVDESS: Emotional Speech and Song Multimodal ASR Localized Narratives: Images with speech, text, and mouse traces How-2 instructional videos SoundSpaces: Audio-Visual Navigation in 3D Environments
Madhura	Vision-language models, image/text modalities SugarCrepe: compositional ability evaluation MagicBrush and Parts2Whole: controlled/conditional image generation and editing
Haoyang	Embodiment, Language+Vision Room-to-Room and Room-across-Room (potentially resource intensive) ALFRED and TEACh (also potentially resource intensive, require a simulator)
Lawrence	OS/Web Agents WebArena: A Realistic Web Environment for Building Autonomous Agents VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks VisualWebBench: Web page understanding and grounding
Siddhant	Visual Common Sense Reasoning, Multimodal Question Answering Memecap: A Dataset for Captioning and Interpreting Memes Sherlock: A dataset for visual abductive reasoning SPIQA: A dataset for multimodal question answering on scientific papers

Physical hardware / robots / sensors ...

What about physical hardware? robots? tasks not datasets? Let's talk.

Compute Limited AWS compute credits will be made available to each student, so please consider both your interests and available compute resources when deciding on a dataset/project.

Lectures

(tentative)

Tuesday	Thursday
Aug 27: Course Structure Research and technical challenges Syllabus and requirements	Aug 29: Multimodal applications and datasets Research tasks and datasets Team projects
Overview Readings: Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions Section 2 in particular Representation Learning: A Review and New Perspectives Sections 1-3, 6-8, 11 Experience Grounds Language Pragmatics in Language Grounding Or Tasks and Data Readings: Any of the papers linked in Tasks & Datasets or relevant to your idea for a project
Sep 3: Unimodal representations (Language)	Sep 5: Unimodal representations (Vision + Audio)
Readings: Visualizing and Understanding Recurrent Networks Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context Visualizing and Understanding Convolutional Networks Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
Sep 10: Fusion	Sep 12: Fission
Readings: Every Picture Tells a Story: Generating Sentences from Images. Detecting Visual Text From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions Linearly Mapping from Image to Text Space
Sep 17: Alignment + Grounding	Sep 19: Aligned Representations Report 1

Sep 24: Project Hours	Sep 26: Project Hours

Oct 1: Transformers Part 1	Oct 3: Transformers Part 2

Oct 8: Midterm Presentations	Oct 10: Midterm Presentations Report 2

Oct 15: Fall Break!	Oct 17: Fall Break!

Oct 22: Generation	Oct 24: Generation (cont.)

Oct 29: Embodiment	Oct 31: Embodiment (cont)
Learning to Map Natural Language Instructions to Physical Quadcopter Control using Simulated Flight PixL2R: Guiding Reinforcement Learning Using Natural Language by Mapping Pixels to Rewards Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation

Nov 5: Election Day (university holiday)	Nov 7: Agents Guest Lecture: JY Koh Report 3

Nov 12: Reasoning	Nov 14: Guest Lecture: So Yeon Tiffany Min

Nov 19: Guest Lecture: Lili Yu	Nov 21: Quantification + Bias

Nov 26: Interacting with (V)LMs	Nov 28: Thanksgiving (no class)

Dec 3: Final Presentations	Dec 5: Final Presentations

11-777: MultiModal Machine Learning

Fall 2024 Previous Projects

Instructor

Instructor

Head TA

TA

TA

TA

TA

TA

Piazza and Canvas

Assignments Timeline and Grading

Tasks & Datasets

Physical hardware / robots / sensors ...

Lectures