11-777 MultiModal Machine Learning

Spring 2024 Previous Projects

This course focuses on core techniques and modern advances for integrating different "modalities" into a shared representation or reasoning system. Specifically, these include text, audio, images/videos and action taking.

Time & Place: 9:30am - 11:00am on Tu/Th (Porter Hall 100)
Canvas: Lectures and additional details (coming soon)
Course questions and discussion (Piazza): https://piazza.com/cmu/spring2023/11777
Project Report Template (GitHub): https://github.com/cmu-mmml/11-777-template

Instructor

Yonatan Bisk

ybisk@cs.cmu.edu

Head TA

Soham Dinesh Tiwari

Wed 9:30-10:30am
(Zoom: Piazza pin)
sohamdit@andrew

TA

Mehul Agarwal

Tues 4-5pm
(Wean 3110)
mehula@andrew

TA

Yingshan Chang

Mon 10-11am
(Weah 3110)
yingshac@andrew

TA

Vidhi Jain

Wed 2:30-3:30pm
(Wean 3110)
vidhij@andrew

TA

Piyush Khanna

Mon 4-5pm
(Wean 3110)
piyushkh@andrew

TA

Vanya Bannihatti Kumar

Mon 2:15-3:15pm
(GHC 5417)
vbanniha@andrew

Piazza and Canvas

All course communication will happen via Piazza and Canvas. All videos will be posted to Canvas for offline viewing though aspects of the class/teaching will be interactive and missing from the zoom sessions. Questions will be taken in person -- not on chat

Piazza: https://piazza.com/cmu/spring2024/11777

Assignments Timeline and Grading

The course is primarily project based, but there will be readings throughout the course and participation points.

Project Timeline and Assignments (70% total): (see links for more details)

Feb 2		Submit group members and name
Feb 08	R1	Dataset Proposal and Analysis	(10%)
Mar 1	R2	Baselines and Model Proposal	(15%)
Mar 28	R3	Analysis of Baselines	(15%)
Apr 23/25		Presentation	(10%)
Apr 26	R4	Completed Report	(20%)

Discussion Participation (20%, with up to 2% bonus):
We will have discussion posts for papers and lectures on Piazza. We ask you to make two thoughtful contributions per week (either asking questions or posting answers) on these posts. 11 weeks x 2 contributions = 22pts, but we will grade out of 20 points allowing for 2pts of extra credit. We also will give credit for asking questions in lectures, but you must post a note documenting your participation to piazza for TAs to accumulate after class..

Paper Readings (10%):
These readings start generic and become specific. You are building the related work section of your project. We will begin by offering readings to choose from and you will eventually be finding your own readings and answering the quiz questions to link them to the lecture material and to your project. Paper summaries are due the following Tuesday night (1 week after being assigned).

Submission Policies:

All deadlines are midnight EST (determined by Canvas submission)
Project reports are graded as a group (single PDF submission), while all other grades are individual.
Late days: Every team has a budget of 6 late days. They will be automatically calculated, after which 2% absolute is removed from max grade.

Tasks & Datasets

The course is primarily centered on a project. Below, the TAs have listed several they are most interested in across audio, embodiment, vision, etc. If your team has a good reason to work on something else, please reach out so we can discuss it and put together a proposal.

Soham	Ego4D world's largest egocentric (first person) video ML dataset and benchmark suite SoundSpaces Audio-Visual Navigation in 3D Environments OmniObject3D Large-Vocabulary 3D Object Dataset for Realistic Perception, Reconstruction and Generation StressID a Multimodal Dataset for Stress Identification (note delays in acquiring data)
Mehul	WebArena A Realistic Web Environment for Building Autonomous Agents WebQA Multimodal multihop reasoning Room across Room a multilingual dataset for Vision-and-Language Navigation
Yingshan	ShapeStacks SugarCrepe A benchmark for faithful vision-language compositionality evaluation LayoutBench Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation
Vidhi	RT-X (components) MUTEX Learning Unified Policies from Multimodal Task Specifications LIBERO Benchmarking Knowledge Transfer for Lifelong Robot Learning
Piyush	OK-VQA Outside Knowledge Visual Question Answering Whoops A Vision-and-Language Benchmark of Synthetic and Compositional Images MMCoQA Conversational Question Answering over Text, Tables, and Images
Vanya	Visual Storytelling VATEX A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research.

Physical hardware / robots / sensors ...

What about physical hardware? robots? tasks not datasets? Let's talk.

Compute Limited AWS compute credits will be made available to each student, so please consider both your interests and available compute resources when deciding on a dataset/project.

Lectures

(tentative)

Tuesday	Thursday
Jan 16: Course Structure Research and technical challenges Syllabus and requirements	Jan 18: Multimodal applications and datasets Research tasks and datasets Team projects
Overview Readings: Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions Section 2 in particular Representation Learning: A Review and New Perspectives Sections 1-3, 6-8, 11 Experience Grounds Language Or Tasks and Data Readings: Any of the papers linked in Tasks & Datasets or relevant to your idea for a project
Jan 23: Unimodal representations (Vision + Audio)	Jan 25: Unimodal representations (Language)
Readings: Visualizing and Understanding Convolutional Networks Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization Visualizing and Understanding Recurrent Networks Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context
Jan 30: Fusion	Feb 01: Fission
Readings: Every Picture Tells a Story: Generating Sentences from Images. Detecting Visual Text From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions Linearly Mapping from Image to Text Space
Feb 06: Alignment + Grounding	Feb 08: Aligned Representations Report 1

Feb 13: Transformers Part 1	Feb 15: Transformers Part 2

Feb 20: Reasoning	Feb 22: Project Hours

Feb 27: Project Hours	Feb 29: Quantification and Bias Report 2

Mar 05: Spring Break!	Mar 07: Spring Break!

Mar 12: Embodiment	Mar 14: Embodiment (cont)
Learning to Map Natural Language Instructions to Physical Quadcopter Control using Simulated Flight PixL2R: Guiding Reinforcement Learning Using Natural Language by Mapping Pixels to Rewards Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation
Mar 19: Generation + Translation	Mar 21: Generation

Mar 26: Transference	Mar 28: Transference Report 3

Apr 02: Project Hours	Apr 04: Project Hours

Apr 09: New research directions Recent publications	Apr 11: Carnival (no class)

Apr 16: Raymond Mooney	Apr 18: Daniel Fried

Apr 23: Final Presentations	Apr 25: Final Presentations

11-777: MultiModal Machine Learning