Spring 2024 Previous Projects
This course focuses on core techniques and modern advances for integrating different "modalities" into a shared representation or reasoning system. Specifically, these include text, audio, images/videos and action taking.
- Time & Place: 9:30am - 11:00am on Tu/Th (Porter Hall 100)
- Canvas: Lectures and additional details (coming soon)
- Course questions and discussion (Piazza): https://piazza.com/cmu/spring2023/11777
- Project Report Template (GitHub): https://github.com/cmu-mmml/11-777-template
Piazza and Canvas
All course communication will happen via Piazza and Canvas. All videos will be posted to Canvas for offline viewing though aspects of the class/teaching will be interactive and missing from the zoom sessions. Questions will be taken in person -- not on chat
Piazza: https://piazza.com/cmu/spring2024/11777
Assignments Timeline and Grading
The course is primarily project based, but there will be readings throughout the course and participation points.Project Timeline and Assignments (70% total): (see links for more details)
Feb 2 | Submit group members and name | ||
Feb 08 | R1 | Dataset Proposal and Analysis | (10%) |
Mar 1 | R2 | Baselines and Model Proposal | (15%) |
Mar 28 | R3 | Analysis of Baselines | (15%) |
Apr 23/25 | Presentation | (10%) | |
Apr 26 | R4 | Completed Report | (20%) |
Discussion Participation (20%, with up to 2% bonus):
We will have discussion posts for papers and lectures on Piazza. We ask you to make two thoughtful contributions per week (either asking questions or posting answers) on these posts. 11 weeks x 2 contributions = 22pts, but we will grade out of 20 points allowing for 2pts of extra credit. We also will give credit for asking questions in lectures, but you must post a note documenting your participation to piazza for TAs to accumulate after class..
Paper Readings (10%):
These readings start generic and become specific. You are building the related work section of your project. We will begin by offering readings to choose from and you will eventually be finding your own readings and answering the quiz questions to link them to the lecture material and to your project. Paper summaries are due the following Tuesday night (1 week after being assigned).
Submission Policies:
- All deadlines are midnight EST (determined by Canvas submission)
- Project reports are graded as a group (single PDF submission), while all other grades are individual.
- Late days: Every team has a budget of 6 late days. They will be automatically calculated, after which 2% absolute is removed from max grade.
Tasks & Datasets
The course is primarily centered on a project. Below, the TAs have listed several they are most interested in across audio, embodiment, vision, etc. If your team has a good reason to work on something else, please reach out so we can discuss it and put together a proposal.Soham |
Ego4D world's largest egocentric (first person) video ML dataset and benchmark suite SoundSpaces Audio-Visual Navigation in 3D Environments OmniObject3D Large-Vocabulary 3D Object Dataset for Realistic Perception, Reconstruction and Generation StressID a Multimodal Dataset for Stress Identification (note delays in acquiring data) |
Mehul |
WebArena A Realistic Web Environment for Building Autonomous Agents WebQA Multimodal multihop reasoning Room across Room a multilingual dataset for Vision-and-Language Navigation |
Yingshan |
ShapeStacks SugarCrepe A benchmark for faithful vision-language compositionality evaluation LayoutBench Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation |
Vidhi |
RT-X (components) MUTEX Learning Unified Policies from Multimodal Task Specifications LIBERO Benchmarking Knowledge Transfer for Lifelong Robot Learning |
Piyush |
OK-VQA Outside Knowledge Visual Question Answering Whoops A Vision-and-Language Benchmark of Synthetic and Compositional Images MMCoQA Conversational Question Answering over Text, Tables, and Images |
Vanya |
Visual Storytelling VATEX A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research. |
Physical hardware / robots / sensors ...
What about physical hardware? robots? tasks not datasets? Let's talk. |
Compute Limited AWS compute credits will be made available to each student, so please consider both your interests and available compute resources when deciding on a dataset/project.
Lectures
(tentative)Tuesday | Thursday |
---|---|
Jan 16: Course Structure
|
Jan 18: Multimodal applications and datasets
|
Overview Readings:
|
|
Jan 23: Unimodal representations (Vision + Audio) |
Jan 25: Unimodal representations (Language) |
Readings: | |
Jan 30: Fusion |
Feb 01: Fission |
Readings: | |
Feb 06: Alignment + Grounding |
Feb 08: Aligned Representations
|
Feb 13: Transformers Part 1 |
Feb 15: Transformers Part 2 |
Feb 20: Reasoning |
Feb 22: Project Hours |
Feb 27: Project Hours |
Feb 29: Quantification and Bias
|
Mar 05: Spring Break! |
Mar 07: Spring Break! |
Mar 12: Embodiment |
Mar 14: Embodiment (cont) |
Mar 19: Generation + Translation | Mar 21: Generation |
Mar 26: Transference |
Mar 28: Transference
|
Apr 02: Project Hours |
Apr 04: Project Hours |
Apr 09: New research directions
|
Apr 11: Carnival (no class) |
Apr 16: Raymond Mooney | Apr 18: Daniel Fried |
Apr 23: Final Presentations | Apr 25: Final Presentations |