11-777 MultiModal Machine Learning

Spring 2023 Previous Projects

This course focuses on core techniques and modern advances for integrating different "modalities" into a shared representation or reasoning system. Specifically, these include text, audio, images/videos and action taking.

Time & Place: 11:00am - 12:20pm on Tu/Th (Doherty Hall 2210)
Canvas: Lectures and additional details (coming soon)
Course questions and discussion (Piazza): https://piazza.com/cmu/spring2023/11777
Project Report Template (GitHub): https://github.com/cmu-mmml/11-777-template

Instructor

Yonatan Bisk

ybisk@cs.cmu.edu

Instructor

Daniel Fried

dfried@cs.cmu.edu

TA

Yingshan Chang

yingshac@andrew

TA

Catherine Cheng

yuncheng@cs

TA

Durvesh Malpure

durvesh@cmu

TA

Soham Tiwari

sohamdit@andrew

TA

Aditya Veerubhotla

adityasv@cs

TA

Kenneth Zheng

kzheng2@andrew

Piazza and Canvas

All course communication will happen via Piazza and Canvas. All videos will be posted to Canvas for offline viewing though aspects of the class/teaching will be interactive and missing from the zoom sessions. Questions will be taken in person -- not on chat

Piazza: https://piazza.com/cmu/spring2023/11777

Assignments Timeline and Grading

The course is primarily project based, but there will be readings throughout the course which are only graded via participation.

Project Timeline and Assignments (70% total): (see links for more details)

Jan 30		Submit group members (see Piazza)
Feb 14	R1	Dataset Proposal and Analysis	(10%)
Mar 03	R2	Related Work and Model Proposal	(15%)
Mar 31	R3	Baseline Analysis	(15%)
Apr 25/27		Presentation	(10%)
May 1	R4	Completed Report	(20%)

Discussion Participation (20%, with up to 2% bonus):
We will have discussion posts for papers and lectures on Piazza. We ask you to make two thoughtful contributions per week (either asking questions or posting answers) on these posts. 11 weeks x 2 contributions = 22pts, but we will grade out of 20 points allowing for 2pts of extra credit. We also will give credit for asking questions in lectures.

Paper Summaries (10%):
Writing a three sentence summary describing the paper you read earns you 1pt. This summary will be submitted in three text boxes. Specifically, A. The goal of the paper, B. Explain the key insight, C. State a key limitation or important extension. There will be 11 opportunities, so one bonus point can be earned (11%). Paper summaries are due the following Tuesday night (1 week after being assigned).

Submission Policies:

All deadlines are midnight EST (determined by Canvas submission)
Project reports are graded as a group (single PDF submission), while all other grades are individual.
Late days: Every team has a budget of 6 late days. They will be automatically calculated, after which 2% absolute is removed from max grade.

Tasks & Datasets

The course will be primarily centered on a few datasets/tasks to facilitate cross-team collaboration and technical assistance. If your team has a good reason to work on something else, please reach out so we can discuss it and put together a proposal.

Simulator Based

Room-Across-Room	Code	Multilingual Embodied Navigation
ALFRED	Code	Embodied instruction following with interaction

Question Answering & Captioning

TextVQA	Code	Text in images (referring expressions and reading)
WebQA	Code	Multihop Visual QA
VizWiz	VQA and Captioning	Visual models for blind users
NLVR2	Code Proj page	Complex reasoning about pairs of images

Multi-turn QA

CompGuessWhat?!		Visual Guessing Game and Attribute Prediction

Audio

Spoken Image Captions		A series of audio corpora and corresponding images for connecting audio directly to image regions.

Video

TVQA		Video Question Answering Dataset
VATEX		Multilingual Video Captioning and Translation

Physical hardware / robots / sensors ...

What about physical hardware? robots? tasks not datasets? Let's talk.

Compute Limited AWS compute credits will be made available to each student, so please consider both your interests and available compute resources when deciding on a dataset/project.

Lectures

(tentative)

Tuesday	Thursday
Jan 17: Course Structure Research and technical challenges Syllabus and requirements	Jan 19: Multimodal applications and datasets Research tasks and datasets Team projects
Overview Readings: Multimodal Machine Learning: A Survey and Taxonomy Sections 1-4 Representation Learning: A Review and New Perspectives Sections 1-3, 6-8, 11 Experience Grounds Language Tasks and Data Readings: Any of the papers linked in Tasks & Datasets
Jan 24: Unimodal representations (Vision) CNNs Residuals and Skip connections	Jan 26: Unimodal representations (Language) Gating and LSTMs Transformers Form groups!
Readings: Visualizing and Understanding Convolutional Networks Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization Visualizing and Understanding Recurrent Networks Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context
Jan 31: Representation	Feb 02: Representation
Readings: Every Picture Tells a Story: Generating Sentences from Images. Detecting Visual Text From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions Linearly Mapping from Image to Text Space
Feb 07: Alignment + Grounding	Feb 09: Aligned and Attended
Readings: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Multimodal Learning using Optimal Transport for Sarcasm and Humor Detection BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Attention is All You Need (with tutorial code )
Feb 14: Multimodal Transformers Report 1 (Tues Midnight)	Feb 16: Multimodal Transformers II
Multimodal Transformer for Unaligned Multimodal Language Sequences “Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks” Multimodal Pretraining Unmasked: Unifying the Vision and Language BERTs ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
Feb 21: Embodiment	Feb 23: RL, Logic, and Causality
Learning to Map Natural Language Instructions to Physical Quadcopter Control using Simulated Flight PixL2R: Guiding Reinforcement Learning Using Natural Language by Mapping Pixels to Rewards Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation ProgPrompt: Generating Situated Robot Task Plans using Large Language Models
Feb 28: Project Hours	Mar 02: Project Hours Report 2 (Fri Midnight)

Mar 07: Spring Break!	Mar 09: Spring Break!

Mar 14: Embodiment (cont)	Mar 16: Quantification and Bias
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances CerealBar: Situated, Collaborative Natural Language Understanding TEACh: Task-driven Embodied Agents that Chat Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision
Mar 21: Generation + Translation	Mar 23: Generation
Baby Talk: Understanding and Generating Simple Image Descriptions Encoder-Agnostic Adaptation for Conditional Language Generation Zero-Shot Text-to-Image Generation Multimodal Few-Shot Learning with Frozen Language Models Hierarchical Text-Conditional Image Generation with CLIP Latents
Mar 28: Transference	Mar 30: Guest Lecture by Maarten Sap on Social Bias Report 3 (Fri Midnight)
DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Models Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think!
Apr 04: Project Hours	Apr 06: Project Hours

Apr 11: Guest Lecture by Aida Nematzadeh	Apr 13: Carnival (no class)
Main papers: Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers Probing Image-Language Transformers for Verb Understanding Rethinking Evaluation Practices in Visual Question Answering: A Case Study on Out-of-Distribution Generalization Additional papers: Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers Learning Transferable Visual Models From Natural Language Supervision SimVLM: Simple Visual Language Model Pretraining with Weak Supervision VALSE
Apr 18: New research directions Recent publications	Apr 20: Guest Lecture by Alane Suhr
Neural Module Networks Visually Grounded Reasoning across Languages and Cultures The PhotoBook Dataset: Building Common Ground through Visually-Grounded Dialogue A Corpus of Natural Language for Visual Reasoning Executing Instructions in Situated Collaborative Interactions Continual Learning for Instruction Following from Realtime Feedback
Apr 25: Final Presentations	Apr 27: Final Presentations

11-777: MultiModal Machine Learning

Spring 2023 Previous Projects

Instructor

Instructor

TA

TA

TA

TA

TA

TA

Piazza and Canvas

Assignments Timeline and Grading

Tasks & Datasets

Simulator Based

Question Answering & Captioning

Multi-turn QA

Audio

Video

Physical hardware / robots / sensors ...

Lectures