11-777 MultiModal Machine Learning

Spring 2021 Previous Projects

This course focuses on core techniques and modern advances for integrating different "modalities" into a shared representation or reasoning system. Specifically, these include text, audio, images/videos and action taking.

Time & Place: 10:40am - 12:00pm on Zoom Tu/Th
Canvas: Lectures and additional details
Course questions and discussion: Slack
GitHub Template: https://github.com/ybisk/11-777-template

Instructor

Yonatan Bisk

ybisk@cs.cmu

Teaching Assistant

Torsten Wörtwein

twoertwe@cs.cmu

Teaching Assistant

Jielin Qiu

jielinq@andrew

Slack and Canvas

All course communication will happen via slack and canvas. All videos will be posted to Canvas for offline viewing though aspects of the class/teaching will be interactive in the zoom sessions.

Slack

#general-questions: For questions about lectures, the course, or help from others on class projects
#group-N-X: Each group should come up with a name and create their own private channel (invite TAs and instructur). Use the same name for your GitHub fork and pin the link to the channel. Please also invite us to the GitHub. Example: #group-fun-vizwiz
#dataset-XYZ: Each core dataset will also have its own slack channel that anyone can join (across groups) to ask for help on setup, preprocessing, and other issues that might arise.
Private Messages: If there is a question you would like to address to the instructors, please create a 4-person PM on slack. Please check #general-questions first and post there when possible.

Assignments Timeline and Grading

The course is primarily project based, but there will be readings throughout the course which are only graded via participation.

Project Timeline and Assignments:

Feb 18		Group Formed and Dataset Chosen
Mar 04	R1	Task Definition and Data Analysis	(10%)
Mar 11	R2	Related Work and Background	(10%)
Mar 18	R3	Baselines, Metrics, and Empty Results Table	(10%)
Apr 01	R4	Analysis of Baselines	(10%)
Apr 22	R5	Proposed Approach	(10%)
May 11		Presentation	(10%)
May 13	R6	Completed Report	(20%)

Participation:
Participation in Class or Slack (20%)
Participation is evaluated as "actively asking/answering questions based on the lectures, readings, and/or assisting other teams with project issues". Concretely, this means that every novel question or helpful answer provided in Slack will count for 1%, up to a total of 20% of your grade.

Submission Policies:

All deadlines are midnight EST (determined by GitHub commit time)
Late days: Every team has a budget of 6 late days. They will be automatically calculated based on git commit, after which 2% absolute is removed from max grade.

Datasets

The course will be primarily centered on a few datasets/tasks to facilitate cross-team collaboration and technical assistance. If your team has a good reason to work on something not listed here, please reach out so we can discuss it and put together a proposal.

Paper	GitHub	Domain
ALFRED	Code Challenge	Embodied instruction following with interaction
MEmoR	Code Ask TA for data	Emotion in conversational context on the big bang theory.
Natural Language for Visual Reasoning	Code v1 Code v2	Visual reasoning about pairs of images and language descriptions. For NLVR2, look at the Contrastive Sets fold.
Room-Across-Room	Code	Embodied instruction following (with views and multilingual)
Social-IQ	Code Proj page	Video Question Answering focused on social interactions
VizWiz Challenge	Challenge	Image Captioning and Question Answering for the blind and visually impaired

Limited AWS and Google Cloud compute credits will be made available to each group, so please consider both your interests and available compute resources when deciding on a dataset/project.

Lectures

Tuesday	Thursday
Feb 2: Course Structure Research and technical challenges Syllabus and requirements	Feb 4: Multimodal applications and datasets Research tasks and datasets Team projects
Feb 9: Basics: "Deep learning" Language, Vision, Audio Loss functions and neural networks	Feb 11: Basics: Optimization Gradients and backprop Practical deep learning optimization
Feb 16: Unimodal representations (Vision) CNNs Residuals and Skip connections	Feb 18: Unimodal representations (Language) Gating and LSTMs Transformers Groups Formed and Dataset Chosen
Feb 23 -- NO CLASS --	Feb 25: Project Hours (Reports 1&2)
Mar 2: Multimodal & Coordinated Representations Auto-encoders CCA Multi-view Clustering	Mar 4: Alignment and Attention Explicit - Dynamic Time Warping Implicit -- Attention R1: Task Definition and Data Analysis
Mar 9: Alignment + Representation Self-attention Multimodal Transformers	Mar 11: Project Hours (Report 3) R2: Related Work and Background
Mar 16: Project Hours (Report 3)	Mar 18:Alignment + Representation (Cont) Self-attention models Multimodal Transformers R3: Baselines, Metrics, and Empty Results
Mar 23:Alignment + Translation Module networks Tree-based & Stack models	Mar 25: Embodiment Action as a modality
Mar 30: Reinforcement Learning Markov Decision Process Q learning and policy gradients	Apr 1: Multimodal RL Deep Q learning Multimodal aplications R4: Analysis of Baselines
Apr 6: Project Hours (Report 5)	Apr 8: Project Hours (Report 5)
Apr 13: Fusion and co-learning Multi-kernel learning and fusion Few shot learning and co-learning	Apr 15: -- NO CLASS --
Apr 20: New research directions Recent approaches in MMML	Apr 22: Affective Computing (Torsten Wörtwein) R5: Proposed Approach
Apr 27: Project Hours (Final)	Apr 29: Project Hours (Final)
May 4: Guest Lecture (Mark Yatskar - UPenn) Bias and Structure in Vision-and-Language	May 6: Guest Lecture (Malihe Alikhani - Pitt) Coherence and Grounding in Multimodal Communication
May 11: Project Presentations (live)	May 13: -- NO CLASS -- R6: Final Reports

11-777: MultiModal Machine Learning

Spring 2021 Previous Projects

Instructor

Teaching Assistant

Teaching Assistant

Slack and Canvas

Assignments Timeline and Grading

Datasets

Lectures