Spring 2021    Previous Projects

This course focuses on core techniques and modern advances for integrating different "modalities" into a shared representation or reasoning system. Specifically, these include text, audio, images/videos and action taking.

Yonatan Bisk

Yonatan Bisk


Torsten Wörtwein
Teaching Assistant

Torsten Wörtwein


Jielin Qiu
Teaching Assistant

Jielin Qiu


Slack and Canvas

All course communication will happen via slack and canvas. All videos will be posted to Canvas for offline viewing though aspects of the class/teaching will be interactive in the zoom sessions.


Assignments Timeline and Grading

The course is primarily project based, but there will be readings throughout the course which are only graded via participation.

Project Timeline and Assignments:
Feb 18 Group Formed and Dataset Chosen
Mar 04 R1 Task Definition and Data Analysis (10%)
Mar 11 R2 Related Work and Background (10%)
Mar 18 R3 Baselines, Metrics, and Empty Results Table (10%)
Apr 01 R4 Analysis of Baselines (10%)
Apr 22 R5 Proposed Approach (10%)
May 11 Presentation (10%)
May 13 R6 Completed Report (20%)

Participation in Class or Slack (20%)
Participation is evaluated as "actively asking/answering questions based on the lectures, readings, and/or assisting other teams with project issues". Concretely, this means that every novel question or helpful answer provided in Slack will count for 1%, up to a total of 20% of your grade.

Submission Policies:


The course will be primarily centered on a few datasets/tasks to facilitate cross-team collaboration and technical assistance. If your team has a good reason to work on something not listed here, please reach out so we can discuss it and put together a proposal.
Paper GitHub Domain
Embodied instruction following with interaction
MEmoR Code
Ask TA for data
Emotion in conversational context on the big bang theory.
Natural Language for Visual Reasoning Code v1
Code v2
Visual reasoning about pairs of images and language descriptions. For NLVR2, look at the Contrastive Sets fold.
Room-Across-Room Code Embodied instruction following (with views and multilingual)
Social-IQ Code
Proj page
Video Question Answering focused on social interactions
VizWiz Challenge Challenge Image Captioning and Question Answering for the blind and visually impaired
Limited AWS and Google Cloud compute credits will be made available to each group, so please consider both your interests and available compute resources when deciding on a dataset/project.


Tuesday Thursday
Feb 2: Course Structure
  • Research and technical challenges
  • Syllabus and requirements
Feb 4: Multimodal applications and datasets
  • Research tasks and datasets
  • Team projects
Feb 9: Basics: "Deep learning"
  • Language, Vision, Audio
  • Loss functions and neural networks
Feb 11: Basics: Optimization
  • Gradients and backprop
  • Practical deep learning optimization
Feb 16: Unimodal representations (Vision)
  • CNNs
  • Residuals and Skip connections
Feb 18: Unimodal representations (Language)
  • Gating and LSTMs
  • Transformers
  • Groups Formed and Dataset Chosen
Feb 23 -- NO CLASS -- Feb 25: Project Hours (Reports 1&2)

Mar 2: Multimodal & Coordinated Representations
  • Auto-encoders
  • CCA
  • Multi-view Clustering
Mar 4: Alignment and Attention
  • Explicit - Dynamic Time Warping
  • Implicit -- Attention
  • R1: Task Definition and Data Analysis
Mar 9: Alignment + Representation
  • Self-attention
  • Multimodal Transformers
Mar 11: Project Hours (Report 3)
  • R2: Related Work and Background
Mar 16: Project Hours (Report 3) Mar 18:Alignment + Representation (Cont)
  • Self-attention models
  • Multimodal Transformers
  • R3: Baselines, Metrics, and Empty Results
Mar 23:Alignment + Translation
  • Module networks
  • Tree-based & Stack models
Mar 25: Embodiment
  • Action as a modality
Mar 30: Reinforcement Learning
  • Markov Decision Process
  • Q learning and policy gradients
Apr 1: Multimodal RL
  • Deep Q learning
  • Multimodal aplications
  • R4: Analysis of Baselines
Apr 6: Project Hours (Report 5) Apr 8: Project Hours (Report 5)
Apr 13: Fusion and co-learning
  • Multi-kernel learning and fusion
  • Few shot learning and co-learning
Apr 15: -- NO CLASS --
Apr 20: New research directions
  • Recent approaches in MMML
Apr 22: Affective Computing (Torsten Wörtwein)
  • R5: Proposed Approach
Apr 27: Project Hours (Final)

Apr 29: Project Hours (Final)
May 4: Guest Lecture (Mark Yatskar - UPenn)
            Bias and Structure in Vision-and-Language
May 6: Guest Lecture (Malihe Alikhani - Pitt)
Coherence and Grounding in Multimodal Communication
May 11: Project Presentations (live) May 13: -- NO CLASS --
  • R6: Final Reports