Lectures

CMMRS 2026 includes lectures from faculty at Cornell University, the University of Maryland, and the Max Planck Institutes. As in previous years, the lectures cover a variety of cutting-edge research topics in computer and information science to give attendees an opportunity to broaden their exposure to research across the discipline.


Ramani Duraiswami

Department of Computer Science and UMIACS, University of Maryland, College Park

Lecture 1: Towards Auditory General Intelligence: Building, Benchmarking, and Improving Large Audio-Language Models

Perception of audio events, music, and speech plays a fundamental role in how humans interact with the world. Large language models have absorbed vast amounts of knowledge from text, but they currently lag in auditory scene understanding, speech and non-speech communication, and music analysis—all central facets of human intelligence. How do we build AI systems that truly understand audio the way humans do?

In this lecture, I will introduce the emerging class of Large Audio-Language Models (LALMs)—systems that connect audio encoders to large language models so they can listen, reason, and respond to queries about what they hear. I will start with the basic architectural recipe: how do you take a pretrained language model and teach it to process audio? What design choices matter—cross-attention versus token prepending, frozen versus finetuned encoders, curriculum training strategies? I will then trace the evolution of our group’s Audio Flamingo family of models: from COMPA (ICLR 2024, spotlight) and GAMA (EMNLP 2024), through Audio Flamingo 2 (ICML 2025) and Audio Flamingo 3 (NeurIPS 2025, spotlight), to Music Flamingo, which introduced structured music understanding, and Audio Flamingo Next (AF-Next, submitted), which extends the framework to multi-talker speech, 30-minute audio, and temporal chain-of-thought reasoning. AF3 and AF-Next are among the leading fully open models for audio understanding, surpassing both open-weight and closed-source systems across over 20 benchmarks.

A recurring theme will be the tight coupling between building models and building benchmarks. Benchmarking has been crucial for language model development, yet such benchmarks for LALMs were initially absent. I will describe our creation of MMAU (ICLR 2025, spotlight)—the first comprehensive benchmark for audio general intelligence, now widely used for evaluating LALMs—and its successor MMAU-Pro (AAAI 2026) I will discuss what makes audio evaluation particularly challenging compared to vision or language, and how building better benchmarks drives the cycle of model improvement.

I will also briefly describe our work extending Audio Flamingo to the audio-visual setting, where the model must jointly reason across what it hears and sees. Finally, I will close with a broader perspective: the recipe of building domain-specific encoders, pairing them with language models, and constructing rigorous benchmarks is not limited to audio. I will sketch how we are beginning to apply this same paradigm to genomics, building multimodal embeddings that bridge DNA sequences and biomedical text—suggesting a general methodology for advancing foundation models in scientific domains.

No background in audio signal processing or language models is assumed. The lecture is designed as a self-contained introduction to multimodal AI, using audio as the running example.

Lecture 2: Towards Physical AI: Differentiable Modeling and Neural Operators: When Physics Meets Machine Learning

An under-appreciated aspect of the deep learning revolution is the use of automatic differentiation and backpropagation on differentiable computational graphs. Before learning from data became the method of choice, scientists spent entire careers developing forward models that captured deep knowledge about the physical world—models based on mathematics, physics, biology, and acoustics. Making these forward models differentiable allows this accumulated scientific knowledge to be incorporated directly into deep learning architectures, enabling more efficient computational pipelines for tasks like parameter optimization, inverse problem solution, and learning explainable models, especially in domains where data is sparse.

In this lecture, I will develop these ideas through a series of concrete examples drawn from our group’s recent work. I will begin with spatial audio—the problem of making sound on headphones perceptually realistic. This requires filtering audio through head-related transfer functions (HRTFs) and room impulse responses, which are traditionally long, expensive filters treated as fixed data. I will show how we made these systems differentiable end-to-end: approximating measured FIR filters with compact IIR models via gradient-based optimization, and synthesizing room reverberation through differentiable feedback delay networks that optimize directly against psychoacoustic metrics like clarity, definition, and reverberation time—achieving real-time binaural rendering on embedded hardware. I will also describe a differentiable multi-sphere scattering model for binaural hearing, implemented in JAX, that connects acoustic physics to gradient-based localization and tracking. Further examples include differentiable rendering via Gaussian splatting and a regularized signed-distance-function formulation (ViscoReg) inspired by continuum mechanics.

From these applied examples, I will zoom out to a more general question: can we learn the solution operator of a partial differential equation, rather than solving it instance by instance? I will introduce neural operator learning and describe our work GAIA (Geometry Aware Integral Autoencoder), which learn to map between function spaces for both forward and inverse PDE problems on arbitrary geometries—from electrical impedance tomography to stress analysis on 3D mechanical components. The key ideas—integral transforms parameterized by neural networks, geometry tokenization via cross-attention, and a unified encoder-decoder architecture that handles forward and inverse problems in a single pass—will be developed from first principles.

Throughout the lecture, I will emphasize the recurring ideas that connect these projects: making physical models differentiable so they can be optimized with gradient descent; encoding known structure—integral equations, conservation laws, geometric symmetries—into network architectures; and the interplay between physics-based priors and data-driven learning. I will also briefly touch on our work on efficient linear attention mechanisms for Transformer architectures.

The lecture assumes basic familiarity with calculus and linear algebra but no prior exposure to PDEs, signal processing, or deep learning. Students will come away with an understanding of how classical scientific modeling and modern machine learning can powerfully reinforce each other.


David Van Horn

Department of Computer Science and UMIACS, University of Maryland, College Park

A Gradual Introduction to Programming Language Research

Gradually typed languages integrate static and dynamic typing disciplines, providing some of the safety guarantees and engineering benefits of static type systems while allowing seamless interoperation between statically and dynamically typed components at run time. Industrial adoption has been substantial: such languages have been developed by Microsoft, Google, and Meta, among others, and are now widely used in practice. Their adoption represents a notable example of successful technology transfer from the programming languages research community. At the same time, research on gradual typing remains highly active, with a steady presence at every major PL venue.

In this lecture series, I use gradual typing as a vehicle for introducing core methods in programming language research, including formal syntax, operational semantics, type systems, and associated proof techniques and tools such as proof assistants. After developing these ideas in a simplified setting, we turn to the foundations of gradual typing, examining its key semantic and metatheoretic results and highlighting open problems and directions for future work in both theory and practice.


Jon Kleinberg

Departments of Computer Science and Information Science, Cornell University

Lecture 1: AI’s Models of the World, and Ours

Recent work on generative AI and large language models (LLMs) has addressed the simultaneous challenge of evaluating an AI system’s explicit behavior at one level and its implicit representations of the world at another. Such distinctions become crucial as people interact with powerful AI systems, where a mismatch between the system’s model of the world and our human models of the world can lead to situations in which the system has inadvertently `set us up to fail’ through our interaction with it. We explore these questions through the lens of generative AI, drawing on examples from game-playing, geographic navigation, and other complex tasks: When we train a model to win chess games, what happens when we pair it with a weaker partner who makes some of the moves? When we train a model to find shortest paths, what happens when it has to deal with unexpected detours? The picture we construct is further complicated by theoretical results indicating that successful generation can be achieved even by agents that are provably incapable of identifying the model they’re generating from. The talk will include joint work with Ashton Anderson, Karim Hamade, Reid McIlroy-Young, Siddhartha Sen, Justin Chen, Sendhil Mullainathan, Ashesh Rambachan, and Keyon Vafa.

Lecture 2: Revisiting the Behavioral Foundations of Algorithms

Many of the most widely-used algorithms are fundamentally concerned with building models of human users from their observed behavior. Traditional approaches to this problem rely on an often unstated revealed-preference assumption: that choice reveals preference. Yet a long line of work in psychology and behavioral economics reveals the gaps that can open up between choice and preference, and experience with platform dynamics makes clear how it can arise in some of the most basic online settings; for example, we might choose content to consume in the present and then later regret the time we spent on it. More generally, behavioral biases and inconsistent preferences make it highly challenging to appropriately interpret the user data that we observe. We discuss a set of models and algorithms that address this challenge through a process of “inversion”, in which an algorithm must try inferring goals and preferences that are not directly measured in the data. This talk will be based on joint work with Jens Ludwig, Sendhil Mullainathan, and Manish Raghavan.

Lecture 3: Formal Models of Language Generation

The emergence of large language models has prompted a surge of interest into theoretical models that might give us insight into both their successes and their shortcomings. We’ll give an overview of recent work in this direction, focusing on a surprising line of positive results that shows it is possible to give guarantees for language-generation algorithms even in the absence of any probabilistic assumptions, in a framework known as “language generation in the limit”. These results suggest interesting notions of breadth in language generation, attempting to formalize the idea that different algorithms for this problem might all meet the specification but differ significantly in their expressiveness — in how richly they can generate from the underlying language. We also discuss strong contrasts with classical results on language identification, showing a strong sense in which language generation and language learning are fundamentally different as computational problems. The talk will be based on joint work with Sendhil Mullainathan and Fan Wei.


Lillian Lee

Department of Computer Science, Cornell University

Taking a turn for the better?  Computational identification of crucial moments in consequential conversations

So much of human interaction occurs as conversations, and it is both fascinating and imperative to analyze them.  Recently, my co-authors and I have sought to identify “key” moments in such exchanges.

(1) A “pivoting” moment corresponds to a *redirection* of the conversation introduced by one party that is accepted/followed by the other.  We develop a probabilistic measure of how much an utterance immediately redirects the flow of the conversation, accounting for both the intention and the actual realization of such a change.

(2) In a *pivotal* moment, the conversation’s outcome hangs in the balance: how one responds can put the conversation on substantially diverging trajectories leading to significantly different results. We formalize this intuition by estimating the variance in expectation of outcome depending on what might be said next.

We find significant correlates of our measures in real human conversations on widely-used platforms.  For example, the patients in our longer-term mental-health-therapy data who redirected less in their first few sessions were significantly more likely to eventually express dissatisfaction with their therapist and terminate the relationship; and the staff responses in our crisis-counseling data had greater estimated impact on disengagement rates during pivotal moments than in non-.

Joint work with Vivian Nguyen, Cristian Danescu-Niculescu-Mizil, Thomas D. Hull, and Sang Min (Dave) Jung.


Wei-Chiu Ma

Department of Computer Science, Cornell University

Towards Physically Grounded Digital Twins and Beyond

Generative AI and foundation models have revolutionized numerous fields (e.g., vision, NLP), transforming our lives in many ways. However, their impact on robotics remains relatively limited compared to other domains. One critical hurdle preventing robotics from reaching the “GPT moment” is the lack of sufficient data. Unlike the abundant image and text data available on the web, real-world robotic data is much more scarce. Collecting this data is expensive, time-consuming, and, most importantly, presents significant safety concerns.

In this context, the automatic creation of realistic, interactable, and highly detailed virtual replicas of physical environments offers immense potential. By making digital twins look real and act real, we can use them as dynamic, virtual testbeds for training and evaluating robotic agents at scale. In this talk, I will share our recent progress in advancing digital twin construction and how it enables more robust policy learning. By building replicas that are not only visually and geometrically accurate but also physically grounded, robotic agents deployed in these mirror worlds can interact with their environments and leverage observations and feedback to learn decision-making policies that transfer seamlessly to their real-world counterparts — safely and at scale.


Rupak Majumdar

Max Planck Institute for Software Systems (MPI-SWS)

Software Engineering in the World of Agents

Coding agents have disrupted the way we build software and think about software engineering. Traditionally, software engineering processes were developed around the “scarce resource” of developer time. When software development costs go to (almost) zero, we have to re-evaluate many of our assumptions. In this lecture, I will discuss problems in software engineering in the world of agents by considering problems of information acquisition, attention, and trust. Where possible, I will show simple mathematical models that provide a clean vocabulary to formulate these problems. It is difficult to predict how the world will look like in August (we are merely in May now), but framing engineering problems using the lens of information acquisition should give us a vocabulary that is relevant to many other problems.


Abhilasha Ravichander

Max Planck Institute for Software Systems (MPI-SWS)

Trustworthy Large Language Models

Millions of everyday users interact with technologies built on generative AI, from voice assistants and search engines to chatbots. While these AI-based systems are increasingly integrated into modern life, they can also magnify risks, inequities, and dissatisfaction when providers deploy unreliable systems. A primary obstacle to greater reliability is the opacity of the underlying large language models: we lack a systematic understanding of how these models work, where critical vulnerabilities arise, why they occur, and how models must be redesigned to address them. In this tutorial, we will first provide foundational background on large language models. We will then describe research investigating when and how models acquire knowledge and capabilities, followed by efforts to build tools that enable greater data transparency. We will discuss why large language models produce incorrect knowledge, or hallucinate, and explore the fairness and bias concerns that emerge from these systems. Finally, we will consider the implications of these findings for building the next generation of responsible and trustworthy AI systems.