Schedule – CS 8803 LRV – Large Scale & Real-Time Visual Analysis

Abstract
Large multimodal models (LMMs) are rapidly advancing the frontier of AI by enabling joint understanding and generation across text, images, video, and audio. Yet deploying these models efficiently in production remains a major challenge due to their heterogeneous architectures, multi-stage pipelines, and highly variable request patterns. In this guest lecture, I will first provide a brief overview of the systems challenges in multimodal inference and generation from both production traces analysis at Azure and open-source LMM characterization. Building on these insights, I will talk about ModServe, our modular serving system for efficient LMM inference. ModServe decouples model stages for independent optimization and adaptive scaling, and introduces modality-aware scheduling to meet tail latency SLOs under dynamic workloads. On production-scale traces, ModServe improves throughput by 3.3–5.5× and reduces costs by up to 41%, while maintaining latency guarantees. I will conclude by discussing broader implications of modular and adaptive design for efficient multimodal generation in datacenters.

[Optional] Towards Efficient Large Multimodal Model Serving

Week 7: Video Processing Acceleration

Monday(9/29): Introduction to video processing acceleration (Slides: cs_8803_acceleration.pdf)

Monday(9/29): VCU: Warehouse-Scale Video Acceleration: Co-Design and Deployment in the Wild

[Optional] HyperCam: Low-Power Onboard Computer Vision for IoT Cameras

Wednesday (10/1): Encoding, Fast and Slow: Low-Latency Video Processing Using Thousands of Tiny Threads (Slides: cs8803_excamera.pdf)

[Optional] From Laptop to Lambda: Outsourcing Everyday Jobs to Thousands of Transient Functional Containers

Week 8: Processing on the Edge

Monday (10/6): Fall break (no class)

Wednesday (10/8): Introduction to edge processing (Slides: cs8803_processing_edge.pdf)

Wednesday (10/8): Small Language Models are the Future of Agentic AI (Slides: cs8803_SLM.pdf)

[Optional] MCDNN: An Approximation-Based Execution Framework for Deep Stream Processing Under Resource Constraints

Week 9: Processing on the Edge (continued)

Monday (10/13): Legilimens: Performant Video Analytics on the System-on-Chip Edge (Slides: cs8803_Legilimens)

[Optional] Moonshine: Speech Recognition for Live Transcription and Voice Commands

Wednesday (10/15): Guest lecture by Tanmay Agrawal, Chinmay Agrawal, and Tristen Nollman [Plix]

Abstract
This lecture surveys how to turn body-worn cameras into end-to-end, outcome-driven products under real-world constraints. We first examine pragmatic strategies to overcome LLM context limits for long videos–favoring purpose-built representations over brute-force tokenization. We then map multi-persona product design (device user vs. platform operator) into an iterative, signal-driven development loop. Finally, we analyze edge–cloud trade-offs including the surprising shift toward audio-first analysis to maximize reliability, trust, and value via distributed AI processing.

[Optional] Generative AI At The Cutting Edge

Week 10: Project Midterm Presentations

Monday (10/20): First half of presentations

Wednesday (10/22): Second half of presentations

Week 11: Dataset Labeling and Ingestion Systems

Monday (10/27): Introduction to Dataset Labeling and Ingestion Systems (Slides: cs8803_dataset_ingestion.pdf)

Monday (10/27): Mixtera: A Data Plane for Foundation Model Training (Slides: cs8803_Mixtera.pptx)

[Optional] tf.data: a machine learning data processing framework

Wednesday (10/29): Guest lecture by Mark Zhao [University of Colorado Boulder]

Abstract
Scalable and efficient machine learning (ML) systems have been instrumental in fueling recent advancements in ML capabilities. However, further scaling these systems requires more than simply increasing the performance and quantity of accelerators such as GPUs. Modern ML deployments rely on complex pipelines composed of many diverse and interconnected systems beyond just accelerators. In this talk, I will emphasize the importance of building scalable systems across the entire ML pipeline. In particular, I will first explore how to build scalable data storage and ingestion systems to manage massive datasets for large-scale ML training pipelines, including those at Meta. To meet growing ML data demands, these data systems must be optimized for performance and efficiency. I will next illustrate how to leverage synergistic optimizations across the training data pipeline to unlock performance and efficiency gains beyond what isolated system optimizations can achieve. However, effectively deploying these optimizations requires navigating a complex system design space. To address this, I will finally introduce cedar, a framework that automates these optimizations and orchestrates ML data processing for diverse training workloads. Together, these systems enable scaling the entire ML pipeline, not just the GPUs.

[Optional] Understanding data storage and ingestion for large-scale deep recommendation model training

Week 12: VLMs and Multimodal Models

Monday (11/3): Introduction to VLMs (Slides: cs8803_VLMs.pdf)

Monday (11/3): CLIP: Learning Transferable Visual Models From Natural Language Supervision (Slides: cs8803_CLIP.pdf)

[Optional] MobileCLIP2: Improving Multi-Modal Reinforced Training

[Optional] SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Wednesday (11/5): NVILA: Efficient Frontier Visual Language Models (Slides: cs8803_NVLIA.pptx)

[Optional] LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Week 13: Retrieval and Compound AI Systems

Monday (11/10): Introduction to retrieval and compound AI systems (Slides: cs8803_Retrieval.pdf)

Monday (11/10): RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving (Slides: cs8803_RAGO.pptx)

[Optional] Milvus: A Purpose-Built Vector Data Management System

Wednesday (11/12): Guest lecture by Gohar Irfan Chaudhry [MIT]

Abstract
Agentic workflows commonly coordinate multiple models and tools with complex control logic. They are quickly becoming the dominant paradigm for AI applications. However, serving them remains inefficient with today’s frameworks. The key problem is that they expose workflows as opaque sequences of model and tool calls that tightly couple agent logic with model and hardware choices. Often, these workflow components are fragmented across different entities, preventing systems from reasoning about trade-offs across accuracy, latency, energy, and cost. This leads to resource waste and degraded service-level objectives (SLOs). We present Murakkab, a resource-efficient serving system for agentic workflows. Murakkab introduces a declarative abstraction that decouples workflow specification from execution configuration. A profile-guided optimizer and adaptive runtime jointly manage the full stack: orchestrating workflow components, mapping them to models and hardware, and dynamically reconfiguring execution to satisfy user-defined SLOs. By exposing the internal structure of agentic workflows, Murakkab enables cross-layer optimization that existing frameworks and cloud schedulers cannot achieve. Our evaluation on diverse workflows shows that Murakkab reduces GPU usage by up to 2.8×, energy consumption by 3.7×, and cost by 4.3× while maintaining SLOs.

[Optional] The Shift from Models to Compound AI Systems

Week 14: Project Final Presentations

Monday (11/17): First third of presentations

Wednesday (11/19): Second third of presentations

Week 15: Project Final Presentations (continued)

Monday (11/24): Final third of presentations

Wednesday (11/26): Thanksgiving recess (no class)

Week 16: Wrap-Up

Monday (12/1): Course summary and feedback

Wednesday (12/3): Reading period (no class)

Friday (12/5): Project final report and software due at 5PM