Week 1: Introduction
Monday (8/18): Course overview & research primer (Slides: cs8803_introduction)
Wednesday (8/20): Lecture – Applications of visual analytics (Slides: cs8803_applications)
[Optional] Real-Time Video Analytics: The Killer App for Edge Computing
Week 2: Video Query Optimization
Monday (8/25): Introduction to video query optimization (Slides: cs8803_query_opt.pdf)
Monday (8/25): BlazeIt: Optimizing Declarative Aggregation and Limit Queries for Neural Network-Based Video Analytics (Slides: cs8803_blazeit.pdf)
[Optional] TASTI: Semantic Indexes for Machine Learning-based Queries over Unstructured Data
Wednesday (8/27): FiGO: Fine-Grained Query Optimization in Video Analytics (Slides: cs8803_figo.pdf)
[Optional] Zeus: Efficiently Localizing Actions in Videos using Reinforcement Learning
Week 3: Video Query Optimization (continued)
Monday (9/1): Labor day (no class)
Wednesday (9/3): VIVA: Optimizing Video Analytics with Declarative Model Relationships (Slides: cs8803_VIVA.pdf)
[Optional] Zelda: Video Analytics using Vision-Language Models
Week 4: Project Proposal Presentations
Monday (9/8): First half of presentations
Wednesday (9/10): Second half of presentations
Friday (9/12): Project Proposal Report Due
Week 5: AI Inference Systems
Monday (9/15): Introduction to AI inference systems (Slides: cs8803_ai_inf_sys.pdf)
Monday (9/15): INFaaS: Automated Model-less Inference Serving (Slides: cs8803_infass.pdf)
[Optional] Llama: A Heterogeneous & Serverless Framework for Auto-Tuning Video Analytics Pipelines
Wednesday (9/17): Orion: Interference-aware, Fine-grained GPU Sharing for ML Applications (Slides: cs8803_orion.pdf)
[Optional] Nexus: A GPU Cluster Engine for Accelerating DNN-Based Video Analysis
Week 6: AI Inference Systems (continued)
Monday (9/22): Orca: A Distributed Serving System for Transformer-Based Generative Models (Slides: cs8803_ORCA.pdf)
[Optional] ServerlessLLM: Low-Latency Serverless Inference for Large Language Models
Wednesday (9/24): Guest lecture by Haoran Qiu [Microsoft Azure]
Abstract
Large multimodal models (LMMs) are rapidly advancing the frontier of AI by enabling joint understanding and generation across text, images, video, and audio. Yet deploying these models efficiently in production remains a major challenge due to their heterogeneous architectures, multi-stage pipelines, and highly variable request patterns. In this guest lecture, I will first provide a brief overview of the systems challenges in multimodal inference and generation from both production traces analysis at Azure and open-source LMM characterization. Building on these insights, I will talk about ModServe, our modular serving system for efficient LMM inference. ModServe decouples model stages for independent optimization and adaptive scaling, and introduces modality-aware scheduling to meet tail latency SLOs under dynamic workloads. On production-scale traces, ModServe improves throughput by 3.3–5.5× and reduces costs by up to 41%, while maintaining latency guarantees. I will conclude by discussing broader implications of modular and adaptive design for efficient multimodal generation in datacenters.
[Optional] Towards Efficient Large Multimodal Model Serving
Week 7: Video Processing Acceleration
Monday(9/29): Introduction to video processing acceleration (Slides: cs_8803_acceleration.pdf)
Monday(9/29): VCU: Warehouse-Scale Video Acceleration: Co-Design and Deployment in the Wild
[Optional] HyperCam: Low-Power Onboard Computer Vision for IoT Cameras
Wednesday (10/1): Encoding, Fast and Slow: Low-Latency Video Processing Using Thousands of Tiny Threads (Slides: cs8803_excamera.pdf)
[Optional] From Laptop to Lambda: Outsourcing Everyday Jobs to Thousands of Transient Functional Containers
Week 8: Processing on the Edge
Monday (10/6): Fall break (no class)
Wednesday (10/8): Introduction to edge processing (Slides: cs8803_processing_edge.pdf)
Wednesday (10/8): Small Language Models are the Future of Agentic AI (Slides: cs8803_SLM.pdf)
Week 9: Processing on the Edge (continued)
Monday (10/13): Legilimens: Performant Video Analytics on the System-on-Chip Edge (Slides: cs8803_Legilimens)
[Optional] Moonshine: Speech Recognition for Live Transcription and Voice Commands
Wednesday (10/15): Guest lecture by Tanmay Agrawal, Chinmay Agrawal, and Tristen Nollman [Plix]
Abstract
This lecture surveys how to turn body-worn cameras into end-to-end, outcome-driven products under real-world constraints. We first examine pragmatic strategies to overcome LLM context limits for long videos–favoring purpose-built representations over brute-force tokenization. We then map multi-persona product design (device user vs. platform operator) into an iterative, signal-driven development loop. Finally, we analyze edge–cloud trade-offs including the surprising shift toward audio-first analysis to maximize reliability, trust, and value via distributed AI processing.
[Optional] Generative AI At The Cutting Edge
Week 10: Project Midterm Presentations
Monday (10/20): First half of presentations
Wednesday (10/22): Second half of presentations
Week 11: Dataset Labeling and Ingestion Systems
Monday (10/27): Introduction to Dataset Labeling and Ingestion Systems (Slides: cs8803_dataset_ingestion.pdf)
Monday (10/27): Mixtera: A Data Plane for Foundation Model Training (Slides: cs8803_Mixtera.pptx)
[Optional] tf.data: a machine learning data processing framework
Wednesday (10/29): Guest lecture by Mark Zhao [University of Colorado Boulder]
Abstract
Scalable and efficient machine learning (ML) systems have been instrumental in fueling recent advancements in ML capabilities. However, further scaling these systems requires more than simply increasing the performance and quantity of accelerators such as GPUs. Modern ML deployments rely on complex pipelines composed of many diverse and interconnected systems beyond just accelerators.
In this talk, I will emphasize the importance of building scalable systems across the entire ML pipeline. In particular, I will first explore how to build scalable data storage and ingestion systems to manage massive datasets for large-scale ML training pipelines, including those at Meta. To meet growing ML data demands, these data systems must be optimized for performance and efficiency. I will next illustrate how to leverage synergistic optimizations across the training data pipeline to unlock performance and efficiency gains beyond what isolated system optimizations can achieve. However, effectively deploying these optimizations requires navigating a complex system design space. To address this, I will finally introduce cedar, a framework that automates these optimizations and orchestrates ML data processing for diverse training workloads. Together, these systems enable scaling the entire ML pipeline, not just the GPUs.
[Optional] Understanding data storage and ingestion for large-scale deep recommendation model training
Week 12: VLMs and Multimodal Models
Monday (11/3): Introduction to VLMs (Slides: cs8803_VLMs.pdf)
Monday (11/3): CLIP: Learning Transferable Visual Models From Natural Language Supervision (Slides: cs8803_CLIP.pdf)
[Optional] MobileCLIP2: Improving Multi-Modal Reinforced Training
Wednesday (11/5): NVILA: Efficient Frontier Visual Language Models (Slides: cs8803_NVLIA.pptx)
[Optional] LongVILA: Scaling Long-Context Visual Language Models for Long Videos
Week 13: Retrieval and Compound AI Systems
Monday (11/10): Introduction to retrieval and compound AI systems (Slides: cs8803_Retrieval.pdf)
Monday (11/10): RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving (Slides: cs8803_RAGO.pptx)
[Optional] Milvus: A Purpose-Built Vector Data Management System
Wednesday (11/12): Guest lecture by Gohar Irfan Chaudhry [MIT]
Abstract
Agentic workflows commonly coordinate multiple models and tools with complex control logic. They are quickly becoming the dominant paradigm for AI applications. However, serving them remains inefficient with today’s frameworks. The key problem is that they expose workflows as opaque sequences of model and tool calls that tightly couple agent logic with model and hardware choices. Often, these workflow components are fragmented across different entities, preventing systems from reasoning about trade-offs across accuracy, latency, energy, and cost. This leads to resource waste and degraded service-level objectives (SLOs).
We present Murakkab, a resource-efficient serving system for agentic workflows. Murakkab introduces a declarative abstraction that decouples workflow specification from execution configuration. A profile-guided optimizer and adaptive runtime jointly manage the full stack: orchestrating workflow components, mapping them to models and hardware, and dynamically reconfiguring execution to satisfy user-defined SLOs. By exposing the internal structure of agentic workflows, Murakkab enables cross-layer optimization that existing frameworks and cloud schedulers cannot achieve. Our evaluation on diverse workflows shows that Murakkab reduces GPU usage by up to 2.8×, energy consumption by 3.7×, and cost by 4.3× while maintaining SLOs.
[Optional] The Shift from Models to Compound AI Systems
Week 14: Project Final Presentations
Monday (11/17): First third of presentations
Wednesday (11/19): Second third of presentations
Week 15: Project Final Presentations (continued)
Monday (11/24): Final third of presentations
Wednesday (11/26): Thanksgiving recess (no class)
Week 16: Wrap-Up
Monday (12/1): Course summary and feedback
Wednesday (12/3): Reading period (no class)
Friday (12/5): Project final report and software due at 5PM