What you'll learn
Learn how Large Language Models (LLMs) repeatedly predict the next token, and how techniques like KV caching can greatly speed up text generation.
Code for efficient LLM app serving, balancing model output speed and serving many users at once.
Explore the fundamentals of Low Rank Adapters and see how Predibase builds their framework inference server to serve fine-tuned models at once.