
Reinforcement Learning in LLMs - Why and How
From imitation to optimization: when LLMs need RL, how verifiable rewards unlock reasoning, and a minimal GRPO playbook.

From imitation to optimization: when LLMs need RL, how verifiable rewards unlock reasoning, and a minimal GRPO playbook.

This note provides a high-level summary of the progress in large language models (LLMs) covering major milestones from Transformers to ChatGPT. The note serves as a fast-paced recap for readers to catch up on this field quickly.

Exponential-min and Gumbel-max tricks for reformulating sampling from a discrete distribution as argmin and argmax, making the sampling operation differentiable.

Concentration of measure pushes Gaussian samples onto a thin shell—here's the intuition, the math, and why typicality matters for generative models.

Music notation gives us a tidy grid of notes, but physics delivers a messy spectrum of vibrations. Here's why tuning is always a compromise.

PPO made RLHF work; DPO made it simple. This post derives DPO from PPO, explains why it’s a supervised alternative (not RL), where it shines, and where RL/GRPO still helps.

You know how to differentiate through a function—but how do you differentiate through a sampling step? Two estimators: score‑function (REINFORCE) and pathwise (reparameterization); pathwise backpropagates through the sampling transform with lower variance.

Can we speed up generation without changing the distribution? A small draft model proposes, the big model accepts/rejects—yielding exact samples, faster.

Why that particular sigmoid in logistic regression? This short post shows how simple moment constraints lead to exponential families (MaxEnt chooses the model) and how MLE fits them.

A living collection of advice from mentors, friends, and books.

A quick walk-through of Expectation-Maximization (EM) algorithm and its cousins.

This is the first post of hopefully a series of post walking through diffusion models. This post will introduce the foundations, focusing on two foundational papers, that many other papers built upon.

A living list of books, essays, and videos that helps me keep perspective on life.



A quick riff on Hume's is–ought gap—why facts don't dictate values, and how the leap from 'is' to 'ought' rests on sentiment.

There has been a lot of confusing information about dopamine. I finally found a literature review-style article, and here is what I learned.

A quick reference tying each hiragana and katakana character back to its Chinese origin, plus the patterns and mnemonics I lean on while studying.

A January dash to London to finally see Jay Chou live, with museums, parks, and good meals along the way.

This is a quick note to discuss a few topics below related to building LLM-powered products and applications, such as how to let LLM use tools and become autonomous agents, how to incorporate domain adaptation, and the production hurdles.

In this note, we'll take a look at how Auto-GPT work and discuss LLM's ability to do explicit reasoning and to become an autonomous agent. We'll touch upon a few related works such as WebGPT, Toolformer, and Langchain.

In October 2021, we spent two weeks traveling to various cities in Italy, including Rome, Cinque Terre, Florence, Tuscany, and Venice. This was our first trip to Italy, and we have documented our journey with a report and photos.

Trip report (itinerary and photos) from our recent trip to southern Utah (Zion, Arches, Canyonlands and Bryce).

This page is a high-level summary / notes of various recent results in language modeling with little explanations

A list of starter resources for Natural Language Processing (NLP), mostly with deep learning.

A short list of interview preparation resources for Data Scientists, Machine Learning Engineers, Machine Learning Scientists, Quant Developers and Quant Researchers.

A literature survey of recent papers on Neural Variational Inference (NVI) and its application in topic modeling.

A high-level summary of various generative models including Variational Autoencoders (VAE), Generative Adverserial Networks (GAN), and their notable extentions and generalizations, such as f-GAN, Adversarial Variational Bayes (AVB), Wasserstein GAN, Wasserstein Auto-Encoder (WAE), Cramer GAN and etc
End of posts • 28 posts