STAT 3104 Applied Bayesian Analysis Fall 2025
Undergraduate course, Columbia University, 2025
- Email: fl2744@columbia.edu
- Office Hours: Mondays 1–2 PM, Uris 321 and Fridays 3–4 PM, Uris 319.
Quick Links
Communication
- CourseWorks (Canvas): This is where you can find all the official materials, submissions, and annoucements.
- Please scroll down to see the comments I write about the homework/quiz/project. The comments are based on the frequently asked questions during the office hours and my own insights.
Student Check-In Form
If you would like to share how the semester is going, what’s working or not working for you, or what kind of support might help, please use the form below. I do care and I do read the responses.
📝 Click here to fill out the Student Check-In & Support Form
Everything you share is confidential and meant to help me support you better. Whether you’re doing great, struggling a little, or just want to share feedback — I’m here to listen.
Ways to Succeed in This Course
- Keep on track Consistency brings mastery. Don’t skip lectures or homeworks.
- Come to office hours — They’re for you, whether you’re confused or just curious - ask questions early and all questions are welcome. I will do my very best to help.
- You do not need to be a mathematician to succeed — Don’t be afraid if the equations look heavy at first. The focus is on understanding ideas, running code, and interpreting results, not on doing pages of proofs. The intuition matters more than the algebraic details.
Homework 3
Time flies. This is Week 8.
Some tips
ANOVA stands for Analysis of Variance. Despite the name, it’s really about comparing group means — the “variance” part refers to how we measure differences between those means. Here’s an example. Suppose we have 3 groups of students. Group A is the students who study in the morning. Group B is the students who study at night. Group C is the students who don’t study at all. Your measure their exam scores and ask: Do these groups have different true average scores, or could the differences be just random noise? ANOVA separates total variability in the data into two parts: Total variability = Between-group variability + Within-group variability. The idea is that if between-group variation is large compared to within-group variation, then it’s evidence that the group means are really different. In the Bayesian framework, instead of testing a null hypothesis, we compute the posterior distributions of the group means and their differences and we can say things like: “There’s a 95% posterior probability that Group A’s mean is higher than Group B’s mean.”
It’s useful to know that the posterior distribution is often difficult to compute analytically. This is why we introduced the method of Monte Carlo - so we can sample the posterior using Monte Carlo.
Problem 1:
- Part a: Even if you can see quite easily how many groups there are by clicking on the csv file on CourseWorks (you would be groups like ‘A’, ‘B’, etc), please still write code the find out the answer. The group_by() and nrows() functions can be helpful. The answer to this quesiton is easily generated by chatbot; just make sure to understand what’s going on.
- Part b: Here is a sanity check. For group A, the sample mean should be 90.000 and the posterior mean should be 90.575. Here’s some more detail. First, do
anovadata$GroupIndex <- as.numeric(as.factor(anovadata$Group))Then your model should be
ulam(
alist(
Y ~ dnorm(mu, sigma),
mu <- a[GroupIndex],
a[GroupIndex] ~ dnorm(FILL_IN_THE_PRIOR_MEAN, FILL_IN_THE_PRIOR_VAR),
sigma ~ dunif(FILL_IN_THE_PRIOR_MEAN, FILL_IN_THE_PRIOR_VAR)
),
data = anovadata, chains = 4, iter = 2000, cores = 4
)
- Part c: Use the
HPDI()function. - Part d: Repetition of parts b and c.
Problem 2:
- Part a: We are given
mp <- ulam(
alist(
a ~ dnorm(0, 1),
b ~ dcauchy(0, 1)
),
data = list(y = 1), iter = 10000, warmup = 200, chains = 2
)
Then you can simply do post <- extract.samples(mp) and then run summary() and hist() on post$a and post$b.
- Run
traceplot(mp).
Problem 3:
- Describe the structure of the two histograms. Do we have symmetry? Skewness?
Homework 2
Homework 2 is mostly about coding.
Before we delve into model fitting, I would like to include a quick review of Bayesian linear regression, which is the topic of this week (Week 4 FYI).
Quick Review of Bayesian Linear Regression
When we learn linear regression in the classical (frequentist) sense, we imagine a straight line trying to best fit our data points. “Best fit the data points” means that the squared errors of the straight line is minimized.
- This line can be exactly computed using a formula, but the formula and its derivation requires the knowledge of linear algebra and is beyond the scope of this course.
- In classical (frequentist) regression, once we compute the line, we treat the slope and intercept as fixed numbers.
- Bayesian regression takes a different viewpoint: Instead of saying “the slope is 2.1,” we say “given the data, we believe the slope is likely around 2.1, but it could also be 2.0 or 2.2 with some probability.” So in Bayesian regression, the slope and intercept aren’t fixed. They’re random variables with distributions that represent our uncertainty about them.
- In lecture, we learned that Ptolemy’s geocentric model wasn’t mechanistically true, but it was descriptively accurate: it gave good predictions. Regression works the same way: it doesn’t explain the “true mechanism” of the world, but it’s a flexible way to approximate relationships in data.
Problem 1:
The first problem is a guided application of Bayesian linear regression to global temperature data.
- Download the dataset from the link in the homework. The first column is the year, the second is the change in global temperature -Part a: Fit a Bayesian linear regression. The question asks you to use priors with large standard deviation, i.e. weakly informative (50 is a reasonable choice for the standard deviation in case you are struggling to choose one).
- Part b: State the posterior mean for the slope. Is it positive or negative? On average, there has been a “slope” degree Celsius increase/decrease per year.
- Part c: Provide plot of the densities for the three parameters.
- Part b: Use Monte Carlo simulation. Calculate the porportion of samples for which the posterior mean of the slope obtained is positive. Intuitively, this probablity should be equal or close to 1 just by thinking of global warming.
- Part e: Change the variance of the prior for b to something very small (like 0.001). Does any of our conclusions change?
- Part f: What statistical arguments can we make? Here are two points you can make: Is the posterior mean of the slope positive or negative? Does the conclusion hold regardless of the whether we use an informative or diffuse prior?
Homework 1
Problem 1:
- The demo the professor presented at the end of the lecture on Sep. 9th is relevant.
- Generative AI is great at producing plotting code. Make sure you really understand what the code is doing, and how the grid approximation reflects Bayesian updating.
Problem 2:
- Pandas are cute 🐼
- Be clear what events we are considering.
- For part a, recall the law of total probability.
- For part b, Bayes’ rule is your tool. Observing a twin should strengthen your belief that the species is the one that gives birth to twins with more probabiility, right? :)
- For part c, I would use Bayes’ rule.
- For part d, follow the hint. As new observations come in, the previous posterior becomes your current prior.
FAQ and Trouble Shooting Bugs
R Session aborted when using R-Studio
It’s experimentally shown that if your macOS version is something like is 12.5, you should still download R-Studio using the [macOS 13+ (macOS 13 and higher)] option on the official site.
