STAT 3104 Applied Bayesian Analysis Fall 2025

Undergraduate course, Columbia University, 2025

Email: fl2744@columbia.edu
Office Hours: Mondays 1–2 PM, Uris 321 and Fridays 3–4 PM, Uris 319.

Quick Links

Communication
Student Check-In Form
Homework 3
Homework 2
Homework 1
- Problem 1
- Problem 2
FAQ and Troubleshooting Bugs

Communication

CourseWorks (Canvas): This is where you can find all the official materials, submissions, and annoucements.
Please scroll down to see the comments I write about the homework/quiz/project. The comments are based on the frequently asked questions during the office hours and my own insights.

Student Check-In Form

If you would like to share how the semester is going, what’s working or not working for you, or what kind of support might help, please use the form below. I do care and I do read the responses.

📝 Click here to fill out the Student Check-In & Support Form

Everything you share is confidential and meant to help me support you better. Whether you’re doing great, struggling a little, or just want to share feedback — I’m here to listen.

Ways to Succeed in This Course

Keep on track Consistency brings mastery. Don’t skip lectures or homeworks.
Come to office hours — They’re for you, whether you’re confused or just curious - ask questions early and all questions are welcome. I will do my very best to help.
You do not need to be a mathematician to succeed — Don’t be afraid if the equations look heavy at first. The focus is on understanding ideas, running code, and interpreting results, not on doing pages of proofs. The intuition matters more than the algebraic details.

Homework 3

Time flies. This is Week 8.

Some tips

ANOVA stands for Analysis of Variance. Despite the name, it’s really about comparing group means — the “variance” part refers to how we measure differences between those means. Here’s an example. Suppose we have 3 groups of students. Group A is the students who study in the morning. Group B is the students who study at night. Group C is the students who don’t study at all. Your measure their exam scores and ask: Do these groups have different true average scores, or could the differences be just random noise? ANOVA separates total variability in the data into two parts: Total variability = Between-group variability + Within-group variability. The idea is that if between-group variation is large compared to within-group variation, then it’s evidence that the group means are really different. In the Bayesian framework, instead of testing a null hypothesis, we compute the posterior distributions of the group means and their differences and we can say things like: “There’s a 95% posterior probability that Group A’s mean is higher than Group B’s mean.”
It’s useful to know that the posterior distribution is often difficult to compute analytically. This is why we introduced the method of Monte Carlo - so we can sample the posterior using Monte Carlo.

Problem 1:

Part a: Even if you can see quite easily how many groups there are by clicking on the csv file on CourseWorks (you would be groups like ‘A’, ‘B’, etc), please still write code the find out the answer. The group_by() and nrows() functions can be helpful. The answer to this quesiton is easily generated by chatbot; just make sure to understand what’s going on.
Part b: Here is a sanity check. For group A, the sample mean should be 90.000 and the posterior mean should be 90.575. Here’s some more detail. First, do anovadata$GroupIndex <- as.numeric(as.factor(anovadata$Group)) Then your model should be


ulam(
  alist(
    Y ~ dnorm(mu, sigma),
    mu <- a[GroupIndex],
    a[GroupIndex] ~ dnorm(FILL_IN_THE_PRIOR_MEAN, FILL_IN_THE_PRIOR_VAR),
    sigma ~ dunif(FILL_IN_THE_PRIOR_MEAN, FILL_IN_THE_PRIOR_VAR)
  ),
  data = anovadata, chains = 4, iter = 2000, cores = 4
)

Part c: Use the HPDI() function.
Part d: Repetition of parts b and c.

Problem 2:

Part a: We are given


mp <- ulam(
  alist(
    a ~ dnorm(0, 1),
    b ~ dcauchy(0, 1)
  ),
  data = list(y = 1), iter = 10000, warmup = 200, chains = 2
)

Then you can simply do post <- extract.samples(mp) and then run summary() and hist() on post$a and post$b.

Run traceplot(mp).

Problem 3:

Describe the structure of the two histograms. Do we have symmetry? Skewness?

Homework 2

Homework 2 is mostly about coding.

Before we delve into model fitting, I would like to include a quick review of Bayesian linear regression, which is the topic of this week (Week 4 FYI).

Quick Review of Bayesian Linear Regression

When we learn linear regression in the classical (frequentist) sense, we imagine a straight line trying to best fit our data points. “Best fit the data points” means that the squared errors of the straight line is minimized.

This line can be exactly computed using a formula, but the formula and its derivation requires the knowledge of linear algebra and is beyond the scope of this course.
In classical (frequentist) regression, once we compute the line, we treat the slope and intercept as fixed numbers.
Bayesian regression takes a different viewpoint: Instead of saying “the slope is 2.1,” we say “given the data, we believe the slope is likely around 2.1, but it could also be 2.0 or 2.2 with some probability.” So in Bayesian regression, the slope and intercept aren’t fixed. They’re random variables with distributions that represent our uncertainty about them.
In lecture, we learned that Ptolemy’s geocentric model wasn’t mechanistically true, but it was descriptively accurate: it gave good predictions. Regression works the same way: it doesn’t explain the “true mechanism” of the world, but it’s a flexible way to approximate relationships in data.

Problem 1:

The first problem is a guided application of Bayesian linear regression to global temperature data.

Download the dataset from the link in the homework. The first column is the year, the second is the change in global temperature -Part a: Fit a Bayesian linear regression. The question asks you to use priors with large standard deviation, i.e. weakly informative (50 is a reasonable choice for the standard deviation in case you are struggling to choose one).
Part b: State the posterior mean for the slope. Is it positive or negative? On average, there has been a “slope” degree Celsius increase/decrease per year.
Part c: Provide plot of the densities for the three parameters.
Part b: Use Monte Carlo simulation. Calculate the porportion of samples for which the posterior mean of the slope obtained is positive. Intuitively, this probablity should be equal or close to 1 just by thinking of global warming.
Part e: Change the variance of the prior for b to something very small (like 0.001). Does any of our conclusions change?
Part f: What statistical arguments can we make? Here are two points you can make: Is the posterior mean of the slope positive or negative? Does the conclusion hold regardless of the whether we use an informative or diffuse prior?

Homework 1

Problem 1:

The demo the professor presented at the end of the lecture on Sep. 9th is relevant.
Generative AI is great at producing plotting code. Make sure you really understand what the code is doing, and how the grid approximation reflects Bayesian updating.

Problem 2:

Pandas are cute 🐼
Be clear what events we are considering.
For part a, recall the law of total probability.
For part b, Bayes’ rule is your tool. Observing a twin should strengthen your belief that the species is the one that gives birth to twins with more probabiility, right? :)
For part c, I would use Bayes’ rule.
For part d, follow the hint. As new observations come in, the previous posterior becomes your current prior.

FAQ and Trouble Shooting Bugs

R Session aborted when using R-Studio

It’s experimentally shown that if your macOS version is something like is 12.5, you should still download R-Studio using the [macOS 13+ (macOS 13 and higher)] option on the official site.

Share on

Twitter Facebook LinkedIn

Fangyuan Lin

Quick Links

Communication

Student Check-In Form

Ways to Succeed in This Course

Homework 3

Some tips

Problem 1:

Problem 2:

Problem 3:

Homework 2

Quick Review of Bayesian Linear Regression

Problem 1:

Homework 1

Problem 1:

Problem 2:

FAQ and Trouble Shooting Bugs

R Session aborted when using R-Studio

Share on