Week 14:
Regression Analysis and Hypothesis Testing

Agenda

Linear Regression

Hypothesis testing

Linear Regression

The objective:
We want to understand the
relationship between two variables

The simple solution:
Assume the relationship is linear
and estimate the “average” trend

Linear regression: The intuition

Imagine you’re trying to figure out a relationship between two things.

For instance, how the number of hours you study affects your score on a test.

Intuitively, you might think that the more you study, the better your score.

Linear regression helps us quantify this relationship.

Equation of a line

Quick review: The equation of a line plotting the relationship between \(y\) and \(x\) is, \[y=mx+b \]

Which variable represents the slope?

Which variable represents the intercept?

library(ggplot2,quietly = T)
data.frame(x=seq.int(0,10,1),
           y=seq(2,22,length.out=11)) |>
  ggplot(aes(x,y)) + 
    geom_line() +
    scale_y_continuous(breaks = seq.int(0,24,2),limits = c(0,24)) +
    scale_x_continuous(breaks = seq.int(0,10,2)) +
    theme_bw(base_size = 15)

Studying and Test Scores

We observe hours studied and test scores

Always plot your data

Scatter plot

We observe hours studied and test scores

Always plot your data

What is the apparent relationship?

Slope? Intercept?

\[y=mx+b \]

How does OLS estimate regression coefficients?
Shiny App

Estimate Regression

model1 <- lm(test_scores ~ hours_studied,data)

m_out <- modelsummary(model1,stars = TRUE,coef_rename = c("(Intercept)"="Intercept","hours_studied"="Hours Studied"),gof_map = c("nobs", "r.squared"))

m_out
 (1)
Intercept 40.389***
(2.895)
Hours Studied 0.783***
(0.090)
Num.Obs. 133
R2 0.364
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

True model vs estimates

\[scores=40 + 0.8 * hours + \varepsilon ;~~ \varepsilon \sim N(0,15)\]

 (1)
Intercept 40.389***
(2.895)
Hours Studied 0.783***
(0.090)
Num.Obs. 133
R2 0.364
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

How well does our model fit?

R2 is a statistic used to measures the proportion of variance in the dependent variable that is predictable from the independent variables

\[ R^2 = \frac{Explained ~~ Variation}{Total ~~ Variation} \]

The value of R2 ranges from 0 to 1:

  • 0 means that the model explains none of the variability of the response data around its mean.

  • 1 means that the model explains all the variability of the response data around its mean.

Putting the coefficients on trial

Did we observe our estimate by chance or can we trust it?

The Hypotheses:

  • Null Hypothesis (H0): posits that the regression coefficient is equal to zero - no effect of the independent variable (e.g., hours studied) on the dependent variable (e.g., test scores).

  • Alternative Hypothesis (H1): posits that the coefficient is not zero - there is an effect, and the amount of studying does influence test scores.

Testing the hypothesis

We use a t-test to determine whether the regression coefficients are significantly different from zero.

The t-statistic is a ratio. The numerator is the difference between the estimated coefficient and zero (or other null hypothesis). The denominator is the standard error of the coefficient.

\[ t = \frac{\beta - 0}{se} = \frac{0.783 - 0}{0.09} = 8.7 \]

Student’s t-distribution

The shaded area is the probability that we observe the value by chance (p-value)

Student’s t-distribution

Locating our t-statistic on the density of the t-distribution quantifies the probability that we observe our result by chance (p-val = 1.199041e-14)

95% Confidence Interval

Reverse engineer the p-value of 0.05 to find the test statistic (1.96)

\[ 1.96 = \frac{\beta - 0}{se} \]

Then rearrange the t-statistics calculation

\[ upper = \beta + 1.96*se ; ~~~ lower = \beta - 1.96*se\]

Inference - back to the trial

The p-val = 0.00000000000001199041 suggests the probability of observing the relationship between studying and test score by chance is small.

Formally, we reject the null hypothesis H0 (no relationship) at an alpha=.05

Our test does not prove that H1 is correct, but it provides strong evidence

Interpreting regression coefficients

 (1)
Intercept 40.389***
(2.895)
Hours Studied 0.783***
(0.090)
Num.Obs. 133
R2 0.364
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

A student that does not study is expected to get a score of 40.389

A student’s score should improve by 0.78 points per hour of additional studying

How many hours should a student study to get 80 points or better?

Predicting new values

We can use the coefficient estimates to calculate expected scores from new students based on study time

\[scores=40 + 0.78 * hours \]

Multivariate regression

You can add additional independent variables (aka regressors)

\[ y=\alpha + \beta_1 X_1 + \beta_2 X_2 + ... +\varepsilon \]

The \(\beta\)’s are conditional on the other covariates

Inference is similar

Summary

Regression is a tool to estimate relationships between variables (measurements)

When assumptions are met, the regression provides the best estimates of the relationship

Hypothesis testing helps us understand the quality of our estimates