How does OLS estimate regression coefficients?
Shiny App
True model vs estimates
\[scores=40 + 0.8 * hours + \varepsilon ;~~ \varepsilon \sim N(0,15)\]
Intercept |
40.389*** |
|
(2.895) |
Hours Studied |
0.783*** |
|
(0.090) |
Num.Obs. |
133 |
R2 |
0.364 |
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 |
|
Putting the coefficients on trial
Did we observe our estimate by chance or can we trust it?
The Hypotheses:
Null Hypothesis (H0): posits that the regression coefficient is equal to zero - no effect of the independent variable (e.g., hours studied) on the dependent variable (e.g., test scores).
Alternative Hypothesis (H1): posits that the coefficient is not zero - there is an effect, and the amount of studying does influence test scores.
Testing the hypothesis
We use a t-test to determine whether the regression coefficients are significantly different from zero.
The t-statistic is a ratio. The numerator is the difference between the estimated coefficient and zero (or other null hypothesis). The denominator is the standard error of the coefficient.
\[ t = \frac{\beta - 0}{se} = \frac{0.783 - 0}{0.09} = 8.7 \]
Student’s t-distribution
The shaded area is the probability that we observe the value by chance (p-value)
Student’s t-distribution
Locating our t-statistic on the density of the t-distribution quantifies the probability that we observe our result by chance (p-val = 1.199041e-14)
95% Confidence Interval
Reverse engineer the p-value of 0.05 to find the test statistic (1.96)
\[ 1.96 = \frac{\beta - 0}{se} \]
Then rearrange the t-statistics calculation
\[ upper = \beta + 1.96*se ; ~~~ lower = \beta - 1.96*se\]
Inference - back to the trial
The p-val = 0.00000000000001199041 suggests the probability of observing the relationship between studying and test score by chance is small.
Formally, we reject the null hypothesis H0 (no relationship) at an alpha=.05
Our test does not prove that H1 is correct, but it provides strong evidence
Interpreting regression coefficients
Intercept |
40.389*** |
|
(2.895) |
Hours Studied |
0.783*** |
|
(0.090) |
Num.Obs. |
133 |
R2 |
0.364 |
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 |
|
A student that does not study is expected to get a score of 40.389
A student’s score should improve by 0.78 points per hour of additional studying
How many hours should a student study to get 80 points or better?
Predicting new values
We can use the coefficient estimates to calculate expected scores from new students based on study time
\[scores=40 + 0.78 * hours \]
Multivariate regression
You can add additional independent variables (aka regressors)
\[ y=\alpha + \beta_1 X_1 + \beta_2 X_2 + ... +\varepsilon \]
The \(\beta\)’s are conditional on the other covariates
Inference is similar
Summary
Regression is a tool to estimate relationships between variables (measurements)
When assumptions are met, the regression provides the best estimates of the relationship
Hypothesis testing helps us understand the quality of our estimates