Today in class, we discussed multiple linear regression and incorporating categorical variables into our linear models

Multiple Linear Regression

For multiple linear regression, we extend our conversations of simple linear regression with one predictor variable explaining one response variable to more than one predictor variable explaining a single response variable.

Are new formula now looks like this:

\[ Y_i= {\beta_0} + {\beta_1} x_1 + {\beta_2} x_2 + ... + {\beta_k} x_k + E \]

Where Bo is still our intercept when our k x’s are 0. Recall this is still is only practical when xi = 0 for for all i’s and when it makes sense for our dataset.

Each Bi represent the change in Y when xi increases by 1, AND all other xi’s do not change. This is a slighty different interpretation of our Beta values than in SLR, but the intuition is still the same.

Our point estimate for Y is similar to SLR as well:

\[ \hat Y_i= \hat{\beta_0} + \hat{\beta_1} x_1 + \hat{\beta_2} x_2 + ...+ \hat{\beta_k} x_k \] Since our different x’s vary with each sample, we are truly interested in the set of Betas that on average minimize our SSE (sum of squared residuals) for all samples. If we dive back into our Linear Algebra days, we can solve this problem using Matrix Algebra. (Note the following pseudo derivation is not necessary, but rather me just nerding out over math derivations that I actually understand :) )

So, our Multiple LR equation has three components: 1. x predictor variables, say k of them. 2. Beta(hat) coefficients, k + 1 of them 3. Y(hat) for our response variable.

If we take n iid samples, we produce 2 dimensional vectors and matrices for these.

X becomes an (n, k+1) matrix with 1 in the first column, then the observed x’s for a given sample in each row, with n samples

Y(bar) becomes an (n,1) vector for the n response values for the n samples

B(bar) becomes an (1, k+1) vector for the Beta coefficents.

This gives us:

\[ \matrix X \bar{b} = \bar{y} \]

And via matrix algebra found in Appendix B.8, we can see than b(bar) minimizes the SSE by:

\[ \bar{b} = (\matrix X ^T \matrix X)^{-1} \matrix X^t\bar{y} \]

Mean Square Error / Standard Error

Similar to SLR, MSE and SE are the point estimates for variance and standard deviation respectively. This time, since we have k+1 estimators, we have n-(k+1) in the denominator to create an unbiased estimator.

\[ MSE = s^2 = SSE/(n-(k+1)) \] \[ SE = s = \sqrt{SSE/(n-(k+1))}\]

And our intuition here tells us that with a smaller MSE, the variation of our error is less, so our predictions of y will be more accurate.

R-Squared and Adjusted R-Squared

For Multiple LR we introduce the concept of Adjusted R-Squared

Recall that R-Squared is the coefficient of determination which tells us what % of the variation in our response variable that can be explained by our predictor variables.

Yet, when there are more predictor variables, R-Squared will naturally increase. So Adjusted R-Squared takes into affect the number of predictor variables as a deterant against R-Squared:

\[ \bar R^2 = (R^2 - k/(n-1))*((n-1)/(n-(k+1))\] The first term reduces R-Squared with k more predictors, and the second term readjusts Adjusted R-Squared to 1 if R-Squared truly = 1.

F Test for Multiple LR

Testing the significance of the regression relationship between y and ANY of the predictors

Null: None of our k predictor variables are significant. Alt: At least 1 predictor variables is significant.

\[ F-stat = F_{k, n-(k+1)} = (Explained Variable / k) / (Unexplained Variation / (n-(k+1))) \] Where if the F-stat is large, it leads us to believe at least 1 predictor is sign.

For our Degrees of Freedom (r1, r2):

-As r1 = k increases, the F-distribution stretches to the right: P-Value increases

-As r2 = (n-(k+1)) increases, the F-distribution compresses: P-Value increases

x = seq(0, 3, length = 1000)
plot(x, df(x = x, df1 = 30, df2 = 69))

Indicator Functions

Finally, we discussed using indicator functions in linear models. Indicator functions are used for categorical variables that can be described as in or out of a category. We this binomial notion with dummy variables by giving an x variable of 1 if the sample is in and 0 if it is out.

This allows us to create our Bi based on the predictability of the sample being in the category.

I found it interesting that if you have a category with multiple possible responses, you can create each response as its own indicator function. This even allows you to use one less function due to thinking of one category as “All Other.”