In class today, we learned about polynomial regression and multicollinearity. Polynomial regression is similar to the linear regression we leaned in other chapters but with additional terms the allow us to correlate topics that are non linear by follow a pattern. (examples include exponetial, cubic, quartic) Multicollinearity is when two or more predictors in your regression equation are correlated. This is not what a typical statistician would want because it will inflate standard error. Women data:

Linear model:

mod <- lm(weight ~ height, data = women)
summary(mod)
## 
## Call:
## lm(formula = weight ~ height, data = women)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7333 -1.1333 -0.3833  0.7417  3.1167 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -87.51667    5.93694  -14.74 1.71e-09 ***
## height        3.45000    0.09114   37.85 1.09e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared:  0.991,  Adjusted R-squared:  0.9903 
## F-statistic:  1433 on 1 and 13 DF,  p-value: 1.091e-14
resid <- resid(mod)
plot(resid~mod$fitted.values)
abline(0,0)

Quadratic model:

mod2 <- lm(weight ~ height + I(height^2), data = women)
summary(mod2)
## 
## Call:
## lm(formula = weight ~ height + I(height^2), data = women)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.50941 -0.29611 -0.00941  0.28615  0.59706 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 261.87818   25.19677  10.393 2.36e-07 ***
## height       -7.34832    0.77769  -9.449 6.58e-07 ***
## I(height^2)   0.08306    0.00598  13.891 9.32e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3841 on 12 degrees of freedom
## Multiple R-squared:  0.9995, Adjusted R-squared:  0.9994 
## F-statistic: 1.139e+04 on 2 and 12 DF,  p-value: < 2.2e-16
resid2 <- resid(mod2)
plot(resid2~mod$fitted.values)

Compare them:

mod <- lm(weight ~ height, data = women)
mod2 <- lm(weight ~ height + I(height^2), data = women)
anova(mod, mod2)
## Analysis of Variance Table
## 
## Model 1: weight ~ height
## Model 2: weight ~ height + I(height^2)
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1     13 30.2333                                  
## 2     12  1.7701  1    28.463 192.96 9.322e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

F = 192.96 p value= 9.322e-09 Given this information, there is a significant difference inbetween the linear and quadratic model. Proving that we need the polynomial regression to most accurately represent your model.

Since we did not go into depth of the coding aspect of multicollinearity, I think it would be best to come up with some examples of the topic. Variables like looking at region and wealth are reasonable correlated. (Maybe not > .9 but in certain situations they might be) Another example is the variable of red wine drinkers and white wine drinkers with in alcoholics. (this is highly correlated because typically people who drink wine, drink both)