In class today we covered section 4.7 and 5.1.
In general, the idea of this topic is that if we plot a simple linear regression, and that isn’t a good model, we can expand to a polynomial regression.
First, you start with a linear model (lm(y~x)). Next, plot the residuals against the fitted values. If your graph has a clear trend, simple linear regression is not the best tool for you!
data(women)
attach(women)
womenmod<-lm(weight~height)
plot(womenmod$residuals~womenmod$fitted.values)
In this graph, you can see there is a clear parabola-like shape. This is a great indicator to use quadratic regression instead of linear.
If the graph was random, this would be evidence that linear regression is sufficient.
So, if we create a quadratic regression model and plot the residuals against the fitted values, you will see that it is a much better estimator.
womenquadmod<-lm(weight ~ height +I(height^2))
plot(womenquadmod$residuals~womenquadmod$fitted.values)
You can these residuals are much better and more random! This is asign that quadratic regression is a better predictor than linear regression. To get more technical, you can look at the pvalue for the t test on \(\beta_2\).
summary(womenquadmod)
##
## Call:
## lm(formula = weight ~ height + I(height^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.50941 -0.29611 -0.00941 0.28615 0.59706
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 261.87818 25.19677 10.393 2.36e-07 ***
## height -7.34832 0.77769 -9.449 6.58e-07 ***
## I(height^2) 0.08306 0.00598 13.891 9.32e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3841 on 12 degrees of freedom
## Multiple R-squared: 0.9995, Adjusted R-squared: 0.9994
## F-statistic: 1.139e+04 on 2 and 12 DF, p-value: < 2.2e-16
If the pvalue of \(\beta_2\) is significant, and in this scenario it is, we can say that \(\beta_2\) is a coefficient not equal to zero and therefore, useful.
Finally, you can expand this topic to include any degreee polynomial regression. This will be extremely usefull in real world data analysis.
In the second part of class we went over Multicollinearity. In short, you do NOT want multicollinearity. The process to check for multicollinearity is straight forward. 1.Plot the data 2.Visually inspect for correlation 3.If visually correlated, calculate the correlation coefficient, -Multicollinearity is considered severe if |cor| > .9
This class period in general was super helpful for real world data analysis going forward.