Today we finished Ch. 4 and started to look a little at Ch. 5.

Polynomial Regression:

We first looked at Polynomial regression, focusing on Quadratic regression. A good way to tell if quadratic regression might be neccessary is looking at the residual plot. GOOD: there is not trend. BAD: there is some trend in the residaul plot.

Lets do this with “women” data. First we will plot a normal linear regression of women’s height and weight and take a look at the reidual plot.

data("women")
attach(women)
woman<-lm(weight~height)
library(car)
## Warning: package 'car' was built under R version 3.4.3
residualPlot(woman)

We can see a smiley-face type trend in the residual plot so we will want to use ploynomial regression to try to improve this

Now, lets fit a quadratic model. We will create a new term {xsq}to store our squared term for the purpose of using it in the mode.

xsq<-height^2
womanq<-lm(weight~height+xsq)
summary(womanq)
## 
## Call:
## lm(formula = weight ~ height + xsq)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.50941 -0.29611 -0.00941  0.28615  0.59706 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 261.87818   25.19677  10.393 2.36e-07 ***
## height       -7.34832    0.77769  -9.449 6.58e-07 ***
## xsq           0.08306    0.00598  13.891 9.32e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3841 on 12 degrees of freedom
## Multiple R-squared:  0.9995, Adjusted R-squared:  0.9994 
## F-statistic: 1.139e+04 on 2 and 12 DF,  p-value: < 2.2e-16

Looking at our p-value for our squared term, we see it is < .05 so we will reject the null. We can say that there is a significant quadratic relationship between height and weight.

Let’s take a look at our residual plot to see if it is improved with our new quadratic model!

residualPlot(womanq)

Though there is somewhat of a trend still, it does look improved.

Lets plot this on our scatter plot.

x<-seq(from=58, to=72, by=.1) #creates sequence of number of our range of heights(x's). increasing by .1
y<-coef(womanq)[1]+coef(womanq)[2]*x+coef(womanq)[3]*x^2 #creates predictions
plot(weight~height)
lines(x,y, lty=8,col=4) #plots our quadratic
abline(woman) #plots line from original simple linear regression.

We can see that the quardatic regression gives us a little bit better of an estimate.. though in practical terms, it doesnt make that much of a difference. It is important to think about statistical significance vs practical significance when approaching real world problems. We do not want to make our models messier than they have to be.

Polynomial Regression can be helpful because often, it can be more accurate than linear regression, depending on the relationship in our data.

A little bit from Ch. 5

We started talking about section 5.1, which is about multicollinearity. Multicollinearity is when predictors are correlated. We do not want this. One way you can do that is just use the {cor} command. This looks at pairwise correlations.

If we don’t want to just look at pairwise correlations, we can create a linear modes and look at the VIF, variance inflation factor.

If there is too much of a correlation between variables, the standard error will be infalted. THis could lead to CI’s that are extra wide, unstable predictors, and difficulty in detecting significance.

Looking into multicollinearity further in the future will help us to determine if our models are accurate in the future.