Today in class we continued talking about Ch5. We focused on:
Diagnostic plots are what result when we use the {plot} command on our model. 4 plots are produced.
This plot is the kind we made before by hand. It compares our residuals and fitted values. Ideally, we want to see no trend and equal spread in this plot. Things to look out for include, curvature, heteroscedasdicity, and outliers. The curvature is the most important thing to pay attention to. it indecates our “mean function” could be missing something. To fix this, we can try plynomial regression and/or an interaction term. We could also try a transformation, which we will talk about later. **There isn’t really a need to look at the other plots if the mean function is wrong. We need to fix this first.
The next plot is a QQnorm plot, which we are also familiar with. It checks the assumption that errors follow a normal distribution. This QQplot uses studentized residuals (gives less weight to extreme values) so it may look a little bit different than the one we made by hand in the past but apparently this is nothing to worry about! Remember from before, the closer the data points follow the line, the clower they are to normal.
This one is similar to the first plot except it folds up the bottom half so all points are >0. This is another way we can look for a trend in our residuals. Want to see no trend. This plot reduced the skewness of our data.
Cook’s Distance measures the influence each data pt has on the regression coefficeint Bhat. We want all of our data points to fall within the red funnel looking lines. If they fall outside, this means the data point has too much influence on Bhat.
library(alr3)
## Warning: package 'alr3' was built under R version 3.4.3
## Loading required package: car
## Warning: package 'car' was built under R version 3.4.3
data(brains)
attach(brains)
modbrain<-lm(BrainWt~BodyWt, data=brains)
plot(modbrain)
First plot so bad we wouldn’t normally even look at others but we will just for the sake of learning. Most important part is having the right mean function! We do not have a good one here.
Let’s try to improve now u sing transforms! There are many different types of transforms but we only talked about a few, including: squareroot, log, and inverse.
modb2<-lm(sqrt(BrainWt)~sqrt(BodyWt))
plot(modb2)
modb3<-lm(log(BrainWt)~BodyWt)
plot(modb3)
#can create new variables
logbr<-log(BrainWt)
logbod<-log(BodyWt)
modb4<-lm(logbr~logbod, data = brains)
par(mfrow=c(2,2)) #cool way to see all the plots at one time!
par(mfrow=c(2,2))
plot(modb4) #best one!!
plot(logbr~logbod) # this one looks good!
#try inverse
invbr<-1/BrainWt
modb5<-lm(invbr~BodyWt)
par(mfrow=c(2,2))
plot(modb5)
#I think modb4 is still better.
modb4 (modb4<-lm(logbr~logbod, data = brains)) looks like the best! It makes the mean function and other plots look much better.
Some more data to try!
data("stopping")
attach(stopping)
modspeed<-lm(Distance~Speed, data = stopping)
plot(modspeed)
Not bad starting place in comparison to the brains data but let’s still try to see if we can make it better.
modsp1<-lm(sqrt(Distance)~Speed) #best
modsp2<-lm(sqrt(Distance)~sqrt(Speed)) #worse
modsp3<-lm(log(Distance)~Speed) #worse
modsp4<-lm(log(Distance)~log(Speed)) #worse
plot(modsp1)
with(stopping,plot(log(Distance),Speed, col=("red")))
#pretty good. Method to use if you are using multiple data sets and dont want to attach them all.
Again, we found the best model by trying out a few of the different transformations. It is a good idea to plot the data first, before the model. This will give you an idea of what kind of transformations might work best.
We also talked some about outliers. Outliers can be hard to deal with because there is no black and white process to follow, but rather are very dependent on the situation and context of the problem.
These are not a big deal and can be left alone. It is important to remember not all outliers will cause problems.
These are bad and will potentially throw off our regression line so we need to play close attention.
studentized residual=(residual/SE(residual)) If the absolute value of the studentized residual is greater than 2 than we need to pay EXTRA attention to this point.
As I mentioned before, dealing with outliers is very much vased on the context of your data. They cannot just be deleted
Steps:
Make sure you have a good mean function before anything else!
Today’s lesson will help us to better check our models and make them as accurate as we can. We now know how to read all of the plots the plot command produces when we put in are model as an argument. We also can no use transforms to make these plots, and in turn, our model better. It will be interesting to compare how using polynomial regression or an interaction term will affect all of the plots, raher than just the resdiual vs fitted value one we looked at previously. It is also nice to have a little bit more knowledge about outliers even though it does vary depending on the problem. Something I did not typically think about before was that not all outliers are bad (like those in the x direction). It is always good to be aware of outliers and what they may mean in our data set.