To begin, we need to get our data all set-up. It is good to add a summary of your data so you can see the names of the variables!
data(women)
attach(women)
summary(women)
## height weight
## Min. :58.0 Min. :115.0
## 1st Qu.:61.5 1st Qu.:124.5
## Median :65.0 Median :135.0
## Mean :65.0 Mean :136.7
## 3rd Qu.:68.5 3rd Qu.:148.0
## Max. :72.0 Max. :164.0
Next we need to plot our data on a scatterplot. I put the height on the explanatory axis because you really can’t change your height.We can see there is a pretty strong correlation between height and weight. I added some nice labels to make it more aesthetically pleasing.
plot(height,weight,xlab = "Women's Height (Inches)", ylab = "Women's Weight (Pounds)", main = "Women's Height vs Weights, Ages 30-39")
Now we are going to create a best fit, or linear regression, line. We can do this in R by using the command lm(). In the arguments, put your response variable then your explantory. You will see that when you call your regression model, it tells you your y-intercept and your slope. Next we are going to plot this line.
womenmod<- lm(weight ~ height)
womenmod
##
## Call:
## lm(formula = weight ~ height)
##
## Coefficients:
## (Intercept) height
## -87.52 3.45
Start by replotting your scatterplot (which you only need to do if you don’t have it saved locally). At first glance, this seems to be a pretty good regression line. It fits the data very well!
plot(height,weight,xlab = "Women's Height (Inches)", ylab = "Women's Weight (Pounds)", main = "Women's Height vs Weights, Ages 30-39")
abline(-87.52,3.45)
To see if this really is a good regression line, we are going to calculate the residuals. But just looking at the residuals isn’t super helpful. So we need to plot them and look! This histogram should be normally distributed for our assumptions to hold true. It looks like it might be normal, but I am not sure. So, we should plot it on a line (since it is easy to tell if a line is correct)
myresids <-womenmod$residuals
hist(myresids)
This is a plot of the quantiles of the residuals. They should fall on the graphed line. They are definitely super close, but we will investigate further just to be sure!
qqnorm(myresids)
qqline(myresids)
We are now going to plot the residuals against the heights. If our assumptions hold, these dots should be randomly distributed. Clearly they are not! So this means the model is over predicting short and tall women, but under predicting average women. This is all the proof we need to say that our model is not a good fit for our data.
plot(womenmod$residuals ~ height)
abline(0,0)
Finally, and maybe something we should have started with, we can look at a summary of our model!
summary(womenmod)
##
## Call:
## lm(formula = weight ~ height)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7333 -1.1333 -0.3833 0.7417 3.1167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -87.51667 5.93694 -14.74 1.71e-09 ***
## height 3.45000 0.09114 37.85 1.09e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared: 0.991, Adjusted R-squared: 0.9903
## F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14
Thanks for tuning in!