For this learning log, I chose the “women” data set. This data set gives the average heights and weights for American women aged 30-39.
data(women)
attach(women)
I am interested in the relationship between weight and {height}. {height} is recorded in inches, while weight is recorded in pounds.
I will use {height} as the explanatory variable and weight as the response, due to the fact that it’s generally easier to change weights than change height.
plot(height,weight,ylab="Weight (lbs)",xlab="Height (in)")
It can be seen from the scatterplot that there is a positive association between the two variables. The points seem to follow the trend pretty closely, and there seems to be an obvious linear relationship.
mymod<-lm(weight~height)
mymod
##
## Call:
## lm(formula = weight ~ height)
##
## Coefficients:
## (Intercept) height
## -87.52 3.45
The regression line is \[\hat{y_i}= \hat{\beta_0}+\hat{\beta_1} x_i\] Plugging in our numbers, \[\hat{y_i}= -87.52 + 3.45 x_i\]
For every inch increase in height, the average American woman’s weight is increased on average by 3.45 pounds.
We cannot interpret the intercept because we cannot reasonably set the explanatory variable to zero. This makes sense, as there’s no possible way that a woman’s height can be zero.
plot(height,weight,ylab="Weight (lbs)",xlab="Height (in)")
abline(-87.52,3.45)
Let’s predict the first record.
firstheight<-women[1,1]
firstactual<-women[1,2]
firstpredicted<-coef(mymod)%*%c(1,firstheight)
firstpredicted
## [,1]
## [1,] 112.5833
The prediction for the weight of an American woman with height 58 inches is 112.5833 pounds.
firstresid<-firstactual-firstpredicted
firstresid
## [,1]
## [1,] 2.416667
So our prediction was 2.4167 pounds too low.
myresids<-mymod$residuals
hist(myresids)
This histogram doesn’t seem to be normally distributed.
qqnorm(myresids)
qqline(myresids)
If the residuals were normally distributed, they’d follow a straight line. The points near the end of the line seem to be pretty far off. I’d say our assumption of normal distribution isn’t fulfilled.
Let’s look at the original plot again to check for homoscedasticity.
plot(height,weight,ylab="Weight (lbs)",xlab="Height (in)")
Although it seems to follow a straight line, we don’t have much data around each height.
Let’s look at the residuals plotted against the height.
plot(mymod$residuals ~ height)
abline(0,0)
It seems to me that there is an obvious curve, which points to heterscedasticity instead of homoscedasity, so the variance doesn’t seem to be constant.
summary(mymod)
##
## Call:
## lm(formula = weight ~ height)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7333 -1.1333 -0.3833 0.7417 3.1167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -87.51667 5.93694 -14.74 1.71e-09 ***
## height 3.45000 0.09114 37.85 1.09e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared: 0.991, Adjusted R-squared: 0.9903
## F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14
So the MSE is 1.525, with 13 degrees of freedom because there are 15 observations, minus 2 for the two terms with Beta.