We are going to look at the dataset called “women”, which contains the height and weights of a sample population of women.
data(women)
head(women)
## height weight
## 1 58 115
## 2 59 117
## 3 60 120
## 4 61 123
## 5 62 126
## 6 63 129
names(women)
## [1] "height" "weight"
attach(women)
After calling up the dataset, I plotted the data to get a visual representation of the sample. In this case, the weight is the response variable (y axis), and the height is the predictor variable (x axis).
plot(weight~height,ylab="weight(lbs)",xlab="height(in)",main="Women's Weights and Heights")
Looking at the scatterplot, it appears that the sample has a relatively clear positive linear relationship between height and weight. As the height increases, so does the weight.
Next, I added a regression line, or a line of best fit, that best represents the trend of the data. I used the lm function to give me the slope and intercept of the data, and plugged it into the abline function, which plots the line.
mymod<-lm(weight~height)
mymod
##
## Call:
## lm(formula = weight ~ height)
##
## Coefficients:
## (Intercept) height
## -87.52 3.45
plot(weight~height,ylab="weight(lbs)",xlab="height(in)",main="Women's Weights and Heights")
abline(-87.52,3.45)
The y-intercept of the line is -87.52, meaning that the weight of someone that is 0 inches tall is -87.52 pounds. This is physically impossible, and this is referred to as extrapolation since it is outside the cloud of data, and therefore should be avoided when making predictions. The slope of the line is 3.45, meaning that within the scope of the data provided, for every inch, there is an average increase in weight of 3.45 pounds.
I want to be able to predict the weight of a woman given her height. For this example, I want to predict the weight of a woman that is 67 inches tall.
pred67<-coef(mymod)%*%c(1,67)
pred67
## [,1]
## [1,] 143.6333
obs67<-women[10,2]
obs67
## [1] 142
The predicted weight of a 67 inch tall woman is about 143.63 pounds, whereas the observed weight from the sample is 142 pounds. The difference between the observed weight and the predicted weight is called the residual, which is:
res67<-obs67-pred67
res67
## [,1]
## [1,] -1.633333
We can calculate the residual at each data point to see how much each predicted value is different from the observed value. We do this because we want to make sure that the assumption that the residuals follow a normal distribution is accurate. To show this, I created a histogram of the residuals.
myresid <- mymod$residuals
hist(myresid,main="Residuals of Women's Weights")
It can be more difficult to check for a normal distribution in the form of a histogram. To make the process a bit easier, we can plot the quantiles of the residuals against the quantiles of a normal distribution (straight line). If the data points from the residuals follow this straight line, then the assumption is correct.
qqnorm(myresid)
qqline(myresid)
The points at the extremes seem to be substantially far off the normal line, so it cannot be assumed that the residuals follow a normal distribution.
Another assumption that is made in linear regression is homoscedasticity, or equal variances. To check this, I plot the residuals against the height, and compared it to to the line y=0. We are looking at the vertical spread to see if there is any obvious signs of heteroscedasticity.
plot(mymod$residuals~height)
abline(0,0)
The data points create a curve, which indicates that the variance is inconsistent, or the data shows heteroscedasticity.
In order to find the mean square error of the data, we can take a look at the summary of the model we created
summary(mymod)
##
## Call:
## lm(formula = weight ~ height)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7333 -1.1333 -0.3833 0.7417 3.1167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -87.51667 5.93694 -14.74 1.71e-09 ***
## height 3.45000 0.09114 37.85 1.09e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared: 0.991, Adjusted R-squared: 0.9903
## F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14
The resulting MSE is 1.525 with 13 degrees of freedom (which comes from 15 datapoints and 2 beta terms).