I chose to work with the data set containing women’s height and weight data. I will use height as the predictor variable and weight as the response variable.
First I will start of by attaching the data set to R.
attach(women)
Now I can run my plot command and look at the scatter plot.
plot(height,weight)
This plot shows that the two variables have a very linear relationship. The plot looks good but I want to fix the labels.
plot(height, weight, ylab = "Women's Weight",
xlab = "Womens Height",
main = "Women's Height and Weight Data")
Now that looks better and is easier to read.
Next I want to actually run my regression and plot my prediction line in my plot.
model <- lm(weight~height)
model
##
## Call:
## lm(formula = weight ~ height)
##
## Coefficients:
## (Intercept) height
## -87.52 3.45
plot(height, weight, ylab = "Women's Weight",
xlab = "Womens Height",
main = "Women's Height and Weight Data")
abline(-87.52,3.45)
Now I will look at a specific data point, I chose the 3rd row, and see how close the predicted to the actual is, the residual.
women[3, ]
## height weight
## 3 60 120
-87.52+60*3.45
## [1] 119.48
120-119.48
## [1] 0.52
Now one of our assumptions is that we want to have a normally distributed error so I will make a histogram of the residuals.
resid<- model$residuals
hist(resid)
This model is okay. If you count you will see that there are 8 negative errors and 7 positive errors so we are centered around 0 pretty decent but the right side is more dispersed than the left side.
I will bring up another two graphs to check out the residuals.
qqnorm(resid)
qqline(resid)
plot(model$residuals ~ height)
abline(0,0)
The first plot shows us that our residuals are all very small. The second plot shows us a potetial problem though. This shows us a pattern in our residuals. Our residuals are supposed to be independent but this contradicts that assumption a bit.
Finally we want to check out the summary statistics of our model.
summary(model)
##
## Call:
## lm(formula = weight ~ height)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7333 -1.1333 -0.3833 0.7417 3.1167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -87.51667 5.93694 -14.74 1.71e-09 ***
## height 3.45000 0.09114 37.85 1.09e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared: 0.991, Adjusted R-squared: 0.9903
## F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14
This is a nice way to see our error statistics, our r or r^2 value, our coefficients, our degrees of freedom, and many other useful statistics.
As we can see our R^2 value is very high and our error is pretty low so our model is a good fit and our variables are linearly related. The interpretation of our model is that for every one inch increase in height the weight increases by 3.45 pounds.