Here is a description of some of the things I learned from the R guide on regression in class.
First, I called my data. Then I reviewed the data, variables, and attached the variables. This was good review
data(women)
head(women)
## height weight
## 1 58 115
## 2 59 117
## 3 60 120
## 4 61 123
## 5 62 126
## 6 63 129
names(women)
## [1] "height" "weight"
attach(women)
Next, I created a scatterplot of the data to visualize it. I learned how to label the axis and create a title.
plot(height,weight,ylab="Weight (lbs)", xlab="Height (in)", main="US Women Average Heights and Weights aged 30-39")
It appears that as height increases, weight increases. We can also use the {lm} command to get the intercept and slope for our regression line. I couldn’t remember how to plot the regression line, but was reminded to use the {abline} command
mymod1<-lm(weight ~ height)
mymod1
##
## Call:
## lm(formula = weight ~ height)
##
## Coefficients:
## (Intercept) height
## -87.52 3.45
plot(height,weight,ylab="Weight (lbs)", xlab="Height (in)")
abline(-87.52, 3.45)
For slope, we expect for each inch increased in average height, the average weight increases by 3.45 lbs. The intercept is outside the cloud of data. We want to avoid extrapolation, so we accept -87.52 as the y-intercept of our line of best fit.
Now for some predicting. I want to predict the average weight for women who are 65in tall.
mypred<-coef(mymod1)%*%c(1,65)
mypred
## [,1]
## [1,] 136.7333
I expect the average weight to be 136.7333 lbs. Next, I want to calculate the residuals (observed-predicted)
actual<-women[8, "weight"]
actual
## [1] 135
actual-mypred
## [,1]
## [1,] -1.733333
My prediction was 1.7333 pounds over.
Next, I learned how to check assumptions.First, check that the residuals are normally distributed. One option is to create a histogram. Another option is to create a qq plot to see how well the data follows a straight line.
myresids<-mymod1$residuals #for all residuals
qqnorm(myresids)
qqline(myresids)
The points aren’t all close to the straight line so we might not be able to use regression to draw conclusions. We also want to check that the variance remains constant for the error. A good way to do this is to check the spread on the scatterplot. Again, we might be violating our assumptions. Mean Square Error can be found in the output for summary of mymod1. This output is useful because it also lists the coefficients, degrees of freedom, some test stat, and p val.
summary(mymod1)
##
## Call:
## lm(formula = weight ~ height)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7333 -1.1333 -0.3833 0.7417 3.1167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -87.51667 5.93694 -14.74 1.71e-09 ***
## height 3.45000 0.09114 37.85 1.09e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared: 0.991, Adjusted R-squared: 0.9903
## F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14