Learning Log Day 3: Regression

Here is a description of some of the things I learned from the R guide on regression in class.

First, I called my data. Then I reviewed the data, variables, and attached the variables. This was good review

data(women)
head(women)

##   height weight
## 1     58    115
## 2     59    117
## 3     60    120
## 4     61    123
## 5     62    126
## 6     63    129

names(women)

## [1] "height" "weight"

attach(women)

Next, I created a scatterplot of the data to visualize it. I learned how to label the axis and create a title.

plot(height,weight,ylab="Weight (lbs)", xlab="Height (in)", main="US Women Average Heights and Weights aged 30-39")

It appears that as height increases, weight increases. We can also use the {lm} command to get the intercept and slope for our regression line. I couldn’t remember how to plot the regression line, but was reminded to use the {abline} command

mymod1<-lm(weight ~ height)
mymod1

## 
## Call:
## lm(formula = weight ~ height)
## 
## Coefficients:
## (Intercept)       height  
##      -87.52         3.45

plot(height,weight,ylab="Weight (lbs)", xlab="Height (in)")
abline(-87.52, 3.45)

For slope, we expect for each inch increased in average height, the average weight increases by 3.45 lbs. The intercept is outside the cloud of data. We want to avoid extrapolation, so we accept -87.52 as the y-intercept of our line of best fit.

Now for some predicting. I want to predict the average weight for women who are 65in tall.

mypred<-coef(mymod1)%*%c(1,65)
mypred

##          [,1]
## [1,] 136.7333

I expect the average weight to be 136.7333 lbs. Next, I want to calculate the residuals (observed-predicted)

actual<-women[8, "weight"]
actual

## [1] 135

actual-mypred

##           [,1]
## [1,] -1.733333

My prediction was 1.7333 pounds over.

Next, I learned how to check assumptions.First, check that the residuals are normally distributed. One option is to create a histogram. Another option is to create a qq plot to see how well the data follows a straight line.

myresids<-mymod1$residuals  #for all residuals
qqnorm(myresids)
qqline(myresids)

The points aren’t all close to the straight line so we might not be able to use regression to draw conclusions. We also want to check that the variance remains constant for the error. A good way to do this is to check the spread on the scatterplot. Again, we might be violating our assumptions. Mean Square Error can be found in the output for summary of mymod1. This output is useful because it also lists the coefficients, degrees of freedom, some test stat, and p val.

summary(mymod1)

## 
## Call:
## lm(formula = weight ~ height)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7333 -1.1333 -0.3833  0.7417  3.1167 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -87.51667    5.93694  -14.74 1.71e-09 ***
## height        3.45000    0.09114   37.85 1.09e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared:  0.991,  Adjusted R-squared:  0.9903 
## F-statistic:  1433 on 1 and 13 DF,  p-value: 1.091e-14

Learning Log Day 3: Regression

Jill Wanner

February 6, 2018