Today I will be explaining how to work through regression, scatter plots, linear models, histograms, and more!
First, Make sure you attach it so R knows what data set to use.
attach(women)
Starting small: first check out the summary, head, and structure of the data set of women’s heights and weights to get an idea of the data set.
summary(women)
## height weight
## Min. :58.0 Min. :115.0
## 1st Qu.:61.5 1st Qu.:124.5
## Median :65.0 Median :135.0
## Mean :65.0 Mean :136.7
## 3rd Qu.:68.5 3rd Qu.:148.0
## Max. :72.0 Max. :164.0
head(women)
## height weight
## 1 58 115
## 2 59 117
## 3 60 120
## 4 61 123
## 5 62 126
## 6 63 129
str(women)
## 'data.frame': 15 obs. of 2 variables:
## $ height: num 58 59 60 61 62 63 64 65 66 67 ...
## $ weight: num 115 117 120 123 126 129 132 135 139 142 ...
Let’s look at a scatter plot for visual people (like me)
plot(height,weight)
That’s a very beautiful scatterplot! If you don’t like the names and care about grammar, you should rename the variables with capital letters.
plot(height,weight, ylab ="Weight (lb)",
xlab = "Height (inches)",
main = "Figure 1")
The trend seems to be a strong positive correlation as height increases, that weight increases as well. Let’s see how strong the correlation is using linear models and the {lm} command. Name the model whatever you’d like.
mymod <- lm(weight ~ height)
Call up your model.
mymod
##
## Call:
## lm(formula = weight ~ height)
##
## Coefficients:
## (Intercept) height
## -87.52 3.45
This intercept seems off because it is trying to find the data for someone 0 inches tall. We are extrapolating out of our logical data set, so we can’t interpret the intercept.
So, the intercept is -87.52 and the slope is 3.45.
It might help to look at a regression line. You already created a regression model, so you have to write the abline command to create the line in the sctterplot.
plot(height,weight,ylab ="Weight (lb)",xlab = "Height (inches)",main = "Figure 1")
abline(-87.52,3.45)
Great. Now let’s predict some more stuff. Tyype in a line number of a height to see what the corresponding weight would be. Type the line number of the data set, then “weight”.
women[4,"weight"]
## [1] 123
Coefficients might be helpful for interpreting and are a bit clearer to read:
coef(mymod)
## (Intercept) height
## -87.51667 3.45000
You have to check model assumptions to make sure your linear regression is valid. The assumption we will use R for is that the errors have a normal distribution. First, find the residuals and then use a method like a histogram to display the distribution.
myresids <- mymod$residuals
hist(myresids)
That doesn’t look right… let’s try a qq plot and see how the data is displayed, and also have the straight line to compare them to
qqnorm(myresids)
qqline(myresids)
The points on the ends look like they are much further from the data than the points near the mean. This could be an issue with our data or the model.
Another check is plotting the residuals against the independent variable’s data.
plot(mymod$residuals ~ height)
abline(0,0)
We can definitely say they don’t seem to match up as well as we’d like - maybe gathering more data would help fit the model more.
Looking at the mean square error is improtant and I like it when R does the work for me rather than using the formula from our book.
summary(mymod)
##
## Call:
## lm(formula = weight ~ height)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7333 -1.1333 -0.3833 0.7417 3.1167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -87.51667 5.93694 -14.74 1.71e-09 ***
## height 3.45000 0.09114 37.85 1.09e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared: 0.991, Adjusted R-squared: 0.9903
## F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14
There’s so much information in this - the number of degrees of freedom, the quartiles, the intercepts, residual standard error, etc. It’s a great thing to have because it saves your time and eliminates lots of risk of human calculation error.