df <- read.csv("http://cknudson.com/data/Galton.csv")
head(df)
## FamilyID FatherHeight MotherHeight Gender Height NumKids
## 1 1 78.5 67.0 M 73.2 4
## 2 1 78.5 67.0 F 69.2 4
## 3 1 78.5 67.0 F 69.0 4
## 4 1 78.5 67.0 F 69.0 4
## 5 2 75.5 66.5 M 73.5 4
## 6 2 75.5 66.5 M 72.5 4
names(df)
## [1] "FamilyID" "FatherHeight" "MotherHeight" "Gender"
## [5] "Height" "NumKids"
summary(df)
## FamilyID FatherHeight MotherHeight Gender Height
## 185 : 15 Min. :62.00 Min. :58.00 F:433 Min. :56.00
## 166 : 11 1st Qu.:68.00 1st Qu.:63.00 M:465 1st Qu.:64.00
## 66 : 11 Median :69.00 Median :64.00 Median :66.50
## 130 : 10 Mean :69.23 Mean :64.08 Mean :66.76
## 136 : 10 3rd Qu.:71.00 3rd Qu.:65.50 3rd Qu.:69.70
## 140 : 10 Max. :78.50 Max. :70.50 Max. :79.00
## (Other):831
## NumKids
## Min. : 1.000
## 1st Qu.: 4.000
## Median : 6.000
## Mean : 6.136
## 3rd Qu.: 8.000
## Max. :15.000
##
plot(df$FatherHeight, df$Height)
model <- lm(df$Height~df$FatherHeight)
model
##
## Call:
## lm(formula = df$Height ~ df$FatherHeight)
##
## Coefficients:
## (Intercept) df$FatherHeight
## 39.1104 0.3994
As we can see here our model was created under the name “model” so by just calling it’s name in a command it will show us our coefficients. We could now create a regression equation from this output:
\[predictedheight=39.1104+observedfather′sheight∗.3994\]
summary(model)
##
## Call:
## lm(formula = df$Height ~ df$FatherHeight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.2683 -2.6689 -0.2092 2.6342 11.9329
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.11039 3.22706 12.120 <2e-16 ***
## df$FatherHeight 0.39938 0.04658 8.574 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.446 on 896 degrees of freedom
## Multiple R-squared: 0.07582, Adjusted R-squared: 0.07479
## F-statistic: 73.51 on 1 and 896 DF, p-value: < 2.2e-16
The model can only explain 7.582% of the data variation.
qqnorm(resid(model))
qqline(resid(model))
As we can see from this plot our errors follow the straight line decently so we will say this assumption is met and discuss possible issues. The points off the line tell us that we might have skewed data or, the most likely situation, we have extreme values in our data that don’t fit well into a normal distribution.