df <-  read.csv("http://cknudson.com/data/Galton.csv")
head(df)
##   FamilyID FatherHeight MotherHeight Gender Height NumKids
## 1        1         78.5         67.0      M   73.2       4
## 2        1         78.5         67.0      F   69.2       4
## 3        1         78.5         67.0      F   69.0       4
## 4        1         78.5         67.0      F   69.0       4
## 5        2         75.5         66.5      M   73.5       4
## 6        2         75.5         66.5      M   72.5       4
names(df)
## [1] "FamilyID"     "FatherHeight" "MotherHeight" "Gender"      
## [5] "Height"       "NumKids"
summary(df)
##     FamilyID    FatherHeight    MotherHeight   Gender      Height     
##  185    : 15   Min.   :62.00   Min.   :58.00   F:433   Min.   :56.00  
##  166    : 11   1st Qu.:68.00   1st Qu.:63.00   M:465   1st Qu.:64.00  
##  66     : 11   Median :69.00   Median :64.00           Median :66.50  
##  130    : 10   Mean   :69.23   Mean   :64.08           Mean   :66.76  
##  136    : 10   3rd Qu.:71.00   3rd Qu.:65.50           3rd Qu.:69.70  
##  140    : 10   Max.   :78.50   Max.   :70.50           Max.   :79.00  
##  (Other):831                                                          
##     NumKids      
##  Min.   : 1.000  
##  1st Qu.: 4.000  
##  Median : 6.000  
##  Mean   : 6.136  
##  3rd Qu.: 8.000  
##  Max.   :15.000  
## 

Visualization

plot(df$FatherHeight, df$Height)

Modeling

model <- lm(df$Height~df$FatherHeight)
model
## 
## Call:
## lm(formula = df$Height ~ df$FatherHeight)
## 
## Coefficients:
##     (Intercept)  df$FatherHeight  
##         39.1104           0.3994

As we can see here our model was created under the name “model” so by just calling it’s name in a command it will show us our coefficients. We could now create a regression equation from this output:

\[predictedheight=39.1104+observedfather′sheight∗.3994\]

summary(model)
## 
## Call:
## lm(formula = df$Height ~ df$FatherHeight)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.2683  -2.6689  -0.2092   2.6342  11.9329 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     39.11039    3.22706  12.120   <2e-16 ***
## df$FatherHeight  0.39938    0.04658   8.574   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.446 on 896 degrees of freedom
## Multiple R-squared:  0.07582,    Adjusted R-squared:  0.07479 
## F-statistic: 73.51 on 1 and 896 DF,  p-value: < 2.2e-16

The model can only explain 7.582% of the data variation.

qqnorm(resid(model))
qqline(resid(model))

As we can see from this plot our errors follow the straight line decently so we will say this assumption is met and discuss possible issues. The points off the line tell us that we might have skewed data or, the most likely situation, we have extreme values in our data that don’t fit well into a normal distribution.