Linear regression, AKA least-squares regression, models a linear relationship between a continuous response variable (Y) and a continuous predictor variable (X). As such, it is a special case of a general linear model, implemented using the lm() function.

Regression was invented by Charles Darwin’s cousin, Francis Galton, so let’s use his data on parent and child heights. Specifically, let’s keep things simple by just using mothers and daughters.

require(mosaic) # be sure to install and load the usual packages if you have not already
require(lattice)
require(datasets)
data(Galton)
GaltonGirls = subset(Galton, sex=="F", drop=TRUE) # here we are creating a new dataset that contains only the daughters

It is always helpful to look at a plot first

xyplot(height~mother, data=GaltonGirls, xlab="Mother's height (in)", ylab="Daughter's height (in)")

plot of chunk unnamed-chunk-3

Now make the regression model, add the line to the figure, and summarize the model:

MomDaughter.lm = lm(height~mother, data=GaltonGirls) # note the similarity of the model and plotting statements
ladd(panel.abline(MomDaughter.lm)) # this command adds the line to the plot

plot of chunk unnamed-chunk-4

summary(MomDaughter.lm) # this command prints a basic summary of any statistical model
## 
## Call:
## lm(formula = height ~ mother, data = GaltonGirls)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.881 -1.545  0.098  1.445  6.772 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  43.1555     3.0571   14.12  < 2e-16 ***
## mother        0.3266     0.0476    6.86  2.4e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.25 on 431 degrees of freedom
## Multiple R-squared:  0.0984, Adjusted R-squared:  0.0963 
## F-statistic:   47 on 1 and 431 DF,  p-value: 2.42e-11

Looking at the coefficients, the estimated intercept term is 43.16 while the slope estimate (mother) is 0.33. Thus, the resulting linear model relating daughter height (Y) to mother height (X) is Y = 43.16 + 0.33*X.

Based on the R-squared value, the height of the mother only explains about 10% of the observed daughter heights, so there are obviously other sources of variation in height.

If you want to know the 95% confidence intervals on the slope and intercept estimates, you can try this:

confint(MomDaughter.lm)
##              2.5 %  97.5 %
## (Intercept) 37.147 49.1641
## mother       0.233  0.4201

Because the confidence interval on the slope estimate is entirely positive and excludes zero (0.23 - 0.42), we can be confident that there is in fact a positive relationship between mother and daughter heights.