Linear regression, AKA least-squares regression, models a linear relationship between a continuous response variable (Y) and a continuous predictor variable (X). As such, it is a special case of a general linear model, implemented using the lm()
function.
Regression was invented by Charles Darwin’s cousin, Francis Galton, so let’s use his data on parent and child heights. Specifically, let’s keep things simple by just using mothers and daughters.
require(mosaic) # be sure to install and load the usual packages if you have not already
require(lattice)
require(datasets)
data(Galton)
GaltonGirls = subset(Galton, sex=="F", drop=TRUE) # here we are creating a new dataset that contains only the daughters
It is always helpful to look at a plot first
xyplot(height~mother, data=GaltonGirls, xlab="Mother's height (in)", ylab="Daughter's height (in)")
Now make the regression model, add the line to the figure, and summarize the model:
MomDaughter.lm = lm(height~mother, data=GaltonGirls) # note the similarity of the model and plotting statements
ladd(panel.abline(MomDaughter.lm)) # this command adds the line to the plot
summary(MomDaughter.lm) # this command prints a basic summary of any statistical model
##
## Call:
## lm(formula = height ~ mother, data = GaltonGirls)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.881 -1.545 0.098 1.445 6.772
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 43.1555 3.0571 14.12 < 2e-16 ***
## mother 0.3266 0.0476 6.86 2.4e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.25 on 431 degrees of freedom
## Multiple R-squared: 0.0984, Adjusted R-squared: 0.0963
## F-statistic: 47 on 1 and 431 DF, p-value: 2.42e-11
Looking at the coefficients, the estimated intercept term is 43.16 while the slope estimate (mother
) is 0.33. Thus, the resulting linear model relating daughter height (Y) to mother height (X) is Y = 43.16 + 0.33*X
.
Based on the R-squared
value, the height of the mother only explains about 10% of the observed daughter heights, so there are obviously other sources of variation in height.
If you want to know the 95% confidence intervals on the slope and intercept estimates, you can try this:
confint(MomDaughter.lm)
## 2.5 % 97.5 %
## (Intercept) 37.147 49.1641
## mother 0.233 0.4201
Because the confidence interval on the slope estimate is entirely positive and excludes zero (0.23 - 0.42), we can be confident that there is in fact a positive relationship between mother and daughter heights.