Concepts

Today in class we covered several different concepts. The first was multiple linear regression. We use this when we want to add more than one predictor variable to our regression equation predicting our response. The general idea is that we estimate more parameters in our model in order to take into account more predictor variables and hopefully make a more accurate model.

The second concept we learned today was indicator functions, or dummy variables. We use this type of function to include categorical variables in our regression equation. Generally, we do this by breaking our categorical variable into groups and creating an estimated parameter in our model for that split group. For example, if we had 3 categorical groups we would need 2 estimated parameters in our model to account for the difference in affect by each group.

I will now go through an example which illustrates the key equations and functions that need to be used when utilizing these concepts. I will use Galton’s data which includes the hieght of a child in a family, the gender of that child, father’s height and mother’s height, among other variables. We will model the childs height based on the height of their mother and their gender.

Galton <- read.csv("http://cknudson.com/data/Galton.csv")
mod <- lm(Height ~ MotherHeight + Gender, data = Galton)
summary(mod)
## 
## Call:
## lm(formula = Height ~ MotherHeight + Gender, data = Galton)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.4036 -1.6024  0.1528  1.5890  9.4199 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  41.44952    2.20949   18.76   <2e-16 ***
## MotherHeight  0.35314    0.03439   10.27   <2e-16 ***
## GenderM       5.17669    0.15867   32.62   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.374 on 895 degrees of freedom
## Multiple R-squared:  0.5618, Adjusted R-squared:  0.5608 
## F-statistic: 573.7 on 2 and 895 DF,  p-value: < 2.2e-16

From the summary, we can see that Gender is 1 if male and 0 if female. Since we only have 2 categories in our categorical variable, we only need one indicator function.

We can also see that the regression equation is PredictedHeight = 41.4495 + .3531 * MotherHeight + 5.1767 * I(male)

We can then predict the height of a male whose mother’s height is 64 inches and a female whose mother’s height is 64 inches.

41.4495 + .3531*64 + 5.1767*1
## [1] 69.2246
41.4495 + .3531*64 + 5.1767*0
## [1] 64.0479

The predicted height for the male is 69.22 inches and is 64.05 inches for the female.

In this case, we will not interpret the intercept because it does not make sense contextually. However, the interpretation of the parameter for mother’s height is that for every increase in mother’s height by one inch, the child’s height will increase by .3531 inches, given that the gender remains the same. Then, the interpretation of the parameter for gender is that, if the child is male, his height will be 5.1767 inches greater than a female, given that the mother’s height is the same.

We can also plot our data points, with the separate regression equations, one for male and another for female.

plot(Galton$MotherHeight, Galton$Height, col=Galton$Gender)
abline((coef(mod)[1]+coef(mod)[3]), coef(mod)[2], col="red")
abline(coef(mod)[1], coef(mod)[2])

We can see that the male’s data is red and the female’s data is black with fitted lines in their respective colors. In this way we can convince ourselves that the regression equation we created seems to accurately describe the data we have.

Comparison to other topics

This topic is fairly straightforward since it is very similar to simple linear regression. It follows many of the same principles. However, this is an important topic because very rarely can you accuratly model a response with one predictor. It is often necessary to add more predictors and this is what multiple linear regression allows us to do.