Lecture: Multiple Linear Regression

Slides for Week 5 are available here.

In this tutorial, you will learn how to:

  • run multiple regression models with continuous and dummy variables
  • identify and interpret coefficients, p-value, and goodness of fit measures
  • make decisions to select the best fit model

Basic Set Up and load all necessary libraries

options(scipen=999, digits = 3) # remove scientific notation

# install.packages("rsq")
library(rsq)

Step 0. Look at the Data Documentation Check the Documentation for Birthweight_Smoking in your project folder.

Step 1. Import a Dataset (+Save a Copy!)

Your filename.csv needs to be exact and indicate the file format, which is a .csv file (Comma Separated Values file)–a delimited text file that uses a comma to separate values. It is one type of data-friendly Excel spreadsheets besides the common .xlsx format)

df <- read.csv("birthweight_smoking.csv")

df1 <- df # df1 is now the spare copy shall we "mess up" our `df`!

# Now in your global environment, you should see 'df' and `df1` stored. 
# It has 3,000 observations (i.e., 3,000 rows) and 12 variables (i.e., 12 columns). 

Exploratory Data Analysis

For more exploratory data analysis, see previous weeks’ lab materials.

dim(df)
## [1] 3000   12
# Look at the summary statistics of the data frame!
summary(df)
##     nprevist     alcohol         tripre1         tripre2         tripre3     
##  Min.   : 0   Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.: 9   1st Qu.:0.000   1st Qu.:1.000   1st Qu.:0.000   1st Qu.:0.000  
##  Median :12   Median :0.000   Median :1.000   Median :0.000   Median :0.000  
##  Mean   :11   Mean   :0.019   Mean   :0.804   Mean   :0.153   Mean   :0.033  
##  3rd Qu.:13   3rd Qu.:0.000   3rd Qu.:1.000   3rd Qu.:0.000   3rd Qu.:0.000  
##  Max.   :35   Max.   :1.000   Max.   :1.000   Max.   :1.000   Max.   :1.000  
##     tripre0      birthweight       smoker        unmarried          educ     
##  Min.   :0.00   Min.   : 425   Min.   :0.000   Min.   :0.000   Min.   : 0.0  
##  1st Qu.:0.00   1st Qu.:3062   1st Qu.:0.000   1st Qu.:0.000   1st Qu.:12.0  
##  Median :0.00   Median :3420   Median :0.000   Median :0.000   Median :12.0  
##  Mean   :0.01   Mean   :3383   Mean   :0.194   Mean   :0.227   Mean   :12.9  
##  3rd Qu.:0.00   3rd Qu.:3750   3rd Qu.:0.000   3rd Qu.:0.000   3rd Qu.:14.0  
##  Max.   :1.00   Max.   :5755   Max.   :1.000   Max.   :1.000   Max.   :17.0  
##       age           drinks     
##  Min.   :14.0   Min.   : 0.00  
##  1st Qu.:23.0   1st Qu.: 0.00  
##  Median :27.0   Median : 0.00  
##  Mean   :26.9   Mean   : 0.06  
##  3rd Qu.:31.0   3rd Qu.: 0.00  
##  Max.   :44.0   Max.   :21.00

Create a Smaller Data Frame

First, let’s create a new data frame with the variables we will use. We can subset our original data frame by selecting the columns pretty quickly. Specify the column names through this code: data[,c("variable1", "variable2")].

dataMini <- df[,c("birthweight","nprevist","age","educ","drinks","smoker","alcohol","unmarried")] 
# subset data frame by shortlisting named columns

na.omit(): Remove missing values

# remove NA (i.e., rows with missing values) first 
dataMini <- na.omit(dataMini)  
    # this dataset is clean, but in reality, data tend to have missing values

Multiple Linear Regression Models

Our first basic model with three IVs, namely smoker, alcohol and nprevist:

lm1 <- lm(birthweight ~ smoker + alcohol + nprevist, data = df) 
summary(lm1)
## 
## Call:
## lm(formula = birthweight ~ smoker + alcohol + nprevist, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2733.5  -307.6    21.4   358.1  2192.7 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  3051.25      34.02   89.70 < 0.0000000000000002 ***
## smoker       -217.58      26.68   -8.16  0.00000000000000051 ***
## alcohol       -30.49      76.23   -0.40                 0.69    
## nprevist       34.07       2.85   11.93 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 570 on 2996 degrees of freedom
## Multiple R-squared:  0.0729, Adjusted R-squared:  0.0719 
## F-statistic: 78.5 on 3 and 2996 DF,  p-value: <0.0000000000000002

1. Number of Observations used in lm1

When working with datasets that have missing values in some variables, it’s important to check the number of observations in each regression model. The number may vary depending on which variables are included in the model. However, the birthweight data we’re working with here is very clean, so the number of observations is the same across all models. Nonetheless, it’s always a good practice to check the sample size in each regression you compute!

nobs(lm1)
## [1] 3000

2. The F-statistic

The F value is the ratio of the mean regression sum of squares divided by the mean error sum of squares.

It is the probability that the null hypothesis (H0) for the full model (i.e., all of the regression coefficients are jointly zero) is true. Null hypothesis (H0) is which we try to disprove or reject at the 0.05 level.

F-statistic for the first model is 78.5 and the p-value is close to 0. The model is significant at the 0.05 level, we can reject H0, meaning there is at least one significant coefficient in the model.


3. Adjusted R-Square Value

\(Adjusted R^2\) and \(R^2\) measure how close the data are to the fitted regression line. 0% indicates that the model explains none of the variability of the response data around its mean. How much of the variance in birthweight is explained by our independent variables?

Adjusted R-squared for lm1: 0.0719

It means that model lm1 (with smoker, alcohol and nprevist as independent variables) explains about 7.19% of the variance in birthweight.

# recall lm1 summary
summary(lm1)
## 
## Call:
## lm(formula = birthweight ~ smoker + alcohol + nprevist, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2733.5  -307.6    21.4   358.1  2192.7 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  3051.25      34.02   89.70 < 0.0000000000000002 ***
## smoker       -217.58      26.68   -8.16  0.00000000000000051 ***
## alcohol       -30.49      76.23   -0.40                 0.69    
## nprevist       34.07       2.85   11.93 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 570 on 2996 degrees of freedom
## Multiple R-squared:  0.0729, Adjusted R-squared:  0.0719 
## F-statistic: 78.5 on 3 and 2996 DF,  p-value: <0.0000000000000002

Partial R-Squared Value

Partial R-Squared for each independent variable (for glm or lm models) The partial R-square (or coefficient of partial determination or a squared partial correlation) measures a fully partialled proportion of the variance in \(Y\) (the unique contribution) explained by \(Xi\) over the variance in \(Y\) that is not associated with any other predictors already included in the model.

partial.rsq for nprevist = 0.0453756

It means that nprevist explains about 4.54% of the variance in birthweight in lm1.

rsq.partial(lm1) # rsq library
## $adjustment
## [1] FALSE
## 
## $variable
## [1] "smoker"   "alcohol"  "nprevist"
## 
## $partial.rsq
## [1] 0.0217171 0.0000534 0.0453756
# Partial R-Squared for each independent variable

4. The Intercept

What is the average infant birthweight when smoker, alcohol and nprevist are zero?

The Constant is 3051.25 grams, meaning when all our variables are 0, our model predicts the average infant birthweight will be 3051.25 grams.

5. Parameter Estimates (beta coefficients and significance)

What are the coefficients for smoker, alcohol and nprevist? Are they statistically significant at the 0.05 level?

Model lm1

  • Based on the adjusted r-squared value, 7.19% of the variance in our dependent variable is explained by the first model.

  • Both smoking mothers (smoker) and the number of prenatal visit (nprevist) are significant predictors of infant birthweight, but not for mothers drinking alcohol during pregnancy (alcohol) at the 0.05 level (p < 0.05 for both variables).

  • smoker: Compared to children whose mothers were non-smokers, the average birthweight of children whose mothers were smokers is significantly lower by 217.58 grams on average at the 0.05 level, holding all other variables constant.

  • nprevist: A 1-unit increase in the number of prenatal visit is associated with an increase in the infant birthweight (DV) by 34.07 grams, holding other variables constant.

  • alcohol: The coefficient for alcohol is not statistically significant at the 0.05 level. (You may skip the coefficient interpretation when a variable is not statistically significant.)

Hierarchical Regression

Hierarchical regression is a way to show if variables of your interest explain a statistically significant amount of variance in your Dependent Variable (DV) after accounting for all other variables. This is a framework for model comparison rather than a statistical method. In this framework, you build several regression models by adding variables to a previous model at each step; later models always include smaller models in previous steps. In many cases, our interest is to determine whether newly added variables show a significant improvement in R2 (the proportion of explained variance in DV by the model). (Read more here)

Recall our first model

summary(lm1)
## 
## Call:
## lm(formula = birthweight ~ smoker + alcohol + nprevist, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2733.5  -307.6    21.4   358.1  2192.7 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  3051.25      34.02   89.70 < 0.0000000000000002 ***
## smoker       -217.58      26.68   -8.16  0.00000000000000051 ***
## alcohol       -30.49      76.23   -0.40                 0.69    
## nprevist       34.07       2.85   11.93 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 570 on 2996 degrees of freedom
## Multiple R-squared:  0.0729, Adjusted R-squared:  0.0719 
## F-statistic: 78.5 on 3 and 2996 DF,  p-value: <0.0000000000000002

Our second model: adding unmarried to the model

lm2 <- lm(birthweight ~ smoker + alcohol + nprevist + unmarried, data = df) 

summary(lm2)
## 
## Call:
## lm(formula = birthweight ~ smoker + alcohol + nprevist + unmarried, 
##     data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2798.8  -309.2    25.4   361.8  2363.7 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   3134.4       35.7   87.91 < 0.0000000000000002 ***
## smoker        -175.4       27.1   -6.47     0.00000000011275 ***
## alcohol        -21.1       75.6   -0.28                 0.78    
## nprevist        29.6        2.9   10.21 < 0.0000000000000002 ***
## unmarried     -187.1       26.0   -7.20     0.00000000000078 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 566 on 2995 degrees of freedom
## Multiple R-squared:  0.0886, Adjusted R-squared:  0.0874 
## F-statistic: 72.8 on 4 and 2995 DF,  p-value: <0.0000000000000002

Model Selection

Information criteria, unlike significance tests, are indices that allow for model comparison accounting for model complexity and sample size.

Smaller models are better and more efficient. Smaller values of AIC and BIC is better. In the case where AIC and BIC give conflicting results, we will use BIC as the criteria for a more conservative estimate.

Which model has the highest adjusted R-squared value and the lowest AIC and BIC values?

# add lm3 to the brackets below

AIC(lm1, lm2)
##     df   AIC
## lm1  5 46598
## lm2  6 46549
BIC(lm1, lm2)
##     df   BIC
## lm1  5 46628
## lm2  6 46585