MSSP 897 Lab 3: Multiple Linear Regression

The slides for Lab 3 session are now available to review here.

Install the packages that will be used for this tutorial.

install.packages(“rsq”)

options(digits = 2) # results will round to two significant digits
good <- read.csv("good.csv")
attach(good) # This step is optional, remember if you decide not to attach, 
             # then you must specify ",data = good" in your regression model, 
             # or preface every variable mention with "good$"

Recall what we have already done in Lab 2 here.

Create a New Data Frame for the Selected Variables

goodreg <- na.omit(good[,c("readss97","AGE97","HOME97","bthwht")]) 
# results are the same with using good, 
# for convenience & efficiency by taking 4 variables out of 112) 
# remove NA

Exploratory Data Analysis

Examine Univariate Statistics

summary(goodreg)

##     readss97       AGE97          HOME97         bthwht    
##  Min.   : 48   Min.   : 3.0   Min.   : 7.9   Min.   :0.00  
##  1st Qu.: 92   1st Qu.: 5.0   1st Qu.:18.1   1st Qu.:0.00  
##  Median :101   Median : 7.0   Median :20.5   Median :0.00  
##  Mean   :102   Mean   : 7.5   Mean   :20.2   Mean   :0.37  
##  3rd Qu.:112   3rd Qu.:10.0   3rd Qu.:22.5   3rd Qu.:1.00  
##  Max.   :166   Max.   :13.0   Max.   :27.0   Max.   :1.00

str(goodreg)

## 'data.frame':    1964 obs. of  4 variables:
##  $ readss97: num  81 112 93 116 143 ...
##  $ AGE97   : int  4 6 12 12 13 6 4 9 7 4 ...
##  $ HOME97  : num  15.1 24 19.1 21 23.1 13.5 16 18.6 16.6 16 ...
##  $ bthwht  : int  1 0 1 1 1 1 1 0 0 1 ...
##  - attr(*, "na.action")= 'omit' Named int  1 2 4 5 7 10 14 15 23 26 ...
##   ..- attr(*, "names")= chr  "1" "2" "4" "5" ...

Bar Chart for Dummy Variable

bw.tab <- table(goodreg$bthwht) # frequency table
bw.tab

## 
##    0    1 
## 1238  726

prop.table(table(goodreg$bthwht)) # tables of proportions

## 
##    0    1 
## 0.63 0.37

# or prop.table(bw.tab)
barplot(bw.tab,xlab="Birthweight", ylab="N", col=c("grey", "light blue"), 
        main = "Graph 1: Bar Chart of Birthweight",  ylim = c(0, 1500))

Histogram and Kernel Density plot for Continuous Variable

Instead of counting the number of datapoints per bin, R can give the probability densities. For instance, Kernel Density plot produces a smooth curve estimating the probability density function of a continuous variable.

# Histogram
hist(goodreg$readss97, breaks = 50, col = "orange",
     main = "Graph 2.1: Histogram of Readss97",
     xlab = "Reading Scores in 1997")

# Histogram with density plot
ggplot(goodreg, aes(x=readss97)) + 
  labs(title = "Graph 2.2: Histogram and Kernel Density of Readss97") +
  geom_histogram(aes(y=..density..), colour="black", fill="white")+
  geom_density(alpha=.2, fill="#FF6666")  # alpha: color opacity

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# simplified version of density plot
r <- density(goodreg$readss97)
plot(r)

Your turn! Do the same for the other variables.

Examine Bivariate Statistics

Look at their correlation and partial correlations: cor(), cor.test()

cor.test(goodreg$readss97,goodreg$AGE97)

## 
##  Pearson's product-moment correlation
## 
## data:  goodreg$readss97 and goodreg$AGE97
## t = 5, df = 2000, p-value = 2e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.074 0.161
## sample estimates:
##  cor 
## 0.12

cor.test(goodreg$readss97,goodreg$HOME97)

## 
##  Pearson's product-moment correlation
## 
## data:  goodreg$readss97 and goodreg$HOME97
## t = 20, df = 2000, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.34 0.41
## sample estimates:
##  cor 
## 0.38

cor.test(goodreg$readss97,goodreg$bthwht)

## 
##  Pearson's product-moment correlation
## 
## data:  goodreg$readss97 and goodreg$bthwht
## t = -4, df = 2000, p-value = 1e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.141 -0.054
## sample estimates:
##    cor 
## -0.098

# which IV has the largest correlation coefficient (r)?
pcor(goodreg, method = c("pearson"))

## $estimate
##          readss97 AGE97 HOME97 bthwht
## readss97    1.000 0.075  0.351 -0.089
## AGE97       0.075 1.000  0.174  0.271
## HOME97      0.351 0.174  1.000 -0.095
## bthwht     -0.089 0.271 -0.095  1.000
## 
## $p.value
##          readss97   AGE97  HOME97  bthwht
## readss97  0.0e+00 9.3e-04 4.9e-58 7.7e-05
## AGE97     9.3e-04 0.0e+00 9.6e-15 1.9e-34
## HOME97    4.9e-58 9.6e-15 0.0e+00 2.4e-05
## bthwht    7.7e-05 1.9e-34 2.4e-05 0.0e+00
## 
## $statistic
##          readss97 AGE97 HOME97 bthwht
## readss97      0.0   3.3   16.6   -4.0
## AGE97         3.3   0.0    7.8   12.5
## HOME97       16.6   7.8    0.0   -4.2
## bthwht       -4.0  12.5   -4.2    0.0
## 
## $n
## [1] 1964
## 
## $gp
## [1] 2
## 
## $method
## [1] "pearson"

# it runs when we cleared all NAs using na.omit() above

3.1 Multiple Linear Regression Model

# 'lm' function has na.omit as the factory default
lm <- lm(readss97~AGE97 + HOME97 + bthwht, data = goodreg, na.action=na.omit) 
lm1 <- lm(goodreg$readss97~goodreg$AGE97 + goodreg$HOME97 + goodreg$bthwht, data = good, na.action=na.omit)
lm2 <- lm(readss97~AGE97 + HOME97 + bthwht, data = goodreg, na.action=na.omit)
# All 3 options produce the same result and output

Use the summary() function to see your detailed regression output

summary(lm)

## 
## Call:
## lm(formula = readss97 ~ AGE97 + HOME97 + bthwht, data = goodreg, 
##     na.action = na.omit)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -54.96  -9.52  -0.31   9.70  54.50 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   63.342      2.266   27.96  < 2e-16 ***
## AGE97          0.399      0.120    3.32  0.00093 ***
## HOME97         1.829      0.110   16.61  < 2e-16 ***
## bthwht        -2.848      0.719   -3.96  7.7e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15 on 1960 degrees of freedom
## Multiple R-squared:  0.151,  Adjusted R-squared:  0.149 
## F-statistic:  116 on 3 and 1960 DF,  p-value: <2e-16

# What outputs can you see from the summary?

3.2 Interpreting Regression Model

Number of Observations and Sample Size (N)

# number of observations (i.e., rows) in your dataset
nrow(good)

## [1] 3563

# regression summary tells you how many observations were deleted
summary(lm)

## 
## Call:
## lm(formula = readss97 ~ AGE97 + HOME97 + bthwht, data = goodreg, 
##     na.action = na.omit)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -54.96  -9.52  -0.31   9.70  54.50 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   63.342      2.266   27.96  < 2e-16 ***
## AGE97          0.399      0.120    3.32  0.00093 ***
## HOME97         1.829      0.110   16.61  < 2e-16 ***
## bthwht        -2.848      0.719   -3.96  7.7e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15 on 1960 degrees of freedom
## Multiple R-squared:  0.151,  Adjusted R-squared:  0.149 
## F-statistic:  116 on 3 and 1960 DF,  p-value: <2e-16

# 1599 observations deleted due to missingness
# Observations used = nrow(good) - observations deleted

# Number of Observations used in your regression model
nobs(lm)

## [1] 1964

The F-statistic

The F value is the ratio of the mean regression sum of squaresdivided by the mean error sum of squares. It is the probability that the null hypothesis (H0) for the full model (i.e., all of the regression coefficients are jointly zero) is true. Null hypothesis (H0) is which we try to disprove or reject, e.g., at the 0.05 level.

summary(lm)

## 
## Call:
## lm(formula = readss97 ~ AGE97 + HOME97 + bthwht, data = goodreg, 
##     na.action = na.omit)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -54.96  -9.52  -0.31   9.70  54.50 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   63.342      2.266   27.96  < 2e-16 ***
## AGE97          0.399      0.120    3.32  0.00093 ***
## HOME97         1.829      0.110   16.61  < 2e-16 ***
## bthwht        -2.848      0.719   -3.96  7.7e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15 on 1960 degrees of freedom
## Multiple R-squared:  0.151,  Adjusted R-squared:  0.149 
## F-statistic:  116 on 3 and 1960 DF,  p-value: <2e-16

# It is significant at the .0001 level, we can reject H0. 
# It means that there is at least one significant coefficient in the model.

R Square Value (or, coefficient of determination)

R^2 measures how close the data are to the fitted regression line. 0% indicates that the model explains none of the variability of the response data around its mean.

# How much of the variance in readss97 is explained by our independent variables?
summary(lm)

## 
## Call:
## lm(formula = readss97 ~ AGE97 + HOME97 + bthwht, data = goodreg, 
##     na.action = na.omit)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -54.96  -9.52  -0.31   9.70  54.50 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   63.342      2.266   27.96  < 2e-16 ***
## AGE97          0.399      0.120    3.32  0.00093 ***
## HOME97         1.829      0.110   16.61  < 2e-16 ***
## bthwht        -2.848      0.719   -3.96  7.7e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15 on 1960 degrees of freedom
## Multiple R-squared:  0.151,  Adjusted R-squared:  0.149 
## F-statistic:  116 on 3 and 1960 DF,  p-value: <2e-16

  # It is .15, meaning that our model as predicted by age97, bthwht, home97 explains about 15% of the variance in readss97.

The Intercept

# What is the average reading score when age, weight, and income are zero?
summary(lm)

## 
## Call:
## lm(formula = readss97 ~ AGE97 + HOME97 + bthwht, data = goodreg, 
##     na.action = na.omit)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -54.96  -9.52  -0.31   9.70  54.50 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   63.342      2.266   27.96  < 2e-16 ***
## AGE97          0.399      0.120    3.32  0.00093 ***
## HOME97         1.829      0.110   16.61  < 2e-16 ***
## bthwht        -2.848      0.719   -3.96  7.7e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15 on 1960 degrees of freedom
## Multiple R-squared:  0.151,  Adjusted R-squared:  0.149 
## F-statistic:  116 on 3 and 1960 DF,  p-value: <2e-16

  # The Constant is 63.34, meaning when all our variables are 0, 
  # our model predicts the average reading score will be 63.34.

Parameter Estimates

# What are the coefficients for AGE97, HOME97, and bthwht? Are they statistically significant?
summary(lm)

## 
## Call:
## lm(formula = readss97 ~ AGE97 + HOME97 + bthwht, data = goodreg, 
##     na.action = na.omit)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -54.96  -9.52  -0.31   9.70  54.50 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   63.342      2.266   27.96  < 2e-16 ***
## AGE97          0.399      0.120    3.32  0.00093 ***
## HOME97         1.829      0.110   16.61  < 2e-16 ***
## bthwht        -2.848      0.719   -3.96  7.7e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15 on 1960 degrees of freedom
## Multiple R-squared:  0.151,  Adjusted R-squared:  0.149 
## F-statistic:  116 on 3 and 1960 DF,  p-value: <2e-16

  # e.g Children with low birth weight are predicted to have reading scores 
  # almost 3 points lower than children without low birthweight, 
  # holding all other variables constant.

Partial R-Squared for each independent variable (for glm or lm models)

The partial R-square (or coefficient of partial determination or a squared partial correlation) measures a fully partialled proportion of the variance in Y (the unique contribution) explained by Xi over the variance in Y that is not associated with any other predictors already included in the model.

rsq.partial(lm)

## $adjustment
## [1] FALSE
## 
## $variable
## [1] "AGE97"  "HOME97" "bthwht"
## 
## $partial.rsq
## [1] 0.0056 0.1233 0.0079

# Alternatively, you may use another package called "lmSupport"
  # install.packages("lmSupport")
  # library(lmSupport)
  # modelEffectSizes(lm)

Standardized Coefficients

Model coefficients are in different units, standardized coefficients helps us compare the effect size across multiple variables. Standardized Coefficient measures the effect of a one-SD change in Xj as opposed to a unit change in Xj.

By placing everything on a common metric of SD’s, we get comparability across βj’s. In other words it’s meant to be an interpretation aid. Statistical significance does not change!

goodreg$bthwht <- as.numeric(goodreg$bthwht) # standardized coefficient don't apply to dummy variable
linearM <- lm(scale(readss97)~scale(AGE97) + scale(HOME97) + scale(bthwht), 
              data = goodreg)
summary(linearM)

## 
## Call:
## lm(formula = scale(readss97) ~ scale(AGE97) + scale(HOME97) + 
##     scale(bthwht), data = goodreg)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.432 -0.594 -0.019  0.605  3.403 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -6.29e-16   2.08e-02    0.00  1.00000    
## scale(AGE97)   7.29e-02   2.20e-02    3.32  0.00093 ***
## scale(HOME97)  3.55e-01   2.14e-02   16.61  < 2e-16 ***
## scale(bthwht) -8.59e-02   2.17e-02   -3.96  7.7e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.92 on 1960 degrees of freedom
## Multiple R-squared:  0.151,  Adjusted R-squared:  0.149 
## F-statistic:  116 on 3 and 1960 DF,  p-value: <2e-16

coefficients(linearM)

##   (Intercept)  scale(AGE97) scale(HOME97) scale(bthwht) 
##      -6.3e-16       7.3e-02       3.6e-01      -8.6e-02

3.3 Hierarchical Regression

lm3 <- lm(goodreg$readss97~goodreg$AGE97 + goodreg$bthwht)
summary(lm3)

## 
## Call:
## lm(formula = goodreg$readss97 ~ goodreg$AGE97 + goodreg$bthwht)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -55.68  -9.99  -0.56  10.15  62.32 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      97.728      0.982   99.52  < 2e-16 ***
## goodreg$AGE97     0.827      0.125    6.59  5.7e-11 ***
## goodreg$bthwht   -4.469      0.761   -5.88  5.0e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16 on 1961 degrees of freedom
## Multiple R-squared:  0.031,  Adjusted R-squared:  0.03 
## F-statistic: 31.4 on 2 and 1961 DF,  p-value: 3.94e-14

summary(lm)

## 
## Call:
## lm(formula = readss97 ~ AGE97 + HOME97 + bthwht, data = goodreg, 
##     na.action = na.omit)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -54.96  -9.52  -0.31   9.70  54.50 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   63.342      2.266   27.96  < 2e-16 ***
## AGE97          0.399      0.120    3.32  0.00093 ***
## HOME97         1.829      0.110   16.61  < 2e-16 ***
## bthwht        -2.848      0.719   -3.96  7.7e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15 on 1960 degrees of freedom
## Multiple R-squared:  0.151,  Adjusted R-squared:  0.149 
## F-statistic:  116 on 3 and 1960 DF,  p-value: <2e-16

# Run a series of models adding variables and noting the change in R2
# Comparing model before and after you add independent variable(s)
#   Examines the unique contribution of an independent variable above and beyond the other variables’ contribution and the covariation/contribution shared with other independent variables
# Note changes in R2 and F-Statistic

Plotting Mutliple Linear Regression Model

plot <- ggplot(goodreg, aes(x = AGE97 + HOME97 + bthwht, y = readss97, col = bthwht)) +
        labs(title = "Graph 3: Multiple Linear Regression of Reading Score in 1997", 
            subtitle = "Lab 3", 
            caption = "(based on data from good package)") +
        geom_point(color='lightpink4', size=1.5) +
        geom_smooth(method="lm", colour="black", size=1) 
plot

Refer to Lab 2 slides here and Lab 2 and 3 Notes in Canvas Folder.

Assignment 3 due at 12:59 pm February 18th, 2019 (Next Monday)

Use the dataset “statekffmultregression” in Canvas. It includes the following variables:

state;
the official poverty rate;
the proportion of the state that rates its health as fair or poor;
the percent of adults in the state that smoke cigarettes;
the proportion of people in the state without health insurance

1) Run a model in which you use unemployment rate to predict the proportion of adults that rate their physical health as fair or poor. Call this model 1 and ask R for the summary. What do you find? What are the Model and Adjusted R-squared values? Interpret the coefficient of the unemployment rate.

2) Run a model in which you use the official poverty rate, the unemployment rate, and the percent of people that smoke to predict the proportion of adults that rate their physical health as fair or poor. Call this model 2. What are the Model and Adjusted R-squared values? What are the coefficients, and which are statistically significant?

3) Run model 2 with standardized coefficients, and call it model 3. Interpret your coefficients now.