MSSP 897 Lab 4: Multiple Linear Regression With Categorical/Dummy Variables

The slides for Lab 4 session is now available to review here.

Run the packages that will be used for this tutorial.

library(rsq)
options(scipen=999, digits = 2) # avoid scientific notation, round to two significant digits

Import the good dataset into R and and assign the dataframe the variable name ‘good’

good <- read.csv("good.csv") 
# or, type the full path of the .csv file, remember using forward slash "/"

4.1 Constructing a Categorical Variable

Overview of Today’s Lab

Use the categorical variable PCBTHORD (birth order) to create a set of new categorical birth order variables
Convert the categorical variables into dichotomous dummy variables.
Run a regression model with new dummy variables
Run regression models with new dummy variables and additional continuious variables

Explore the categorical variable PCBTHORD (birth order)

summary(good$PCBTHORD)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       2      98      51      99      99

hist(good$PCBTHORD, breaks = 100)

plot(good$PCBTHORD)

table(good$PCBTHORD)

## 
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
##  567  428  257  161  127   60   64   38   25   15   11    4   13    4    2 
##   98   99 
##    8 1779

Assign all current values for the variable PCBTHORD into one of four categories:

missing (for values of 98 or 99),
1 (for children born first),
2 (for children born 2nd),
3 (for children born 3rd) and
4 (for children born 4th and later)

Use ifelse statements telling R IF a certain condition is met (e.g. a variable equals a certain value or is in a certain range) ELSE R should do something else that you specify.

good$PCBTHORD <- ifelse(good$PCBTHORD == 98, NA, 
                        ifelse(good$PCBTHORD == 99, NA, 
                               ifelse(good$PCBTHORD == 1, 1, 
                                      ifelse(good$PCBTHORD == 2, 2, 
                                             ifelse(good$PCBTHORD == 3, 3, 
                                                    ifelse(good$PCBTHORD >= 4, 4, NA))))))

See how the variable PCBTHORD has been transformed:

hist(good$PCBTHORD, breaks = 100)

plot(good$PCBTHORD) # how many levels of value do we have now?

table(good$PCBTHORD) # check the frequency of each level of value

## 
##   1   2   3   4 
## 567 428 257 524

4.2 Convert a Categorical Variable to Multiple Dummy (Binary) Varibles

Dummy coding refers to the process of coding a categorical variable into dichotomous variables (assuming mutually exclusive group). In general, with k groups there will be k-1 coded variables.

E.g. If we have g1-g4, for g2, every observation in group 2 will be coded as 1 and 0 for all other groups it will be coded as zero. There is no g1 as it is not needed because g2-g4 have all of the information needed to determine which observation is in which group.

The recoded PCBTHORD variable has 4 levels of values (i.e., 4 groups) so we need to create 3 (i.e., 4-1) dummy variables, with first born (BOD1) as your reference group == “0”.

# BOD2 is the new variable that compares reading scores of second born children (BOD2=1) to first born children. 
good$BOD2 <- ifelse(good$PCBTHORD == 2, 1, 
                    ifelse(good$PCBTHORD == 1 | good$PCBTHORD > 2, 0, NA)) 
# Stay with numeric structure to use operators
# Coding: if the person was the 2nd born, then the value=1, if not the value =0 (for the rest who were not born 2nd). 
table(good$BOD2)

## 
##    0    1 
## 1348  428

#   BOD3 compares reading scores of third born children (BOD3=1) to first born children.
good$BOD3 <- ifelse(good$PCBTHORD == 3, 1, 
                    ifelse(good$PCBTHORD < 3 | good$PCBTHORD == 4, 0, NA))
# Coding: if the person was the 3rd born, then the value=1, if not the value =0 (for the rest who were not born 3rd). 
table(good$BOD3)

## 
##    0    1 
## 1519  257

# BOD4 compares reading scores of children born fourth or later (BOD4=1) to first born children

good$BOD4 <- ifelse(good$PCBTHORD == 4, 1, 
                    ifelse(good$PCBTHORD < 4, 0, NA))
# Coding: if the person was the 4th born, then the value=1, if not the value =0 (for the rest who were not born 4th). 

table(good$BOD4)

## 
##    0    1 
## 1252  524

4.3 Create A New Variable Using `RowMeans` of Other variables

Create new dependent variables, reading score in 2002 (“readss02”), which is the mean of letter word (“LWSS02”) and passage comprehension score (“PCSS02”) in 2002.

Combine a new data matrix with the two variables

readmean <- cbind(good$LWSS02,good$PCSS02)
colnames(readmean) <- c("LWSS02","PCSS02") # optional: rename column names

Create a new variable using rowMeans function

good$readss02 <- rowMeans(readmean, na.rm = FALSE)
summary(good$readss02)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      32      94     103     104     114     186    1026

hist(good$readss02)

4.4 Multiple Linear Regression with Dummy and Continuous Variables

With Dummy Variables Only

lm1 <- lm(good$readss02 ~ good$BOD2)
summary(lm1)

## 
## Call:
## lm(formula = good$readss02 ~ good$BOD2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -71.55 -10.05  -0.55   9.95  81.45 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  103.047      0.511  201.73 <0.0000000000000002 ***
## good$BOD2      2.376      1.071    2.22               0.027 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16 on 1281 degrees of freedom
##   (2280 observations deleted due to missingness)
## Multiple R-squared:  0.00383,    Adjusted R-squared:  0.00305 
## F-statistic: 4.92 on 1 and 1281 DF,  p-value: 0.0267

rsq.partial(lm1)

## $adjustment
## [1] FALSE
## 
## $variable
## [1] "good$BOD2"
## 
## $partial.rsq
## [1] 0.0038

nobs(lm1)

## [1] 1283

AIC(lm1)

## [1] 10772

BIC(lm1)

## [1] 10788

lm2 <- lm(readss02~BOD2 + BOD3 + BOD4 , data=good) 
summary(lm2)

## 
## Call:
## lm(formula = readss02 ~ BOD2 + BOD3 + BOD4, data = good)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -72.39  -9.92  -0.89  10.11  82.60 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  103.888      0.785  132.31 <0.0000000000000002 ***
## BOD2           1.535      1.225    1.25               0.210    
## BOD3          -0.328      1.421   -0.23               0.818    
## BOD4          -1.992      1.132   -1.76               0.079 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16 on 1279 degrees of freedom
##   (2280 observations deleted due to missingness)
## Multiple R-squared:  0.00641,    Adjusted R-squared:  0.00408 
## F-statistic: 2.75 on 3 and 1279 DF,  p-value: 0.0415

rsq.partial(lm2)

## $adjustment
## [1] FALSE
## 
## $variable
## [1] "BOD2" "BOD3" "BOD4"
## 
## $partial.rsq
## [1] 0.001226 0.000042 0.002414

nobs(lm2)

## [1] 1283

AIC(lm2)

## [1] 10773

BIC(lm2)

## [1] 10799

# Alternatively, try this for lm2: 
lm2.1 <- lm(good$readss02 ~ factor(good$PCBTHORD)) # reference category is?
summary(lm2.1)

## 
## Call:
## lm(formula = good$readss02 ~ factor(good$PCBTHORD))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -72.39  -9.92  -0.89  10.11  82.60 
## 
## Coefficients:
##                        Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)             103.888      0.785  132.31 <0.0000000000000002 ***
## factor(good$PCBTHORD)2    1.535      1.225    1.25               0.210    
## factor(good$PCBTHORD)3   -0.328      1.421   -0.23               0.818    
## factor(good$PCBTHORD)4   -1.992      1.132   -1.76               0.079 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16 on 1279 degrees of freedom
##   (2280 observations deleted due to missingness)
## Multiple R-squared:  0.00641,    Adjusted R-squared:  0.00408 
## F-statistic: 2.75 on 3 and 1279 DF,  p-value: 0.0415

rsq.partial(lm2.1)

## $adjustment
## [1] FALSE
## 
## $variable
## [1] "factor(good$PCBTHORD)"
## 
## $partial.rsq
## [1] 0.0064

nobs(lm2.1)

## [1] 1283

With Dummy and Continuous Variables

lm3 <- lm(readss02 ~ BOD2 + BOD3 + BOD4 + AGE02 + readss97 , data=good) 
summary(lm3)

## 
## Call:
## lm(formula = readss02 ~ BOD2 + BOD3 + BOD4 + AGE02 + readss97, 
##     data = good)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -43.37  -7.09  -1.01   6.17  83.11 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   46.585      3.231   14.42 < 0.0000000000000002 ***
## BOD2           0.579      1.159    0.50                 0.62    
## BOD3          -0.105      1.309   -0.08                 0.94    
## BOD4          -0.469      1.038   -0.45                 0.65    
## AGE02         -1.002      0.143   -7.03      0.0000000000043 ***
## readss97       0.676      0.027   25.03 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12 on 823 degrees of freedom
##   (2734 observations deleted due to missingness)
## Multiple R-squared:  0.443,  Adjusted R-squared:  0.44 
## F-statistic:  131 on 5 and 823 DF,  p-value: <0.0000000000000002

rsq.partial(lm3)

## $adjustment
## [1] FALSE
## 
## $variable
## [1] "BOD2"     "BOD3"     "BOD4"     "AGE02"    "readss97"
## 
## $partial.rsq
## [1] 0.0003026 0.0000078 0.0002479 0.0566554 0.4321772

nobs(lm3)

## [1] 829

AIC(lm3)

## [1] 6469

BIC(lm3)

## [1] 6502

# Alternatively, try this for lm3: 
lm3.1 <- lm(good$readss02 ~ factor(good$PCBTHORD) + good$AGE02 + good$readss97, data=good) 
summary(lm3.1)

## 
## Call:
## lm(formula = good$readss02 ~ factor(good$PCBTHORD) + good$AGE02 + 
##     good$readss97, data = good)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -43.37  -7.09  -1.01   6.17  83.11 
## 
## Coefficients:
##                        Estimate Std. Error t value             Pr(>|t|)
## (Intercept)              46.585      3.231   14.42 < 0.0000000000000002
## factor(good$PCBTHORD)2    0.579      1.159    0.50                 0.62
## factor(good$PCBTHORD)3   -0.105      1.309   -0.08                 0.94
## factor(good$PCBTHORD)4   -0.469      1.038   -0.45                 0.65
## good$AGE02               -1.002      0.143   -7.03      0.0000000000043
## good$readss97             0.676      0.027   25.03 < 0.0000000000000002
##                           
## (Intercept)            ***
## factor(good$PCBTHORD)2    
## factor(good$PCBTHORD)3    
## factor(good$PCBTHORD)4    
## good$AGE02             ***
## good$readss97          ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12 on 823 degrees of freedom
##   (2734 observations deleted due to missingness)
## Multiple R-squared:  0.443,  Adjusted R-squared:  0.44 
## F-statistic:  131 on 5 and 823 DF,  p-value: <0.0000000000000002

rsq.partial(lm3.1)

## $adjustment
## [1] FALSE
## 
## $variable
## [1] "factor(good$PCBTHORD)" "good$AGE02"            "good$readss97"        
## 
## $partial.rsq
## [1] 0.014 0.057 0.422

nobs(lm3.1)

## [1] 829

How do the sample size changes across the regression models?
How do the AIC and BIC change? Which model has the best fit?
How do you interpret their individual effects (parameter estimates)?
Examine the adjusted R^2, Partial R^2.

Policy Report 1 due February 26th, 2019 (Next Tuesday) at 11:59pm

Assignment 4 due February 25th, 2019 (Next Monday) at 12:59 pm

Using the statekffdata dataset, answer the following question:

Controlling for the official poverty rate, what is the impact of having a Democratic governor (compared to all other parties) on the uninsured rate? Describe the relationships between the uninsured rate and each of your two independent variables. Are they statistically significant?

Create a binary variable which equals 1 when the governor is a Democrat and a 0 otherwise.
Create a measure of uninsured (100 - the sum of the insured categories).
Run the regression.

MSSP 897 Lab 4: Multiple Linear Regression With Categorical/Dummy Variables

Viviana Wu

February 17, 2019

The slides for Lab 4 session is now available to review here.

4.1 Constructing a Categorical Variable

4.2 Convert a Categorical Variable to Multiple Dummy (Binary) Varibles

4.3 Create A New Variable Using `RowMeans` of Other variables

4.4 Multiple Linear Regression with Dummy and Continuous Variables

With Dummy Variables Only

With Dummy and Continuous Variables

Policy Report 1 due February 26th, 2019 (Next Tuesday) at 11:59pm

Assignment 4 due February 25th, 2019 (Next Monday) at 12:59 pm

MSSP 897 Lab 4: Multiple Linear Regression With Categorical/Dummy Variables

Viviana Wu

February 17, 2019

The slides for Lab 4 session is now available to review here.

4.1 Constructing a Categorical Variable

4.2 Convert a Categorical Variable to Multiple Dummy (Binary) Varibles

4.3 Create A New Variable Using RowMeans of Other variables

4.4 Multiple Linear Regression with Dummy and Continuous Variables

With Dummy Variables Only

With Dummy and Continuous Variables

Policy Report 1 due February 26th, 2019 (Next Tuesday) at 11:59pm

Assignment 4 due February 25th, 2019 (Next Monday) at 12:59 pm

4.3 Create A New Variable Using `RowMeans` of Other variables