The slides for Lab 4 session is now available to review here.


Run the packages that will be used for this tutorial.

library(rsq)
options(scipen=999, digits = 2) # avoid scientific notation, round to two significant digits

Import the good dataset into R and and assign the dataframe the variable name ‘good’

good <- read.csv("good.csv") 
# or, type the full path of the .csv file, remember using forward slash "/"

4.1 Constructing a Categorical Variable

Overview of Today’s Lab

  1. Use the categorical variable PCBTHORD (birth order) to create a set of new categorical birth order variables

  2. Convert the categorical variables into dichotomous dummy variables.

  3. Run a regression model with new dummy variables

  4. Run regression models with new dummy variables and additional continuious variables

Explore the categorical variable PCBTHORD (birth order)

summary(good$PCBTHORD)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       2      98      51      99      99
hist(good$PCBTHORD, breaks = 100)

plot(good$PCBTHORD)

table(good$PCBTHORD) 
## 
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
##  567  428  257  161  127   60   64   38   25   15   11    4   13    4    2 
##   98   99 
##    8 1779

Assign all current values for the variable PCBTHORD into one of four categories:

  • missing (for values of 98 or 99),
  • 1 (for children born first),
  • 2 (for children born 2nd),
  • 3 (for children born 3rd) and
  • 4 (for children born 4th and later)

Use ifelse statements telling R IF a certain condition is met (e.g. a variable equals a certain value or is in a certain range) ELSE R should do something else that you specify.

good$PCBTHORD <- ifelse(good$PCBTHORD == 98, NA, 
                        ifelse(good$PCBTHORD == 99, NA, 
                               ifelse(good$PCBTHORD == 1, 1, 
                                      ifelse(good$PCBTHORD == 2, 2, 
                                             ifelse(good$PCBTHORD == 3, 3, 
                                                    ifelse(good$PCBTHORD >= 4, 4, NA))))))

See how the variable PCBTHORD has been transformed:

hist(good$PCBTHORD, breaks = 100)

plot(good$PCBTHORD) # how many levels of value do we have now? 

table(good$PCBTHORD) # check the frequency of each level of value
## 
##   1   2   3   4 
## 567 428 257 524

4.2 Convert a Categorical Variable to Multiple Dummy (Binary) Varibles

Dummy coding refers to the process of coding a categorical variable into dichotomous variables (assuming mutually exclusive group). In general, with k groups there will be k-1 coded variables.

E.g. If we have g1-g4, for g2, every observation in group 2 will be coded as 1 and 0 for all other groups it will be coded as zero. There is no g1 as it is not needed because g2-g4 have all of the information needed to determine which observation is in which group.

The recoded PCBTHORD variable has 4 levels of values (i.e., 4 groups) so we need to create 3 (i.e., 4-1) dummy variables, with first born (BOD1) as your reference group == “0”.

# BOD2 is the new variable that compares reading scores of second born children (BOD2=1) to first born children. 
good$BOD2 <- ifelse(good$PCBTHORD == 2, 1, 
                    ifelse(good$PCBTHORD == 1 | good$PCBTHORD > 2, 0, NA)) 
# Stay with numeric structure to use operators
# Coding: if the person was the 2nd born, then the value=1, if not the value =0 (for the rest who were not born 2nd). 
table(good$BOD2)
## 
##    0    1 
## 1348  428
#   BOD3 compares reading scores of third born children (BOD3=1) to first born children.
good$BOD3 <- ifelse(good$PCBTHORD == 3, 1, 
                    ifelse(good$PCBTHORD < 3 | good$PCBTHORD == 4, 0, NA))
# Coding: if the person was the 3rd born, then the value=1, if not the value =0 (for the rest who were not born 3rd). 
table(good$BOD3)
## 
##    0    1 
## 1519  257
# BOD4 compares reading scores of children born fourth or later (BOD4=1) to first born children

good$BOD4 <- ifelse(good$PCBTHORD == 4, 1, 
                    ifelse(good$PCBTHORD < 4, 0, NA))
# Coding: if the person was the 4th born, then the value=1, if not the value =0 (for the rest who were not born 4th). 

table(good$BOD4)
## 
##    0    1 
## 1252  524

4.3 Create A New Variable Using RowMeans of Other variables

Create new dependent variables, reading score in 2002 (“readss02”), which is the mean of letter word (“LWSS02”) and passage comprehension score (“PCSS02”) in 2002.

Combine a new data matrix with the two variables

readmean <- cbind(good$LWSS02,good$PCSS02)
colnames(readmean) <- c("LWSS02","PCSS02") # optional: rename column names

Create a new variable using rowMeans function

good$readss02 <- rowMeans(readmean, na.rm = FALSE)
summary(good$readss02)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      32      94     103     104     114     186    1026
hist(good$readss02)


4.4 Multiple Linear Regression with Dummy and Continuous Variables

With Dummy Variables Only

lm1 <- lm(good$readss02 ~ good$BOD2)
summary(lm1)
## 
## Call:
## lm(formula = good$readss02 ~ good$BOD2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -71.55 -10.05  -0.55   9.95  81.45 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  103.047      0.511  201.73 <0.0000000000000002 ***
## good$BOD2      2.376      1.071    2.22               0.027 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16 on 1281 degrees of freedom
##   (2280 observations deleted due to missingness)
## Multiple R-squared:  0.00383,    Adjusted R-squared:  0.00305 
## F-statistic: 4.92 on 1 and 1281 DF,  p-value: 0.0267
rsq.partial(lm1)
## $adjustment
## [1] FALSE
## 
## $variable
## [1] "good$BOD2"
## 
## $partial.rsq
## [1] 0.0038
nobs(lm1)
## [1] 1283
AIC(lm1)
## [1] 10772
BIC(lm1)
## [1] 10788
lm2 <- lm(readss02~BOD2 + BOD3 + BOD4 , data=good) 
summary(lm2)
## 
## Call:
## lm(formula = readss02 ~ BOD2 + BOD3 + BOD4, data = good)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -72.39  -9.92  -0.89  10.11  82.60 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  103.888      0.785  132.31 <0.0000000000000002 ***
## BOD2           1.535      1.225    1.25               0.210    
## BOD3          -0.328      1.421   -0.23               0.818    
## BOD4          -1.992      1.132   -1.76               0.079 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16 on 1279 degrees of freedom
##   (2280 observations deleted due to missingness)
## Multiple R-squared:  0.00641,    Adjusted R-squared:  0.00408 
## F-statistic: 2.75 on 3 and 1279 DF,  p-value: 0.0415
rsq.partial(lm2)
## $adjustment
## [1] FALSE
## 
## $variable
## [1] "BOD2" "BOD3" "BOD4"
## 
## $partial.rsq
## [1] 0.001226 0.000042 0.002414
nobs(lm2)
## [1] 1283
AIC(lm2)
## [1] 10773
BIC(lm2)
## [1] 10799
# Alternatively, try this for lm2: 
lm2.1 <- lm(good$readss02 ~ factor(good$PCBTHORD)) # reference category is?
summary(lm2.1)
## 
## Call:
## lm(formula = good$readss02 ~ factor(good$PCBTHORD))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -72.39  -9.92  -0.89  10.11  82.60 
## 
## Coefficients:
##                        Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)             103.888      0.785  132.31 <0.0000000000000002 ***
## factor(good$PCBTHORD)2    1.535      1.225    1.25               0.210    
## factor(good$PCBTHORD)3   -0.328      1.421   -0.23               0.818    
## factor(good$PCBTHORD)4   -1.992      1.132   -1.76               0.079 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16 on 1279 degrees of freedom
##   (2280 observations deleted due to missingness)
## Multiple R-squared:  0.00641,    Adjusted R-squared:  0.00408 
## F-statistic: 2.75 on 3 and 1279 DF,  p-value: 0.0415
rsq.partial(lm2.1)
## $adjustment
## [1] FALSE
## 
## $variable
## [1] "factor(good$PCBTHORD)"
## 
## $partial.rsq
## [1] 0.0064
nobs(lm2.1)
## [1] 1283

With Dummy and Continuous Variables

lm3 <- lm(readss02 ~ BOD2 + BOD3 + BOD4 + AGE02 + readss97 , data=good) 
summary(lm3)
## 
## Call:
## lm(formula = readss02 ~ BOD2 + BOD3 + BOD4 + AGE02 + readss97, 
##     data = good)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -43.37  -7.09  -1.01   6.17  83.11 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   46.585      3.231   14.42 < 0.0000000000000002 ***
## BOD2           0.579      1.159    0.50                 0.62    
## BOD3          -0.105      1.309   -0.08                 0.94    
## BOD4          -0.469      1.038   -0.45                 0.65    
## AGE02         -1.002      0.143   -7.03      0.0000000000043 ***
## readss97       0.676      0.027   25.03 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12 on 823 degrees of freedom
##   (2734 observations deleted due to missingness)
## Multiple R-squared:  0.443,  Adjusted R-squared:  0.44 
## F-statistic:  131 on 5 and 823 DF,  p-value: <0.0000000000000002
rsq.partial(lm3)
## $adjustment
## [1] FALSE
## 
## $variable
## [1] "BOD2"     "BOD3"     "BOD4"     "AGE02"    "readss97"
## 
## $partial.rsq
## [1] 0.0003026 0.0000078 0.0002479 0.0566554 0.4321772
nobs(lm3)
## [1] 829
AIC(lm3) 
## [1] 6469
BIC(lm3)
## [1] 6502
# Alternatively, try this for lm3: 
lm3.1 <- lm(good$readss02 ~ factor(good$PCBTHORD) + good$AGE02 + good$readss97, data=good) 
summary(lm3.1)
## 
## Call:
## lm(formula = good$readss02 ~ factor(good$PCBTHORD) + good$AGE02 + 
##     good$readss97, data = good)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -43.37  -7.09  -1.01   6.17  83.11 
## 
## Coefficients:
##                        Estimate Std. Error t value             Pr(>|t|)
## (Intercept)              46.585      3.231   14.42 < 0.0000000000000002
## factor(good$PCBTHORD)2    0.579      1.159    0.50                 0.62
## factor(good$PCBTHORD)3   -0.105      1.309   -0.08                 0.94
## factor(good$PCBTHORD)4   -0.469      1.038   -0.45                 0.65
## good$AGE02               -1.002      0.143   -7.03      0.0000000000043
## good$readss97             0.676      0.027   25.03 < 0.0000000000000002
##                           
## (Intercept)            ***
## factor(good$PCBTHORD)2    
## factor(good$PCBTHORD)3    
## factor(good$PCBTHORD)4    
## good$AGE02             ***
## good$readss97          ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12 on 823 degrees of freedom
##   (2734 observations deleted due to missingness)
## Multiple R-squared:  0.443,  Adjusted R-squared:  0.44 
## F-statistic:  131 on 5 and 823 DF,  p-value: <0.0000000000000002
rsq.partial(lm3.1)
## $adjustment
## [1] FALSE
## 
## $variable
## [1] "factor(good$PCBTHORD)" "good$AGE02"            "good$readss97"        
## 
## $partial.rsq
## [1] 0.014 0.057 0.422
nobs(lm3.1)
## [1] 829
  • How do the sample size changes across the regression models?
  • How do the AIC and BIC change? Which model has the best fit?
  • How do you interpret their individual effects (parameter estimates)?
  • Examine the adjusted R^2, Partial R^2.

Policy Report 1 due February 26th, 2019 (Next Tuesday) at 11:59pm


Assignment 4 due February 25th, 2019 (Next Monday) at 12:59 pm

Using the statekffdata dataset, answer the following question:

Controlling for the official poverty rate, what is the impact of having a Democratic governor (compared to all other parties) on the uninsured rate? Describe the relationships between the uninsured rate and each of your two independent variables. Are they statistically significant?

  1. Create a binary variable which equals 1 when the governor is a Democrat and a 0 otherwise.

  2. Create a measure of uninsured (100 - the sum of the insured categories).

  3. Run the regression.