Run the packages that will be used for this tutorial.
library(rsq)
options(scipen=999, digits = 2) # avoid scientific notation, round to two significant digitsImport the good dataset into R and and assign the dataframe the variable name ‘good’
good <- read.csv("good.csv")
# or, type the full path of the .csv file, remember using forward slash "/"Overview of Today’s Lab
Use the categorical variable PCBTHORD (birth order) to create a set of new categorical birth order variables
Convert the categorical variables into dichotomous dummy variables.
Run a regression model with new dummy variables
Run regression models with new dummy variables and additional continuious variables
Explore the categorical variable PCBTHORD (birth order)
summary(good$PCBTHORD)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 2 98 51 99 99
hist(good$PCBTHORD, breaks = 100)plot(good$PCBTHORD)table(good$PCBTHORD) ##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 567 428 257 161 127 60 64 38 25 15 11 4 13 4 2
## 98 99
## 8 1779
Assign all current values for the variable PCBTHORD into one of four categories:
Use ifelse statements telling R IF a certain condition is met (e.g. a variable equals a certain value or is in a certain range) ELSE R should do something else that you specify.
good$PCBTHORD <- ifelse(good$PCBTHORD == 98, NA,
ifelse(good$PCBTHORD == 99, NA,
ifelse(good$PCBTHORD == 1, 1,
ifelse(good$PCBTHORD == 2, 2,
ifelse(good$PCBTHORD == 3, 3,
ifelse(good$PCBTHORD >= 4, 4, NA))))))See how the variable PCBTHORD has been transformed:
hist(good$PCBTHORD, breaks = 100)plot(good$PCBTHORD) # how many levels of value do we have now? table(good$PCBTHORD) # check the frequency of each level of value##
## 1 2 3 4
## 567 428 257 524
Dummy coding refers to the process of coding a categorical variable into dichotomous variables (assuming mutually exclusive group). In general, with k groups there will be k-1 coded variables.
E.g. If we have g1-g4, for g2, every observation in group 2 will be coded as 1 and 0 for all other groups it will be coded as zero. There is no g1 as it is not needed because g2-g4 have all of the information needed to determine which observation is in which group.
The recoded PCBTHORD variable has 4 levels of values (i.e., 4 groups) so we need to create 3 (i.e., 4-1) dummy variables, with first born (BOD1) as your reference group == “0”.
# BOD2 is the new variable that compares reading scores of second born children (BOD2=1) to first born children.
good$BOD2 <- ifelse(good$PCBTHORD == 2, 1,
ifelse(good$PCBTHORD == 1 | good$PCBTHORD > 2, 0, NA))
# Stay with numeric structure to use operators
# Coding: if the person was the 2nd born, then the value=1, if not the value =0 (for the rest who were not born 2nd).
table(good$BOD2)##
## 0 1
## 1348 428
# BOD3 compares reading scores of third born children (BOD3=1) to first born children.
good$BOD3 <- ifelse(good$PCBTHORD == 3, 1,
ifelse(good$PCBTHORD < 3 | good$PCBTHORD == 4, 0, NA))
# Coding: if the person was the 3rd born, then the value=1, if not the value =0 (for the rest who were not born 3rd).
table(good$BOD3)##
## 0 1
## 1519 257
# BOD4 compares reading scores of children born fourth or later (BOD4=1) to first born children
good$BOD4 <- ifelse(good$PCBTHORD == 4, 1,
ifelse(good$PCBTHORD < 4, 0, NA))
# Coding: if the person was the 4th born, then the value=1, if not the value =0 (for the rest who were not born 4th).
table(good$BOD4)##
## 0 1
## 1252 524
RowMeans of Other variablesCreate new dependent variables, reading score in 2002 (“readss02”), which is the mean of letter word (“LWSS02”) and passage comprehension score (“PCSS02”) in 2002.
Combine a new data matrix with the two variables
readmean <- cbind(good$LWSS02,good$PCSS02)
colnames(readmean) <- c("LWSS02","PCSS02") # optional: rename column namesCreate a new variable using rowMeans function
good$readss02 <- rowMeans(readmean, na.rm = FALSE)
summary(good$readss02)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 32 94 103 104 114 186 1026
hist(good$readss02)lm1 <- lm(good$readss02 ~ good$BOD2)
summary(lm1)##
## Call:
## lm(formula = good$readss02 ~ good$BOD2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -71.55 -10.05 -0.55 9.95 81.45
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 103.047 0.511 201.73 <0.0000000000000002 ***
## good$BOD2 2.376 1.071 2.22 0.027 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16 on 1281 degrees of freedom
## (2280 observations deleted due to missingness)
## Multiple R-squared: 0.00383, Adjusted R-squared: 0.00305
## F-statistic: 4.92 on 1 and 1281 DF, p-value: 0.0267
rsq.partial(lm1)## $adjustment
## [1] FALSE
##
## $variable
## [1] "good$BOD2"
##
## $partial.rsq
## [1] 0.0038
nobs(lm1)## [1] 1283
AIC(lm1)## [1] 10772
BIC(lm1)## [1] 10788
lm2 <- lm(readss02~BOD2 + BOD3 + BOD4 , data=good)
summary(lm2)##
## Call:
## lm(formula = readss02 ~ BOD2 + BOD3 + BOD4, data = good)
##
## Residuals:
## Min 1Q Median 3Q Max
## -72.39 -9.92 -0.89 10.11 82.60
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 103.888 0.785 132.31 <0.0000000000000002 ***
## BOD2 1.535 1.225 1.25 0.210
## BOD3 -0.328 1.421 -0.23 0.818
## BOD4 -1.992 1.132 -1.76 0.079 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16 on 1279 degrees of freedom
## (2280 observations deleted due to missingness)
## Multiple R-squared: 0.00641, Adjusted R-squared: 0.00408
## F-statistic: 2.75 on 3 and 1279 DF, p-value: 0.0415
rsq.partial(lm2)## $adjustment
## [1] FALSE
##
## $variable
## [1] "BOD2" "BOD3" "BOD4"
##
## $partial.rsq
## [1] 0.001226 0.000042 0.002414
nobs(lm2)## [1] 1283
AIC(lm2)## [1] 10773
BIC(lm2)## [1] 10799
# Alternatively, try this for lm2:
lm2.1 <- lm(good$readss02 ~ factor(good$PCBTHORD)) # reference category is?
summary(lm2.1)##
## Call:
## lm(formula = good$readss02 ~ factor(good$PCBTHORD))
##
## Residuals:
## Min 1Q Median 3Q Max
## -72.39 -9.92 -0.89 10.11 82.60
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 103.888 0.785 132.31 <0.0000000000000002 ***
## factor(good$PCBTHORD)2 1.535 1.225 1.25 0.210
## factor(good$PCBTHORD)3 -0.328 1.421 -0.23 0.818
## factor(good$PCBTHORD)4 -1.992 1.132 -1.76 0.079 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16 on 1279 degrees of freedom
## (2280 observations deleted due to missingness)
## Multiple R-squared: 0.00641, Adjusted R-squared: 0.00408
## F-statistic: 2.75 on 3 and 1279 DF, p-value: 0.0415
rsq.partial(lm2.1)## $adjustment
## [1] FALSE
##
## $variable
## [1] "factor(good$PCBTHORD)"
##
## $partial.rsq
## [1] 0.0064
nobs(lm2.1)## [1] 1283
lm3 <- lm(readss02 ~ BOD2 + BOD3 + BOD4 + AGE02 + readss97 , data=good)
summary(lm3)##
## Call:
## lm(formula = readss02 ~ BOD2 + BOD3 + BOD4 + AGE02 + readss97,
## data = good)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.37 -7.09 -1.01 6.17 83.11
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.585 3.231 14.42 < 0.0000000000000002 ***
## BOD2 0.579 1.159 0.50 0.62
## BOD3 -0.105 1.309 -0.08 0.94
## BOD4 -0.469 1.038 -0.45 0.65
## AGE02 -1.002 0.143 -7.03 0.0000000000043 ***
## readss97 0.676 0.027 25.03 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12 on 823 degrees of freedom
## (2734 observations deleted due to missingness)
## Multiple R-squared: 0.443, Adjusted R-squared: 0.44
## F-statistic: 131 on 5 and 823 DF, p-value: <0.0000000000000002
rsq.partial(lm3)## $adjustment
## [1] FALSE
##
## $variable
## [1] "BOD2" "BOD3" "BOD4" "AGE02" "readss97"
##
## $partial.rsq
## [1] 0.0003026 0.0000078 0.0002479 0.0566554 0.4321772
nobs(lm3)## [1] 829
AIC(lm3) ## [1] 6469
BIC(lm3)## [1] 6502
# Alternatively, try this for lm3:
lm3.1 <- lm(good$readss02 ~ factor(good$PCBTHORD) + good$AGE02 + good$readss97, data=good)
summary(lm3.1)##
## Call:
## lm(formula = good$readss02 ~ factor(good$PCBTHORD) + good$AGE02 +
## good$readss97, data = good)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.37 -7.09 -1.01 6.17 83.11
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.585 3.231 14.42 < 0.0000000000000002
## factor(good$PCBTHORD)2 0.579 1.159 0.50 0.62
## factor(good$PCBTHORD)3 -0.105 1.309 -0.08 0.94
## factor(good$PCBTHORD)4 -0.469 1.038 -0.45 0.65
## good$AGE02 -1.002 0.143 -7.03 0.0000000000043
## good$readss97 0.676 0.027 25.03 < 0.0000000000000002
##
## (Intercept) ***
## factor(good$PCBTHORD)2
## factor(good$PCBTHORD)3
## factor(good$PCBTHORD)4
## good$AGE02 ***
## good$readss97 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12 on 823 degrees of freedom
## (2734 observations deleted due to missingness)
## Multiple R-squared: 0.443, Adjusted R-squared: 0.44
## F-statistic: 131 on 5 and 823 DF, p-value: <0.0000000000000002
rsq.partial(lm3.1)## $adjustment
## [1] FALSE
##
## $variable
## [1] "factor(good$PCBTHORD)" "good$AGE02" "good$readss97"
##
## $partial.rsq
## [1] 0.014 0.057 0.422
nobs(lm3.1)## [1] 829
Using the statekffdata dataset, answer the following question:
Controlling for the official poverty rate, what is the impact of having a Democratic governor (compared to all other parties) on the uninsured rate? Describe the relationships between the uninsured rate and each of your two independent variables. Are they statistically significant?
Create a binary variable which equals 1 when the governor is a Democrat and a 0 otherwise.
Create a measure of uninsured (100 - the sum of the insured categories).
Run the regression.