Slides for Week 6 is available here.
In this tutorial, you will learn how to:
Basic Set Up and install and load all necessary libraries
Check the CPS2015_Description.pdf
in your project
folder.
Each month the Bureau of Labor Statistics in the U.S. Department of Labor conducts the “Current Population Survey” (CPS), which provides data on labor force characteristics of the population, including the level of employment, unemployment, and earnings. The file CPS2015 contains the data for 2015 (from the 2016 surveys). These data are for full-time workers, defined as workers employed more than 35 hours per week for at least 48 weeks in the previous year. Data are provided for workers whose highest educational achievement is (1) a high school diploma, and (2) a bachelor’s degree.
Variables in the dataset:
Your filename.csv
needs to be exact and indicates the
file format, which is a .csv file (Comma Separated Values file)–a
delimited text file that uses a comma to separate values. It is one type
of data-friendly Excel spreadsheets besides the common .xlsx format)
Now in your global environment, you should see cps
and
cps1
stored. It has 7098 observations (i.e., 7098 rows) and
5 variables (i.e., 5 columns).
## year ahe bachelor female age
## Min. :2015 Min. : 0.0 Min. :0.000 Min. :0.000 Min. :25.0
## 1st Qu.:2015 1st Qu.: 13.0 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:27.0
## Median :2015 Median : 18.3 Median :1.000 Median :0.000 Median :30.0
## Mean :2015 Mean : 21.2 Mean :0.526 Mean :0.417 Mean :29.6
## 3rd Qu.:2015 3rd Qu.: 26.4 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:32.0
## Max. :2015 Max. :105.8 Max. :1.000 Max. :1.000 Max. :34.0
## 'data.frame': 7098 obs. of 5 variables:
## $ year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ ahe : num 2.04 2.39 0 0 2.88 ...
## $ bachelor: int 1 1 0 0 1 0 1 1 0 0 ...
## $ female : int 0 1 0 0 0 1 1 0 1 0 ...
## $ age : int 34 33 25 28 27 30 25 29 25 29 ...
## ahe bachelor female age
## ahe 1.000 0.38034 -0.1122 0.13101
## bachelor 0.380 1.00000 0.1487 -0.00113
## female -0.112 0.14867 1.0000 -0.03184
## age 0.131 -0.00113 -0.0318 1.00000
ifelse()
for Dummy and Conditional CodingDummy coding refers to the process of coding a categorical variable
into dichotomous variables (assuming mutually exclusive groups). We can
use the function ifelse
to recode and create a new variable
based on certain conditions.
Here is the syntax of ‘ifelse()’ function: ifelse(logical_expression, a , b)
The argument above in ‘ifelse’ states that:
1. logical_expression: Indicates the variable in which you specific a
test on (e.g., age is above or equal to 30).
2. a: Executes when the logical_expression is
TRUE.
3. b: Executes when the logical_expression is
FALSE.
Review the operators for indicating one or multiple conditions for specifying the test (e.g., or, and, less than, greater than and equal to): See Lab 2 RPubs Note.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 25.0 27.0 30.0 29.6 32.0 34.0
## 0 1
## 3404 3694
Example for two conditions: Create a variable that codes as 1 if the worker is equal or more than 30 years old and with a bachelor degree as the highest educational level (meet the two criteria). If not, coded as 0.
## 0 1
## 5158 1940
Example for either of the two conditions: female or with a bachelor degree. (meet either or the two criteria)
## 0 1
## 2222 4876
In this class, we don’t have time to cover outliers. If you want to learn more about how to handle outliers, check out my past tutorial.
Centering is the process of subtracting the variable mean (average) from each of the values of that same variable; in other words, it’s a linear rescaling of a variable. Centering variables is sometimes completed prior to including those variables as predictors in a regression model, and it is generally done for one or both of the following purposes: (a) to make the intercept valuable more interpretable, and (b) to reduce collinearity between two or more predictor variables that are subsequently multiplied to create an interaction term (product term) when estimating a moderated multiple linear regression model or polynomial regression model, for example.
Two Steps: create a new age
variable centered around its
mean, then square it
ageC
) by deducting
the mean from the value## [1] 29.6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -4.63 -2.63 0.37 0.00 2.37 4.37
ageC
to create
ageC2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.14 1.88 5.61 8.27 13.18 21.44
## [1] -0.0674
cps$age
and
cps$ahe
Log transformation is a data transformation method in which it replaces each variable x with a log(x). A log-regression model is a regression equation where one or more of the variables are linearized via a log-transformation. It allows us to transform a complex nonlinear relationship into a simpler linear model that can be easily evaluated using direct and standard techniques.
The laws of logarithms
ifelse
For the variable: ahe
In the context of analyzing a dataset containing annual household income (ahe), we need to ensure that the data is free of any values of 0 (and negative value) because taking the logarithm of 0 is undefined or results in infinity.
Using
ifelse()
to change variable value from 0 to 1
Since the minimum value of ahe is 0, we need to clean the data by
replacing 0 with 1. To do this, we can use the ifelse()
function in R. In the following code, if cps$ahe
is equal
to 0, cps$ahe1
will be assigned the value of 1. If
cps$ahe
is not equal to 0, cps$ahe1
will be
assigned the original value of cps$ahe
. This way, we have
replaced all 0 values in ahe with 1, which will allow us to perform any
necessary logarithmic transformations without encountering undefined
values.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 13.0 18.3 21.2 26.4 105.8
# Now, we can do a log transformation on `cps$ahe`
cps$log_ahe1 <- log(cps$ahe1)
summary(cps$log_ahe1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 2.56 2.91 2.91 3.27 4.66
For the variable: age
In addition to cleaning the data for ahe, we can also perform a logarithmic transformation on the age variable. Unlike ahe, the age variable does not contain any observations coded as 0, so we don’t need to worry about undefined values when taking the logarithm.
Performing a logarithmic transformation on age can be useful in
certain analyses, as it can help to reduce the impact of outliers and
better approximate a normal distribution. To do this in R, we can use
the log() function to create a new variable log_age
for any
subsequent analyses or modeling.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 25.0 27.0 30.0 29.6 32.0 34.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.22 3.30 3.40 3.38 3.47 3.53
If you click and view the dataset in the global environment, you should be able to see these new variables created on the right end of the dataset (scroll or use the arrow keys to view the entire dataset) and confirm that the new variables have been added.
We can generate a scatterplot of the logged average hourly earnings
(log_ahe1) against the centered age variable (x = ageC). Specifically,
poly(x,2)
argument specifies that a second-order polynomial
(quadratic) function should be used to model the relationship between
the variables.
ggplot(cps, aes(x=ageC, y=log_ahe1)) + geom_point() +
stat_smooth(se=F, method='lm', formula = y ~ poly(x,2)) +
labs(title = "Non-linear Relationship between Age and Average Hourly Earnings")
Hierarchical regression is a way to show if variables of your interest explain a statistically significant amount of variance in your Dependent Variable (DV) after accounting for all other variables. This is a framework for model comparison rather than a statistical method. In this framework, you build several regression models by adding variables to a previous model at each step; later models always include smaller models in previous steps. In many cases, our interest is to determine whether newly added variables show a significant improvement in R2 (the proportion of explained variance in DV by the model). (Read more here)
\(Adjusted R^2\) measure how close the data are to the fitted regression line. 0% indicates that the model explains none of the variability of the response data around its mean. How much of the variance in DV is explained by our independent variables? Based on the adjusted R-squared value, model lm1 accounts for approximately 2.86% of the variance in average hourly earnings.
##
## Call:
## lm(formula = ahe ~ female + age, data = cps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.65 -8.12 -2.77 5.14 82.69
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.4117 1.4775 4.34 0.000014 ***
## female -2.6586 0.2879 -9.24 < 0.0000000000000002 ***
## age 0.5377 0.0493 10.90 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12 on 7095 degrees of freedom
## Multiple R-squared: 0.0288, Adjusted R-squared: 0.0286
## F-statistic: 105 on 2 and 7095 DF, p-value: <0.0000000000000002
## [1] 0.0286
## female age
## 1 1
https://www.r-econometrics.com/methods/hcrobusterrors/
Omitted variable bias arises when an important variable is left out of a model. It occurs when the effect of an omitted variable on the dependent variable is correlated with one or more of the independent variables included in the model. As a result, the estimated coefficients of the included variables become biased and can lead to incorrect conclusions. Say, in the lecture example, if one of these omitted variables, say work experience, is positively correlated with both education and income, then the estimated coefficient for education in the model will be biased upwards. In other words, the model will overestimate the effect of education on income because it is not accounting for the effect of work experience. This is known as omitted variable bias.
Let’s add bachelor
to lm2 since education level can be
an important factor for explaining earning.
##
## Call:
## lm(formula = ahe ~ female + age + bachelor, data = cps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.91 -6.65 -1.86 4.25 83.91
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.0342 1.3548 1.5 0.13
## female -4.1424 0.2659 -15.6 <0.0000000000000002 ***
## age 0.5316 0.0451 11.8 <0.0000000000000002 ***
## bachelor 9.8471 0.2624 37.5 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.9 on 7094 degrees of freedom
## Multiple R-squared: 0.19, Adjusted R-squared: 0.189
## F-statistic: 553 on 3 and 7094 DF, p-value: <0.0000000000000002
## [1] 0.189
## female age bachelor
## 1.02 1.00 1.02
We can take further steps to improve the model fit with the data by tackling issues of nonlinearity and non-normality. This may involve transforming the variables or applying more advanced modeling techniques such as polynomial and log regression to model nonlinear relationship.
A linear Regression with ageC
and the quadratic term
ageC2
##
## Call:
## lm(formula = ahe ~ female + ageC + ageC2 + bachelor, data = cps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.62 -6.66 -1.92 4.28 83.86
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.9913 0.2541 70.80 <0.0000000000000002 ***
## female -4.1389 0.2659 -15.56 <0.0000000000000002 ***
## ageC 0.5272 0.0452 11.67 <0.0000000000000002 ***
## ageC2 -0.0252 0.0177 -1.43 0.15
## bachelor 9.8488 0.2624 37.53 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.9 on 7093 degrees of freedom
## Multiple R-squared: 0.19, Adjusted R-squared: 0.189
## F-statistic: 416 on 4 and 7093 DF, p-value: <0.0000000000000002
## [1] 0.189
## female ageC ageC2 bachelor
## 1.02 1.01 1.00 1.02
Log-transformed OLS regression models can be interpreted in the same way as regular OLS regression models, but the interpretation of the coefficients changes. Specifically, the coefficients in a log-transformed OLS regression model represent the percentage change in the dependent variable associated with a one-unit increase in the independent variable.
Log-regression models fall into four categories:
linear model, which is the traditional linear model without making any log transformations, \[ Y = b_0 + b_1*x + \mu \]
linear-log model, where we transform the x-explanatory variables using logs, \[ Y = b_0 + b_1*log(x) + \mu \]
log-linear model, where we transform the y-dependent variable using logs; and \[ log(Y) = b_0 + b_1*x + \mu \]
log-log model, where both the y-dependent variable and the x-explanatory factors are transformed using logs. \[ log(Y) = b_0 + b_1*log(x) + \mu \]
log_age
##
## Call:
## lm(formula = ahe ~ female + log_age + bachelor, data = cps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.82 -6.66 -1.89 4.30 83.89
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -35.240 4.484 -7.86 0.0000000000000045 ***
## female -4.141 0.266 -15.57 < 0.0000000000000002 ***
## log_age 15.669 1.323 11.85 < 0.0000000000000002 ***
## bachelor 9.848 0.262 37.53 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.9 on 7094 degrees of freedom
## Multiple R-squared: 0.19, Adjusted R-squared: 0.189
## F-statistic: 554 on 3 and 7094 DF, p-value: <0.0000000000000002
## [1] 0.189
## female log_age bachelor
## 1.02 1.00 1.02
log_ahe1
Note the sign for ageC2 in the model. The sign is positive when the model is convex and negative when the curve is concave. Learn more: https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faqhow-do-i-interpret-the-sign-of-the-quadratic-term-in-a-polynomial-regression/
##
## Call:
## lm(formula = log_ahe1 ~ female + ageC + ageC2 + bachelor, data = cps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7147 -0.2862 0.0129 0.3044 2.0602
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.758755 0.011139 247.68 <0.0000000000000002 ***
## female -0.176939 0.011657 -15.18 <0.0000000000000002 ***
## ageC 0.023976 0.001980 12.11 <0.0000000000000002 ***
## ageC2 -0.001875 0.000776 -2.42 0.016 *
## bachelor 0.462164 0.011504 40.18 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.479 on 7093 degrees of freedom
## Multiple R-squared: 0.209, Adjusted R-squared: 0.208
## F-statistic: 467 on 4 and 7093 DF, p-value: <0.0000000000000002
## [1] 0.208
## female ageC ageC2 bachelor
## 1.02 1.01 1.00 1.02
log_ahe1
and IV
log_age
##
## Call:
## lm(formula = log_ahe1 ~ female + log_age + bachelor, data = cps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7061 -0.2870 0.0105 0.3029 2.0631
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3118 0.1966 1.59 0.11
## female -0.1771 0.0117 -15.19 <0.0000000000000002 ***
## log_age 0.7185 0.0580 12.39 <0.0000000000000002 ***
## bachelor 0.4621 0.0115 40.16 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.479 on 7094 degrees of freedom
## Multiple R-squared: 0.208, Adjusted R-squared: 0.208
## F-statistic: 621 on 3 and 7094 DF, p-value: <0.0000000000000002
## [1] 0.208
## female log_age bachelor
## 1.02 1.00 1.02
Information criteria, unlike significance tests, are indices that allow for model comparison accounting for model complexity and sample size. Smaller models are better and more efficient. Smaller values of AIC and BIC are better.
Having said that, BIC is more conservative and penalizes model complexity more heavily. Hence, BIC is a better indicator of fit for multiple regression models when AIC and BIC values are inconclusive. Another consideration is theoretical, which models make more sense based on the prior literature and hypothesized relationships? You might opt for a model that makes a better theoretical sense when two models have a similar fit.
We used AIC and BIC model selection to distinguish among a set of possible models describing the relationship between average hourly earning, sex, a bachelor’s degree, and age.
## df AIC
## lm1 4 55366
## lm2 5 54083
## lm3 6 54083
## lm4 5 54082
## lm5 6 9688
## lm6 5 9690
## df BIC
## lm1 4 55394
## lm2 5 54118
## lm3 6 54125
## lm4 5 54116
## lm5 6 9729
## lm6 5 9724
Rda (or .RData) is a file format used in the R for saving and loading data and objects. It is a compressed and efficient way to store large dataset.
# export a csv copy
write.csv(cps, file = "cps_lab6.csv")
# export data in .rda format
save(cps, file = "cps_lab6.rda")
Stargazer
packageThe Stargazer package in R is a popular tool designed for creating well-formatted tables that summarize statistical models’ results. It supports a wide range of model types from various packages beyond base R’s statistical models, including lm (linear models) and glm (generalized linear models), among others. Stargazer tables can be exported in multiple formats, including HTML, LaTeX, and plain text, making it easier to integrate statistical analysis results into reports, research papers, and presentations. The full documentation: https://cran.r-project.org/web/packages/stargazer/vignettes/stargazer.pdf
If you use the stargazer package in your research publications, please remember to include the following citation: Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables. R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
** Note: Open the tables from your folder using a word document.**
cps_new <- (cps[c("ahe","age","bachelor","female")])
# install.packages("stargazer")
library(stargazer)
stargazer(cps_new, type= "html", out = "summary table.html",
title="Table 1. Summary Statistics", align=TRUE)
Statistic | N | Mean | St. Dev. | Min | Max |
ahe | 7,098 | 21.200 | 12.100 | 0.000 | 106.000 |
age | 7,098 | 29.600 | 2.880 | 25 | 34 |
bachelor | 7,098 | 0.526 | 0.499 | 0 | 1 |
female | 7,098 | 0.417 | 0.493 | 0 | 1 |
** Note: Open the tables from your folder using a word document.**
# Create a matrix by selecting some of the variables
cps_new$bachelor <- as.numeric(cps_new$bachelor)
# Remove categorical variables from your correlation matrix
# since correlation is only useful for continuous/numeric variables
cor.matrix1 <- cor(cps_new[c("ahe","age")])
stargazer(cor.matrix1, type= "html", out = "cor.table1.html",
title="Table 2. Correlation Matrix")
ahe | age | |
ahe | 1 | 0.131 |
age | 0.131 | 1 |
** Note: Open the tables from your folder using a word document.**
For more settings, see cheatsheet https://www.jakeruss.com/cheatsheets/stargazer/
lm1$AIC <- AIC(lm1)
lm2$AIC <- AIC(lm2)
lm3$AIC <- AIC(lm3)
lm4$AIC <- AIC(lm4)
lm5$AIC <- AIC(lm5)
lm6$AIC <- AIC(lm6)
lm1$BIC <- BIC(lm1)
lm2$BIC <- BIC(lm2)
lm3$BIC <- BIC(lm3)
lm4$BIC <- BIC(lm4)
lm5$BIC <- BIC(lm5)
lm6$BIC <- BIC(lm6)
stargazer(lm1, lm2, lm3, lm4, lm5, lm6, type= "html", out = "regression.table.html",
dep.var.labels = "Average Hourly Earnings",
column.labels = c("Basic", "Bachelor", "Polynomial", "Linear-log", "Log-linear", "Log-Log"),
intercept.bottom = FALSE,
digits = 2 , digits.extra = 2,
single.row= FALSE,
keep.stat = c("n","adj.rsq","aic","bic"),
notes.append = FALSE,
star.char = c("+", "*", "**", "***"),
star.cutoffs = c(0.1, 0.05, 0.01, 0.001),
notes = c("(+ p<0.1; * p<0.05; ** p<0.01; *** p<0.001)"),
title="Table 3. Regression Results", align=TRUE, no.space=FALSE)
Dependent variable: | ||||||
Average Hourly Earnings | log_ahe1 | |||||
Basic | Bachelor | Polynomial | Linear-log | Log-linear | Log-Log | |
(1) | (2) | (3) | (4) | (5) | (6) | |
Constant | 6.41*** | 2.03 | 18.00*** | -35.20*** | 2.76*** | 0.31 |
(1.48) | (1.35) | (0.25) | (4.48) | (0.01) | (0.20) | |
female | -2.66*** | -4.14*** | -4.14*** | -4.14*** | -0.18*** | -0.18*** |
(0.29) | (0.27) | (0.27) | (0.27) | (0.01) | (0.01) | |
age | 0.54*** | 0.53*** | ||||
(0.05) | (0.05) | |||||
ageC | 0.53*** | 0.02*** | ||||
(0.05) | (0.002) | |||||
ageC2 | -0.03 | -0.002* | ||||
(0.02) | (0.001) | |||||
log_age | 15.70*** | 0.72*** | ||||
(1.32) | (0.06) | |||||
bachelor | 9.85*** | 9.85*** | 9.85*** | 0.46*** | 0.46*** | |
(0.26) | (0.26) | (0.26) | (0.01) | (0.01) | ||
Observations | 7,098 | 7,098 | 7,098 | 7,098 | 7,098 | 7,098 |
Adjusted R2 | 0.03 | 0.19 | 0.19 | 0.19 | 0.21 | 0.21 |
Akaike Inf. Crit. | 55,366.00 | 54,083.00 | 54,083.00 | 54,082.00 | 9,688.00 | 9,690.00 |
Bayesian Inf. Crit. | 55,394.00 | 54,118.00 | 54,125.00 | 54,116.00 | 9,729.00 | 9,724.00 |
Note: | (+ p<0.1; * p<0.05; ** p<0.01; *** p<0.001) |