Loading Packages for Assignment
library(tidyverse)
library(knitr)
library(haven) #needed to pull data from file location
library(ggplot2)
library(broom)
library(skimr)
library(stargazer)
nlsw <- read_dta("nlsw88.dta") #pulling in data, using the haven package
glimpse(nlsw) #reviwing data
Rows: 2,246
Columns: 17
$ idcode <dbl> 1, 2, 3, 4, 6, 7, 9, 12, 13, 14, 15, 16, 18, 19, 20, 22, 23, 24, 25, 36,...
$ age <dbl> 37, 37, 42, 43, 42, 39, 37, 40, 40, 40, 39, 40, 40, 40, 39, 41, 42, 41, ...
$ race <dbl+lbl> 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ married <dbl+lbl> 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1,...
$ never_married <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...
$ grade <dbl> 12, 12, 12, 17, 12, 12, 12, 18, 14, 15, 16, 15, 15, 15, 15, 15, 15, 14, ...
$ collgrad <dbl+lbl> 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0,...
$ south <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ smsa <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,...
$ c_city <dbl> 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ...
$ industry <dbl+lbl> 5, 4, 4, 11, 4, 11, 5, 11, 11, 11, 11, 11, 6, 11, 11, 11, 11, ...
$ occupation <dbl+lbl> 6, 5, 3, 13, 6, 3, 2, 2, 3, 1, 1, 1, 5, 1, 1, 1, 1, ...
$ union <dbl+lbl> 1, 1, NA, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, ...
$ wage <dbl> 11.739125, 6.400963, 5.016723, 9.033813, 8.083731, 4.629630, 10.491142, ...
$ hours <dbl> 48, 40, 40, 42, 48, 30, 40, 45, 8, 50, 16, 40, 40, 40, 4, 32, 45, 24, 40...
$ ttl_exp <dbl> 10.333334, 13.621795, 17.730770, 13.211537, 17.820513, 7.326923, 19.0448...
$ tenure <dbl> 5.3333335, 5.2500000, 1.2500000, 1.7500000, 17.7500000, 2.2500000, 19.00...
R users:
a. rename the variable collgrad to college_grad,
b. rename the variable ttl_exp into total_work_exp, and
c. rename the variable age to age_years.
d. change collgrad into a factor variable using the mutate() and as_factor()functions
e. Use skim to check your work.
nlsw.clean <- nlsw %>%
rename(.,
college_grad = collgrad, #rename the variable collgrad to college_grad,
total_work_exp = ttl_exp, #rename the variable ttl_exp into total_work_exp
age_years = age) %>% #rename the variable age to age_years
#change collgrad into a factor variable using the mutate() and as_factor()functions
mutate(.,college_grad_fac = as_factor(college_grad))
glimpse(nlsw.clean)
Rows: 2,246
Columns: 18
$ idcode <dbl> 1, 2, 3, 4, 6, 7, 9, 12, 13, 14, 15, 16, 18, 19, 20, 22, 23, 24, 25, ...
$ age_years <dbl> 37, 37, 42, 43, 42, 39, 37, 40, 40, 40, 39, 40, 40, 40, 39, 41, 42, 4...
$ race <dbl+lbl> 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ married <dbl+lbl> 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1,...
$ never_married <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...
$ grade <dbl> 12, 12, 12, 17, 12, 12, 12, 18, 14, 15, 16, 15, 15, 15, 15, 15, 15, 1...
$ college_grad <dbl+lbl> 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1,...
$ south <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ smsa <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,...
$ c_city <dbl> 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...
$ industry <dbl+lbl> 5, 4, 4, 11, 4, 11, 5, 11, 11, 11, 11, 11, 6, 11, 11, 11, 1...
$ occupation <dbl+lbl> 6, 5, 3, 13, 6, 3, 2, 2, 3, 1, 1, 1, 5, 1, 1, 1, ...
$ union <dbl+lbl> 1, 1, NA, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, ...
$ wage <dbl> 11.739125, 6.400963, 5.016723, 9.033813, 8.083731, 4.629630, 10.49114...
$ hours <dbl> 48, 40, 40, 42, 48, 30, 40, 45, 8, 50, 16, 40, 40, 40, 4, 32, 45, 24,...
$ total_work_exp <dbl> 10.333334, 13.621795, 17.730770, 13.211537, 17.820513, 7.326923, 19.0...
$ tenure <dbl> 5.3333335, 5.2500000, 1.2500000, 1.7500000, 17.7500000, 2.2500000, 19...
$ college_grad_fac <fct> not college grad, not college grad, not college grad, college grad, n...
skim(nlsw.clean) #Use skim to check your work.
Couldn't find skimmers for class: haven_labelled, vctrs_vctr, double, numeric; No user-defined `sfl` provided. Falling back to `character`.Couldn't find skimmers for class: haven_labelled, vctrs_vctr, double, numeric; No user-defined `sfl` provided. Falling back to `character`.Couldn't find skimmers for class: haven_labelled, vctrs_vctr, double, numeric; No user-defined `sfl` provided. Falling back to `character`.Couldn't find skimmers for class: haven_labelled, vctrs_vctr, double, numeric; No user-defined `sfl` provided. Falling back to `character`.Couldn't find skimmers for class: haven_labelled, vctrs_vctr, double, numeric; No user-defined `sfl` provided. Falling back to `character`.Couldn't find skimmers for class: haven_labelled, vctrs_vctr, double, numeric; No user-defined `sfl` provided. Falling back to `character`.Couldn't find skimmers for class: haven_labelled, vctrs_vctr, double, numeric; No user-defined `sfl` provided. Falling back to `character`.
-- Data Summary ------------------------
Values
Name nlsw.clean
Number of rows 2246
Number of columns 18
_______________________
Column type frequency:
character 7
factor 1
numeric 10
________________________
Group variables None
-- Variable type: character -----------------------------------------------------------------------
# A tibble: 7 x 8
skim_variable n_missing complete_rate min max empty n_unique whitespace
* <chr> <int> <dbl> <int> <int> <int> <int> <int>
1 race 0 1 1 1 0 3 0
2 married 0 1 1 1 0 2 0
3 college_grad 0 1 1 1 0 2 0
4 smsa 0 1 1 1 0 2 0
5 industry 14 0.994 1 2 0 12 0
6 occupation 9 0.996 1 2 0 13 0
7 union 368 0.836 1 1 0 2 0
-- Variable type: factor --------------------------------------------------------------------------
# A tibble: 1 x 6
skim_variable n_missing complete_rate ordered n_unique top_counts
* <chr> <int> <dbl> <lgl> <int> <chr>
1 college_grad_fac 0 1 FALSE 2 not: 1714, col: 532
-- Variable type: numeric -------------------------------------------------------------------------
# A tibble: 10 x 11
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
* <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 idcode 0 1 2613. 1481. 1 1366. 2614 3902. 5159
2 age_years 0 1 39.2 3.06 34 36 39 42 46
3 never_married 0 1 0.104 0.306 0 0 0 0 1
4 grade 2 0.999 13.1 2.52 0 12 12 15 18
5 south 0 1 0.419 0.494 0 0 0 1 1
6 c_city 0 1 0.292 0.455 0 0 0 1 1
7 wage 0 1 7.77 5.76 1.00 4.26 6.27 9.59 40.7
8 hours 4 0.998 37.2 10.5 1 35 40 40 80
9 total_work_exp 0 1 12.5 4.61 0.115 9.21 13.1 16.0 28.9
10 tenure 15 0.993 5.98 5.51 0 1.58 3.83 9.33 25.9
hist
* <chr>
1 ▇▇▇▇▇
2 ▇▆▇▃▃
3 ▇▁▁▁▁
4 ▁▁▁▇▃
5 ▇▁▁▁▆
6 ▇▁▁▁▃
7 ▇▂▁▁▁
8 ▁▂▇▁▁
9 ▂▅▇▂▁
10 ▇▃▂▁▁
nlswreg1 <- lm(wage ~ total_work_exp, data = nlsw.clean)
#nlswreg1 <- lm(nlsw.clean$wage ~ nlsw.clean$total_work_exp) is another way to write this.
summary(nlswreg1)
Call:
lm(formula = wage ~ total_work_exp, data = nlsw.clean)
Residuals:
Min 1Q Median 3Q Max
-8.502 -2.911 -1.295 1.308 35.370
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.61249 0.33935 10.64 <2e-16 ***
total_work_exp 0.33143 0.02541 13.04 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.55 on 2244 degrees of freedom
Multiple R-squared: 0.07048, Adjusted R-squared: 0.07006
F-statistic: 170.1 on 1 and 2244 DF, p-value: < 2.2e-16
nlswreg2 <- lm(wage ~ total_work_exp + age_years, data = nlsw.clean)
summary(nlswreg2)
Call:
lm(formula = wage ~ total_work_exp + age_years, data = nlsw.clean)
Residuals:
Min 1Q Median 3Q Max
-8.107 -2.926 -1.267 1.337 35.050
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.64932 1.50566 5.745 1.05e-08 ***
total_work_exp 0.34233 0.02555 13.401 < 2e-16 ***
age_years -0.13213 0.03849 -3.433 0.000607 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.537 on 2243 degrees of freedom
Multiple R-squared: 0.07534, Adjusted R-squared: 0.07451
F-statistic: 91.37 on 2 and 2243 DF, p-value: < 2.2e-16
nlswreg3 <- lm(wage ~ total_work_exp + age_years +college_grad, data = nlsw.clean)
summary(nlswreg3)
Call:
lm(formula = wage ~ total_work_exp + age_years + college_grad,
data = nlsw.clean)
Residuals:
Min 1Q Median 3Q Max
-9.917 -2.689 -1.021 1.086 35.550
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.92690 1.46037 5.428 6.31e-08 ***
total_work_exp 0.30869 0.02491 12.391 < 2e-16 ***
age_years -0.12252 0.03731 -3.284 0.00104 **
college_grad 3.24134 0.26799 12.095 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.366 on 2242 degrees of freedom
Multiple R-squared: 0.132, Adjusted R-squared: 0.1308
F-statistic: 113.6 on 3 and 2242 DF, p-value: < 2.2e-16
nlswreg4 <- lm(wage ~ total_work_exp + age_years +college_grad +age_years:total_work_exp, data = nlsw.clean)
summary(nlswreg4)
Call:
lm(formula = wage ~ total_work_exp + age_years + college_grad +
age_years:total_work_exp, data = nlsw.clean)
Residuals:
Min 1Q Median 3Q Max
-9.535 -2.646 -1.027 1.064 35.940
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.701472 4.151041 0.410 0.6819
total_work_exp 0.816992 0.318256 2.567 0.0103 *
age_years 0.034692 0.104979 0.330 0.7411
college_grad 3.229193 0.268005 12.049 <2e-16 ***
total_work_exp:age_years -0.012788 0.007982 -1.602 0.1093
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.364 on 2241 degrees of freedom
Multiple R-squared: 0.133, Adjusted R-squared: 0.1314
F-statistic: 85.92 on 4 and 2241 DF, p-value: < 2.2e-16
plot(nlsw.clean$total_work_exp, nlsw.clean$age_years)
summary.table <- stargazer(nlswreg1,
nlswreg2,
nlswreg3,
nlswreg4,
type = "text",
font.size = "normalsize",
digits = 2,
keep.stat = c("n", "rsq", "adj.rsq"),
out = "test.html",
star.cutoffs = c(.05, .01, .001))
length of NULL cannot be changedlength of NULL cannot be changedlength of NULL cannot be changedlength of NULL cannot be changedlength of NULL cannot be changednumber of rows of result is not a multiple of vector length (arg 2)number of rows of result is not a multiple of vector length (arg 2)number of rows of result is not a multiple of vector length (arg 2)
=========================================================
Dependent variable:
--------------------------------
wage
(1) (2) (3) (4)
---------------------------------------------------------
total_work_exp 0.33*** 0.34*** 0.31*** 0.82*
(0.03) (0.03) (0.02) (0.32)
age_years -0.13*** -0.12** 0.03
(0.04) (0.04) (0.10)
college_grad 3.24*** 3.23***
(0.27) (0.27)
total_work_exp:age_years -0.01
(0.01)
Constant 3.61*** 8.65*** 7.93*** 1.70
(0.34) (1.51) (1.46) (4.15)
---------------------------------------------------------
Observations 2,246 2,246 2,246 2,246
R2 0.07 0.08 0.13 0.13
Adjusted R2 0.07 0.07 0.13 0.13
=========================================================
Note: *p<0.05; **p<0.01; ***p<0.001
According to this model and dataset, total work experience is a signifcant predictor of womens’ hourly wage. When age is taken into consideration, the wage does not change as much. As additional predictor variables are added, age_years becomes less signifcant. However, when college_grad was introduced into the model, age and work_experience did not have as much weight.
Overall, these models are not great predictors of wage. From this analysis you can note a few significant variables; however, the model as whole is weak (note the .13 adjusted R-squared).