#-100 = inapplicable -99 = no answer, -98 = do not know/cannot choose
# Question: Did any observations have the values of -100, -99 or -98 for the age variable?
# Answer: No, but I did have some values that defaulted to NAs so it's possible that R picked them up before I could.
# Check 'age' variable after recoding
summary(gss$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 18.00 34.00 48.00 49.18 64.00 89.00 208
# Preliminary check for 'hrs1' variable before recoding
summary(gss$hrs1)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 37.00 40.00 40.18 45.00 89.00 1606
# Recode work hours using tidyverse package:
# From GSS online data explorer: -100 = inapplicable, -99 = No answer, -98 = Do not
# Know/Cannot Choose, -97 = Skipped on Web, 89 = 89+ hours
gss <- gss %>%
mutate(
workhrs = case_when(
hrs1 == 89 ~ NA, # Recoding to remove possible outliers and preserve linearity and. This is decision you will probably consider in your own data
hrs1 <= -97 ~ NA, # Recoding all negative values to NA
TRUE ~ hrs1
)
)
summary(gss$hrs1)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 37.00 40.00 40.18 45.00 89.00 1606
summary(gss$workhrs)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 36.00 40.00 39.85 45.00 85.00 1619
The above are two code chunks represent effective ways to recode data. The first uses the logic of base R and the second the logic and tools of tidyverse and piping. Both work and you will develop a preference. While it can look more complicated, I recommend building your skills to recode using tidyverse.
# Rename the `conrinc` as a numeric variable "income"
gss <- gss %>%
mutate(income = as.numeric(conrinc))
# Recode the degree variable into a factor with labels corresponding to "Less_HS", "HS", "Assoc", "Bach", "Grad"
gss <- gss %>%
mutate(
degree = factor(degree,
levels = c(0,1,2,3,4),
labels = c("Less_HS", "HS", "Assoc", "Bach", "Grad")
)
)
# Recode the sex variable into a factor variable
gss <- gss %>%
mutate(
sex = factor(sex,
levels = c(1,2),
labels = c("male" , "female")
)
)
# Recoding race variable into descriptive character vector.
gss <- gss %>%
mutate(
race = factor(race,
levels = c(1,2,3),
labels = c("white" , "black ", "other")
)
)
# Recoding partyid variable into political affiliation categories.
gss <- gss %>%
mutate(
partyid = case_when(
partyid %in% c(0:2) ~ "Dem",
partyid %in% c(4:6) ~ "Rep",
partyid %in% c(3,7) ~ "Other"
)
)
gss <- gss %>%
select(-hrs1, -conrinc) # Removing the original work hours and weird income name from the dataset we will be working with
gss <- gss %>% drop_na() # Dropping any rows missing on any variable. The way this is implemented here works the same as keeping only the complete.cases
We’re going to use a sibling function,
datasummary_balance(). This function will give us the mean
and standard deviation for our continuous data, and below it, give us
the counts and percentages for our categorical data.
The formula is datasummary_balance( ~ 1, ...). The
“~ 1” tells it to create this double-table for all
variables in the data.
datasummary_balance( ~ 1,
title = "Sample Descriptive Statistics",
notes = "Data: 2022 General Social Survey",
data = gss)
## Warning: Labelled data are not supported by the `datasummary_balance()`
## function. This warning appears once per session.
| Mean | Std. Dev. | ||
|---|---|---|---|
| Data: 2022 General Social Survey | |||
| age | 43.6 | 14.0 | |
| workhrs | 40.2 | 13.2 | |
| income | 44796.6 | 40371.0 | |
| N | Pct. | ||
| degree | Less_HS | 89 | 5.9 |
| HS | 651 | 43.0 | |
| Assoc | 143 | 9.5 | |
| Bach | 385 | 25.4 | |
| Grad | 245 | 16.2 | |
| sex | male | 760 | 50.2 |
| female | 753 | 49.8 | |
| race | white | 1116 | 73.8 |
| black | 214 | 14.1 | |
| other | 183 | 12.1 | |
| partyid | Dem | 636 | 42.0 |
| Other | 373 | 24.7 | |
| Rep | 504 | 33.3 | |
Let’s start with a simple model that only include age, hours worked, and sex
ols1 <- lm(income ~ age + workhrs + sex,
data = gss)
summary(ols1)
##
## Call:
## lm(formula = income ~ age + workhrs + sex, data = gss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -78254 -23112 -8846 10918 144698
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3191.36 4719.56 -0.676 0.499
## age 554.82 69.45 7.989 0.00000000000000268 ***
## workhrs 737.92 74.53 9.901 < 0.0000000000000002 ***
## sexfemale -11857.33 1969.52 -6.020 0.00000000218093759 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37850 on 1509 degrees of freedom
## Multiple R-squared: 0.1226, Adjusted R-squared: 0.1209
## F-statistic: 70.31 on 3 and 1509 DF, p-value: < 0.00000000000000022
Looking at our model, we have a series of coefficient estimates, standard errors, t-values, and p-values for each of our independent variables. Interpret the coefficients.
## Next, we add race to the previous model and run another regression ## model, called ols2.
ols2 <- lm(income ~ age + workhrs + sex + race,
data = gss)
summary(ols2)
##
## Call:
## lm(formula = income ~ age + workhrs + sex + race, data = gss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -78562 -22983 -8279 10077 143434
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1150.39 4770.10 -0.241 0.809458
## age 540.18 69.50 7.772 0.0000000000000142 ***
## workhrs 743.61 74.24 10.016 < 0.0000000000000002 ***
## sexfemale -11319.93 1968.05 -5.752 0.0000000106714558 ***
## raceblack -10657.69 2824.48 -3.773 0.000167 ***
## raceother -3230.24 3020.18 -1.070 0.284992
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37700 on 1507 degrees of freedom
## Multiple R-squared: 0.131, Adjusted R-squared: 0.1281
## F-statistic: 45.43 on 5 and 1507 DF, p-value: < 0.00000000000000022
## Interpret the coefficients of race variables.
## Both blacks and other races are negatively associated with income, meaning that both groups typically earn less than white individuals. However, this is only statistically significant for blacks.
Another way to think about the above interpretation is by creating a
series of “dummy variables.” These operate where each variable is split
into a series of zero-one variables. For example, our variable
sex has two levels, “male” and “female.” Turning it into a
dummy would then create a new variable male where males
have the coding 1 and females have the coding
0. The fastdummies package has a function
dummy_cols() that makes this easy.
Let’s try this now, turning the variables sex,
race, partyid, and degree into a
series of dummies.
gss <- gss %>%
fastDummies::dummy_cols(select_columns = c("sex", "race", "partyid", "degree"))
names(gss)
## [1] "degree" "sex" "race" "age"
## [5] "partyid" "workhrs" "income" "sex_male"
## [9] "sex_female" "race_white" "race_black " "race_other"
## [13] "partyid_Dem" "partyid_Other" "partyid_Rep" "degree_Less_HS"
## [17] "degree_HS" "degree_Assoc" "degree_Bach" "degree_Grad"
Now, let’s try out running models with our new dummies. Note, too, that it doesn’t matter which dummy we leave out as long as one of them is left out. Changing the dummy will only change the intercept, not the slope.
ols_nodummy <- lm(income ~ age + workhrs + race + sex,
data = gss)
ols_dummy1 <- lm(income ~ age + workhrs +
race_white + race_other +
sex_female,
data = gss)
ols_dummy2 <- lm(income ~ age + workhrs +
race_white + race_other +
sex_male,
data = gss)
modelsummary(list(ols_nodummy, ols_dummy1, ols_dummy2),
stars = T, title = "Testing Dummy Variables",
gof_omit = "F|Log|IC|RMSE")
| (1) | (2) | (3) | |
|---|---|---|---|
| + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | |||
| (Intercept) | -1150.393 | -11808.087* | -23128.020*** |
| (4770.100) | (5308.898) | (5044.718) | |
| age | 540.179*** | 540.179*** | 540.179*** |
| (69.503) | (69.503) | (69.503) | |
| workhrs | 743.608*** | 743.608*** | 743.608*** |
| (74.243) | (74.243) | (74.243) | |
| raceblack | -10657.694*** | ||
| (2824.485) | |||
| raceother | -3230.241 | ||
| (3020.183) | |||
| sexfemale | -11319.932*** | ||
| (1968.054) | |||
| race_white | 10657.694*** | 10657.694*** | |
| (2824.485) | (2824.485) | ||
| race_other | 7427.453+ | 7427.453+ | |
| (3808.216) | (3808.216) | ||
| sex_female | -11319.932*** | ||
| (1968.054) | |||
| sex_male | 11319.932*** | ||
| (1968.054) | |||
| Num.Obs. | 1513 | 1513 | 1513 |
| R2 | 0.131 | 0.131 | 0.131 |
| R2 Adj. | 0.128 | 0.128 | 0.128 |
As we can see from these results, there is no difference in whether you use your own dummies or use the defaults. Just note, however, that R will drop whatever the first category is. Creating and using your own dummies gives you more control over this, but isn’t necessary.
If we’d like to change the “reference category,” AKA the category
that is dropped, one option is to turn the variable into a factor with
new levels (1 = White, 2 = Black, 3 = Other). Another is by putting in
ourselves which levels to include in the model: Including Black and
Other makes White the reference. Finally, another option is to use the
stats::relevel() function to choose a new base category.
(relevel(var, ref = new_reference_level))
If we look at our income variable, our value is very skewed. It’s
also in large dollar amounts that aren’t easy to manage. Let’s see what
happens if we made two transformations. First, let’s think of income in
thousands of dollars instead of in total dollar amounts. Then, let’s
also take the natural log of our variable so we arrive at something more
normally distributed. Finally, let’s see what happens if we do both.
(Note: by default, base::log() takes the natural log. If
you want to take the \(log_{10}\) or
\(log_2\), you should use
log10() or log2(). You can also use
log(x, base) for others.)
Taking the log of a term does two things.
gss <- gss %>%
mutate(logincome = log(income),
thouinc = income/1000,
logthouinc = log(income/1000)
)
origincplot <- ggplot(gss) + geom_density(aes(income))
thouincplot <- ggplot(gss) + geom_density(aes(thouinc))
logincplot <- ggplot(gss) + geom_density(aes(logincome))
logthouincplot <- ggplot(gss) + geom_density(aes(logthouinc))
ggpubr::ggarrange(origincplot, thouincplot, logincplot, logthouincplot)
Visualizing Transformed Income Variables
Now, let’s try a series of regressions to see what differences, if any, exist when we make these changes.
gss %>%
select(income, thouinc, logincome, logthouinc) %>%
datasummary_skim(type="numeric",title = "Summary Statistics of Transformed Income Variables")
| Unique | Missing Pct. | Mean | SD | Min | Median | Max | Histogram | |
|---|---|---|---|---|---|---|---|---|
| income | 26 | 0 | 44796.6 | 40371.0 | 336.0 | 36960.0 | 170912.6 | |
| thouinc | 26 | 0 | 44.8 | 40.4 | 0.3 | 37.0 | 170.9 | |
| logincome | 26 | 0 | 10.3 | 1.0 | 5.8 | 10.5 | 12.0 | |
| logthouinc | 26 | 0 | 3.4 | 1.0 | -1.1 | 3.6 | 5.1 |
ols_inc <- lm(income ~ age + workhrs + sex, data = gss)
ols_thou <- lm(thouinc ~ age + workhrs + sex, data = gss)
ols_log <- lm(logincome ~ age + workhrs + sex, data = gss)
ols_log_thou <- lm(logthouinc ~ age + workhrs + sex, data = gss)
modelsummary(list("Original" = ols_inc, "Inc/1k" = ols_thou,
"Log(Inc)" = ols_log, "Log(Inc/1k)" = ols_log_thou),
title = "Effect of Data Transformations: Logarithmic Terms",
stars = T, gof_omit = "F|RMSE|Log|IC")
| Original | Inc/1k | Log(Inc) | Log(Inc/1k) | |
|---|---|---|---|---|
| + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | ||||
| (Intercept) | -3191.363 | -3.191 | 8.734*** | 1.826*** |
| (4719.555) | (4.720) | (0.119) | (0.119) | |
| age | 554.824*** | 0.555*** | 0.015*** | 0.015*** |
| (69.451) | (0.069) | (0.002) | (0.002) | |
| workhrs | 737.918*** | 0.738*** | 0.026*** | 0.026*** |
| (74.531) | (0.075) | (0.002) | (0.002) | |
| sexfemale | -11857.330*** | -11.857*** | -0.277*** | -0.277*** |
| (1969.525) | (1.970) | (0.050) | (0.050) | |
| Num.Obs. | 1513 | 1513 | 1513 | 1513 |
| R2 | 0.123 | 0.123 | 0.171 | 0.171 |
| R2 Adj. | 0.121 | 0.121 | 0.169 | 0.169 |
We can see here that scaling only shifts the direction of variables,
but has no effect on the final model statistics. Overall, the model with
the logged term (logincome) worked best (\(r^2 = .3\)) and is the most interpretable.
(Remember, an increase in the natural-logged amount equates to a 1%
increase in the raw amount. Scaling the logged amount doesn’t have any
effect since the increase in percent is stable across the board.)
Now, let’s try another transformation. This time, let’s estimate the number of year’s a person has been working by subtracting 18 years from their age. While we’re here, let’s also see what happens if we square the term.
We use a square term when we hypothesize a U shaped or inverse-U shaped relationship. That is, when we suspect the effect will be non-linear.
gss <- gss %>%
mutate(workexp = age - 18,
worksq = workexp^2)
## gss %>%
## select(age, workexp, worksq) %>%
## datasummary_skim(title = "Summary Statistics of Transformed Work Experience Variables")
## OK I played with this one for like an hour and couldn't get it to work; I keep getting an error code
Let’s now test out whether these two data transformations affect our models. Below I compare age and work experience (equal to \(age - 18\)). Lastly, I include a model with both work experience and work experience squared.
ols_age <- lm(logincome ~ age + workhrs + sex,
data = gss)
ols_work <- lm(logincome ~ workexp + workhrs + sex,
data = gss)
ols_work_worksq <- lm(logincome ~ workexp + worksq + workhrs + sex,
data = gss)
wrkcoefs <- c("age" = "Age", "workexp" = "Work Experience",
"worksq" = "Work Experience ^2", "sexMale" = "Male", "workhrs" = "Work Hours")
modelsummary(list(ols_age, ols_work, ols_work_worksq),
coef_map = wrkcoefs, stars = T, gof_omit = "F|RMSE|Log|IC",
title = "Effect of Data Transformations: Square Terms")
| (1) | (2) | (3) | |
|---|---|---|---|
| + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | |||
| Age | 0.015*** | ||
| (0.002) | |||
| Work Experience | 0.015*** | 0.074*** | |
| (0.002) | (0.007) | ||
| Work Experience ^2 | -0.001*** | ||
| (0.000) | |||
| Work Hours | 0.026*** | 0.026*** | 0.022*** |
| (0.002) | (0.002) | (0.002) | |
| Num.Obs. | 1513 | 1513 | 1513 |
| R2 | 0.171 | 0.171 | 0.215 |
| R2 Adj. | 0.169 | 0.169 | 0.213 |
Looking at our table, we can see that the simple shift of \(work\:experience = age - 18\) also had no real effect on model fit.
Meanwhile, adding the square term increased our model’s prediction power (via the \(r^2\)).
With these in mind, we can tell that shifting or dividing a variable doesn’t change the relationship; it only changes the amount. But changing the shape of the line (via taking the log) does change the relationship and therefore the income.
Okay, now, interpret the coefficients of work experience and work experience squared (from model ols_work_worksq). Do you think the results make sense? Explain.