#-100 = inapplicable -99 = no answer, -98 = do not know/cannot choose 
# Question: Did any observations have the values of -100, -99 or -98 for the age variable?
# Answer: No, but I did have some values that defaulted to NAs so it's possible that R picked them up before I could.

# Check 'age' variable after recoding
summary(gss$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   18.00   34.00   48.00   49.18   64.00   89.00     208
# Preliminary check for 'hrs1' variable before recoding
summary(gss$hrs1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   37.00   40.00   40.18   45.00   89.00    1606
# Recode work hours using tidyverse package: 
# From GSS online data explorer: -100 = inapplicable, -99 = No answer, -98 = Do not 
# Know/Cannot Choose, -97 = Skipped on Web, 89 = 89+ hours
gss <- gss %>%
  mutate(
    workhrs = case_when(
      hrs1 == 89 ~ NA, # Recoding to remove possible outliers and preserve linearity and. This is decision you will probably consider in your own data
      hrs1 <= -97 ~ NA, # Recoding all negative values to NA
      TRUE ~ hrs1
    )
  ) 


summary(gss$hrs1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   37.00   40.00   40.18   45.00   89.00    1606
summary(gss$workhrs)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   36.00   40.00   39.85   45.00   85.00    1619

The above are two code chunks represent effective ways to recode data. The first uses the logic of base R and the second the logic and tools of tidyverse and piping. Both work and you will develop a preference. While it can look more complicated, I recommend building your skills to recode using tidyverse.

# Rename the `conrinc` as a numeric variable "income" 
gss <- gss %>% 
  mutate(income = as.numeric(conrinc))

# Recode the degree variable into a factor with labels corresponding to "Less_HS", "HS", "Assoc", "Bach", "Grad"
 gss <- gss %>% 
   mutate(
      degree = factor(degree,
                    levels = c(0,1,2,3,4),
                    labels = c("Less_HS", "HS", "Assoc", "Bach", "Grad")
    )
  )
 

# Recode the sex variable into a factor variable
gss <- gss %>%
  mutate(
      sex = factor(sex,
                    levels = c(1,2),
                    labels = c("male" , "female")
                    )
      )

# Recoding race variable into descriptive character vector.
gss <- gss %>%
  mutate(
      race = factor(race,
                    levels = c(1,2,3),
                    labels = c("white" , "black ", "other")
                    )
      )

  
# Recoding partyid variable into political affiliation categories.
gss <- gss %>%
  mutate(
      partyid = case_when(
      partyid %in% c(0:2) ~ "Dem", 
      partyid %in% c(4:6) ~ "Rep", 
      partyid %in% c(3,7) ~ "Other"  
    )
  )
gss <- gss %>% 
  select(-hrs1, -conrinc)  # Removing the original work hours and weird income name from the dataset we will be working with

gss <- gss %>% drop_na() # Dropping any rows missing on any variable. The way this is implemented here works the same as keeping only the complete.cases

Data Summary

We’re going to use a sibling function, datasummary_balance(). This function will give us the mean and standard deviation for our continuous data, and below it, give us the counts and percentages for our categorical data.

The formula is datasummary_balance( ~ 1, ...). The “~ 1” tells it to create this double-table for all variables in the data.

datasummary_balance( ~ 1,
                    title = "Sample Descriptive Statistics",
                    notes = "Data: 2022 General Social Survey",
                    data = gss)
## Warning: Labelled data are not supported by the `datasummary_balance()`
## function. This warning appears once per session.
Sample Descriptive Statistics
Mean Std. Dev.
Data: 2022 General Social Survey
age 43.6 14.0
workhrs 40.2 13.2
income 44796.6 40371.0
N Pct.
degree Less_HS 89 5.9
HS 651 43.0
Assoc 143 9.5
Bach 385 25.4
Grad 245 16.2
sex male 760 50.2
female 753 49.8
race white 1116 73.8
black 214 14.1
other 183 12.1
partyid Dem 636 42.0
Other 373 24.7
Rep 504 33.3

Regression Models

Model Specification

Let’s start with a simple model that only include age, hours worked, and sex

ols1 <- lm(income ~ age + workhrs + sex, 
                   data = gss)
summary(ols1)
## 
## Call:
## lm(formula = income ~ age + workhrs + sex, data = gss)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -78254 -23112  -8846  10918 144698 
## 
## Coefficients:
##              Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  -3191.36    4719.56  -0.676                0.499    
## age            554.82      69.45   7.989  0.00000000000000268 ***
## workhrs        737.92      74.53   9.901 < 0.0000000000000002 ***
## sexfemale   -11857.33    1969.52  -6.020  0.00000000218093759 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37850 on 1509 degrees of freedom
## Multiple R-squared:  0.1226, Adjusted R-squared:  0.1209 
## F-statistic: 70.31 on 3 and 1509 DF,  p-value: < 0.00000000000000022

Interpretation

Looking at our model, we have a series of coefficient estimates, standard errors, t-values, and p-values for each of our independent variables. Interpret the coefficients.

As age and hours worked increase, so does income. Contrarily, women make 11,857 less than men typically. All of these relationships are statisticallly significant.

## Next, we add race to the previous model and run another regression ## model, called ols2.

ols2 <- lm(income ~ age + workhrs + sex + race, 
                   data = gss)
summary(ols2)
## 
## Call:
## lm(formula = income ~ age + workhrs + sex + race, data = gss)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -78562 -22983  -8279  10077 143434 
## 
## Coefficients:
##              Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  -1150.39    4770.10  -0.241             0.809458    
## age            540.18      69.50   7.772   0.0000000000000142 ***
## workhrs        743.61      74.24  10.016 < 0.0000000000000002 ***
## sexfemale   -11319.93    1968.05  -5.752   0.0000000106714558 ***
## raceblack   -10657.69    2824.48  -3.773             0.000167 ***
## raceother    -3230.24    3020.18  -1.070             0.284992    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37700 on 1507 degrees of freedom
## Multiple R-squared:  0.131,  Adjusted R-squared:  0.1281 
## F-statistic: 45.43 on 5 and 1507 DF,  p-value: < 0.00000000000000022
## Interpret the coefficients of race variables. 

## Both blacks and other races are negatively associated with income, meaning that both groups typically earn less than white individuals. However, this is only statistically significant for blacks. 

Data Transformation Effects

Dummies vs Categories

Another way to think about the above interpretation is by creating a series of “dummy variables.” These operate where each variable is split into a series of zero-one variables. For example, our variable sex has two levels, “male” and “female.” Turning it into a dummy would then create a new variable male where males have the coding 1 and females have the coding 0. The fastdummies package has a function dummy_cols() that makes this easy.

Let’s try this now, turning the variables sex, race, partyid, and degree into a series of dummies.

gss <- gss %>% 
  fastDummies::dummy_cols(select_columns = c("sex", "race", "partyid", "degree"))

names(gss)
##  [1] "degree"         "sex"            "race"           "age"           
##  [5] "partyid"        "workhrs"        "income"         "sex_male"      
##  [9] "sex_female"     "race_white"     "race_black "    "race_other"    
## [13] "partyid_Dem"    "partyid_Other"  "partyid_Rep"    "degree_Less_HS"
## [17] "degree_HS"      "degree_Assoc"   "degree_Bach"    "degree_Grad"

Now, let’s try out running models with our new dummies. Note, too, that it doesn’t matter which dummy we leave out as long as one of them is left out. Changing the dummy will only change the intercept, not the slope.

ols_nodummy <- lm(income ~ age  + workhrs + race + sex,
                  data = gss)
ols_dummy1 <- lm(income ~ age + workhrs + 
                   race_white + race_other +
                   sex_female,
                 data = gss)
ols_dummy2 <- lm(income ~ age +  workhrs +  
                   race_white + race_other +
                   sex_male,
                   data = gss)

modelsummary(list(ols_nodummy, ols_dummy1, ols_dummy2),
             stars = T, title = "Testing Dummy Variables",
             gof_omit = "F|Log|IC|RMSE")
Testing Dummy Variables
(1) (2) (3)
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
(Intercept) -1150.393 -11808.087* -23128.020***
(4770.100) (5308.898) (5044.718)
age 540.179*** 540.179*** 540.179***
(69.503) (69.503) (69.503)
workhrs 743.608*** 743.608*** 743.608***
(74.243) (74.243) (74.243)
raceblack -10657.694***
(2824.485)
raceother -3230.241
(3020.183)
sexfemale -11319.932***
(1968.054)
race_white 10657.694*** 10657.694***
(2824.485) (2824.485)
race_other 7427.453+ 7427.453+
(3808.216) (3808.216)
sex_female -11319.932***
(1968.054)
sex_male 11319.932***
(1968.054)
Num.Obs. 1513 1513 1513
R2 0.131 0.131 0.131
R2 Adj. 0.128 0.128 0.128

As we can see from these results, there is no difference in whether you use your own dummies or use the defaults. Just note, however, that R will drop whatever the first category is. Creating and using your own dummies gives you more control over this, but isn’t necessary.

If we’d like to change the “reference category,” AKA the category that is dropped, one option is to turn the variable into a factor with new levels (1 = White, 2 = Black, 3 = Other). Another is by putting in ourselves which levels to include in the model: Including Black and Other makes White the reference. Finally, another option is to use the stats::relevel() function to choose a new base category. (relevel(var, ref = new_reference_level))

Scaling and Non-Linearization

Income

If we look at our income variable, our value is very skewed. It’s also in large dollar amounts that aren’t easy to manage. Let’s see what happens if we made two transformations. First, let’s think of income in thousands of dollars instead of in total dollar amounts. Then, let’s also take the natural log of our variable so we arrive at something more normally distributed. Finally, let’s see what happens if we do both. (Note: by default, base::log() takes the natural log. If you want to take the \(log_{10}\) or \(log_2\), you should use log10() or log2(). You can also use log(x, base) for others.)

Taking the log of a term does two things.

  1. Methodologically, it gives us a distribution that is more normally distributed, improving the overall fit.
  2. Theoretically, we use it when we think the effect is increasing at a decreasing rate OR decreasing at a decreasing rate.
gss <- gss %>% 
  mutate(logincome = log(income),
         thouinc = income/1000,
         logthouinc = log(income/1000)
         )

origincplot <- ggplot(gss) + geom_density(aes(income))
thouincplot <- ggplot(gss) + geom_density(aes(thouinc))
logincplot <- ggplot(gss) + geom_density(aes(logincome))
logthouincplot <- ggplot(gss) + geom_density(aes(logthouinc))

ggpubr::ggarrange(origincplot, thouincplot, logincplot, logthouincplot)
Visualizing Transformed Income Variables

Visualizing Transformed Income Variables

Now, let’s try a series of regressions to see what differences, if any, exist when we make these changes.

gss %>% 
  select(income, thouinc, logincome, logthouinc) %>% 
  datasummary_skim(type="numeric",title = "Summary Statistics of Transformed Income Variables")
Unique Missing Pct. Mean SD Min Median Max Histogram
income 26 0 44796.6 40371.0 336.0 36960.0 170912.6
thouinc 26 0 44.8 40.4 0.3 37.0 170.9
logincome 26 0 10.3 1.0 5.8 10.5 12.0
logthouinc 26 0 3.4 1.0 -1.1 3.6 5.1
ols_inc <- lm(income ~ age +  workhrs + sex, data = gss)
ols_thou <- lm(thouinc ~ age +  workhrs + sex, data = gss)
ols_log <- lm(logincome ~ age + workhrs + sex, data = gss)
ols_log_thou <- lm(logthouinc ~ age + workhrs + sex, data = gss)

modelsummary(list("Original" = ols_inc, "Inc/1k" = ols_thou, 
                  "Log(Inc)" = ols_log, "Log(Inc/1k)" = ols_log_thou),
             title = "Effect of Data Transformations: Logarithmic Terms",
             stars = T, gof_omit = "F|RMSE|Log|IC")
Effect of Data Transformations: Logarithmic Terms
Original Inc/1k Log(Inc) Log(Inc/1k)
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
(Intercept) -3191.363 -3.191 8.734*** 1.826***
(4719.555) (4.720) (0.119) (0.119)
age 554.824*** 0.555*** 0.015*** 0.015***
(69.451) (0.069) (0.002) (0.002)
workhrs 737.918*** 0.738*** 0.026*** 0.026***
(74.531) (0.075) (0.002) (0.002)
sexfemale -11857.330*** -11.857*** -0.277*** -0.277***
(1969.525) (1.970) (0.050) (0.050)
Num.Obs. 1513 1513 1513 1513
R2 0.123 0.123 0.171 0.171
R2 Adj. 0.121 0.121 0.169 0.169

We can see here that scaling only shifts the direction of variables, but has no effect on the final model statistics. Overall, the model with the logged term (logincome) worked best (\(r^2 = .3\)) and is the most interpretable. (Remember, an increase in the natural-logged amount equates to a 1% increase in the raw amount. Scaling the logged amount doesn’t have any effect since the increase in percent is stable across the board.)

Work Experience

Now, let’s try another transformation. This time, let’s estimate the number of year’s a person has been working by subtracting 18 years from their age. While we’re here, let’s also see what happens if we square the term.

We use a square term when we hypothesize a U shaped or inverse-U shaped relationship. That is, when we suspect the effect will be non-linear.

gss <- gss %>% 
  mutate(workexp = age - 18,
         worksq = workexp^2)


## gss %>% 
##   select(age, workexp, worksq) %>% 
##   datasummary_skim(title = "Summary Statistics of Transformed Work Experience Variables")

## OK I played with this one for like an hour and couldn't get it to work; I keep getting an error code

Let’s now test out whether these two data transformations affect our models. Below I compare age and work experience (equal to \(age - 18\)). Lastly, I include a model with both work experience and work experience squared.

ols_age <- lm(logincome ~ age + workhrs + sex,
                   data = gss)
ols_work <- lm(logincome ~ workexp + workhrs + sex,
                   data = gss)
ols_work_worksq <- lm(logincome ~ workexp + worksq + workhrs + sex,
                   data = gss)
wrkcoefs <- c("age" = "Age", "workexp" = "Work Experience",
              "worksq" = "Work Experience ^2", "sexMale" = "Male", "workhrs" = "Work Hours")
modelsummary(list(ols_age, ols_work,  ols_work_worksq),
             coef_map = wrkcoefs, stars = T, gof_omit = "F|RMSE|Log|IC",
             title = "Effect of Data Transformations: Square Terms")
Effect of Data Transformations: Square Terms
(1) (2) (3)
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Age 0.015***
(0.002)
Work Experience 0.015*** 0.074***
(0.002) (0.007)
Work Experience ^2 -0.001***
(0.000)
Work Hours 0.026*** 0.026*** 0.022***
(0.002) (0.002) (0.002)
Num.Obs. 1513 1513 1513
R2 0.171 0.171 0.215
R2 Adj. 0.169 0.169 0.213

Looking at our table, we can see that the simple shift of \(work\:experience = age - 18\) also had no real effect on model fit.

Meanwhile, adding the square term increased our model’s prediction power (via the \(r^2\)).

With these in mind, we can tell that shifting or dividing a variable doesn’t change the relationship; it only changes the amount. But changing the shape of the line (via taking the log) does change the relationship and therefore the income.

Okay, now, interpret the coefficients of work experience and work experience squared (from model ols_work_worksq). Do you think the results make sense? Explain.

In the last model, we can interpret the work experience as: holding all else constant, for every 1 year increase in work experience, a worker likely sees around a 7% increase in income. Futhermore, this trend decreases slightly once peaking.