This loads the data into R and shows the first few rows and columns of the table:
library(readr)
data <- read_csv("Downloads/usa_00020.csv")
## Rows: 333775 Columns: 27
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (27): YEAR, SAMPLE, SERIAL, CBSERIAL, HHWT, CLUSTER, STATEFIP, COUNTYICP...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(data)
The tidyverse is an R library that helps to do many data manipulations. So this loads it.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.5 ✔ dplyr 1.0.9
## ✔ tibble 3.1.6 ✔ stringr 1.4.0
## ✔ tidyr 1.2.0 ✔ forcats 0.5.1
## ✔ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
The data set is too big. Let’s discard variables that you’re not going to use and keep the rest. Also, the census data has missing values (NAs, didn’t answer, etc.). So, the tidyverse/filter function makes sure the missing values are discarded also. The result is a leaner and cleaner data set.
data <- data %>%
select(YEAR, METRO, SEX, AGE, MARST, RACE,
HISPAN, EMPSTAT, INCTOT, INCWELFR, POVERTY) %>%
filter(METRO == 2) %>%
select(YEAR, SEX, AGE, MARST, RACE, HISPAN,
EMPSTAT, INCTOT, INCWELFR, POVERTY) %>%
filter(EMPSTAT > 0) %>%
filter(HISPAN < 5) %>%
filter(INCTOT > 0 & INCTOT < 9999999) %>%
filter(INCWELFR < 99999) %>%
filter(POVERTY < 99999)
head(data)
The library stargazer turns R output into better looking tables. This loads it into R.
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
Summary statistics of the cleaned data set.
stargazer(as.data.frame(data), type="text")
##
## ======================================================
## Statistic N Mean St. Dev. Min Max
## ------------------------------------------------------
## YEAR 90,203 2,014.678 4.990 2,010 2,020
## SEX 90,203 1.529 0.499 1 2
## AGE 90,203 48.117 18.253 16 95
## MARST 90,203 3.487 2.256 1 6
## RACE 90,203 2.840 2.441 1 9
## HISPAN 90,203 0.663 1.364 0 4
## EMPSTAT 90,203 1.628 0.899 1 3
## INCTOT 90,203 51,522.850 76,448.210 1 1,318,000
## INCWELFR 90,203 113.769 982.272 0 30,000
## POVERTY 90,203 330.792 164.748 1 501
## ------------------------------------------------------
This reduces the variable MARST (marital status) to fewer categories:
data <- data %>%
mutate(MARST = replace(MARST, MARST == 1, 1)) %>%
mutate(MARST = replace(MARST, MARST == 2, 2)) %>%
mutate(MARST = replace(MARST, MARST == 3, 2)) %>%
mutate(MARST = replace(MARST, MARST == 4, 2)) %>%
mutate(MARST = replace(MARST, MARST == 5, 2)) %>%
mutate(MARST = replace(MARST, MARST == 6, 3))
This reduces the variable RACE to - 1 (white), - 2 (black), - 3 (native am & alaskan), - 4 (asian), and - 5 (other).
data <- data %>%
mutate(RACE = replace(RACE, RACE == 1, 1)) %>%
mutate(RACE = replace(RACE, RACE == 2, 2)) %>%
mutate(RACE = replace(RACE, RACE == 3, 3)) %>%
mutate(RACE = replace(RACE, RACE == 4, 4)) %>%
mutate(RACE = replace(RACE, RACE == 5, 4)) %>%
mutate(RACE = replace(RACE, RACE == 6, 4)) %>%
mutate(RACE = replace(RACE, RACE == 7, 4)) %>%
mutate(RACE = replace(RACE, RACE == 8, 4)) %>%
mutate(RACE = replace(RACE, RACE == 9, 5))
This creates a new variable POVI (poverty index) with values: - 0 (earns more than poverty-line income) and - 1 (earns poverty-line income or less).
data$POVI <- data$POVERTY
data <- data %>%
mutate(POVI = replace(POVI, POVI <= 100, 1)) %>%
mutate(POVI = replace(POVI, POVI > 100, 0))
Here’s the summary statistics of the transformed (simplified) data set.
stargazer(as.data.frame(data), type="text")
##
## ======================================================
## Statistic N Mean St. Dev. Min Max
## ------------------------------------------------------
## YEAR 90,203 2,014.678 4.990 2,010 2,020
## SEX 90,203 1.529 0.499 1 2
## AGE 90,203 48.117 18.253 16 95
## MARST 90,203 1.963 0.879 1 3
## RACE 90,203 2.153 1.304 1 5
## HISPAN 90,203 0.663 1.364 0 4
## EMPSTAT 90,203 1.628 0.899 1 3
## INCTOT 90,203 51,522.850 76,448.210 1 1,318,000
## INCWELFR 90,203 113.769 982.272 0 30,000
## POVERTY 90,203 330.792 164.748 1 501
## POVI 90,203 0.120 0.325 0 1
## ------------------------------------------------------
These are some plots of the densities (approximate distributions) of the numeric variables or their logs: INCTOT, INCWELFR, and POVERTY. (The other variables are categorical or “factors.”)
Taking logs is required when one has a variable with a distribution that is too skewed (long right tails; so POVERTY is not a good candidate for this, because it has a long tail but on the left side.).
plot(density(data$POVERTY), main="POVERTY: Density plot")
plot(density(log(data$INCWELFR)), main="Log of INCWELFR: Density plot")
plot(density(log(data$INCTOT)), main="Log of INCTOT: Density plot")
These are some plots of numeric variables split by the values of factors (e.g. SEX, YEAR, MARSTAT, RACE). Just to show you what can be easily done with R by just copying, pasting, and tweaking the codes.
ggplot(data, aes(x = AGE)) +
geom_density() +
facet_grid(YEAR ~ .)
ggplot(data, aes(x = log(INCTOT))) +
geom_density() +
facet_grid(YEAR ~ .)
ggplot(data, aes(x = POVERTY)) +
geom_density() +
facet_grid(YEAR ~ .)
ggplot(data, aes(x = log(INCWELFR))) +
geom_density() +
facet_grid(YEAR ~ .)
## Warning: Removed 87476 rows containing non-finite values (stat_density).
Also as an example, see here a table with the descriptive statistics of one variable (e.g. INCTOT) but sliced by YEAR and SEX.
data %>%
group_by(YEAR, SEX) %>%
summarize(n=length(INCTOT),
mean = mean(INCTOT),
median = median(INCTOT),
min = min(INCTOT),
max = max(INCTOT))
## `summarise()` has grouped output by 'YEAR'. You can override using the
## `.groups` argument.
The regressions are next. There are four regressions:
reg <- lm(log(data$INCTOT) ~ factor(data$YEAR) + factor(data$SEX) +
data$AGE + factor(data$MARST) + factor(data$RACE) +
factor(data$HISPAN) + factor(data$EMPSTAT))
reg1 <- lm(data$INCTOT ~ factor(data$YEAR) + factor(data$SEX) +
data$AGE + factor(data$MARST) + factor(data$RACE) +
factor(data$HISPAN) + factor(data$EMPSTAT))
reg2 <- lm(data$POVI ~ factor(data$YEAR) + factor(data$SEX) +
data$AGE + factor(data$MARST) + factor(data$RACE) +
factor(data$HISPAN) + factor(data$EMPSTAT))
reg3 <- lm(log(data$POVERTY) ~ factor(data$YEAR) + factor(data$SEX) +
data$AGE + factor(data$MARST) + factor(data$RACE) +
factor(data$HISPAN) + factor(data$EMPSTAT))
Important note: When one uses the R function “factor” in the regression code, the values of the variables get split into one or more binary variables with the lowest value (usually 1, in the census data) recoded to zero and the alternative values of the factors then recoded as ones. This is key when interpreting the results.
The summaries of the regressions are here.
summary(reg)
##
## Call:
## lm(formula = log(data$INCTOT) ~ factor(data$YEAR) + factor(data$SEX) +
## data$AGE + factor(data$MARST) + factor(data$RACE) + factor(data$HISPAN) +
## factor(data$EMPSTAT))
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.0211 -0.5483 0.0850 0.6583 4.3718
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.1496334 0.0153924 659.391 <2e-16 ***
## factor(data$YEAR)2020 0.3069675 0.0073303 41.877 <2e-16 ***
## factor(data$SEX)2 -0.2041551 0.0073166 -27.903 <2e-16 ***
## data$AGE 0.0167590 0.0002677 62.604 <2e-16 ***
## factor(data$MARST)2 -0.1079756 0.0098510 -10.961 <2e-16 ***
## factor(data$MARST)3 -0.1536810 0.0091715 -16.756 <2e-16 ***
## factor(data$RACE)2 -0.3581921 0.0094161 -38.040 <2e-16 ***
## factor(data$RACE)3 -0.4767336 0.0509795 -9.351 <2e-16 ***
## factor(data$RACE)4 -0.3767021 0.0090078 -41.820 <2e-16 ***
## factor(data$RACE)5 -0.1162281 0.0495923 -2.344 0.0191 *
## factor(data$HISPAN)1 -0.3661611 0.0245883 -14.892 <2e-16 ***
## factor(data$HISPAN)2 -0.2018365 0.0149013 -13.545 <2e-16 ***
## factor(data$HISPAN)3 -0.0827552 0.0490151 -1.688 0.0913 .
## factor(data$HISPAN)4 -0.2740426 0.0117102 -23.402 <2e-16 ***
## factor(data$EMPSTAT)2 -1.1249463 0.0162357 -69.288 <2e-16 ***
## factor(data$EMPSTAT)3 -1.4353521 0.0095301 -150.613 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.079 on 90187 degrees of freedom
## Multiple R-squared: 0.2828, Adjusted R-squared: 0.2826
## F-statistic: 2370 on 15 and 90187 DF, p-value: < 2.2e-16
summary(reg1)
##
## Call:
## lm(formula = data$INCTOT ~ factor(data$YEAR) + factor(data$SEX) +
## data$AGE + factor(data$MARST) + factor(data$RACE) + factor(data$HISPAN) +
## factor(data$EMPSTAT))
##
## Residuals:
## Min 1Q Median 3Q Max
## -110338 -32260 -12006 11122 1222932
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 69766.80 1017.09 68.594 < 2e-16 ***
## factor(data$YEAR)2020 16171.11 484.37 33.386 < 2e-16 ***
## factor(data$SEX)2 -12138.08 483.46 -25.106 < 2e-16 ***
## data$AGE 375.34 17.69 21.219 < 2e-16 ***
## factor(data$MARST)2 -8759.03 650.93 -13.456 < 2e-16 ***
## factor(data$MARST)3 -11372.24 606.03 -18.765 < 2e-16 ***
## factor(data$RACE)2 -25840.03 622.19 -41.531 < 2e-16 ***
## factor(data$RACE)3 -25563.31 3368.58 -7.589 3.26e-14 ***
## factor(data$RACE)4 -21051.60 595.21 -35.368 < 2e-16 ***
## factor(data$RACE)5 -9376.51 3276.92 -2.861 0.00422 **
## factor(data$HISPAN)1 -24626.81 1624.73 -15.157 < 2e-16 ***
## factor(data$HISPAN)2 -14017.06 984.64 -14.236 < 2e-16 ***
## factor(data$HISPAN)3 -9270.11 3238.78 -2.862 0.00421 **
## factor(data$HISPAN)4 -18394.73 773.78 -23.773 < 2e-16 ***
## factor(data$EMPSTAT)2 -35004.11 1072.81 -32.628 < 2e-16 ***
## factor(data$EMPSTAT)3 -46661.51 629.72 -74.099 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 71290 on 90187 degrees of freedom
## Multiple R-squared: 0.1305, Adjusted R-squared: 0.1303
## F-statistic: 902.1 on 15 and 90187 DF, p-value: < 2.2e-16
summary(reg2)
##
## Call:
## lm(formula = data$POVI ~ factor(data$YEAR) + factor(data$SEX) +
## data$AGE + factor(data$MARST) + factor(data$RACE) + factor(data$HISPAN) +
## factor(data$EMPSTAT))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.45243 -0.15325 -0.08420 -0.01216 1.07808
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.768e-02 4.426e-03 19.810 < 2e-16 ***
## factor(data$YEAR)2020 -2.827e-02 2.108e-03 -13.412 < 2e-16 ***
## factor(data$SEX)2 1.636e-02 2.104e-03 7.775 7.62e-15 ***
## data$AGE -1.964e-03 7.698e-05 -25.516 < 2e-16 ***
## factor(data$MARST)2 8.634e-02 2.833e-03 30.480 < 2e-16 ***
## factor(data$MARST)3 4.822e-02 2.637e-03 18.284 < 2e-16 ***
## factor(data$RACE)2 4.319e-02 2.708e-03 15.951 < 2e-16 ***
## factor(data$RACE)3 5.462e-02 1.466e-02 3.726 0.000195 ***
## factor(data$RACE)4 4.218e-02 2.590e-03 16.285 < 2e-16 ***
## factor(data$RACE)5 1.941e-02 1.426e-02 1.361 0.173399
## factor(data$HISPAN)1 5.594e-02 7.070e-03 7.912 2.56e-15 ***
## factor(data$HISPAN)2 8.638e-02 4.285e-03 20.160 < 2e-16 ***
## factor(data$HISPAN)3 2.416e-02 1.409e-02 1.714 0.086556 .
## factor(data$HISPAN)4 2.935e-02 3.367e-03 8.715 < 2e-16 ***
## factor(data$EMPSTAT)2 1.406e-01 4.669e-03 30.113 < 2e-16 ***
## factor(data$EMPSTAT)3 1.865e-01 2.740e-03 68.062 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3103 on 90187 degrees of freedom
## Multiple R-squared: 0.09073, Adjusted R-squared: 0.09058
## F-statistic: 599.9 on 15 and 90187 DF, p-value: < 2.2e-16
summary(reg3)
##
## Call:
## lm(formula = log(data$POVERTY) ~ factor(data$YEAR) + factor(data$SEX) +
## data$AGE + factor(data$MARST) + factor(data$RACE) + factor(data$HISPAN) +
## factor(data$EMPSTAT))
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.9639 -0.3243 0.1790 0.4776 1.7070
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.7342914 0.0107243 534.701 < 2e-16 ***
## factor(data$YEAR)2020 0.1320082 0.0051072 25.847 < 2e-16 ***
## factor(data$SEX)2 -0.0349660 0.0050977 -6.859 6.97e-12 ***
## data$AGE 0.0057093 0.0001865 30.611 < 2e-16 ***
## factor(data$MARST)2 -0.2842302 0.0068634 -41.412 < 2e-16 ***
## factor(data$MARST)3 -0.1821187 0.0063900 -28.500 < 2e-16 ***
## factor(data$RACE)2 -0.1936931 0.0065604 -29.524 < 2e-16 ***
## factor(data$RACE)3 -0.2498332 0.0355187 -7.034 2.02e-12 ***
## factor(data$RACE)4 -0.1959157 0.0062759 -31.217 < 2e-16 ***
## factor(data$RACE)5 -0.0251367 0.0345522 -0.727 0.4669
## factor(data$HISPAN)1 -0.3374446 0.0171313 -19.698 < 2e-16 ***
## factor(data$HISPAN)2 -0.2215148 0.0103821 -21.336 < 2e-16 ***
## factor(data$HISPAN)3 -0.0734981 0.0341500 -2.152 0.0314 *
## factor(data$HISPAN)4 -0.1711796 0.0081588 -20.981 < 2e-16 ***
## factor(data$EMPSTAT)2 -0.4999592 0.0113118 -44.198 < 2e-16 ***
## factor(data$EMPSTAT)3 -0.5821586 0.0066398 -87.677 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7517 on 90187 degrees of freedom
## Multiple R-squared: 0.1614, Adjusted R-squared: 0.1612
## F-statistic: 1157 on 15 and 90187 DF, p-value: < 2.2e-16
With stargazer, one gets a better looking table with all the regressions compared next to each other.
# summaries but better looking
stargazer(reg, reg1, reg2, reg3, type="text")
##
## ====================================================================================
## Dependent variable:
## ---------------------------------------------------
## INCTOT) INCTOT POVI POVERTY)
## (1) (2) (3) (4)
## ------------------------------------------------------------------------------------
## YEAR)2020 0.307*** 16,171.110*** -0.028*** 0.132***
## (0.007) (484.366) (0.002) (0.005)
##
## SEX)2 -0.204*** -12,138.080*** 0.016*** -0.035***
## (0.007) (483.464) (0.002) (0.005)
##
## AGE 0.017*** 375.343*** -0.002*** 0.006***
## (0.0003) (17.689) (0.0001) (0.0002)
##
## MARST)2 -0.108*** -8,759.026*** 0.086*** -0.284***
## (0.010) (650.926) (0.003) (0.007)
##
## MARST)3 -0.154*** -11,372.240*** 0.048*** -0.182***
## (0.009) (606.029) (0.003) (0.006)
##
## RACE)2 -0.358*** -25,840.030*** 0.043*** -0.194***
## (0.009) (622.189) (0.003) (0.007)
##
## RACE)3 -0.477*** -25,563.310*** 0.055*** -0.250***
## (0.051) (3,368.583) (0.015) (0.036)
##
## RACE)4 -0.377*** -21,051.600*** 0.042*** -0.196***
## (0.009) (595.208) (0.003) (0.006)
##
## RACE)5 -0.116** -9,376.510*** 0.019 -0.025
## (0.050) (3,276.924) (0.014) (0.035)
##
## HISPAN)1 -0.366*** -24,626.810*** 0.056*** -0.337***
## (0.025) (1,624.729) (0.007) (0.017)
##
## HISPAN)2 -0.202*** -14,017.060*** 0.086*** -0.222***
## (0.015) (984.636) (0.004) (0.010)
##
## HISPAN)3 -0.083* -9,270.113*** 0.024* -0.073**
## (0.049) (3,238.782) (0.014) (0.034)
##
## HISPAN)4 -0.274*** -18,394.730*** 0.029*** -0.171***
## (0.012) (773.777) (0.003) (0.008)
##
## EMPSTAT)2 -1.125*** -35,004.110*** 0.141*** -0.500***
## (0.016) (1,072.810) (0.005) (0.011)
##
## EMPSTAT)3 -1.435*** -46,661.510*** 0.187*** -0.582***
## (0.010) (629.720) (0.003) (0.007)
##
## Constant 10.150*** 69,766.800*** 0.088*** 5.734***
## (0.015) (1,017.090) (0.004) (0.011)
##
## ------------------------------------------------------------------------------------
## Observations 90,203 90,203 90,203 90,203
## R2 0.283 0.130 0.091 0.161
## Adjusted R2 0.283 0.130 0.091 0.161
## Residual Std. Error (df = 90187) 1.079 71,293.270 0.310 0.752
## F Statistic (df = 15; 90187) 2,370.350*** 902.061*** 599.926*** 1,157.068***
## ====================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
With these results, you can run simulations for each given model. For example, using reg, a 30yo white woman in 2020, who is married with present husband and employed, is expected (i.e. on average) to earn the income below. Note that the values of the variables for this person are: YEAR=2020, SEX=1, AGE=30, MARSTAT=0, RACE=0, HISPAN=0, and EMPSTAT=0.
beta <- reg$coefficients
exp(beta[1] + 1*beta[2] + 1*beta[3] + 30*beta[4] + 0*beta[5] + 0*beta[6] + 0*beta[7] +0*beta[8] +0*beta[9] +0*beta[10] +0*beta[11] +0*beta[12] +0*beta[13] +0*beta[14] +0*beta[15] +0*beta[16])
## (Intercept)
## 46873.93
Now, the expected INCTOT for 30yo black PRican woman in 2020, who is married with present husband and employed. Because the LHS variable is the log of INCTOT, one needs the exponential function in the computation below.
exp(beta[1] + 1*beta[2] + 1*beta[3] + 30*beta[4] + 0*beta[5] + 0*beta[6] + 1*beta[7] + 0*beta[8] + 0*beta[9] + 0*beta[10] +0*beta[11] +1*beta[12] +0*beta[13] +0*beta[14] +0*beta[15] +0*beta[16])
## (Intercept)
## 26774.05
Now, using reg1 (with the LHS being INCTOT, not log). For these two women, the expected INCTOT would be as shown next. Taking logs prevents the average to be pulled up by a few high earners. So, this regression is likely to exaggerate the expected INCTOT of people.
beta <- reg1$coefficients
beta[1] + 1*beta[2] + 1*beta[3] + 30*beta[4] + 0*beta[5] + 0*beta[6] + 0*beta[7] +0*beta[8] +0*beta[9] +0*beta[10] +0*beta[11] +0*beta[12] +0*beta[13] +0*beta[14] +0*beta[15] +0*beta[16]
## (Intercept)
## 85060.11
beta[1] + 1*beta[2] + 1*beta[3] + 30*beta[4] + 0*beta[5] + 0*beta[6] + 1*beta[7] + 0*beta[8] + 0*beta[9] + 0*beta[10] +0*beta[11] +1*beta[12] +0*beta[13] +0*beta[14] +0*beta[15] +0*beta[16]
## (Intercept)
## 45203.03
With reg2, one needs to be careful. The LHS variable is POVI, a binary variable. So, one can interpret the predicted values of POVI given by the regression as probabilities of falling into poverty (POVI=1). With that proviso, one can use these results.
The following are simulations for the two women above.
beta <- reg2$coefficients
beta[1] + 1*beta[2] + 1*beta[3] + 30*beta[4] + 0*beta[5] + 0*beta[6] + 0*beta[7] +0*beta[8] +0*beta[9] +0*beta[10] +0*beta[11] +0*beta[12] +0*beta[13] +0*beta[14] +0*beta[15] +0*beta[16]
## (Intercept)
## 0.01684336
beta[1] + 1*beta[2] + 1*beta[3] + 30*beta[4] + 0*beta[5] + 0*beta[6] + 1*beta[7] + 0*beta[8] + 0*beta[9] + 0*beta[10] +0*beta[11] +1*beta[12] +0*beta[13] +0*beta[14] +0*beta[15] +0*beta[16]
## (Intercept)
## 0.1464157
Note that the probabilities are given in decimal form (to represent them as percentages, one need to multiply the number by 100 and append the “%” sign.)
There’s not even a 2% chance that the white woman would fall into POVI=1, while there is almost a 14% chance that the black woman will. Etc.
More can be done, but that’s it for now.