Data analysis for Sophia Gautier

This loads the data into R and shows the first few rows and columns of the table:

library(readr)
data <- read_csv("Downloads/usa_00020.csv")

## Rows: 333775 Columns: 27
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (27): YEAR, SAMPLE, SERIAL, CBSERIAL, HHWT, CLUSTER, STATEFIP, COUNTYICP...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(data)

The tidyverse is an R library that helps to do many data manipulations. So this loads it.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✔ ggplot2 3.3.5     ✔ dplyr   1.0.9
## ✔ tibble  3.1.6     ✔ stringr 1.4.0
## ✔ tidyr   1.2.0     ✔ forcats 0.5.1
## ✔ purrr   0.3.4

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

The data set is too big. Let’s discard variables that you’re not going to use and keep the rest. Also, the census data has missing values (NAs, didn’t answer, etc.). So, the tidyverse/filter function makes sure the missing values are discarded also. The result is a leaner and cleaner data set.

data <- data %>% 
  select(YEAR, METRO, SEX, AGE, MARST, RACE,
         HISPAN, EMPSTAT, INCTOT, INCWELFR, POVERTY) %>%
  filter(METRO == 2) %>% 
  select(YEAR, SEX, AGE, MARST, RACE, HISPAN, 
         EMPSTAT, INCTOT, INCWELFR, POVERTY) %>%
  filter(EMPSTAT > 0) %>% 
  filter(HISPAN < 5) %>% 
  filter(INCTOT > 0 & INCTOT < 9999999) %>%
  filter(INCWELFR < 99999) %>%
  filter(POVERTY < 99999)
head(data)

The library stargazer turns R output into better looking tables. This loads it into R.

library(stargazer)

## 
## Please cite as:

##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.

##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

Summary statistics of the cleaned data set.

stargazer(as.data.frame(data), type="text")

## 
## ======================================================
## Statistic   N       Mean     St. Dev.   Min     Max   
## ------------------------------------------------------
## YEAR      90,203 2,014.678    4.990    2,010   2,020  
## SEX       90,203   1.529      0.499      1       2    
## AGE       90,203   48.117     18.253    16      95    
## MARST     90,203   3.487      2.256      1       6    
## RACE      90,203   2.840      2.441      1       9    
## HISPAN    90,203   0.663      1.364      0       4    
## EMPSTAT   90,203   1.628      0.899      1       3    
## INCTOT    90,203 51,522.850 76,448.210   1   1,318,000
## INCWELFR  90,203  113.769    982.272     0    30,000  
## POVERTY   90,203  330.792    164.748     1      501   
## ------------------------------------------------------

This reduces the variable MARST (marital status) to fewer categories:

1 (married with present spouse),
2 (married with absent spouse, separated, divorced, or widowed), and
3 (never married or single).

data <- data %>%
  mutate(MARST = replace(MARST, MARST == 1, 1)) %>%
  mutate(MARST = replace(MARST, MARST == 2, 2)) %>%
  mutate(MARST = replace(MARST, MARST == 3, 2)) %>%
  mutate(MARST = replace(MARST, MARST == 4, 2)) %>%
  mutate(MARST = replace(MARST, MARST == 5, 2)) %>%
  mutate(MARST = replace(MARST, MARST == 6, 3))

This reduces the variable RACE to - 1 (white), - 2 (black), - 3 (native am & alaskan), - 4 (asian), and - 5 (other).

data <- data %>%
  mutate(RACE = replace(RACE, RACE == 1, 1)) %>%
  mutate(RACE = replace(RACE, RACE == 2, 2)) %>%
  mutate(RACE = replace(RACE, RACE == 3, 3)) %>%
  mutate(RACE = replace(RACE, RACE == 4, 4)) %>%
  mutate(RACE = replace(RACE, RACE == 5, 4)) %>%
  mutate(RACE = replace(RACE, RACE == 6, 4)) %>%
  mutate(RACE = replace(RACE, RACE == 7, 4)) %>%
  mutate(RACE = replace(RACE, RACE == 8, 4)) %>%
  mutate(RACE = replace(RACE, RACE == 9, 5))

This creates a new variable POVI (poverty index) with values: - 0 (earns more than poverty-line income) and - 1 (earns poverty-line income or less).

data$POVI <- data$POVERTY
data <- data %>%
  mutate(POVI = replace(POVI, POVI <= 100, 1)) %>%
  mutate(POVI = replace(POVI, POVI > 100, 0))

Here’s the summary statistics of the transformed (simplified) data set.

stargazer(as.data.frame(data), type="text")

## 
## ======================================================
## Statistic   N       Mean     St. Dev.   Min     Max   
## ------------------------------------------------------
## YEAR      90,203 2,014.678    4.990    2,010   2,020  
## SEX       90,203   1.529      0.499      1       2    
## AGE       90,203   48.117     18.253    16      95    
## MARST     90,203   1.963      0.879      1       3    
## RACE      90,203   2.153      1.304      1       5    
## HISPAN    90,203   0.663      1.364      0       4    
## EMPSTAT   90,203   1.628      0.899      1       3    
## INCTOT    90,203 51,522.850 76,448.210   1   1,318,000
## INCWELFR  90,203  113.769    982.272     0    30,000  
## POVERTY   90,203  330.792    164.748     1      501   
## POVI      90,203   0.120      0.325      0       1    
## ------------------------------------------------------

These are some plots of the densities (approximate distributions) of the numeric variables or their logs: INCTOT, INCWELFR, and POVERTY. (The other variables are categorical or “factors.”)

Taking logs is required when one has a variable with a distribution that is too skewed (long right tails; so POVERTY is not a good candidate for this, because it has a long tail but on the left side.).

plot(density(data$POVERTY), main="POVERTY: Density plot")

plot(density(log(data$INCWELFR)), main="Log of INCWELFR: Density plot")

plot(density(log(data$INCTOT)), main="Log of INCTOT: Density plot")

These are some plots of numeric variables split by the values of factors (e.g. SEX, YEAR, MARSTAT, RACE). Just to show you what can be easily done with R by just copying, pasting, and tweaking the codes.

ggplot(data, aes(x = AGE)) +
  geom_density() +
  facet_grid(YEAR ~ .)

ggplot(data, aes(x = log(INCTOT))) +
  geom_density() +
  facet_grid(YEAR ~ .)

ggplot(data, aes(x = POVERTY)) +
  geom_density() +
  facet_grid(YEAR ~ .)

ggplot(data, aes(x = log(INCWELFR))) +
  geom_density() +
  facet_grid(YEAR ~ .)

## Warning: Removed 87476 rows containing non-finite values (stat_density).

Also as an example, see here a table with the descriptive statistics of one variable (e.g. INCTOT) but sliced by YEAR and SEX.

data %>% 
  group_by(YEAR, SEX) %>% 
  summarize(n=length(INCTOT),
            mean = mean(INCTOT),
            median = median(INCTOT),
            min = min(INCTOT),
            max = max(INCTOT))

## `summarise()` has grouped output by 'YEAR'. You can override using the
## `.groups` argument.

The regressions are next. There are four regressions:

reg: with the log of INCTOT as dependent variable, and with AGE and all the factors (YEAR, SEX, MARST, RACE, HISPAN, and EMPSTAT) as independent or explanatory variables;
reg1: with INCTOT (not the log) as dependent variable and the same explanatory variables as reg;
reg2: with POVI as dependent variable and all rest as dependent variables (not INCTOT, because INCTOT would obviously grab all the attention, since the POVERTY variable is constructed in accordance with the level of INCTOT and POVI is POVERTY but turned binary: YES, poor vs NO, not poor); and
reg3: with the log of POVERTY as dependent variable and the usual explanatory variables on the right hand side.

reg <- lm(log(data$INCTOT) ~ factor(data$YEAR) + factor(data$SEX) + 
            data$AGE + factor(data$MARST) + factor(data$RACE) + 
            factor(data$HISPAN) + factor(data$EMPSTAT))

reg1 <- lm(data$INCTOT ~ factor(data$YEAR) + factor(data$SEX) + 
            data$AGE + factor(data$MARST) + factor(data$RACE) + 
            factor(data$HISPAN) + factor(data$EMPSTAT))

reg2 <- lm(data$POVI ~ factor(data$YEAR) + factor(data$SEX) + 
             data$AGE + factor(data$MARST) + factor(data$RACE) + 
             factor(data$HISPAN) + factor(data$EMPSTAT))

reg3 <- lm(log(data$POVERTY) ~ factor(data$YEAR) + factor(data$SEX) + 
             data$AGE + factor(data$MARST) + factor(data$RACE) + 
             factor(data$HISPAN) + factor(data$EMPSTAT))

Important note: When one uses the R function “factor” in the regression code, the values of the variables get split into one or more binary variables with the lowest value (usually 1, in the census data) recoded to zero and the alternative values of the factors then recoded as ones. This is key when interpreting the results.

The summaries of the regressions are here.

summary(reg)

## 
## Call:
## lm(formula = log(data$INCTOT) ~ factor(data$YEAR) + factor(data$SEX) + 
##     data$AGE + factor(data$MARST) + factor(data$RACE) + factor(data$HISPAN) + 
##     factor(data$EMPSTAT))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.0211  -0.5483   0.0850   0.6583   4.3718 
## 
## Coefficients:
##                         Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)           10.1496334  0.0153924  659.391   <2e-16 ***
## factor(data$YEAR)2020  0.3069675  0.0073303   41.877   <2e-16 ***
## factor(data$SEX)2     -0.2041551  0.0073166  -27.903   <2e-16 ***
## data$AGE               0.0167590  0.0002677   62.604   <2e-16 ***
## factor(data$MARST)2   -0.1079756  0.0098510  -10.961   <2e-16 ***
## factor(data$MARST)3   -0.1536810  0.0091715  -16.756   <2e-16 ***
## factor(data$RACE)2    -0.3581921  0.0094161  -38.040   <2e-16 ***
## factor(data$RACE)3    -0.4767336  0.0509795   -9.351   <2e-16 ***
## factor(data$RACE)4    -0.3767021  0.0090078  -41.820   <2e-16 ***
## factor(data$RACE)5    -0.1162281  0.0495923   -2.344   0.0191 *  
## factor(data$HISPAN)1  -0.3661611  0.0245883  -14.892   <2e-16 ***
## factor(data$HISPAN)2  -0.2018365  0.0149013  -13.545   <2e-16 ***
## factor(data$HISPAN)3  -0.0827552  0.0490151   -1.688   0.0913 .  
## factor(data$HISPAN)4  -0.2740426  0.0117102  -23.402   <2e-16 ***
## factor(data$EMPSTAT)2 -1.1249463  0.0162357  -69.288   <2e-16 ***
## factor(data$EMPSTAT)3 -1.4353521  0.0095301 -150.613   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.079 on 90187 degrees of freedom
## Multiple R-squared:  0.2828, Adjusted R-squared:  0.2826 
## F-statistic:  2370 on 15 and 90187 DF,  p-value: < 2.2e-16

summary(reg1)

## 
## Call:
## lm(formula = data$INCTOT ~ factor(data$YEAR) + factor(data$SEX) + 
##     data$AGE + factor(data$MARST) + factor(data$RACE) + factor(data$HISPAN) + 
##     factor(data$EMPSTAT))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -110338  -32260  -12006   11122 1222932 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            69766.80    1017.09  68.594  < 2e-16 ***
## factor(data$YEAR)2020  16171.11     484.37  33.386  < 2e-16 ***
## factor(data$SEX)2     -12138.08     483.46 -25.106  < 2e-16 ***
## data$AGE                 375.34      17.69  21.219  < 2e-16 ***
## factor(data$MARST)2    -8759.03     650.93 -13.456  < 2e-16 ***
## factor(data$MARST)3   -11372.24     606.03 -18.765  < 2e-16 ***
## factor(data$RACE)2    -25840.03     622.19 -41.531  < 2e-16 ***
## factor(data$RACE)3    -25563.31    3368.58  -7.589 3.26e-14 ***
## factor(data$RACE)4    -21051.60     595.21 -35.368  < 2e-16 ***
## factor(data$RACE)5     -9376.51    3276.92  -2.861  0.00422 ** 
## factor(data$HISPAN)1  -24626.81    1624.73 -15.157  < 2e-16 ***
## factor(data$HISPAN)2  -14017.06     984.64 -14.236  < 2e-16 ***
## factor(data$HISPAN)3   -9270.11    3238.78  -2.862  0.00421 ** 
## factor(data$HISPAN)4  -18394.73     773.78 -23.773  < 2e-16 ***
## factor(data$EMPSTAT)2 -35004.11    1072.81 -32.628  < 2e-16 ***
## factor(data$EMPSTAT)3 -46661.51     629.72 -74.099  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 71290 on 90187 degrees of freedom
## Multiple R-squared:  0.1305, Adjusted R-squared:  0.1303 
## F-statistic: 902.1 on 15 and 90187 DF,  p-value: < 2.2e-16

summary(reg2)

## 
## Call:
## lm(formula = data$POVI ~ factor(data$YEAR) + factor(data$SEX) + 
##     data$AGE + factor(data$MARST) + factor(data$RACE) + factor(data$HISPAN) + 
##     factor(data$EMPSTAT))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.45243 -0.15325 -0.08420 -0.01216  1.07808 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            8.768e-02  4.426e-03  19.810  < 2e-16 ***
## factor(data$YEAR)2020 -2.827e-02  2.108e-03 -13.412  < 2e-16 ***
## factor(data$SEX)2      1.636e-02  2.104e-03   7.775 7.62e-15 ***
## data$AGE              -1.964e-03  7.698e-05 -25.516  < 2e-16 ***
## factor(data$MARST)2    8.634e-02  2.833e-03  30.480  < 2e-16 ***
## factor(data$MARST)3    4.822e-02  2.637e-03  18.284  < 2e-16 ***
## factor(data$RACE)2     4.319e-02  2.708e-03  15.951  < 2e-16 ***
## factor(data$RACE)3     5.462e-02  1.466e-02   3.726 0.000195 ***
## factor(data$RACE)4     4.218e-02  2.590e-03  16.285  < 2e-16 ***
## factor(data$RACE)5     1.941e-02  1.426e-02   1.361 0.173399    
## factor(data$HISPAN)1   5.594e-02  7.070e-03   7.912 2.56e-15 ***
## factor(data$HISPAN)2   8.638e-02  4.285e-03  20.160  < 2e-16 ***
## factor(data$HISPAN)3   2.416e-02  1.409e-02   1.714 0.086556 .  
## factor(data$HISPAN)4   2.935e-02  3.367e-03   8.715  < 2e-16 ***
## factor(data$EMPSTAT)2  1.406e-01  4.669e-03  30.113  < 2e-16 ***
## factor(data$EMPSTAT)3  1.865e-01  2.740e-03  68.062  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3103 on 90187 degrees of freedom
## Multiple R-squared:  0.09073,    Adjusted R-squared:  0.09058 
## F-statistic: 599.9 on 15 and 90187 DF,  p-value: < 2.2e-16

summary(reg3)

## 
## Call:
## lm(formula = log(data$POVERTY) ~ factor(data$YEAR) + factor(data$SEX) + 
##     data$AGE + factor(data$MARST) + factor(data$RACE) + factor(data$HISPAN) + 
##     factor(data$EMPSTAT))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.9639 -0.3243  0.1790  0.4776  1.7070 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            5.7342914  0.0107243 534.701  < 2e-16 ***
## factor(data$YEAR)2020  0.1320082  0.0051072  25.847  < 2e-16 ***
## factor(data$SEX)2     -0.0349660  0.0050977  -6.859 6.97e-12 ***
## data$AGE               0.0057093  0.0001865  30.611  < 2e-16 ***
## factor(data$MARST)2   -0.2842302  0.0068634 -41.412  < 2e-16 ***
## factor(data$MARST)3   -0.1821187  0.0063900 -28.500  < 2e-16 ***
## factor(data$RACE)2    -0.1936931  0.0065604 -29.524  < 2e-16 ***
## factor(data$RACE)3    -0.2498332  0.0355187  -7.034 2.02e-12 ***
## factor(data$RACE)4    -0.1959157  0.0062759 -31.217  < 2e-16 ***
## factor(data$RACE)5    -0.0251367  0.0345522  -0.727   0.4669    
## factor(data$HISPAN)1  -0.3374446  0.0171313 -19.698  < 2e-16 ***
## factor(data$HISPAN)2  -0.2215148  0.0103821 -21.336  < 2e-16 ***
## factor(data$HISPAN)3  -0.0734981  0.0341500  -2.152   0.0314 *  
## factor(data$HISPAN)4  -0.1711796  0.0081588 -20.981  < 2e-16 ***
## factor(data$EMPSTAT)2 -0.4999592  0.0113118 -44.198  < 2e-16 ***
## factor(data$EMPSTAT)3 -0.5821586  0.0066398 -87.677  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7517 on 90187 degrees of freedom
## Multiple R-squared:  0.1614, Adjusted R-squared:  0.1612 
## F-statistic:  1157 on 15 and 90187 DF,  p-value: < 2.2e-16

With stargazer, one gets a better looking table with all the regressions compared next to each other.

# summaries but better looking
stargazer(reg, reg1, reg2, reg3, type="text")

## 
## ====================================================================================
##                                                  Dependent variable:                
##                                  ---------------------------------------------------
##                                    INCTOT)        INCTOT        POVI      POVERTY)  
##                                      (1)           (2)          (3)         (4)     
## ------------------------------------------------------------------------------------
## YEAR)2020                          0.307***   16,171.110***  -0.028***    0.132***  
##                                    (0.007)      (484.366)     (0.002)     (0.005)   
##                                                                                     
## SEX)2                             -0.204***   -12,138.080***  0.016***   -0.035***  
##                                    (0.007)      (483.464)     (0.002)     (0.005)   
##                                                                                     
## AGE                                0.017***     375.343***   -0.002***    0.006***  
##                                    (0.0003)      (17.689)     (0.0001)    (0.0002)  
##                                                                                     
## MARST)2                           -0.108***   -8,759.026***   0.086***   -0.284***  
##                                    (0.010)      (650.926)     (0.003)     (0.007)   
##                                                                                     
## MARST)3                           -0.154***   -11,372.240***  0.048***   -0.182***  
##                                    (0.009)      (606.029)     (0.003)     (0.006)   
##                                                                                     
## RACE)2                            -0.358***   -25,840.030***  0.043***   -0.194***  
##                                    (0.009)      (622.189)     (0.003)     (0.007)   
##                                                                                     
## RACE)3                            -0.477***   -25,563.310***  0.055***   -0.250***  
##                                    (0.051)     (3,368.583)    (0.015)     (0.036)   
##                                                                                     
## RACE)4                            -0.377***   -21,051.600***  0.042***   -0.196***  
##                                    (0.009)      (595.208)     (0.003)     (0.006)   
##                                                                                     
## RACE)5                             -0.116**   -9,376.510***    0.019       -0.025   
##                                    (0.050)     (3,276.924)    (0.014)     (0.035)   
##                                                                                     
## HISPAN)1                          -0.366***   -24,626.810***  0.056***   -0.337***  
##                                    (0.025)     (1,624.729)    (0.007)     (0.017)   
##                                                                                     
## HISPAN)2                          -0.202***   -14,017.060***  0.086***   -0.222***  
##                                    (0.015)      (984.636)     (0.004)     (0.010)   
##                                                                                     
## HISPAN)3                           -0.083*    -9,270.113***    0.024*     -0.073**  
##                                    (0.049)     (3,238.782)    (0.014)     (0.034)   
##                                                                                     
## HISPAN)4                          -0.274***   -18,394.730***  0.029***   -0.171***  
##                                    (0.012)      (773.777)     (0.003)     (0.008)   
##                                                                                     
## EMPSTAT)2                         -1.125***   -35,004.110***  0.141***   -0.500***  
##                                    (0.016)     (1,072.810)    (0.005)     (0.011)   
##                                                                                     
## EMPSTAT)3                         -1.435***   -46,661.510***  0.187***   -0.582***  
##                                    (0.010)      (629.720)     (0.003)     (0.007)   
##                                                                                     
## Constant                          10.150***   69,766.800***   0.088***    5.734***  
##                                    (0.015)     (1,017.090)    (0.004)     (0.011)   
##                                                                                     
## ------------------------------------------------------------------------------------
## Observations                        90,203        90,203       90,203      90,203   
## R2                                  0.283         0.130        0.091       0.161    
## Adjusted R2                         0.283         0.130        0.091       0.161    
## Residual Std. Error (df = 90187)    1.079       71,293.270     0.310       0.752    
## F Statistic (df = 15; 90187)     2,370.350***   902.061***   599.926*** 1,157.068***
## ====================================================================================
## Note:                                                    *p<0.1; **p<0.05; ***p<0.01

With these results, you can run simulations for each given model. For example, using reg, a 30yo white woman in 2020, who is married with present husband and employed, is expected (i.e. on average) to earn the income below. Note that the values of the variables for this person are: YEAR=2020, SEX=1, AGE=30, MARSTAT=0, RACE=0, HISPAN=0, and EMPSTAT=0.

beta <- reg$coefficients
exp(beta[1] + 1*beta[2] + 1*beta[3] + 30*beta[4] + 0*beta[5] + 0*beta[6] + 0*beta[7] +0*beta[8] +0*beta[9] +0*beta[10] +0*beta[11] +0*beta[12] +0*beta[13] +0*beta[14] +0*beta[15] +0*beta[16])

## (Intercept) 
##    46873.93

Now, the expected INCTOT for 30yo black PRican woman in 2020, who is married with present husband and employed. Because the LHS variable is the log of INCTOT, one needs the exponential function in the computation below.

exp(beta[1] + 1*beta[2] + 1*beta[3] + 30*beta[4] + 0*beta[5] + 0*beta[6] + 1*beta[7] + 0*beta[8] + 0*beta[9] + 0*beta[10] +0*beta[11] +1*beta[12] +0*beta[13] +0*beta[14] +0*beta[15] +0*beta[16])

## (Intercept) 
##    26774.05

Now, using reg1 (with the LHS being INCTOT, not log). For these two women, the expected INCTOT would be as shown next. Taking logs prevents the average to be pulled up by a few high earners. So, this regression is likely to exaggerate the expected INCTOT of people.

beta <- reg1$coefficients
beta[1] + 1*beta[2] + 1*beta[3] + 30*beta[4] + 0*beta[5] + 0*beta[6] + 0*beta[7] +0*beta[8] +0*beta[9] +0*beta[10] +0*beta[11] +0*beta[12] +0*beta[13] +0*beta[14] +0*beta[15] +0*beta[16]

## (Intercept) 
##    85060.11

beta[1] + 1*beta[2] + 1*beta[3] + 30*beta[4] + 0*beta[5] + 0*beta[6] + 1*beta[7] + 0*beta[8] + 0*beta[9] + 0*beta[10] +0*beta[11] +1*beta[12] +0*beta[13] +0*beta[14] +0*beta[15] +0*beta[16]

## (Intercept) 
##    45203.03

With reg2, one needs to be careful. The LHS variable is POVI, a binary variable. So, one can interpret the predicted values of POVI given by the regression as probabilities of falling into poverty (POVI=1). With that proviso, one can use these results.

The following are simulations for the two women above.

beta <- reg2$coefficients
beta[1] + 1*beta[2] + 1*beta[3] + 30*beta[4] + 0*beta[5] + 0*beta[6] + 0*beta[7] +0*beta[8] +0*beta[9] +0*beta[10] +0*beta[11] +0*beta[12] +0*beta[13] +0*beta[14] +0*beta[15] +0*beta[16]

## (Intercept) 
##  0.01684336

beta[1] + 1*beta[2] + 1*beta[3] + 30*beta[4] + 0*beta[5] + 0*beta[6] + 1*beta[7] + 0*beta[8] + 0*beta[9] + 0*beta[10] +0*beta[11] +1*beta[12] +0*beta[13] +0*beta[14] +0*beta[15] +0*beta[16]

## (Intercept) 
##   0.1464157

Note that the probabilities are given in decimal form (to represent them as percentages, one need to multiply the number by 100 and append the “%” sign.)

There’s not even a 2% chance that the white woman would fall into POVI=1, while there is almost a 14% chance that the black woman will. Etc.

More can be done, but that’s it for now.

Data analysis for Sophia Gautier

Julio Huato