Create Variables

First I created a variable from a large dataset of over 100,000 people in North Carolina in 2018. I am using the ipumsr package to read the data in. This is a special R package to read in data from www.ipums.org. Information about ipumsr package: https://cran.r-project.org/web/packages/ipumsr/vignettes/ipums.html I narrowed the results into the counties with the 4 largest cities in them: Charlotte, Raleigh, Greensboro, and Durham. I did this to limit the results. I named this result “citydiff” and I will be using the variable citydiff for the remainder of my analysis. I started this project by wondering if there was a difference in men versus women employed.

I filtered all of the “N/A” data out of the variables, and labeled the numeric values to make more sense to the reader.

## Use of data from IPUMS USA is subject to conditions including that users should
## cite the data appropriately. Use command `ipums_conditions()` for more details.

Exploratory Analysis of the Data

Next I completed an exploratory analysis of my data to understand it before forming a hypothesis. Here is a bar graph of employed men versus women. I can tell that many people surveyed are employed, but I it looks about even to me from the graph. I see that there are slightly more women than men.

citybar <- citydiff %>% 
  group_by(EMPSTAT, SEX) %>% 
  count(EMPSTAT) %>%
  rename(Employment = EMPSTAT)

citybar %>% 
  ggplot() + geom_bar(aes(x = SEX, y = n, fill = Employment), position = "stack", stat = "identity") + scale_fill_manual(values = c("#8A97BF", "#F29F05")) + theme_minimal() 

I also created a table to numerically represent my results. From this I can see that the ratios look about the same, but I should still do a hypothesis test to find out.

citydiff %>% 
  group_by(SEX, EMPSTAT) %>% 
  count(EMPSTAT) %>% 
  rename(Count = n, Employed = EMPSTAT, Gender = SEX) %>% 
  kable()
Gender Employed Count
Female Employed 7173
Female Unemployed 243
Male Employed 6864
Male Unemployed 265

I also have created a boxplot to show the ages of the people surveyed. They are similar, which is good because if there is around the same span of people I don’t need to worry about this variable affecting my results.

ggplot(citydiff) + 
  geom_boxplot(aes(AGE, SEX)) + theme_minimal() + theme(panel.background = element_rect( fill = "#8A97BF", color = "#D95276")) 


Hypothesis:

\(H_0\): There is no difference in employment status for North Carolinians based on sex (p1 - p2 = 0).

\(H_A\): There is some kind of difference in employment status for North Carolinians based on sex (p1 - p2 ≠ 0).

Find the Sample P Value

psamp <- citydiff %>% 
  specify(formula = EMPSTAT ~ SEX, success = "Employed") %>% 
  calculate(stat = "diff in props", order = c("Female", "Male"))

Interpret Sample P Value

P Hat = 0.004405124

Because it is positive, the first was bigger than the second. This means that in our sample, the women had a higher proportion of employment by 0.44 percentage points. This is a very small difference.


Null Distribution

I will be using a difference in proportion hypothesis test with the ‘infer’ package, because I am measuring the proportion of employment between men and women. This means that I am finding a sample null proportion for both men and women and then subtracting the ‘men’ proportion from the ‘women’ proportion and the difference tells me which sample null proportion is larger.

set.seed(3384)
diff_null_dist <- citydiff %>%
  filter(AGE > 50) %>% 
  specify(formula = EMPSTAT ~ SEX, success = "Employed") %>%  
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in props", order = c("Female", "Male"))

visualize(diff_null_dist) +
  shade_pvalue(obs_stat = psamp, direction = "both", color = "#D95276", fill = "#8A97BF") + theme_minimal() + ggtitle("Null Distribution")

diff_null_dist %>% 
  get_p_value(obs_stat = psamp, direction = "both")

Interpret P Value

p-value = 0.412

In a world where the null hypothesis is true (men and women are employed at the same rate), it would likely that this sample statistic occurs. There would be a 41.2% chance of getting sample proportions this far apart.

Formal Conclusion

Because our p-value is much greater than 0.05 (0.412), we fail to reject our null hypothesis. There is insufficient evidence to conclude that the there is some relationship between sex when considering employment status of North Carolinians. My next step is to create a confidence interval in order to see the range of P-Values, instead of one single estimate. This will give me more information about the relationship between sex and employment status.


Confidence Intervals

set.seed(2055)
diff_boot <- citydiff %>% 
  specify(formula = EMPSTAT ~ SEX, success = "Employed") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "diff in props", order = c("Female", "Male"))

diff_boot %>%
  get_confidence_interval(type = "percentile", level = .9)
Lower Interval Upper Interval
-0.0006905008 0.009618421

CI: -0.0001 to 0.001

Interpretation: I cannot be 90% confident that the North Carolinian level of employment rate is higher for men than for women, because the lower interval is negative and the upper interval is positive.


Summary

Main Conclusions

To conclude, the results of both my confidence interval and my hypothesis tests had insufficient evidence to conclude that there is a relation between employment status and sex in North Carolina. This does not mean that there is no relationship for sure, but at this time there’s not enough evidence to say so and it seems highly unlikely.

Ramifications

For gender-focused groups, it would be powerful to be able to say that there is a lack of evidence to prove a relationship between sex and employment and to have the data to back this claim up.

Study Limits

One study limit is obviously that the study was only made up of 14,545 observations. All of these people live in one of the four largest cities in North Carolina: Charlotte, Raleigh, Greensboro, and Durham. This data may be different if the data was collected from rural and urban areas. It would also be different if it was expanded to include other states. It should also be noted that although there is no difference between employment and sex, this study was limited to a binary view assignment of sex (male or female) and not gender preferences. Since it is not recorded how the sex data was gathered, it could be from birth data, which may not reflect how many individuals identify. This study should also not be used to assume that there is no gender discrimination in the work place, as discrimination can appear in many other forms than being hired.

Future Study Opportunities

It would be useful to study this again with a more comprehensive “sex” measurement that included nonbinary folks. It could also be interesting to seeif the results would change if the sexes are also split up into ages, i.e., if the older people got the more likely they were to be employed or unemployed by sex.