Harold Nelson
3/15/2020
## ── Attaching packages ───────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 0.8.5
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ──────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Let’s review where we left things before the break. We had observed that some cohorts of women had increased in size between 2007 and 2017 in spite of the natural tendency of a cohort to shrink over time as some members die. The only possible explanation for this phenomenon is that the cohorts had grown through immigration. That led us to examine the racial composition of a cohort at different points in time. The results are documented in https://rpubs.com/HaroldNelsonJr43/580552.
Nothing in our previous results pointed to what most people believe about the primary source of immigration into the US. It is from Mexico and other countries to the south
To pursue this line of thought, we need to focus on ethnicity as opposed to race. In official US statistics, “Hispanic” is not a race, it is an ethnicity. Ethnicity is a separate dimension from race. One can be hispanic and white, or hispanic and black, or hispanic and any race.
The question we want to pursue now is how the ethnic composition of the cohort of women we examined changed between 2007 and 2017.
Go to https://wonder.cdc.gov/controller/datarequest/D66.
Request that the results be grouped by Mother’s hispanic origin, Age of mother 9, and year.
Below the grouping selection, check the box for Fertility Rate.
In Box 4, ask for “All Years”. All has been selected for everything else in Box 4.
In Box 6, request an export. Uncheck "show totals:.
Download the txt file and remove the extraneous columns in excel. Change the names of the columns you kept to match what we did before. Use “Ethnicity” instead of “RACE”.
Finally, import this into RStudio as Nat0718_ETH.
Note that you will have to change this code to match the location of the downloaded file on your system. This code points to the downloads folder on my home system.
library(readr)
Nat0718_ETH <- read_delim("~/Downloads/Nat0718_ETH.txt",
"\t", escape_double = FALSE, trim_ws = TRUE)
## Parsed with column specification:
## cols(
## Ethnicity = col_character(),
## Age = col_character(),
## Year = col_double(),
## Births = col_double(),
## Fpop = col_character(),
## Rate = col_character()
## )
Do a summary and glimpse of the downloaded file and list the changes that need to be made.
## Ethnicity Age Year Births
## Length:324 Length:324 Min. :2007 Min. : 8
## Class :character Class :character 1st Qu.:2010 1st Qu.: 1354
## Mode :character Mode :character Median :2012 Median : 7273
## Mean :2012 Mean :148433
## 3rd Qu.:2015 3rd Qu.:205012
## Max. :2018 Max. :912230
## Fpop Rate
## Length:324 Length:324
## Class :character Class :character
## Mode :character Mode :character
##
##
##
## Observations: 324
## Variables: 6
## $ Ethnicity <chr> "2135-2", "2135-2", "2135-2", "2135-2", "2135-2", "2135-2",…
## $ Age <chr> "15", "15", "15", "15", "15", "15", "15", "15", "15", "15",…
## $ Year <dbl> 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016,…
## $ Births <dbl> 2411, 2326, 2073, 1811, 1576, 1396, 1214, 1037, 986, 886, 7…
## $ Fpop <chr> "Not Available", "Not Available", "Not Available", "Not Ava…
## $ Rate <chr> "Not Available", "Not Available", "Not Available", "Not Ava…
We can see that the following changes need to be made.
Ethnicity is coded in a meaningless way. Recode these values to “Hispanic” and “Non-Hispanic”.
Fpop and rate have been imported as character variables because the string “Not Available” is present. Use as.numeric() to convert the available numeric values and mark the remainder as NA.
Use na.omit() in your dplyr pipeline to remove the observations marked NA.
Nat0718_ETH = Nat0718_ETH %>%
mutate(Ethnicity = recode(Ethnicity, "2135-2" = "Hispanic",
"2186-5" = "Non-Hispanic"),
Fpop = as.numeric(Fpop),
Rate = as.numeric(Rate)) %>%
na.omit()
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
## # A tibble: 6 x 6
## Ethnicity Age Year Births Fpop Rate
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Hispanic 15-19 2007 148563 1973663 75.3
## 2 Hispanic 15-19 2008 144914 2060092 70.3
## 3 Hispanic 15-19 2009 136263 2140836 63.6
## 4 Hispanic 15-19 2010 121798 2186082 55.7
## 5 Hispanic 15-19 2011 109660 2212656 49.6
## 6 Hispanic 15-19 2012 102722 2218259 46.3
Create and display a new dataframe Nat0718_ETH2 from Nat0718_ETH. This new dataframe should be analogous to the dataframe RCNat0718, which we created for the analysis of racial composition.
Nat0718_ETH2 = Nat0718_ETH %>%
filter((Year == 2007 & Age == "15-19") |
(Year == 2017 & Age == "25-29")) %>%
group_by(Year,Ethnicity) %>%
summarize(Fpop = sum(Fpop)) %>%
ungroup() %>%
mutate(Fpop = Fpop/1000000)
Nat0718_ETH2
## # A tibble: 4 x 3
## Year Ethnicity Fpop
## <dbl> <chr> <dbl>
## 1 2007 Hispanic 1.97
## 2 2007 Non-Hispanic 8.76
## 3 2017 Hispanic 2.28
## 4 2017 Non-Hispanic 9.19
Create a new dataframe, both, analogous to the dataframe of the same name we created in the analysis of racial composition. Use pivot_wider.
Comment on the results.
both = Nat0718_ETH2 %>%
pivot_wider(names_from = Year, values_from = Fpop)
colnames(both ) = c("Race","Fpop","Fpop17")
both = both %>%
mutate(dpop = Fpop17 - Fpop,
pct_growth= dpop/Fpop,
pct_contrib = dpop/sum(dpop))
both
## # A tibble: 2 x 6
## Race Fpop Fpop17 dpop pct_growth pct_contrib
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Hispanic 1.97 2.28 0.308 0.156 0.417
## 2 Non-Hispanic 8.76 9.19 0.429 0.0490 0.583