Lecture March 16 2020

Harold Nelson

3/15/2020

Setup

library(tidyverse)
## ── Attaching packages ───────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ──────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Review

Let’s review where we left things before the break. We had observed that some cohorts of women had increased in size between 2007 and 2017 in spite of the natural tendency of a cohort to shrink over time as some members die. The only possible explanation for this phenomenon is that the cohorts had grown through immigration. That led us to examine the racial composition of a cohort at different points in time. The results are documented in https://rpubs.com/HaroldNelsonJr43/580552.

The Next Question

Nothing in our previous results pointed to what most people believe about the primary source of immigration into the US. It is from Mexico and other countries to the south

To pursue this line of thought, we need to focus on ethnicity as opposed to race. In official US statistics, “Hispanic” is not a race, it is an ethnicity. Ethnicity is a separate dimension from race. One can be hispanic and white, or hispanic and black, or hispanic and any race.

The question we want to pursue now is how the ethnic composition of the cohort of women we examined changed between 2007 and 2017.

Task 1

Go to https://wonder.cdc.gov/controller/datarequest/D66.

Request that the results be grouped by Mother’s hispanic origin, Age of mother 9, and year.

Below the grouping selection, check the box for Fertility Rate.

In Box 4, ask for “All Years”. All has been selected for everything else in Box 4.

In Box 6, request an export. Uncheck "show totals:.

Download the txt file and remove the extraneous columns in excel. Change the names of the columns you kept to match what we did before. Use “Ethnicity” instead of “RACE”.

Finally, import this into RStudio as Nat0718_ETH.

Solution

Note that you will have to change this code to match the location of the downloaded file on your system. This code points to the downloads folder on my home system.

library(readr)
Nat0718_ETH <- read_delim("~/Downloads/Nat0718_ETH.txt", 
    "\t", escape_double = FALSE, trim_ws = TRUE)
## Parsed with column specification:
## cols(
##   Ethnicity = col_character(),
##   Age = col_character(),
##   Year = col_double(),
##   Births = col_double(),
##   Fpop = col_character(),
##   Rate = col_character()
## )

Task 2

Do a summary and glimpse of the downloaded file and list the changes that need to be made.

Solution

summary(Nat0718_ETH)
##   Ethnicity             Age                 Year          Births      
##  Length:324         Length:324         Min.   :2007   Min.   :     8  
##  Class :character   Class :character   1st Qu.:2010   1st Qu.:  1354  
##  Mode  :character   Mode  :character   Median :2012   Median :  7273  
##                                        Mean   :2012   Mean   :148433  
##                                        3rd Qu.:2015   3rd Qu.:205012  
##                                        Max.   :2018   Max.   :912230  
##      Fpop               Rate          
##  Length:324         Length:324        
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 
glimpse(Nat0718_ETH)
## Observations: 324
## Variables: 6
## $ Ethnicity <chr> "2135-2", "2135-2", "2135-2", "2135-2", "2135-2", "2135-2",…
## $ Age       <chr> "15", "15", "15", "15", "15", "15", "15", "15", "15", "15",…
## $ Year      <dbl> 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016,…
## $ Births    <dbl> 2411, 2326, 2073, 1811, 1576, 1396, 1214, 1037, 986, 886, 7…
## $ Fpop      <chr> "Not Available", "Not Available", "Not Available", "Not Ava…
## $ Rate      <chr> "Not Available", "Not Available", "Not Available", "Not Ava…

Task 3

We can see that the following changes need to be made.

  1. Ethnicity is coded in a meaningless way. Recode these values to “Hispanic” and “Non-Hispanic”.

  2. Fpop and rate have been imported as character variables because the string “Not Available” is present. Use as.numeric() to convert the available numeric values and mark the remainder as NA.

  3. Use na.omit() in your dplyr pipeline to remove the observations marked NA.

Solution

Nat0718_ETH = Nat0718_ETH %>% 
  mutate(Ethnicity = recode(Ethnicity, "2135-2" = "Hispanic",
                            "2186-5" = "Non-Hispanic"),
         Fpop = as.numeric(Fpop),
         Rate = as.numeric(Rate)) %>% 
  na.omit()
## Warning: NAs introduced by coercion

## Warning: NAs introduced by coercion
head(Nat0718_ETH)
## # A tibble: 6 x 6
##   Ethnicity Age    Year Births    Fpop  Rate
##   <chr>     <chr> <dbl>  <dbl>   <dbl> <dbl>
## 1 Hispanic  15-19  2007 148563 1973663  75.3
## 2 Hispanic  15-19  2008 144914 2060092  70.3
## 3 Hispanic  15-19  2009 136263 2140836  63.6
## 4 Hispanic  15-19  2010 121798 2186082  55.7
## 5 Hispanic  15-19  2011 109660 2212656  49.6
## 6 Hispanic  15-19  2012 102722 2218259  46.3

Task 4

Create and display a new dataframe Nat0718_ETH2 from Nat0718_ETH. This new dataframe should be analogous to the dataframe RCNat0718, which we created for the analysis of racial composition.

Nat0718_ETH2 = Nat0718_ETH %>% 
  filter((Year == 2007 & Age == "15-19") |
         (Year == 2017 & Age == "25-29")) %>% 
  group_by(Year,Ethnicity) %>% 
  summarize(Fpop = sum(Fpop)) %>% 
  ungroup() %>% 
  mutate(Fpop = Fpop/1000000)

Nat0718_ETH2
## # A tibble: 4 x 3
##    Year Ethnicity     Fpop
##   <dbl> <chr>        <dbl>
## 1  2007 Hispanic      1.97
## 2  2007 Non-Hispanic  8.76
## 3  2017 Hispanic      2.28
## 4  2017 Non-Hispanic  9.19

Task 5

Create a new dataframe, both, analogous to the dataframe of the same name we created in the analysis of racial composition. Use pivot_wider.

Comment on the results.

Solution

both = Nat0718_ETH2 %>% 
  pivot_wider(names_from = Year, values_from = Fpop)

colnames(both ) = c("Race","Fpop","Fpop17")

both = both %>% 
  mutate(dpop = Fpop17 - Fpop,
         pct_growth= dpop/Fpop,
         pct_contrib = dpop/sum(dpop))
both
## # A tibble: 2 x 6
##   Race          Fpop Fpop17  dpop pct_growth pct_contrib
##   <chr>        <dbl>  <dbl> <dbl>      <dbl>       <dbl>
## 1 Hispanic      1.97   2.28 0.308     0.156        0.417
## 2 Non-Hispanic  8.76   9.19 0.429     0.0490       0.583