Boys and Girls

Harold Nelson

2023-02-13

Setup

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Get the Data

Use the Import Dataset control.

boys_and_girls_2021 <- read_delim("Natality, 2016-2021 expanded.txt", delim = "\t", escape_double = FALSE,  trim_ws = TRUE)
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 121 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (5): Notes, State of Residence, State of Residence Code, Sex of Infant, ...
## dbl (1): Births
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Glimpse() and/or str()

glimpse(boys_and_girls_2021)
## Rows: 121
## Columns: 6
## $ Notes                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ `State of Residence`      <chr> "Alabama", "Alabama", "Alaska", "Alaska", "A…
## $ `State of Residence Code` <chr> "01", "01", "02", "02", "04", "04", "05", "0…
## $ `Sex of Infant`           <chr> "Female", "Male", "Female", "Male", "Female"…
## $ `Sex of Infant Code`      <chr> "F", "M", "F", "M", "F", "M", "F", "M", "F",…
## $ Births                    <dbl> 170911, 179258, 29296, 31102, 235677, 245676…

Use View()

What to do?

How do you clean up these variable names?

Solution

Use janitor::clean_names() Do a little research!

Real Solution

library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
boys_and_girls_2021 = clean_names(boys_and_girls_2021)

glimpse(boys_and_girls_2021)
## Rows: 121
## Columns: 6
## $ notes                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ state_of_residence      <chr> "Alabama", "Alabama", "Alaska", "Alaska", "Ari…
## $ state_of_residence_code <chr> "01", "01", "02", "02", "04", "04", "05", "05"…
## $ sex_of_infant           <chr> "Female", "Male", "Female", "Male", "Female", …
## $ sex_of_infant_code      <chr> "F", "M", "F", "M", "F", "M", "F", "M", "F", "…
## $ births                  <dbl> 170911, 179258, 29296, 31102, 235677, 245676, …

Dump the NA values

Also trim the variable names.

Solution

boys_and_girls_2021 = boys_and_girls_2021 %>% 
  select(state_of_residence,sex_of_infant_code,births) %>% 
  rename(state = state_of_residence,
         sex = sex_of_infant_code) %>% 
  drop_na()

glimpse(boys_and_girls_2021)
## Rows: 102
## Columns: 3
## $ state  <chr> "Alabama", "Alabama", "Alaska", "Alaska", "Arizona", "Arizona",…
## $ sex    <chr> "F", "M", "F", "M", "F", "M", "F", "M", "F", "M", "F", "M", "F"…
## $ births <dbl> 170911, 179258, 29296, 31102, 235677, 245676, 107765, 112827, 1…

Pivot

wider = boys_and_girls_2021 %>% 
  pivot_wider(names_from = sex,
              values_from = births) %>% 
  mutate(ratio = M/(M + F))

summary(wider$ratio)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.5061  0.5107  0.5115  0.5116  0.5124  0.5197