Project 1, Michael Simms

Topic: Births in DC from 2016 to 2021, Ordered by Mother’s Education Level

Essay, Part A: Dataset topic Introduction, Description of Variables (What I Plan to Explore)

Given my interest in population aging, namely the phenomenon where an increasing median age in a population results from declining fertility rates and rising life expectancy, I was interested to investigate the following dataset (source: CDC) Which includes US births in all 50 states (and the District of Columbia) from 2016 to 2021. In this dataset there are 9 variables– 5 categorical variables (State, State Abbreviation, Gender, Education Level of Mother, and Education Level Code), and 4 quantitative variables (Year, Number of Births, Average Age of Mother (in Years), and Average Birth Weight (in Grams)). I cleaned the dataset by using the functions tolower() and gsub(), which made all letters in the variable names lowercase and removed the variable names’ spaces, respectively. I also filtered the data to include only values from the District of Columbia, the dataset’s smallest geospatial jurisdiction, and one with whose geography and demographics I am rather familiar.

library(tidyverse)
library(patchwork)

#Here I am using the library() function to load and make available the installed package, "tidyverse" as usual. I am also loading and making available the "patchwork" installed package, which will be key for placing several graphs on the same plot

Load the Data

setwd("~/MC Data Science/Data 110/Datasets")
us_births <- read_csv("us_births_2016_2021.CSV")

Rows: 5496 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): State, State Abbreviation, Gender, Education Level of Mother
dbl (5): Year, Education Level Code, Number of Births, Average Age of Mother...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#In this chunk I am loading the dataset from its appropriate working directory (folder), and using the read_csv() function

Clean Up the Data

names(us_births) <- tolower(names(us_births))
names(us_births) <- gsub(" ","",names(us_births))
str(us_births)

spc_tbl_ [5,496 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ state                    : chr [1:5496] "Alabama" "Alabama" "Alabama" "Alabama" ...
 $ stateabbreviation        : chr [1:5496] "AL" "AL" "AL" "AL" ...
 $ year                     : num [1:5496] 2016 2016 2016 2016 2016 ...
 $ gender                   : chr [1:5496] "F" "F" "F" "F" ...
 $ educationlevelofmother   : chr [1:5496] "8th grade or less" "9th through 12th grade with no diploma" "High school graduate or GED completed" "Some college credit, but not a degree" ...
 $ educationlevelcode       : num [1:5496] 1 2 3 4 5 6 7 8 -9 1 ...
 $ numberofbirths           : num [1:5496] 1052 3436 8777 6453 2227 ...
 $ averageageofmother(years): num [1:5496] 27.8 24.1 25.4 26.7 28.9 30.3 32 33.1 27.7 27.6 ...
 $ averagebirthweight(g)    : num [1:5496] 3117 3040 3080 3122 3174 ...
 - attr(*, "spec")=
  .. cols(
  ..   State = col_character(),
  ..   `State Abbreviation` = col_character(),
  ..   Year = col_double(),
  ..   Gender = col_character(),
  ..   `Education Level of Mother` = col_character(),
  ..   `Education Level Code` = col_double(),
  ..   `Number of Births` = col_double(),
  ..   `Average Age of Mother (years)` = col_double(),
  ..   `Average Birth Weight (g)` = col_double()
  .. )
 - attr(*, "problems")=<externalptr>

#Here I used the function tolower() to make all of the variable names lowercase, and I used the function gsub() to reove the spaces in the variable names. I also chose the function str() to view the dataset's structure.

head(us_births)

# A tibble: 6 × 9
  state stateabbreviation  year gender educationlevelofmother educationlevelcode
  <chr> <chr>             <dbl> <chr>  <chr>                               <dbl>
1 Alab… AL                 2016 F      8th grade or less                       1
2 Alab… AL                 2016 F      9th through 12th grad…                  2
3 Alab… AL                 2016 F      High school graduate …                  3
4 Alab… AL                 2016 F      Some college credit, …                  4
5 Alab… AL                 2016 F      Associate degree (AA,…                  5
6 Alab… AL                 2016 F      Bachelor's degree (BA…                  6
# ℹ 3 more variables: numberofbirths <dbl>, `averageageofmother(years)` <dbl>,
#   `averagebirthweight(g)` <dbl>

#In this chunk I am using the function head() to view the first 6 rows of the dataset.

dc_births <- us_births |>
  filter(state == "District of Columbia")
dc_births

# A tibble: 108 × 9
   state                stateabbreviation  year gender educationlevelofmother   
   <chr>                <chr>             <dbl> <chr>  <chr>                    
 1 District of Columbia DC                 2016 F      8th grade or less        
 2 District of Columbia DC                 2016 F      9th through 12th grade w…
 3 District of Columbia DC                 2016 F      High school graduate or …
 4 District of Columbia DC                 2016 F      Some college credit, but…
 5 District of Columbia DC                 2016 F      Associate degree (AA, AS)
 6 District of Columbia DC                 2016 F      Bachelor's degree (BA, A…
 7 District of Columbia DC                 2016 F      Master's degree (MA, MS,…
 8 District of Columbia DC                 2016 F      Doctorate (PhD, EdD) or …
 9 District of Columbia DC                 2016 F      Unknown or Not Stated    
10 District of Columbia DC                 2016 M      8th grade or less        
# ℹ 98 more rows
# ℹ 4 more variables: educationlevelcode <dbl>, numberofbirths <dbl>,
#   `averageageofmother(years)` <dbl>, `averagebirthweight(g)` <dbl>

#In this chunk I am filtering the dataset to include only data from the District of Columbia.

Creating Visualizations

#Plot 1:
p1 <- dc_births |>
  ggplot() +
  geom_bar(mapping = aes(x = year, y = numberofbirths, fill = educationlevelofmother), stat = "identity") +
  xlab("Year") +
  ylab("Number of Births") +
  ggtitle("DC Babies Born 2016-2021, Organized by Mother's Ed Level")
p1

#In this chunk I am creating a bar graph of the number of births per year, organized by the education level of the mother. The data and the legend, however, do not reflect the education levels in the correct order.

#The title also is very abbreviated, due to limited space.

#I therefore re-ordered the factors here, to reflect increasing levels of education.
dc_births$educationlevelofmother<-factor(dc_births$educationlevelofmother, levels=c("8th grade or less", "9th through 12th grade with no diploma","High school graduate or GED completed", "Some college credit, but not a degree", "Associate degree (AA, AS)", "Bachelor's degree (BA, AB, BS)", "Master's degree (MA, MS, MEng, MEd, MSW, MBA)", "Doctorate (PhD, EdD) or Professional Degree (MD, DDS, DVM, LLB, JD)", "Unknown or Not Stated"))

#Plot 2 (with the factors re-ordered):
p2 <- dc_births |>
  ggplot() +
  geom_bar(mapping = aes(x = year, y = numberofbirths, fill = educationlevelofmother), stat = "identity") +
  xlab("Year") +
  ylab("Number of Births") +
  ggtitle("DC Babies Born 2016-2021, Organized by Mother's Ed Level")
p2

#Given the previous two visualizations, I decided to filter out (and create vectors of) distinct categories of the data to obtain a better sense of the data's interrelationships

births_dc_mothers_with_college_degrees <-filter(dc_births, educationlevelofmother %in%  c("Associate degree (AA, AS)", "Bachelor's degree (BA, AB, BS)", "Master's degree (MA, MS, MEng, MEd, MSW, MBA)", "Doctorate (PhD, EdD) or Professional Degree (MD, DDS, DVM, LLB, JD)"))
births_dc_mothers_with_college_degrees

# A tibble: 48 × 9
   state                stateabbreviation  year gender educationlevelofmother   
   <chr>                <chr>             <dbl> <chr>  <fct>                    
 1 District of Columbia DC                 2016 F      Associate degree (AA, AS)
 2 District of Columbia DC                 2016 F      Bachelor's degree (BA, A…
 3 District of Columbia DC                 2016 F      Master's degree (MA, MS,…
 4 District of Columbia DC                 2016 F      Doctorate (PhD, EdD) or …
 5 District of Columbia DC                 2016 M      Associate degree (AA, AS)
 6 District of Columbia DC                 2016 M      Bachelor's degree (BA, A…
 7 District of Columbia DC                 2016 M      Master's degree (MA, MS,…
 8 District of Columbia DC                 2016 M      Doctorate (PhD, EdD) or …
 9 District of Columbia DC                 2017 F      Associate degree (AA, AS)
10 District of Columbia DC                 2017 F      Bachelor's degree (BA, A…
# ℹ 38 more rows
# ℹ 4 more variables: educationlevelcode <dbl>, numberofbirths <dbl>,
#   `averageageofmother(years)` <dbl>, `averagebirthweight(g)` <dbl>

births_dc_mothers_without_college_degrees <- filter(dc_births, educationlevelofmother %in%  c("8th grade or less", "9th through 12th grade with no diploma", "High school graduate or GED completed", "Some college credit, but not a degree"))
births_dc_mothers_without_college_degrees

# A tibble: 48 × 9
   state                stateabbreviation  year gender educationlevelofmother   
   <chr>                <chr>             <dbl> <chr>  <fct>                    
 1 District of Columbia DC                 2016 F      8th grade or less        
 2 District of Columbia DC                 2016 F      9th through 12th grade w…
 3 District of Columbia DC                 2016 F      High school graduate or …
 4 District of Columbia DC                 2016 F      Some college credit, but…
 5 District of Columbia DC                 2016 M      8th grade or less        
 6 District of Columbia DC                 2016 M      9th through 12th grade w…
 7 District of Columbia DC                 2016 M      High school graduate or …
 8 District of Columbia DC                 2016 M      Some college credit, but…
 9 District of Columbia DC                 2017 F      8th grade or less        
10 District of Columbia DC                 2017 F      9th through 12th grade w…
# ℹ 38 more rows
# ℹ 4 more variables: educationlevelcode <dbl>, numberofbirths <dbl>,
#   `averageageofmother(years)` <dbl>, `averagebirthweight(g)` <dbl>

births_dc_mothers_with_ed_unkown_or_not_stated <- dc_births |>
filter(educationlevelofmother == "Unknown or Not Stated")
births_dc_mothers_with_ed_unkown_or_not_stated

# A tibble: 12 × 9
   state                stateabbreviation  year gender educationlevelofmother
   <chr>                <chr>             <dbl> <chr>  <fct>                 
 1 District of Columbia DC                 2016 F      Unknown or Not Stated 
 2 District of Columbia DC                 2016 M      Unknown or Not Stated 
 3 District of Columbia DC                 2017 F      Unknown or Not Stated 
 4 District of Columbia DC                 2017 M      Unknown or Not Stated 
 5 District of Columbia DC                 2018 F      Unknown or Not Stated 
 6 District of Columbia DC                 2018 M      Unknown or Not Stated 
 7 District of Columbia DC                 2019 F      Unknown or Not Stated 
 8 District of Columbia DC                 2019 M      Unknown or Not Stated 
 9 District of Columbia DC                 2020 F      Unknown or Not Stated 
10 District of Columbia DC                 2020 M      Unknown or Not Stated 
11 District of Columbia DC                 2021 F      Unknown or Not Stated 
12 District of Columbia DC                 2021 M      Unknown or Not Stated 
# ℹ 4 more variables: educationlevelcode <dbl>, numberofbirths <dbl>,
#   `averageageofmother(years)` <dbl>, `averagebirthweight(g)` <dbl>

#Here also are the bar graphs for each of the 3 vectors I created in the previous chunk. I decided against filling by education level here, as I acknowledge that each bar graph is itself indicating a representative education level (that is, an education level range). I chose to fill by gender instead (on all 3 bar graphs, for the sake of added context).

p3 <- births_dc_mothers_with_college_degrees |>
  ggplot() +
  geom_bar(mapping = aes(x = year, y = numberofbirths, fill = gender), stat = "identity") +
  xlab("Year") +
  ylab("Number of Births") +
  coord_cartesian(ylim = c(0, 5000)) +
  ggtitle("DC Births, Mothers w/ College Degrees, 2016-2021") +
  labs(fill = "Gender of Child") +
   theme(axis.text = element_text(size = 5),
        plot.title = element_text(size = 10),
        plot.margin = margin(t = 3, r = 3, b = 1, l = 3)) +
  theme(axis.text.x = element_text(angle = 255))
  

p4 <- births_dc_mothers_without_college_degrees |>
  ggplot() +
  geom_bar(mapping = aes(x = year, y = numberofbirths, fill = gender), stat = "identity") +
  xlab("Year") +
  ylab("Number of Births") +
  coord_cartesian(ylim = c(0, 5000)) +
  ggtitle("DC Births, Mothers w/o College Degrees, 2016-2021") +
  labs(fill = "Gender of Child") +
   theme(axis.text = element_text(size = 5),
        plot.title = element_text(size = 10),
        plot.margin = margin(t = 3, r = 3, b = 1, l = 3)) +
  theme(axis.text.x = element_text(angle = 255))

p5 <-births_dc_mothers_with_ed_unkown_or_not_stated |>
  ggplot() +
  geom_bar(mapping = aes(x = year, y = numberofbirths, fill = gender), stat = "identity") +
  xlab("Year") +
  ylab("Number of Births") +
  coord_cartesian(ylim = c(0, 5000)) +
  ggtitle("DC Births, Unknown Ed of Mothers, 2016-2021") +
  labs(fill = "Gender of Child") +
   theme(axis.text = element_text(size = 5),
        plot.title = element_text(size = 10),
        plot.margin = margin(t = 3, r = 3, b = 1, l = 3)) +
  theme(axis.text.x = element_text(angle = 255))

#Plot 3:

p3 + p4 + p5

#again, the titles of the plots are abbreviated, due to limited space

Essay, Part B: What Does the Visualization Represent? (Interesting Patterns or Surprises)

I entered into this project aware of the widespread inequality of access to material resources in the US, particularly as this inequality can manifest in urban centers. And in my having previously combed through data on health disparities, I have become more keen to notice the interfaces between inequality and social dynamics. When I noticed, therefore, in this dataset that the number of births was relatively constant among women in DC with college degrees between 2016 and 2021 (and that over the same time period births among women without college degrees in DC showed a marked downward trend), I became curious to consider why this could be. Perhaps economic pressures during (and just ahead of) the pandemic prompted women of lower income classes to be less willing to have children (and perhaps the relative economic stability of women with college degrees led them to have very little change in birth rates over the same time period). It’s also possible that a measure of intersectionality among race, geography and economic status in DC have also acted together to affect infant mortality (it would be interesting to consider, for instance, whether this dataset regards “births” only as live births, or whether they also include miscarriages). And concerning the category where mothers’ edcuation level was either unknown or not stated, it is worthwhile evaluating whether such results are statistically significant in this context.

Essay, Part C:

I wish to have wrestled a little bit longer with the parameters of the bar graphs, particularly in plot 3. Added room for the title and the graph axes would have facilitated viewing the visualizations.