1. Data Aquisition

Automate the download of 2 files

  • “File Layout”: cc-est2020-alldata6.pdf dated 27-JUL-2021
  • “All States”: cc-est2020-alldata6.csv dated 27-JUL-2021
    • Note: This is a 135MB file, ~836K rows with 50 columns

This chunk of code does the following:

  • Checks if a folder called “datafiles” exists, if not, creates it
  • Checks if “cc-est2019-alldata6.pdf” exists, if not, downloads it from the given URL into the “datafiles” folder
  • Checks if an rdata file called “popSubset” exists
    • if yes, loads it into R
    • if not, downloads the larger data set from the given URL into the “datafiles” folder and loads it into R

From there, a smaller data set called “popSubset” is created by removing incomplete rows, and all columns except the necessary 7 variables.

if(!dir.exists("datafiles")) {
  dir.create("datafiles")
}
if (!file.exists("./datafiles/cc-est2019-alldata6.pdf")) {
  url <- "https://www2.census.gov/programs-surveys/popest/technical-documentation/file-layouts/2010-2020/cc-est2020-alldata6.pdf"
  download.file(url,"./datafiles/cc-est2020-alldata6.pdf", mode = "wb")
}
if (!exists("popSubset")) {
  if (file.exists("./datafiles/popSubset.rdata")) {
    load("./datafiles/popSubset.rdata")
  } else {
    if (!file.exists("./datafiles/cc-est2020-alldata6.csv")) {
      url <- "https://www2.census.gov/programs-surveys/popest/datasets/2010-2020/counties/asrh/CC-EST2020-ALLDATA6.csv"
      download.file(url,"./datafiles/cc-est2020-alldata6.csv")
    } 
    popAllData <- read_csv("./datafiles/cc-est2020-alldata6.csv")
    popSubset <- popAllData[complete.cases(popAllData),] %>%
      select(STNAME:TOT_FEMALE)
    popSubset <- popSubset[complete.cases(popSubset),]
    save(popSubset, file="./datafiles/popSubset.rdata" )
  }
}

Resulting dimesions:

  • popSubset: 836000, 7

2. Data Split

The key AGEGRP has an entry which is the sum of all the other age groups. Split the data into two pieces, keeping all the years.

  • set1: rows that represent the total for all age groups, for all years
  • set2: rows that represent all age groups, but not the total, for all years

To create the two new data sets, I filtered by age group. set1 includes rows in which the AGEGRP equals 0 because the accompanying PDF shows 0 indicates the total of all age groups. set2 includes rows in which the AGEGRP is larger than 0, to denote all other age groups.

set1 <- popSubset %>%
  filter(AGEGRP == 0)
set2 <- popSubset %>%
  filter(AGEGRP > 0)

Resulting dimensions:

  • set1: 44000, 7
  • set2: 792000, 7

3. Aggregate across Age groups

For set2, aggregate across all the age groups by summing them

To aggregate across age groups, I grouped the variables by state name, county name, and year, then summed the population variables.

set2.agesSum <- set2 %>%
  group_by(STNAME,CTYNAME,YEAR) %>%
  summarise(TOT_POP = sum(TOT_POP), TOT_MALE = sum(TOT_MALE), TOT_FEMALE = sum(TOT_FEMALE))

Resulting dimensions:

  • set2.agesSum: 44000, 6

This new data set is similar to, but not identical to, set1. Instead of filtering out all AGEGRP 0 rows, it combined the ones that were left (AGEGRP 1-18) and summed the populations. This left the data with only 6 variables, instead of the 7 of set1, because the AGEGRP column has been grouped within STNAME, CTYNAME, and YEAR.

4. Aggregate across Years

Further reduce the data from Q3 by using only the average population across the yearly estimates from 2010-2019.

I used the PDF accompanying the data set to look at the key for the YEAR variable. To only use the ten July estimates between years 2010-2019, I saw that I need to filter the Year column to only show the data between 3 and 12.

set2.agesSumMean <- set2.agesSum %>%
  filter(YEAR %in% (3:12)) %>%
  group_by(STNAME,CTYNAME) %>%
  summarise(TOT_POP = mean(TOT_POP), TOT_MALE = mean(TOT_MALE), TOT_FEMALE = mean(TOT_FEMALE))

Resulting dimensions:

  • set2.agesSumMean: 3143, 5

5. Aggregate across Counties

Continuing with the data from Q4, find State Populations by aggregating with a sum of the population of each County.

I followed similar steps to Q3, but grouped by only state name.

set2.state <- set2.agesSumMean %>%
  group_by(STNAME) %>%
  summarise(TOT_POP = sum(TOT_POP), TOT_MALE = sum(TOT_MALE), TOT_FEMALE = sum(TOT_FEMALE))

Resulting dimensions:

  • set2.state: 51, 4

6. Aggregate across States

Now add up the total population attribute from data in Q5, you should get single number which is the population of entire country.

For this final step, I summarized the total population, resulting in a single number. I turned this into an integer and used the ‘prettyNum’ function to format it.

USPopulation <- set2.state %>%
  summarise(TOT_POP = sum(TOT_POP))
USPopulation <- as.integer(USPopulation$TOT_POP)
USPopulation <- prettyNum(USPopulation, big.mark = ",")

I wanted to play around with in-line css, so the final population below is presented in a blue box

The average US population from 2010-2019 was 319,333,559.
This is in line with the data found at https://www.multpl.com/united-states-population/table/by-year.