Automate the download of 2 files
This chunk of code does the following:
From there, a smaller data set called “popSubset” is created by removing incomplete rows, and all columns except the necessary 7 variables.
if(!dir.exists("datafiles")) {
dir.create("datafiles")
}
if (!file.exists("./datafiles/cc-est2019-alldata6.pdf")) {
url <- "https://www2.census.gov/programs-surveys/popest/technical-documentation/file-layouts/2010-2020/cc-est2020-alldata6.pdf"
download.file(url,"./datafiles/cc-est2020-alldata6.pdf", mode = "wb")
}
if (!exists("popSubset")) {
if (file.exists("./datafiles/popSubset.rdata")) {
load("./datafiles/popSubset.rdata")
} else {
if (!file.exists("./datafiles/cc-est2020-alldata6.csv")) {
url <- "https://www2.census.gov/programs-surveys/popest/datasets/2010-2020/counties/asrh/CC-EST2020-ALLDATA6.csv"
download.file(url,"./datafiles/cc-est2020-alldata6.csv")
}
popAllData <- read_csv("./datafiles/cc-est2020-alldata6.csv")
popSubset <- popAllData[complete.cases(popAllData),] %>%
select(STNAME:TOT_FEMALE)
popSubset <- popSubset[complete.cases(popSubset),]
save(popSubset, file="./datafiles/popSubset.rdata" )
}
}
The key AGEGRP has an entry which is the sum of all the other age groups. Split the data into two pieces, keeping all the years.
To create the two new data sets, I filtered by age group. set1 includes rows in which the AGEGRP equals 0 because the accompanying PDF shows 0 indicates the total of all age groups. set2 includes rows in which the AGEGRP is larger than 0, to denote all other age groups.
set1 <- popSubset %>%
filter(AGEGRP == 0)
set2 <- popSubset %>%
filter(AGEGRP > 0)
For set2, aggregate across all the age groups by summing them
To aggregate across age groups, I grouped the variables by state name, county name, and year, then summed the population variables.
set2.agesSum <- set2 %>%
group_by(STNAME,CTYNAME,YEAR) %>%
summarise(TOT_POP = sum(TOT_POP), TOT_MALE = sum(TOT_MALE), TOT_FEMALE = sum(TOT_FEMALE))
This new data set is similar to, but not identical to, set1. Instead of filtering out all AGEGRP 0 rows, it combined the ones that were left (AGEGRP 1-18) and summed the populations. This left the data with only 6 variables, instead of the 7 of set1, because the AGEGRP column has been grouped within STNAME, CTYNAME, and YEAR.
Further reduce the data from Q3 by using only the average population across the yearly estimates from 2010-2019.
I used the PDF accompanying the data set to look at the key for the YEAR variable. To only use the ten July estimates between years 2010-2019, I saw that I need to filter the Year column to only show the data between 3 and 12.
set2.agesSumMean <- set2.agesSum %>%
filter(YEAR %in% (3:12)) %>%
group_by(STNAME,CTYNAME) %>%
summarise(TOT_POP = mean(TOT_POP), TOT_MALE = mean(TOT_MALE), TOT_FEMALE = mean(TOT_FEMALE))
Continuing with the data from Q4, find State Populations by aggregating with a sum of the population of each County.
I followed similar steps to Q3, but grouped by only state name.
set2.state <- set2.agesSumMean %>%
group_by(STNAME) %>%
summarise(TOT_POP = sum(TOT_POP), TOT_MALE = sum(TOT_MALE), TOT_FEMALE = sum(TOT_FEMALE))
Now add up the total population attribute from data in Q5, you should get single number which is the population of entire country.
For this final step, I summarized the total population, resulting in a single number. I turned this into an integer and used the ‘prettyNum’ function to format it.
USPopulation <- set2.state %>%
summarise(TOT_POP = sum(TOT_POP))
USPopulation <- as.integer(USPopulation$TOT_POP)
USPopulation <- prettyNum(USPopulation, big.mark = ",")
I wanted to play around with in-line css, so the final population below is presented in a blue box