\[\\[.1in]\]
Automate the download of two files. Using an if/else process (provided in prompt) download and save census data and reference key. Check if the object is not already loaded.
if (!file.exists("./datafiles/cc-est2019-alldata6.pdf")) {
url <- "https://www2.census.gov/programs-surveys/popest/technical-documentation/file-layouts/2010-2020/cc-est2020-alldata6.pdf"
download.file(url,"./datafiles/cc-est2020-alldata6.pdf", mode = "wb")
}
# Check if the object is not already loaded
if (!exists("popSubset5")){
if (file.exists("./datafiles/popSubset.rdata")){
load("./datafiles/popSubset.rdata")
} else {
if (!dir.exists("./datafiles")){dir.create("datafiles")}
if (!file.exists("./datafiles/cc-est2020-alldata6.csv")){
url <- "https://www2.census.gov/programs-surveys/popest/datasets/2010-2020/counties/asrh/CC-EST2020-ALLDATA6.csv"
download.file(url,"./datafiles/cc-est2020-alldata6.csv")
}
popAllData <- readr::read_csv ("./datafiles/cc-est2020-alldata6.csv")
sum(is.na(popAllData)) #identify NA, found 1634
popAll_NoNA<- popAllData[complete.cases(popAllData), ] #remove rows with NA
dim(popAll_NoNA) #Check dimensions
# keep only the attributes
#"STNAME"= 4 ,"CTYNAME" = 5 ,"YEAR"= 6,"AGEGRP"=7,"TOT_POP"=8,"TOT_MALE" = 9,"TOT_FEMALE" = 10
# Subset only columns 4-10, keeping all rows
popSubset5 <- popAll_NoNA[ ,4:10]
save(popSubset5, file="./datafiles/popSubset.rdata")
}
}
The original dataset features 836038 rows and 50 columns popSubset5 features 836000 rows and 7 columns \[\\[.1in]\]
The key AGEGRP has an entry which is the sum of all the other age groups. Split the data into two pieces, keeping all the years. Set 1 should feature all rows that represent the total for all age groups (indicated by ‘0’ in the key). Set 2 should feature all age age groups (indicated by age groups greater than ‘0’) for all years.
set1 <- subset(popSubset5, AGEGRP == 0)
dim(set1)
## [1] 44000 7
set11 <- popSubset5 %>% filter(AGEGRP == 0)
dim (set11)
## [1] 44000 7
Set 1 features 44000 rows and 7 columns. \[\\[.1in]\]
set2 <- subset(popSubset5, AGEGRP != 0)
dim(set2)
## [1] 792000 7
set22 <- popSubset5 %>% filter(AGEGRP != 0)
dim (set22)
## [1] 792000 7
Set 2 features 792000 rows and 7 columns.
Question?
filter v subset - Both approaches seem to work. Is there a benefit to one approach over the other?
\[\\[.1in]\]
For set2, using group_by() and summarise(), aggregate across all the age groups by summing them.
set22_agesum <- set22 %>%
group_by (STNAME,CTYNAME,YEAR)%>%
summarize (Sum_Male = sum(TOT_MALE),
Sum_Female = sum(TOT_FEMALE),
Sum_Pop = sum(TOT_POP)
)
## `summarise()` has grouped output by 'STNAME', 'CTYNAME'. You can override using
## the `.groups` argument.
print(set22_agesum)
## # A tibble: 44,000 x 6
## # Groups: STNAME, CTYNAME [3,143]
## STNAME CTYNAME YEAR Sum_Male Sum_Female Sum_Pop
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Alabama Autauga County 1 26569 28002 54571
## 2 Alabama Autauga County 2 26576 28006 54582
## 3 Alabama Autauga County 3 26667 28094 54761
## 4 Alabama Autauga County 4 26980 28249 55229
## 5 Alabama Autauga County 5 26830 28140 54970
## 6 Alabama Autauga County 6 26588 28159 54747
## 7 Alabama Autauga County 7 26804 28118 54922
## 8 Alabama Autauga County 8 26752 28151 54903
## 9 Alabama Autauga County 9 26999 28303 55302
## 10 Alabama Autauga County 10 27041 28407 55448
## # ... with 43,990 more rows
dim(set22_agesum)
## [1] 44000 6
By aggregating across age groups we reduce the data to 44000 rows and 6 columns.
Question
Does this reduced data set have the same dimensions as set1? Comment more than just yes/no. E.g. should they?
Answer
set22_agessum has 44000 obs. of 6 variables compared to set11, which has 44000 observations of 7 variables. The difference between the two data frames is the 2nd one does not have AGEGRP as a column. My thought is that the two data frames were intended to be different.
\[\\[.1in]\]
Summary of the notes I used to break down the problem:
set2.agesSumMean <- set2.agesSum %>% enter code group set 2 by STNAME,CTYNAME summarize the TOT_POP, TOT_FEMALE, TOT_MALE, but, this time filter by year Further reduce the data from question 3 by using only the average population across the yearly estimates from 2010-2019. NOTE This will require dropping more records from the 2010 Census and base. e.g. use only the ten July estimates.
The key for YEAR is as follows:
Applied this more simple example:
guzzlers2.MpgGearMean<- guzzlers2 %>%
filter(gear %in% (4:5)) %>%
group_by(cyl,carb) %>%
summarise(MeanMPG = mean(mpg)
)
set22_ageSumMean <- set22 %>%
filter (YEAR %in% (3:12)) %>%
group_by (STNAME, CTYNAME) %>%
summarize (Mean_M = mean(TOT_MALE),
Mean_F = mean(TOT_FEMALE),
Mean_P = mean(TOT_POP)
)
## `summarise()` has grouped output by 'STNAME'. You can override using the
## `.groups` argument.
print(set22_ageSumMean)
## # A tibble: 3,143 x 5
## # Groups: STNAME [51]
## STNAME CTYNAME Mean_M Mean_F Mean_P
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Alabama Autauga County 1493. 1571. 3064.
## 2 Alabama Baldwin County 5464. 5756. 11219.
## 3 Alabama Barbour County 774. 683. 1458.
## 4 Alabama Bibb County 672. 581. 1253.
## 5 Alabama Blount County 1580. 1621. 3201.
## 6 Alabama Bullock County 316. 266. 582.
## 7 Alabama Butler County 525. 600. 1125.
## 8 Alabama Calhoun County 3100. 3343. 6442.
## 9 Alabama Chambers County 900. 980. 1881.
## 10 Alabama Cherokee County 716. 726. 1441.
## # ... with 3,133 more rows
dim(set22_ageSumMean)
## [1] 3143 5
Data aggregated across years features 3143 rows and 5 columns
Above worked When I used the original data frame = set22 but didn’t work when I try to use the set22_agesum. Or, maybe I did this incorrectly? What am I missing? \[\\[.1in]\]
Continuing with the data from question 4, find State populations by aggregating with a sum of the population of each County.
set22_state <- set22 %>%
group_by(STNAME)%>%
summarize(Sum_County = sum(TOT_POP),
Sum_Co_M = sum(TOT_MALE),
Sum_Co_F = sum(TOT_FEMALE)
)
print(set22_state)
## # A tibble: 51 x 4
## STNAME Sum_County Sum_Co_M Sum_Co_F
## <chr> <dbl> <dbl> <dbl>
## 1 Alabama 67878412 32881261 34997151
## 2 Alaska 10208130 5327668 4880462
## 3 Arizona 95686011 47552687 48133324
## 4 Arkansas 41645819 20450084 21195735
## 5 California 539302676 268016418 271286258
## 6 Colorado 75722244 38045457 37676787
## 7 Connecticut 50105551 24421231 25684320
## 8 Delaware 13141186 6358390 6782796
## 9 District of Columbia 9279031 4397695 4881336
## 10 Florida 282207193 137963792 144243401
## # ... with 41 more rows
dim(set22_state)
## [1] 51 4
Data aggregated across counties features 51 rows (for each of the 50 states and Washington D.C.) and 4 columns.
Now add up the total population attribute from data in Q5 and you should get single number which is the population of entire country. Is this true? (provide a reference), Does this make sense? Note: Terry says the total US population is 319,333,559 within the template.
US_Pop <- set11%>%
filter(YEAR == 13)%>%
summarize(AllUS = sum(TOT_POP))
print (US_Pop)
## # A tibble: 1 x 1
## AllUS
## <dbl>
## 1 329398742
US_Pop2 <- set22%>%
filter(YEAR == 13)%>%
summarize(AllUS = sum(TOT_POP))
print (US_Pop2)
## # A tibble: 1 x 1
## AllUS
## <dbl>
## 1 329398742
When I summarize, the answer I get is 329398742
A google search says the population was 331,108,434 as of April, 2020. The variance of 1709692 is .0051% variance so I am declaring victory. Year # 13 = 4/1/2020 population estimate.
When I do a similar calculation on the set22 data, I get the same answer of 329398742. I think this check makes sense because the only difference between set11 and set22 is one was total and the other was more detail by age groups. We know that set11 = set22
End