DAV5303 HW

\[\\[.1in]\]

Q1 - Data Acquisition

Automate the download of two files. Using an if/else process (provided in prompt) download and save census data and reference key. Check if the object is not already loaded.

if (!file.exists("./datafiles/cc-est2019-alldata6.pdf")) {
  url <- "https://www2.census.gov/programs-surveys/popest/technical-documentation/file-layouts/2010-2020/cc-est2020-alldata6.pdf"
  download.file(url,"./datafiles/cc-est2020-alldata6.pdf", mode = "wb")
}

# Check if the object is not already loaded
if (!exists("popSubset5")){ 
  if (file.exists("./datafiles/popSubset.rdata")){
    load("./datafiles/popSubset.rdata")
  } else {
    if (!dir.exists("./datafiles")){dir.create("datafiles")} 
    if (!file.exists("./datafiles/cc-est2020-alldata6.csv")){
      url <- "https://www2.census.gov/programs-surveys/popest/datasets/2010-2020/counties/asrh/CC-EST2020-ALLDATA6.csv"
      download.file(url,"./datafiles/cc-est2020-alldata6.csv")
    }
    popAllData <- readr::read_csv ("./datafiles/cc-est2020-alldata6.csv")
    sum(is.na(popAllData))  #identify NA, found 1634
    popAll_NoNA<- popAllData[complete.cases(popAllData), ] #remove rows with NA
    dim(popAll_NoNA) #Check dimensions
   
    # keep only the attributes 
    #"STNAME"= 4 ,"CTYNAME" = 5 ,"YEAR"= 6,"AGEGRP"=7,"TOT_POP"=8,"TOT_MALE" = 9,"TOT_FEMALE" = 10
    # Subset only columns 4-10, keeping all rows
    popSubset5 <- popAll_NoNA[ ,4:10] 
    save(popSubset5, file="./datafiles/popSubset.rdata")
  }
}

The original dataset features 836038 rows and 50 columns popSubset5 features 836000 rows and 7 columns \[\\[.1in]\]

Q2 - Data Filter

The key AGEGRP has an entry which is the sum of all the other age groups. Split the data into two pieces, keeping all the years. Set 1 should feature all rows that represent the total for all age groups (indicated by ‘0’ in the key). Set 2 should feature all age age groups (indicated by age groups greater than ‘0’) for all years.

set1 <- subset(popSubset5, AGEGRP == 0)
dim(set1)

## [1] 44000     7

set11 <- popSubset5 %>% filter(AGEGRP == 0)
dim (set11)

## [1] 44000     7

Set 1 features 44000 rows and 7 columns. \[\\[.1in]\]

set2 <- subset(popSubset5, AGEGRP != 0)
dim(set2)

## [1] 792000      7

set22 <- popSubset5 %>% filter(AGEGRP != 0)
dim (set22)

## [1] 792000      7

Set 2 features 792000 rows and 7 columns.

Question?

filter v subset - Both approaches seem to work. Is there a benefit to one approach over the other?

\[\\[.1in]\]

Q3 - Aggregate Across Age Groups

For set2, using group_by() and summarise(), aggregate across all the age groups by summing them.

set22_agesum <- set22 %>% 
  group_by (STNAME,CTYNAME,YEAR)%>%
  summarize (Sum_Male = sum(TOT_MALE),
             Sum_Female = sum(TOT_FEMALE), 
             Sum_Pop = sum(TOT_POP)
             )

## `summarise()` has grouped output by 'STNAME', 'CTYNAME'. You can override using
## the `.groups` argument.

print(set22_agesum)

## # A tibble: 44,000 x 6
## # Groups:   STNAME, CTYNAME [3,143]
##    STNAME  CTYNAME         YEAR Sum_Male Sum_Female Sum_Pop
##    <chr>   <chr>          <dbl>    <dbl>      <dbl>   <dbl>
##  1 Alabama Autauga County     1    26569      28002   54571
##  2 Alabama Autauga County     2    26576      28006   54582
##  3 Alabama Autauga County     3    26667      28094   54761
##  4 Alabama Autauga County     4    26980      28249   55229
##  5 Alabama Autauga County     5    26830      28140   54970
##  6 Alabama Autauga County     6    26588      28159   54747
##  7 Alabama Autauga County     7    26804      28118   54922
##  8 Alabama Autauga County     8    26752      28151   54903
##  9 Alabama Autauga County     9    26999      28303   55302
## 10 Alabama Autauga County    10    27041      28407   55448
## # ... with 43,990 more rows

dim(set22_agesum)

## [1] 44000     6

By aggregating across age groups we reduce the data to 44000 rows and 6 columns.

Question

Does this reduced data set have the same dimensions as set1? Comment more than just yes/no. E.g. should they?

Answer

set22_agessum has 44000 obs. of 6 variables compared to set11, which has 44000 observations of 7 variables. The difference between the two data frames is the 2nd one does not have AGEGRP as a column. My thought is that the two data frames were intended to be different.
\[\\[.1in]\]

Q4 - Aggregate Across Years

Summary of the notes I used to break down the problem:

set2.agesSumMean <- set2.agesSum %>% enter code group set 2 by STNAME,CTYNAME summarize the TOT_POP, TOT_FEMALE, TOT_MALE, but, this time filter by year Further reduce the data from question 3 by using only the average population across the yearly estimates from 2010-2019. NOTE This will require dropping more records from the 2010 Census and base. e.g. use only the ten July estimates.

The key for YEAR is as follows:

1 = 4/1/2010 Census population
2 = 4/1/2010 population estimates base
3 = 7/1/2010 population estimate
4 = 7/1/2011 population estimate
5 = 7/1/2012 population estimate
6 = 7/1/2013 population estimate
7 = 7/1/2014 population estimate
8 = 7/1/2015 population estimate
9 = 7/1/2016 population estimate
10 = 7/1/2017 population estimate
11 = 7/1/2018 population estimate
12 = 7/1/2019 population estimate
13 = 4/1/2020 population estimate

Applied this more simple example:

guzzlers2.MpgGearMean<- guzzlers2 %>%

filter(gear %in% (4:5)) %>%

group_by(cyl,carb) %>%

summarise(MeanMPG = mean(mpg)

)

set22_ageSumMean <- set22 %>%
  filter (YEAR %in% (3:12)) %>%
  group_by (STNAME, CTYNAME) %>%
  summarize (Mean_M = mean(TOT_MALE),
             Mean_F = mean(TOT_FEMALE),
             Mean_P = mean(TOT_POP)
             )

## `summarise()` has grouped output by 'STNAME'. You can override using the
## `.groups` argument.

print(set22_ageSumMean)

## # A tibble: 3,143 x 5
## # Groups:   STNAME [51]
##    STNAME  CTYNAME         Mean_M Mean_F Mean_P
##    <chr>   <chr>            <dbl>  <dbl>  <dbl>
##  1 Alabama Autauga County   1493.  1571.  3064.
##  2 Alabama Baldwin County   5464.  5756. 11219.
##  3 Alabama Barbour County    774.   683.  1458.
##  4 Alabama Bibb County       672.   581.  1253.
##  5 Alabama Blount County    1580.  1621.  3201.
##  6 Alabama Bullock County    316.   266.   582.
##  7 Alabama Butler County     525.   600.  1125.
##  8 Alabama Calhoun County   3100.  3343.  6442.
##  9 Alabama Chambers County   900.   980.  1881.
## 10 Alabama Cherokee County   716.   726.  1441.
## # ... with 3,133 more rows

dim(set22_ageSumMean)

## [1] 3143    5

Data aggregated across years features 3143 rows and 5 columns

Question for Terry?

Above worked When I used the original data frame = set22 but didn’t work when I try to use the set22_agesum. Or, maybe I did this incorrectly? What am I missing? \[\\[.1in]\]

Q5 - Group By State

Continuing with the data from question 4, find State populations by aggregating with a sum of the population of each County.

set22_state <- set22 %>%
  group_by(STNAME)%>%
  summarize(Sum_County = sum(TOT_POP),
            Sum_Co_M = sum(TOT_MALE),
            Sum_Co_F = sum(TOT_FEMALE)
            )
print(set22_state)

## # A tibble: 51 x 4
##    STNAME               Sum_County  Sum_Co_M  Sum_Co_F
##    <chr>                     <dbl>     <dbl>     <dbl>
##  1 Alabama                67878412  32881261  34997151
##  2 Alaska                 10208130   5327668   4880462
##  3 Arizona                95686011  47552687  48133324
##  4 Arkansas               41645819  20450084  21195735
##  5 California            539302676 268016418 271286258
##  6 Colorado               75722244  38045457  37676787
##  7 Connecticut            50105551  24421231  25684320
##  8 Delaware               13141186   6358390   6782796
##  9 District of Columbia    9279031   4397695   4881336
## 10 Florida               282207193 137963792 144243401
## # ... with 41 more rows

dim(set22_state)

## [1] 51  4

Data aggregated across counties features 51 rows (for each of the 50 states and Washington D.C.) and 4 columns.

Q6 - Aggregate All States

Now add up the total population attribute from data in Q5 and you should get single number which is the population of entire country. Is this true? (provide a reference), Does this make sense? Note: Terry says the total US population is 319,333,559 within the template.

US_Pop <- set11%>%
  filter(YEAR == 13)%>%
  summarize(AllUS = sum(TOT_POP))

print (US_Pop)

## # A tibble: 1 x 1
##       AllUS
##       <dbl>
## 1 329398742

US_Pop2 <- set22%>%
  filter(YEAR == 13)%>%
  summarize(AllUS = sum(TOT_POP))

print (US_Pop2)

## # A tibble: 1 x 1
##       AllUS
##       <dbl>
## 1 329398742

When I summarize, the answer I get is 329398742

A google search says the population was 331,108,434 as of April, 2020. The variance of 1709692 is .0051% variance so I am declaring victory. Year # 13 = 4/1/2020 population estimate.

When I do a similar calculation on the set22 data, I get the same answer of 329398742. I think this check makes sense because the only difference between set11 and set22 is one was total and the other was more detail by age groups. We know that set11 = set22

End

DAV5303 HW_3

Lisa Coletti