library(tidyverse)
acs_2015 <- read_csv("Week 4/homework3/acs_2015_county_data_revised.csv")
How many rows and columns are in the data set?
dim(acs_2015)
## [1] 3142 35
class(acs_2015)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
This data set has 3142 rows and 35 columns, and is classified as multiple data types.
str(), most of the variables here appear to be correctly categorized as strings and numbers. However, variables like total_pop, men, women, and citizen could be changed to integers since they refer to numbers of people and not decimal values. I think that census_id should probably be changed to character variable since is is not a value that should be modified, as it refers to a unique id.str(acs_2015)
## tibble [3,142 × 35] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ census_id : num [1:3142] 1001 1003 1005 1007 1009 ...
## $ state : chr [1:3142] "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ county : chr [1:3142] "Autauga" "Baldwin" "Barbour" "Bibb" ...
## $ total_pop : num [1:3142] 55221 195121 26932 22604 57710 ...
## $ men : num [1:3142] 26745 95314 14497 12073 28512 ...
## $ women : num [1:3142] 28476 99807 12435 10531 29198 ...
## $ hispanic : num [1:3142] 2.6 4.5 4.6 2.2 8.6 4.4 1.2 3.5 0.4 1.5 ...
## $ white : num [1:3142] 75.8 83.1 46.2 74.5 87.9 22.2 53.3 73 57.3 91.7 ...
## $ black : num [1:3142] 18.5 9.5 46.7 21.4 1.5 70.7 43.8 20.3 40.3 4.8 ...
## $ native : num [1:3142] 0.4 0.6 0.2 0.4 0.3 1.2 0.1 0.2 0.2 0.6 ...
## $ asian : num [1:3142] 1 0.7 0.4 0.1 0.1 0.2 0.4 0.9 0.8 0.3 ...
## $ pacific : num [1:3142] 0 0 0 0 0 0 0 0 0 0 ...
## $ citizen : num [1:3142] 40725 147695 20714 17495 42345 ...
## $ income : num [1:3142] 51281 50254 32964 38678 45813 ...
## $ income_per_cap: num [1:3142] 24974 27317 16824 18431 20532 ...
## $ poverty : num [1:3142] 12.9 13.4 26.7 16.8 16.7 24.6 25.4 20.5 21.6 19.2 ...
## $ child_poverty : num [1:3142] 18.6 19.2 45.3 27.9 27.2 38.4 39.2 31.6 37.2 30.1 ...
## $ professional : num [1:3142] 33.2 33.1 26.8 21.5 28.5 18.8 27.5 27.3 23.3 29.3 ...
## $ service : num [1:3142] 17 17.7 16.1 17.9 14.1 15 16.6 17.7 14.5 16 ...
## $ office : num [1:3142] 24.2 27.1 23.1 17.8 23.9 19.7 21.9 24.2 26.3 19.5 ...
## $ construction : num [1:3142] 8.6 10.8 10.8 19 13.5 20.1 10.3 10.5 11.5 13.7 ...
## $ production : num [1:3142] 17.1 11.2 23.1 23.7 19.9 26.4 23.7 20.4 24.4 21.5 ...
## $ drive : num [1:3142] 87.5 84.7 83.8 83.2 84.9 74.9 84.5 85.3 85.1 83.9 ...
## $ carpool : num [1:3142] 8.8 8.8 10.9 13.5 11.2 14.9 12.4 9.4 11.9 12.1 ...
## $ transit : num [1:3142] 0.1 0.1 0.4 0.5 0.4 0.7 0 0.2 0.2 0.2 ...
## $ walk : num [1:3142] 0.5 1 1.8 0.6 0.9 5 0.8 1.2 0.3 0.6 ...
## $ other_transp : num [1:3142] 1.3 1.4 1.5 1.5 0.4 1.7 0.6 1.2 0.4 0.7 ...
## $ work_at_home : num [1:3142] 1.8 3.9 1.6 0.7 2.3 2.8 1.7 2.7 2.1 2.5 ...
## $ mean_commute : num [1:3142] 26.5 26.4 24.1 28.8 34.9 27.5 24.6 24.1 25.1 27.4 ...
## $ employed : num [1:3142] 23986 85953 8597 8294 22189 ...
## $ private_work : num [1:3142] 73.6 81.5 71.8 76.8 82 79.5 77.4 74.1 85.1 73.1 ...
## $ public_work : num [1:3142] 20.9 12.3 20.8 16.1 13.5 15.1 16.2 20.8 12.1 18.5 ...
## $ self_employed : num [1:3142] 5.5 5.8 7.3 6.7 4.2 5.4 6.2 5 2.8 7.9 ...
## $ family_work : num [1:3142] 0 0.4 0.1 0.4 0.4 0 0.2 0.1 0 0.5 ...
## $ unemployment : num [1:3142] 7.6 7.5 17.6 8.3 7.7 18 10.9 12.3 8.9 7.9 ...
## - attr(*, "spec")=
## .. cols(
## .. census_id = col_double(),
## .. state = col_character(),
## .. county = col_character(),
## .. total_pop = col_double(),
## .. men = col_double(),
## .. women = col_double(),
## .. hispanic = col_double(),
## .. white = col_double(),
## .. black = col_double(),
## .. native = col_double(),
## .. asian = col_double(),
## .. pacific = col_double(),
## .. citizen = col_double(),
## .. income = col_double(),
## .. income_per_cap = col_double(),
## .. poverty = col_double(),
## .. child_poverty = col_double(),
## .. professional = col_double(),
## .. service = col_double(),
## .. office = col_double(),
## .. construction = col_double(),
## .. production = col_double(),
## .. drive = col_double(),
## .. carpool = col_double(),
## .. transit = col_double(),
## .. walk = col_double(),
## .. other_transp = col_double(),
## .. work_at_home = col_double(),
## .. mean_commute = col_double(),
## .. employed = col_double(),
## .. private_work = col_double(),
## .. public_work = col_double(),
## .. self_employed = col_double(),
## .. family_work = col_double(),
## .. unemployment = col_double()
## .. )
Let’s fix those variables!
acs_2015$total_pop <- as.integer(acs_2015$total_pop)
acs_2015$men <- as.integer(acs_2015$men)
acs_2015$women <- as.integer(acs_2015$women)
acs_2015$citizen <- as.integer(acs_2015$citizen)
acs_2015$census_id <- as.character(acs_2015$census_id)
Now that the variable types have been changes, let’s see a glimpse at the data.
glimpse(acs_2015)
## Rows: 3,142
## Columns: 35
## $ census_id <chr> "1001", "1003", "1005", "1007", "1009", "1011", "1013"…
## $ state <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama",…
## $ county <chr> "Autauga", "Baldwin", "Barbour", "Bibb", "Blount", "Bu…
## $ total_pop <int> 55221, 195121, 26932, 22604, 57710, 10678, 20354, 1166…
## $ men <int> 26745, 95314, 14497, 12073, 28512, 5660, 9502, 56274, …
## $ women <int> 28476, 99807, 12435, 10531, 29198, 5018, 10852, 60374,…
## $ hispanic <dbl> 2.6, 4.5, 4.6, 2.2, 8.6, 4.4, 1.2, 3.5, 0.4, 1.5, 7.6,…
## $ white <dbl> 75.8, 83.1, 46.2, 74.5, 87.9, 22.2, 53.3, 73.0, 57.3, …
## $ black <dbl> 18.5, 9.5, 46.7, 21.4, 1.5, 70.7, 43.8, 20.3, 40.3, 4.…
## $ native <dbl> 0.4, 0.6, 0.2, 0.4, 0.3, 1.2, 0.1, 0.2, 0.2, 0.6, 0.4,…
## $ asian <dbl> 1.0, 0.7, 0.4, 0.1, 0.1, 0.2, 0.4, 0.9, 0.8, 0.3, 0.3,…
## $ pacific <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,…
## $ citizen <int> 40725, 147695, 20714, 17495, 42345, 8057, 15581, 88612…
## $ income <dbl> 51281, 50254, 32964, 38678, 45813, 31938, 32229, 41703…
## $ income_per_cap <dbl> 24974, 27317, 16824, 18431, 20532, 17580, 18390, 21374…
## $ poverty <dbl> 12.9, 13.4, 26.7, 16.8, 16.7, 24.6, 25.4, 20.5, 21.6, …
## $ child_poverty <dbl> 18.6, 19.2, 45.3, 27.9, 27.2, 38.4, 39.2, 31.6, 37.2, …
## $ professional <dbl> 33.2, 33.1, 26.8, 21.5, 28.5, 18.8, 27.5, 27.3, 23.3, …
## $ service <dbl> 17.0, 17.7, 16.1, 17.9, 14.1, 15.0, 16.6, 17.7, 14.5, …
## $ office <dbl> 24.2, 27.1, 23.1, 17.8, 23.9, 19.7, 21.9, 24.2, 26.3, …
## $ construction <dbl> 8.6, 10.8, 10.8, 19.0, 13.5, 20.1, 10.3, 10.5, 11.5, 1…
## $ production <dbl> 17.1, 11.2, 23.1, 23.7, 19.9, 26.4, 23.7, 20.4, 24.4, …
## $ drive <dbl> 87.5, 84.7, 83.8, 83.2, 84.9, 74.9, 84.5, 85.3, 85.1, …
## $ carpool <dbl> 8.8, 8.8, 10.9, 13.5, 11.2, 14.9, 12.4, 9.4, 11.9, 12.…
## $ transit <dbl> 0.1, 0.1, 0.4, 0.5, 0.4, 0.7, 0.0, 0.2, 0.2, 0.2, 0.2,…
## $ walk <dbl> 0.5, 1.0, 1.8, 0.6, 0.9, 5.0, 0.8, 1.2, 0.3, 0.6, 1.1,…
## $ other_transp <dbl> 1.3, 1.4, 1.5, 1.5, 0.4, 1.7, 0.6, 1.2, 0.4, 0.7, 1.4,…
## $ work_at_home <dbl> 1.8, 3.9, 1.6, 0.7, 2.3, 2.8, 1.7, 2.7, 2.1, 2.5, 1.9,…
## $ mean_commute <dbl> 26.5, 26.4, 24.1, 28.8, 34.9, 27.5, 24.6, 24.1, 25.1, …
## $ employed <dbl> 23986, 85953, 8597, 8294, 22189, 3865, 7813, 47401, 13…
## $ private_work <dbl> 73.6, 81.5, 71.8, 76.8, 82.0, 79.5, 77.4, 74.1, 85.1, …
## $ public_work <dbl> 20.9, 12.3, 20.8, 16.1, 13.5, 15.1, 16.2, 20.8, 12.1, …
## $ self_employed <dbl> 5.5, 5.8, 7.3, 6.7, 4.2, 5.4, 6.2, 5.0, 2.8, 7.9, 4.1,…
## $ family_work <dbl> 0.0, 0.4, 0.1, 0.4, 0.4, 0.0, 0.2, 0.1, 0.0, 0.5, 0.5,…
## $ unemployment <dbl> 7.6, 7.5, 17.6, 8.3, 7.7, 18.0, 10.9, 12.3, 8.9, 7.9, …
colSums(is.na(acs_2015))
## census_id state county total_pop men
## 0 0 0 0 0
## women hispanic white black native
## 0 0 0 0 0
## asian pacific citizen income income_per_cap
## 0 0 0 1 0
## poverty child_poverty professional service office
## 0 1 0 0 0
## construction production drive carpool transit
## 0 0 0 0 0
## walk other_transp work_at_home mean_commute employed
## 0 0 0 0 0
## private_work public_work self_employed family_work unemployment
## 0 0 0 0 0
There is one missing value in income and one in child_poverty.
Let’s look at those rows of the data to determine if we should delete them altogether.
filter(acs_2015, is.na(income))
filter(acs_2015, is.na(child_poverty))
There does not seem to be a reason to delete these observations, so I will impute with mean.
acs_2015$income[is.na(acs_2015$income)] <- mean(acs_2015$income, na.rm = TRUE)
acs_2015$child_poverty[is.na(acs_2015$child_poverty)] <- mean(acs_2015$child_poverty, na.rm = TRUE)
QA check:
colSums(is.na(acs_2015))
## census_id state county total_pop men
## 0 0 0 0 0
## women hispanic white black native
## 0 0 0 0 0
## asian pacific citizen income income_per_cap
## 0 0 0 0 0
## poverty child_poverty professional service office
## 0 0 0 0 0
## construction production drive carpool transit
## 0 0 0 0 0
## walk other_transp work_at_home mean_commute employed
## 0 0 0 0 0
## private_work public_work self_employed family_work unemployment
## 0 0 0 0 0
summary(acs_2015)
It appears that employed is not in terms of percentage, so that will need to be handled.
acs_2015 <- mutate(acs_2015,
employed = round((employed/total_pop) * 100, digits = 1)
)
Next, let’s take a final look at a couple rows of our data.
head(acs_2015, 5)
I ran the first line of code just because I wanted to make sure it was working correctly, as I am still unfamiliar with the pipeline operator. The second line actually counts up the counties with more women than men by filtering the data.
acs_2015 %>% filter(women> men) %>% select(county, women, men)
acs_2015 %>% filter(women > men) %>% NROW
## [1] 1985
So, there are 1,985 counties with more women than men.
We see from the following that there are 2,420 counties with unemployment lower than 10 percent.
acs_2015 %>% filter(unemployment < 10) %>% NROW
## [1] 2420
dplyr::top_n(), we find the top 10 counties with highest mean commmute and display county name, mean commute, census ID, and state.acs_2015 %>% top_n(10, mean_commute) %>% arrange(desc(mean_commute)) %>% select(census_id, county, state, mean_commute)
First, we create a new variable that calculates the percentage of women for each county.
acs_2015 <- mutate(acs_2015,
percent_women = round((women/total_pop) * 100, digits = 1)
)
Next, let’s find the top 10 counties with the lowest percentages of women and see the census ID, county name, state, and percentage.
acs_2015 %>% top_n(-10, percent_women) %>% arrange(percent_women) %>% select(census_id, county, state, percent_women)
Here, we see 11 counties output because Pershing county and Garza county are tied at number 10 with 33.7% women.
We create the new variable below as desired.
acs_2015 <- mutate(acs_2015,
race_sum = hispanic + white + black + native + asian
)
acs_2015 %>% top_n(-10, race_sum) %>% arrange(race_sum) %>% select(census_id, county, state, race_sum)
acs_2015 %>% group_by(state) %>% summarize(avg_race = mean(race_sum)) %>% arrange(avg_race) %>% top_n(-1, avg_race)
Grouping by state and finding the mean of this race percentage tells us that Hawaii has the lowest sum. This makes sense because in part a we saw that the top 5 lowest counties were in Hawaii.
acs_2015 %>% filter(race_sum > 100) %>% NROW
## [1] 9
Based on the above output, there are 9 counties with a sum greater than 100%.
acs_2015 %>% filter(race_sum == 100) %>% NROW
## [1] 27
There are 27 states with a sum exactly equal to 100% when we filter our data with this constraint.
dplyr::min_rank() function to create a new variable called carpool_rank where the highest ranked county (rank = 1) is the county with the highest carpool value. Read the documentation carefully for the ranking function.Based on the documentation of min_rank() and rank(), this will by default ranke the lowest value as rank = 1. So, to reverse that we will use desc(carpool) within the min_rank() function.
acs_2015 <- mutate(acs_2015,
carpool_rank = min_rank(desc(carpool))
)
acs_2015 %>% select(census_id, county, state, carpool, carpool_rank ) %>% arrange(carpool_rank) %>% top_n(-5, carpool_rank)
acs_2015 %>% select(census_id, county, state, carpool, carpool_rank ) %>% arrange(carpool_rank) %>% top_n(-10, carpool_rank)
acs_2015 %>% select(census_id, county, state, carpool, carpool_rank ) %>% arrange(carpool_rank) %>% top_n(10, carpool_rank)
acs_2015 %>% group_by(state) %>% summarize(avg_carpool_rank = mean(carpool_rank)) %>% arrange(avg_carpool_rank) %>% top_n(-1, avg_carpool_rank)
acs_2015 %>% group_by(state) %>% summarize(avg_carpool_rank = mean(carpool_rank)) %>% arrange(avg_carpool_rank) %>% top_n(-5, avg_carpool_rank)