Import the data set using a Tidyverse function and NOT with a Base R function. How many rows and columns are in the data set?
The following code was used to call the tidyverse package, import the ACS data set using the read_csv() function and to see the number of rows and columns in the data set :
library(tidyverse)
acs_data <- read_csv("/Users/swethainuganti/Desktop/Cincinnati /1st Sem/Data Wrangling/week 4/Week 4/homework3/acs_2015_county_data_revised.csv")
dim(acs_data)
## [1] 3142 35
The ACS data set has 3142 rows (observations) and 35 columns (variables).
Do any data types need changed? Show any code to change variable types and show code/output for a command after you’re finished.
The following code was used to see the data types of each variable:
str(acs_data)
## spec_tbl_df [3,142 × 35] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ census_id : num [1:3142] 1001 1003 1005 1007 1009 ...
## $ state : chr [1:3142] "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ county : chr [1:3142] "Autauga" "Baldwin" "Barbour" "Bibb" ...
## $ total_pop : num [1:3142] 55221 195121 26932 22604 57710 ...
## $ men : num [1:3142] 26745 95314 14497 12073 28512 ...
## $ women : num [1:3142] 28476 99807 12435 10531 29198 ...
## $ hispanic : num [1:3142] 2.6 4.5 4.6 2.2 8.6 4.4 1.2 3.5 0.4 1.5 ...
## $ white : num [1:3142] 75.8 83.1 46.2 74.5 87.9 22.2 53.3 73 57.3 91.7 ...
## $ black : num [1:3142] 18.5 9.5 46.7 21.4 1.5 70.7 43.8 20.3 40.3 4.8 ...
## $ native : num [1:3142] 0.4 0.6 0.2 0.4 0.3 1.2 0.1 0.2 0.2 0.6 ...
## $ asian : num [1:3142] 1 0.7 0.4 0.1 0.1 0.2 0.4 0.9 0.8 0.3 ...
## $ pacific : num [1:3142] 0 0 0 0 0 0 0 0 0 0 ...
## $ citizen : num [1:3142] 40725 147695 20714 17495 42345 ...
## $ income : num [1:3142] 51281 50254 32964 38678 45813 ...
## $ income_per_cap: num [1:3142] 24974 27317 16824 18431 20532 ...
## $ poverty : num [1:3142] 12.9 13.4 26.7 16.8 16.7 24.6 25.4 20.5 21.6 19.2 ...
## $ child_poverty : num [1:3142] 18.6 19.2 45.3 27.9 27.2 38.4 39.2 31.6 37.2 30.1 ...
## $ professional : num [1:3142] 33.2 33.1 26.8 21.5 28.5 18.8 27.5 27.3 23.3 29.3 ...
## $ service : num [1:3142] 17 17.7 16.1 17.9 14.1 15 16.6 17.7 14.5 16 ...
## $ office : num [1:3142] 24.2 27.1 23.1 17.8 23.9 19.7 21.9 24.2 26.3 19.5 ...
## $ construction : num [1:3142] 8.6 10.8 10.8 19 13.5 20.1 10.3 10.5 11.5 13.7 ...
## $ production : num [1:3142] 17.1 11.2 23.1 23.7 19.9 26.4 23.7 20.4 24.4 21.5 ...
## $ drive : num [1:3142] 87.5 84.7 83.8 83.2 84.9 74.9 84.5 85.3 85.1 83.9 ...
## $ carpool : num [1:3142] 8.8 8.8 10.9 13.5 11.2 14.9 12.4 9.4 11.9 12.1 ...
## $ transit : num [1:3142] 0.1 0.1 0.4 0.5 0.4 0.7 0 0.2 0.2 0.2 ...
## $ walk : num [1:3142] 0.5 1 1.8 0.6 0.9 5 0.8 1.2 0.3 0.6 ...
## $ other_transp : num [1:3142] 1.3 1.4 1.5 1.5 0.4 1.7 0.6 1.2 0.4 0.7 ...
## $ work_at_home : num [1:3142] 1.8 3.9 1.6 0.7 2.3 2.8 1.7 2.7 2.1 2.5 ...
## $ mean_commute : num [1:3142] 26.5 26.4 24.1 28.8 34.9 27.5 24.6 24.1 25.1 27.4 ...
## $ employed : num [1:3142] 23986 85953 8597 8294 22189 ...
## $ private_work : num [1:3142] 73.6 81.5 71.8 76.8 82 79.5 77.4 74.1 85.1 73.1 ...
## $ public_work : num [1:3142] 20.9 12.3 20.8 16.1 13.5 15.1 16.2 20.8 12.1 18.5 ...
## $ self_employed : num [1:3142] 5.5 5.8 7.3 6.7 4.2 5.4 6.2 5 2.8 7.9 ...
## $ family_work : num [1:3142] 0 0.4 0.1 0.4 0.4 0 0.2 0.1 0 0.5 ...
## $ unemployment : num [1:3142] 7.6 7.5 17.6 8.3 7.7 18 10.9 12.3 8.9 7.9 ...
## - attr(*, "spec")=
## .. cols(
## .. census_id = col_double(),
## .. state = col_character(),
## .. county = col_character(),
## .. total_pop = col_double(),
## .. men = col_double(),
## .. women = col_double(),
## .. hispanic = col_double(),
## .. white = col_double(),
## .. black = col_double(),
## .. native = col_double(),
## .. asian = col_double(),
## .. pacific = col_double(),
## .. citizen = col_double(),
## .. income = col_double(),
## .. income_per_cap = col_double(),
## .. poverty = col_double(),
## .. child_poverty = col_double(),
## .. professional = col_double(),
## .. service = col_double(),
## .. office = col_double(),
## .. construction = col_double(),
## .. production = col_double(),
## .. drive = col_double(),
## .. carpool = col_double(),
## .. transit = col_double(),
## .. walk = col_double(),
## .. other_transp = col_double(),
## .. work_at_home = col_double(),
## .. mean_commute = col_double(),
## .. employed = col_double(),
## .. private_work = col_double(),
## .. public_work = col_double(),
## .. self_employed = col_double(),
## .. family_work = col_double(),
## .. unemployment = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
After reviewing the data types and descriptions for each variable,
census_id has been switched to character variable to allow it to be a label for all the observations as we know that we will not be using it for arithmetic analysis.
Additionally, state has been changed to factor as it is a categorical variable and the observations can be collapsed into individual states.
acs_data$census_id <- as.character(acs_data$census_id)
acs_data$state <- as.factor(acs_data$state)
str(acs_data)
## spec_tbl_df [3,142 × 35] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ census_id : chr [1:3142] "1001" "1003" "1005" "1007" ...
## $ state : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ county : chr [1:3142] "Autauga" "Baldwin" "Barbour" "Bibb" ...
## $ total_pop : num [1:3142] 55221 195121 26932 22604 57710 ...
## $ men : num [1:3142] 26745 95314 14497 12073 28512 ...
## $ women : num [1:3142] 28476 99807 12435 10531 29198 ...
## $ hispanic : num [1:3142] 2.6 4.5 4.6 2.2 8.6 4.4 1.2 3.5 0.4 1.5 ...
## $ white : num [1:3142] 75.8 83.1 46.2 74.5 87.9 22.2 53.3 73 57.3 91.7 ...
## $ black : num [1:3142] 18.5 9.5 46.7 21.4 1.5 70.7 43.8 20.3 40.3 4.8 ...
## $ native : num [1:3142] 0.4 0.6 0.2 0.4 0.3 1.2 0.1 0.2 0.2 0.6 ...
## $ asian : num [1:3142] 1 0.7 0.4 0.1 0.1 0.2 0.4 0.9 0.8 0.3 ...
## $ pacific : num [1:3142] 0 0 0 0 0 0 0 0 0 0 ...
## $ citizen : num [1:3142] 40725 147695 20714 17495 42345 ...
## $ income : num [1:3142] 51281 50254 32964 38678 45813 ...
## $ income_per_cap: num [1:3142] 24974 27317 16824 18431 20532 ...
## $ poverty : num [1:3142] 12.9 13.4 26.7 16.8 16.7 24.6 25.4 20.5 21.6 19.2 ...
## $ child_poverty : num [1:3142] 18.6 19.2 45.3 27.9 27.2 38.4 39.2 31.6 37.2 30.1 ...
## $ professional : num [1:3142] 33.2 33.1 26.8 21.5 28.5 18.8 27.5 27.3 23.3 29.3 ...
## $ service : num [1:3142] 17 17.7 16.1 17.9 14.1 15 16.6 17.7 14.5 16 ...
## $ office : num [1:3142] 24.2 27.1 23.1 17.8 23.9 19.7 21.9 24.2 26.3 19.5 ...
## $ construction : num [1:3142] 8.6 10.8 10.8 19 13.5 20.1 10.3 10.5 11.5 13.7 ...
## $ production : num [1:3142] 17.1 11.2 23.1 23.7 19.9 26.4 23.7 20.4 24.4 21.5 ...
## $ drive : num [1:3142] 87.5 84.7 83.8 83.2 84.9 74.9 84.5 85.3 85.1 83.9 ...
## $ carpool : num [1:3142] 8.8 8.8 10.9 13.5 11.2 14.9 12.4 9.4 11.9 12.1 ...
## $ transit : num [1:3142] 0.1 0.1 0.4 0.5 0.4 0.7 0 0.2 0.2 0.2 ...
## $ walk : num [1:3142] 0.5 1 1.8 0.6 0.9 5 0.8 1.2 0.3 0.6 ...
## $ other_transp : num [1:3142] 1.3 1.4 1.5 1.5 0.4 1.7 0.6 1.2 0.4 0.7 ...
## $ work_at_home : num [1:3142] 1.8 3.9 1.6 0.7 2.3 2.8 1.7 2.7 2.1 2.5 ...
## $ mean_commute : num [1:3142] 26.5 26.4 24.1 28.8 34.9 27.5 24.6 24.1 25.1 27.4 ...
## $ employed : num [1:3142] 23986 85953 8597 8294 22189 ...
## $ private_work : num [1:3142] 73.6 81.5 71.8 76.8 82 79.5 77.4 74.1 85.1 73.1 ...
## $ public_work : num [1:3142] 20.9 12.3 20.8 16.1 13.5 15.1 16.2 20.8 12.1 18.5 ...
## $ self_employed : num [1:3142] 5.5 5.8 7.3 6.7 4.2 5.4 6.2 5 2.8 7.9 ...
## $ family_work : num [1:3142] 0 0.4 0.1 0.4 0.4 0 0.2 0.1 0 0.5 ...
## $ unemployment : num [1:3142] 7.6 7.5 17.6 8.3 7.7 18 10.9 12.3 8.9 7.9 ...
## - attr(*, "spec")=
## .. cols(
## .. census_id = col_double(),
## .. state = col_character(),
## .. county = col_character(),
## .. total_pop = col_double(),
## .. men = col_double(),
## .. women = col_double(),
## .. hispanic = col_double(),
## .. white = col_double(),
## .. black = col_double(),
## .. native = col_double(),
## .. asian = col_double(),
## .. pacific = col_double(),
## .. citizen = col_double(),
## .. income = col_double(),
## .. income_per_cap = col_double(),
## .. poverty = col_double(),
## .. child_poverty = col_double(),
## .. professional = col_double(),
## .. service = col_double(),
## .. office = col_double(),
## .. construction = col_double(),
## .. production = col_double(),
## .. drive = col_double(),
## .. carpool = col_double(),
## .. transit = col_double(),
## .. walk = col_double(),
## .. other_transp = col_double(),
## .. work_at_home = col_double(),
## .. mean_commute = col_double(),
## .. employed = col_double(),
## .. private_work = col_double(),
## .. public_work = col_double(),
## .. self_employed = col_double(),
## .. family_work = col_double(),
## .. unemployment = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
Are there any missing values? How will you handle missing values? Will you impute a missing value with, for example, a mean or median value for the entire column, or will you remove the entire observation? Give a rationale for your decision and show any code/output to handle missing values.
sum(is.na(acs_data))
## [1] 2
There are only 2 missing variables in income & child_poverty fields. We can just remove these variables as they are less in number and the data is not going to get affected because of it. One more approach to handle these missing values could be to impute the mean of the respective columns of the particular state. But considering the small count of missing data we are removing the observations.
#removing observations with missing values in columns "income" & "child_poverty"
acs_data <- acs_data %>% drop_na()
glimpse(acs_data)
## Rows: 3,140
## Columns: 35
## $ census_id <chr> "1001", "1003", "1005", "1007", "1009", "1011", "1013",…
## $ state <fct> Alabama, Alabama, Alabama, Alabama, Alabama, Alabama, A…
## $ county <chr> "Autauga", "Baldwin", "Barbour", "Bibb", "Blount", "Bul…
## $ total_pop <dbl> 55221, 195121, 26932, 22604, 57710, 10678, 20354, 11664…
## $ men <dbl> 26745, 95314, 14497, 12073, 28512, 5660, 9502, 56274, 1…
## $ women <dbl> 28476, 99807, 12435, 10531, 29198, 5018, 10852, 60374, …
## $ hispanic <dbl> 2.6, 4.5, 4.6, 2.2, 8.6, 4.4, 1.2, 3.5, 0.4, 1.5, 7.6, …
## $ white <dbl> 75.8, 83.1, 46.2, 74.5, 87.9, 22.2, 53.3, 73.0, 57.3, 9…
## $ black <dbl> 18.5, 9.5, 46.7, 21.4, 1.5, 70.7, 43.8, 20.3, 40.3, 4.8…
## $ native <dbl> 0.4, 0.6, 0.2, 0.4, 0.3, 1.2, 0.1, 0.2, 0.2, 0.6, 0.4, …
## $ asian <dbl> 1.0, 0.7, 0.4, 0.1, 0.1, 0.2, 0.4, 0.9, 0.8, 0.3, 0.3, …
## $ pacific <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, …
## $ citizen <dbl> 40725, 147695, 20714, 17495, 42345, 8057, 15581, 88612,…
## $ income <dbl> 51281, 50254, 32964, 38678, 45813, 31938, 32229, 41703,…
## $ income_per_cap <dbl> 24974, 27317, 16824, 18431, 20532, 17580, 18390, 21374,…
## $ poverty <dbl> 12.9, 13.4, 26.7, 16.8, 16.7, 24.6, 25.4, 20.5, 21.6, 1…
## $ child_poverty <dbl> 18.6, 19.2, 45.3, 27.9, 27.2, 38.4, 39.2, 31.6, 37.2, 3…
## $ professional <dbl> 33.2, 33.1, 26.8, 21.5, 28.5, 18.8, 27.5, 27.3, 23.3, 2…
## $ service <dbl> 17.0, 17.7, 16.1, 17.9, 14.1, 15.0, 16.6, 17.7, 14.5, 1…
## $ office <dbl> 24.2, 27.1, 23.1, 17.8, 23.9, 19.7, 21.9, 24.2, 26.3, 1…
## $ construction <dbl> 8.6, 10.8, 10.8, 19.0, 13.5, 20.1, 10.3, 10.5, 11.5, 13…
## $ production <dbl> 17.1, 11.2, 23.1, 23.7, 19.9, 26.4, 23.7, 20.4, 24.4, 2…
## $ drive <dbl> 87.5, 84.7, 83.8, 83.2, 84.9, 74.9, 84.5, 85.3, 85.1, 8…
## $ carpool <dbl> 8.8, 8.8, 10.9, 13.5, 11.2, 14.9, 12.4, 9.4, 11.9, 12.1…
## $ transit <dbl> 0.1, 0.1, 0.4, 0.5, 0.4, 0.7, 0.0, 0.2, 0.2, 0.2, 0.2, …
## $ walk <dbl> 0.5, 1.0, 1.8, 0.6, 0.9, 5.0, 0.8, 1.2, 0.3, 0.6, 1.1, …
## $ other_transp <dbl> 1.3, 1.4, 1.5, 1.5, 0.4, 1.7, 0.6, 1.2, 0.4, 0.7, 1.4, …
## $ work_at_home <dbl> 1.8, 3.9, 1.6, 0.7, 2.3, 2.8, 1.7, 2.7, 2.1, 2.5, 1.9, …
## $ mean_commute <dbl> 26.5, 26.4, 24.1, 28.8, 34.9, 27.5, 24.6, 24.1, 25.1, 2…
## $ employed <dbl> 23986, 85953, 8597, 8294, 22189, 3865, 7813, 47401, 136…
## $ private_work <dbl> 73.6, 81.5, 71.8, 76.8, 82.0, 79.5, 77.4, 74.1, 85.1, 7…
## $ public_work <dbl> 20.9, 12.3, 20.8, 16.1, 13.5, 15.1, 16.2, 20.8, 12.1, 1…
## $ self_employed <dbl> 5.5, 5.8, 7.3, 6.7, 4.2, 5.4, 6.2, 5.0, 2.8, 7.9, 4.1, …
## $ family_work <dbl> 0.0, 0.4, 0.1, 0.4, 0.4, 0.0, 0.2, 0.1, 0.0, 0.5, 0.5, …
## $ unemployment <dbl> 7.6, 7.5, 17.6, 8.3, 7.7, 18.0, 10.9, 12.3, 8.9, 7.9, 9…
After removing the missing values we can observe the count of rows has been dropped by 2 from 3142 to 3140.
Use the function to examine any unusual values. Are there any? If so, how will you handle these unusual values? Show any code/output to handle unusual values.
acs_data %>% summary()
## census_id state county total_pop
## Length:3140 Texas : 253 Length:3140 Min. : 267
## Class :character Georgia : 159 Class :character 1st Qu.: 11036
## Mode :character Virginia: 133 Mode :character Median : 25793
## Kentucky: 120 Mean : 100801
## Missouri: 115 3rd Qu.: 67620
## Kansas : 105 Max. :10038388
## (Other) :2255
## men women hispanic white
## Min. : 136 Min. : 131 Min. : 0.000 Min. : 0.90
## 1st Qu.: 5551 1st Qu.: 5488 1st Qu.: 1.900 1st Qu.:65.67
## Median : 12838 Median : 12916 Median : 3.700 Median :84.65
## Mean : 49597 Mean : 51204 Mean : 8.819 Mean :77.31
## 3rd Qu.: 33328 3rd Qu.: 34123 3rd Qu.: 9.000 3rd Qu.:93.33
## Max. :4945351 Max. :5093037 Max. :98.700 Max. :99.80
##
## black native asian pacific
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.00000
## 1st Qu.: 0.600 1st Qu.: 0.100 1st Qu.: 0.200 1st Qu.: 0.00000
## Median : 2.100 Median : 0.300 Median : 0.500 Median : 0.00000
## Mean : 8.885 Mean : 1.763 Mean : 1.253 Mean : 0.07357
## 3rd Qu.:10.200 3rd Qu.: 0.600 3rd Qu.: 1.200 3rd Qu.: 0.00000
## Max. :85.900 Max. :92.100 Max. :41.600 Max. :11.10000
##
## citizen income income_per_cap poverty
## Min. : 199 Min. : 19328 Min. : 8292 Min. : 1.4
## 1st Qu.: 8276 1st Qu.: 38826 1st Qu.:20470 1st Qu.:12.0
## Median : 19454 Median : 45094 Median :23574 Median :16.0
## Mean : 70849 Mean : 46824 Mean :24331 Mean :16.7
## 3rd Qu.: 50795 3rd Qu.: 52248 3rd Qu.:27138 3rd Qu.:20.3
## Max. :6046749 Max. :123453 Max. :65600 Max. :53.3
##
## child_poverty professional service office
## Min. : 0.00 Min. :13.50 Min. : 5.00 Min. : 4.10
## 1st Qu.:16.10 1st Qu.:26.70 1st Qu.:15.90 1st Qu.:20.20
## Median :22.50 Median :30.00 Median :18.00 Median :22.40
## Mean :23.29 Mean :31.05 Mean :18.25 Mean :22.13
## 3rd Qu.:29.50 3rd Qu.:34.42 3rd Qu.:20.20 3rd Qu.:24.30
## Max. :72.30 Max. :74.00 Max. :36.60 Max. :35.40
##
## construction production drive carpool
## Min. : 1.70 Min. : 0.00 Min. : 5.2 Min. : 0.00
## 1st Qu.: 9.80 1st Qu.:11.50 1st Qu.:76.6 1st Qu.: 8.50
## Median :12.20 Median :15.40 Median :80.6 Median : 9.90
## Mean :12.75 Mean :15.82 Mean :79.1 Mean :10.33
## 3rd Qu.:15.00 3rd Qu.:19.40 3rd Qu.:83.6 3rd Qu.:11.90
## Max. :40.30 Max. :55.60 Max. :94.6 Max. :29.90
##
## transit walk other_transp work_at_home
## Min. : 0.0000 Min. : 0.000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.1000 1st Qu.: 1.400 1st Qu.: 0.90 1st Qu.: 2.800
## Median : 0.4000 Median : 2.400 Median : 1.30 Median : 4.000
## Mean : 0.9681 Mean : 3.294 Mean : 1.61 Mean : 4.694
## 3rd Qu.: 0.8000 3rd Qu.: 4.000 3rd Qu.: 1.90 3rd Qu.: 5.700
## Max. :61.7000 Max. :71.200 Max. :39.10 Max. :37.200
##
## mean_commute employed private_work public_work
## Min. : 4.90 Min. : 166 Min. :29.50 Min. : 5.80
## 1st Qu.:19.30 1st Qu.: 4532 1st Qu.:70.90 1st Qu.:13.07
## Median :22.90 Median : 10657 Median :75.85 Median :16.10
## Mean :23.15 Mean : 46416 Mean :74.45 Mean :17.33
## 3rd Qu.:26.60 3rd Qu.: 29272 3rd Qu.:79.80 3rd Qu.:20.10
## Max. :44.00 Max. :4635465 Max. :88.30 Max. :66.20
##
## self_employed family_work unemployment
## Min. : 0.000 Min. :0.0000 Min. : 0.000
## 1st Qu.: 5.400 1st Qu.:0.1000 1st Qu.: 5.500
## Median : 6.900 Median :0.2000 Median : 7.500
## Mean : 7.922 Mean :0.2917 Mean : 7.815
## 3rd Qu.: 9.400 3rd Qu.:0.3000 3rd Qu.: 9.700
## Max. :36.600 Max. :9.8000 Max. :29.400
##
After observing the summary we can see the mean & median values is far from each other for few fields for example, total_pop , men , women , hispanic , black , citizen. This suggests there are outlines in the data and that the data is not distributed equally across all the categories of people.
How many counties have more women than men?
acs_data %>%
filter(women > men) %>%
nrow()
## [1] 1984
There are 1984 counties that have more women than men.
How many counties have an unemployment rate lower than 10%?
acs_data %>%
filter(unemployment < 10.0) %>%
nrow()
## [1] 2419
There are 2419 counties that have an unemployment rate lower than 10%
What are the top 10 counties with the highest mean commute? Show the census ID, county name, state, and the mean_commute in your final answer (sorted by mean_commute).
Top 10 counties with highest mean commute are below:
acs_data %>%
select(census_id,state,county,mean_commute) %>%
top_n(10, mean_commute)
## # A tibble: 10 × 4
## census_id state county mean_commute
## <chr> <fct> <chr> <dbl>
## 1 8093 Colorado Park 42.4
## 2 24017 Maryland Charles 42.8
## 3 36005 New York Bronx 43
## 4 36047 New York Kings 41.7
## 5 36081 New York Queens 42.6
## 6 36085 New York Richmond 42.6
## 7 42103 Pennsylvania Pike 44
## 8 51187 Virginia Warren 42.7
## 9 51193 Virginia Westmoreland 42.5
## 10 54015 West Virginia Clay 41.4
Create a new variable that calculates the percentage of women for each county and then find the top 10 counties with the lowest percentages. Show the census ID, county name, state, and the percentage in your final answer (sorted by ascending percentage).
acs_data %>%
mutate(percent_women = women*100/total_pop) %>%
select(census_id,county,state,percent_women) %>%
group_by(census_id,state,county) %>%
summarize(percent_women_total = sum(round(percent_women,2))) %>%
arrange(percent_women_total) %>% head(10)
## # A tibble: 10 × 4
## # Groups: census_id, state [10]
## census_id state county percent_women_total
## <chr> <fct> <chr> <dbl>
## 1 42053 Pennsylvania Forest 26.8
## 2 8011 Colorado Bent 31.4
## 3 51183 Virginia Sussex 31.5
## 4 13309 Georgia Wheeler 32.1
## 5 6035 California Lassen 33.2
## 6 48095 Texas Concho 33.3
## 7 13053 Georgia Chattahoochee 33.4
## 8 2013 Alaska Aleutians East Borough 33.5
## 9 22125 Louisiana West Feliciana 33.6
## 10 32027 Nevada Pershing 33.7
Create a new variable that calculates the sum of all race percentage variables (these columns are the “hispanic”, “white”, “black”, “native”, “asian”, and “pacific” variables). ### Q9(a) What are the top 10 counties with the lowest sum of these race percentage variables?
acs_data %>%
select(state,county,hispanic,white,black,native,asian,pacific) %>%
mutate(race_percent = hispanic+white+black+native+asian+pacific) -> acs_data_race_filtered
head(acs_data_race_filtered)
## # A tibble: 6 × 9
## state county hispanic white black native asian pacific race_percent
## <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Alabama Autauga 2.6 75.8 18.5 0.4 1 0 98.3
## 2 Alabama Baldwin 4.5 83.1 9.5 0.6 0.7 0 98.4
## 3 Alabama Barbour 4.6 46.2 46.7 0.2 0.4 0 98.1
## 4 Alabama Bibb 2.2 74.5 21.4 0.4 0.1 0 98.6
## 5 Alabama Blount 8.6 87.9 1.5 0.3 0.1 0 98.4
## 6 Alabama Bullock 4.4 22.2 70.7 1.2 0.2 0 98.7
acs_data_race_filtered %>% group_by(state,county) %>% summarize(percent_total = sum(race_percent)) %>% arrange(percent_total)
## # A tibble: 3,140 × 3
## # Groups: state [51]
## state county percent_total
## <fct> <chr> <dbl>
## 1 Hawaii Hawaii 76.4
## 2 Hawaii Maui 79.2
## 3 Oklahoma Mayes 79.7
## 4 Hawaii Honolulu 81.5
## 5 Oklahoma Pontotoc 82.8
## 6 Tennessee Grundy 83
## 7 Alaska Yakutat City and Borough 83.4
## 8 Oklahoma Johnston 84
## 9 Hawaii Kauai 84.1
## 10 Oklahoma Alfalfa 85.1
## # … with 3,130 more rows
Which state, on average, has the lowest sum of these race percentage variables?
acs_data_race_filtered %>% group_by(state) %>% summarize(avg_percent_total = mean(race_percent)) %>% arrange(avg_percent_total) %>% head(1)
## # A tibble: 1 × 2
## state avg_percent_total
## <fct> <dbl>
## 1 Hawaii 80.3
Hawaii has the highest sum of percentages.
Do any counties have a sum greater than 100%?
acs_data_race_filtered %>% group_by(state,county) %>% summarize(percent_total = sum(race_percent)) %>% filter(percent_total > 100)
## # A tibble: 11 × 3
## # Groups: state [4]
## state county percent_total
## <fct> <chr> <dbl>
## 1 Mississippi Claiborne 100
## 2 Nebraska Gosper 100.
## 3 Nebraska Hooker 100.
## 4 Nebraska Nance 100.
## 5 Texas Bailey 100.
## 6 Texas Duval 100
## 7 Texas Edwards 100.
## 8 Texas Kenedy 100
## 9 Texas Kent 100
## 10 Texas Presidio 100
## 11 Utah Beaver 100
Yes, there are 11 counties across 4 states with sum of percent of races greater than 100.
How many states have a sum that equals exactly to 100%?
acs_data_race_filtered %>%
group_by(state) %>%
summarize(state_percent_total = sum(race_percent)) %>%
filter(state_percent_total == 100)
## # A tibble: 0 × 2
## # … with 2 variables: state <fct>, state_percent_total <dbl>
No states have sum of race percentages equal to 100.
Using the carpool variable, ### Q10(a) Use the function to create a new variable called carpool_rank where the highest ranked county (rank = 1) is the county with the highest carpool value. Read the documentation carefully for the ranking function.
Using the carpool variable, new variable carpool_rank has been created where the highest ranked county (rank =1) is the county with the highest carpool value.
acs_data$carpool_rank = min_rank(-(acs_data$carpool))
Find the 10 highest ranked counties for carpooling. Show the census ID, county name, state, carpool value, and carpool_rank in your final answer.
acs_data %>%
select(census_id, county, state, carpool, carpool_rank) %>%
arrange(carpool_rank) %>%
top_n(-10, wt=carpool_rank)
## # A tibble: 10 × 5
## census_id county state carpool carpool_rank
## <chr> <chr> <fct> <dbl> <int>
## 1 13061 Clay Georgia 29.9 1
## 2 18087 LaGrange Indiana 27 2
## 3 13165 Jenkins Georgia 25.3 3
## 4 5133 Sevier Arkansas 24.4 4
## 5 20175 Seward Kansas 23.4 5
## 6 48079 Cochran Texas 22.8 6
## 7 48247 Jim Hogg Texas 22.6 7
## 8 48393 Roberts Texas 22.4 8
## 9 39075 Holmes Ohio 21.8 9
## 10 21197 Powell Kentucky 21.6 10
The 10 highest ranked counties for carpooling are in the table above.
Find the 10 lowest ranked counties for carpooling. Show the same variables in your final answer.
(acs_data <- acs_data %>%
mutate(carpool_rank = rank(desc(carpool), ties.method = "max")))
## # A tibble: 3,140 × 36
## census_id state county total_pop men women hispanic white black native
## <chr> <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1001 Alabama Autauga 55221 26745 28476 2.6 75.8 18.5 0.4
## 2 1003 Alabama Baldwin 195121 95314 99807 4.5 83.1 9.5 0.6
## 3 1005 Alabama Barbour 26932 14497 12435 4.6 46.2 46.7 0.2
## 4 1007 Alabama Bibb 22604 12073 10531 2.2 74.5 21.4 0.4
## 5 1009 Alabama Blount 57710 28512 29198 8.6 87.9 1.5 0.3
## 6 1011 Alabama Bullock 10678 5660 5018 4.4 22.2 70.7 1.2
## 7 1013 Alabama Butler 20354 9502 10852 1.2 53.3 43.8 0.1
## 8 1015 Alabama Calhoun 116648 56274 60374 3.5 73 20.3 0.2
## 9 1017 Alabama Chambers 34079 16258 17821 0.4 57.3 40.3 0.2
## 10 1019 Alabama Cherokee 26008 12975 13033 1.5 91.7 4.8 0.6
## # … with 3,130 more rows, and 26 more variables: asian <dbl>, pacific <dbl>,
## # citizen <dbl>, income <dbl>, income_per_cap <dbl>, poverty <dbl>,
## # child_poverty <dbl>, professional <dbl>, service <dbl>, office <dbl>,
## # construction <dbl>, production <dbl>, drive <dbl>, carpool <dbl>,
## # transit <dbl>, walk <dbl>, other_transp <dbl>, work_at_home <dbl>,
## # mean_commute <dbl>, employed <dbl>, private_work <dbl>, public_work <dbl>,
## # self_employed <dbl>, family_work <dbl>, unemployment <dbl>, …
acs_data %>%
select(census_id, county, state, carpool, carpool_rank) %>%
arrange(carpool_rank) %>%
top_n(10, wt=carpool_rank)
## # A tibble: 11 × 5
## census_id county state carpool carpool_rank
## <chr> <chr> <fct> <dbl> <int>
## 1 46069 Hyde South Dakota 2.8 3131
## 2 51720 Norton city Virginia 2.8 3131
## 3 30019 Daniels Montana 2.6 3133
## 4 31057 Dundy Nebraska 2.6 3133
## 5 13309 Wheeler Georgia 2.3 3135
## 6 38029 Emmons North Dakota 2.3 3135
## 7 36061 New York New York 1.9 3136
## 8 31183 Wheeler Nebraska 1.3 3137
## 9 48235 Irion Texas 0.9 3138
## 10 48261 Kenedy Texas 0 3140
## 11 48269 King Texas 0 3140
On average, what state is the best ranked for carpooling?
acs_data %>%
group_by(state) %>%
summarise(avgrank = mean(carpool_rank)) %>%
arrange(avgrank) %>%
top_n(-1)
## # A tibble: 1 × 2
## state avgrank
## <fct> <dbl>
## 1 Hawaii 674.
Hawaii is the best ranked for carpooling.
What are the top 5 states for carpooling?
acs_data %>%
group_by(state) %>%
summarise(avgrank = mean(carpool_rank)) %>%
arrange(avgrank) %>%
top_n(-5)
## # A tibble: 5 × 2
## state avgrank
## <fct> <dbl>
## 1 Hawaii 674.
## 2 Arizona 1005.
## 3 Utah 1045.
## 4 Arkansas 1084.
## 5 Alaska 1108.
The top 5 states for carpooling are Hawaii, Arizona, Utah, Arkansas and Alaska.