DW_HW_3

Question 1

Import the data set using a Tidyverse function and NOT with a Base R function. How many rows and columns are in the data set?

The following code was used to call the tidyverse package, import the ACS data set using the read_csv() function and to see the number of rows and columns in the data set :

library(tidyverse)
acs_data <- read_csv("/Users/swethainuganti/Desktop/Cincinnati /1st Sem/Data Wrangling/week 4/Week 4/homework3/acs_2015_county_data_revised.csv")
dim(acs_data)

## [1] 3142   35

The ACS data set has 3142 rows (observations) and 35 columns (variables).

Question 2

Do any data types need changed? Show any code to change variable types and show code/output for a command after you’re finished.

The following code was used to see the data types of each variable:

str(acs_data)

## spec_tbl_df [3,142 × 35] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ census_id     : num [1:3142] 1001 1003 1005 1007 1009 ...
##  $ state         : chr [1:3142] "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ county        : chr [1:3142] "Autauga" "Baldwin" "Barbour" "Bibb" ...
##  $ total_pop     : num [1:3142] 55221 195121 26932 22604 57710 ...
##  $ men           : num [1:3142] 26745 95314 14497 12073 28512 ...
##  $ women         : num [1:3142] 28476 99807 12435 10531 29198 ...
##  $ hispanic      : num [1:3142] 2.6 4.5 4.6 2.2 8.6 4.4 1.2 3.5 0.4 1.5 ...
##  $ white         : num [1:3142] 75.8 83.1 46.2 74.5 87.9 22.2 53.3 73 57.3 91.7 ...
##  $ black         : num [1:3142] 18.5 9.5 46.7 21.4 1.5 70.7 43.8 20.3 40.3 4.8 ...
##  $ native        : num [1:3142] 0.4 0.6 0.2 0.4 0.3 1.2 0.1 0.2 0.2 0.6 ...
##  $ asian         : num [1:3142] 1 0.7 0.4 0.1 0.1 0.2 0.4 0.9 0.8 0.3 ...
##  $ pacific       : num [1:3142] 0 0 0 0 0 0 0 0 0 0 ...
##  $ citizen       : num [1:3142] 40725 147695 20714 17495 42345 ...
##  $ income        : num [1:3142] 51281 50254 32964 38678 45813 ...
##  $ income_per_cap: num [1:3142] 24974 27317 16824 18431 20532 ...
##  $ poverty       : num [1:3142] 12.9 13.4 26.7 16.8 16.7 24.6 25.4 20.5 21.6 19.2 ...
##  $ child_poverty : num [1:3142] 18.6 19.2 45.3 27.9 27.2 38.4 39.2 31.6 37.2 30.1 ...
##  $ professional  : num [1:3142] 33.2 33.1 26.8 21.5 28.5 18.8 27.5 27.3 23.3 29.3 ...
##  $ service       : num [1:3142] 17 17.7 16.1 17.9 14.1 15 16.6 17.7 14.5 16 ...
##  $ office        : num [1:3142] 24.2 27.1 23.1 17.8 23.9 19.7 21.9 24.2 26.3 19.5 ...
##  $ construction  : num [1:3142] 8.6 10.8 10.8 19 13.5 20.1 10.3 10.5 11.5 13.7 ...
##  $ production    : num [1:3142] 17.1 11.2 23.1 23.7 19.9 26.4 23.7 20.4 24.4 21.5 ...
##  $ drive         : num [1:3142] 87.5 84.7 83.8 83.2 84.9 74.9 84.5 85.3 85.1 83.9 ...
##  $ carpool       : num [1:3142] 8.8 8.8 10.9 13.5 11.2 14.9 12.4 9.4 11.9 12.1 ...
##  $ transit       : num [1:3142] 0.1 0.1 0.4 0.5 0.4 0.7 0 0.2 0.2 0.2 ...
##  $ walk          : num [1:3142] 0.5 1 1.8 0.6 0.9 5 0.8 1.2 0.3 0.6 ...
##  $ other_transp  : num [1:3142] 1.3 1.4 1.5 1.5 0.4 1.7 0.6 1.2 0.4 0.7 ...
##  $ work_at_home  : num [1:3142] 1.8 3.9 1.6 0.7 2.3 2.8 1.7 2.7 2.1 2.5 ...
##  $ mean_commute  : num [1:3142] 26.5 26.4 24.1 28.8 34.9 27.5 24.6 24.1 25.1 27.4 ...
##  $ employed      : num [1:3142] 23986 85953 8597 8294 22189 ...
##  $ private_work  : num [1:3142] 73.6 81.5 71.8 76.8 82 79.5 77.4 74.1 85.1 73.1 ...
##  $ public_work   : num [1:3142] 20.9 12.3 20.8 16.1 13.5 15.1 16.2 20.8 12.1 18.5 ...
##  $ self_employed : num [1:3142] 5.5 5.8 7.3 6.7 4.2 5.4 6.2 5 2.8 7.9 ...
##  $ family_work   : num [1:3142] 0 0.4 0.1 0.4 0.4 0 0.2 0.1 0 0.5 ...
##  $ unemployment  : num [1:3142] 7.6 7.5 17.6 8.3 7.7 18 10.9 12.3 8.9 7.9 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   census_id = col_double(),
##   ..   state = col_character(),
##   ..   county = col_character(),
##   ..   total_pop = col_double(),
##   ..   men = col_double(),
##   ..   women = col_double(),
##   ..   hispanic = col_double(),
##   ..   white = col_double(),
##   ..   black = col_double(),
##   ..   native = col_double(),
##   ..   asian = col_double(),
##   ..   pacific = col_double(),
##   ..   citizen = col_double(),
##   ..   income = col_double(),
##   ..   income_per_cap = col_double(),
##   ..   poverty = col_double(),
##   ..   child_poverty = col_double(),
##   ..   professional = col_double(),
##   ..   service = col_double(),
##   ..   office = col_double(),
##   ..   construction = col_double(),
##   ..   production = col_double(),
##   ..   drive = col_double(),
##   ..   carpool = col_double(),
##   ..   transit = col_double(),
##   ..   walk = col_double(),
##   ..   other_transp = col_double(),
##   ..   work_at_home = col_double(),
##   ..   mean_commute = col_double(),
##   ..   employed = col_double(),
##   ..   private_work = col_double(),
##   ..   public_work = col_double(),
##   ..   self_employed = col_double(),
##   ..   family_work = col_double(),
##   ..   unemployment = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

After reviewing the data types and descriptions for each variable,

census_id has been switched to character variable to allow it to be a label for all the observations as we know that we will not be using it for arithmetic analysis.
Additionally, state has been changed to factor as it is a categorical variable and the observations can be collapsed into individual states.

acs_data$census_id <- as.character(acs_data$census_id)
acs_data$state <- as.factor(acs_data$state)
str(acs_data)

## spec_tbl_df [3,142 × 35] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ census_id     : chr [1:3142] "1001" "1003" "1005" "1007" ...
##  $ state         : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ county        : chr [1:3142] "Autauga" "Baldwin" "Barbour" "Bibb" ...
##  $ total_pop     : num [1:3142] 55221 195121 26932 22604 57710 ...
##  $ men           : num [1:3142] 26745 95314 14497 12073 28512 ...
##  $ women         : num [1:3142] 28476 99807 12435 10531 29198 ...
##  $ hispanic      : num [1:3142] 2.6 4.5 4.6 2.2 8.6 4.4 1.2 3.5 0.4 1.5 ...
##  $ white         : num [1:3142] 75.8 83.1 46.2 74.5 87.9 22.2 53.3 73 57.3 91.7 ...
##  $ black         : num [1:3142] 18.5 9.5 46.7 21.4 1.5 70.7 43.8 20.3 40.3 4.8 ...
##  $ native        : num [1:3142] 0.4 0.6 0.2 0.4 0.3 1.2 0.1 0.2 0.2 0.6 ...
##  $ asian         : num [1:3142] 1 0.7 0.4 0.1 0.1 0.2 0.4 0.9 0.8 0.3 ...
##  $ pacific       : num [1:3142] 0 0 0 0 0 0 0 0 0 0 ...
##  $ citizen       : num [1:3142] 40725 147695 20714 17495 42345 ...
##  $ income        : num [1:3142] 51281 50254 32964 38678 45813 ...
##  $ income_per_cap: num [1:3142] 24974 27317 16824 18431 20532 ...
##  $ poverty       : num [1:3142] 12.9 13.4 26.7 16.8 16.7 24.6 25.4 20.5 21.6 19.2 ...
##  $ child_poverty : num [1:3142] 18.6 19.2 45.3 27.9 27.2 38.4 39.2 31.6 37.2 30.1 ...
##  $ professional  : num [1:3142] 33.2 33.1 26.8 21.5 28.5 18.8 27.5 27.3 23.3 29.3 ...
##  $ service       : num [1:3142] 17 17.7 16.1 17.9 14.1 15 16.6 17.7 14.5 16 ...
##  $ office        : num [1:3142] 24.2 27.1 23.1 17.8 23.9 19.7 21.9 24.2 26.3 19.5 ...
##  $ construction  : num [1:3142] 8.6 10.8 10.8 19 13.5 20.1 10.3 10.5 11.5 13.7 ...
##  $ production    : num [1:3142] 17.1 11.2 23.1 23.7 19.9 26.4 23.7 20.4 24.4 21.5 ...
##  $ drive         : num [1:3142] 87.5 84.7 83.8 83.2 84.9 74.9 84.5 85.3 85.1 83.9 ...
##  $ carpool       : num [1:3142] 8.8 8.8 10.9 13.5 11.2 14.9 12.4 9.4 11.9 12.1 ...
##  $ transit       : num [1:3142] 0.1 0.1 0.4 0.5 0.4 0.7 0 0.2 0.2 0.2 ...
##  $ walk          : num [1:3142] 0.5 1 1.8 0.6 0.9 5 0.8 1.2 0.3 0.6 ...
##  $ other_transp  : num [1:3142] 1.3 1.4 1.5 1.5 0.4 1.7 0.6 1.2 0.4 0.7 ...
##  $ work_at_home  : num [1:3142] 1.8 3.9 1.6 0.7 2.3 2.8 1.7 2.7 2.1 2.5 ...
##  $ mean_commute  : num [1:3142] 26.5 26.4 24.1 28.8 34.9 27.5 24.6 24.1 25.1 27.4 ...
##  $ employed      : num [1:3142] 23986 85953 8597 8294 22189 ...
##  $ private_work  : num [1:3142] 73.6 81.5 71.8 76.8 82 79.5 77.4 74.1 85.1 73.1 ...
##  $ public_work   : num [1:3142] 20.9 12.3 20.8 16.1 13.5 15.1 16.2 20.8 12.1 18.5 ...
##  $ self_employed : num [1:3142] 5.5 5.8 7.3 6.7 4.2 5.4 6.2 5 2.8 7.9 ...
##  $ family_work   : num [1:3142] 0 0.4 0.1 0.4 0.4 0 0.2 0.1 0 0.5 ...
##  $ unemployment  : num [1:3142] 7.6 7.5 17.6 8.3 7.7 18 10.9 12.3 8.9 7.9 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   census_id = col_double(),
##   ..   state = col_character(),
##   ..   county = col_character(),
##   ..   total_pop = col_double(),
##   ..   men = col_double(),
##   ..   women = col_double(),
##   ..   hispanic = col_double(),
##   ..   white = col_double(),
##   ..   black = col_double(),
##   ..   native = col_double(),
##   ..   asian = col_double(),
##   ..   pacific = col_double(),
##   ..   citizen = col_double(),
##   ..   income = col_double(),
##   ..   income_per_cap = col_double(),
##   ..   poverty = col_double(),
##   ..   child_poverty = col_double(),
##   ..   professional = col_double(),
##   ..   service = col_double(),
##   ..   office = col_double(),
##   ..   construction = col_double(),
##   ..   production = col_double(),
##   ..   drive = col_double(),
##   ..   carpool = col_double(),
##   ..   transit = col_double(),
##   ..   walk = col_double(),
##   ..   other_transp = col_double(),
##   ..   work_at_home = col_double(),
##   ..   mean_commute = col_double(),
##   ..   employed = col_double(),
##   ..   private_work = col_double(),
##   ..   public_work = col_double(),
##   ..   self_employed = col_double(),
##   ..   family_work = col_double(),
##   ..   unemployment = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Question 3

Are there any missing values? How will you handle missing values? Will you impute a missing value with, for example, a mean or median value for the entire column, or will you remove the entire observation? Give a rationale for your decision and show any code/output to handle missing values.

sum(is.na(acs_data))

## [1] 2

There are only 2 missing variables in income & child_poverty fields. We can just remove these variables as they are less in number and the data is not going to get affected because of it. One more approach to handle these missing values could be to impute the mean of the respective columns of the particular state. But considering the small count of missing data we are removing the observations.

#removing observations with missing values in columns "income" & "child_poverty"
acs_data <- acs_data %>% drop_na()
glimpse(acs_data)

## Rows: 3,140
## Columns: 35
## $ census_id      <chr> "1001", "1003", "1005", "1007", "1009", "1011", "1013",…
## $ state          <fct> Alabama, Alabama, Alabama, Alabama, Alabama, Alabama, A…
## $ county         <chr> "Autauga", "Baldwin", "Barbour", "Bibb", "Blount", "Bul…
## $ total_pop      <dbl> 55221, 195121, 26932, 22604, 57710, 10678, 20354, 11664…
## $ men            <dbl> 26745, 95314, 14497, 12073, 28512, 5660, 9502, 56274, 1…
## $ women          <dbl> 28476, 99807, 12435, 10531, 29198, 5018, 10852, 60374, …
## $ hispanic       <dbl> 2.6, 4.5, 4.6, 2.2, 8.6, 4.4, 1.2, 3.5, 0.4, 1.5, 7.6, …
## $ white          <dbl> 75.8, 83.1, 46.2, 74.5, 87.9, 22.2, 53.3, 73.0, 57.3, 9…
## $ black          <dbl> 18.5, 9.5, 46.7, 21.4, 1.5, 70.7, 43.8, 20.3, 40.3, 4.8…
## $ native         <dbl> 0.4, 0.6, 0.2, 0.4, 0.3, 1.2, 0.1, 0.2, 0.2, 0.6, 0.4, …
## $ asian          <dbl> 1.0, 0.7, 0.4, 0.1, 0.1, 0.2, 0.4, 0.9, 0.8, 0.3, 0.3, …
## $ pacific        <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, …
## $ citizen        <dbl> 40725, 147695, 20714, 17495, 42345, 8057, 15581, 88612,…
## $ income         <dbl> 51281, 50254, 32964, 38678, 45813, 31938, 32229, 41703,…
## $ income_per_cap <dbl> 24974, 27317, 16824, 18431, 20532, 17580, 18390, 21374,…
## $ poverty        <dbl> 12.9, 13.4, 26.7, 16.8, 16.7, 24.6, 25.4, 20.5, 21.6, 1…
## $ child_poverty  <dbl> 18.6, 19.2, 45.3, 27.9, 27.2, 38.4, 39.2, 31.6, 37.2, 3…
## $ professional   <dbl> 33.2, 33.1, 26.8, 21.5, 28.5, 18.8, 27.5, 27.3, 23.3, 2…
## $ service        <dbl> 17.0, 17.7, 16.1, 17.9, 14.1, 15.0, 16.6, 17.7, 14.5, 1…
## $ office         <dbl> 24.2, 27.1, 23.1, 17.8, 23.9, 19.7, 21.9, 24.2, 26.3, 1…
## $ construction   <dbl> 8.6, 10.8, 10.8, 19.0, 13.5, 20.1, 10.3, 10.5, 11.5, 13…
## $ production     <dbl> 17.1, 11.2, 23.1, 23.7, 19.9, 26.4, 23.7, 20.4, 24.4, 2…
## $ drive          <dbl> 87.5, 84.7, 83.8, 83.2, 84.9, 74.9, 84.5, 85.3, 85.1, 8…
## $ carpool        <dbl> 8.8, 8.8, 10.9, 13.5, 11.2, 14.9, 12.4, 9.4, 11.9, 12.1…
## $ transit        <dbl> 0.1, 0.1, 0.4, 0.5, 0.4, 0.7, 0.0, 0.2, 0.2, 0.2, 0.2, …
## $ walk           <dbl> 0.5, 1.0, 1.8, 0.6, 0.9, 5.0, 0.8, 1.2, 0.3, 0.6, 1.1, …
## $ other_transp   <dbl> 1.3, 1.4, 1.5, 1.5, 0.4, 1.7, 0.6, 1.2, 0.4, 0.7, 1.4, …
## $ work_at_home   <dbl> 1.8, 3.9, 1.6, 0.7, 2.3, 2.8, 1.7, 2.7, 2.1, 2.5, 1.9, …
## $ mean_commute   <dbl> 26.5, 26.4, 24.1, 28.8, 34.9, 27.5, 24.6, 24.1, 25.1, 2…
## $ employed       <dbl> 23986, 85953, 8597, 8294, 22189, 3865, 7813, 47401, 136…
## $ private_work   <dbl> 73.6, 81.5, 71.8, 76.8, 82.0, 79.5, 77.4, 74.1, 85.1, 7…
## $ public_work    <dbl> 20.9, 12.3, 20.8, 16.1, 13.5, 15.1, 16.2, 20.8, 12.1, 1…
## $ self_employed  <dbl> 5.5, 5.8, 7.3, 6.7, 4.2, 5.4, 6.2, 5.0, 2.8, 7.9, 4.1, …
## $ family_work    <dbl> 0.0, 0.4, 0.1, 0.4, 0.4, 0.0, 0.2, 0.1, 0.0, 0.5, 0.5, …
## $ unemployment   <dbl> 7.6, 7.5, 17.6, 8.3, 7.7, 18.0, 10.9, 12.3, 8.9, 7.9, 9…

After removing the missing values we can observe the count of rows has been dropped by 2 from 3142 to 3140.

Question 4

Use the function to examine any unusual values. Are there any? If so, how will you handle these unusual values? Show any code/output to handle unusual values.

acs_data %>% summary()

##   census_id              state         county            total_pop       
##  Length:3140        Texas   : 253   Length:3140        Min.   :     267  
##  Class :character   Georgia : 159   Class :character   1st Qu.:   11036  
##  Mode  :character   Virginia: 133   Mode  :character   Median :   25793  
##                     Kentucky: 120                      Mean   :  100801  
##                     Missouri: 115                      3rd Qu.:   67620  
##                     Kansas  : 105                      Max.   :10038388  
##                     (Other) :2255                                        
##       men              women            hispanic          white      
##  Min.   :    136   Min.   :    131   Min.   : 0.000   Min.   : 0.90  
##  1st Qu.:   5551   1st Qu.:   5488   1st Qu.: 1.900   1st Qu.:65.67  
##  Median :  12838   Median :  12916   Median : 3.700   Median :84.65  
##  Mean   :  49597   Mean   :  51204   Mean   : 8.819   Mean   :77.31  
##  3rd Qu.:  33328   3rd Qu.:  34123   3rd Qu.: 9.000   3rd Qu.:93.33  
##  Max.   :4945351   Max.   :5093037   Max.   :98.700   Max.   :99.80  
##                                                                      
##      black            native           asian           pacific        
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.00000  
##  1st Qu.: 0.600   1st Qu.: 0.100   1st Qu.: 0.200   1st Qu.: 0.00000  
##  Median : 2.100   Median : 0.300   Median : 0.500   Median : 0.00000  
##  Mean   : 8.885   Mean   : 1.763   Mean   : 1.253   Mean   : 0.07357  
##  3rd Qu.:10.200   3rd Qu.: 0.600   3rd Qu.: 1.200   3rd Qu.: 0.00000  
##  Max.   :85.900   Max.   :92.100   Max.   :41.600   Max.   :11.10000  
##                                                                       
##     citizen            income       income_per_cap     poverty    
##  Min.   :    199   Min.   : 19328   Min.   : 8292   Min.   : 1.4  
##  1st Qu.:   8276   1st Qu.: 38826   1st Qu.:20470   1st Qu.:12.0  
##  Median :  19454   Median : 45094   Median :23574   Median :16.0  
##  Mean   :  70849   Mean   : 46824   Mean   :24331   Mean   :16.7  
##  3rd Qu.:  50795   3rd Qu.: 52248   3rd Qu.:27138   3rd Qu.:20.3  
##  Max.   :6046749   Max.   :123453   Max.   :65600   Max.   :53.3  
##                                                                   
##  child_poverty    professional      service          office     
##  Min.   : 0.00   Min.   :13.50   Min.   : 5.00   Min.   : 4.10  
##  1st Qu.:16.10   1st Qu.:26.70   1st Qu.:15.90   1st Qu.:20.20  
##  Median :22.50   Median :30.00   Median :18.00   Median :22.40  
##  Mean   :23.29   Mean   :31.05   Mean   :18.25   Mean   :22.13  
##  3rd Qu.:29.50   3rd Qu.:34.42   3rd Qu.:20.20   3rd Qu.:24.30  
##  Max.   :72.30   Max.   :74.00   Max.   :36.60   Max.   :35.40  
##                                                                 
##   construction     production        drive         carpool     
##  Min.   : 1.70   Min.   : 0.00   Min.   : 5.2   Min.   : 0.00  
##  1st Qu.: 9.80   1st Qu.:11.50   1st Qu.:76.6   1st Qu.: 8.50  
##  Median :12.20   Median :15.40   Median :80.6   Median : 9.90  
##  Mean   :12.75   Mean   :15.82   Mean   :79.1   Mean   :10.33  
##  3rd Qu.:15.00   3rd Qu.:19.40   3rd Qu.:83.6   3rd Qu.:11.90  
##  Max.   :40.30   Max.   :55.60   Max.   :94.6   Max.   :29.90  
##                                                                
##     transit             walk         other_transp    work_at_home   
##  Min.   : 0.0000   Min.   : 0.000   Min.   : 0.00   Min.   : 0.000  
##  1st Qu.: 0.1000   1st Qu.: 1.400   1st Qu.: 0.90   1st Qu.: 2.800  
##  Median : 0.4000   Median : 2.400   Median : 1.30   Median : 4.000  
##  Mean   : 0.9681   Mean   : 3.294   Mean   : 1.61   Mean   : 4.694  
##  3rd Qu.: 0.8000   3rd Qu.: 4.000   3rd Qu.: 1.90   3rd Qu.: 5.700  
##  Max.   :61.7000   Max.   :71.200   Max.   :39.10   Max.   :37.200  
##                                                                     
##   mean_commute      employed        private_work    public_work   
##  Min.   : 4.90   Min.   :    166   Min.   :29.50   Min.   : 5.80  
##  1st Qu.:19.30   1st Qu.:   4532   1st Qu.:70.90   1st Qu.:13.07  
##  Median :22.90   Median :  10657   Median :75.85   Median :16.10  
##  Mean   :23.15   Mean   :  46416   Mean   :74.45   Mean   :17.33  
##  3rd Qu.:26.60   3rd Qu.:  29272   3rd Qu.:79.80   3rd Qu.:20.10  
##  Max.   :44.00   Max.   :4635465   Max.   :88.30   Max.   :66.20  
##                                                                   
##  self_employed     family_work      unemployment   
##  Min.   : 0.000   Min.   :0.0000   Min.   : 0.000  
##  1st Qu.: 5.400   1st Qu.:0.1000   1st Qu.: 5.500  
##  Median : 6.900   Median :0.2000   Median : 7.500  
##  Mean   : 7.922   Mean   :0.2917   Mean   : 7.815  
##  3rd Qu.: 9.400   3rd Qu.:0.3000   3rd Qu.: 9.700  
##  Max.   :36.600   Max.   :9.8000   Max.   :29.400  
##

After observing the summary we can see the mean & median values is far from each other for few fields for example, total_pop , men , women , hispanic , black , citizen. This suggests there are outlines in the data and that the data is not distributed equally across all the categories of people.

Question 5

How many counties have more women than men?

acs_data %>% 
  filter(women > men) %>% 
  nrow()

## [1] 1984

There are 1984 counties that have more women than men.

Question 6

How many counties have an unemployment rate lower than 10%?

acs_data %>% 
  filter(unemployment < 10.0) %>% 
  nrow()

## [1] 2419

There are 2419 counties that have an unemployment rate lower than 10%

Question 7

What are the top 10 counties with the highest mean commute? Show the census ID, county name, state, and the mean_commute in your final answer (sorted by mean_commute).

Top 10 counties with highest mean commute are below:

acs_data %>% 
  select(census_id,state,county,mean_commute) %>% 
  top_n(10, mean_commute)

## # A tibble: 10 × 4
##    census_id state         county       mean_commute
##    <chr>     <fct>         <chr>               <dbl>
##  1 8093      Colorado      Park                 42.4
##  2 24017     Maryland      Charles              42.8
##  3 36005     New York      Bronx                43  
##  4 36047     New York      Kings                41.7
##  5 36081     New York      Queens               42.6
##  6 36085     New York      Richmond             42.6
##  7 42103     Pennsylvania  Pike                 44  
##  8 51187     Virginia      Warren               42.7
##  9 51193     Virginia      Westmoreland         42.5
## 10 54015     West Virginia Clay                 41.4

Question 8

Create a new variable that calculates the percentage of women for each county and then find the top 10 counties with the lowest percentages. Show the census ID, county name, state, and the percentage in your final answer (sorted by ascending percentage).

acs_data %>% 
  mutate(percent_women = women*100/total_pop) %>% 
  select(census_id,county,state,percent_women) %>% 
  group_by(census_id,state,county) %>% 
  summarize(percent_women_total = sum(round(percent_women,2))) %>%
  arrange(percent_women_total) %>% head(10)

## # A tibble: 10 × 4
## # Groups:   census_id, state [10]
##    census_id state        county                 percent_women_total
##    <chr>     <fct>        <chr>                                <dbl>
##  1 42053     Pennsylvania Forest                                26.8
##  2 8011      Colorado     Bent                                  31.4
##  3 51183     Virginia     Sussex                                31.5
##  4 13309     Georgia      Wheeler                               32.1
##  5 6035      California   Lassen                                33.2
##  6 48095     Texas        Concho                                33.3
##  7 13053     Georgia      Chattahoochee                         33.4
##  8 2013      Alaska       Aleutians East Borough                33.5
##  9 22125     Louisiana    West Feliciana                        33.6
## 10 32027     Nevada       Pershing                              33.7

Question 9

Create a new variable that calculates the sum of all race percentage variables (these columns are the “hispanic”, “white”, “black”, “native”, “asian”, and “pacific” variables). ### Q9(a) What are the top 10 counties with the lowest sum of these race percentage variables?

acs_data %>% 
  select(state,county,hispanic,white,black,native,asian,pacific) %>% 
  mutate(race_percent = hispanic+white+black+native+asian+pacific) -> acs_data_race_filtered  

head(acs_data_race_filtered)

## # A tibble: 6 × 9
##   state   county  hispanic white black native asian pacific race_percent
##   <fct>   <chr>      <dbl> <dbl> <dbl>  <dbl> <dbl>   <dbl>        <dbl>
## 1 Alabama Autauga      2.6  75.8  18.5    0.4   1         0         98.3
## 2 Alabama Baldwin      4.5  83.1   9.5    0.6   0.7       0         98.4
## 3 Alabama Barbour      4.6  46.2  46.7    0.2   0.4       0         98.1
## 4 Alabama Bibb         2.2  74.5  21.4    0.4   0.1       0         98.6
## 5 Alabama Blount       8.6  87.9   1.5    0.3   0.1       0         98.4
## 6 Alabama Bullock      4.4  22.2  70.7    1.2   0.2       0         98.7

acs_data_race_filtered %>% group_by(state,county) %>% summarize(percent_total = sum(race_percent)) %>% arrange(percent_total)

## # A tibble: 3,140 × 3
## # Groups:   state [51]
##    state     county                   percent_total
##    <fct>     <chr>                            <dbl>
##  1 Hawaii    Hawaii                            76.4
##  2 Hawaii    Maui                              79.2
##  3 Oklahoma  Mayes                             79.7
##  4 Hawaii    Honolulu                          81.5
##  5 Oklahoma  Pontotoc                          82.8
##  6 Tennessee Grundy                            83  
##  7 Alaska    Yakutat City and Borough          83.4
##  8 Oklahoma  Johnston                          84  
##  9 Hawaii    Kauai                             84.1
## 10 Oklahoma  Alfalfa                           85.1
## # … with 3,130 more rows

Q9(b)

Which state, on average, has the lowest sum of these race percentage variables?

acs_data_race_filtered %>% group_by(state) %>% summarize(avg_percent_total = mean(race_percent)) %>% arrange(avg_percent_total) %>% head(1)

## # A tibble: 1 × 2
##   state  avg_percent_total
##   <fct>              <dbl>
## 1 Hawaii              80.3

Hawaii has the highest sum of percentages.

Q9(c)

Do any counties have a sum greater than 100%?

acs_data_race_filtered %>% group_by(state,county) %>% summarize(percent_total = sum(race_percent)) %>% filter(percent_total > 100)

## # A tibble: 11 × 3
## # Groups:   state [4]
##    state       county    percent_total
##    <fct>       <chr>             <dbl>
##  1 Mississippi Claiborne          100 
##  2 Nebraska    Gosper             100.
##  3 Nebraska    Hooker             100.
##  4 Nebraska    Nance              100.
##  5 Texas       Bailey             100.
##  6 Texas       Duval              100 
##  7 Texas       Edwards            100.
##  8 Texas       Kenedy             100 
##  9 Texas       Kent               100 
## 10 Texas       Presidio           100 
## 11 Utah        Beaver             100

Yes, there are 11 counties across 4 states with sum of percent of races greater than 100.

Q9(d)

How many states have a sum that equals exactly to 100%?

acs_data_race_filtered %>% 
  group_by(state) %>% 
  summarize(state_percent_total = sum(race_percent)) %>% 
  filter(state_percent_total == 100)

## # A tibble: 0 × 2
## # … with 2 variables: state <fct>, state_percent_total <dbl>

No states have sum of race percentages equal to 100.

Question 10

Using the carpool variable, ### Q10(a) Use the function to create a new variable called carpool_rank where the highest ranked county (rank = 1) is the county with the highest carpool value. Read the documentation carefully for the ranking function.

Using the carpool variable, new variable carpool_rank has been created where the highest ranked county (rank =1) is the county with the highest carpool value.

acs_data$carpool_rank = min_rank(-(acs_data$carpool))

Q10(b)

Find the 10 highest ranked counties for carpooling. Show the census ID, county name, state, carpool value, and carpool_rank in your final answer.

acs_data %>% 
  select(census_id, county, state, carpool, carpool_rank) %>% 
  arrange(carpool_rank) %>% 
  top_n(-10, wt=carpool_rank)

## # A tibble: 10 × 5
##    census_id county   state    carpool carpool_rank
##    <chr>     <chr>    <fct>      <dbl>        <int>
##  1 13061     Clay     Georgia     29.9            1
##  2 18087     LaGrange Indiana     27              2
##  3 13165     Jenkins  Georgia     25.3            3
##  4 5133      Sevier   Arkansas    24.4            4
##  5 20175     Seward   Kansas      23.4            5
##  6 48079     Cochran  Texas       22.8            6
##  7 48247     Jim Hogg Texas       22.6            7
##  8 48393     Roberts  Texas       22.4            8
##  9 39075     Holmes   Ohio        21.8            9
## 10 21197     Powell   Kentucky    21.6           10

The 10 highest ranked counties for carpooling are in the table above.

Q10(c)

Find the 10 lowest ranked counties for carpooling. Show the same variables in your final answer.

(acs_data <- acs_data %>% 
    mutate(carpool_rank = rank(desc(carpool), ties.method = "max")))

## # A tibble: 3,140 × 36
##    census_id state   county   total_pop   men women hispanic white black native
##    <chr>     <fct>   <chr>        <dbl> <dbl> <dbl>    <dbl> <dbl> <dbl>  <dbl>
##  1 1001      Alabama Autauga      55221 26745 28476      2.6  75.8  18.5    0.4
##  2 1003      Alabama Baldwin     195121 95314 99807      4.5  83.1   9.5    0.6
##  3 1005      Alabama Barbour      26932 14497 12435      4.6  46.2  46.7    0.2
##  4 1007      Alabama Bibb         22604 12073 10531      2.2  74.5  21.4    0.4
##  5 1009      Alabama Blount       57710 28512 29198      8.6  87.9   1.5    0.3
##  6 1011      Alabama Bullock      10678  5660  5018      4.4  22.2  70.7    1.2
##  7 1013      Alabama Butler       20354  9502 10852      1.2  53.3  43.8    0.1
##  8 1015      Alabama Calhoun     116648 56274 60374      3.5  73    20.3    0.2
##  9 1017      Alabama Chambers     34079 16258 17821      0.4  57.3  40.3    0.2
## 10 1019      Alabama Cherokee     26008 12975 13033      1.5  91.7   4.8    0.6
## # … with 3,130 more rows, and 26 more variables: asian <dbl>, pacific <dbl>,
## #   citizen <dbl>, income <dbl>, income_per_cap <dbl>, poverty <dbl>,
## #   child_poverty <dbl>, professional <dbl>, service <dbl>, office <dbl>,
## #   construction <dbl>, production <dbl>, drive <dbl>, carpool <dbl>,
## #   transit <dbl>, walk <dbl>, other_transp <dbl>, work_at_home <dbl>,
## #   mean_commute <dbl>, employed <dbl>, private_work <dbl>, public_work <dbl>,
## #   self_employed <dbl>, family_work <dbl>, unemployment <dbl>, …

acs_data %>% 
  select(census_id, county, state, carpool, carpool_rank) %>% 
  arrange(carpool_rank) %>% 
  top_n(10, wt=carpool_rank)

## # A tibble: 11 × 5
##    census_id county      state        carpool carpool_rank
##    <chr>     <chr>       <fct>          <dbl>        <int>
##  1 46069     Hyde        South Dakota     2.8         3131
##  2 51720     Norton city Virginia         2.8         3131
##  3 30019     Daniels     Montana          2.6         3133
##  4 31057     Dundy       Nebraska         2.6         3133
##  5 13309     Wheeler     Georgia          2.3         3135
##  6 38029     Emmons      North Dakota     2.3         3135
##  7 36061     New York    New York         1.9         3136
##  8 31183     Wheeler     Nebraska         1.3         3137
##  9 48235     Irion       Texas            0.9         3138
## 10 48261     Kenedy      Texas            0           3140
## 11 48269     King        Texas            0           3140

Q10(d)

On average, what state is the best ranked for carpooling?

acs_data %>% 
  group_by(state) %>% 
  summarise(avgrank = mean(carpool_rank)) %>% 
  arrange(avgrank) %>% 
  top_n(-1)

## # A tibble: 1 × 2
##   state  avgrank
##   <fct>    <dbl>
## 1 Hawaii    674.

Hawaii is the best ranked for carpooling.

Q10(e)

What are the top 5 states for carpooling?

acs_data %>% 
  group_by(state) %>% 
  summarise(avgrank = mean(carpool_rank)) %>% 
  arrange(avgrank) %>% 
  top_n(-5)

## # A tibble: 5 × 2
##   state    avgrank
##   <fct>      <dbl>
## 1 Hawaii      674.
## 2 Arizona    1005.
## 3 Utah       1045.
## 4 Arkansas   1084.
## 5 Alaska     1108.

The top 5 states for carpooling are Hawaii, Arizona, Utah, Arkansas and Alaska.