tidyverse function and NOT with a Base R function. How many rows and columns are in the data set?library(tidyverse)
ACS <- read_csv("acs_2015_county_data_revised.csv")
dim(ACS)
## [1] 3142 35
There are 3,142 rows with 35 columns in the data set.
str(ACS)
After viewing the structure of the data set, both county and state need to be changed to factor to better fit the data dictionary.
ACS$county <- as.factor(ACS$county)
ACS$state <- as.factor(ACS$state)
After changing the data types, the data set looks near perfect.
glimpse(ACS)
## Rows: 3,142
## Columns: 35
## $ census_id <dbl> 1001, 1003, 1005, 1007, 1009, 1011, 1013, 1015, 1017, 1…
## $ state <fct> Alabama, Alabama, Alabama, Alabama, Alabama, Alabama, A…
## $ county <fct> Autauga, Baldwin, Barbour, Bibb, Blount, Bullock, Butle…
## $ total_pop <dbl> 55221, 195121, 26932, 22604, 57710, 10678, 20354, 11664…
## $ men <dbl> 26745, 95314, 14497, 12073, 28512, 5660, 9502, 56274, 1…
## $ women <dbl> 28476, 99807, 12435, 10531, 29198, 5018, 10852, 60374, …
## $ hispanic <dbl> 2.6, 4.5, 4.6, 2.2, 8.6, 4.4, 1.2, 3.5, 0.4, 1.5, 7.6, …
## $ white <dbl> 75.8, 83.1, 46.2, 74.5, 87.9, 22.2, 53.3, 73.0, 57.3, 9…
## $ black <dbl> 18.5, 9.5, 46.7, 21.4, 1.5, 70.7, 43.8, 20.3, 40.3, 4.8…
## $ native <dbl> 0.4, 0.6, 0.2, 0.4, 0.3, 1.2, 0.1, 0.2, 0.2, 0.6, 0.4, …
## $ asian <dbl> 1.0, 0.7, 0.4, 0.1, 0.1, 0.2, 0.4, 0.9, 0.8, 0.3, 0.3, …
## $ pacific <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, …
## $ citizen <dbl> 40725, 147695, 20714, 17495, 42345, 8057, 15581, 88612,…
## $ income <dbl> 51281, 50254, 32964, 38678, 45813, 31938, 32229, 41703,…
## $ income_per_cap <dbl> 24974, 27317, 16824, 18431, 20532, 17580, 18390, 21374,…
## $ poverty <dbl> 12.9, 13.4, 26.7, 16.8, 16.7, 24.6, 25.4, 20.5, 21.6, 1…
## $ child_poverty <dbl> 18.6, 19.2, 45.3, 27.9, 27.2, 38.4, 39.2, 31.6, 37.2, 3…
## $ professional <dbl> 33.2, 33.1, 26.8, 21.5, 28.5, 18.8, 27.5, 27.3, 23.3, 2…
## $ service <dbl> 17.0, 17.7, 16.1, 17.9, 14.1, 15.0, 16.6, 17.7, 14.5, 1…
## $ office <dbl> 24.2, 27.1, 23.1, 17.8, 23.9, 19.7, 21.9, 24.2, 26.3, 1…
## $ construction <dbl> 8.6, 10.8, 10.8, 19.0, 13.5, 20.1, 10.3, 10.5, 11.5, 13…
## $ production <dbl> 17.1, 11.2, 23.1, 23.7, 19.9, 26.4, 23.7, 20.4, 24.4, 2…
## $ drive <dbl> 87.5, 84.7, 83.8, 83.2, 84.9, 74.9, 84.5, 85.3, 85.1, 8…
## $ carpool <dbl> 8.8, 8.8, 10.9, 13.5, 11.2, 14.9, 12.4, 9.4, 11.9, 12.1…
## $ transit <dbl> 0.1, 0.1, 0.4, 0.5, 0.4, 0.7, 0.0, 0.2, 0.2, 0.2, 0.2, …
## $ walk <dbl> 0.5, 1.0, 1.8, 0.6, 0.9, 5.0, 0.8, 1.2, 0.3, 0.6, 1.1, …
## $ other_transp <dbl> 1.3, 1.4, 1.5, 1.5, 0.4, 1.7, 0.6, 1.2, 0.4, 0.7, 1.4, …
## $ work_at_home <dbl> 1.8, 3.9, 1.6, 0.7, 2.3, 2.8, 1.7, 2.7, 2.1, 2.5, 1.9, …
## $ mean_commute <dbl> 26.5, 26.4, 24.1, 28.8, 34.9, 27.5, 24.6, 24.1, 25.1, 2…
## $ employed <dbl> 23986, 85953, 8597, 8294, 22189, 3865, 7813, 47401, 136…
## $ private_work <dbl> 73.6, 81.5, 71.8, 76.8, 82.0, 79.5, 77.4, 74.1, 85.1, 7…
## $ public_work <dbl> 20.9, 12.3, 20.8, 16.1, 13.5, 15.1, 16.2, 20.8, 12.1, 1…
## $ self_employed <dbl> 5.5, 5.8, 7.3, 6.7, 4.2, 5.4, 6.2, 5.0, 2.8, 7.9, 4.1, …
## $ family_work <dbl> 0.0, 0.4, 0.1, 0.4, 0.4, 0.0, 0.2, 0.1, 0.0, 0.5, 0.5, …
## $ unemployment <dbl> 7.6, 7.5, 17.6, 8.3, 7.7, 18.0, 10.9, 12.3, 8.9, 7.9, 9…
Utilizing the glimpse() function is a transposed version of the print() function that allows us to view the columns down and the data across the page.
colSums(is.na(ACS))
## census_id state county total_pop men
## 0 0 0 0 0
## women hispanic white black native
## 0 0 0 0 0
## asian pacific citizen income income_per_cap
## 0 0 0 1 0
## poverty child_poverty professional service office
## 0 1 0 0 0
## construction production drive carpool transit
## 0 0 0 0 0
## walk other_transp work_at_home mean_commute employed
## 0 0 0 0 0
## private_work public_work self_employed family_work unemployment
## 0 0 0 0 0
It appears there are 2 missing values in the data set. Both child_poverty and income have one missing value each. From the observation above, I would first handle missing data by asking whoever is in charge of the observations if it is appropriate or not to remove data. If so, I would handle this by getting rid of the two observations or imputing them on the median as these won’t impact too much moving forward with the analysis.
ACS$child_poverty[is.na(ACS$child_poverty)] <- median(ACS$child_poverty, na.rm=TRUE)
ACS$income[is.na(ACS$income)] <- median(ACS$income, na.rm=TRUE)
colSums(is.na(ACS))
## census_id state county total_pop men
## 0 0 0 0 0
## women hispanic white black native
## 0 0 0 0 0
## asian pacific citizen income income_per_cap
## 0 0 0 0 0
## poverty child_poverty professional service office
## 0 0 0 0 0
## construction production drive carpool transit
## 0 0 0 0 0
## walk other_transp work_at_home mean_commute employed
## 0 0 0 0 0
## private_work public_work self_employed family_work unemployment
## 0 0 0 0 0
summary(ACS)
## census_id state county total_pop
## Min. : 1001 Texas : 254 Washington: 31 Min. : 85
## 1st Qu.:18178 Georgia : 159 Jefferson : 26 1st Qu.: 11028
## Median :29176 Virginia: 133 Franklin : 25 Median : 25768
## Mean :30384 Kentucky: 120 Jackson : 24 Mean : 100737
## 3rd Qu.:45080 Missouri: 115 Lincoln : 24 3rd Qu.: 67552
## Max. :56045 Kansas : 105 Madison : 20 Max. :10038388
## (Other) :2256 (Other) :2992
## men women hispanic white
## Min. : 42 Min. : 43 Min. : 0.000 Min. : 0.90
## 1st Qu.: 5546 1st Qu.: 5466 1st Qu.: 1.900 1st Qu.:65.60
## Median : 12826 Median : 12907 Median : 3.700 Median :84.60
## Mean : 49565 Mean : 51171 Mean : 8.826 Mean :77.28
## 3rd Qu.: 33319 3rd Qu.: 34122 3rd Qu.: 9.000 3rd Qu.:93.30
## Max. :4945351 Max. :5093037 Max. :98.700 Max. :99.80
##
## black native asian pacific
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.00000
## 1st Qu.: 0.600 1st Qu.: 0.100 1st Qu.: 0.200 1st Qu.: 0.00000
## Median : 2.100 Median : 0.300 Median : 0.500 Median : 0.00000
## Mean : 8.879 Mean : 1.766 Mean : 1.258 Mean : 0.08475
## 3rd Qu.:10.175 3rd Qu.: 0.600 3rd Qu.: 1.200 3rd Qu.: 0.00000
## Max. :85.900 Max. :92.100 Max. :41.600 Max. :35.30000
##
## citizen income income_per_cap poverty
## Min. : 80 Min. : 19328 Min. : 8292 Min. : 1.4
## 1st Qu.: 8254 1st Qu.: 38826 1st Qu.:20471 1st Qu.:12.0
## Median : 19434 Median : 45111 Median :23577 Median :16.0
## Mean : 70804 Mean : 46830 Mean :24338 Mean :16.7
## 3rd Qu.: 50728 3rd Qu.: 52249 3rd Qu.:27138 3rd Qu.:20.3
## Max. :6046749 Max. :123453 Max. :65600 Max. :53.3
##
## child_poverty professional service office
## Min. : 0.00 Min. :13.50 Min. : 5.00 Min. : 4.10
## 1st Qu.:16.10 1st Qu.:26.70 1st Qu.:15.90 1st Qu.:20.20
## Median :22.50 Median :30.00 Median :18.00 Median :22.40
## Mean :23.29 Mean :31.04 Mean :18.26 Mean :22.13
## 3rd Qu.:29.48 3rd Qu.:34.40 3rd Qu.:20.20 3rd Qu.:24.30
## Max. :72.30 Max. :74.00 Max. :36.60 Max. :35.40
##
## construction production drive carpool
## Min. : 1.70 Min. : 0.00 Min. : 5.20 Min. : 0.00
## 1st Qu.: 9.80 1st Qu.:11.53 1st Qu.:76.60 1st Qu.: 8.50
## Median :12.20 Median :15.40 Median :80.60 Median : 9.90
## Mean :12.74 Mean :15.82 Mean :79.08 Mean :10.33
## 3rd Qu.:15.00 3rd Qu.:19.40 3rd Qu.:83.60 3rd Qu.:11.88
## Max. :40.30 Max. :55.60 Max. :94.60 Max. :29.90
##
## transit walk other_transp work_at_home
## Min. : 0.0000 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.1000 1st Qu.: 1.400 1st Qu.: 0.900 1st Qu.: 2.800
## Median : 0.4000 Median : 2.400 Median : 1.300 Median : 4.000
## Mean : 0.9675 Mean : 3.307 Mean : 1.614 Mean : 4.697
## 3rd Qu.: 0.8000 3rd Qu.: 4.000 3rd Qu.: 1.900 3rd Qu.: 5.700
## Max. :61.7000 Max. :71.200 Max. :39.100 Max. :37.200
##
## mean_commute employed private_work public_work
## Min. : 4.90 Min. : 62 Min. :25.00 Min. : 5.80
## 1st Qu.:19.30 1st Qu.: 4524 1st Qu.:70.90 1st Qu.:13.10
## Median :22.90 Median : 10644 Median :75.80 Median :16.10
## Mean :23.15 Mean : 46387 Mean :74.44 Mean :17.35
## 3rd Qu.:26.60 3rd Qu.: 29254 3rd Qu.:79.80 3rd Qu.:20.10
## Max. :44.00 Max. :4635465 Max. :88.30 Max. :66.20
##
## self_employed family_work unemployment
## Min. : 0.000 Min. :0.0000 Min. : 0.000
## 1st Qu.: 5.400 1st Qu.:0.1000 1st Qu.: 5.500
## Median : 6.900 Median :0.2000 Median : 7.500
## Mean : 7.921 Mean :0.2915 Mean : 7.815
## 3rd Qu.: 9.400 3rd Qu.:0.3000 3rd Qu.: 9.700
## Max. :36.600 Max. :9.8000 Max. :29.400
##
The data set contains skewed variables. For instance, men and women has such outstanding outliers; however, we are not able to remove any as we need more information about the values from these variables. Considering the genders come from a population, these numeric values can easily vary to a high degree, so we will leave the values for now. Something else to note is the variable employed. According to the dictionary, these values represent a percentage for ages 16+. Although, these values recorded are not percentages but perhaps a count of those employed. Below, we have modified the variable to accurately represent the definition of the variable while giving the summary statistics of the updated set.
ACS <- ACS %>% mutate(employed = employed/total_pop * 100)
glimpse(ACS)
## Rows: 3,142
## Columns: 35
## $ census_id <dbl> 1001, 1003, 1005, 1007, 1009, 1011, 1013, 1015, 1017, 1…
## $ state <fct> Alabama, Alabama, Alabama, Alabama, Alabama, Alabama, A…
## $ county <fct> Autauga, Baldwin, Barbour, Bibb, Blount, Bullock, Butle…
## $ total_pop <dbl> 55221, 195121, 26932, 22604, 57710, 10678, 20354, 11664…
## $ men <dbl> 26745, 95314, 14497, 12073, 28512, 5660, 9502, 56274, 1…
## $ women <dbl> 28476, 99807, 12435, 10531, 29198, 5018, 10852, 60374, …
## $ hispanic <dbl> 2.6, 4.5, 4.6, 2.2, 8.6, 4.4, 1.2, 3.5, 0.4, 1.5, 7.6, …
## $ white <dbl> 75.8, 83.1, 46.2, 74.5, 87.9, 22.2, 53.3, 73.0, 57.3, 9…
## $ black <dbl> 18.5, 9.5, 46.7, 21.4, 1.5, 70.7, 43.8, 20.3, 40.3, 4.8…
## $ native <dbl> 0.4, 0.6, 0.2, 0.4, 0.3, 1.2, 0.1, 0.2, 0.2, 0.6, 0.4, …
## $ asian <dbl> 1.0, 0.7, 0.4, 0.1, 0.1, 0.2, 0.4, 0.9, 0.8, 0.3, 0.3, …
## $ pacific <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, …
## $ citizen <dbl> 40725, 147695, 20714, 17495, 42345, 8057, 15581, 88612,…
## $ income <dbl> 51281, 50254, 32964, 38678, 45813, 31938, 32229, 41703,…
## $ income_per_cap <dbl> 24974, 27317, 16824, 18431, 20532, 17580, 18390, 21374,…
## $ poverty <dbl> 12.9, 13.4, 26.7, 16.8, 16.7, 24.6, 25.4, 20.5, 21.6, 1…
## $ child_poverty <dbl> 18.6, 19.2, 45.3, 27.9, 27.2, 38.4, 39.2, 31.6, 37.2, 3…
## $ professional <dbl> 33.2, 33.1, 26.8, 21.5, 28.5, 18.8, 27.5, 27.3, 23.3, 2…
## $ service <dbl> 17.0, 17.7, 16.1, 17.9, 14.1, 15.0, 16.6, 17.7, 14.5, 1…
## $ office <dbl> 24.2, 27.1, 23.1, 17.8, 23.9, 19.7, 21.9, 24.2, 26.3, 1…
## $ construction <dbl> 8.6, 10.8, 10.8, 19.0, 13.5, 20.1, 10.3, 10.5, 11.5, 13…
## $ production <dbl> 17.1, 11.2, 23.1, 23.7, 19.9, 26.4, 23.7, 20.4, 24.4, 2…
## $ drive <dbl> 87.5, 84.7, 83.8, 83.2, 84.9, 74.9, 84.5, 85.3, 85.1, 8…
## $ carpool <dbl> 8.8, 8.8, 10.9, 13.5, 11.2, 14.9, 12.4, 9.4, 11.9, 12.1…
## $ transit <dbl> 0.1, 0.1, 0.4, 0.5, 0.4, 0.7, 0.0, 0.2, 0.2, 0.2, 0.2, …
## $ walk <dbl> 0.5, 1.0, 1.8, 0.6, 0.9, 5.0, 0.8, 1.2, 0.3, 0.6, 1.1, …
## $ other_transp <dbl> 1.3, 1.4, 1.5, 1.5, 0.4, 1.7, 0.6, 1.2, 0.4, 0.7, 1.4, …
## $ work_at_home <dbl> 1.8, 3.9, 1.6, 0.7, 2.3, 2.8, 1.7, 2.7, 2.1, 2.5, 1.9, …
## $ mean_commute <dbl> 26.5, 26.4, 24.1, 28.8, 34.9, 27.5, 24.6, 24.1, 25.1, 2…
## $ employed <dbl> 43.43637, 44.05113, 31.92113, 36.69262, 38.44914, 36.19…
## $ private_work <dbl> 73.6, 81.5, 71.8, 76.8, 82.0, 79.5, 77.4, 74.1, 85.1, 7…
## $ public_work <dbl> 20.9, 12.3, 20.8, 16.1, 13.5, 15.1, 16.2, 20.8, 12.1, 1…
## $ self_employed <dbl> 5.5, 5.8, 7.3, 6.7, 4.2, 5.4, 6.2, 5.0, 2.8, 7.9, 4.1, …
## $ family_work <dbl> 0.0, 0.4, 0.1, 0.4, 0.4, 0.0, 0.2, 0.1, 0.0, 0.5, 0.5, …
## $ unemployment <dbl> 7.6, 7.5, 17.6, 8.3, 7.7, 18.0, 10.9, 12.3, 8.9, 7.9, 9…
summary(ACS)
## census_id state county total_pop
## Min. : 1001 Texas : 254 Washington: 31 Min. : 85
## 1st Qu.:18178 Georgia : 159 Jefferson : 26 1st Qu.: 11028
## Median :29176 Virginia: 133 Franklin : 25 Median : 25768
## Mean :30384 Kentucky: 120 Jackson : 24 Mean : 100737
## 3rd Qu.:45080 Missouri: 115 Lincoln : 24 3rd Qu.: 67552
## Max. :56045 Kansas : 105 Madison : 20 Max. :10038388
## (Other) :2256 (Other) :2992
## men women hispanic white
## Min. : 42 Min. : 43 Min. : 0.000 Min. : 0.90
## 1st Qu.: 5546 1st Qu.: 5466 1st Qu.: 1.900 1st Qu.:65.60
## Median : 12826 Median : 12907 Median : 3.700 Median :84.60
## Mean : 49565 Mean : 51171 Mean : 8.826 Mean :77.28
## 3rd Qu.: 33319 3rd Qu.: 34122 3rd Qu.: 9.000 3rd Qu.:93.30
## Max. :4945351 Max. :5093037 Max. :98.700 Max. :99.80
##
## black native asian pacific
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.00000
## 1st Qu.: 0.600 1st Qu.: 0.100 1st Qu.: 0.200 1st Qu.: 0.00000
## Median : 2.100 Median : 0.300 Median : 0.500 Median : 0.00000
## Mean : 8.879 Mean : 1.766 Mean : 1.258 Mean : 0.08475
## 3rd Qu.:10.175 3rd Qu.: 0.600 3rd Qu.: 1.200 3rd Qu.: 0.00000
## Max. :85.900 Max. :92.100 Max. :41.600 Max. :35.30000
##
## citizen income income_per_cap poverty
## Min. : 80 Min. : 19328 Min. : 8292 Min. : 1.4
## 1st Qu.: 8254 1st Qu.: 38826 1st Qu.:20471 1st Qu.:12.0
## Median : 19434 Median : 45111 Median :23577 Median :16.0
## Mean : 70804 Mean : 46830 Mean :24338 Mean :16.7
## 3rd Qu.: 50728 3rd Qu.: 52249 3rd Qu.:27138 3rd Qu.:20.3
## Max. :6046749 Max. :123453 Max. :65600 Max. :53.3
##
## child_poverty professional service office
## Min. : 0.00 Min. :13.50 Min. : 5.00 Min. : 4.10
## 1st Qu.:16.10 1st Qu.:26.70 1st Qu.:15.90 1st Qu.:20.20
## Median :22.50 Median :30.00 Median :18.00 Median :22.40
## Mean :23.29 Mean :31.04 Mean :18.26 Mean :22.13
## 3rd Qu.:29.48 3rd Qu.:34.40 3rd Qu.:20.20 3rd Qu.:24.30
## Max. :72.30 Max. :74.00 Max. :36.60 Max. :35.40
##
## construction production drive carpool
## Min. : 1.70 Min. : 0.00 Min. : 5.20 Min. : 0.00
## 1st Qu.: 9.80 1st Qu.:11.53 1st Qu.:76.60 1st Qu.: 8.50
## Median :12.20 Median :15.40 Median :80.60 Median : 9.90
## Mean :12.74 Mean :15.82 Mean :79.08 Mean :10.33
## 3rd Qu.:15.00 3rd Qu.:19.40 3rd Qu.:83.60 3rd Qu.:11.88
## Max. :40.30 Max. :55.60 Max. :94.60 Max. :29.90
##
## transit walk other_transp work_at_home
## Min. : 0.0000 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.1000 1st Qu.: 1.400 1st Qu.: 0.900 1st Qu.: 2.800
## Median : 0.4000 Median : 2.400 Median : 1.300 Median : 4.000
## Mean : 0.9675 Mean : 3.307 Mean : 1.614 Mean : 4.697
## 3rd Qu.: 0.8000 3rd Qu.: 4.000 3rd Qu.: 1.900 3rd Qu.: 5.700
## Max. :61.7000 Max. :71.200 Max. :39.100 Max. :37.200
##
## mean_commute employed private_work public_work
## Min. : 4.90 Min. :16.57 Min. :25.00 Min. : 5.80
## 1st Qu.:19.30 1st Qu.:39.01 1st Qu.:70.90 1st Qu.:13.10
## Median :22.90 Median :43.67 Median :75.80 Median :16.10
## Mean :23.15 Mean :43.37 Mean :74.44 Mean :17.35
## 3rd Qu.:26.60 3rd Qu.:48.20 3rd Qu.:79.80 3rd Qu.:20.10
## Max. :44.00 Max. :76.24 Max. :88.30 Max. :66.20
##
## self_employed family_work unemployment
## Min. : 0.000 Min. :0.0000 Min. : 0.000
## 1st Qu.: 5.400 1st Qu.:0.1000 1st Qu.: 5.500
## Median : 6.900 Median :0.2000 Median : 7.500
## Mean : 7.921 Mean :0.2915 Mean : 7.815
## 3rd Qu.: 9.400 3rd Qu.:0.3000 3rd Qu.: 9.700
## Max. :36.600 Max. :9.8000 Max. :29.400
##
ACS %>% mutate(women_county = women > men) %>%
summarize(Total_Women = sum(women_county))
## # A tibble: 1 × 1
## Total_Women
## <int>
## 1 1985
It appears there are 1,985 counties with more women than there are men.
ACS %>% mutate(unemployment_county = unemployment < 10) %>%
summarize(Total_Unemployment = sum(unemployment_county))
## # A tibble: 1 × 1
## Total_Unemployment
## <int>
## 1 2420
It appears there are 2,420 counties with an unemployment rate lower than 10%.
mean_commute to answer this question.ACS %>% arrange(desc(mean_commute)) %>%
select(census_id, county, state, mean_commute) %>%
top_n(n = 10, wt = mean_commute)
## # A tibble: 10 × 4
## census_id county state mean_commute
## <dbl> <fct> <fct> <dbl>
## 1 42103 Pike Pennsylvania 44
## 2 36005 Bronx New York 43
## 3 24017 Charles Maryland 42.8
## 4 51187 Warren Virginia 42.7
## 5 36081 Queens New York 42.6
## 6 36085 Richmond New York 42.6
## 7 51193 Westmoreland Virginia 42.5
## 8 8093 Park Colorado 42.4
## 9 36047 Kings New York 41.7
## 10 54015 Clay West Virginia 41.4
After leveraging the dplyr::top_n() function, we can see that the following counties in respective, descending order have the highest mean commute: Pike, Bronx, Charles, Warren, Queens, Richmond, Westmoreland, Park, Kings, and Clay.
ACS %>% mutate(women_percentage = women/total_pop * 100) %>%
arrange(women_percentage) %>%
select(census_id, county, women_percentage, state) %>%
top_n(n = -10, wt = women_percentage)
## # A tibble: 10 × 4
## census_id county women_percentage state
## <dbl> <fct> <dbl> <fct>
## 1 42053 Forest 26.8 Pennsylvania
## 2 8011 Bent 31.4 Colorado
## 3 51183 Sussex 31.5 Virginia
## 4 13309 Wheeler 32.1 Georgia
## 5 6035 Lassen 33.2 California
## 6 48095 Concho 33.3 Texas
## 7 13053 Chattahoochee 33.4 Georgia
## 8 2013 Aleutians East Borough 33.5 Alaska
## 9 22125 West Feliciana 33.6 Louisiana
## 10 32027 Pershing 33.7 Nevada
After leveraging the dplyr::top_n() function, the new variable women_percentage illustrates the percentage of women for each county. More specifically, the top 10 counties with the lowest percentages for women population. In ascending order, we have: Forest, Bent, Sussex, Wheeler, Lassen, Concho, Chattahoochee, Aleutians East Borough, West Feliciana, and Pershing.
ACS %>% mutate(race_sums = hispanic + white + black + native + asian + pacific) %>%
arrange(race_sums) %>%
select(census_id, county, race_sums, state,) %>%
top_n(n = -10, wt = race_sums)
## # A tibble: 10 × 4
## census_id county race_sums state
## <dbl> <fct> <dbl> <fct>
## 1 15001 Hawaii 76.4 Hawaii
## 2 15009 Maui 79.2 Hawaii
## 3 40097 Mayes 79.7 Oklahoma
## 4 15003 Honolulu 81.5 Hawaii
## 5 40123 Pontotoc 82.8 Oklahoma
## 6 47061 Grundy 83 Tennessee
## 7 2282 Yakutat City and Borough 83.4 Alaska
## 8 40069 Johnston 84 Oklahoma
## 9 15007 Kauai 84.1 Hawaii
## 10 40003 Alfalfa 85.1 Oklahoma
In terms of the counties with the lowest sum of race as a percentage, Hawaii has the lowest followed by Maui, Mayes, Honolulu, Pontotoc, Grundy, Yakutat City and Borough, Johnston, Kauai, and Alfalfa.
ACS %>% group_by(state) %>%
mutate(sum_race = hispanic + white + black + native + asian + pacific) %>%
summarize(mean_sum_race = mean(sum_race)) %>%
arrange(mean_sum_race) %>%
top_n(n =- 1, wt = mean_sum_race)
## # A tibble: 1 × 2
## state mean_sum_race
## <fct> <dbl>
## 1 Hawaii 84
In terms of which state, on average, has the lowest sum of race percentage would be Hawaii at 84%, which would make sense given the context of the state.
ACS_Race <- ACS %>% mutate(sum_race = hispanic + white + black + native + asian + pacific)
ACS_Race$sum_race <- format(round(ACS_Race$sum_race, 10), nsmall = 10)
ACS_Race %>% mutate(percentage_gt100 = sum_race > 100.01) %>%
filter(percentage_gt100 == TRUE) %>%
select(census_id, county, state, sum_race, percentage_gt100)
## # A tibble: 5 × 5
## census_id county state sum_race percentage_gt100
## <dbl> <fct> <fct> <chr> <lgl>
## 1 31073 Gosper Nebraska 100.1000000000 TRUE
## 2 31091 Hooker Nebraska 100.1000000000 TRUE
## 3 31125 Nance Nebraska 100.1000000000 TRUE
## 4 48017 Bailey Texas 100.1000000000 TRUE
## 5 48137 Edwards Texas 100.1000000000 TRUE
There appears to be five counties that have a recorded value of over 100% in terms of race percentage. They are: Gosper, Hooker, Nance, Bailey, and Edwards.
ACS_Race %>% mutate(percentage_100 = (sum_race < 100.1 & sum_race >=100.0)) %>%
filter(percentage_100 == TRUE) %>%
select(census_id, state, sum_race, percentage_100) %>%
distinct(state)
## # A tibble: 16 × 1
## state
## <fct>
## 1 Alabama
## 2 Georgia
## 3 Kansas
## 4 Kentucky
## 5 Mississippi
## 6 Montana
## 7 Nebraska
## 8 Nevada
## 9 New Mexico
## 10 North Carolina
## 11 North Dakota
## 12 Oregon
## 13 South Dakota
## 14 Texas
## 15 Utah
## 16 West Virginia
There are 16 states that sum to exactly 100%.
carpool variable,carpool_rank where the highest ranked county (rank = 1) is the county with the highest carpool value. Read the documentation carefully for the ranking function.ACS$carpool_rank <- min_rank(desc(ACS$carpool))
carpool_rank has been created as a new variable in the data set.
ACS %>% arrange(carpool_rank) %>%
select(census_id, county, state, carpool, carpool_rank) %>%
top_n(n = -10, wt = carpool_rank)
## # A tibble: 10 × 5
## census_id county state carpool carpool_rank
## <dbl> <fct> <fct> <dbl> <int>
## 1 13061 Clay Georgia 29.9 1
## 2 18087 LaGrange Indiana 27 2
## 3 13165 Jenkins Georgia 25.3 3
## 4 5133 Sevier Arkansas 24.4 4
## 5 20175 Seward Kansas 23.4 5
## 6 48079 Cochran Texas 22.8 6
## 7 48247 Jim Hogg Texas 22.6 7
## 8 48393 Roberts Texas 22.4 8
## 9 39075 Holmes Ohio 21.8 9
## 10 21197 Powell Kentucky 21.6 10
From this, we see that, in descending order, Clay, LaGrange, Jenkins, Sevier, Seward, Cochran, Jim Hogg, Roberts, Holmes, and Powell have the 10 highest ranked data for carpooling.
ACS %>% arrange(desc(carpool_rank)) %>%
select(census_id, county, state, carpool, carpool_rank) %>%
top_n(n = 10, wt = carpool_rank)
## # A tibble: 11 × 5
## census_id county state carpool carpool_rank
## <dbl> <fct> <fct> <dbl> <int>
## 1 48261 Kenedy Texas 0 3141
## 2 48269 King Texas 0 3141
## 3 48235 Irion Texas 0.9 3140
## 4 31183 Wheeler Nebraska 1.3 3139
## 5 36061 New York New York 1.9 3138
## 6 13309 Wheeler Georgia 2.3 3136
## 7 38029 Emmons North Dakota 2.3 3136
## 8 30019 Daniels Montana 2.6 3134
## 9 31057 Dundy Nebraska 2.6 3134
## 10 46069 Hyde South Dakota 2.8 3132
## 11 51720 Norton city Virginia 2.8 3132
Opposite from 10(b), we have the lowest ranked counties for carpooling, which are: Kenedy, King, Irion, Wheeler, New York, Wheeler, Emmons, Daniels, Dundy, Hyde, and Norton City.
ACS %>% group_by(state) %>%
summarize(state_mean_carpool = mean(carpool)) %>%
mutate(state_mean_rank = min_rank(desc(state_mean_carpool))) %>%
arrange(state_mean_rank) %>%
top_n(n = -1, wt = state_mean_rank)
## # A tibble: 1 × 3
## state state_mean_carpool state_mean_rank
## <fct> <dbl> <int>
## 1 Alaska 12.1 1
On average, we see that Alaska has the highest ranked percentage for carpooling.
ACS %>% group_by(state) %>%
summarize(state_mean_carpool = mean(carpool)) %>%
mutate(state_mean_rank = min_rank(desc(state_mean_carpool))) %>%
arrange(state_mean_rank) %>%
top_n(n = -5, wt = state_mean_rank)
## # A tibble: 5 × 3
## state state_mean_carpool state_mean_rank
## <fct> <dbl> <int>
## 1 Alaska 12.1 1
## 2 Arkansas 11.9 2
## 3 Utah 11.9 3
## 4 Texas 11.8 4
## 5 Nevada 11.7 5
The top five states for carpooling in descending order are: Alaska, Arkansas, Utah, Texas, and Nevada.