Wrangling HW 3

Importing Data and Data Cleaning

library(tidyverse)

Import the data set using a Tidyverse function.

acs_2015 <- read_csv("Week 4/homework3/acs_2015_county_data_revised.csv")

How many rows and columns are in the data set?

dim(acs_2015)

## [1] 3142   35

class(acs_2015)

## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

This data set has 3142 rows and 35 columns, and is classified as multiple data types.

From the output of str(), most of the variables here appear to be correctly categorized as strings and numbers. However, variables like total_pop, men, women, and citizen could be changed to integers since they refer to numbers of people and not decimal values. I think that census_id should probably be changed to character variable since is is not a value that should be modified, as it refers to a unique id.

str(acs_2015)

## tibble [3,142 × 35] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ census_id     : num [1:3142] 1001 1003 1005 1007 1009 ...
##  $ state         : chr [1:3142] "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ county        : chr [1:3142] "Autauga" "Baldwin" "Barbour" "Bibb" ...
##  $ total_pop     : num [1:3142] 55221 195121 26932 22604 57710 ...
##  $ men           : num [1:3142] 26745 95314 14497 12073 28512 ...
##  $ women         : num [1:3142] 28476 99807 12435 10531 29198 ...
##  $ hispanic      : num [1:3142] 2.6 4.5 4.6 2.2 8.6 4.4 1.2 3.5 0.4 1.5 ...
##  $ white         : num [1:3142] 75.8 83.1 46.2 74.5 87.9 22.2 53.3 73 57.3 91.7 ...
##  $ black         : num [1:3142] 18.5 9.5 46.7 21.4 1.5 70.7 43.8 20.3 40.3 4.8 ...
##  $ native        : num [1:3142] 0.4 0.6 0.2 0.4 0.3 1.2 0.1 0.2 0.2 0.6 ...
##  $ asian         : num [1:3142] 1 0.7 0.4 0.1 0.1 0.2 0.4 0.9 0.8 0.3 ...
##  $ pacific       : num [1:3142] 0 0 0 0 0 0 0 0 0 0 ...
##  $ citizen       : num [1:3142] 40725 147695 20714 17495 42345 ...
##  $ income        : num [1:3142] 51281 50254 32964 38678 45813 ...
##  $ income_per_cap: num [1:3142] 24974 27317 16824 18431 20532 ...
##  $ poverty       : num [1:3142] 12.9 13.4 26.7 16.8 16.7 24.6 25.4 20.5 21.6 19.2 ...
##  $ child_poverty : num [1:3142] 18.6 19.2 45.3 27.9 27.2 38.4 39.2 31.6 37.2 30.1 ...
##  $ professional  : num [1:3142] 33.2 33.1 26.8 21.5 28.5 18.8 27.5 27.3 23.3 29.3 ...
##  $ service       : num [1:3142] 17 17.7 16.1 17.9 14.1 15 16.6 17.7 14.5 16 ...
##  $ office        : num [1:3142] 24.2 27.1 23.1 17.8 23.9 19.7 21.9 24.2 26.3 19.5 ...
##  $ construction  : num [1:3142] 8.6 10.8 10.8 19 13.5 20.1 10.3 10.5 11.5 13.7 ...
##  $ production    : num [1:3142] 17.1 11.2 23.1 23.7 19.9 26.4 23.7 20.4 24.4 21.5 ...
##  $ drive         : num [1:3142] 87.5 84.7 83.8 83.2 84.9 74.9 84.5 85.3 85.1 83.9 ...
##  $ carpool       : num [1:3142] 8.8 8.8 10.9 13.5 11.2 14.9 12.4 9.4 11.9 12.1 ...
##  $ transit       : num [1:3142] 0.1 0.1 0.4 0.5 0.4 0.7 0 0.2 0.2 0.2 ...
##  $ walk          : num [1:3142] 0.5 1 1.8 0.6 0.9 5 0.8 1.2 0.3 0.6 ...
##  $ other_transp  : num [1:3142] 1.3 1.4 1.5 1.5 0.4 1.7 0.6 1.2 0.4 0.7 ...
##  $ work_at_home  : num [1:3142] 1.8 3.9 1.6 0.7 2.3 2.8 1.7 2.7 2.1 2.5 ...
##  $ mean_commute  : num [1:3142] 26.5 26.4 24.1 28.8 34.9 27.5 24.6 24.1 25.1 27.4 ...
##  $ employed      : num [1:3142] 23986 85953 8597 8294 22189 ...
##  $ private_work  : num [1:3142] 73.6 81.5 71.8 76.8 82 79.5 77.4 74.1 85.1 73.1 ...
##  $ public_work   : num [1:3142] 20.9 12.3 20.8 16.1 13.5 15.1 16.2 20.8 12.1 18.5 ...
##  $ self_employed : num [1:3142] 5.5 5.8 7.3 6.7 4.2 5.4 6.2 5 2.8 7.9 ...
##  $ family_work   : num [1:3142] 0 0.4 0.1 0.4 0.4 0 0.2 0.1 0 0.5 ...
##  $ unemployment  : num [1:3142] 7.6 7.5 17.6 8.3 7.7 18 10.9 12.3 8.9 7.9 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   census_id = col_double(),
##   ..   state = col_character(),
##   ..   county = col_character(),
##   ..   total_pop = col_double(),
##   ..   men = col_double(),
##   ..   women = col_double(),
##   ..   hispanic = col_double(),
##   ..   white = col_double(),
##   ..   black = col_double(),
##   ..   native = col_double(),
##   ..   asian = col_double(),
##   ..   pacific = col_double(),
##   ..   citizen = col_double(),
##   ..   income = col_double(),
##   ..   income_per_cap = col_double(),
##   ..   poverty = col_double(),
##   ..   child_poverty = col_double(),
##   ..   professional = col_double(),
##   ..   service = col_double(),
##   ..   office = col_double(),
##   ..   construction = col_double(),
##   ..   production = col_double(),
##   ..   drive = col_double(),
##   ..   carpool = col_double(),
##   ..   transit = col_double(),
##   ..   walk = col_double(),
##   ..   other_transp = col_double(),
##   ..   work_at_home = col_double(),
##   ..   mean_commute = col_double(),
##   ..   employed = col_double(),
##   ..   private_work = col_double(),
##   ..   public_work = col_double(),
##   ..   self_employed = col_double(),
##   ..   family_work = col_double(),
##   ..   unemployment = col_double()
##   .. )

Let’s fix those variables!

acs_2015$total_pop <- as.integer(acs_2015$total_pop)
acs_2015$men <- as.integer(acs_2015$men)
acs_2015$women <- as.integer(acs_2015$women)
acs_2015$citizen <- as.integer(acs_2015$citizen)
acs_2015$census_id <- as.character(acs_2015$census_id)

Now that the variable types have been changes, let’s see a glimpse at the data.

glimpse(acs_2015)

## Rows: 3,142
## Columns: 35
## $ census_id      <chr> "1001", "1003", "1005", "1007", "1009", "1011", "1013"…
## $ state          <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama",…
## $ county         <chr> "Autauga", "Baldwin", "Barbour", "Bibb", "Blount", "Bu…
## $ total_pop      <int> 55221, 195121, 26932, 22604, 57710, 10678, 20354, 1166…
## $ men            <int> 26745, 95314, 14497, 12073, 28512, 5660, 9502, 56274, …
## $ women          <int> 28476, 99807, 12435, 10531, 29198, 5018, 10852, 60374,…
## $ hispanic       <dbl> 2.6, 4.5, 4.6, 2.2, 8.6, 4.4, 1.2, 3.5, 0.4, 1.5, 7.6,…
## $ white          <dbl> 75.8, 83.1, 46.2, 74.5, 87.9, 22.2, 53.3, 73.0, 57.3, …
## $ black          <dbl> 18.5, 9.5, 46.7, 21.4, 1.5, 70.7, 43.8, 20.3, 40.3, 4.…
## $ native         <dbl> 0.4, 0.6, 0.2, 0.4, 0.3, 1.2, 0.1, 0.2, 0.2, 0.6, 0.4,…
## $ asian          <dbl> 1.0, 0.7, 0.4, 0.1, 0.1, 0.2, 0.4, 0.9, 0.8, 0.3, 0.3,…
## $ pacific        <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,…
## $ citizen        <int> 40725, 147695, 20714, 17495, 42345, 8057, 15581, 88612…
## $ income         <dbl> 51281, 50254, 32964, 38678, 45813, 31938, 32229, 41703…
## $ income_per_cap <dbl> 24974, 27317, 16824, 18431, 20532, 17580, 18390, 21374…
## $ poverty        <dbl> 12.9, 13.4, 26.7, 16.8, 16.7, 24.6, 25.4, 20.5, 21.6, …
## $ child_poverty  <dbl> 18.6, 19.2, 45.3, 27.9, 27.2, 38.4, 39.2, 31.6, 37.2, …
## $ professional   <dbl> 33.2, 33.1, 26.8, 21.5, 28.5, 18.8, 27.5, 27.3, 23.3, …
## $ service        <dbl> 17.0, 17.7, 16.1, 17.9, 14.1, 15.0, 16.6, 17.7, 14.5, …
## $ office         <dbl> 24.2, 27.1, 23.1, 17.8, 23.9, 19.7, 21.9, 24.2, 26.3, …
## $ construction   <dbl> 8.6, 10.8, 10.8, 19.0, 13.5, 20.1, 10.3, 10.5, 11.5, 1…
## $ production     <dbl> 17.1, 11.2, 23.1, 23.7, 19.9, 26.4, 23.7, 20.4, 24.4, …
## $ drive          <dbl> 87.5, 84.7, 83.8, 83.2, 84.9, 74.9, 84.5, 85.3, 85.1, …
## $ carpool        <dbl> 8.8, 8.8, 10.9, 13.5, 11.2, 14.9, 12.4, 9.4, 11.9, 12.…
## $ transit        <dbl> 0.1, 0.1, 0.4, 0.5, 0.4, 0.7, 0.0, 0.2, 0.2, 0.2, 0.2,…
## $ walk           <dbl> 0.5, 1.0, 1.8, 0.6, 0.9, 5.0, 0.8, 1.2, 0.3, 0.6, 1.1,…
## $ other_transp   <dbl> 1.3, 1.4, 1.5, 1.5, 0.4, 1.7, 0.6, 1.2, 0.4, 0.7, 1.4,…
## $ work_at_home   <dbl> 1.8, 3.9, 1.6, 0.7, 2.3, 2.8, 1.7, 2.7, 2.1, 2.5, 1.9,…
## $ mean_commute   <dbl> 26.5, 26.4, 24.1, 28.8, 34.9, 27.5, 24.6, 24.1, 25.1, …
## $ employed       <dbl> 23986, 85953, 8597, 8294, 22189, 3865, 7813, 47401, 13…
## $ private_work   <dbl> 73.6, 81.5, 71.8, 76.8, 82.0, 79.5, 77.4, 74.1, 85.1, …
## $ public_work    <dbl> 20.9, 12.3, 20.8, 16.1, 13.5, 15.1, 16.2, 20.8, 12.1, …
## $ self_employed  <dbl> 5.5, 5.8, 7.3, 6.7, 4.2, 5.4, 6.2, 5.0, 2.8, 7.9, 4.1,…
## $ family_work    <dbl> 0.0, 0.4, 0.1, 0.4, 0.4, 0.0, 0.2, 0.1, 0.0, 0.5, 0.5,…
## $ unemployment   <dbl> 7.6, 7.5, 17.6, 8.3, 7.7, 18.0, 10.9, 12.3, 8.9, 7.9, …

Next, let’s check for missing values.

colSums(is.na(acs_2015))

##      census_id          state         county      total_pop            men 
##              0              0              0              0              0 
##          women       hispanic          white          black         native 
##              0              0              0              0              0 
##          asian        pacific        citizen         income income_per_cap 
##              0              0              0              1              0 
##        poverty  child_poverty   professional        service         office 
##              0              1              0              0              0 
##   construction     production          drive        carpool        transit 
##              0              0              0              0              0 
##           walk   other_transp   work_at_home   mean_commute       employed 
##              0              0              0              0              0 
##   private_work    public_work  self_employed    family_work   unemployment 
##              0              0              0              0              0

There is one missing value in income and one in child_poverty.

Let’s look at those rows of the data to determine if we should delete them altogether.

filter(acs_2015, is.na(income))

filter(acs_2015, is.na(child_poverty))

There does not seem to be a reason to delete these observations, so I will impute with mean.

acs_2015$income[is.na(acs_2015$income)] <- mean(acs_2015$income, na.rm = TRUE)
acs_2015$child_poverty[is.na(acs_2015$child_poverty)] <- mean(acs_2015$child_poverty, na.rm = TRUE)

QA check:

colSums(is.na(acs_2015))

##      census_id          state         county      total_pop            men 
##              0              0              0              0              0 
##          women       hispanic          white          black         native 
##              0              0              0              0              0 
##          asian        pacific        citizen         income income_per_cap 
##              0              0              0              0              0 
##        poverty  child_poverty   professional        service         office 
##              0              0              0              0              0 
##   construction     production          drive        carpool        transit 
##              0              0              0              0              0 
##           walk   other_transp   work_at_home   mean_commute       employed 
##              0              0              0              0              0 
##   private_work    public_work  self_employed    family_work   unemployment 
##              0              0              0              0              0

Let’s examine unusual values in our data.

summary(acs_2015)

It appears that employed is not in terms of percentage, so that will need to be handled.

acs_2015 <- mutate(acs_2015,
  employed = round((employed/total_pop) * 100, digits = 1)
)

Next, let’s take a final look at a couple rows of our data.

head(acs_2015, 5)

Data Manipulation and Insights

How many counties have more women than men?

I ran the first line of code just because I wanted to make sure it was working correctly, as I am still unfamiliar with the pipeline operator. The second line actually counts up the counties with more women than men by filtering the data.

acs_2015 %>% filter(women> men) %>% select(county, women, men)

acs_2015 %>% filter(women > men) %>% NROW

## [1] 1985

So, there are 1,985 counties with more women than men.

How many counties have an unemployment rate lower than 10%?

We see from the following that there are 2,420 counties with unemployment lower than 10 percent.

acs_2015 %>% filter(unemployment < 10) %>% NROW

## [1] 2420

Using dplyr::top_n(), we find the top 10 counties with highest mean commmute and display county name, mean commute, census ID, and state.

acs_2015  %>% top_n(10, mean_commute) %>% arrange(desc(mean_commute)) %>% select(census_id, county, state, mean_commute)

Create a new variable that calculates the percentage of women for each county and then find the top 10 counties with the lowest percentages. Show the census ID, county name, state, and the percentage in your final answer (sorted by ascending percentage).

First, we create a new variable that calculates the percentage of women for each county.

acs_2015 <- mutate(acs_2015,
  percent_women = round((women/total_pop) * 100, digits = 1)
)

Next, let’s find the top 10 counties with the lowest percentages of women and see the census ID, county name, state, and percentage.

acs_2015  %>% top_n(-10, percent_women) %>% arrange(percent_women) %>% select(census_id, county, state, percent_women)

Here, we see 11 counties output because Pershing county and Garza county are tied at number 10 with 33.7% women.

Create a new variable that calculates the sum of all race percentage variables (these columns are the “hispanic”, “white”, “black”, “native”, “asian”, and “pacific” variables).

We create the new variable below as desired.

acs_2015 <- mutate(acs_2015,
  race_sum = hispanic + white + black + native + asian
)

Here are the top 10 counties with the lowest sum of these race percentage variables.

acs_2015  %>% top_n(-10, race_sum) %>% arrange(race_sum) %>% select(census_id, county, state, race_sum)

Which state, on average, has the lowest sum of these race percentage variables?

acs_2015 %>% group_by(state) %>% summarize(avg_race = mean(race_sum)) %>% arrange(avg_race) %>% top_n(-1, avg_race)

Grouping by state and finding the mean of this race percentage tells us that Hawaii has the lowest sum. This makes sense because in part a we saw that the top 5 lowest counties were in Hawaii.

Do any counties have a sum greater than 100%?

acs_2015 %>% filter(race_sum > 100) %>% NROW

## [1] 9

Based on the above output, there are 9 counties with a sum greater than 100%.

How many states have a sum that equals exactly to 100%?

acs_2015 %>% filter(race_sum == 100) %>% NROW

## [1] 27

There are 27 states with a sum exactly equal to 100% when we filter our data with this constraint.

Using the carpool variable,

Use the dplyr::min_rank() function to create a new variable called carpool_rank where the highest ranked county (rank = 1) is the county with the highest carpool value. Read the documentation carefully for the ranking function.

Based on the documentation of min_rank() and rank(), this will by default ranke the lowest value as rank = 1. So, to reverse that we will use desc(carpool) within the min_rank() function.

acs_2015 <- mutate(acs_2015,
  carpool_rank = min_rank(desc(carpool))
)
acs_2015 %>% select(census_id, county, state, carpool, carpool_rank ) %>% arrange(carpool_rank) %>% top_n(-5, carpool_rank)

These are the 10 highest ranked counties for carpooling,showing the census ID, county name, state, carpool value, and carpool_rank below.

acs_2015 %>% select(census_id, county, state, carpool, carpool_rank ) %>% arrange(carpool_rank) %>% top_n(-10, carpool_rank)

These are the 10 lowest ranked counties for carpooling.

acs_2015 %>% select(census_id, county, state, carpool, carpool_rank ) %>% arrange(carpool_rank) %>% top_n(10, carpool_rank)

On average, Arizona is the best state for carpooling based on the following code.

acs_2015 %>% group_by(state) %>% summarize(avg_carpool_rank = mean(carpool_rank)) %>% arrange(avg_carpool_rank) %>% top_n(-1, avg_carpool_rank)

The top 5 states for carpooling are Arizona, Utah, Arkansas, Hawaii, and Alaska, based on their average rankings by state, as calculated below.

acs_2015 %>% group_by(state) %>% summarize(avg_carpool_rank = mean(carpool_rank)) %>% arrange(avg_carpool_rank) %>% top_n(-5, avg_carpool_rank)

Wrangling HW 3

Emily Thie

11/6/2020

Importing Data and Data Cleaning

Data Manipulation and Insights