Importing Data and Data Cleaning

1. Import the data set using a Tidyverse function and NOT with a Base R function. How manyrows and columns are in the data set?

Code explanation: Library Tidyverse package so we can use the read_csv function to import our data.

library(tidyverse)
data_ACS <- read_csv("acs_2015_county_data_revised.csv")

The dims() function tells us the size (rowsxcolumns) of the dataset. There are 3,142 rows and 35 columns in the dataset

## [1] 3142   35

2. Do any data types need changed? Show any code to change variable types and show code/output for a command after you’re finished.

Yes, I will need to change “State” and “County” to factors. I do not need to change any other variables, such as the percentages because they are already correctly stored as numerics.

I then use the structure function to make sure each variable did i fact change to factor.

data_ACS$state <- as.factor(data_ACS$state)
data_ACS$county <- as.factor(data_ACS$county)

str(data_ACS$state)
##  Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
str(data_ACS$county)
##  Factor w/ 1851 levels "Abbeville","Acadia",..: 82 89 100 149 164 225 235 246 293 315 ...

3. Are there any missing values? How will you handle missing values? Will you impute a missing value with, for example, a mean or median value for the entire column, or will you remove the entire observation? Give a rationale for your decision and show any code/output to handle missing values.

Yes, there is a total of 2 missing values. 1 missing Income value and 1 missing child_poverty value.Because there are only 2 missing values and the missing values are numeric, I will omit both observations entirely (If the missing values were factors, I would replace the NA with “not stated”).

colSums(is.na(data_ACS)) # tells me total NA in each column
##      census_id          state         county      total_pop            men 
##              0              0              0              0              0 
##          women       hispanic          white          black         native 
##              0              0              0              0              0 
##          asian        pacific        citizen         income income_per_cap 
##              0              0              0              1              0 
##        poverty  child_poverty   professional        service         office 
##              0              1              0              0              0 
##   construction     production          drive        carpool        transit 
##              0              0              0              0              0 
##           walk   other_transp   work_at_home   mean_commute       employed 
##              0              0              0              0              0 
##   private_work    public_work  self_employed    family_work   unemployment 
##              0              0              0              0              0
data_ACS <- na.omit(data_ACS) # removes observations with NA
colSums(is.na(data_ACS)) # checks total NA in each column again to make sure NA was removed
##      census_id          state         county      total_pop            men 
##              0              0              0              0              0 
##          women       hispanic          white          black         native 
##              0              0              0              0              0 
##          asian        pacific        citizen         income income_per_cap 
##              0              0              0              0              0 
##        poverty  child_poverty   professional        service         office 
##              0              0              0              0              0 
##   construction     production          drive        carpool        transit 
##              0              0              0              0              0 
##           walk   other_transp   work_at_home   mean_commute       employed 
##              0              0              0              0              0 
##   private_work    public_work  self_employed    family_work   unemployment 
##              0              0              0              0              0
dim(data_ACS) # used here to double check we now have 2 less rows
## [1] 3140   35

4.Use the function to examine any unusual values. Are there any? If so, how will you handle these unusual values? Show any code/output to handle unusual values. Notes: For the sake of time, you do not need to create any visualizations or other statistical summaries for every variable— the summary function will suffice for this homework).

Using the summary() function, I do not see any unusual values.

Data Manipulation and Insights

5.How many counties have more women than men?

1,984 counties have more women than men

sum(data_ACS$women > data_ACS$men) # creates logical vector & sum tells me how often vector is TRUE
## [1] 1984

6.How many counties have an unemployment rate lower than 10%?

There are 2,419 counties with an unemployment rate under 10%

sum(data_ACS$unemployment < 10, na.rm = TRUE) # logical vector where unemployment < 10% = TRUE & sum tells me how often it is TRUE
## [1] 2419

7.What are the top 10 counties with the highest mean commute? Show the census ID, countyname, state, and the mean_commute in your final answer (sorted by mean_commute).

top10_commute <- data_ACS %>%
  select(census_id, county, state, mean_commute) %>% # selects relevant Columns for output
  top_n(10, mean_commute)%>% # selects top 10 rows with highest mean_commute
  arrange(mean_commute) # sorts by mean_commute

top10_commute # gives output of above commands
## # A tibble: 10 × 4
##    census_id county       state         mean_commute
##        <dbl> <fct>        <fct>                <dbl>
##  1     54015 Clay         West Virginia         41.4
##  2     36047 Kings        New York              41.7
##  3      8093 Park         Colorado              42.4
##  4     51193 Westmoreland Virginia              42.5
##  5     36081 Queens       New York              42.6
##  6     36085 Richmond     New York              42.6
##  7     51187 Warren       Virginia              42.7
##  8     24017 Charles      Maryland              42.8
##  9     36005 Bronx        New York              43  
## 10     42103 Pike         Pennsylvania          44

8.Create a new variable that calculates the percentage of women for each county and then find the top 10 counties with the lowest percentages. Show the census ID, county name, state,and the percentage in your final answer (sorted by ascending percentage).

data_ACS <- data_ACS %>% 
  mutate(percent_women = women / total_pop * 100) # mutate uses dplyr to create new variable

least_women <- data_ACS %>%
  select(census_id, county, state, percent_women) %>% # selects relevant columns for output
  slice_min(percent_women,n = 10) # pulls 10 rows with lowest value
 
least_women # gives output of above commands
## # A tibble: 10 × 4
##    census_id county                 state        percent_women
##        <dbl> <fct>                  <fct>                <dbl>
##  1     42053 Forest                 Pennsylvania          26.8
##  2      8011 Bent                   Colorado              31.4
##  3     51183 Sussex                 Virginia              31.5
##  4     13309 Wheeler                Georgia               32.1
##  5      6035 Lassen                 California            33.2
##  6     48095 Concho                 Texas                 33.3
##  7     13053 Chattahoochee          Georgia               33.4
##  8      2013 Aleutians East Borough Alaska                33.5
##  9     22125 West Feliciana         Louisiana             33.6
## 10     32027 Pershing               Nevada                33.7

9.Create a new variable that calculates the sum of all race percentage variables (these columns are the “hispanic”, “white”, “black”, “native”, “asian”, and “pacific” variables).

data_ACS <- data_ACS %>% 
  mutate(all_race = hispanic + white + black + native + asian + pacific) # mutate ueses dyplr to create new variable

a.What are the top 10 counties with the lowest sum of these race percentage variables?

top10_race <- data_ACS %>%
  select(county, all_race) %>% # selects relevant columns for output
  slice_min(all_race,n = 10)  # pulls 10 lowests values

top10_race # gives output of above commands
## # A tibble: 10 × 2
##    county                   all_race
##    <fct>                       <dbl>
##  1 Hawaii                       76.4
##  2 Maui                         79.2
##  3 Mayes                        79.7
##  4 Honolulu                     81.5
##  5 Pontotoc                     82.8
##  6 Grundy                       83  
##  7 Yakutat City and Borough     83.4
##  8 Johnston                     84  
##  9 Kauai                        84.1
## 10 Alfalfa                      85.1

b.Which state, on average, has the lowest sum of these race percentage variables?

Hawaii has the lowest sum of these race percentage variables, on average.

lowest_race <- data_ACS %>%
  group_by(state) %>% # groups observations by state so I can then take average of state
  summarise(avg_all_race = mean(all_race, na.rm = TRUE)) %>%  # average of all_race by state
  slice_min(avg_all_race, n = 1) # selescts state with lowest avg all_race

lowest_race # gives output of above commands
## # A tibble: 1 × 2
##   state  avg_all_race
##   <fct>         <dbl>
## 1 Hawaii         80.3

c.Do any counties have a sum greater than 100%?

Yes, 11 counties have a sum greater than 100

county_100 <- data_ACS %>% 
filter(all_race > 100) %>% # only keeps counties with all_race sum > 100
  select(county) # selects relevant coulmn name for output

county_100 # gives output of above commands
## # A tibble: 11 × 1
##    county   
##    <fct>    
##  1 Claiborne
##  2 Gosper   
##  3 Hooker   
##  4 Nance    
##  5 Bailey   
##  6 Duval    
##  7 Edwards  
##  8 Kenedy   
##  9 Kent     
## 10 Presidio 
## 11 Beaver

d.How many states have a sum that equals exactly to 100%?

13 states have a sum equal to exactly 100%

states_equal_100 <- data_ACS %>%
  filter(all_race == 100) %>% # only keeps counties with all_race sum = 100
  select(state) # selects relevant coulmn name for output

unique(states_equal_100) # gives only unique output of above commands
## # A tibble: 13 × 1
##    state         
##    <fct>         
##  1 Alabama       
##  2 Georgia       
##  3 Kansas        
##  4 Kentucky      
##  5 Mississippi   
##  6 Montana       
##  7 Nebraska      
##  8 New Mexico    
##  9 North Carolina
## 10 North Dakota  
## 11 South Dakota  
## 12 Texas         
## 13 West Virginia

10.Using the carpool variable,

a.Use the dplyr::min_rank() function to create a new variable called carpool_rank where the highest ranked county (rank = 1) is the county with the highest carpool value. Read the documentation carefully for the ranking function.

data_ACS <- data_ACS %>% 
  mutate(carpool_rank = dplyr::min_rank(carpool)) # mutate uses dyplr to create new variable & dplyr::min_rank assigns rank to each value

b.Find the 10 highest ranked counties for carpooling. Show the census ID, county name, state, carpool value, and carpool_rank in your final answer.

top10_carpool <- data_ACS %>%
  select(census_id, county, state, carpool, carpool_rank) %>% # selects relevant columns for output
  top_n(10, carpool_rank)  # shows top 10 carpool_rank  

top10_carpool # gives output of above commands
## # A tibble: 10 × 5
##    census_id county   state    carpool carpool_rank
##        <dbl> <fct>    <fct>      <dbl>        <int>
##  1      5133 Sevier   Arkansas    24.4         3137
##  2     13061 Clay     Georgia     29.9         3140
##  3     13165 Jenkins  Georgia     25.3         3138
##  4     18087 LaGrange Indiana     27           3139
##  5     20175 Seward   Kansas      23.4         3136
##  6     21197 Powell   Kentucky    21.6         3131
##  7     39075 Holmes   Ohio        21.8         3132
##  8     48079 Cochran  Texas       22.8         3135
##  9     48247 Jim Hogg Texas       22.6         3134
## 10     48393 Roberts  Texas       22.4         3133

c.Find the 10 lowest ranked counties for carpooling. Show the same variables in your final answer.

bottom10_carpool <- data_ACS %>%
  select(census_id, county, state, carpool, carpool_rank) %>% # selects relevant columns for output
  slice_min(carpool_rank, n = 10)  # shows bottom 10 carpool_rank  

bottom10_carpool # gives output of above commands
## # A tibble: 11 × 5
##    census_id county      state        carpool carpool_rank
##        <dbl> <fct>       <fct>          <dbl>        <int>
##  1     48261 Kenedy      Texas            0              1
##  2     48269 King        Texas            0              1
##  3     48235 Irion       Texas            0.9            3
##  4     31183 Wheeler     Nebraska         1.3            4
##  5     36061 New York    New York         1.9            5
##  6     13309 Wheeler     Georgia          2.3            6
##  7     38029 Emmons      North Dakota     2.3            6
##  8     30019 Daniels     Montana          2.6            8
##  9     31057 Dundy       Nebraska         2.6            8
## 10     46069 Hyde        South Dakota     2.8           10
## 11     51720 Norton city Virginia         2.8           10

d.On average, what state is the best ranked for carpooling?

On average, (although not a state) Washington, DC is the best ranked for carpooling….I think this means MA is the best ranked carpool STATE on avg.

best_carpool <- data_ACS %>%
  group_by(state) %>% # groups observations by state to then take average
  summarise(avg_carpool_rank = mean(carpool_rank, na.rm = TRUE)) %>% # average of carpool_rank by state
  slice_min(avg_carpool_rank, n = 1) # shows state with lowest avg carpool_rank

best_carpool # gives output of above commands
## # A tibble: 1 × 2
##   state                avg_carpool_rank
##   <fct>                           <dbl>
## 1 District of Columbia               72
best_carpool2 <- data_ACS %>%
  group_by(state) %>% # groups observations by state to then take average
  summarise(avg_carpool_rank = mean(carpool_rank, na.rm = TRUE)) %>% # average of carpool_rank by state
  slice_min(avg_carpool_rank, n = 2) # shows the 2 states with lowest avg carpool_rank

best_carpool2 # gives output of above commands
## # A tibble: 2 × 2
##   state                avg_carpool_rank
##   <fct>                           <dbl>
## 1 District of Columbia              72 
## 2 Massachusetts                    540.

e.What are the top 5 states for carpooling?

top5_carpool <- data_ACS %>%
  select(state, carpool) %>% # selects relevant columns for output
  top_n(5, carpool) # returns top 5 carpool values   

top5_carpool # gives output of above commands
## # A tibble: 5 × 2
##   state    carpool
##   <fct>      <dbl>
## 1 Arkansas    24.4
## 2 Georgia     29.9
## 3 Georgia     25.3
## 4 Indiana     27  
## 5 Kansas      23.4