1. Import the data set using a Tidyverse function and NOT with a Base R function. How manyrows and columns are in the data set?
Code explanation: Library Tidyverse package so we can use the read_csv function to import our data.
library(tidyverse)
data_ACS <- read_csv("acs_2015_county_data_revised.csv")
The dims() function tells us the size (rowsxcolumns) of
the dataset. There are 3,142 rows and 35 columns in the dataset
## [1] 3142 35
2. Do any data types need changed? Show any code to change variable types and show code/output for a command after you’re finished.
Yes, I will need to change “State” and “County” to factors. I do not need to change any other variables, such as the percentages because they are already correctly stored as numerics.
I then use the structure function to make sure each variable did i fact change to factor.
data_ACS$state <- as.factor(data_ACS$state)
data_ACS$county <- as.factor(data_ACS$county)
str(data_ACS$state)
## Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
str(data_ACS$county)
## Factor w/ 1851 levels "Abbeville","Acadia",..: 82 89 100 149 164 225 235 246 293 315 ...
3. Are there any missing values? How will you handle missing values? Will you impute a missing value with, for example, a mean or median value for the entire column, or will you remove the entire observation? Give a rationale for your decision and show any code/output to handle missing values.
Yes, there is a total of 2 missing values. 1 missing Income value and 1 missing child_poverty value.Because there are only 2 missing values and the missing values are numeric, I will omit both observations entirely (If the missing values were factors, I would replace the NA with “not stated”).
colSums(is.na(data_ACS)) # tells me total NA in each column
## census_id state county total_pop men
## 0 0 0 0 0
## women hispanic white black native
## 0 0 0 0 0
## asian pacific citizen income income_per_cap
## 0 0 0 1 0
## poverty child_poverty professional service office
## 0 1 0 0 0
## construction production drive carpool transit
## 0 0 0 0 0
## walk other_transp work_at_home mean_commute employed
## 0 0 0 0 0
## private_work public_work self_employed family_work unemployment
## 0 0 0 0 0
data_ACS <- na.omit(data_ACS) # removes observations with NA
colSums(is.na(data_ACS)) # checks total NA in each column again to make sure NA was removed
## census_id state county total_pop men
## 0 0 0 0 0
## women hispanic white black native
## 0 0 0 0 0
## asian pacific citizen income income_per_cap
## 0 0 0 0 0
## poverty child_poverty professional service office
## 0 0 0 0 0
## construction production drive carpool transit
## 0 0 0 0 0
## walk other_transp work_at_home mean_commute employed
## 0 0 0 0 0
## private_work public_work self_employed family_work unemployment
## 0 0 0 0 0
dim(data_ACS) # used here to double check we now have 2 less rows
## [1] 3140 35
4.Use the function to examine any unusual values. Are there any? If so, how will you handle these unusual values? Show any code/output to handle unusual values. Notes: For the sake of time, you do not need to create any visualizations or other statistical summaries for every variable— the summary function will suffice for this homework).
Using the summary() function, I do not see any unusual
values.
5.How many counties have more women than men?
1,984 counties have more women than men
sum(data_ACS$women > data_ACS$men) # creates logical vector & sum tells me how often vector is TRUE
## [1] 1984
6.How many counties have an unemployment rate lower than 10%?
There are 2,419 counties with an unemployment rate under 10%
sum(data_ACS$unemployment < 10, na.rm = TRUE) # logical vector where unemployment < 10% = TRUE & sum tells me how often it is TRUE
## [1] 2419
7.What are the top 10 counties with the highest mean commute? Show the census ID, countyname, state, and the mean_commute in your final answer (sorted by mean_commute).
top10_commute <- data_ACS %>%
select(census_id, county, state, mean_commute) %>% # selects relevant Columns for output
top_n(10, mean_commute)%>% # selects top 10 rows with highest mean_commute
arrange(mean_commute) # sorts by mean_commute
top10_commute # gives output of above commands
## # A tibble: 10 × 4
## census_id county state mean_commute
## <dbl> <fct> <fct> <dbl>
## 1 54015 Clay West Virginia 41.4
## 2 36047 Kings New York 41.7
## 3 8093 Park Colorado 42.4
## 4 51193 Westmoreland Virginia 42.5
## 5 36081 Queens New York 42.6
## 6 36085 Richmond New York 42.6
## 7 51187 Warren Virginia 42.7
## 8 24017 Charles Maryland 42.8
## 9 36005 Bronx New York 43
## 10 42103 Pike Pennsylvania 44
8.Create a new variable that calculates the percentage of women for each county and then find the top 10 counties with the lowest percentages. Show the census ID, county name, state,and the percentage in your final answer (sorted by ascending percentage).
data_ACS <- data_ACS %>%
mutate(percent_women = women / total_pop * 100) # mutate uses dplyr to create new variable
least_women <- data_ACS %>%
select(census_id, county, state, percent_women) %>% # selects relevant columns for output
slice_min(percent_women,n = 10) # pulls 10 rows with lowest value
least_women # gives output of above commands
## # A tibble: 10 × 4
## census_id county state percent_women
## <dbl> <fct> <fct> <dbl>
## 1 42053 Forest Pennsylvania 26.8
## 2 8011 Bent Colorado 31.4
## 3 51183 Sussex Virginia 31.5
## 4 13309 Wheeler Georgia 32.1
## 5 6035 Lassen California 33.2
## 6 48095 Concho Texas 33.3
## 7 13053 Chattahoochee Georgia 33.4
## 8 2013 Aleutians East Borough Alaska 33.5
## 9 22125 West Feliciana Louisiana 33.6
## 10 32027 Pershing Nevada 33.7
9.Create a new variable that calculates the sum of all race percentage variables (these columns are the “hispanic”, “white”, “black”, “native”, “asian”, and “pacific” variables).
data_ACS <- data_ACS %>%
mutate(all_race = hispanic + white + black + native + asian + pacific) # mutate ueses dyplr to create new variable
a.What are the top 10 counties with the lowest sum of these race percentage variables?
top10_race <- data_ACS %>%
select(county, all_race) %>% # selects relevant columns for output
slice_min(all_race,n = 10) # pulls 10 lowests values
top10_race # gives output of above commands
## # A tibble: 10 × 2
## county all_race
## <fct> <dbl>
## 1 Hawaii 76.4
## 2 Maui 79.2
## 3 Mayes 79.7
## 4 Honolulu 81.5
## 5 Pontotoc 82.8
## 6 Grundy 83
## 7 Yakutat City and Borough 83.4
## 8 Johnston 84
## 9 Kauai 84.1
## 10 Alfalfa 85.1
b.Which state, on average, has the lowest sum of these race percentage variables?
Hawaii has the lowest sum of these race percentage variables, on average.
lowest_race <- data_ACS %>%
group_by(state) %>% # groups observations by state so I can then take average of state
summarise(avg_all_race = mean(all_race, na.rm = TRUE)) %>% # average of all_race by state
slice_min(avg_all_race, n = 1) # selescts state with lowest avg all_race
lowest_race # gives output of above commands
## # A tibble: 1 × 2
## state avg_all_race
## <fct> <dbl>
## 1 Hawaii 80.3
c.Do any counties have a sum greater than 100%?
Yes, 11 counties have a sum greater than 100
county_100 <- data_ACS %>%
filter(all_race > 100) %>% # only keeps counties with all_race sum > 100
select(county) # selects relevant coulmn name for output
county_100 # gives output of above commands
## # A tibble: 11 × 1
## county
## <fct>
## 1 Claiborne
## 2 Gosper
## 3 Hooker
## 4 Nance
## 5 Bailey
## 6 Duval
## 7 Edwards
## 8 Kenedy
## 9 Kent
## 10 Presidio
## 11 Beaver
d.How many states have a sum that equals exactly to 100%?
13 states have a sum equal to exactly 100%
states_equal_100 <- data_ACS %>%
filter(all_race == 100) %>% # only keeps counties with all_race sum = 100
select(state) # selects relevant coulmn name for output
unique(states_equal_100) # gives only unique output of above commands
## # A tibble: 13 × 1
## state
## <fct>
## 1 Alabama
## 2 Georgia
## 3 Kansas
## 4 Kentucky
## 5 Mississippi
## 6 Montana
## 7 Nebraska
## 8 New Mexico
## 9 North Carolina
## 10 North Dakota
## 11 South Dakota
## 12 Texas
## 13 West Virginia
10.Using the carpool variable,
a.Use the dplyr::min_rank() function to create a
new variable called carpool_rank where the highest ranked county (rank =
1) is the county with the highest carpool value. Read the documentation
carefully for the ranking function.
data_ACS <- data_ACS %>%
mutate(carpool_rank = dplyr::min_rank(carpool)) # mutate uses dyplr to create new variable & dplyr::min_rank assigns rank to each value
b.Find the 10 highest ranked counties for carpooling. Show the census ID, county name, state, carpool value, and carpool_rank in your final answer.
top10_carpool <- data_ACS %>%
select(census_id, county, state, carpool, carpool_rank) %>% # selects relevant columns for output
top_n(10, carpool_rank) # shows top 10 carpool_rank
top10_carpool # gives output of above commands
## # A tibble: 10 × 5
## census_id county state carpool carpool_rank
## <dbl> <fct> <fct> <dbl> <int>
## 1 5133 Sevier Arkansas 24.4 3137
## 2 13061 Clay Georgia 29.9 3140
## 3 13165 Jenkins Georgia 25.3 3138
## 4 18087 LaGrange Indiana 27 3139
## 5 20175 Seward Kansas 23.4 3136
## 6 21197 Powell Kentucky 21.6 3131
## 7 39075 Holmes Ohio 21.8 3132
## 8 48079 Cochran Texas 22.8 3135
## 9 48247 Jim Hogg Texas 22.6 3134
## 10 48393 Roberts Texas 22.4 3133
c.Find the 10 lowest ranked counties for carpooling. Show the same variables in your final answer.
bottom10_carpool <- data_ACS %>%
select(census_id, county, state, carpool, carpool_rank) %>% # selects relevant columns for output
slice_min(carpool_rank, n = 10) # shows bottom 10 carpool_rank
bottom10_carpool # gives output of above commands
## # A tibble: 11 × 5
## census_id county state carpool carpool_rank
## <dbl> <fct> <fct> <dbl> <int>
## 1 48261 Kenedy Texas 0 1
## 2 48269 King Texas 0 1
## 3 48235 Irion Texas 0.9 3
## 4 31183 Wheeler Nebraska 1.3 4
## 5 36061 New York New York 1.9 5
## 6 13309 Wheeler Georgia 2.3 6
## 7 38029 Emmons North Dakota 2.3 6
## 8 30019 Daniels Montana 2.6 8
## 9 31057 Dundy Nebraska 2.6 8
## 10 46069 Hyde South Dakota 2.8 10
## 11 51720 Norton city Virginia 2.8 10
d.On average, what state is the best ranked for carpooling?
On average, (although not a state) Washington, DC is the best ranked for carpooling….I think this means MA is the best ranked carpool STATE on avg.
best_carpool <- data_ACS %>%
group_by(state) %>% # groups observations by state to then take average
summarise(avg_carpool_rank = mean(carpool_rank, na.rm = TRUE)) %>% # average of carpool_rank by state
slice_min(avg_carpool_rank, n = 1) # shows state with lowest avg carpool_rank
best_carpool # gives output of above commands
## # A tibble: 1 × 2
## state avg_carpool_rank
## <fct> <dbl>
## 1 District of Columbia 72
best_carpool2 <- data_ACS %>%
group_by(state) %>% # groups observations by state to then take average
summarise(avg_carpool_rank = mean(carpool_rank, na.rm = TRUE)) %>% # average of carpool_rank by state
slice_min(avg_carpool_rank, n = 2) # shows the 2 states with lowest avg carpool_rank
best_carpool2 # gives output of above commands
## # A tibble: 2 × 2
## state avg_carpool_rank
## <fct> <dbl>
## 1 District of Columbia 72
## 2 Massachusetts 540.
e.What are the top 5 states for carpooling?
top5_carpool <- data_ACS %>%
select(state, carpool) %>% # selects relevant columns for output
top_n(5, carpool) # returns top 5 carpool values
top5_carpool # gives output of above commands
## # A tibble: 5 × 2
## state carpool
## <fct> <dbl>
## 1 Arkansas 24.4
## 2 Georgia 29.9
## 3 Georgia 25.3
## 4 Indiana 27
## 5 Kansas 23.4