Prerequisite Packages:

Question One:

Load the data set using Tidyverse:

acs_county_data <- read_csv("C:/Users/samc8/OneDrive - Xavier University/Data Wrangling/Week 5/acs_2015_county_data_revised.csv")

How many rows and columns are in the data set?

  • There are 3142 rows and 35 columns found in the data set.
dim(acs_county_data)
[1] 3142   35

Question Two:

Do any data types need changed? Show any code to change variable type and show code/ outputfor a “glimpse()” command after you’re finished.

  • Based on the information after running the “glimpse()” command, and reading the “Data Dictionary” supplementary file I do not think any data type needs to be changed. The only two non numeric data fields are “state” and “county” which are both characterized as a “character”. The remainder of the data is in the form of “dbl” which is a numerical value, where all data values for the remaining columns deal with some type of numerical measure such as totals, percents, etc.
glimpse(acs_county_data)

str(acs_county_data)

Question 3:

Are there any missing values?

There are two missing values:

  • 1 missing value in Income
  • 1 missing value in Child Poverty
colSums(is.na(acs_county_data))

Cleaning the data set:

  • After looking at the two categories, I feel that both income and child_poverty are needed in the analysis and should not removed. Since there is also only one observation missing from both categories, I think it would be best to mutate the data where I will implement a median function to add data to the missing data point.
acs_county_data_clean <- acs_county_data %>%
  mutate(
    income = replace_na(income, median(income, na.rm = TRUE)),
    child_poverty = replace_na(child_poverty, median(child_poverty, na.rm = TRUE))
)

Ensuring the data is cleaned:

acs_county_data_clean %>% summarise(missing_income = sum(is.na(income)), missing_child_poverty = sum(is.na(child_poverty)))

colSums(is.na(acs_county_data_clean))

Question Four:

Use the “Summary()” function to examine any unusual values:

  • Based on the results from the “Summary()” function there does not appear to be any any unusual values. While some values are large, they are plausible in respects to their category. For example the “Men” and “Women” Max. row is much larger than the rest of the information, however but it is taking into account for the total. Below is a boxplot of the “Number of Men” category, showing how it appears to be an outlier but it could be plausible.
summary(acs_county_data_clean)

boxplot(acs_county_data_clean$men, main = "Number of Men")
Boxplot: Number of Men
Boxplot: Number of Men

Question Five:

How many counties have more women than men?

  • There are 1,985 counties that have more women than men.
acs_county_data_clean %>% filter(women > men) %>% nrow()

[1] 1985

Question Six:

How many counties have an unemployment rate lower than 10%?

  • There are 2,420 counties that have an unemployment rate lower than 10%.
acs_county_data_clean %>% filter(unemployment < 10) %>% nrow()

[1] 2420

Question Seven:

What are the top 10 counties with the highest mean commute?

acs_county_data_clean %>%
select(census_id, county, state, mean_commute) %>%
top_n(10, mean_commute) %>% 
arrange(desc(mean_commute))

Question Eight:

Calculate the percentage of women for each county: and then find the top 10 counties with the lowest percentages

acs_women_pct <- acs_county_data_clean %>% 
mutate(pct_women = (women / total_pop) * 100)

Then find the top 10 counties with the lowest percentages:

acs_women_pct %>% 
select (census_id, county, state, pct_women) %>% 
arrange(pct_women) %>% 
slice(1:10)

Question Nine:

Create a new variable that measures sum of all race percentage variables:

acs_race_pct <- acs_county_data_clean %>% 
mutate(race_sum = hispanic + white + black + native + asian + pacific)

What are the top 10 counties with the lowest sum of these race percentage variables?

acs_race_pct %>% 
select(census_id, county, state, race_sum) %>% 
arrange(race_sum) %>% 
slice(1:10)

Which state, on average, has the lowest sum of these race percentage variables?

  • The state, on average, that has the lowest sum of these race percentage variables is Hawaii.
acs_race_pct %>% 
group_by(state) %>% 
summarise(avg_race_sum = mean(race_sum, na.rm = TRUE)) %>% 
arrange(avg_race_sum) %>% 
slice(1)

Do any counties have a sum greater than 100%?

  • There are 11 counties that have a sum greater than 100%.
acs_race_pct %>% 
filter(race_sum > 100) %>% 
nrow()

[1] 11

How many states have a sum that equals exactly to 100%?

  • There are 13 states that have a sum that equals exactly to 100%.
acs_race_pct %>% 
filter(race_sum == 100) %>% 
distinct(state) %>% 
nrow()

[1] 13

Question Ten:

Create a new variable called “carpool_rank”:

acs_carpool <- acs_county_data_clean %>% 
mutate(carpool_rank = min_rank(desc(carpool)))

FInid the 10 highest ranked counties for carpooling:

acs_carpool %>% 
select(census_id, county, state, carpool, carpool_rank) %>% 
arrange(carpool_rank) %>% 
slice(1:10)

Find the 10 lowest ranked counties for carpooling:

acs_carpool %>% 
arrange(desc(carpool_rank)) %>% 
select(census_id, county, state, carpool, carpool_rank) %>% 
slice(1:10)

On average, what state is the best ranked for carpooling?

  • On average, Arizona is the best state ranked for carpooling.
acs_carpool %>% 
group_by(state) %>% 
summarise(avg_rank = mean(carpool_rank, na.rm = TRUE)) %>% 
arrange(avg_rank) %>% 
slice(1)
  state   avg_rank
  <chr>      <dbl>
1 Arizona     971.

What are the top 5 states for carpooling?

acs_carpool %>% 
group_by(state) %>% 
summarise(avg_rank = mean(carpool_rank, na.rm = TRUE)) %>% 
arrange(avg_rank) %>% 
slice(1:5)
  state    avg_rank
  <chr>       <dbl>
1 Arizona      971.
2 Utah        1019.
3 Arkansas    1055.
4 Hawaii      1072.
5 Alaska      1087.