Question 1 Explanation, Code, and Output

First, I begin by setting my working directory to the file location containing acs_2015_county_data_revised.csv. I then load the Tidyverse package. I then import the data set using a Tidyverse function, and I name the data set county_data. Then I examine the structure of the data set to determine that there are 3142 rows and 35 columns. The below output is displayed when the Tidyverse package is loaded, and below is the code to set my working directory and load the Tidyverse package.

# Using the below code, I set my working directory.
setwd("C:/Users/richa/Dropbox/My PC (DESKTOP-B9LT0L1)/Documents/Data Wrangling/Week 4/homework3")

# Using the below code, I load the Tidverse package.
library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.4     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

In the below code, I import the data set contained in acs_2015_county_data_revised.csv. The below output is displayed when the data set is imported.

# Using the below code, I import the data set using a Tidyverse function, and I name the data set.
county_data <- read_csv("acs_2015_county_data_revised.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double(),
##   state = col_character(),
##   county = col_character()
## )
## i Use `spec()` for the full column specifications.

I use the below code to display the below output containing information about the data’s structure.

# Using the below code, I examine the structure of the data set.
str(county_data)

## tibble [3,142 x 35] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ census_id     : num [1:3142] 1001 1003 1005 1007 1009 ...
##  $ state         : chr [1:3142] "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ county        : chr [1:3142] "Autauga" "Baldwin" "Barbour" "Bibb" ...
##  $ total_pop     : num [1:3142] 55221 195121 26932 22604 57710 ...
##  $ men           : num [1:3142] 26745 95314 14497 12073 28512 ...
##  $ women         : num [1:3142] 28476 99807 12435 10531 29198 ...
##  $ hispanic      : num [1:3142] 2.6 4.5 4.6 2.2 8.6 4.4 1.2 3.5 0.4 1.5 ...
##  $ white         : num [1:3142] 75.8 83.1 46.2 74.5 87.9 22.2 53.3 73 57.3 91.7 ...
##  $ black         : num [1:3142] 18.5 9.5 46.7 21.4 1.5 70.7 43.8 20.3 40.3 4.8 ...
##  $ native        : num [1:3142] 0.4 0.6 0.2 0.4 0.3 1.2 0.1 0.2 0.2 0.6 ...
##  $ asian         : num [1:3142] 1 0.7 0.4 0.1 0.1 0.2 0.4 0.9 0.8 0.3 ...
##  $ pacific       : num [1:3142] 0 0 0 0 0 0 0 0 0 0 ...
##  $ citizen       : num [1:3142] 40725 147695 20714 17495 42345 ...
##  $ income        : num [1:3142] 51281 50254 32964 38678 45813 ...
##  $ income_per_cap: num [1:3142] 24974 27317 16824 18431 20532 ...
##  $ poverty       : num [1:3142] 12.9 13.4 26.7 16.8 16.7 24.6 25.4 20.5 21.6 19.2 ...
##  $ child_poverty : num [1:3142] 18.6 19.2 45.3 27.9 27.2 38.4 39.2 31.6 37.2 30.1 ...
##  $ professional  : num [1:3142] 33.2 33.1 26.8 21.5 28.5 18.8 27.5 27.3 23.3 29.3 ...
##  $ service       : num [1:3142] 17 17.7 16.1 17.9 14.1 15 16.6 17.7 14.5 16 ...
##  $ office        : num [1:3142] 24.2 27.1 23.1 17.8 23.9 19.7 21.9 24.2 26.3 19.5 ...
##  $ construction  : num [1:3142] 8.6 10.8 10.8 19 13.5 20.1 10.3 10.5 11.5 13.7 ...
##  $ production    : num [1:3142] 17.1 11.2 23.1 23.7 19.9 26.4 23.7 20.4 24.4 21.5 ...
##  $ drive         : num [1:3142] 87.5 84.7 83.8 83.2 84.9 74.9 84.5 85.3 85.1 83.9 ...
##  $ carpool       : num [1:3142] 8.8 8.8 10.9 13.5 11.2 14.9 12.4 9.4 11.9 12.1 ...
##  $ transit       : num [1:3142] 0.1 0.1 0.4 0.5 0.4 0.7 0 0.2 0.2 0.2 ...
##  $ walk          : num [1:3142] 0.5 1 1.8 0.6 0.9 5 0.8 1.2 0.3 0.6 ...
##  $ other_transp  : num [1:3142] 1.3 1.4 1.5 1.5 0.4 1.7 0.6 1.2 0.4 0.7 ...
##  $ work_at_home  : num [1:3142] 1.8 3.9 1.6 0.7 2.3 2.8 1.7 2.7 2.1 2.5 ...
##  $ mean_commute  : num [1:3142] 26.5 26.4 24.1 28.8 34.9 27.5 24.6 24.1 25.1 27.4 ...
##  $ employed      : num [1:3142] 23986 85953 8597 8294 22189 ...
##  $ private_work  : num [1:3142] 73.6 81.5 71.8 76.8 82 79.5 77.4 74.1 85.1 73.1 ...
##  $ public_work   : num [1:3142] 20.9 12.3 20.8 16.1 13.5 15.1 16.2 20.8 12.1 18.5 ...
##  $ self_employed : num [1:3142] 5.5 5.8 7.3 6.7 4.2 5.4 6.2 5 2.8 7.9 ...
##  $ family_work   : num [1:3142] 0 0.4 0.1 0.4 0.4 0 0.2 0.1 0 0.5 ...
##  $ unemployment  : num [1:3142] 7.6 7.5 17.6 8.3 7.7 18 10.9 12.3 8.9 7.9 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   census_id = col_double(),
##   ..   state = col_character(),
##   ..   county = col_character(),
##   ..   total_pop = col_double(),
##   ..   men = col_double(),
##   ..   women = col_double(),
##   ..   hispanic = col_double(),
##   ..   white = col_double(),
##   ..   black = col_double(),
##   ..   native = col_double(),
##   ..   asian = col_double(),
##   ..   pacific = col_double(),
##   ..   citizen = col_double(),
##   ..   income = col_double(),
##   ..   income_per_cap = col_double(),
##   ..   poverty = col_double(),
##   ..   child_poverty = col_double(),
##   ..   professional = col_double(),
##   ..   service = col_double(),
##   ..   office = col_double(),
##   ..   construction = col_double(),
##   ..   production = col_double(),
##   ..   drive = col_double(),
##   ..   carpool = col_double(),
##   ..   transit = col_double(),
##   ..   walk = col_double(),
##   ..   other_transp = col_double(),
##   ..   work_at_home = col_double(),
##   ..   mean_commute = col_double(),
##   ..   employed = col_double(),
##   ..   private_work = col_double(),
##   ..   public_work = col_double(),
##   ..   self_employed = col_double(),
##   ..   family_work = col_double(),
##   ..   unemployment = col_double()
##   .. )

Question 2 Explanation, Code, and Output

Using the below code, we examine the structure of the data set in order to learn the type of each variable. The output regarding the structure of the data set is shown below. The state and county variables both contain character strings, and using the below output, we see that each of these two variables is indeed appropriately assigned a character type. The variables hispanic, white, black, native, asian, pacific, poverty, child_poverty, professional, service, office, construction, production, drive, carpool, transit, walk, other_transp, work_at_home, mean_commute, private_work, public_work, self_employed, family_work, and unemployment each contain decimal numbers, and hence should be assigned the numeric type double. Looking at the below output, we see that this is indeed the case. We also see below that the variables income and income_per_cap are both assigned the numeric type of double. Looking at the csv file, these variables appear to contain integers. However it is possible that some income may be a decimal number involving cents, and so we do not change the type of either of these variables involving income. The variable employed appears to contain integers. However according to the data dictionary, this variable should be a percentage. It is clear that the data dictionary has defined this variable wrong, as there are many integers in the “employed” column which are way above 100. The numbers in this column are definitely not a percentage. In the below output, we can see that the type of “employed” is a numeric double. Although “employed” appears to contain integers in the data set, we choose to leave the type as numeric double since this is the type that would be required if the data dictionary were accurate.

# Using the below code, I examine the structure of the data set.
str(county_data)

## tibble [3,142 x 35] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ census_id     : num [1:3142] 1001 1003 1005 1007 1009 ...
##  $ state         : chr [1:3142] "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ county        : chr [1:3142] "Autauga" "Baldwin" "Barbour" "Bibb" ...
##  $ total_pop     : num [1:3142] 55221 195121 26932 22604 57710 ...
##  $ men           : num [1:3142] 26745 95314 14497 12073 28512 ...
##  $ women         : num [1:3142] 28476 99807 12435 10531 29198 ...
##  $ hispanic      : num [1:3142] 2.6 4.5 4.6 2.2 8.6 4.4 1.2 3.5 0.4 1.5 ...
##  $ white         : num [1:3142] 75.8 83.1 46.2 74.5 87.9 22.2 53.3 73 57.3 91.7 ...
##  $ black         : num [1:3142] 18.5 9.5 46.7 21.4 1.5 70.7 43.8 20.3 40.3 4.8 ...
##  $ native        : num [1:3142] 0.4 0.6 0.2 0.4 0.3 1.2 0.1 0.2 0.2 0.6 ...
##  $ asian         : num [1:3142] 1 0.7 0.4 0.1 0.1 0.2 0.4 0.9 0.8 0.3 ...
##  $ pacific       : num [1:3142] 0 0 0 0 0 0 0 0 0 0 ...
##  $ citizen       : num [1:3142] 40725 147695 20714 17495 42345 ...
##  $ income        : num [1:3142] 51281 50254 32964 38678 45813 ...
##  $ income_per_cap: num [1:3142] 24974 27317 16824 18431 20532 ...
##  $ poverty       : num [1:3142] 12.9 13.4 26.7 16.8 16.7 24.6 25.4 20.5 21.6 19.2 ...
##  $ child_poverty : num [1:3142] 18.6 19.2 45.3 27.9 27.2 38.4 39.2 31.6 37.2 30.1 ...
##  $ professional  : num [1:3142] 33.2 33.1 26.8 21.5 28.5 18.8 27.5 27.3 23.3 29.3 ...
##  $ service       : num [1:3142] 17 17.7 16.1 17.9 14.1 15 16.6 17.7 14.5 16 ...
##  $ office        : num [1:3142] 24.2 27.1 23.1 17.8 23.9 19.7 21.9 24.2 26.3 19.5 ...
##  $ construction  : num [1:3142] 8.6 10.8 10.8 19 13.5 20.1 10.3 10.5 11.5 13.7 ...
##  $ production    : num [1:3142] 17.1 11.2 23.1 23.7 19.9 26.4 23.7 20.4 24.4 21.5 ...
##  $ drive         : num [1:3142] 87.5 84.7 83.8 83.2 84.9 74.9 84.5 85.3 85.1 83.9 ...
##  $ carpool       : num [1:3142] 8.8 8.8 10.9 13.5 11.2 14.9 12.4 9.4 11.9 12.1 ...
##  $ transit       : num [1:3142] 0.1 0.1 0.4 0.5 0.4 0.7 0 0.2 0.2 0.2 ...
##  $ walk          : num [1:3142] 0.5 1 1.8 0.6 0.9 5 0.8 1.2 0.3 0.6 ...
##  $ other_transp  : num [1:3142] 1.3 1.4 1.5 1.5 0.4 1.7 0.6 1.2 0.4 0.7 ...
##  $ work_at_home  : num [1:3142] 1.8 3.9 1.6 0.7 2.3 2.8 1.7 2.7 2.1 2.5 ...
##  $ mean_commute  : num [1:3142] 26.5 26.4 24.1 28.8 34.9 27.5 24.6 24.1 25.1 27.4 ...
##  $ employed      : num [1:3142] 23986 85953 8597 8294 22189 ...
##  $ private_work  : num [1:3142] 73.6 81.5 71.8 76.8 82 79.5 77.4 74.1 85.1 73.1 ...
##  $ public_work   : num [1:3142] 20.9 12.3 20.8 16.1 13.5 15.1 16.2 20.8 12.1 18.5 ...
##  $ self_employed : num [1:3142] 5.5 5.8 7.3 6.7 4.2 5.4 6.2 5 2.8 7.9 ...
##  $ family_work   : num [1:3142] 0 0.4 0.1 0.4 0.4 0 0.2 0.1 0 0.5 ...
##  $ unemployment  : num [1:3142] 7.6 7.5 17.6 8.3 7.7 18 10.9 12.3 8.9 7.9 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   census_id = col_double(),
##   ..   state = col_character(),
##   ..   county = col_character(),
##   ..   total_pop = col_double(),
##   ..   men = col_double(),
##   ..   women = col_double(),
##   ..   hispanic = col_double(),
##   ..   white = col_double(),
##   ..   black = col_double(),
##   ..   native = col_double(),
##   ..   asian = col_double(),
##   ..   pacific = col_double(),
##   ..   citizen = col_double(),
##   ..   income = col_double(),
##   ..   income_per_cap = col_double(),
##   ..   poverty = col_double(),
##   ..   child_poverty = col_double(),
##   ..   professional = col_double(),
##   ..   service = col_double(),
##   ..   office = col_double(),
##   ..   construction = col_double(),
##   ..   production = col_double(),
##   ..   drive = col_double(),
##   ..   carpool = col_double(),
##   ..   transit = col_double(),
##   ..   walk = col_double(),
##   ..   other_transp = col_double(),
##   ..   work_at_home = col_double(),
##   ..   mean_commute = col_double(),
##   ..   employed = col_double(),
##   ..   private_work = col_double(),
##   ..   public_work = col_double(),
##   ..   self_employed = col_double(),
##   ..   family_work = col_double(),
##   ..   unemployment = col_double()
##   .. )

We now discuss the changes that we make to the types of the variables. The census_id is of type numeric. However even when ID’s are numbers, it is often common place to give them a character type because we generally are not interested in using the ID’s in a mathematical calculation or equation. Hence in the below code, we change the type of the variable census_id to character. The variables citizen, total_pop, men, and women represent a count of people, and hence they are integers. However from the above output, we see that they are assigned type numeric without being specifically designated as integers (rather than decimal numbers). Hence in the below code, we change the type of the variables citizen, total_pop, men, and women to integer. In the below code, we make all changes specified in this paragraph. We then use the glimpse() function to check that we successfully changed the variable types, and in viewing the below output, we find that the variable types have indeed been appropriately changed.

# Using the below code, we change the type of census_id to character.
county_data$census_id <- as.character(county_data$census_id)

# Using the below code, we change the type of citizen to integer.
county_data$citizen <- as.integer(county_data$citizen)

# Using the below code, we change the type of total_pop to integer
county_data$total_pop <- as.integer(county_data$total_pop)

# Using the below code, we change the type of men to integer.
county_data$men <- as.integer(county_data$men)

# Using the below code, we change the type of women to integer.
county_data$women <- as.integer(county_data$women)

# Using the below code, we gain a glimpse of the structure of the data.
glimpse(county_data)

## Rows: 3,142
## Columns: 35
## $ census_id      <chr> "1001", "1003", "1005", "1007", "1009", "1011", "101...
## $ state          <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama...
## $ county         <chr> "Autauga", "Baldwin", "Barbour", "Bibb", "Blount", "...
## $ total_pop      <int> 55221, 195121, 26932, 22604, 57710, 10678, 20354, 11...
## $ men            <int> 26745, 95314, 14497, 12073, 28512, 5660, 9502, 56274...
## $ women          <int> 28476, 99807, 12435, 10531, 29198, 5018, 10852, 6037...
## $ hispanic       <dbl> 2.6, 4.5, 4.6, 2.2, 8.6, 4.4, 1.2, 3.5, 0.4, 1.5, 7....
## $ white          <dbl> 75.8, 83.1, 46.2, 74.5, 87.9, 22.2, 53.3, 73.0, 57.3...
## $ black          <dbl> 18.5, 9.5, 46.7, 21.4, 1.5, 70.7, 43.8, 20.3, 40.3, ...
## $ native         <dbl> 0.4, 0.6, 0.2, 0.4, 0.3, 1.2, 0.1, 0.2, 0.2, 0.6, 0....
## $ asian          <dbl> 1.0, 0.7, 0.4, 0.1, 0.1, 0.2, 0.4, 0.9, 0.8, 0.3, 0....
## $ pacific        <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0....
## $ citizen        <int> 40725, 147695, 20714, 17495, 42345, 8057, 15581, 886...
## $ income         <dbl> 51281, 50254, 32964, 38678, 45813, 31938, 32229, 417...
## $ income_per_cap <dbl> 24974, 27317, 16824, 18431, 20532, 17580, 18390, 213...
## $ poverty        <dbl> 12.9, 13.4, 26.7, 16.8, 16.7, 24.6, 25.4, 20.5, 21.6...
## $ child_poverty  <dbl> 18.6, 19.2, 45.3, 27.9, 27.2, 38.4, 39.2, 31.6, 37.2...
## $ professional   <dbl> 33.2, 33.1, 26.8, 21.5, 28.5, 18.8, 27.5, 27.3, 23.3...
## $ service        <dbl> 17.0, 17.7, 16.1, 17.9, 14.1, 15.0, 16.6, 17.7, 14.5...
## $ office         <dbl> 24.2, 27.1, 23.1, 17.8, 23.9, 19.7, 21.9, 24.2, 26.3...
## $ construction   <dbl> 8.6, 10.8, 10.8, 19.0, 13.5, 20.1, 10.3, 10.5, 11.5,...
## $ production     <dbl> 17.1, 11.2, 23.1, 23.7, 19.9, 26.4, 23.7, 20.4, 24.4...
## $ drive          <dbl> 87.5, 84.7, 83.8, 83.2, 84.9, 74.9, 84.5, 85.3, 85.1...
## $ carpool        <dbl> 8.8, 8.8, 10.9, 13.5, 11.2, 14.9, 12.4, 9.4, 11.9, 1...
## $ transit        <dbl> 0.1, 0.1, 0.4, 0.5, 0.4, 0.7, 0.0, 0.2, 0.2, 0.2, 0....
## $ walk           <dbl> 0.5, 1.0, 1.8, 0.6, 0.9, 5.0, 0.8, 1.2, 0.3, 0.6, 1....
## $ other_transp   <dbl> 1.3, 1.4, 1.5, 1.5, 0.4, 1.7, 0.6, 1.2, 0.4, 0.7, 1....
## $ work_at_home   <dbl> 1.8, 3.9, 1.6, 0.7, 2.3, 2.8, 1.7, 2.7, 2.1, 2.5, 1....
## $ mean_commute   <dbl> 26.5, 26.4, 24.1, 28.8, 34.9, 27.5, 24.6, 24.1, 25.1...
## $ employed       <dbl> 23986, 85953, 8597, 8294, 22189, 3865, 7813, 47401, ...
## $ private_work   <dbl> 73.6, 81.5, 71.8, 76.8, 82.0, 79.5, 77.4, 74.1, 85.1...
## $ public_work    <dbl> 20.9, 12.3, 20.8, 16.1, 13.5, 15.1, 16.2, 20.8, 12.1...
## $ self_employed  <dbl> 5.5, 5.8, 7.3, 6.7, 4.2, 5.4, 6.2, 5.0, 2.8, 7.9, 4....
## $ family_work    <dbl> 0.0, 0.4, 0.1, 0.4, 0.4, 0.0, 0.2, 0.1, 0.0, 0.5, 0....
## $ unemployment   <dbl> 7.6, 7.5, 17.6, 8.3, 7.7, 18.0, 10.9, 12.3, 8.9, 7.9...

Question 3 Explanation, Code, and Output

Using the below code, we create the below output displaying the number of missing values in each variable.

# Using the below code, we find the number of missing values in each variable.
colSums(is.na(county_data))

##      census_id          state         county      total_pop            men 
##              0              0              0              0              0 
##          women       hispanic          white          black         native 
##              0              0              0              0              0 
##          asian        pacific        citizen         income income_per_cap 
##              0              0              0              1              0 
##        poverty  child_poverty   professional        service         office 
##              0              1              0              0              0 
##   construction     production          drive        carpool        transit 
##              0              0              0              0              0 
##           walk   other_transp   work_at_home   mean_commute       employed 
##              0              0              0              0              0 
##   private_work    public_work  self_employed    family_work   unemployment 
##              0              0              0              0              0

Based on the above output, we see that the only variables with missing values are “income” and “child_poverty”. Each of these two variables has one missing value. We note that each observation in the data set represents a unique county. Hence if we remove a couple observations, we may not come up with accurate answers to questions 5 through 10. Take for example question 5, which asks for the number of counties having more women than men. If we remove an observation from the data set, we may not come up with a correct answer to this question. Hence rather than removing observations, we instead impute values for the missing variables. If the data contains several outliers, the mean can be less representative of the majority of the data than the median. Hence we choose to impute the missing values with the median (rather than the mean) of the corresponding column. For instance, we will replace the missing value in the variable “income” with the median income, and we will replace the missing value in the variable “child_poverty” with the median value for “child_poverty”. We create a new data set called county_data_2 containing the replacements for these missing values. Using the below code, we perform the replacements described in this paragraph.

# Using the below code, we create a new data set called county_data_2 which is identical to county_data.
county_data_2 <- county_data

# Using the below code, we replace the missing value for income with the median income.
county_data_2$income[which(is.na(county_data_2$income))] <- median(county_data$income, na.rm = TRUE)

# Using the below code, we replace the missing value for child_poverty with the median value for child_poverty.
county_data_2$child_poverty[which(is.na(county_data_2$child_poverty))] <- median(county_data$child_poverty, na.rm = TRUE)

Using the below code, we create the below output which indicates that the missing values have now been removed from county_data_2.

# Using the below code, we display the number of missing values in each variable.
colSums(is.na(county_data_2))

##      census_id          state         county      total_pop            men 
##              0              0              0              0              0 
##          women       hispanic          white          black         native 
##              0              0              0              0              0 
##          asian        pacific        citizen         income income_per_cap 
##              0              0              0              0              0 
##        poverty  child_poverty   professional        service         office 
##              0              0              0              0              0 
##   construction     production          drive        carpool        transit 
##              0              0              0              0              0 
##           walk   other_transp   work_at_home   mean_commute       employed 
##              0              0              0              0              0 
##   private_work    public_work  self_employed    family_work   unemployment 
##              0              0              0              0              0

Question 4 Explanation, Code, and Output

The below output displays a summary of the variable total_pop. We see that the maximum value is quite far away from the 3rd quantile, and we see that the minimum value is also a bit away from the 1st quantile. This is most likely due to the fact that most counties do not have an incredibly large or incredibly small population. However the minimum and maximum numbers provided for the total population seem reasonable, given that there exist counties in the USA with extremely high populations and also rural counties with very low population. For instance in the city where I grew up, there is a total population of only 150. Given that these values do not appear unusual, we do not remove any outliers for this variable. The below code is used to display the below summary.

# Using the below code, we display a summary for the variable total_pop.
summary(county_data_2$total_pop)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##       85    11028    25768   100737    67552 10038388

The below output displays a summary of the variable men. We see that the maximum value is quite far away from the 3rd quantile, and this suggests that there may be outliers. This is most likely due to the fact that most counties do not have an incredibly large population. However the maximum number provided for the variable “men” seems reasonable, given that there exist counties in the USA with extremely high populations. Given that these values do not appear unusual, we do not remove any outliers for this variable. The below code is used to display the below summary.

# Using the below code, we display a summary for the variable men.
summary(county_data_2$men)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      42    5546   12826   49565   33319 4945351

The below output displays a summary of the variable women. We see that the maximum value is quite far away from the 3rd quantile, and this suggests that there may be outliers. This is most likely due to the fact that most counties do not have an incredibly large population. However the maximum number provided for the variable “women” seems reasonable, given that there exist counties in the USA with extremely high populations. Given that these values do not appear unusual, we do not remove any outliers for this variable. The below code is used to display the below summary.

# Using the below code, we display a summary for the variable women.
summary(county_data_2$women)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      43    5466   12907   51171   34122 5093037

The below output displays a summary for the variable hispanic. The maximum value of 98.7 seems to be much larger than the 3rd quantile. This suggests that there may be outliers in this variable. However the maximum value of 98.7 is no more than 100, and hence is reasonable as a percentage. The below code is used to produce the below output.

# Using the below code, we display a summary of the variable hispanic.
summary(county_data_2$hispanic)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.900   3.700   8.826   9.000  98.700

The below output displays the row number of the observations having a value for hispanic greater than 95. We see that there are multiple observations having extremely high values for hispanic, and hence we do not suspect that there is one single value that was entered in error. Instead we believe that the outliers are caused merely because Hispanic individuals are a minority in most (but not all) counties in the USA. The below code is used to create the below output.

# Using the below code, we display the row number of the observations having a value for hispanic greater than 95.
which(county_data_2$hispanic > 95)

## [1] 2685 2737 2763

The below output displays a summary for the variable white. The minimum value of 0.9 seems to be much smaller than the 1st quantile. This suggests that there may be outliers in this variable. However the minimum value of 0.9 is appropriately between 0 and 100, as we would expect for a percentage. The below code is used to produce the below output.

# Using the below code, we display a summary for the variable white.
summary(county_data_2$white)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.90   65.60   84.60   77.28   93.30   99.80

The below output displays the row number of the observations having a value for white less than 5. We see that there are multiple observations having extremely low values for white, and hence we do not suspect that there is one single value that was entered in error. Instead we believe that the outliers are caused merely because white individuals are a majority in most (but not all) counties in the USA. The below code is used to create the below output.

# Using the below code, we display the row number of the observations having a value for white less than 5.
which(county_data_2$white < 5)

## [1]   82 2413 2685 2737 2763

The below output displays a summary for the variable black. The maximum value of 85.9 seems to be much larger than the 3rd quantile. This suggests that there may be outliers in this variable. However the maximum value of 85.9 is appropriately between 0 and 100, as we would expect for a percentage. The below code is used to produce the below output.

# Using the below code, we display a summary for the variable black.
summary(county_data_2$black)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.600   2.100   8.879  10.175  85.900

The below output displays the row number of the observations having a value for black greater than 80. We see that there are multiple observations having extremely high values for black, and hence we do not suspect that there is one single value that was entered in error. Instead we believe that the outliers are caused merely because black individuals are a minority in most (but not all) counties in the USA. The below code is used to create the below output.

# Using the below code, we display the row number of the observations having a value for black greater than 80.
which(county_data_2$black > 80)

## [1]   32   44 1412 1427 1433

The below output displays a summary for the variable native. The maximum value of 92.1 seems to be much larger than the 3rd quantile. This suggests that there may be outliers in this variable. However the maximum value of 92.1 is appropriately between 0 and 100, as we would expect for a percentage. The below code is used to produce the below output.

# Using the below code, we display a summary for the variable native.
summary(county_data_2$native)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.100   0.300   1.766   0.600  92.100

The below output displays the row number of the observations having a value for native greater than 85. We see that there are several observations having extremely high values for native, and hence we do not suspect that there is one single value that was entered in error. Instead we believe that the outliers are caused merely because native individuals are a minority in most (but not all) counties in the USA. The below code is used to create the below output.

# Using the below code, we display the row number of the observations having a value for native greater than 85.
which(county_data_2$native > 85)

## [1]   82 2413

The below output displays a summary for the variable asian. The maximum value of 41.6 seems to be much larger than the 3rd quantile. This suggests that there may be outliers in this variable. However the maximum value of 41.6 is appropriately between 0 and 100, as we would expect for a percentage. The below code is used to produce the below output.

# Using the below code, we display a summary for the variable asian.
summary(county_data_2$asian)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.200   0.500   1.258   1.200  41.600

The below output displays the row number of the observations having a value for asian greater than 35. We see that there are several observations having considerably high values for asian, and hence we do not suspect that there is one single value that was entered in error. Instead we believe that the outliers are caused merely because Asian individuals are a minority in most (but not all) counties in the USA. The below code is used to create the below output.

# Using the below code, we display the row number of the observations having a value for asian greater than 35.
which(county_data_2$asian > 35)

## [1]  69 548

The below output displays a summary for the variable pacific. The maximum value of 35.3 seems to be much larger than the 3rd quantile. This suggests that there may be outliers in this variable. However the maximum value of 35.3 is appropriately between 0 and 100, as we would expect for a percentage. Also the distance between the maximum value and 3rd quantile can be explained by the fact that in most counties (but not all), pacific individuals are a very small minority. The below code is used to produce the below output.

# Using the below code, we display a summary for the variable pacific.
summary(county_data_2$pacific)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.08475  0.00000 35.30000

The below output displays a summary of the variable citizen. We see that the maximum value is quite far away from the 3rd quantile, and this suggests that there may be outliers. This is most likely due to the fact that most counties do not have an incredibly large population. However the maximum number provided for the variable “citizen” seems reasonable, given that there exist counties in the USA with very high populations. Given that these values do not appear unusual, we do not remove any outliers for this variable. The below code is used to display the below summary.

# Using the below code, we display a summary for the variable citizen.
summary(county_data_2$citizen)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      80    8254   19434   70804   50728 6046749

The below output displays a summary of the variable income. We see that the maximum value is quite far away from the 3rd quantile, and this suggests that there may be outliers. This is most likely due to the fact that there are only a small number of counties which have a much higher cost of living (and also higher median income) than other counties in the USA. The below code is used to display the below summary.

# Using the below code, we display a summary for the variable income.
summary(county_data_2$income)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   19328   38826   45111   46830   52249  123453

The below output displays the row number of the observations having a value for income greater than 120000. We see that there are several observations having considerably high values for income, and hence we do not suspect that there is one single value that was entered in error. Instead we believe that the outliers are caused merely because the median income in a small number of counties is much larger than in most other counties in the USA. Hence we do not remove or alter observations for the variable income. The below code is used to create the below output.

# Using the below code, we display the row number of the observations having a median income greater than 120000.
which(county_data_2$income > 120000)

## [1] 2873 2926

The below output displays a summary of the variable income_per_cap. We see that the maximum value is quite far away from the 3rd quantile, and this suggests that there may be outliers. This is most likely due to the fact that there are only a small number of counties which have a much higher cost of living (and also higher income per capita) than most other counties in the USA. However, the maximum value for income per capita seems reasonable, and so we do not remove or alter any outliers from this variable. The below code is used to display the below summary.

# Using the below code, we display a summary of the income per capita.
summary(county_data_2$income_per_cap)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8292   20471   23577   24338   27138   65600

The below output displays a summary for the variable poverty. The maximum value of 53.3 seems to be much larger than the 3rd quantile. This suggests that there may be outliers in this variable. However the maximum value of 53.3 is appropriately between 0 and 100, as we would expect for a percentage. Also the distance between the maximum value and 3rd quantile can be explained by the fact that there are only a small number of counties where poverty exceeds 45%. The below code is used to produce the below output.

# Using the below code, we display a summary of the variable poverty.
summary(county_data_2$poverty)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.4    12.0    16.0    16.7    20.3    53.3

The below output displays the row number of the observations having a value for poverty greater than 45. We see that there are multiple observations having considerably high values for poverty, and hence we do not suspect that there is one single value that was entered in error. Instead we believe that the outliers are caused merely because there are only a small number of counties with such high poverty levels. Hence we do not remove or alter observations for the variable poverty. The below code is used to create the below output.

# Using the below code, we display the row number of the observations having a value for poverty greater than 45.
which(county_data_2$poverty > 45)

## [1] 1131 1433 2377 2409 2413 2422

The below output displays a summary for the variable child_poverty. The maximum value of 72.3 seems to be much larger than the 3rd quantile. This suggests that there may be outliers in this variable. However the maximum value of 72.3 is appropriately between 0 and 100, as we would expect for a percentage. Also the distance between the maximum value and 3rd quantile can be explained by the fact that there are only a small number of counties where the percentage of children living in poverty is greater than 65%. The below code is used to produce the below output.

# Using the below code, we display a summary of the variable child_poverty.
summary(county_data_2$child_poverty)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   16.10   22.50   23.29   29.48   72.30

The below output displays the row number of the observations having a value for child_poverty greater than 65. We see that there are multiple observations having considerably high values for child_poverty, and hence we do not suspect that there is one single value that was entered in error. Instead we believe that the outliers are caused merely because there are only a small number of counties with such high child poverty levels. Hence we do not remove or alter observations for the variable child_poverty. The below code is used to create the below output.

# Using the below code, we display the row number of the observations having a value for child_poverty greater than 65.
which(county_data_2$child_poverty > 65)

## [1]   32 1131 2409

The below output displays a summary for the variable professional. The maximum value of 74 seems to be much larger than the 3rd quantile. This suggests that there may be outliers in this variable. However the maximum value of 74 is appropriately between 0 and 100, as we would expect for a percentage. The below code is used to produce the below output.

# Using the below code, we display a summary of the variable professional.
summary(county_data_2$professional)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.50   26.70   30.00   31.04   34.40   74.00

The below output displays the row number of the observations having a value for professional greater than 65. We see that there are multiple observations having high values for professional, and hence we do not suspect that there is one single value that was entered in error. Instead we believe that the outliers are caused merely because there are only a small number of counties with such high percentages of professional employment. Hence we do not remove or alter observations for the variable professional. The below code is used to create the below output.

# Using the below code, we display the row number of the observations having a value for professional greater than 65.
which(county_data_2$professional > 65)

## [1] 1811 2827 2926

The below output displays a summary for the variable service, and based on this output we see that there do not appear to be any extremely unusual outliers for the variable service. All values in the summary are appropriately between 0 and 100, as we would expect for a percentage. The below code is used to produce the below output.

# Using the below code, we display a summary of the variable service.
summary(county_data_2$service)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   15.90   18.00   18.26   20.20   36.60

The below output displays a summary for the variable office. The minimum value of 4.1 seems to be much smaller than the 1st quantile. This suggests that there may be outliers in this variable. However the minimum value of 4.1 is appropriately between 0 and 100, as we would expect for a percentage. The below code is used to produce the below output.

# Using the below code, we display a summary of the variable office.
summary(county_data_2$office)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.10   20.20   22.40   22.13   24.30   35.40

The below output displays the row number of the observations having a value for office smaller than 10. We see that there are multiple observations having small values for office, and hence we do not suspect that there is one single value that was entered in error. Instead we believe that the outliers are caused merely because there are only a small number of counties with such small percentages of office employment. Hence we do not remove or alter observations for the variable office. The below code is used to create the below output.

# Using the below code, we display the row number of the observations having a value for office smaller than 10.
which(county_data_2$office < 10)

## [1]   68 1604 1657 1659 1752

The below output displays a summary for the variable construction. The maximum value of 40.3 seems to be much larger than the 3rd quantile. This suggests that there may be outliers in this variable. However the maximum value of 40.3 is appropriately between 0 and 100, as we would expect for a percentage. The below code is used to produce the below output.

# Using the below code, we display a summary of the variable construction.
summary(county_data_2$construction)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.70    9.80   12.20   12.74   15.00   40.30

The below output displays the row number of the observations having a value for construction greater than 35. We see that there are several observations having high values for construction, and hence we do not suspect that there is one single value that was entered in error. Instead we believe that the outliers are caused merely because there are only a small number of counties with such high percentages of construction employment. Hence we do not remove or alter observations for the variable construction. The below code is used to create the below output.

# Using the below code, we display the row number of the observations having a value for construction greater than 35.
which(county_data_2$construction > 35)

## [1] 1755 2658

The below output displays a summary for the variable production. The maximum value of 55.6 seems to be much larger than the 3rd quantile. This suggests that there may be outliers in this variable. However the maximum value of 55.6 is appropriately between 0 and 100, as we would expect for a percentage. Hence we do not expect that these unusual values were entered in error. Instead we believe that there are just a small number of counties with such high percentages of production employment. Hence we do not alter or remove observations for the variable production. Indeed, removing observations may prevent us from obtaining accurate answers to questions 5 through 10. The below code is used to produce the below output.

# Using the below code, we display a summary of the variable production.
summary(county_data_2$production)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   11.53   15.40   15.82   19.40   55.60

The below output displays a summary for the variable drive. The minimum value of 5.2 seems to be much smaller than the 1st quantile. This suggests that there may be outliers in this variable. However the minimum value of 5.2 is appropriately between 0 and 100, as we would expect for a percentage. The below code is used to produce the below output.

# Using the below code, we display a summary of the variable drive.
summary(county_data_2$drive)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.20   76.60   80.60   79.08   83.60   94.60

The below output displays the row number of the observations having a value for drive less than 10. We see that there are several observations having low values for drive, and hence we do not suspect that there is one single value that was entered in error. Instead we believe that the outliers are caused merely because there are only a small number of counties with such low percentages of individuals who drive to work. Hence we do not remove or alter observations for the variable drive. The below code is used to create the below output.

# Using the below code, we display the row number of the observations having a value for drive less than 10.
which(county_data_2$drive < 10)

## [1]   82 1859

The below output displays a summary for the variable carpool. The maximum value of 29.9 seems to be much larger than the 3rd quantile. This suggests that there may be outliers in this variable. However the maximum value of 29.9 is appropriately between 0 and 100, as we would expect for a percentage. The below code is used to produce the below output.

# Using the below code, we display a summary of the variable carpool.
summary(county_data_2$carpool)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    8.50    9.90   10.33   11.88   29.90

The below output displays the row number of the observations having a value for carpool greater than 25. We see that there are several observations having high values for carpool, and hence we do not suspect that there is one single value that was entered in error. Instead we believe that the outliers are caused merely because there are only a small number of counties with such high percentages of individuals who carpool to work. Hence we do not remove or alter observations for the variable carpool. The below code is used to create the below output.

# Using the below code, we display the row number of the observations having a value for carpool greater than 25.
which(county_data_2$carpool > 25)

## [1] 417 469 741

The below output displays a summary for the variable transit. The maximum value of 61.7 seems to be much larger than the 3rd quantile. This suggests that there may be outliers in this variable. However the maximum value of 61.7 is appropriately between 0 and 100, as we would expect for a percentage. The below code is used to produce the below output.

# Using the below code, we display a summary of the variable transit.
summary(county_data_2$transit)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1000  0.4000  0.9675  0.8000 61.7000

The below output displays the row number of the observations having a value for transit greater than 55. We see that there are several observations having high values for transit, and hence we do not suspect that there is one single value that was entered in error. Instead we believe that the outliers are caused merely because there are only a small number of counties with such high percentages of individuals who use public transportation to get to work. Hence we do not remove or alter observations for the variable transit. The below code is used to create the below output.

# Using the below code, we display the row number of the observations having a value for transit greater than 55.
which(county_data_2$transit > 55)

## [1] 1831 1852 1859

The below output displays a summary for the variable walk. The maximum value of 71.2 seems to be much larger than the 3rd quantile. This suggests that there may be outliers in this variable. However the maximum value of 71.2 is appropriately between 0 and 100, as we would expect for a percentage. The below code is used to produce the below output.

# Using the below code, we display a summary of the variable walk.
summary(county_data_2$walk)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.400   2.400   3.307   4.000  71.200

The below output displays the row number of the observations having a value for walk greater than 60. We see that only observation 68 has a value for walk greater than 60. Hence we suspect that this may be an extremely unusual outlier, and so we will examine this observation further. The below code is used to create the below output.

# Using the below code, we display the row number of the observations having a value for walk greater than 60.
which(county_data_2$walk > 60)

## [1] 68

Because the value for walk in observation 68 seems to be unusual, we determine if the sum of the variables drive, carpool, transit, walk, other_transp, and work_at_home is equal to 100 in row 68. The below output indicates that, as we should expect, the sum is indeed equal to 100. This suggests that although the value for walk in observation 68 seems to be unusual, it was probably not entered incorrectly. Hence we do not alter or remove this unusual observation. Indeed removing observations could prevent us from obtaining accurate answers in questions 5 through 10. The below code is used to create the below output.

# Using the below code, we display the sum of the variables drive, carpool, transit, walk, other_transp, and work_at_home in row 68.
county_data_2$drive[68] + county_data_2$carpool[68] + county_data_2$transit[68] + county_data_2$walk[68] + county_data_2$other_transp[68] + county_data_2$work_at_home[68]

## [1] 100

The below output displays a summary for the variable other_transp. The maximum value of 39.1 seems to be much larger than the 3rd quantile. This suggests that there may be outliers in this variable. However the maximum value of 39.1 is appropriately between 0 and 100, as we would expect for a percentage. The below code is used to produce the below output.

# Using the below code, we display a summary of the variable other_transp.
summary(county_data_2$other_transp)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.900   1.300   1.614   1.900  39.100

The below output displays the row number of the observations having a value for other_transp greater than 30. We see that there are several observations having such high values for other_transp, and hence we do not suspect that there is one single value that was entered in error. Instead we believe that the outliers are caused merely because there are only a small number of counties with such high percentages of individuals who go to work via other means. Hence we do not remove or alter observations for the variable other_transp. The below code is used to create the below output.

# Using the below code, we display the row number of the observations having a value for other_transp greater than 30.
which(county_data_2$other_transp > 30)

## [1] 82 83

The below output displays a summary for the variable work_at_home. The maximum value of 37.2 seems to be much larger than the 3rd quantile. This suggests that there may be outliers in this variable. However the maximum value of 37.2 is appropriately between 0 and 100, as we would expect for a percentage. The below code is used to produce the below output.

# Using the below code, we display a summary of the variable work_at_home.
summary(county_data_2$work_at_home)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.800   4.000   4.697   5.700  37.200

The below output displays the row number of the observations having a value for work_at_home greater than 30. We see that there are several observations having such high values for work_at_home, and hence we do not suspect that there is one single value that was entered in error. Instead we believe that the outliers are caused merely because there are only a small number of counties with such high percentages of individuals who work at home. Hence we do not remove or alter observations for the variable work_at_home. The below code is used to create the below output.

# Using the below code, we display the row number of the observations having a value for work_at_home greater than 30.
which(county_data_2$work_at_home > 30)

## [1]  413 1568 1706

The below output displays a summary for the variable mean_commute. The maximum value of 44 seems to be much larger than the 3rd quantile, and the minimum value of 4.9 seems to be quite smaller than the 1st quantile. However this could be due to the fact that there are only a small number of counties in the USA where the commute is generally either extremely long or extremely short. Given this, the minimum and maximum values for mean_commute seem reasonable, and so we do not alter or remove any outliers from this variable. The below code is used to produce the below output.

# Using the below code, we display a summary of the variable mean_commute.
summary(county_data_2$mean_commute)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.90   19.30   22.90   23.15   26.60   44.00

The below output displays a summary for the variable employed. According to the data dictionary, the variable employed is supposed to be a percentage. However based on the values in the below summary, we believe that it is instead the number of people who are employed in each county. We note that the minimum value is quite a bit smaller than the 1st quantile, and the maximum value is quite a bit larger than the 3rd quantile. However this is likely due to the fact that only a small number of counties in the USA have extremely big populations, and only a small number of counties have very small populations. Given that there are some counties in the USA with very small or very large populations, the values in the below summary seem reasonable and not unusual. Hence we do not alter or remove any outliers for this variable. The below code is used to produce the below output.

# Using the below code, we display a summary of the variable employed.
summary(county_data_2$employed)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      62    4524   10644   46387   29254 4635465

The below output displays a summary for the variable private_work. The minimum value of 25 seems to be much smaller than the 1st quantile. This suggests that there may be outliers in this variable. However the minimum value of 25 is appropriately between 0 and 100, as we would expect for a percentage. The below code is used to produce the below output.

# Using the below code, we display a summary of the variable private_work.
summary(county_data_2$private_work)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   25.00   70.90   75.80   74.44   79.80   88.30

The below output displays the row number of the observations having a value for private_work less than 30. We see that there are several observations having such low values for private_work, and hence we do not suspect that there is one single value that was entered in error. Instead we believe that the outliers are caused merely because there are only a small number of counties with such low percentages of individuals who work in the private sector. Hence we do not remove or alter observations for the variable private_work. The below code is used to create the below output.

# Using the below code, we display the row number of the observations having a value for private_work less than 30.
which(county_data_2$private_work < 30)

## [1]  549 2413

The below output displays a summary for the variable public_work. The maximum value of 66.2 seems to be much larger than the 3rd quantile. This suggests that there may be outliers in this variable. However the maximum value of 66.2 is appropriately between 0 and 100, as we would expect for a percentage. The below code is used to produce the below output.

# Using the below code, we display a summary of the variable public_work.
summary(county_data_2$public_work)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.80   13.10   16.10   17.35   20.10   66.20

The below output displays the row number of the observations having a value for public_work greater than 60. We see that there are several observations having such high values for public_work, and hence we do not suspect that there is one single value that was entered in error. Instead we believe that the outliers are caused merely because there are only a small number of counties with such high percentages of individuals who work in the public sector. Hence we do not remove or alter observations for the variable public_work. The below code is used to create the below output.

# Using the below code, we display the row number of the observations having a value for public_work greater than 60.
which(county_data_2$public_work > 60)

## [1]   96  549 2413

The below output displays a summary for the variable self_employed. The maximum value of 36.6 seems to be much larger than the 3rd quantile. This suggests that there may be outliers in this variable. However the maximum value of 36.6 is appropriately between 0 and 100, as we would expect for a percentage. The below code is used to produce the below output.

# Using the below code, we display a summary of the variable self_employed.
summary(county_data_2$self_employed)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   5.400   6.900   7.921   9.400  36.600

The below output displays the row number of the observations having a value for self_employed greater than 30. We see that there are several observations having such high values for self_employed, and hence we do not suspect that there is one single value that was entered in error. Instead we believe that the outliers are caused merely because there are only a small number of counties with such high percentages of individuals who are self-employed. Hence we do not remove or alter observations for the variable self_employed. The below code is used to create the below output.

# Using the below code, we display the row number of the observations having a value for self_employed greater than 30.
which(county_data_2$self_employed > 30)

## [1] 1604 1615 1652 1706 2009

The below output displays a summary for the variable family_work. The maximum value of 9.8 seems to be much larger than the 3rd quantile. This suggests that there may be outliers in this variable. However the maximum value of 9.8 is appropriately between 0 and 100, as we would expect for a percentage. The below code is used to produce the below output.

# Using the below code, we display a summary of the variable family_work.
summary(county_data_2$family_work)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1000  0.2000  0.2915  0.3000  9.8000

The below output displays the row number of the observations having a value for family_work greater than 5. We see that there are multiple observations having such high values for family_work, and hence we do not suspect that there is one single value that was entered in error. Instead we believe that the outliers are caused merely because there are only a small number of counties with such high percentages of individuals who are engaged in family work. Hence we do not remove or alter observations for the variable family_work. The below code is used to create the below output.

# Using the below code, we display the row number of the observations having a value for family_work greater than 5.
which(county_data_2$family_work > 5)

## [1]  239 1601 1624 1652

The below output displays a summary for the variable unemployment. The maximum value of 29.4 seems to be much larger than the 3rd quantile. This suggests that there may be outliers in this variable. However the maximum value of 29.4 is appropriately between 0 and 100, as we would expect for a percentage. The below code is used to produce the below output.

# Using the below code, we display a summary of the variable unemployment.
summary(county_data_2$unemployment)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   5.500   7.500   7.815   9.700  29.400

The below output displays the row number of the observations having a value for unemployment greater than 25. We see that there are multiple observations having such high values for unemployment, and hence we do not suspect that there is one single value that was entered in error. Instead we believe that the outliers are caused merely because there are only a small number of counties with such high percentages of individuals who are unemployed. Hence we do not remove or alter observations for the variable unemployment. The below code is used to create the below output.

# Using the below code, we display the row number of the observations having a value for unemployment greater than 25.
which(county_data_2$unemployment > 25)

## [1]   82  258 1428 1461 2377 2413 2422 2428

The below output displays a summary of the variable census_id. We see that there are 3142 observations of this variable, and it is of class character. The code that we used to create the below output is shown directly below.

# Using the below code, we display a summary of the variable census_id.
summary(county_data_2$census_id)

##    Length     Class      Mode 
##      3142 character character

The below output displays a summary of the variable state. We see that there are 3142 observations of this variable, and it is of class character. The below code that we used to create the below output is shown directly below.

# Using the below code, we display a summary of the variable state.
summary(county_data_2$state)

##    Length     Class      Mode 
##      3142 character character

The below output displays a summary of the variable county. We see that there are 3142 observations of this variable, and it is of class character. The below code that we used to create the below output is shown directly below.

# Using the below code, we display a summary of the variable county.
summary(county_data_2$county)

##    Length     Class      Mode 
##      3142 character character

In conclusion, we did find some outliers in the data. However these outliers were not so extremely unusual for us to suspect that they were entered incorrectly. As a consequence, we do not choose to alter or remove them. Indeed altering or removing them could prevent us from obtaining correct answers to questions 5 through 10.

Question 5 Explanation, Code, and Output

Before answering this question, we double-check that each row in the data set represents a unique county (i.e., no county occurs twice in the data set). The below output displays the number of unique counties in the data set, and we obtain this number using the below code.

# Using the below code, we create a dataset containing only the variables state and county.
county_data_3 <- select(county_data_2, state, county)

# Using the below code, we output the number of unique counties contained in the data set that we created.
nrow(unique(county_data_3))

## [1] 3142

We compare the number of unique counties to the total number of rows in the data set using the below code. The below output indicates that there are 3142 observations in the data set. We note that this is the same as the number of unique counties.

# Using the below code, we find the number of observations in the data set.
nrow(county_data_2)

## [1] 3142

We create a data set containing only the counties having more women than men. We view the head of this data set in the output below, and we see that it does indeed appear to contain the counties with more women than men. The code to perform the operations mentioned in this paragraph is displayed directly below.

# Using the below code, we create a data set called county_data_4 which contains the observations with more women than men.
county_data_4 <- filter(county_data_2, women > men)

# Using the below code, we view the head of county_data_4 in order to check that it appears to contain the observations with more women than men.
head(county_data_4)

## # A tibble: 6 x 35
##   census_id state county total_pop   men women hispanic white black native asian
##   <chr>     <chr> <chr>      <int> <int> <int>    <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 1001      Alab~ Autau~     55221 26745 28476      2.6  75.8  18.5    0.4   1  
## 2 1003      Alab~ Baldw~    195121 95314 99807      4.5  83.1   9.5    0.6   0.7
## 3 1009      Alab~ Blount     57710 28512 29198      8.6  87.9   1.5    0.3   0.1
## 4 1013      Alab~ Butler     20354  9502 10852      1.2  53.3  43.8    0.1   0.4
## 5 1015      Alab~ Calho~    116648 56274 60374      3.5  73    20.3    0.2   0.9
## 6 1017      Alab~ Chamb~     34079 16258 17821      0.4  57.3  40.3    0.2   0.8
## # ... with 24 more variables: pacific <dbl>, citizen <int>, income <dbl>,
## #   income_per_cap <dbl>, poverty <dbl>, child_poverty <dbl>,
## #   professional <dbl>, service <dbl>, office <dbl>, construction <dbl>,
## #   production <dbl>, drive <dbl>, carpool <dbl>, transit <dbl>, walk <dbl>,
## #   other_transp <dbl>, work_at_home <dbl>, mean_commute <dbl>, employed <dbl>,
## #   private_work <dbl>, public_work <dbl>, self_employed <dbl>,
## #   family_work <dbl>, unemployment <dbl>

The below output indicates that there are 1985 counties with more women than men. We obtain this value using the below code.

# Using the below code, we find the number of counties with more women than men by counting the number of observations in county_data_4.
nrow(county_data_4)

## [1] 1985

Question 6 Explanation, Code, and Output

We create a data set containing only the counties having an unemployment rate lower than 10%. We view the head of this data set in the output below, and we see that it does indeed appear to contain the counties with low unemployment rates. The code to perform the operations mentioned in this paragraph is displayed directly below.

# Using the below code, we create a data set called county_data_5 which contains the counties having an unemployment rate lower than 10%.
county_data_5 <- filter(county_data_2, unemployment < 10)

# Using the below code, we view the head of county_data_5 in order to check that it appears to contain the counties with an unemployment rate lower than 10%.
head(county_data_5)

## # A tibble: 6 x 35
##   census_id state county total_pop   men women hispanic white black native asian
##   <chr>     <chr> <chr>      <int> <int> <int>    <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 1001      Alab~ Autau~     55221 26745 28476      2.6  75.8  18.5    0.4   1  
## 2 1003      Alab~ Baldw~    195121 95314 99807      4.5  83.1   9.5    0.6   0.7
## 3 1007      Alab~ Bibb       22604 12073 10531      2.2  74.5  21.4    0.4   0.1
## 4 1009      Alab~ Blount     57710 28512 29198      8.6  87.9   1.5    0.3   0.1
## 5 1017      Alab~ Chamb~     34079 16258 17821      0.4  57.3  40.3    0.2   0.8
## 6 1019      Alab~ Chero~     26008 12975 13033      1.5  91.7   4.8    0.6   0.3
## # ... with 24 more variables: pacific <dbl>, citizen <int>, income <dbl>,
## #   income_per_cap <dbl>, poverty <dbl>, child_poverty <dbl>,
## #   professional <dbl>, service <dbl>, office <dbl>, construction <dbl>,
## #   production <dbl>, drive <dbl>, carpool <dbl>, transit <dbl>, walk <dbl>,
## #   other_transp <dbl>, work_at_home <dbl>, mean_commute <dbl>, employed <dbl>,
## #   private_work <dbl>, public_work <dbl>, self_employed <dbl>,
## #   family_work <dbl>, unemployment <dbl>

The below output indicates that there are 2420 counties having an unemployment rate lower than 10%. We obtain this value using the below code.

# Using the below code, we find the number of counties with an unemployment rate lower than 10%
nrow(county_data_5)

## [1] 2420

Question 7 Explanation, Code, and Output

Using the below code, we access the documentation for the dplyr::top_n() function, and in so doing, we learn how to use the function to answer this problem.

# We use the below code to access the documentation for the dplyr::top_n() function.
?dplyr::top_n()

## starting httpd help server ... done

The below output displays the top 10 counties with the highest mean commute (sorted by mean_commute). Below is the code that we used to create this output.

# Using the below code, we create a new data set called county_data_6 containing only the top 10 counties with the highest mean commute.
county_data_6 <- dplyr::top_n(county_data_2, 10, county_data_2$mean_commute)

# Using the below code, we arrange county_data_6 by mean commute.
county_data_6 <- arrange(county_data_6, desc(mean_commute))

# Using the below code, we display the census ID, the county name, the state, and the mean commute of the 10 counties in county_data_6.
select(county_data_6, census_id, county, state, mean_commute)

## # A tibble: 10 x 4
##    census_id county       state         mean_commute
##    <chr>     <chr>        <chr>                <dbl>
##  1 42103     Pike         Pennsylvania          44  
##  2 36005     Bronx        New York              43  
##  3 24017     Charles      Maryland              42.8
##  4 51187     Warren       Virginia              42.7
##  5 36081     Queens       New York              42.6
##  6 36085     Richmond     New York              42.6
##  7 51193     Westmoreland Virginia              42.5
##  8 8093      Park         Colorado              42.4
##  9 36047     Kings        New York              41.7
## 10 54015     Clay         West Virginia         41.4

Based on the above output, we see that the 10 counties with the longest mean_commute are Pike county in Pennsylvania, Bronx county in New York, Charles county in Maryland, Warren county in Virginia, Queens county in New York, Richmond county in New York, Westmoreland county in Virginia, Park county in Colorado, Kings county in New York, and Clay county in West Virginia.

Question 8 Explanation, Code, and Output

We create a new data set containing all of the 35 variables in county_data_2 and also a new variable called women_percent. The variable women_percent displays the percentage of the population that are women. In the below output, we view the head of this new data set in order to check that the new variable was appropriately added, and we see that the new data set now indeed contains 36 variables (one additional variable beyond the 35 contained in county_data_2). The code used to perform the operations described in this paragraph is shown directly below.

# Using the below code, we create a new data set called county_data_7 which contains an additional variable indicating the percentage of women in each county.
county_data_7 <- mutate(county_data_2, women_percent = women / total_pop * 100)

# Using the below code, we view the head of the new data set to check that the new variable was appropriately added.
head(county_data_7)

## # A tibble: 6 x 36
##   census_id state county total_pop   men women hispanic white black native asian
##   <chr>     <chr> <chr>      <int> <int> <int>    <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 1001      Alab~ Autau~     55221 26745 28476      2.6  75.8  18.5    0.4   1  
## 2 1003      Alab~ Baldw~    195121 95314 99807      4.5  83.1   9.5    0.6   0.7
## 3 1005      Alab~ Barbo~     26932 14497 12435      4.6  46.2  46.7    0.2   0.4
## 4 1007      Alab~ Bibb       22604 12073 10531      2.2  74.5  21.4    0.4   0.1
## 5 1009      Alab~ Blount     57710 28512 29198      8.6  87.9   1.5    0.3   0.1
## 6 1011      Alab~ Bullo~     10678  5660  5018      4.4  22.2  70.7    1.2   0.2
## # ... with 25 more variables: pacific <dbl>, citizen <int>, income <dbl>,
## #   income_per_cap <dbl>, poverty <dbl>, child_poverty <dbl>,
## #   professional <dbl>, service <dbl>, office <dbl>, construction <dbl>,
## #   production <dbl>, drive <dbl>, carpool <dbl>, transit <dbl>, walk <dbl>,
## #   other_transp <dbl>, work_at_home <dbl>, mean_commute <dbl>, employed <dbl>,
## #   private_work <dbl>, public_work <dbl>, self_employed <dbl>,
## #   family_work <dbl>, unemployment <dbl>, women_percent <dbl>

The below output displays the 10 counties having the lowest percentage of women (sorted in ascending percentage of women). Below is the code that we used to create this output.

# Using the below code, we create a new data set containing only the 10 counties having the lowest percentage of women.
county_data_8 <- dplyr::top_n(county_data_7, -10, county_data_7$women_percent)

# Using the below code, we arrange county_data_8 by the percentage of women.
county_data_8 <- arrange(county_data_8, women_percent)

# Using the below code, we display the census ID, the county name, the state, and the percentage of women of the 10 counties in county_data_8.
select(county_data_8, census_id, county, state, women_percent)

## # A tibble: 10 x 4
##    census_id county                 state        women_percent
##    <chr>     <chr>                  <chr>                <dbl>
##  1 42053     Forest                 Pennsylvania          26.8
##  2 8011      Bent                   Colorado              31.4
##  3 51183     Sussex                 Virginia              31.5
##  4 13309     Wheeler                Georgia               32.1
##  5 6035      Lassen                 California            33.2
##  6 48095     Concho                 Texas                 33.3
##  7 13053     Chattahoochee          Georgia               33.4
##  8 2013      Aleutians East Borough Alaska                33.5
##  9 22125     West Feliciana         Louisiana             33.6
## 10 32027     Pershing               Nevada                33.7

Based on the above output, we see that the 10 counties with the lowest percentage of women are Forest county in Pennsylvania, Bent county in Colorado, Sussex county in Virginia, Wheeler county in Georgia, Lassen county in California, Concho county in Texas, Chattahoochee county in Georgia, Aleutians East Borough county in Alaska, West Feliciana county in Louisiana, and Pershing county in Nevada.

Question 9 Explanation, Code, and Output

We create a new data set containing all of the 36 variables in county_data_7 and also a new variable called race_sum. The variable race_sum displays the sum of all race percentage variables (i.e., the sum of the “hispanic”, “white”, “black”, “native”, “asian”, and “pacific” variables). In the below output, we view the head of this new data set in order to check that the new variable was appropriately added, and we see that the new data set contains 37 variables (one more than that of county_data_7). The code used to perform the operations described in this paragraph is shown directly below.

# Using the below code, we create a new data set called county_data_9 which contains an additional variable indicating the sum of all race percentage variables.
county_data_9 <- mutate(county_data_7, race_sum = hispanic + white + black + native + asian + pacific)

# Using the below code, we view the head of the new data set to check that the new variable was appropriately added.
head(county_data_9)

## # A tibble: 6 x 37
##   census_id state county total_pop   men women hispanic white black native asian
##   <chr>     <chr> <chr>      <int> <int> <int>    <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 1001      Alab~ Autau~     55221 26745 28476      2.6  75.8  18.5    0.4   1  
## 2 1003      Alab~ Baldw~    195121 95314 99807      4.5  83.1   9.5    0.6   0.7
## 3 1005      Alab~ Barbo~     26932 14497 12435      4.6  46.2  46.7    0.2   0.4
## 4 1007      Alab~ Bibb       22604 12073 10531      2.2  74.5  21.4    0.4   0.1
## 5 1009      Alab~ Blount     57710 28512 29198      8.6  87.9   1.5    0.3   0.1
## 6 1011      Alab~ Bullo~     10678  5660  5018      4.4  22.2  70.7    1.2   0.2
## # ... with 26 more variables: pacific <dbl>, citizen <int>, income <dbl>,
## #   income_per_cap <dbl>, poverty <dbl>, child_poverty <dbl>,
## #   professional <dbl>, service <dbl>, office <dbl>, construction <dbl>,
## #   production <dbl>, drive <dbl>, carpool <dbl>, transit <dbl>, walk <dbl>,
## #   other_transp <dbl>, work_at_home <dbl>, mean_commute <dbl>, employed <dbl>,
## #   private_work <dbl>, public_work <dbl>, self_employed <dbl>,
## #   family_work <dbl>, unemployment <dbl>, women_percent <dbl>, race_sum <dbl>

Part a

The below output displays the 10 counties having the lowest values for the new variable race_sum (sorted by ascending values for race_sum). Below is the code that we used to create this output.

# Using the below code, we create a new data set containing only the 10 counties having the lowest values for race_sum.
county_data_10 <- dplyr::top_n(county_data_9, -10, county_data_9$race_sum)

# Using the below code, we arrange county_data_10 by the values for race_sum.
county_data_10 <- arrange(county_data_10, race_sum)

# Using the below code, we display the census ID, the county name, the state, and the race_sum of the 10 counties in county_data_10.
select(county_data_10, census_id, county, state, race_sum)

## # A tibble: 10 x 4
##    census_id county                   state     race_sum
##    <chr>     <chr>                    <chr>        <dbl>
##  1 15001     Hawaii                   Hawaii        76.4
##  2 15009     Maui                     Hawaii        79.2
##  3 40097     Mayes                    Oklahoma      79.7
##  4 15003     Honolulu                 Hawaii        81.5
##  5 40123     Pontotoc                 Oklahoma      82.8
##  6 47061     Grundy                   Tennessee     83. 
##  7 2282      Yakutat City and Borough Alaska        83.4
##  8 40069     Johnston                 Oklahoma      84  
##  9 15007     Kauai                    Hawaii        84.1
## 10 40003     Alfalfa                  Oklahoma      85.1

Based on the above output, we see that the 10 counties with the lowest values for race_sum are Hawaii county in Hawaii, Maui county in Hawaii, Mayes county in Oklahoma, Honolulu county in Hawaii, Pontotoc county in Oklahoma, Grundy county in Tennessee, Yakutat City and Borough county in Alaska, Johnston county in Oklahoma, Kauai city in Hawaii, and Alfalfa county in Oklahoma.

Part b

We calculate the average of the county values for race_sum in each state, and we display these values in the column avg_race_sum in the output below. The below output is sorted by ascending avg_race_sum. Hence the first entry shown below is the state with the lowest value for avg_race_sum. That is, the state of Hawaii has, on average, the lowest sum of the race percentage variables. The below output was created using the below code.

# Using the below code, we group the data in county_data_9 by state.
by_state <- group_by(county_data_9, state)

# Using the below code, we display the average of the county values for race_sum in each state (sorted by ascending avg_race_sum).
arrange(summarise(by_state, avg_race_sum = mean(race_sum)), avg_race_sum)

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 51 x 2
##    state                avg_race_sum
##    <chr>                       <dbl>
##  1 Hawaii                       84  
##  2 Alaska                       92.7
##  3 Oklahoma                     92.8
##  4 Washington                   96.7
##  5 California                   96.9
##  6 Oregon                       97.1
##  7 Delaware                     97.3
##  8 Massachusetts                97.5
##  9 Maryland                     97.6
## 10 District of Columbia         97.6
## # ... with 41 more rows

Part c

The below output displays the 11 counties with values for race_sum exceeding 100 (sorted by descending race_sum). These 11 counties are Claiborne county in Mississippi, Gosper county in Nebraska, Hooker county in Nebraska, Nance county in Nebraska, Bailey county in Texas, Duval county in Texas, Edwards county in Texas, Kenedy county in Texas, Kent county in Texas, Presidio county in Texas, and Beaver county in Utah. We create the below output using the below code.

# Using the below code, we display the 11 counties with values for race_sum exceeding 100 (sorted by descending race_sum).
arrange(select(filter(county_data_9, race_sum > 100), census_id, county, state, race_sum), desc(race_sum))

## # A tibble: 11 x 4
##    census_id county    state       race_sum
##    <chr>     <chr>     <chr>          <dbl>
##  1 31073     Gosper    Nebraska        100.
##  2 31091     Hooker    Nebraska        100.
##  3 48017     Bailey    Texas           100.
##  4 48137     Edwards   Texas           100.
##  5 31125     Nance     Nebraska        100.
##  6 28021     Claiborne Mississippi     100.
##  7 48131     Duval     Texas           100.
##  8 48261     Kenedy    Texas           100.
##  9 48263     Kent      Texas           100.
## 10 48377     Presidio  Texas           100.
## 11 49001     Beaver    Utah            100.

Part d

The below output indicates that there are no states whose avg_race_sum is exactly equal to 100. We create this output using the below code.

# Using the below code, we display the states whose avg_race_sum is exactly equal to 100.
filter(summarise(by_state, avg_race_sum = mean(race_sum)), avg_race_sum == 100)

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 0 x 2
## # ... with 2 variables: state <chr>, avg_race_sum <dbl>

The below output displays the counties that have a value for race_sum which is exactly equal to 100. In the below output, we see that the states of Alabama, Georgia, Kansas, Kentucky, Mississippi, Montana, Nebraska, New Mexico, North Carolina, North Dakota, South Dakota, Texas, and West Virginia each have at least one county with a value for race_sum that is exactly equal to 100. This is a total of 13 states having at least one county with a value for race_sum that is exactly equal to 100. We create the below output using the following code.

# Using the below code, we display the counties that have a value for race_sum which is exactly equal to 100.
select(filter(county_data_9, race_sum == 100), census_id, county, state, race_sum)

## # A tibble: 27 x 4
##    census_id county    state       race_sum
##    <chr>     <chr>     <chr>          <dbl>
##  1 1065      Hale      Alabama          100
##  2 1131      Wilcox    Alabama          100
##  3 13201     Miller    Georgia          100
##  4 13307     Webster   Georgia          100
##  5 20199     Wallace   Kansas           100
##  6 21031     Butler    Kentucky         100
##  7 28125     Sharkey   Mississippi      100
##  8 30019     Daniels   Montana          100
##  9 30069     Petroleum Montana          100
## 10 30109     Wibaux    Montana          100
## # ... with 17 more rows

The below code is used to create the below output listing the 13 unique states which we previously mentioned have at least one county whose race_sum is exactly equal to 100.

# Using the below code, we display the 13 states which have at least one county whose race_sum is exactly equal to 100.
unique(select(filter(county_data_9, race_sum == 100), state))

## # A tibble: 13 x 1
##    state         
##    <chr>         
##  1 Alabama       
##  2 Georgia       
##  3 Kansas        
##  4 Kentucky      
##  5 Mississippi   
##  6 Montana       
##  7 Nebraska      
##  8 New Mexico    
##  9 North Carolina
## 10 North Dakota  
## 11 South Dakota  
## 12 Texas         
## 13 West Virginia

The below output indicates that there are 13 states which have at least one county whose race_sum exactly equals 100. The below code is used to create the below output.

# Using the below code, we calculate the number of states which have at least one county whose race_sum exactly equals 100.
nrow(unique(select(filter(county_data_9, race_sum == 100), state)))

## [1] 13

Question 10 Explanation, Code, and Output

Part a

Using the below code, we access the documentation for the dplyr::min_rank() function, and in so doing, we learn how to use the function to answer this problem.

# We use the below code to access the documentation for the dplyr::min_rank() function.
?dplyr::min_rank()

Using the below code, we create a new variable called carpool_rank to rank the counties according to their carpool values.

# Using the below code, we create a new variable to rank the counties according to their carpool values. 
carpool_rank <- dplyr::min_rank(desc(county_data_9$carpool))

Part b

We create a new data set called county_data_11 which contains the variable carpool_rank and all 37 variables contained in county_data_9. In the output below, we then view the head of the new data set to see that we successfully included the variable carpool_rank, and we see that as desired, the new data set does indeed have 38 variables (just one more than county_data_9). The below code is used to perform the operations mentioned in this paragraph.

# Using the below code, we create a new data set containing carpool_rank and all of the variables in county_data_9.
county_data_11 <- data.frame(county_data_9, carpool_rank)

# Using the below code, we view the head of the new data set.
head(county_data_11)

##   census_id   state  county total_pop   men women hispanic white black native
## 1      1001 Alabama Autauga     55221 26745 28476      2.6  75.8  18.5    0.4
## 2      1003 Alabama Baldwin    195121 95314 99807      4.5  83.1   9.5    0.6
## 3      1005 Alabama Barbour     26932 14497 12435      4.6  46.2  46.7    0.2
## 4      1007 Alabama    Bibb     22604 12073 10531      2.2  74.5  21.4    0.4
## 5      1009 Alabama  Blount     57710 28512 29198      8.6  87.9   1.5    0.3
## 6      1011 Alabama Bullock     10678  5660  5018      4.4  22.2  70.7    1.2
##   asian pacific citizen income income_per_cap poverty child_poverty
## 1   1.0       0   40725  51281          24974    12.9          18.6
## 2   0.7       0  147695  50254          27317    13.4          19.2
## 3   0.4       0   20714  32964          16824    26.7          45.3
## 4   0.1       0   17495  38678          18431    16.8          27.9
## 5   0.1       0   42345  45813          20532    16.7          27.2
## 6   0.2       0    8057  31938          17580    24.6          38.4
##   professional service office construction production drive carpool transit
## 1         33.2    17.0   24.2          8.6       17.1  87.5     8.8     0.1
## 2         33.1    17.7   27.1         10.8       11.2  84.7     8.8     0.1
## 3         26.8    16.1   23.1         10.8       23.1  83.8    10.9     0.4
## 4         21.5    17.9   17.8         19.0       23.7  83.2    13.5     0.5
## 5         28.5    14.1   23.9         13.5       19.9  84.9    11.2     0.4
## 6         18.8    15.0   19.7         20.1       26.4  74.9    14.9     0.7
##   walk other_transp work_at_home mean_commute employed private_work public_work
## 1  0.5          1.3          1.8         26.5    23986         73.6        20.9
## 2  1.0          1.4          3.9         26.4    85953         81.5        12.3
## 3  1.8          1.5          1.6         24.1     8597         71.8        20.8
## 4  0.6          1.5          0.7         28.8     8294         76.8        16.1
## 5  0.9          0.4          2.3         34.9    22189         82.0        13.5
## 6  5.0          1.7          2.8         27.5     3865         79.5        15.1
##   self_employed family_work unemployment women_percent race_sum carpool_rank
## 1           5.5         0.0          7.6      51.56734     98.3         2157
## 2           5.8         0.4          7.5      51.15134     98.4         2157
## 3           7.3         0.1         17.6      46.17184     98.1         1103
## 4           6.7         0.4          8.3      46.58910     98.6          391
## 5           4.2         0.4          7.7      50.59435     98.4          986
## 6           5.4         0.0         18.0      46.99382     98.7          204

In the below output, the 10 highest ranked counties for carpooling are displayed. We create this output using the below code.

# Using the below code, we display the 10 highest ranked counties for carpooling.
arrange(filter(select(county_data_11, census_id, county, state, carpool, carpool_rank), carpool_rank < 11), carpool_rank)

##    census_id   county    state carpool carpool_rank
## 1      13061     Clay  Georgia    29.9            1
## 2      18087 LaGrange  Indiana    27.0            2
## 3      13165  Jenkins  Georgia    25.3            3
## 4       5133   Sevier Arkansas    24.4            4
## 5      20175   Seward   Kansas    23.4            5
## 6      48079  Cochran    Texas    22.8            6
## 7      48247 Jim Hogg    Texas    22.6            7
## 8      48393  Roberts    Texas    22.4            8
## 9      39075   Holmes     Ohio    21.8            9
## 10     21197   Powell Kentucky    21.6           10

Based on the above output, we see that the 10 highest ranked counties for carpooling are Clay county in Georgia, LaGrange county in Indiana, Jenkins county in Georgia, Sevier county in Arkansas, Seward county in Kansas, Cochran county in Texas, Jim Hogg county in Texas, Roberts county in Texas, Holmes county in Ohio, and Powell county in Kentucky.

Part c

In the below output, the 10 lowest ranked counties for carpooling are displayed. Note that due to a tie for carpool ranking between Hyde county in South Dakota and Norton City county in Virgina for the 10th lowest spot, there are actually 11 counties displayed below. We create this output using the below code.

# Using the below code, we display the 10 lowest ranked counties for carpooling. Note that 11 end up needing to be displayed due to a tie.
arrange(filter(select(county_data_11, census_id, county, state, carpool, carpool_rank), carpool_rank > nrow(county_data_11) - 11), desc(carpool_rank))

##    census_id      county        state carpool carpool_rank
## 1      48261      Kenedy        Texas     0.0         3141
## 2      48269        King        Texas     0.0         3141
## 3      48235       Irion        Texas     0.9         3140
## 4      31183     Wheeler     Nebraska     1.3         3139
## 5      36061    New York     New York     1.9         3138
## 6      13309     Wheeler      Georgia     2.3         3136
## 7      38029      Emmons North Dakota     2.3         3136
## 8      30019     Daniels      Montana     2.6         3134
## 9      31057       Dundy     Nebraska     2.6         3134
## 10     46069        Hyde South Dakota     2.8         3132
## 11     51720 Norton city     Virginia     2.8         3132

Based on the above output, we see that Kenedy county in Texas, King county in Texas, Irion county in Texas, Wheeler county in Nebraska, New York county in New York, Wheeler county in Georgia, Emmons county in North Dakota, Daniels county in Montana, Dundy county in Nebraska, Hyde county in South Dakota, and Norton City county in Virginia are the lowest ranked counties for carpooling.

Part d

We calculate the average of the county values for carpool_rank in each state, and we display these values in the column avg_carpool_rank in the output below. The below output is sorted by ascending avg_carpool_rank. Hence the first entry shown below is the state with the best average carpool rank. That is, the state of Arizona has, on average, the best carpool ranking. The below output was created using the below code.

# Using the below code, we group the data in county_data_11 by state.
by_state_3 <- group_by(county_data_11, state)

# Using the below code, we display the average of the county values for carpool_rank in each state (sorted by ascending carpool_rank).
arrange(summarise(by_state_3, avg_carpool_rank = mean(carpool_rank)), avg_carpool_rank)

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 51 x 2
##    state      avg_carpool_rank
##    <chr>                 <dbl>
##  1 Arizona                971.
##  2 Utah                  1019.
##  3 Arkansas              1055.
##  4 Hawaii                1072.
##  5 Alaska                1087.
##  6 Nevada                1100.
##  7 Texas                 1106.
##  8 Wyoming               1107.
##  9 California            1122.
## 10 Missouri              1133.
## # ... with 41 more rows

Part e

In the below output, we display the top 5 states for carpooling. We create this output using the below code.

# Using the below code, we display the top 5 states for carpooling.
head(arrange(summarise(by_state_3, avg_carpool_rank = mean(carpool_rank)), avg_carpool_rank), 5)

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 5 x 2
##   state    avg_carpool_rank
##   <chr>               <dbl>
## 1 Arizona              971.
## 2 Utah                1019.
## 3 Arkansas            1055.
## 4 Hawaii              1072.
## 5 Alaska              1087.

Based on the above output, we see that Arizona, Utah, Arkansas, Hawaii, and Alaska are the top 5 states for carpooling.

Homework 3

Abigail Richard

11/5/2020

Question 1 Explanation, Code, and Output

Question 2 Explanation, Code, and Output

Question 3 Explanation, Code, and Output

Question 4 Explanation, Code, and Output

Question 5 Explanation, Code, and Output

Question 6 Explanation, Code, and Output

Question 7 Explanation, Code, and Output

Question 8 Explanation, Code, and Output

Question 9 Explanation, Code, and Output

Part a

Part b

Part c

Part d

Question 10 Explanation, Code, and Output

Part a

Part b

Part c

Part d

Part e