First, I begin by setting my working directory to the file location containing acs_2015_county_data_revised.csv. I then load the Tidyverse package. I then import the data set using a Tidyverse function, and I name the data set county_data. Then I examine the structure of the data set to determine that there are 3142 rows and 35 columns. The below output is displayed when the Tidyverse package is loaded, and below is the code to set my working directory and load the Tidyverse package.
# Using the below code, I set my working directory.
setwd("C:/Users/richa/Dropbox/My PC (DESKTOP-B9LT0L1)/Documents/Data Wrangling/Week 4/homework3")
# Using the below code, I load the Tidverse package.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.4 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
In the below code, I import the data set contained in acs_2015_county_data_revised.csv. The below output is displayed when the data set is imported.
# Using the below code, I import the data set using a Tidyverse function, and I name the data set.
county_data <- read_csv("acs_2015_county_data_revised.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_double(),
## state = col_character(),
## county = col_character()
## )
## i Use `spec()` for the full column specifications.
I use the below code to display the below output containing information about the data’s structure.
# Using the below code, I examine the structure of the data set.
str(county_data)
## tibble [3,142 x 35] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ census_id : num [1:3142] 1001 1003 1005 1007 1009 ...
## $ state : chr [1:3142] "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ county : chr [1:3142] "Autauga" "Baldwin" "Barbour" "Bibb" ...
## $ total_pop : num [1:3142] 55221 195121 26932 22604 57710 ...
## $ men : num [1:3142] 26745 95314 14497 12073 28512 ...
## $ women : num [1:3142] 28476 99807 12435 10531 29198 ...
## $ hispanic : num [1:3142] 2.6 4.5 4.6 2.2 8.6 4.4 1.2 3.5 0.4 1.5 ...
## $ white : num [1:3142] 75.8 83.1 46.2 74.5 87.9 22.2 53.3 73 57.3 91.7 ...
## $ black : num [1:3142] 18.5 9.5 46.7 21.4 1.5 70.7 43.8 20.3 40.3 4.8 ...
## $ native : num [1:3142] 0.4 0.6 0.2 0.4 0.3 1.2 0.1 0.2 0.2 0.6 ...
## $ asian : num [1:3142] 1 0.7 0.4 0.1 0.1 0.2 0.4 0.9 0.8 0.3 ...
## $ pacific : num [1:3142] 0 0 0 0 0 0 0 0 0 0 ...
## $ citizen : num [1:3142] 40725 147695 20714 17495 42345 ...
## $ income : num [1:3142] 51281 50254 32964 38678 45813 ...
## $ income_per_cap: num [1:3142] 24974 27317 16824 18431 20532 ...
## $ poverty : num [1:3142] 12.9 13.4 26.7 16.8 16.7 24.6 25.4 20.5 21.6 19.2 ...
## $ child_poverty : num [1:3142] 18.6 19.2 45.3 27.9 27.2 38.4 39.2 31.6 37.2 30.1 ...
## $ professional : num [1:3142] 33.2 33.1 26.8 21.5 28.5 18.8 27.5 27.3 23.3 29.3 ...
## $ service : num [1:3142] 17 17.7 16.1 17.9 14.1 15 16.6 17.7 14.5 16 ...
## $ office : num [1:3142] 24.2 27.1 23.1 17.8 23.9 19.7 21.9 24.2 26.3 19.5 ...
## $ construction : num [1:3142] 8.6 10.8 10.8 19 13.5 20.1 10.3 10.5 11.5 13.7 ...
## $ production : num [1:3142] 17.1 11.2 23.1 23.7 19.9 26.4 23.7 20.4 24.4 21.5 ...
## $ drive : num [1:3142] 87.5 84.7 83.8 83.2 84.9 74.9 84.5 85.3 85.1 83.9 ...
## $ carpool : num [1:3142] 8.8 8.8 10.9 13.5 11.2 14.9 12.4 9.4 11.9 12.1 ...
## $ transit : num [1:3142] 0.1 0.1 0.4 0.5 0.4 0.7 0 0.2 0.2 0.2 ...
## $ walk : num [1:3142] 0.5 1 1.8 0.6 0.9 5 0.8 1.2 0.3 0.6 ...
## $ other_transp : num [1:3142] 1.3 1.4 1.5 1.5 0.4 1.7 0.6 1.2 0.4 0.7 ...
## $ work_at_home : num [1:3142] 1.8 3.9 1.6 0.7 2.3 2.8 1.7 2.7 2.1 2.5 ...
## $ mean_commute : num [1:3142] 26.5 26.4 24.1 28.8 34.9 27.5 24.6 24.1 25.1 27.4 ...
## $ employed : num [1:3142] 23986 85953 8597 8294 22189 ...
## $ private_work : num [1:3142] 73.6 81.5 71.8 76.8 82 79.5 77.4 74.1 85.1 73.1 ...
## $ public_work : num [1:3142] 20.9 12.3 20.8 16.1 13.5 15.1 16.2 20.8 12.1 18.5 ...
## $ self_employed : num [1:3142] 5.5 5.8 7.3 6.7 4.2 5.4 6.2 5 2.8 7.9 ...
## $ family_work : num [1:3142] 0 0.4 0.1 0.4 0.4 0 0.2 0.1 0 0.5 ...
## $ unemployment : num [1:3142] 7.6 7.5 17.6 8.3 7.7 18 10.9 12.3 8.9 7.9 ...
## - attr(*, "spec")=
## .. cols(
## .. census_id = col_double(),
## .. state = col_character(),
## .. county = col_character(),
## .. total_pop = col_double(),
## .. men = col_double(),
## .. women = col_double(),
## .. hispanic = col_double(),
## .. white = col_double(),
## .. black = col_double(),
## .. native = col_double(),
## .. asian = col_double(),
## .. pacific = col_double(),
## .. citizen = col_double(),
## .. income = col_double(),
## .. income_per_cap = col_double(),
## .. poverty = col_double(),
## .. child_poverty = col_double(),
## .. professional = col_double(),
## .. service = col_double(),
## .. office = col_double(),
## .. construction = col_double(),
## .. production = col_double(),
## .. drive = col_double(),
## .. carpool = col_double(),
## .. transit = col_double(),
## .. walk = col_double(),
## .. other_transp = col_double(),
## .. work_at_home = col_double(),
## .. mean_commute = col_double(),
## .. employed = col_double(),
## .. private_work = col_double(),
## .. public_work = col_double(),
## .. self_employed = col_double(),
## .. family_work = col_double(),
## .. unemployment = col_double()
## .. )
Using the below code, we examine the structure of the data set in order to learn the type of each variable. The output regarding the structure of the data set is shown below. The state and county variables both contain character strings, and using the below output, we see that each of these two variables is indeed appropriately assigned a character type. The variables hispanic, white, black, native, asian, pacific, poverty, child_poverty, professional, service, office, construction, production, drive, carpool, transit, walk, other_transp, work_at_home, mean_commute, private_work, public_work, self_employed, family_work, and unemployment each contain decimal numbers, and hence should be assigned the numeric type double. Looking at the below output, we see that this is indeed the case. We also see below that the variables income and income_per_cap are both assigned the numeric type of double. Looking at the csv file, these variables appear to contain integers. However it is possible that some income may be a decimal number involving cents, and so we do not change the type of either of these variables involving income. The variable employed appears to contain integers. However according to the data dictionary, this variable should be a percentage. It is clear that the data dictionary has defined this variable wrong, as there are many integers in the “employed” column which are way above 100. The numbers in this column are definitely not a percentage. In the below output, we can see that the type of “employed” is a numeric double. Although “employed” appears to contain integers in the data set, we choose to leave the type as numeric double since this is the type that would be required if the data dictionary were accurate.
# Using the below code, I examine the structure of the data set.
str(county_data)
## tibble [3,142 x 35] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ census_id : num [1:3142] 1001 1003 1005 1007 1009 ...
## $ state : chr [1:3142] "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ county : chr [1:3142] "Autauga" "Baldwin" "Barbour" "Bibb" ...
## $ total_pop : num [1:3142] 55221 195121 26932 22604 57710 ...
## $ men : num [1:3142] 26745 95314 14497 12073 28512 ...
## $ women : num [1:3142] 28476 99807 12435 10531 29198 ...
## $ hispanic : num [1:3142] 2.6 4.5 4.6 2.2 8.6 4.4 1.2 3.5 0.4 1.5 ...
## $ white : num [1:3142] 75.8 83.1 46.2 74.5 87.9 22.2 53.3 73 57.3 91.7 ...
## $ black : num [1:3142] 18.5 9.5 46.7 21.4 1.5 70.7 43.8 20.3 40.3 4.8 ...
## $ native : num [1:3142] 0.4 0.6 0.2 0.4 0.3 1.2 0.1 0.2 0.2 0.6 ...
## $ asian : num [1:3142] 1 0.7 0.4 0.1 0.1 0.2 0.4 0.9 0.8 0.3 ...
## $ pacific : num [1:3142] 0 0 0 0 0 0 0 0 0 0 ...
## $ citizen : num [1:3142] 40725 147695 20714 17495 42345 ...
## $ income : num [1:3142] 51281 50254 32964 38678 45813 ...
## $ income_per_cap: num [1:3142] 24974 27317 16824 18431 20532 ...
## $ poverty : num [1:3142] 12.9 13.4 26.7 16.8 16.7 24.6 25.4 20.5 21.6 19.2 ...
## $ child_poverty : num [1:3142] 18.6 19.2 45.3 27.9 27.2 38.4 39.2 31.6 37.2 30.1 ...
## $ professional : num [1:3142] 33.2 33.1 26.8 21.5 28.5 18.8 27.5 27.3 23.3 29.3 ...
## $ service : num [1:3142] 17 17.7 16.1 17.9 14.1 15 16.6 17.7 14.5 16 ...
## $ office : num [1:3142] 24.2 27.1 23.1 17.8 23.9 19.7 21.9 24.2 26.3 19.5 ...
## $ construction : num [1:3142] 8.6 10.8 10.8 19 13.5 20.1 10.3 10.5 11.5 13.7 ...
## $ production : num [1:3142] 17.1 11.2 23.1 23.7 19.9 26.4 23.7 20.4 24.4 21.5 ...
## $ drive : num [1:3142] 87.5 84.7 83.8 83.2 84.9 74.9 84.5 85.3 85.1 83.9 ...
## $ carpool : num [1:3142] 8.8 8.8 10.9 13.5 11.2 14.9 12.4 9.4 11.9 12.1 ...
## $ transit : num [1:3142] 0.1 0.1 0.4 0.5 0.4 0.7 0 0.2 0.2 0.2 ...
## $ walk : num [1:3142] 0.5 1 1.8 0.6 0.9 5 0.8 1.2 0.3 0.6 ...
## $ other_transp : num [1:3142] 1.3 1.4 1.5 1.5 0.4 1.7 0.6 1.2 0.4 0.7 ...
## $ work_at_home : num [1:3142] 1.8 3.9 1.6 0.7 2.3 2.8 1.7 2.7 2.1 2.5 ...
## $ mean_commute : num [1:3142] 26.5 26.4 24.1 28.8 34.9 27.5 24.6 24.1 25.1 27.4 ...
## $ employed : num [1:3142] 23986 85953 8597 8294 22189 ...
## $ private_work : num [1:3142] 73.6 81.5 71.8 76.8 82 79.5 77.4 74.1 85.1 73.1 ...
## $ public_work : num [1:3142] 20.9 12.3 20.8 16.1 13.5 15.1 16.2 20.8 12.1 18.5 ...
## $ self_employed : num [1:3142] 5.5 5.8 7.3 6.7 4.2 5.4 6.2 5 2.8 7.9 ...
## $ family_work : num [1:3142] 0 0.4 0.1 0.4 0.4 0 0.2 0.1 0 0.5 ...
## $ unemployment : num [1:3142] 7.6 7.5 17.6 8.3 7.7 18 10.9 12.3 8.9 7.9 ...
## - attr(*, "spec")=
## .. cols(
## .. census_id = col_double(),
## .. state = col_character(),
## .. county = col_character(),
## .. total_pop = col_double(),
## .. men = col_double(),
## .. women = col_double(),
## .. hispanic = col_double(),
## .. white = col_double(),
## .. black = col_double(),
## .. native = col_double(),
## .. asian = col_double(),
## .. pacific = col_double(),
## .. citizen = col_double(),
## .. income = col_double(),
## .. income_per_cap = col_double(),
## .. poverty = col_double(),
## .. child_poverty = col_double(),
## .. professional = col_double(),
## .. service = col_double(),
## .. office = col_double(),
## .. construction = col_double(),
## .. production = col_double(),
## .. drive = col_double(),
## .. carpool = col_double(),
## .. transit = col_double(),
## .. walk = col_double(),
## .. other_transp = col_double(),
## .. work_at_home = col_double(),
## .. mean_commute = col_double(),
## .. employed = col_double(),
## .. private_work = col_double(),
## .. public_work = col_double(),
## .. self_employed = col_double(),
## .. family_work = col_double(),
## .. unemployment = col_double()
## .. )
We now discuss the changes that we make to the types of the variables. The census_id is of type numeric. However even when ID’s are numbers, it is often common place to give them a character type because we generally are not interested in using the ID’s in a mathematical calculation or equation. Hence in the below code, we change the type of the variable census_id to character. The variables citizen, total_pop, men, and women represent a count of people, and hence they are integers. However from the above output, we see that they are assigned type numeric without being specifically designated as integers (rather than decimal numbers). Hence in the below code, we change the type of the variables citizen, total_pop, men, and women to integer. In the below code, we make all changes specified in this paragraph. We then use the glimpse() function to check that we successfully changed the variable types, and in viewing the below output, we find that the variable types have indeed been appropriately changed.
# Using the below code, we change the type of census_id to character.
county_data$census_id <- as.character(county_data$census_id)
# Using the below code, we change the type of citizen to integer.
county_data$citizen <- as.integer(county_data$citizen)
# Using the below code, we change the type of total_pop to integer
county_data$total_pop <- as.integer(county_data$total_pop)
# Using the below code, we change the type of men to integer.
county_data$men <- as.integer(county_data$men)
# Using the below code, we change the type of women to integer.
county_data$women <- as.integer(county_data$women)
# Using the below code, we gain a glimpse of the structure of the data.
glimpse(county_data)
## Rows: 3,142
## Columns: 35
## $ census_id <chr> "1001", "1003", "1005", "1007", "1009", "1011", "101...
## $ state <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama...
## $ county <chr> "Autauga", "Baldwin", "Barbour", "Bibb", "Blount", "...
## $ total_pop <int> 55221, 195121, 26932, 22604, 57710, 10678, 20354, 11...
## $ men <int> 26745, 95314, 14497, 12073, 28512, 5660, 9502, 56274...
## $ women <int> 28476, 99807, 12435, 10531, 29198, 5018, 10852, 6037...
## $ hispanic <dbl> 2.6, 4.5, 4.6, 2.2, 8.6, 4.4, 1.2, 3.5, 0.4, 1.5, 7....
## $ white <dbl> 75.8, 83.1, 46.2, 74.5, 87.9, 22.2, 53.3, 73.0, 57.3...
## $ black <dbl> 18.5, 9.5, 46.7, 21.4, 1.5, 70.7, 43.8, 20.3, 40.3, ...
## $ native <dbl> 0.4, 0.6, 0.2, 0.4, 0.3, 1.2, 0.1, 0.2, 0.2, 0.6, 0....
## $ asian <dbl> 1.0, 0.7, 0.4, 0.1, 0.1, 0.2, 0.4, 0.9, 0.8, 0.3, 0....
## $ pacific <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0....
## $ citizen <int> 40725, 147695, 20714, 17495, 42345, 8057, 15581, 886...
## $ income <dbl> 51281, 50254, 32964, 38678, 45813, 31938, 32229, 417...
## $ income_per_cap <dbl> 24974, 27317, 16824, 18431, 20532, 17580, 18390, 213...
## $ poverty <dbl> 12.9, 13.4, 26.7, 16.8, 16.7, 24.6, 25.4, 20.5, 21.6...
## $ child_poverty <dbl> 18.6, 19.2, 45.3, 27.9, 27.2, 38.4, 39.2, 31.6, 37.2...
## $ professional <dbl> 33.2, 33.1, 26.8, 21.5, 28.5, 18.8, 27.5, 27.3, 23.3...
## $ service <dbl> 17.0, 17.7, 16.1, 17.9, 14.1, 15.0, 16.6, 17.7, 14.5...
## $ office <dbl> 24.2, 27.1, 23.1, 17.8, 23.9, 19.7, 21.9, 24.2, 26.3...
## $ construction <dbl> 8.6, 10.8, 10.8, 19.0, 13.5, 20.1, 10.3, 10.5, 11.5,...
## $ production <dbl> 17.1, 11.2, 23.1, 23.7, 19.9, 26.4, 23.7, 20.4, 24.4...
## $ drive <dbl> 87.5, 84.7, 83.8, 83.2, 84.9, 74.9, 84.5, 85.3, 85.1...
## $ carpool <dbl> 8.8, 8.8, 10.9, 13.5, 11.2, 14.9, 12.4, 9.4, 11.9, 1...
## $ transit <dbl> 0.1, 0.1, 0.4, 0.5, 0.4, 0.7, 0.0, 0.2, 0.2, 0.2, 0....
## $ walk <dbl> 0.5, 1.0, 1.8, 0.6, 0.9, 5.0, 0.8, 1.2, 0.3, 0.6, 1....
## $ other_transp <dbl> 1.3, 1.4, 1.5, 1.5, 0.4, 1.7, 0.6, 1.2, 0.4, 0.7, 1....
## $ work_at_home <dbl> 1.8, 3.9, 1.6, 0.7, 2.3, 2.8, 1.7, 2.7, 2.1, 2.5, 1....
## $ mean_commute <dbl> 26.5, 26.4, 24.1, 28.8, 34.9, 27.5, 24.6, 24.1, 25.1...
## $ employed <dbl> 23986, 85953, 8597, 8294, 22189, 3865, 7813, 47401, ...
## $ private_work <dbl> 73.6, 81.5, 71.8, 76.8, 82.0, 79.5, 77.4, 74.1, 85.1...
## $ public_work <dbl> 20.9, 12.3, 20.8, 16.1, 13.5, 15.1, 16.2, 20.8, 12.1...
## $ self_employed <dbl> 5.5, 5.8, 7.3, 6.7, 4.2, 5.4, 6.2, 5.0, 2.8, 7.9, 4....
## $ family_work <dbl> 0.0, 0.4, 0.1, 0.4, 0.4, 0.0, 0.2, 0.1, 0.0, 0.5, 0....
## $ unemployment <dbl> 7.6, 7.5, 17.6, 8.3, 7.7, 18.0, 10.9, 12.3, 8.9, 7.9...
Using the below code, we create the below output displaying the number of missing values in each variable.
# Using the below code, we find the number of missing values in each variable.
colSums(is.na(county_data))
## census_id state county total_pop men
## 0 0 0 0 0
## women hispanic white black native
## 0 0 0 0 0
## asian pacific citizen income income_per_cap
## 0 0 0 1 0
## poverty child_poverty professional service office
## 0 1 0 0 0
## construction production drive carpool transit
## 0 0 0 0 0
## walk other_transp work_at_home mean_commute employed
## 0 0 0 0 0
## private_work public_work self_employed family_work unemployment
## 0 0 0 0 0
Based on the above output, we see that the only variables with missing values are “income” and “child_poverty”. Each of these two variables has one missing value. We note that each observation in the data set represents a unique county. Hence if we remove a couple observations, we may not come up with accurate answers to questions 5 through 10. Take for example question 5, which asks for the number of counties having more women than men. If we remove an observation from the data set, we may not come up with a correct answer to this question. Hence rather than removing observations, we instead impute values for the missing variables. If the data contains several outliers, the mean can be less representative of the majority of the data than the median. Hence we choose to impute the missing values with the median (rather than the mean) of the corresponding column. For instance, we will replace the missing value in the variable “income” with the median income, and we will replace the missing value in the variable “child_poverty” with the median value for “child_poverty”. We create a new data set called county_data_2 containing the replacements for these missing values. Using the below code, we perform the replacements described in this paragraph.
# Using the below code, we create a new data set called county_data_2 which is identical to county_data.
county_data_2 <- county_data
# Using the below code, we replace the missing value for income with the median income.
county_data_2$income[which(is.na(county_data_2$income))] <- median(county_data$income, na.rm = TRUE)
# Using the below code, we replace the missing value for child_poverty with the median value for child_poverty.
county_data_2$child_poverty[which(is.na(county_data_2$child_poverty))] <- median(county_data$child_poverty, na.rm = TRUE)
Using the below code, we create the below output which indicates that the missing values have now been removed from county_data_2.
# Using the below code, we display the number of missing values in each variable.
colSums(is.na(county_data_2))
## census_id state county total_pop men
## 0 0 0 0 0
## women hispanic white black native
## 0 0 0 0 0
## asian pacific citizen income income_per_cap
## 0 0 0 0 0
## poverty child_poverty professional service office
## 0 0 0 0 0
## construction production drive carpool transit
## 0 0 0 0 0
## walk other_transp work_at_home mean_commute employed
## 0 0 0 0 0
## private_work public_work self_employed family_work unemployment
## 0 0 0 0 0
The below output displays a summary of the variable total_pop. We see that the maximum value is quite far away from the 3rd quantile, and we see that the minimum value is also a bit away from the 1st quantile. This is most likely due to the fact that most counties do not have an incredibly large or incredibly small population. However the minimum and maximum numbers provided for the total population seem reasonable, given that there exist counties in the USA with extremely high populations and also rural counties with very low population. For instance in the city where I grew up, there is a total population of only 150. Given that these values do not appear unusual, we do not remove any outliers for this variable. The below code is used to display the below summary.
# Using the below code, we display a summary for the variable total_pop.
summary(county_data_2$total_pop)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 85 11028 25768 100737 67552 10038388
The below output displays a summary of the variable men. We see that the maximum value is quite far away from the 3rd quantile, and this suggests that there may be outliers. This is most likely due to the fact that most counties do not have an incredibly large population. However the maximum number provided for the variable “men” seems reasonable, given that there exist counties in the USA with extremely high populations. Given that these values do not appear unusual, we do not remove any outliers for this variable. The below code is used to display the below summary.
# Using the below code, we display a summary for the variable men.
summary(county_data_2$men)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 42 5546 12826 49565 33319 4945351
The below output displays a summary of the variable women. We see that the maximum value is quite far away from the 3rd quantile, and this suggests that there may be outliers. This is most likely due to the fact that most counties do not have an incredibly large population. However the maximum number provided for the variable “women” seems reasonable, given that there exist counties in the USA with extremely high populations. Given that these values do not appear unusual, we do not remove any outliers for this variable. The below code is used to display the below summary.
# Using the below code, we display a summary for the variable women.
summary(county_data_2$women)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 43 5466 12907 51171 34122 5093037
The below output displays a summary for the variable hispanic. The maximum value of 98.7 seems to be much larger than the 3rd quantile. This suggests that there may be outliers in this variable. However the maximum value of 98.7 is no more than 100, and hence is reasonable as a percentage. The below code is used to produce the below output.
# Using the below code, we display a summary of the variable hispanic.
summary(county_data_2$hispanic)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.900 3.700 8.826 9.000 98.700
The below output displays the row number of the observations having a value for hispanic greater than 95. We see that there are multiple observations having extremely high values for hispanic, and hence we do not suspect that there is one single value that was entered in error. Instead we believe that the outliers are caused merely because Hispanic individuals are a minority in most (but not all) counties in the USA. The below code is used to create the below output.
# Using the below code, we display the row number of the observations having a value for hispanic greater than 95.
which(county_data_2$hispanic > 95)
## [1] 2685 2737 2763
The below output displays a summary for the variable white. The minimum value of 0.9 seems to be much smaller than the 1st quantile. This suggests that there may be outliers in this variable. However the minimum value of 0.9 is appropriately between 0 and 100, as we would expect for a percentage. The below code is used to produce the below output.
# Using the below code, we display a summary for the variable white.
summary(county_data_2$white)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.90 65.60 84.60 77.28 93.30 99.80
The below output displays the row number of the observations having a value for white less than 5. We see that there are multiple observations having extremely low values for white, and hence we do not suspect that there is one single value that was entered in error. Instead we believe that the outliers are caused merely because white individuals are a majority in most (but not all) counties in the USA. The below code is used to create the below output.
# Using the below code, we display the row number of the observations having a value for white less than 5.
which(county_data_2$white < 5)
## [1] 82 2413 2685 2737 2763
The below output displays a summary for the variable black. The maximum value of 85.9 seems to be much larger than the 3rd quantile. This suggests that there may be outliers in this variable. However the maximum value of 85.9 is appropriately between 0 and 100, as we would expect for a percentage. The below code is used to produce the below output.
# Using the below code, we display a summary for the variable black.
summary(county_data_2$black)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.600 2.100 8.879 10.175 85.900
The below output displays the row number of the observations having a value for black greater than 80. We see that there are multiple observations having extremely high values for black, and hence we do not suspect that there is one single value that was entered in error. Instead we believe that the outliers are caused merely because black individuals are a minority in most (but not all) counties in the USA. The below code is used to create the below output.
# Using the below code, we display the row number of the observations having a value for black greater than 80.
which(county_data_2$black > 80)
## [1] 32 44 1412 1427 1433
The below output displays a summary for the variable native. The maximum value of 92.1 seems to be much larger than the 3rd quantile. This suggests that there may be outliers in this variable. However the maximum value of 92.1 is appropriately between 0 and 100, as we would expect for a percentage. The below code is used to produce the below output.
# Using the below code, we display a summary for the variable native.
summary(county_data_2$native)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.100 0.300 1.766 0.600 92.100
The below output displays the row number of the observations having a value for native greater than 85. We see that there are several observations having extremely high values for native, and hence we do not suspect that there is one single value that was entered in error. Instead we believe that the outliers are caused merely because native individuals are a minority in most (but not all) counties in the USA. The below code is used to create the below output.
# Using the below code, we display the row number of the observations having a value for native greater than 85.
which(county_data_2$native > 85)
## [1] 82 2413
The below output displays a summary for the variable asian. The maximum value of 41.6 seems to be much larger than the 3rd quantile. This suggests that there may be outliers in this variable. However the maximum value of 41.6 is appropriately between 0 and 100, as we would expect for a percentage. The below code is used to produce the below output.
# Using the below code, we display a summary for the variable asian.
summary(county_data_2$asian)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.200 0.500 1.258 1.200 41.600
The below output displays the row number of the observations having a value for asian greater than 35. We see that there are several observations having considerably high values for asian, and hence we do not suspect that there is one single value that was entered in error. Instead we believe that the outliers are caused merely because Asian individuals are a minority in most (but not all) counties in the USA. The below code is used to create the below output.
# Using the below code, we display the row number of the observations having a value for asian greater than 35.
which(county_data_2$asian > 35)
## [1] 69 548
The below output displays a summary for the variable pacific. The maximum value of 35.3 seems to be much larger than the 3rd quantile. This suggests that there may be outliers in this variable. However the maximum value of 35.3 is appropriately between 0 and 100, as we would expect for a percentage. Also the distance between the maximum value and 3rd quantile can be explained by the fact that in most counties (but not all), pacific individuals are a very small minority. The below code is used to produce the below output.
# Using the below code, we display a summary for the variable pacific.
summary(county_data_2$pacific)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.08475 0.00000 35.30000
The below output displays a summary of the variable citizen. We see that the maximum value is quite far away from the 3rd quantile, and this suggests that there may be outliers. This is most likely due to the fact that most counties do not have an incredibly large population. However the maximum number provided for the variable “citizen” seems reasonable, given that there exist counties in the USA with very high populations. Given that these values do not appear unusual, we do not remove any outliers for this variable. The below code is used to display the below summary.
# Using the below code, we display a summary for the variable citizen.
summary(county_data_2$citizen)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 80 8254 19434 70804 50728 6046749
The below output displays a summary of the variable income. We see that the maximum value is quite far away from the 3rd quantile, and this suggests that there may be outliers. This is most likely due to the fact that there are only a small number of counties which have a much higher cost of living (and also higher median income) than other counties in the USA. The below code is used to display the below summary.
# Using the below code, we display a summary for the variable income.
summary(county_data_2$income)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19328 38826 45111 46830 52249 123453
The below output displays the row number of the observations having a value for income greater than 120000. We see that there are several observations having considerably high values for income, and hence we do not suspect that there is one single value that was entered in error. Instead we believe that the outliers are caused merely because the median income in a small number of counties is much larger than in most other counties in the USA. Hence we do not remove or alter observations for the variable income. The below code is used to create the below output.
# Using the below code, we display the row number of the observations having a median income greater than 120000.
which(county_data_2$income > 120000)
## [1] 2873 2926
The below output displays a summary of the variable income_per_cap. We see that the maximum value is quite far away from the 3rd quantile, and this suggests that there may be outliers. This is most likely due to the fact that there are only a small number of counties which have a much higher cost of living (and also higher income per capita) than most other counties in the USA. However, the maximum value for income per capita seems reasonable, and so we do not remove or alter any outliers from this variable. The below code is used to display the below summary.
# Using the below code, we display a summary of the income per capita.
summary(county_data_2$income_per_cap)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8292 20471 23577 24338 27138 65600
The below output displays a summary for the variable poverty. The maximum value of 53.3 seems to be much larger than the 3rd quantile. This suggests that there may be outliers in this variable. However the maximum value of 53.3 is appropriately between 0 and 100, as we would expect for a percentage. Also the distance between the maximum value and 3rd quantile can be explained by the fact that there are only a small number of counties where poverty exceeds 45%. The below code is used to produce the below output.
# Using the below code, we display a summary of the variable poverty.
summary(county_data_2$poverty)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.4 12.0 16.0 16.7 20.3 53.3
The below output displays the row number of the observations having a value for poverty greater than 45. We see that there are multiple observations having considerably high values for poverty, and hence we do not suspect that there is one single value that was entered in error. Instead we believe that the outliers are caused merely because there are only a small number of counties with such high poverty levels. Hence we do not remove or alter observations for the variable poverty. The below code is used to create the below output.
# Using the below code, we display the row number of the observations having a value for poverty greater than 45.
which(county_data_2$poverty > 45)
## [1] 1131 1433 2377 2409 2413 2422
The below output displays a summary for the variable child_poverty. The maximum value of 72.3 seems to be much larger than the 3rd quantile. This suggests that there may be outliers in this variable. However the maximum value of 72.3 is appropriately between 0 and 100, as we would expect for a percentage. Also the distance between the maximum value and 3rd quantile can be explained by the fact that there are only a small number of counties where the percentage of children living in poverty is greater than 65%. The below code is used to produce the below output.
# Using the below code, we display a summary of the variable child_poverty.
summary(county_data_2$child_poverty)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 16.10 22.50 23.29 29.48 72.30
The below output displays the row number of the observations having a value for child_poverty greater than 65. We see that there are multiple observations having considerably high values for child_poverty, and hence we do not suspect that there is one single value that was entered in error. Instead we believe that the outliers are caused merely because there are only a small number of counties with such high child poverty levels. Hence we do not remove or alter observations for the variable child_poverty. The below code is used to create the below output.
# Using the below code, we display the row number of the observations having a value for child_poverty greater than 65.
which(county_data_2$child_poverty > 65)
## [1] 32 1131 2409
The below output displays a summary for the variable professional. The maximum value of 74 seems to be much larger than the 3rd quantile. This suggests that there may be outliers in this variable. However the maximum value of 74 is appropriately between 0 and 100, as we would expect for a percentage. The below code is used to produce the below output.
# Using the below code, we display a summary of the variable professional.
summary(county_data_2$professional)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.50 26.70 30.00 31.04 34.40 74.00
The below output displays the row number of the observations having a value for professional greater than 65. We see that there are multiple observations having high values for professional, and hence we do not suspect that there is one single value that was entered in error. Instead we believe that the outliers are caused merely because there are only a small number of counties with such high percentages of professional employment. Hence we do not remove or alter observations for the variable professional. The below code is used to create the below output.
# Using the below code, we display the row number of the observations having a value for professional greater than 65.
which(county_data_2$professional > 65)
## [1] 1811 2827 2926
The below output displays a summary for the variable service, and based on this output we see that there do not appear to be any extremely unusual outliers for the variable service. All values in the summary are appropriately between 0 and 100, as we would expect for a percentage. The below code is used to produce the below output.
# Using the below code, we display a summary of the variable service.
summary(county_data_2$service)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 15.90 18.00 18.26 20.20 36.60
The below output displays a summary for the variable office. The minimum value of 4.1 seems to be much smaller than the 1st quantile. This suggests that there may be outliers in this variable. However the minimum value of 4.1 is appropriately between 0 and 100, as we would expect for a percentage. The below code is used to produce the below output.
# Using the below code, we display a summary of the variable office.
summary(county_data_2$office)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.10 20.20 22.40 22.13 24.30 35.40
The below output displays the row number of the observations having a value for office smaller than 10. We see that there are multiple observations having small values for office, and hence we do not suspect that there is one single value that was entered in error. Instead we believe that the outliers are caused merely because there are only a small number of counties with such small percentages of office employment. Hence we do not remove or alter observations for the variable office. The below code is used to create the below output.
# Using the below code, we display the row number of the observations having a value for office smaller than 10.
which(county_data_2$office < 10)
## [1] 68 1604 1657 1659 1752
The below output displays a summary for the variable construction. The maximum value of 40.3 seems to be much larger than the 3rd quantile. This suggests that there may be outliers in this variable. However the maximum value of 40.3 is appropriately between 0 and 100, as we would expect for a percentage. The below code is used to produce the below output.
# Using the below code, we display a summary of the variable construction.
summary(county_data_2$construction)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.70 9.80 12.20 12.74 15.00 40.30
The below output displays the row number of the observations having a value for construction greater than 35. We see that there are several observations having high values for construction, and hence we do not suspect that there is one single value that was entered in error. Instead we believe that the outliers are caused merely because there are only a small number of counties with such high percentages of construction employment. Hence we do not remove or alter observations for the variable construction. The below code is used to create the below output.
# Using the below code, we display the row number of the observations having a value for construction greater than 35.
which(county_data_2$construction > 35)
## [1] 1755 2658
The below output displays a summary for the variable production. The maximum value of 55.6 seems to be much larger than the 3rd quantile. This suggests that there may be outliers in this variable. However the maximum value of 55.6 is appropriately between 0 and 100, as we would expect for a percentage. Hence we do not expect that these unusual values were entered in error. Instead we believe that there are just a small number of counties with such high percentages of production employment. Hence we do not alter or remove observations for the variable production. Indeed, removing observations may prevent us from obtaining accurate answers to questions 5 through 10. The below code is used to produce the below output.
# Using the below code, we display a summary of the variable production.
summary(county_data_2$production)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 11.53 15.40 15.82 19.40 55.60
The below output displays a summary for the variable drive. The minimum value of 5.2 seems to be much smaller than the 1st quantile. This suggests that there may be outliers in this variable. However the minimum value of 5.2 is appropriately between 0 and 100, as we would expect for a percentage. The below code is used to produce the below output.
# Using the below code, we display a summary of the variable drive.
summary(county_data_2$drive)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.20 76.60 80.60 79.08 83.60 94.60
The below output displays the row number of the observations having a value for drive less than 10. We see that there are several observations having low values for drive, and hence we do not suspect that there is one single value that was entered in error. Instead we believe that the outliers are caused merely because there are only a small number of counties with such low percentages of individuals who drive to work. Hence we do not remove or alter observations for the variable drive. The below code is used to create the below output.
# Using the below code, we display the row number of the observations having a value for drive less than 10.
which(county_data_2$drive < 10)
## [1] 82 1859
The below output displays a summary for the variable carpool. The maximum value of 29.9 seems to be much larger than the 3rd quantile. This suggests that there may be outliers in this variable. However the maximum value of 29.9 is appropriately between 0 and 100, as we would expect for a percentage. The below code is used to produce the below output.
# Using the below code, we display a summary of the variable carpool.
summary(county_data_2$carpool)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 8.50 9.90 10.33 11.88 29.90
The below output displays the row number of the observations having a value for carpool greater than 25. We see that there are several observations having high values for carpool, and hence we do not suspect that there is one single value that was entered in error. Instead we believe that the outliers are caused merely because there are only a small number of counties with such high percentages of individuals who carpool to work. Hence we do not remove or alter observations for the variable carpool. The below code is used to create the below output.
# Using the below code, we display the row number of the observations having a value for carpool greater than 25.
which(county_data_2$carpool > 25)
## [1] 417 469 741
The below output displays a summary for the variable transit. The maximum value of 61.7 seems to be much larger than the 3rd quantile. This suggests that there may be outliers in this variable. However the maximum value of 61.7 is appropriately between 0 and 100, as we would expect for a percentage. The below code is used to produce the below output.
# Using the below code, we display a summary of the variable transit.
summary(county_data_2$transit)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1000 0.4000 0.9675 0.8000 61.7000
The below output displays the row number of the observations having a value for transit greater than 55. We see that there are several observations having high values for transit, and hence we do not suspect that there is one single value that was entered in error. Instead we believe that the outliers are caused merely because there are only a small number of counties with such high percentages of individuals who use public transportation to get to work. Hence we do not remove or alter observations for the variable transit. The below code is used to create the below output.
# Using the below code, we display the row number of the observations having a value for transit greater than 55.
which(county_data_2$transit > 55)
## [1] 1831 1852 1859
The below output displays a summary for the variable walk. The maximum value of 71.2 seems to be much larger than the 3rd quantile. This suggests that there may be outliers in this variable. However the maximum value of 71.2 is appropriately between 0 and 100, as we would expect for a percentage. The below code is used to produce the below output.
# Using the below code, we display a summary of the variable walk.
summary(county_data_2$walk)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.400 2.400 3.307 4.000 71.200
The below output displays the row number of the observations having a value for walk greater than 60. We see that only observation 68 has a value for walk greater than 60. Hence we suspect that this may be an extremely unusual outlier, and so we will examine this observation further. The below code is used to create the below output.
# Using the below code, we display the row number of the observations having a value for walk greater than 60.
which(county_data_2$walk > 60)
## [1] 68
Because the value for walk in observation 68 seems to be unusual, we determine if the sum of the variables drive, carpool, transit, walk, other_transp, and work_at_home is equal to 100 in row 68. The below output indicates that, as we should expect, the sum is indeed equal to 100. This suggests that although the value for walk in observation 68 seems to be unusual, it was probably not entered incorrectly. Hence we do not alter or remove this unusual observation. Indeed removing observations could prevent us from obtaining accurate answers in questions 5 through 10. The below code is used to create the below output.
# Using the below code, we display the sum of the variables drive, carpool, transit, walk, other_transp, and work_at_home in row 68.
county_data_2$drive[68] + county_data_2$carpool[68] + county_data_2$transit[68] + county_data_2$walk[68] + county_data_2$other_transp[68] + county_data_2$work_at_home[68]
## [1] 100
The below output displays a summary for the variable other_transp. The maximum value of 39.1 seems to be much larger than the 3rd quantile. This suggests that there may be outliers in this variable. However the maximum value of 39.1 is appropriately between 0 and 100, as we would expect for a percentage. The below code is used to produce the below output.
# Using the below code, we display a summary of the variable other_transp.
summary(county_data_2$other_transp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.900 1.300 1.614 1.900 39.100
The below output displays the row number of the observations having a value for other_transp greater than 30. We see that there are several observations having such high values for other_transp, and hence we do not suspect that there is one single value that was entered in error. Instead we believe that the outliers are caused merely because there are only a small number of counties with such high percentages of individuals who go to work via other means. Hence we do not remove or alter observations for the variable other_transp. The below code is used to create the below output.
# Using the below code, we display the row number of the observations having a value for other_transp greater than 30.
which(county_data_2$other_transp > 30)
## [1] 82 83
The below output displays a summary for the variable work_at_home. The maximum value of 37.2 seems to be much larger than the 3rd quantile. This suggests that there may be outliers in this variable. However the maximum value of 37.2 is appropriately between 0 and 100, as we would expect for a percentage. The below code is used to produce the below output.
# Using the below code, we display a summary of the variable work_at_home.
summary(county_data_2$work_at_home)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.800 4.000 4.697 5.700 37.200
The below output displays the row number of the observations having a value for work_at_home greater than 30. We see that there are several observations having such high values for work_at_home, and hence we do not suspect that there is one single value that was entered in error. Instead we believe that the outliers are caused merely because there are only a small number of counties with such high percentages of individuals who work at home. Hence we do not remove or alter observations for the variable work_at_home. The below code is used to create the below output.
# Using the below code, we display the row number of the observations having a value for work_at_home greater than 30.
which(county_data_2$work_at_home > 30)
## [1] 413 1568 1706
The below output displays a summary for the variable mean_commute. The maximum value of 44 seems to be much larger than the 3rd quantile, and the minimum value of 4.9 seems to be quite smaller than the 1st quantile. However this could be due to the fact that there are only a small number of counties in the USA where the commute is generally either extremely long or extremely short. Given this, the minimum and maximum values for mean_commute seem reasonable, and so we do not alter or remove any outliers from this variable. The below code is used to produce the below output.
# Using the below code, we display a summary of the variable mean_commute.
summary(county_data_2$mean_commute)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.90 19.30 22.90 23.15 26.60 44.00
The below output displays a summary for the variable employed. According to the data dictionary, the variable employed is supposed to be a percentage. However based on the values in the below summary, we believe that it is instead the number of people who are employed in each county. We note that the minimum value is quite a bit smaller than the 1st quantile, and the maximum value is quite a bit larger than the 3rd quantile. However this is likely due to the fact that only a small number of counties in the USA have extremely big populations, and only a small number of counties have very small populations. Given that there are some counties in the USA with very small or very large populations, the values in the below summary seem reasonable and not unusual. Hence we do not alter or remove any outliers for this variable. The below code is used to produce the below output.
# Using the below code, we display a summary of the variable employed.
summary(county_data_2$employed)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 62 4524 10644 46387 29254 4635465
The below output displays a summary for the variable private_work. The minimum value of 25 seems to be much smaller than the 1st quantile. This suggests that there may be outliers in this variable. However the minimum value of 25 is appropriately between 0 and 100, as we would expect for a percentage. The below code is used to produce the below output.
# Using the below code, we display a summary of the variable private_work.
summary(county_data_2$private_work)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 25.00 70.90 75.80 74.44 79.80 88.30
The below output displays the row number of the observations having a value for private_work less than 30. We see that there are several observations having such low values for private_work, and hence we do not suspect that there is one single value that was entered in error. Instead we believe that the outliers are caused merely because there are only a small number of counties with such low percentages of individuals who work in the private sector. Hence we do not remove or alter observations for the variable private_work. The below code is used to create the below output.
# Using the below code, we display the row number of the observations having a value for private_work less than 30.
which(county_data_2$private_work < 30)
## [1] 549 2413
The below output displays a summary for the variable public_work. The maximum value of 66.2 seems to be much larger than the 3rd quantile. This suggests that there may be outliers in this variable. However the maximum value of 66.2 is appropriately between 0 and 100, as we would expect for a percentage. The below code is used to produce the below output.
# Using the below code, we display a summary of the variable public_work.
summary(county_data_2$public_work)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.80 13.10 16.10 17.35 20.10 66.20
The below output displays the row number of the observations having a value for public_work greater than 60. We see that there are several observations having such high values for public_work, and hence we do not suspect that there is one single value that was entered in error. Instead we believe that the outliers are caused merely because there are only a small number of counties with such high percentages of individuals who work in the public sector. Hence we do not remove or alter observations for the variable public_work. The below code is used to create the below output.
# Using the below code, we display the row number of the observations having a value for public_work greater than 60.
which(county_data_2$public_work > 60)
## [1] 96 549 2413
The below output displays a summary for the variable self_employed. The maximum value of 36.6 seems to be much larger than the 3rd quantile. This suggests that there may be outliers in this variable. However the maximum value of 36.6 is appropriately between 0 and 100, as we would expect for a percentage. The below code is used to produce the below output.
# Using the below code, we display a summary of the variable self_employed.
summary(county_data_2$self_employed)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 5.400 6.900 7.921 9.400 36.600
The below output displays the row number of the observations having a value for self_employed greater than 30. We see that there are several observations having such high values for self_employed, and hence we do not suspect that there is one single value that was entered in error. Instead we believe that the outliers are caused merely because there are only a small number of counties with such high percentages of individuals who are self-employed. Hence we do not remove or alter observations for the variable self_employed. The below code is used to create the below output.
# Using the below code, we display the row number of the observations having a value for self_employed greater than 30.
which(county_data_2$self_employed > 30)
## [1] 1604 1615 1652 1706 2009
The below output displays a summary for the variable family_work. The maximum value of 9.8 seems to be much larger than the 3rd quantile. This suggests that there may be outliers in this variable. However the maximum value of 9.8 is appropriately between 0 and 100, as we would expect for a percentage. The below code is used to produce the below output.
# Using the below code, we display a summary of the variable family_work.
summary(county_data_2$family_work)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1000 0.2000 0.2915 0.3000 9.8000
The below output displays the row number of the observations having a value for family_work greater than 5. We see that there are multiple observations having such high values for family_work, and hence we do not suspect that there is one single value that was entered in error. Instead we believe that the outliers are caused merely because there are only a small number of counties with such high percentages of individuals who are engaged in family work. Hence we do not remove or alter observations for the variable family_work. The below code is used to create the below output.
# Using the below code, we display the row number of the observations having a value for family_work greater than 5.
which(county_data_2$family_work > 5)
## [1] 239 1601 1624 1652
The below output displays a summary for the variable unemployment. The maximum value of 29.4 seems to be much larger than the 3rd quantile. This suggests that there may be outliers in this variable. However the maximum value of 29.4 is appropriately between 0 and 100, as we would expect for a percentage. The below code is used to produce the below output.
# Using the below code, we display a summary of the variable unemployment.
summary(county_data_2$unemployment)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 5.500 7.500 7.815 9.700 29.400
The below output displays the row number of the observations having a value for unemployment greater than 25. We see that there are multiple observations having such high values for unemployment, and hence we do not suspect that there is one single value that was entered in error. Instead we believe that the outliers are caused merely because there are only a small number of counties with such high percentages of individuals who are unemployed. Hence we do not remove or alter observations for the variable unemployment. The below code is used to create the below output.
# Using the below code, we display the row number of the observations having a value for unemployment greater than 25.
which(county_data_2$unemployment > 25)
## [1] 82 258 1428 1461 2377 2413 2422 2428
The below output displays a summary of the variable census_id. We see that there are 3142 observations of this variable, and it is of class character. The code that we used to create the below output is shown directly below.
# Using the below code, we display a summary of the variable census_id.
summary(county_data_2$census_id)
## Length Class Mode
## 3142 character character
The below output displays a summary of the variable state. We see that there are 3142 observations of this variable, and it is of class character. The below code that we used to create the below output is shown directly below.
# Using the below code, we display a summary of the variable state.
summary(county_data_2$state)
## Length Class Mode
## 3142 character character
The below output displays a summary of the variable county. We see that there are 3142 observations of this variable, and it is of class character. The below code that we used to create the below output is shown directly below.
# Using the below code, we display a summary of the variable county.
summary(county_data_2$county)
## Length Class Mode
## 3142 character character
In conclusion, we did find some outliers in the data. However these outliers were not so extremely unusual for us to suspect that they were entered incorrectly. As a consequence, we do not choose to alter or remove them. Indeed altering or removing them could prevent us from obtaining correct answers to questions 5 through 10.
Before answering this question, we double-check that each row in the data set represents a unique county (i.e., no county occurs twice in the data set). The below output displays the number of unique counties in the data set, and we obtain this number using the below code.
# Using the below code, we create a dataset containing only the variables state and county.
county_data_3 <- select(county_data_2, state, county)
# Using the below code, we output the number of unique counties contained in the data set that we created.
nrow(unique(county_data_3))
## [1] 3142
We compare the number of unique counties to the total number of rows in the data set using the below code. The below output indicates that there are 3142 observations in the data set. We note that this is the same as the number of unique counties.
# Using the below code, we find the number of observations in the data set.
nrow(county_data_2)
## [1] 3142
We create a data set containing only the counties having more women than men. We view the head of this data set in the output below, and we see that it does indeed appear to contain the counties with more women than men. The code to perform the operations mentioned in this paragraph is displayed directly below.
# Using the below code, we create a data set called county_data_4 which contains the observations with more women than men.
county_data_4 <- filter(county_data_2, women > men)
# Using the below code, we view the head of county_data_4 in order to check that it appears to contain the observations with more women than men.
head(county_data_4)
## # A tibble: 6 x 35
## census_id state county total_pop men women hispanic white black native asian
## <chr> <chr> <chr> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1001 Alab~ Autau~ 55221 26745 28476 2.6 75.8 18.5 0.4 1
## 2 1003 Alab~ Baldw~ 195121 95314 99807 4.5 83.1 9.5 0.6 0.7
## 3 1009 Alab~ Blount 57710 28512 29198 8.6 87.9 1.5 0.3 0.1
## 4 1013 Alab~ Butler 20354 9502 10852 1.2 53.3 43.8 0.1 0.4
## 5 1015 Alab~ Calho~ 116648 56274 60374 3.5 73 20.3 0.2 0.9
## 6 1017 Alab~ Chamb~ 34079 16258 17821 0.4 57.3 40.3 0.2 0.8
## # ... with 24 more variables: pacific <dbl>, citizen <int>, income <dbl>,
## # income_per_cap <dbl>, poverty <dbl>, child_poverty <dbl>,
## # professional <dbl>, service <dbl>, office <dbl>, construction <dbl>,
## # production <dbl>, drive <dbl>, carpool <dbl>, transit <dbl>, walk <dbl>,
## # other_transp <dbl>, work_at_home <dbl>, mean_commute <dbl>, employed <dbl>,
## # private_work <dbl>, public_work <dbl>, self_employed <dbl>,
## # family_work <dbl>, unemployment <dbl>
The below output indicates that there are 1985 counties with more women than men. We obtain this value using the below code.
# Using the below code, we find the number of counties with more women than men by counting the number of observations in county_data_4.
nrow(county_data_4)
## [1] 1985
We create a data set containing only the counties having an unemployment rate lower than 10%. We view the head of this data set in the output below, and we see that it does indeed appear to contain the counties with low unemployment rates. The code to perform the operations mentioned in this paragraph is displayed directly below.
# Using the below code, we create a data set called county_data_5 which contains the counties having an unemployment rate lower than 10%.
county_data_5 <- filter(county_data_2, unemployment < 10)
# Using the below code, we view the head of county_data_5 in order to check that it appears to contain the counties with an unemployment rate lower than 10%.
head(county_data_5)
## # A tibble: 6 x 35
## census_id state county total_pop men women hispanic white black native asian
## <chr> <chr> <chr> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1001 Alab~ Autau~ 55221 26745 28476 2.6 75.8 18.5 0.4 1
## 2 1003 Alab~ Baldw~ 195121 95314 99807 4.5 83.1 9.5 0.6 0.7
## 3 1007 Alab~ Bibb 22604 12073 10531 2.2 74.5 21.4 0.4 0.1
## 4 1009 Alab~ Blount 57710 28512 29198 8.6 87.9 1.5 0.3 0.1
## 5 1017 Alab~ Chamb~ 34079 16258 17821 0.4 57.3 40.3 0.2 0.8
## 6 1019 Alab~ Chero~ 26008 12975 13033 1.5 91.7 4.8 0.6 0.3
## # ... with 24 more variables: pacific <dbl>, citizen <int>, income <dbl>,
## # income_per_cap <dbl>, poverty <dbl>, child_poverty <dbl>,
## # professional <dbl>, service <dbl>, office <dbl>, construction <dbl>,
## # production <dbl>, drive <dbl>, carpool <dbl>, transit <dbl>, walk <dbl>,
## # other_transp <dbl>, work_at_home <dbl>, mean_commute <dbl>, employed <dbl>,
## # private_work <dbl>, public_work <dbl>, self_employed <dbl>,
## # family_work <dbl>, unemployment <dbl>
The below output indicates that there are 2420 counties having an unemployment rate lower than 10%. We obtain this value using the below code.
# Using the below code, we find the number of counties with an unemployment rate lower than 10%
nrow(county_data_5)
## [1] 2420
Using the below code, we access the documentation for the dplyr::top_n() function, and in so doing, we learn how to use the function to answer this problem.
# We use the below code to access the documentation for the dplyr::top_n() function.
?dplyr::top_n()
## starting httpd help server ... done
The below output displays the top 10 counties with the highest mean commute (sorted by mean_commute). Below is the code that we used to create this output.
# Using the below code, we create a new data set called county_data_6 containing only the top 10 counties with the highest mean commute.
county_data_6 <- dplyr::top_n(county_data_2, 10, county_data_2$mean_commute)
# Using the below code, we arrange county_data_6 by mean commute.
county_data_6 <- arrange(county_data_6, desc(mean_commute))
# Using the below code, we display the census ID, the county name, the state, and the mean commute of the 10 counties in county_data_6.
select(county_data_6, census_id, county, state, mean_commute)
## # A tibble: 10 x 4
## census_id county state mean_commute
## <chr> <chr> <chr> <dbl>
## 1 42103 Pike Pennsylvania 44
## 2 36005 Bronx New York 43
## 3 24017 Charles Maryland 42.8
## 4 51187 Warren Virginia 42.7
## 5 36081 Queens New York 42.6
## 6 36085 Richmond New York 42.6
## 7 51193 Westmoreland Virginia 42.5
## 8 8093 Park Colorado 42.4
## 9 36047 Kings New York 41.7
## 10 54015 Clay West Virginia 41.4
Based on the above output, we see that the 10 counties with the longest mean_commute are Pike county in Pennsylvania, Bronx county in New York, Charles county in Maryland, Warren county in Virginia, Queens county in New York, Richmond county in New York, Westmoreland county in Virginia, Park county in Colorado, Kings county in New York, and Clay county in West Virginia.
We create a new data set containing all of the 35 variables in county_data_2 and also a new variable called women_percent. The variable women_percent displays the percentage of the population that are women. In the below output, we view the head of this new data set in order to check that the new variable was appropriately added, and we see that the new data set now indeed contains 36 variables (one additional variable beyond the 35 contained in county_data_2). The code used to perform the operations described in this paragraph is shown directly below.
# Using the below code, we create a new data set called county_data_7 which contains an additional variable indicating the percentage of women in each county.
county_data_7 <- mutate(county_data_2, women_percent = women / total_pop * 100)
# Using the below code, we view the head of the new data set to check that the new variable was appropriately added.
head(county_data_7)
## # A tibble: 6 x 36
## census_id state county total_pop men women hispanic white black native asian
## <chr> <chr> <chr> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1001 Alab~ Autau~ 55221 26745 28476 2.6 75.8 18.5 0.4 1
## 2 1003 Alab~ Baldw~ 195121 95314 99807 4.5 83.1 9.5 0.6 0.7
## 3 1005 Alab~ Barbo~ 26932 14497 12435 4.6 46.2 46.7 0.2 0.4
## 4 1007 Alab~ Bibb 22604 12073 10531 2.2 74.5 21.4 0.4 0.1
## 5 1009 Alab~ Blount 57710 28512 29198 8.6 87.9 1.5 0.3 0.1
## 6 1011 Alab~ Bullo~ 10678 5660 5018 4.4 22.2 70.7 1.2 0.2
## # ... with 25 more variables: pacific <dbl>, citizen <int>, income <dbl>,
## # income_per_cap <dbl>, poverty <dbl>, child_poverty <dbl>,
## # professional <dbl>, service <dbl>, office <dbl>, construction <dbl>,
## # production <dbl>, drive <dbl>, carpool <dbl>, transit <dbl>, walk <dbl>,
## # other_transp <dbl>, work_at_home <dbl>, mean_commute <dbl>, employed <dbl>,
## # private_work <dbl>, public_work <dbl>, self_employed <dbl>,
## # family_work <dbl>, unemployment <dbl>, women_percent <dbl>
The below output displays the 10 counties having the lowest percentage of women (sorted in ascending percentage of women). Below is the code that we used to create this output.
# Using the below code, we create a new data set containing only the 10 counties having the lowest percentage of women.
county_data_8 <- dplyr::top_n(county_data_7, -10, county_data_7$women_percent)
# Using the below code, we arrange county_data_8 by the percentage of women.
county_data_8 <- arrange(county_data_8, women_percent)
# Using the below code, we display the census ID, the county name, the state, and the percentage of women of the 10 counties in county_data_8.
select(county_data_8, census_id, county, state, women_percent)
## # A tibble: 10 x 4
## census_id county state women_percent
## <chr> <chr> <chr> <dbl>
## 1 42053 Forest Pennsylvania 26.8
## 2 8011 Bent Colorado 31.4
## 3 51183 Sussex Virginia 31.5
## 4 13309 Wheeler Georgia 32.1
## 5 6035 Lassen California 33.2
## 6 48095 Concho Texas 33.3
## 7 13053 Chattahoochee Georgia 33.4
## 8 2013 Aleutians East Borough Alaska 33.5
## 9 22125 West Feliciana Louisiana 33.6
## 10 32027 Pershing Nevada 33.7
Based on the above output, we see that the 10 counties with the lowest percentage of women are Forest county in Pennsylvania, Bent county in Colorado, Sussex county in Virginia, Wheeler county in Georgia, Lassen county in California, Concho county in Texas, Chattahoochee county in Georgia, Aleutians East Borough county in Alaska, West Feliciana county in Louisiana, and Pershing county in Nevada.
We create a new data set containing all of the 36 variables in county_data_7 and also a new variable called race_sum. The variable race_sum displays the sum of all race percentage variables (i.e., the sum of the “hispanic”, “white”, “black”, “native”, “asian”, and “pacific” variables). In the below output, we view the head of this new data set in order to check that the new variable was appropriately added, and we see that the new data set contains 37 variables (one more than that of county_data_7). The code used to perform the operations described in this paragraph is shown directly below.
# Using the below code, we create a new data set called county_data_9 which contains an additional variable indicating the sum of all race percentage variables.
county_data_9 <- mutate(county_data_7, race_sum = hispanic + white + black + native + asian + pacific)
# Using the below code, we view the head of the new data set to check that the new variable was appropriately added.
head(county_data_9)
## # A tibble: 6 x 37
## census_id state county total_pop men women hispanic white black native asian
## <chr> <chr> <chr> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1001 Alab~ Autau~ 55221 26745 28476 2.6 75.8 18.5 0.4 1
## 2 1003 Alab~ Baldw~ 195121 95314 99807 4.5 83.1 9.5 0.6 0.7
## 3 1005 Alab~ Barbo~ 26932 14497 12435 4.6 46.2 46.7 0.2 0.4
## 4 1007 Alab~ Bibb 22604 12073 10531 2.2 74.5 21.4 0.4 0.1
## 5 1009 Alab~ Blount 57710 28512 29198 8.6 87.9 1.5 0.3 0.1
## 6 1011 Alab~ Bullo~ 10678 5660 5018 4.4 22.2 70.7 1.2 0.2
## # ... with 26 more variables: pacific <dbl>, citizen <int>, income <dbl>,
## # income_per_cap <dbl>, poverty <dbl>, child_poverty <dbl>,
## # professional <dbl>, service <dbl>, office <dbl>, construction <dbl>,
## # production <dbl>, drive <dbl>, carpool <dbl>, transit <dbl>, walk <dbl>,
## # other_transp <dbl>, work_at_home <dbl>, mean_commute <dbl>, employed <dbl>,
## # private_work <dbl>, public_work <dbl>, self_employed <dbl>,
## # family_work <dbl>, unemployment <dbl>, women_percent <dbl>, race_sum <dbl>
The below output displays the 10 counties having the lowest values for the new variable race_sum (sorted by ascending values for race_sum). Below is the code that we used to create this output.
# Using the below code, we create a new data set containing only the 10 counties having the lowest values for race_sum.
county_data_10 <- dplyr::top_n(county_data_9, -10, county_data_9$race_sum)
# Using the below code, we arrange county_data_10 by the values for race_sum.
county_data_10 <- arrange(county_data_10, race_sum)
# Using the below code, we display the census ID, the county name, the state, and the race_sum of the 10 counties in county_data_10.
select(county_data_10, census_id, county, state, race_sum)
## # A tibble: 10 x 4
## census_id county state race_sum
## <chr> <chr> <chr> <dbl>
## 1 15001 Hawaii Hawaii 76.4
## 2 15009 Maui Hawaii 79.2
## 3 40097 Mayes Oklahoma 79.7
## 4 15003 Honolulu Hawaii 81.5
## 5 40123 Pontotoc Oklahoma 82.8
## 6 47061 Grundy Tennessee 83.
## 7 2282 Yakutat City and Borough Alaska 83.4
## 8 40069 Johnston Oklahoma 84
## 9 15007 Kauai Hawaii 84.1
## 10 40003 Alfalfa Oklahoma 85.1
Based on the above output, we see that the 10 counties with the lowest values for race_sum are Hawaii county in Hawaii, Maui county in Hawaii, Mayes county in Oklahoma, Honolulu county in Hawaii, Pontotoc county in Oklahoma, Grundy county in Tennessee, Yakutat City and Borough county in Alaska, Johnston county in Oklahoma, Kauai city in Hawaii, and Alfalfa county in Oklahoma.
We calculate the average of the county values for race_sum in each state, and we display these values in the column avg_race_sum in the output below. The below output is sorted by ascending avg_race_sum. Hence the first entry shown below is the state with the lowest value for avg_race_sum. That is, the state of Hawaii has, on average, the lowest sum of the race percentage variables. The below output was created using the below code.
# Using the below code, we group the data in county_data_9 by state.
by_state <- group_by(county_data_9, state)
# Using the below code, we display the average of the county values for race_sum in each state (sorted by ascending avg_race_sum).
arrange(summarise(by_state, avg_race_sum = mean(race_sum)), avg_race_sum)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 51 x 2
## state avg_race_sum
## <chr> <dbl>
## 1 Hawaii 84
## 2 Alaska 92.7
## 3 Oklahoma 92.8
## 4 Washington 96.7
## 5 California 96.9
## 6 Oregon 97.1
## 7 Delaware 97.3
## 8 Massachusetts 97.5
## 9 Maryland 97.6
## 10 District of Columbia 97.6
## # ... with 41 more rows
The below output displays the 11 counties with values for race_sum exceeding 100 (sorted by descending race_sum). These 11 counties are Claiborne county in Mississippi, Gosper county in Nebraska, Hooker county in Nebraska, Nance county in Nebraska, Bailey county in Texas, Duval county in Texas, Edwards county in Texas, Kenedy county in Texas, Kent county in Texas, Presidio county in Texas, and Beaver county in Utah. We create the below output using the below code.
# Using the below code, we display the 11 counties with values for race_sum exceeding 100 (sorted by descending race_sum).
arrange(select(filter(county_data_9, race_sum > 100), census_id, county, state, race_sum), desc(race_sum))
## # A tibble: 11 x 4
## census_id county state race_sum
## <chr> <chr> <chr> <dbl>
## 1 31073 Gosper Nebraska 100.
## 2 31091 Hooker Nebraska 100.
## 3 48017 Bailey Texas 100.
## 4 48137 Edwards Texas 100.
## 5 31125 Nance Nebraska 100.
## 6 28021 Claiborne Mississippi 100.
## 7 48131 Duval Texas 100.
## 8 48261 Kenedy Texas 100.
## 9 48263 Kent Texas 100.
## 10 48377 Presidio Texas 100.
## 11 49001 Beaver Utah 100.
The below output indicates that there are no states whose avg_race_sum is exactly equal to 100. We create this output using the below code.
# Using the below code, we display the states whose avg_race_sum is exactly equal to 100.
filter(summarise(by_state, avg_race_sum = mean(race_sum)), avg_race_sum == 100)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 0 x 2
## # ... with 2 variables: state <chr>, avg_race_sum <dbl>
The below output displays the counties that have a value for race_sum which is exactly equal to 100. In the below output, we see that the states of Alabama, Georgia, Kansas, Kentucky, Mississippi, Montana, Nebraska, New Mexico, North Carolina, North Dakota, South Dakota, Texas, and West Virginia each have at least one county with a value for race_sum that is exactly equal to 100. This is a total of 13 states having at least one county with a value for race_sum that is exactly equal to 100. We create the below output using the following code.
# Using the below code, we display the counties that have a value for race_sum which is exactly equal to 100.
select(filter(county_data_9, race_sum == 100), census_id, county, state, race_sum)
## # A tibble: 27 x 4
## census_id county state race_sum
## <chr> <chr> <chr> <dbl>
## 1 1065 Hale Alabama 100
## 2 1131 Wilcox Alabama 100
## 3 13201 Miller Georgia 100
## 4 13307 Webster Georgia 100
## 5 20199 Wallace Kansas 100
## 6 21031 Butler Kentucky 100
## 7 28125 Sharkey Mississippi 100
## 8 30019 Daniels Montana 100
## 9 30069 Petroleum Montana 100
## 10 30109 Wibaux Montana 100
## # ... with 17 more rows
The below code is used to create the below output listing the 13 unique states which we previously mentioned have at least one county whose race_sum is exactly equal to 100.
# Using the below code, we display the 13 states which have at least one county whose race_sum is exactly equal to 100.
unique(select(filter(county_data_9, race_sum == 100), state))
## # A tibble: 13 x 1
## state
## <chr>
## 1 Alabama
## 2 Georgia
## 3 Kansas
## 4 Kentucky
## 5 Mississippi
## 6 Montana
## 7 Nebraska
## 8 New Mexico
## 9 North Carolina
## 10 North Dakota
## 11 South Dakota
## 12 Texas
## 13 West Virginia
The below output indicates that there are 13 states which have at least one county whose race_sum exactly equals 100. The below code is used to create the below output.
# Using the below code, we calculate the number of states which have at least one county whose race_sum exactly equals 100.
nrow(unique(select(filter(county_data_9, race_sum == 100), state)))
## [1] 13
Using the below code, we access the documentation for the dplyr::min_rank() function, and in so doing, we learn how to use the function to answer this problem.
# We use the below code to access the documentation for the dplyr::min_rank() function.
?dplyr::min_rank()
Using the below code, we create a new variable called carpool_rank to rank the counties according to their carpool values.
# Using the below code, we create a new variable to rank the counties according to their carpool values.
carpool_rank <- dplyr::min_rank(desc(county_data_9$carpool))
We create a new data set called county_data_11 which contains the variable carpool_rank and all 37 variables contained in county_data_9. In the output below, we then view the head of the new data set to see that we successfully included the variable carpool_rank, and we see that as desired, the new data set does indeed have 38 variables (just one more than county_data_9). The below code is used to perform the operations mentioned in this paragraph.
# Using the below code, we create a new data set containing carpool_rank and all of the variables in county_data_9.
county_data_11 <- data.frame(county_data_9, carpool_rank)
# Using the below code, we view the head of the new data set.
head(county_data_11)
## census_id state county total_pop men women hispanic white black native
## 1 1001 Alabama Autauga 55221 26745 28476 2.6 75.8 18.5 0.4
## 2 1003 Alabama Baldwin 195121 95314 99807 4.5 83.1 9.5 0.6
## 3 1005 Alabama Barbour 26932 14497 12435 4.6 46.2 46.7 0.2
## 4 1007 Alabama Bibb 22604 12073 10531 2.2 74.5 21.4 0.4
## 5 1009 Alabama Blount 57710 28512 29198 8.6 87.9 1.5 0.3
## 6 1011 Alabama Bullock 10678 5660 5018 4.4 22.2 70.7 1.2
## asian pacific citizen income income_per_cap poverty child_poverty
## 1 1.0 0 40725 51281 24974 12.9 18.6
## 2 0.7 0 147695 50254 27317 13.4 19.2
## 3 0.4 0 20714 32964 16824 26.7 45.3
## 4 0.1 0 17495 38678 18431 16.8 27.9
## 5 0.1 0 42345 45813 20532 16.7 27.2
## 6 0.2 0 8057 31938 17580 24.6 38.4
## professional service office construction production drive carpool transit
## 1 33.2 17.0 24.2 8.6 17.1 87.5 8.8 0.1
## 2 33.1 17.7 27.1 10.8 11.2 84.7 8.8 0.1
## 3 26.8 16.1 23.1 10.8 23.1 83.8 10.9 0.4
## 4 21.5 17.9 17.8 19.0 23.7 83.2 13.5 0.5
## 5 28.5 14.1 23.9 13.5 19.9 84.9 11.2 0.4
## 6 18.8 15.0 19.7 20.1 26.4 74.9 14.9 0.7
## walk other_transp work_at_home mean_commute employed private_work public_work
## 1 0.5 1.3 1.8 26.5 23986 73.6 20.9
## 2 1.0 1.4 3.9 26.4 85953 81.5 12.3
## 3 1.8 1.5 1.6 24.1 8597 71.8 20.8
## 4 0.6 1.5 0.7 28.8 8294 76.8 16.1
## 5 0.9 0.4 2.3 34.9 22189 82.0 13.5
## 6 5.0 1.7 2.8 27.5 3865 79.5 15.1
## self_employed family_work unemployment women_percent race_sum carpool_rank
## 1 5.5 0.0 7.6 51.56734 98.3 2157
## 2 5.8 0.4 7.5 51.15134 98.4 2157
## 3 7.3 0.1 17.6 46.17184 98.1 1103
## 4 6.7 0.4 8.3 46.58910 98.6 391
## 5 4.2 0.4 7.7 50.59435 98.4 986
## 6 5.4 0.0 18.0 46.99382 98.7 204
In the below output, the 10 highest ranked counties for carpooling are displayed. We create this output using the below code.
# Using the below code, we display the 10 highest ranked counties for carpooling.
arrange(filter(select(county_data_11, census_id, county, state, carpool, carpool_rank), carpool_rank < 11), carpool_rank)
## census_id county state carpool carpool_rank
## 1 13061 Clay Georgia 29.9 1
## 2 18087 LaGrange Indiana 27.0 2
## 3 13165 Jenkins Georgia 25.3 3
## 4 5133 Sevier Arkansas 24.4 4
## 5 20175 Seward Kansas 23.4 5
## 6 48079 Cochran Texas 22.8 6
## 7 48247 Jim Hogg Texas 22.6 7
## 8 48393 Roberts Texas 22.4 8
## 9 39075 Holmes Ohio 21.8 9
## 10 21197 Powell Kentucky 21.6 10
Based on the above output, we see that the 10 highest ranked counties for carpooling are Clay county in Georgia, LaGrange county in Indiana, Jenkins county in Georgia, Sevier county in Arkansas, Seward county in Kansas, Cochran county in Texas, Jim Hogg county in Texas, Roberts county in Texas, Holmes county in Ohio, and Powell county in Kentucky.
In the below output, the 10 lowest ranked counties for carpooling are displayed. Note that due to a tie for carpool ranking between Hyde county in South Dakota and Norton City county in Virgina for the 10th lowest spot, there are actually 11 counties displayed below. We create this output using the below code.
# Using the below code, we display the 10 lowest ranked counties for carpooling. Note that 11 end up needing to be displayed due to a tie.
arrange(filter(select(county_data_11, census_id, county, state, carpool, carpool_rank), carpool_rank > nrow(county_data_11) - 11), desc(carpool_rank))
## census_id county state carpool carpool_rank
## 1 48261 Kenedy Texas 0.0 3141
## 2 48269 King Texas 0.0 3141
## 3 48235 Irion Texas 0.9 3140
## 4 31183 Wheeler Nebraska 1.3 3139
## 5 36061 New York New York 1.9 3138
## 6 13309 Wheeler Georgia 2.3 3136
## 7 38029 Emmons North Dakota 2.3 3136
## 8 30019 Daniels Montana 2.6 3134
## 9 31057 Dundy Nebraska 2.6 3134
## 10 46069 Hyde South Dakota 2.8 3132
## 11 51720 Norton city Virginia 2.8 3132
Based on the above output, we see that Kenedy county in Texas, King county in Texas, Irion county in Texas, Wheeler county in Nebraska, New York county in New York, Wheeler county in Georgia, Emmons county in North Dakota, Daniels county in Montana, Dundy county in Nebraska, Hyde county in South Dakota, and Norton City county in Virginia are the lowest ranked counties for carpooling.
We calculate the average of the county values for carpool_rank in each state, and we display these values in the column avg_carpool_rank in the output below. The below output is sorted by ascending avg_carpool_rank. Hence the first entry shown below is the state with the best average carpool rank. That is, the state of Arizona has, on average, the best carpool ranking. The below output was created using the below code.
# Using the below code, we group the data in county_data_11 by state.
by_state_3 <- group_by(county_data_11, state)
# Using the below code, we display the average of the county values for carpool_rank in each state (sorted by ascending carpool_rank).
arrange(summarise(by_state_3, avg_carpool_rank = mean(carpool_rank)), avg_carpool_rank)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 51 x 2
## state avg_carpool_rank
## <chr> <dbl>
## 1 Arizona 971.
## 2 Utah 1019.
## 3 Arkansas 1055.
## 4 Hawaii 1072.
## 5 Alaska 1087.
## 6 Nevada 1100.
## 7 Texas 1106.
## 8 Wyoming 1107.
## 9 California 1122.
## 10 Missouri 1133.
## # ... with 41 more rows
In the below output, we display the top 5 states for carpooling. We create this output using the below code.
# Using the below code, we display the top 5 states for carpooling.
head(arrange(summarise(by_state_3, avg_carpool_rank = mean(carpool_rank)), avg_carpool_rank), 5)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 2
## state avg_carpool_rank
## <chr> <dbl>
## 1 Arizona 971.
## 2 Utah 1019.
## 3 Arkansas 1055.
## 4 Hawaii 1072.
## 5 Alaska 1087.
Based on the above output, we see that Arizona, Utah, Arkansas, Hawaii, and Alaska are the top 5 states for carpooling.