Homework 3

Importing Data and Data Cleaning

Import the data set using a tidyverse function and NOT with a Base R function. How many rows and columns are in the data set?

library(tidyverse)
ACS <- read_csv("acs_2015_county_data_revised.csv")

dim(ACS)

## [1] 3142   35

There are 3,142 rows with 35 columns in the data set.

Do any data types need changed? Show any code to change variable types and show code/output for a glimpse() command after you’re finished.

str(ACS)

After viewing the structure of the data set, both county and state need to be changed to factor to better fit the data dictionary.

ACS$county <- as.factor(ACS$county)
ACS$state <- as.factor(ACS$state)

After changing the data types, the data set looks near perfect.

glimpse(ACS)

## Rows: 3,142
## Columns: 35
## $ census_id      <dbl> 1001, 1003, 1005, 1007, 1009, 1011, 1013, 1015, 1017, 1…
## $ state          <fct> Alabama, Alabama, Alabama, Alabama, Alabama, Alabama, A…
## $ county         <fct> Autauga, Baldwin, Barbour, Bibb, Blount, Bullock, Butle…
## $ total_pop      <dbl> 55221, 195121, 26932, 22604, 57710, 10678, 20354, 11664…
## $ men            <dbl> 26745, 95314, 14497, 12073, 28512, 5660, 9502, 56274, 1…
## $ women          <dbl> 28476, 99807, 12435, 10531, 29198, 5018, 10852, 60374, …
## $ hispanic       <dbl> 2.6, 4.5, 4.6, 2.2, 8.6, 4.4, 1.2, 3.5, 0.4, 1.5, 7.6, …
## $ white          <dbl> 75.8, 83.1, 46.2, 74.5, 87.9, 22.2, 53.3, 73.0, 57.3, 9…
## $ black          <dbl> 18.5, 9.5, 46.7, 21.4, 1.5, 70.7, 43.8, 20.3, 40.3, 4.8…
## $ native         <dbl> 0.4, 0.6, 0.2, 0.4, 0.3, 1.2, 0.1, 0.2, 0.2, 0.6, 0.4, …
## $ asian          <dbl> 1.0, 0.7, 0.4, 0.1, 0.1, 0.2, 0.4, 0.9, 0.8, 0.3, 0.3, …
## $ pacific        <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, …
## $ citizen        <dbl> 40725, 147695, 20714, 17495, 42345, 8057, 15581, 88612,…
## $ income         <dbl> 51281, 50254, 32964, 38678, 45813, 31938, 32229, 41703,…
## $ income_per_cap <dbl> 24974, 27317, 16824, 18431, 20532, 17580, 18390, 21374,…
## $ poverty        <dbl> 12.9, 13.4, 26.7, 16.8, 16.7, 24.6, 25.4, 20.5, 21.6, 1…
## $ child_poverty  <dbl> 18.6, 19.2, 45.3, 27.9, 27.2, 38.4, 39.2, 31.6, 37.2, 3…
## $ professional   <dbl> 33.2, 33.1, 26.8, 21.5, 28.5, 18.8, 27.5, 27.3, 23.3, 2…
## $ service        <dbl> 17.0, 17.7, 16.1, 17.9, 14.1, 15.0, 16.6, 17.7, 14.5, 1…
## $ office         <dbl> 24.2, 27.1, 23.1, 17.8, 23.9, 19.7, 21.9, 24.2, 26.3, 1…
## $ construction   <dbl> 8.6, 10.8, 10.8, 19.0, 13.5, 20.1, 10.3, 10.5, 11.5, 13…
## $ production     <dbl> 17.1, 11.2, 23.1, 23.7, 19.9, 26.4, 23.7, 20.4, 24.4, 2…
## $ drive          <dbl> 87.5, 84.7, 83.8, 83.2, 84.9, 74.9, 84.5, 85.3, 85.1, 8…
## $ carpool        <dbl> 8.8, 8.8, 10.9, 13.5, 11.2, 14.9, 12.4, 9.4, 11.9, 12.1…
## $ transit        <dbl> 0.1, 0.1, 0.4, 0.5, 0.4, 0.7, 0.0, 0.2, 0.2, 0.2, 0.2, …
## $ walk           <dbl> 0.5, 1.0, 1.8, 0.6, 0.9, 5.0, 0.8, 1.2, 0.3, 0.6, 1.1, …
## $ other_transp   <dbl> 1.3, 1.4, 1.5, 1.5, 0.4, 1.7, 0.6, 1.2, 0.4, 0.7, 1.4, …
## $ work_at_home   <dbl> 1.8, 3.9, 1.6, 0.7, 2.3, 2.8, 1.7, 2.7, 2.1, 2.5, 1.9, …
## $ mean_commute   <dbl> 26.5, 26.4, 24.1, 28.8, 34.9, 27.5, 24.6, 24.1, 25.1, 2…
## $ employed       <dbl> 23986, 85953, 8597, 8294, 22189, 3865, 7813, 47401, 136…
## $ private_work   <dbl> 73.6, 81.5, 71.8, 76.8, 82.0, 79.5, 77.4, 74.1, 85.1, 7…
## $ public_work    <dbl> 20.9, 12.3, 20.8, 16.1, 13.5, 15.1, 16.2, 20.8, 12.1, 1…
## $ self_employed  <dbl> 5.5, 5.8, 7.3, 6.7, 4.2, 5.4, 6.2, 5.0, 2.8, 7.9, 4.1, …
## $ family_work    <dbl> 0.0, 0.4, 0.1, 0.4, 0.4, 0.0, 0.2, 0.1, 0.0, 0.5, 0.5, …
## $ unemployment   <dbl> 7.6, 7.5, 17.6, 8.3, 7.7, 18.0, 10.9, 12.3, 8.9, 7.9, 9…

Utilizing the glimpse() function is a transposed version of the print() function that allows us to view the columns down and the data across the page.

Are there any missing values? How will you handle missing values? Will you impute a missing value with, for example, a mean or median value for the entire column, or will you remove the entire observation? Give a rationale for your decision and show any code/output to handle missing values.

colSums(is.na(ACS))

##      census_id          state         county      total_pop            men 
##              0              0              0              0              0 
##          women       hispanic          white          black         native 
##              0              0              0              0              0 
##          asian        pacific        citizen         income income_per_cap 
##              0              0              0              1              0 
##        poverty  child_poverty   professional        service         office 
##              0              1              0              0              0 
##   construction     production          drive        carpool        transit 
##              0              0              0              0              0 
##           walk   other_transp   work_at_home   mean_commute       employed 
##              0              0              0              0              0 
##   private_work    public_work  self_employed    family_work   unemployment 
##              0              0              0              0              0

It appears there are 2 missing values in the data set. Both child_poverty and income have one missing value each. From the observation above, I would first handle missing data by asking whoever is in charge of the observations if it is appropriate or not to remove data. If so, I would handle this by getting rid of the two observations or imputing them on the median as these won’t impact too much moving forward with the analysis.

ACS$child_poverty[is.na(ACS$child_poverty)] <- median(ACS$child_poverty, na.rm=TRUE)
ACS$income[is.na(ACS$income)] <- median(ACS$income, na.rm=TRUE)
colSums(is.na(ACS))

##      census_id          state         county      total_pop            men 
##              0              0              0              0              0 
##          women       hispanic          white          black         native 
##              0              0              0              0              0 
##          asian        pacific        citizen         income income_per_cap 
##              0              0              0              0              0 
##        poverty  child_poverty   professional        service         office 
##              0              0              0              0              0 
##   construction     production          drive        carpool        transit 
##              0              0              0              0              0 
##           walk   other_transp   work_at_home   mean_commute       employed 
##              0              0              0              0              0 
##   private_work    public_work  self_employed    family_work   unemployment 
##              0              0              0              0              0

Use the summary() function to examine any unusual values. Are there any? If so, how will you handle these unusual values? Show any code/output to handle unusual values. Notes:

For the sake of time, you do not need to create any visualizations or other statistical summaries for every variable—the summary function will suffice for this homework).
You should read the data dictionary for this homework to understand the context behind each variable.

summary(ACS)

##    census_id          state             county       total_pop       
##  Min.   : 1001   Texas   : 254   Washington:  31   Min.   :      85  
##  1st Qu.:18178   Georgia : 159   Jefferson :  26   1st Qu.:   11028  
##  Median :29176   Virginia: 133   Franklin  :  25   Median :   25768  
##  Mean   :30384   Kentucky: 120   Jackson   :  24   Mean   :  100737  
##  3rd Qu.:45080   Missouri: 115   Lincoln   :  24   3rd Qu.:   67552  
##  Max.   :56045   Kansas  : 105   Madison   :  20   Max.   :10038388  
##                  (Other) :2256   (Other)   :2992                     
##       men              women            hispanic          white      
##  Min.   :     42   Min.   :     43   Min.   : 0.000   Min.   : 0.90  
##  1st Qu.:   5546   1st Qu.:   5466   1st Qu.: 1.900   1st Qu.:65.60  
##  Median :  12826   Median :  12907   Median : 3.700   Median :84.60  
##  Mean   :  49565   Mean   :  51171   Mean   : 8.826   Mean   :77.28  
##  3rd Qu.:  33319   3rd Qu.:  34122   3rd Qu.: 9.000   3rd Qu.:93.30  
##  Max.   :4945351   Max.   :5093037   Max.   :98.700   Max.   :99.80  
##                                                                      
##      black            native           asian           pacific        
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.00000  
##  1st Qu.: 0.600   1st Qu.: 0.100   1st Qu.: 0.200   1st Qu.: 0.00000  
##  Median : 2.100   Median : 0.300   Median : 0.500   Median : 0.00000  
##  Mean   : 8.879   Mean   : 1.766   Mean   : 1.258   Mean   : 0.08475  
##  3rd Qu.:10.175   3rd Qu.: 0.600   3rd Qu.: 1.200   3rd Qu.: 0.00000  
##  Max.   :85.900   Max.   :92.100   Max.   :41.600   Max.   :35.30000  
##                                                                       
##     citizen            income       income_per_cap     poverty    
##  Min.   :     80   Min.   : 19328   Min.   : 8292   Min.   : 1.4  
##  1st Qu.:   8254   1st Qu.: 38826   1st Qu.:20471   1st Qu.:12.0  
##  Median :  19434   Median : 45111   Median :23577   Median :16.0  
##  Mean   :  70804   Mean   : 46830   Mean   :24338   Mean   :16.7  
##  3rd Qu.:  50728   3rd Qu.: 52249   3rd Qu.:27138   3rd Qu.:20.3  
##  Max.   :6046749   Max.   :123453   Max.   :65600   Max.   :53.3  
##                                                                   
##  child_poverty    professional      service          office     
##  Min.   : 0.00   Min.   :13.50   Min.   : 5.00   Min.   : 4.10  
##  1st Qu.:16.10   1st Qu.:26.70   1st Qu.:15.90   1st Qu.:20.20  
##  Median :22.50   Median :30.00   Median :18.00   Median :22.40  
##  Mean   :23.29   Mean   :31.04   Mean   :18.26   Mean   :22.13  
##  3rd Qu.:29.48   3rd Qu.:34.40   3rd Qu.:20.20   3rd Qu.:24.30  
##  Max.   :72.30   Max.   :74.00   Max.   :36.60   Max.   :35.40  
##                                                                 
##   construction     production        drive          carpool     
##  Min.   : 1.70   Min.   : 0.00   Min.   : 5.20   Min.   : 0.00  
##  1st Qu.: 9.80   1st Qu.:11.53   1st Qu.:76.60   1st Qu.: 8.50  
##  Median :12.20   Median :15.40   Median :80.60   Median : 9.90  
##  Mean   :12.74   Mean   :15.82   Mean   :79.08   Mean   :10.33  
##  3rd Qu.:15.00   3rd Qu.:19.40   3rd Qu.:83.60   3rd Qu.:11.88  
##  Max.   :40.30   Max.   :55.60   Max.   :94.60   Max.   :29.90  
##                                                                 
##     transit             walk         other_transp     work_at_home   
##  Min.   : 0.0000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 0.1000   1st Qu.: 1.400   1st Qu.: 0.900   1st Qu.: 2.800  
##  Median : 0.4000   Median : 2.400   Median : 1.300   Median : 4.000  
##  Mean   : 0.9675   Mean   : 3.307   Mean   : 1.614   Mean   : 4.697  
##  3rd Qu.: 0.8000   3rd Qu.: 4.000   3rd Qu.: 1.900   3rd Qu.: 5.700  
##  Max.   :61.7000   Max.   :71.200   Max.   :39.100   Max.   :37.200  
##                                                                      
##   mean_commute      employed        private_work    public_work   
##  Min.   : 4.90   Min.   :     62   Min.   :25.00   Min.   : 5.80  
##  1st Qu.:19.30   1st Qu.:   4524   1st Qu.:70.90   1st Qu.:13.10  
##  Median :22.90   Median :  10644   Median :75.80   Median :16.10  
##  Mean   :23.15   Mean   :  46387   Mean   :74.44   Mean   :17.35  
##  3rd Qu.:26.60   3rd Qu.:  29254   3rd Qu.:79.80   3rd Qu.:20.10  
##  Max.   :44.00   Max.   :4635465   Max.   :88.30   Max.   :66.20  
##                                                                   
##  self_employed     family_work      unemployment   
##  Min.   : 0.000   Min.   :0.0000   Min.   : 0.000  
##  1st Qu.: 5.400   1st Qu.:0.1000   1st Qu.: 5.500  
##  Median : 6.900   Median :0.2000   Median : 7.500  
##  Mean   : 7.921   Mean   :0.2915   Mean   : 7.815  
##  3rd Qu.: 9.400   3rd Qu.:0.3000   3rd Qu.: 9.700  
##  Max.   :36.600   Max.   :9.8000   Max.   :29.400  
##

The data set contains skewed variables. For instance, men and women has such outstanding outliers; however, we are not able to remove any as we need more information about the values from these variables. Considering the genders come from a population, these numeric values can easily vary to a high degree, so we will leave the values for now. Something else to note is the variable employed. According to the dictionary, these values represent a percentage for ages 16+. Although, these values recorded are not percentages but perhaps a count of those employed. Below, we have modified the variable to accurately represent the definition of the variable while giving the summary statistics of the updated set.

ACS <- ACS %>% mutate(employed = employed/total_pop * 100)
glimpse(ACS)

## Rows: 3,142
## Columns: 35
## $ census_id      <dbl> 1001, 1003, 1005, 1007, 1009, 1011, 1013, 1015, 1017, 1…
## $ state          <fct> Alabama, Alabama, Alabama, Alabama, Alabama, Alabama, A…
## $ county         <fct> Autauga, Baldwin, Barbour, Bibb, Blount, Bullock, Butle…
## $ total_pop      <dbl> 55221, 195121, 26932, 22604, 57710, 10678, 20354, 11664…
## $ men            <dbl> 26745, 95314, 14497, 12073, 28512, 5660, 9502, 56274, 1…
## $ women          <dbl> 28476, 99807, 12435, 10531, 29198, 5018, 10852, 60374, …
## $ hispanic       <dbl> 2.6, 4.5, 4.6, 2.2, 8.6, 4.4, 1.2, 3.5, 0.4, 1.5, 7.6, …
## $ white          <dbl> 75.8, 83.1, 46.2, 74.5, 87.9, 22.2, 53.3, 73.0, 57.3, 9…
## $ black          <dbl> 18.5, 9.5, 46.7, 21.4, 1.5, 70.7, 43.8, 20.3, 40.3, 4.8…
## $ native         <dbl> 0.4, 0.6, 0.2, 0.4, 0.3, 1.2, 0.1, 0.2, 0.2, 0.6, 0.4, …
## $ asian          <dbl> 1.0, 0.7, 0.4, 0.1, 0.1, 0.2, 0.4, 0.9, 0.8, 0.3, 0.3, …
## $ pacific        <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, …
## $ citizen        <dbl> 40725, 147695, 20714, 17495, 42345, 8057, 15581, 88612,…
## $ income         <dbl> 51281, 50254, 32964, 38678, 45813, 31938, 32229, 41703,…
## $ income_per_cap <dbl> 24974, 27317, 16824, 18431, 20532, 17580, 18390, 21374,…
## $ poverty        <dbl> 12.9, 13.4, 26.7, 16.8, 16.7, 24.6, 25.4, 20.5, 21.6, 1…
## $ child_poverty  <dbl> 18.6, 19.2, 45.3, 27.9, 27.2, 38.4, 39.2, 31.6, 37.2, 3…
## $ professional   <dbl> 33.2, 33.1, 26.8, 21.5, 28.5, 18.8, 27.5, 27.3, 23.3, 2…
## $ service        <dbl> 17.0, 17.7, 16.1, 17.9, 14.1, 15.0, 16.6, 17.7, 14.5, 1…
## $ office         <dbl> 24.2, 27.1, 23.1, 17.8, 23.9, 19.7, 21.9, 24.2, 26.3, 1…
## $ construction   <dbl> 8.6, 10.8, 10.8, 19.0, 13.5, 20.1, 10.3, 10.5, 11.5, 13…
## $ production     <dbl> 17.1, 11.2, 23.1, 23.7, 19.9, 26.4, 23.7, 20.4, 24.4, 2…
## $ drive          <dbl> 87.5, 84.7, 83.8, 83.2, 84.9, 74.9, 84.5, 85.3, 85.1, 8…
## $ carpool        <dbl> 8.8, 8.8, 10.9, 13.5, 11.2, 14.9, 12.4, 9.4, 11.9, 12.1…
## $ transit        <dbl> 0.1, 0.1, 0.4, 0.5, 0.4, 0.7, 0.0, 0.2, 0.2, 0.2, 0.2, …
## $ walk           <dbl> 0.5, 1.0, 1.8, 0.6, 0.9, 5.0, 0.8, 1.2, 0.3, 0.6, 1.1, …
## $ other_transp   <dbl> 1.3, 1.4, 1.5, 1.5, 0.4, 1.7, 0.6, 1.2, 0.4, 0.7, 1.4, …
## $ work_at_home   <dbl> 1.8, 3.9, 1.6, 0.7, 2.3, 2.8, 1.7, 2.7, 2.1, 2.5, 1.9, …
## $ mean_commute   <dbl> 26.5, 26.4, 24.1, 28.8, 34.9, 27.5, 24.6, 24.1, 25.1, 2…
## $ employed       <dbl> 43.43637, 44.05113, 31.92113, 36.69262, 38.44914, 36.19…
## $ private_work   <dbl> 73.6, 81.5, 71.8, 76.8, 82.0, 79.5, 77.4, 74.1, 85.1, 7…
## $ public_work    <dbl> 20.9, 12.3, 20.8, 16.1, 13.5, 15.1, 16.2, 20.8, 12.1, 1…
## $ self_employed  <dbl> 5.5, 5.8, 7.3, 6.7, 4.2, 5.4, 6.2, 5.0, 2.8, 7.9, 4.1, …
## $ family_work    <dbl> 0.0, 0.4, 0.1, 0.4, 0.4, 0.0, 0.2, 0.1, 0.0, 0.5, 0.5, …
## $ unemployment   <dbl> 7.6, 7.5, 17.6, 8.3, 7.7, 18.0, 10.9, 12.3, 8.9, 7.9, 9…

summary(ACS)

##    census_id          state             county       total_pop       
##  Min.   : 1001   Texas   : 254   Washington:  31   Min.   :      85  
##  1st Qu.:18178   Georgia : 159   Jefferson :  26   1st Qu.:   11028  
##  Median :29176   Virginia: 133   Franklin  :  25   Median :   25768  
##  Mean   :30384   Kentucky: 120   Jackson   :  24   Mean   :  100737  
##  3rd Qu.:45080   Missouri: 115   Lincoln   :  24   3rd Qu.:   67552  
##  Max.   :56045   Kansas  : 105   Madison   :  20   Max.   :10038388  
##                  (Other) :2256   (Other)   :2992                     
##       men              women            hispanic          white      
##  Min.   :     42   Min.   :     43   Min.   : 0.000   Min.   : 0.90  
##  1st Qu.:   5546   1st Qu.:   5466   1st Qu.: 1.900   1st Qu.:65.60  
##  Median :  12826   Median :  12907   Median : 3.700   Median :84.60  
##  Mean   :  49565   Mean   :  51171   Mean   : 8.826   Mean   :77.28  
##  3rd Qu.:  33319   3rd Qu.:  34122   3rd Qu.: 9.000   3rd Qu.:93.30  
##  Max.   :4945351   Max.   :5093037   Max.   :98.700   Max.   :99.80  
##                                                                      
##      black            native           asian           pacific        
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.00000  
##  1st Qu.: 0.600   1st Qu.: 0.100   1st Qu.: 0.200   1st Qu.: 0.00000  
##  Median : 2.100   Median : 0.300   Median : 0.500   Median : 0.00000  
##  Mean   : 8.879   Mean   : 1.766   Mean   : 1.258   Mean   : 0.08475  
##  3rd Qu.:10.175   3rd Qu.: 0.600   3rd Qu.: 1.200   3rd Qu.: 0.00000  
##  Max.   :85.900   Max.   :92.100   Max.   :41.600   Max.   :35.30000  
##                                                                       
##     citizen            income       income_per_cap     poverty    
##  Min.   :     80   Min.   : 19328   Min.   : 8292   Min.   : 1.4  
##  1st Qu.:   8254   1st Qu.: 38826   1st Qu.:20471   1st Qu.:12.0  
##  Median :  19434   Median : 45111   Median :23577   Median :16.0  
##  Mean   :  70804   Mean   : 46830   Mean   :24338   Mean   :16.7  
##  3rd Qu.:  50728   3rd Qu.: 52249   3rd Qu.:27138   3rd Qu.:20.3  
##  Max.   :6046749   Max.   :123453   Max.   :65600   Max.   :53.3  
##                                                                   
##  child_poverty    professional      service          office     
##  Min.   : 0.00   Min.   :13.50   Min.   : 5.00   Min.   : 4.10  
##  1st Qu.:16.10   1st Qu.:26.70   1st Qu.:15.90   1st Qu.:20.20  
##  Median :22.50   Median :30.00   Median :18.00   Median :22.40  
##  Mean   :23.29   Mean   :31.04   Mean   :18.26   Mean   :22.13  
##  3rd Qu.:29.48   3rd Qu.:34.40   3rd Qu.:20.20   3rd Qu.:24.30  
##  Max.   :72.30   Max.   :74.00   Max.   :36.60   Max.   :35.40  
##                                                                 
##   construction     production        drive          carpool     
##  Min.   : 1.70   Min.   : 0.00   Min.   : 5.20   Min.   : 0.00  
##  1st Qu.: 9.80   1st Qu.:11.53   1st Qu.:76.60   1st Qu.: 8.50  
##  Median :12.20   Median :15.40   Median :80.60   Median : 9.90  
##  Mean   :12.74   Mean   :15.82   Mean   :79.08   Mean   :10.33  
##  3rd Qu.:15.00   3rd Qu.:19.40   3rd Qu.:83.60   3rd Qu.:11.88  
##  Max.   :40.30   Max.   :55.60   Max.   :94.60   Max.   :29.90  
##                                                                 
##     transit             walk         other_transp     work_at_home   
##  Min.   : 0.0000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 0.1000   1st Qu.: 1.400   1st Qu.: 0.900   1st Qu.: 2.800  
##  Median : 0.4000   Median : 2.400   Median : 1.300   Median : 4.000  
##  Mean   : 0.9675   Mean   : 3.307   Mean   : 1.614   Mean   : 4.697  
##  3rd Qu.: 0.8000   3rd Qu.: 4.000   3rd Qu.: 1.900   3rd Qu.: 5.700  
##  Max.   :61.7000   Max.   :71.200   Max.   :39.100   Max.   :37.200  
##                                                                      
##   mean_commute      employed      private_work    public_work   
##  Min.   : 4.90   Min.   :16.57   Min.   :25.00   Min.   : 5.80  
##  1st Qu.:19.30   1st Qu.:39.01   1st Qu.:70.90   1st Qu.:13.10  
##  Median :22.90   Median :43.67   Median :75.80   Median :16.10  
##  Mean   :23.15   Mean   :43.37   Mean   :74.44   Mean   :17.35  
##  3rd Qu.:26.60   3rd Qu.:48.20   3rd Qu.:79.80   3rd Qu.:20.10  
##  Max.   :44.00   Max.   :76.24   Max.   :88.30   Max.   :66.20  
##                                                                 
##  self_employed     family_work      unemployment   
##  Min.   : 0.000   Min.   :0.0000   Min.   : 0.000  
##  1st Qu.: 5.400   1st Qu.:0.1000   1st Qu.: 5.500  
##  Median : 6.900   Median :0.2000   Median : 7.500  
##  Mean   : 7.921   Mean   :0.2915   Mean   : 7.815  
##  3rd Qu.: 9.400   3rd Qu.:0.3000   3rd Qu.: 9.700  
##  Max.   :36.600   Max.   :9.8000   Max.   :29.400  
##

Data Manipulation and Insights

How many counties have more women than men?

ACS %>% mutate(women_county = women > men) %>%
  summarize(Total_Women = sum(women_county))

## # A tibble: 1 × 1
##   Total_Women
##         <int>
## 1        1985

It appears there are 1,985 counties with more women than there are men.

How many counties have an unemployment rate lower than 10%?

ACS %>% mutate(unemployment_county = unemployment < 10) %>%
  summarize(Total_Unemployment = sum(unemployment_county))

## # A tibble: 1 × 1
##   Total_Unemployment
##                <int>
## 1               2420

It appears there are 2,420 counties with an unemployment rate lower than 10%.

What are the top 10 counties with the highest mean commute? Show the census ID, county name, state, and the mean_commute in your final answer (sorted by mean_commute). Notes:

Use the variable mean_commute to answer this question.
Leverage the dplyr::top_n() function. Read the documentation for this function.

ACS  %>%  arrange(desc(mean_commute)) %>% 
          select(census_id, county, state,  mean_commute) %>% 
          top_n(n = 10, wt = mean_commute)

## # A tibble: 10 × 4
##    census_id county       state         mean_commute
##        <dbl> <fct>        <fct>                <dbl>
##  1     42103 Pike         Pennsylvania          44  
##  2     36005 Bronx        New York              43  
##  3     24017 Charles      Maryland              42.8
##  4     51187 Warren       Virginia              42.7
##  5     36081 Queens       New York              42.6
##  6     36085 Richmond     New York              42.6
##  7     51193 Westmoreland Virginia              42.5
##  8      8093 Park         Colorado              42.4
##  9     36047 Kings        New York              41.7
## 10     54015 Clay         West Virginia         41.4

After leveraging the dplyr::top_n() function, we can see that the following counties in respective, descending order have the highest mean commute: Pike, Bronx, Charles, Warren, Queens, Richmond, Westmoreland, Park, Kings, and Clay.

Create a new variable that calculates the percentage of women for each county and then find the top 10 counties with the lowest percentages. Show the census ID, county name, state, and the percentage in your final answer (sorted by ascending percentage).

ACS %>% mutate(women_percentage = women/total_pop * 100) %>% 
        arrange(women_percentage) %>% 
        select(census_id, county, women_percentage, state) %>% 
        top_n(n = -10, wt = women_percentage)

## # A tibble: 10 × 4
##    census_id county                 women_percentage state       
##        <dbl> <fct>                             <dbl> <fct>       
##  1     42053 Forest                             26.8 Pennsylvania
##  2      8011 Bent                               31.4 Colorado    
##  3     51183 Sussex                             31.5 Virginia    
##  4     13309 Wheeler                            32.1 Georgia     
##  5      6035 Lassen                             33.2 California  
##  6     48095 Concho                             33.3 Texas       
##  7     13053 Chattahoochee                      33.4 Georgia     
##  8      2013 Aleutians East Borough             33.5 Alaska      
##  9     22125 West Feliciana                     33.6 Louisiana   
## 10     32027 Pershing                           33.7 Nevada

After leveraging the dplyr::top_n() function, the new variable women_percentage illustrates the percentage of women for each county. More specifically, the top 10 counties with the lowest percentages for women population. In ascending order, we have: Forest, Bent, Sussex, Wheeler, Lassen, Concho, Chattahoochee, Aleutians East Borough, West Feliciana, and Pershing.

Create a new variable that calculates the sum of all race percentage variables (these columns are the “hispanic”, “white”, “black”, “native”, “asian”, and “pacific” variables).

What are the top 10 counties with the lowest sum of these race percentage variables?

ACS %>% mutate(race_sums = hispanic +  white + black + native + asian + pacific) %>% 
        arrange(race_sums) %>% 
        select(census_id, county, race_sums, state,) %>% 
        top_n(n = -10, wt = race_sums)

## # A tibble: 10 × 4
##    census_id county                   race_sums state    
##        <dbl> <fct>                        <dbl> <fct>    
##  1     15001 Hawaii                        76.4 Hawaii   
##  2     15009 Maui                          79.2 Hawaii   
##  3     40097 Mayes                         79.7 Oklahoma 
##  4     15003 Honolulu                      81.5 Hawaii   
##  5     40123 Pontotoc                      82.8 Oklahoma 
##  6     47061 Grundy                        83   Tennessee
##  7      2282 Yakutat City and Borough      83.4 Alaska   
##  8     40069 Johnston                      84   Oklahoma 
##  9     15007 Kauai                         84.1 Hawaii   
## 10     40003 Alfalfa                       85.1 Oklahoma

In terms of the counties with the lowest sum of race as a percentage, Hawaii has the lowest followed by Maui, Mayes, Honolulu, Pontotoc, Grundy, Yakutat City and Borough, Johnston, Kauai, and Alfalfa.

Which state, on average, has the lowest sum of these race percentage variables?

  ACS %>% group_by(state) %>% 
        mutate(sum_race = hispanic +  white + black + native + asian + pacific) %>% 
        summarize(mean_sum_race = mean(sum_race)) %>% 
        arrange(mean_sum_race) %>% 
        top_n(n =- 1, wt = mean_sum_race)

## # A tibble: 1 × 2
##   state  mean_sum_race
##   <fct>          <dbl>
## 1 Hawaii            84

In terms of which state, on average, has the lowest sum of race percentage would be Hawaii at 84%, which would make sense given the context of the state.

Do any counties have a sum greater than 100%?

ACS_Race <- ACS %>% mutate(sum_race = hispanic +  white + black + native + asian + pacific)
ACS_Race$sum_race <- format(round(ACS_Race$sum_race, 10), nsmall = 10)
ACS_Race %>%  mutate(percentage_gt100 = sum_race > 100.01) %>% 
              filter(percentage_gt100 == TRUE) %>% 
              select(census_id, county, state, sum_race, percentage_gt100)

## # A tibble: 5 × 5
##   census_id county  state    sum_race       percentage_gt100
##       <dbl> <fct>   <fct>    <chr>          <lgl>           
## 1     31073 Gosper  Nebraska 100.1000000000 TRUE            
## 2     31091 Hooker  Nebraska 100.1000000000 TRUE            
## 3     31125 Nance   Nebraska 100.1000000000 TRUE            
## 4     48017 Bailey  Texas    100.1000000000 TRUE            
## 5     48137 Edwards Texas    100.1000000000 TRUE

There appears to be five counties that have a recorded value of over 100% in terms of race percentage. They are: Gosper, Hooker, Nance, Bailey, and Edwards.

How many states have a sum that equals exactly to 100%?

ACS_Race %>%  mutate(percentage_100 = (sum_race < 100.1 & sum_race >=100.0)) %>% 
              filter(percentage_100 == TRUE) %>% 
              select(census_id, state, sum_race, percentage_100)  %>% 
              distinct(state)

## # A tibble: 16 × 1
##    state         
##    <fct>         
##  1 Alabama       
##  2 Georgia       
##  3 Kansas        
##  4 Kentucky      
##  5 Mississippi   
##  6 Montana       
##  7 Nebraska      
##  8 Nevada        
##  9 New Mexico    
## 10 North Carolina
## 11 North Dakota  
## 12 Oregon        
## 13 South Dakota  
## 14 Texas         
## 15 Utah          
## 16 West Virginia

There are 16 states that sum to exactly 100%.

Using the carpool variable,

Use the dplyr::top_n() function to create a new variable called carpool_rank where the highest ranked county (rank = 1) is the county with the highest carpool value. Read the documentation carefully for the ranking function.

ACS$carpool_rank <- min_rank(desc(ACS$carpool))

carpool_rank has been created as a new variable in the data set.

Find the 10 highest ranked counties for carpooling. Show the census ID, county name, state, carpool value, and carpool_rank in your final answer.

ACS %>% arrange(carpool_rank) %>%
        select(census_id, county, state, carpool, carpool_rank) %>%
        top_n(n = -10, wt = carpool_rank)

## # A tibble: 10 × 5
##    census_id county   state    carpool carpool_rank
##        <dbl> <fct>    <fct>      <dbl>        <int>
##  1     13061 Clay     Georgia     29.9            1
##  2     18087 LaGrange Indiana     27              2
##  3     13165 Jenkins  Georgia     25.3            3
##  4      5133 Sevier   Arkansas    24.4            4
##  5     20175 Seward   Kansas      23.4            5
##  6     48079 Cochran  Texas       22.8            6
##  7     48247 Jim Hogg Texas       22.6            7
##  8     48393 Roberts  Texas       22.4            8
##  9     39075 Holmes   Ohio        21.8            9
## 10     21197 Powell   Kentucky    21.6           10

From this, we see that, in descending order, Clay, LaGrange, Jenkins, Sevier, Seward, Cochran, Jim Hogg, Roberts, Holmes, and Powell have the 10 highest ranked data for carpooling.

Find the 10 lowest ranked counties for carpooling. Show the same variables in your final answer.

ACS %>% arrange(desc(carpool_rank)) %>%
        select(census_id, county, state, carpool, carpool_rank) %>%
        top_n(n = 10, wt = carpool_rank)

## # A tibble: 11 × 5
##    census_id county      state        carpool carpool_rank
##        <dbl> <fct>       <fct>          <dbl>        <int>
##  1     48261 Kenedy      Texas            0           3141
##  2     48269 King        Texas            0           3141
##  3     48235 Irion       Texas            0.9         3140
##  4     31183 Wheeler     Nebraska         1.3         3139
##  5     36061 New York    New York         1.9         3138
##  6     13309 Wheeler     Georgia          2.3         3136
##  7     38029 Emmons      North Dakota     2.3         3136
##  8     30019 Daniels     Montana          2.6         3134
##  9     31057 Dundy       Nebraska         2.6         3134
## 10     46069 Hyde        South Dakota     2.8         3132
## 11     51720 Norton city Virginia         2.8         3132

Opposite from 10(b), we have the lowest ranked counties for carpooling, which are: Kenedy, King, Irion, Wheeler, New York, Wheeler, Emmons, Daniels, Dundy, Hyde, and Norton City.

On average, what state is the best ranked for carpooling?

ACS %>%  group_by(state) %>% 
         summarize(state_mean_carpool = mean(carpool)) %>%
         mutate(state_mean_rank = min_rank(desc(state_mean_carpool))) %>% 
         arrange(state_mean_rank) %>% 
         top_n(n = -1, wt = state_mean_rank)

## # A tibble: 1 × 3
##   state  state_mean_carpool state_mean_rank
##   <fct>               <dbl>           <int>
## 1 Alaska               12.1               1

On average, we see that Alaska has the highest ranked percentage for carpooling.

What are the top 5 states for carpooling?

ACS %>%  group_by(state) %>% 
         summarize(state_mean_carpool = mean(carpool)) %>%
         mutate(state_mean_rank = min_rank(desc(state_mean_carpool))) %>% 
         arrange(state_mean_rank) %>% 
         top_n(n = -5, wt = state_mean_rank)

## # A tibble: 5 × 3
##   state    state_mean_carpool state_mean_rank
##   <fct>                 <dbl>           <int>
## 1 Alaska                 12.1               1
## 2 Arkansas               11.9               2
## 3 Utah                   11.9               3
## 4 Texas                  11.8               4
## 5 Nevada                 11.7               5

The top five states for carpooling in descending order are: Alaska, Arkansas, Utah, Texas, and Nevada.

Homework 3

Colton Dolezal

11/11/2021

Importing Data and Data Cleaning

Data Manipulation and Insights