To import the data set "acs_2015_county_data-revised.csv:

countydata <- read_csv("~/R/Week 4/homework3/acs_2015_county_data_revised.csv")

Question 1

The data has a total of 3,142 rows and 35 columns. One of my initial thoughts when looking at the data is that some columns have observations as numbers while others were entered in as percentages. This may make working with the data as a whole difficult so I am debating the need to change the numeric values to percentages for consistency.

Question 2

glimpse(countydata)
## Rows: 3,142
## Columns: 35
## $ census_id      <dbl> 1001, 1003, 1005, 1007, 1009, 1011, 1013, 1015, 1017...
## $ state          <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama...
## $ county         <chr> "Autauga", "Baldwin", "Barbour", "Bibb", "Blount", "...
## $ total_pop      <dbl> 55221, 195121, 26932, 22604, 57710, 10678, 20354, 11...
## $ men            <dbl> 26745, 95314, 14497, 12073, 28512, 5660, 9502, 56274...
## $ women          <dbl> 28476, 99807, 12435, 10531, 29198, 5018, 10852, 6037...
## $ hispanic       <dbl> 2.6, 4.5, 4.6, 2.2, 8.6, 4.4, 1.2, 3.5, 0.4, 1.5, 7....
## $ white          <dbl> 75.8, 83.1, 46.2, 74.5, 87.9, 22.2, 53.3, 73.0, 57.3...
## $ black          <dbl> 18.5, 9.5, 46.7, 21.4, 1.5, 70.7, 43.8, 20.3, 40.3, ...
## $ native         <dbl> 0.4, 0.6, 0.2, 0.4, 0.3, 1.2, 0.1, 0.2, 0.2, 0.6, 0....
## $ asian          <dbl> 1.0, 0.7, 0.4, 0.1, 0.1, 0.2, 0.4, 0.9, 0.8, 0.3, 0....
## $ pacific        <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0....
## $ citizen        <dbl> 40725, 147695, 20714, 17495, 42345, 8057, 15581, 886...
## $ income         <dbl> 51281, 50254, 32964, 38678, 45813, 31938, 32229, 417...
## $ income_per_cap <dbl> 24974, 27317, 16824, 18431, 20532, 17580, 18390, 213...
## $ poverty        <dbl> 12.9, 13.4, 26.7, 16.8, 16.7, 24.6, 25.4, 20.5, 21.6...
## $ child_poverty  <dbl> 18.6, 19.2, 45.3, 27.9, 27.2, 38.4, 39.2, 31.6, 37.2...
## $ professional   <dbl> 33.2, 33.1, 26.8, 21.5, 28.5, 18.8, 27.5, 27.3, 23.3...
## $ service        <dbl> 17.0, 17.7, 16.1, 17.9, 14.1, 15.0, 16.6, 17.7, 14.5...
## $ office         <dbl> 24.2, 27.1, 23.1, 17.8, 23.9, 19.7, 21.9, 24.2, 26.3...
## $ construction   <dbl> 8.6, 10.8, 10.8, 19.0, 13.5, 20.1, 10.3, 10.5, 11.5,...
## $ production     <dbl> 17.1, 11.2, 23.1, 23.7, 19.9, 26.4, 23.7, 20.4, 24.4...
## $ drive          <dbl> 87.5, 84.7, 83.8, 83.2, 84.9, 74.9, 84.5, 85.3, 85.1...
## $ carpool        <dbl> 8.8, 8.8, 10.9, 13.5, 11.2, 14.9, 12.4, 9.4, 11.9, 1...
## $ transit        <dbl> 0.1, 0.1, 0.4, 0.5, 0.4, 0.7, 0.0, 0.2, 0.2, 0.2, 0....
## $ walk           <dbl> 0.5, 1.0, 1.8, 0.6, 0.9, 5.0, 0.8, 1.2, 0.3, 0.6, 1....
## $ other_transp   <dbl> 1.3, 1.4, 1.5, 1.5, 0.4, 1.7, 0.6, 1.2, 0.4, 0.7, 1....
## $ work_at_home   <dbl> 1.8, 3.9, 1.6, 0.7, 2.3, 2.8, 1.7, 2.7, 2.1, 2.5, 1....
## $ mean_commute   <dbl> 26.5, 26.4, 24.1, 28.8, 34.9, 27.5, 24.6, 24.1, 25.1...
## $ employed       <dbl> 23986, 85953, 8597, 8294, 22189, 3865, 7813, 47401, ...
## $ private_work   <dbl> 73.6, 81.5, 71.8, 76.8, 82.0, 79.5, 77.4, 74.1, 85.1...
## $ public_work    <dbl> 20.9, 12.3, 20.8, 16.1, 13.5, 15.1, 16.2, 20.8, 12.1...
## $ self_employed  <dbl> 5.5, 5.8, 7.3, 6.7, 4.2, 5.4, 6.2, 5.0, 2.8, 7.9, 4....
## $ family_work    <dbl> 0.0, 0.4, 0.1, 0.4, 0.4, 0.0, 0.2, 0.1, 0.0, 0.5, 0....
## $ unemployment   <dbl> 7.6, 7.5, 17.6, 8.3, 7.7, 18.0, 10.9, 12.3, 8.9, 7.9...

I think I should change census_id to a character value instead of a double (countinous numeric) since its role is merely to identify an area. I tried to do this multiple times using the folloing code but wasn’t successful.

as.character(as.numeric(countydata$census_id))

Question 3

One of the first steps in cleaning the data was to identify NA values:

colSums(is.na(countydata))
##      census_id          state         county      total_pop            men 
##              0              0              0              0              0 
##          women       hispanic          white          black         native 
##              0              0              0              0              0 
##          asian        pacific        citizen         income income_per_cap 
##              0              0              0              1              0 
##        poverty  child_poverty   professional        service         office 
##              0              1              0              0              0 
##   construction     production          drive        carpool        transit 
##              0              0              0              0              0 
##           walk   other_transp   work_at_home   mean_commute       employed 
##              0              0              0              0              0 
##   private_work    public_work  self_employed    family_work   unemployment 
##              0              0              0              0              0

As you can see from the chart above, there are very few NA or missing values in this data set. There appear to be only two missing values one in the income column and one in the child_poverty column. It seems to make sense that we change the NA values to the mean for each of the columns.

is.na_replace_mean <- countydata$child_poverty
  child_poverty_mean <- mean(is.na_replace_mean, na.rm = TRUE)
  is.na_replace_mean[is.na(is.na_replace_mean)] <- child_poverty_mean
is.na_replace_mean <- countydata$income
  income_mean <- mean(is.na_replace_mean, na.rm = TRUE)
  is.na_replace_mean[is.na(is.na_replace_mean)] <- income_mean

Question 4

summary(countydata)
##    census_id        state              county            total_pop       
##  Min.   : 1001   Length:3142        Length:3142        Min.   :      85  
##  1st Qu.:18178   Class :character   Class :character   1st Qu.:   11028  
##  Median :29176   Mode  :character   Mode  :character   Median :   25768  
##  Mean   :30384                                         Mean   :  100737  
##  3rd Qu.:45081                                         3rd Qu.:   67552  
##  Max.   :56045                                         Max.   :10038388  
##                                                                          
##       men              women            hispanic          white      
##  Min.   :     42   Min.   :     43   Min.   : 0.000   Min.   : 0.90  
##  1st Qu.:   5546   1st Qu.:   5466   1st Qu.: 1.900   1st Qu.:65.60  
##  Median :  12826   Median :  12907   Median : 3.700   Median :84.60  
##  Mean   :  49565   Mean   :  51171   Mean   : 8.826   Mean   :77.28  
##  3rd Qu.:  33319   3rd Qu.:  34122   3rd Qu.: 9.000   3rd Qu.:93.30  
##  Max.   :4945351   Max.   :5093037   Max.   :98.700   Max.   :99.80  
##                                                                      
##      black            native           asian           pacific        
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.00000  
##  1st Qu.: 0.600   1st Qu.: 0.100   1st Qu.: 0.200   1st Qu.: 0.00000  
##  Median : 2.100   Median : 0.300   Median : 0.500   Median : 0.00000  
##  Mean   : 8.879   Mean   : 1.766   Mean   : 1.258   Mean   : 0.08475  
##  3rd Qu.:10.175   3rd Qu.: 0.600   3rd Qu.: 1.200   3rd Qu.: 0.00000  
##  Max.   :85.900   Max.   :92.100   Max.   :41.600   Max.   :35.30000  
##                                                                       
##     citizen            income       income_per_cap     poverty    
##  Min.   :     80   Min.   : 19328   Min.   : 8292   Min.   : 1.4  
##  1st Qu.:   8254   1st Qu.: 38826   1st Qu.:20471   1st Qu.:12.0  
##  Median :  19434   Median : 45111   Median :23577   Median :16.0  
##  Mean   :  70804   Mean   : 46830   Mean   :24338   Mean   :16.7  
##  3rd Qu.:  50728   3rd Qu.: 52250   3rd Qu.:27138   3rd Qu.:20.3  
##  Max.   :6046749   Max.   :123453   Max.   :65600   Max.   :53.3  
##                    NA's   :1                                      
##  child_poverty    professional      service          office     
##  Min.   : 0.00   Min.   :13.50   Min.   : 5.00   Min.   : 4.10  
##  1st Qu.:16.10   1st Qu.:26.70   1st Qu.:15.90   1st Qu.:20.20  
##  Median :22.50   Median :30.00   Median :18.00   Median :22.40  
##  Mean   :23.29   Mean   :31.04   Mean   :18.26   Mean   :22.13  
##  3rd Qu.:29.50   3rd Qu.:34.40   3rd Qu.:20.20   3rd Qu.:24.30  
##  Max.   :72.30   Max.   :74.00   Max.   :36.60   Max.   :35.40  
##  NA's   :1                                                      
##   construction     production        drive          carpool     
##  Min.   : 1.70   Min.   : 0.00   Min.   : 5.20   Min.   : 0.00  
##  1st Qu.: 9.80   1st Qu.:11.53   1st Qu.:76.60   1st Qu.: 8.50  
##  Median :12.20   Median :15.40   Median :80.60   Median : 9.90  
##  Mean   :12.74   Mean   :15.82   Mean   :79.08   Mean   :10.33  
##  3rd Qu.:15.00   3rd Qu.:19.40   3rd Qu.:83.60   3rd Qu.:11.88  
##  Max.   :40.30   Max.   :55.60   Max.   :94.60   Max.   :29.90  
##                                                                 
##     transit             walk         other_transp     work_at_home   
##  Min.   : 0.0000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 0.1000   1st Qu.: 1.400   1st Qu.: 0.900   1st Qu.: 2.800  
##  Median : 0.4000   Median : 2.400   Median : 1.300   Median : 4.000  
##  Mean   : 0.9675   Mean   : 3.307   Mean   : 1.614   Mean   : 4.697  
##  3rd Qu.: 0.8000   3rd Qu.: 4.000   3rd Qu.: 1.900   3rd Qu.: 5.700  
##  Max.   :61.7000   Max.   :71.200   Max.   :39.100   Max.   :37.200  
##                                                                      
##   mean_commute      employed        private_work    public_work   
##  Min.   : 4.90   Min.   :     62   Min.   :25.00   Min.   : 5.80  
##  1st Qu.:19.30   1st Qu.:   4524   1st Qu.:70.90   1st Qu.:13.10  
##  Median :22.90   Median :  10644   Median :75.80   Median :16.10  
##  Mean   :23.15   Mean   :  46387   Mean   :74.44   Mean   :17.35  
##  3rd Qu.:26.60   3rd Qu.:  29254   3rd Qu.:79.80   3rd Qu.:20.10  
##  Max.   :44.00   Max.   :4635465   Max.   :88.30   Max.   :66.20  
##                                                                   
##  self_employed     family_work      unemployment   
##  Min.   : 0.000   Min.   :0.0000   Min.   : 0.000  
##  1st Qu.: 5.400   1st Qu.:0.1000   1st Qu.: 5.500  
##  Median : 6.900   Median :0.2000   Median : 7.500  
##  Mean   : 7.921   Mean   :0.2915   Mean   : 7.815  
##  3rd Qu.: 9.400   3rd Qu.:0.3000   3rd Qu.: 9.700  
##  Max.   :36.600   Max.   :9.8000   Max.   :29.400  
## 

I don’t see anything that stands out when looking at the summary of each variable? I was looking to see if there were any odd values (such as 100 as the max in employed in a job field) since it should be a percentage that equals to 100 with several other columns. I didn’t see any of those though?

Question 5

There are 1,985 counties in which there are more women than men.

Question 6

There are 2,420 counties with an unemployment rate less than 10%.

Question 7

Top 10 counties with the highest mean commute: Pike, Bronx, Charles, Warrn, Queens, Richmond, Westmoreland, Park, Kings and Clay

MeanCommute <- select(countydata, census_id, county, state, mean_commute)
MeanCommute <- top_n(MeanCommute, 10, mean_commute)
arrange(MeanCommute, desc(mean_commute))
## # A tibble: 10 x 4
##    census_id county       state         mean_commute
##        <dbl> <chr>        <chr>                <dbl>
##  1     42103 Pike         Pennsylvania          44  
##  2     36005 Bronx        New York              43  
##  3     24017 Charles      Maryland              42.8
##  4     51187 Warren       Virginia              42.7
##  5     36081 Queens       New York              42.6
##  6     36085 Richmond     New York              42.6
##  7     51193 Westmoreland Virginia              42.5
##  8      8093 Park         Colorado              42.4
##  9     36047 Kings        New York              41.7
## 10     54015 Clay         West Virginia         41.4

Question 8

  perctwomen <- countydata %>% mutate(Percent_Women = women / total_pop)
pwomen <- select(perctwomen, census_id, state, county, Percent_Women)
pwomen <- top_n(pwomen, -10, Percent_Women)
arrange(pwomen, Percent_Women)
## # A tibble: 10 x 4
##    census_id state        county                 Percent_Women
##        <dbl> <chr>        <chr>                          <dbl>
##  1     42053 Pennsylvania Forest                         0.268
##  2      8011 Colorado     Bent                           0.314
##  3     51183 Virginia     Sussex                         0.315
##  4     13309 Georgia      Wheeler                        0.321
##  5      6035 California   Lassen                         0.332
##  6     48095 Texas        Concho                         0.333
##  7     13053 Georgia      Chattahoochee                  0.334
##  8      2013 Alaska       Aleutians East Borough         0.335
##  9     22125 Louisiana    West Feliciana                 0.336
## 10     32027 Nevada       Pershing                       0.337

Question 9

  1. Hawaii, Maui, Mayes, Honolulu, Pontotoc, Grundy, Yakutat City and Borough, Johnston, Kauai, Alfalfa
  2. State with lowest % = Hawaii
  3. No state has a sum that is equal to 100%
  4. 5 counties have a total sum over 100
racecounty <- countydata %>%
  rowwise() %>%
  mutate(
  Total_race = sum(c(white,black,asian,native,pacific,hispanic)))
topcountyrace2 <- select(racecounty, county, state, Total_race)
topcountyrace2 <- top_n(topcountyrace2, -10, Total_race)
arrange(topcountyrace2, Total_race)
## # A tibble: 3,142 x 3
## # Rowwise: 
##    county                   state     Total_race
##    <chr>                    <chr>          <dbl>
##  1 Hawaii                   Hawaii          76.4
##  2 Maui                     Hawaii          79.2
##  3 Mayes                    Oklahoma        79.7
##  4 Honolulu                 Hawaii          81.5
##  5 Pontotoc                 Oklahoma        82.8
##  6 Grundy                   Tennessee       83  
##  7 Yakutat City and Borough Alaska          83.4
##  8 Johnston                 Oklahoma        84  
##  9 Kauai                    Hawaii          84.1
## 10 Alfalfa                  Oklahoma        85.1
## # ... with 3,132 more rows
topcountyrace3 <- topcountyrace2 %>% group_by(state) %>% summarise_at(vars(Total_race),funs(mean(.,na.rm=TRUE)))
## Warning: `funs()` is deprecated as of dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
 arrange(topcountyrace3, Total_race)
## # A tibble: 51 x 2
##    state                Total_race
##    <chr>                     <dbl>
##  1 Hawaii                     84  
##  2 Alaska                     92.7
##  3 Oklahoma                   92.8
##  4 Washington                 96.7
##  5 California                 96.9
##  6 Oregon                     97.1
##  7 Delaware                   97.3
##  8 Massachusetts              97.5
##  9 Maryland                   97.6
## 10 District of Columbia       97.6
## # ... with 41 more rows
topcountyrace3 <- topcountyrace2 %>% group_by(state) %>% summarise_at(vars(Total_race),funs(mean(.,na.rm=TRUE)))
 arrange(topcountyrace3, desc(Total_race))
## # A tibble: 51 x 2
##    state        Total_race
##    <chr>             <dbl>
##  1 Mississippi        99.2
##  2 Iowa               98.8
##  3 Texas              98.8
##  4 Nebraska           98.7
##  5 Utah               98.6
##  6 Indiana            98.6
##  7 Pennsylvania       98.6
##  8 Kentucky           98.6
##  9 Wisconsin          98.5
## 10 Alabama            98.5
## # ... with 41 more rows
length(which(topcountyrace2$Total_race > 100.00))
## [1] 5
racecounty <- countydata %>%
  rowwise() %>%
  mutate(
  Total_race = sum(c(white,black,asian,native,pacific,hispanic)))
topcountyrace2 <- select(racecounty, county, state, Total_race)
top_n(topcountyrace2, -10, Total_race)
## # A tibble: 3,142 x 3
## # Rowwise: 
##    county   state   Total_race
##    <chr>    <chr>        <dbl>
##  1 Autauga  Alabama       98.3
##  2 Baldwin  Alabama       98.4
##  3 Barbour  Alabama       98.1
##  4 Bibb     Alabama       98.6
##  5 Blount   Alabama       98.4
##  6 Bullock  Alabama       98.7
##  7 Butler   Alabama       98.8
##  8 Calhoun  Alabama       97.9
##  9 Chambers Alabama       99  
## 10 Cherokee Alabama       98.9
## # ... with 3,132 more rows
arrange(topcountyrace2, desc(Total_race))
## # A tibble: 3,142 x 3
## # Rowwise: 
##    county  state    Total_race
##    <chr>   <chr>         <dbl>
##  1 Gosper  Nebraska       100.
##  2 Hooker  Nebraska       100.
##  3 Nance   Nebraska       100.
##  4 Bailey  Texas          100.
##  5 Edwards Texas          100.
##  6 Hale    Alabama        100 
##  7 Wilcox  Alabama        100 
##  8 Miller  Georgia        100 
##  9 Webster Georgia        100 
## 10 Wallace Kansas         100 
## # ... with 3,132 more rows

Question 10

carpoolrank <- countydata %>%
mutate(Carpool_Rank = min_rank(desc(carpool)))
carpoolrank <- select(carpoolrank, census_id, county, state, carpool, Carpool_Rank)
carpoolrank <- arrange(carpoolrank, Carpool_Rank)
top_n(carpoolrank, -10, Carpool_Rank)
## # A tibble: 10 x 5
##    census_id county   state    carpool Carpool_Rank
##        <dbl> <chr>    <chr>      <dbl>        <int>
##  1     13061 Clay     Georgia     29.9            1
##  2     18087 LaGrange Indiana     27              2
##  3     13165 Jenkins  Georgia     25.3            3
##  4      5133 Sevier   Arkansas    24.4            4
##  5     20175 Seward   Kansas      23.4            5
##  6     48079 Cochran  Texas       22.8            6
##  7     48247 Jim Hogg Texas       22.6            7
##  8     48393 Roberts  Texas       22.4            8
##  9     39075 Holmes   Ohio        21.8            9
## 10     21197 Powell   Kentucky    21.6           10
top_n(carpoolrank, 10, Carpool_Rank)
## # A tibble: 11 x 5
##    census_id county      state        carpool Carpool_Rank
##        <dbl> <chr>       <chr>          <dbl>        <int>
##  1     46069 Hyde        South Dakota     2.8         3132
##  2     51720 Norton city Virginia         2.8         3132
##  3     30019 Daniels     Montana          2.6         3134
##  4     31057 Dundy       Nebraska         2.6         3134
##  5     13309 Wheeler     Georgia          2.3         3136
##  6     38029 Emmons      North Dakota     2.3         3136
##  7     36061 New York    New York         1.9         3138
##  8     31183 Wheeler     Nebraska         1.3         3139
##  9     48235 Irion       Texas            0.9         3140
## 10     48261 Kenedy      Texas            0           3141
## 11     48269 King        Texas            0           3141
  1. On average, Alaska is ranked best for carpooling.

  2. The top five states are Alaska, Arkansas, Utah, Texas and Nevada.

carpoolrank2 <- carpoolrank %>% group_by(state) %>% summarise_at(vars(carpool),funs(mean(.,na.rm=TRUE)))
carpoolrank2 <- mutate(carpoolrank2, Carpool_Rank = min_rank(desc(carpool)))
carpoolrank2 <- arrange(carpoolrank2, Carpool_Rank)
top_n(carpoolrank2, -5, Carpool_Rank)
## # A tibble: 5 x 3
##   state    carpool Carpool_Rank
##   <chr>      <dbl>        <int>
## 1 Alaska      12.1            1
## 2 Arkansas    11.9            2
## 3 Utah        11.9            3
## 4 Texas       11.8            4
## 5 Nevada      11.7            5