library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.4     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(readr)

To import the data set "acs_2015_county_data-revised.csv:

countydata <- read_csv("R/Week 4/homework3/acs_2015_county_data_revised.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double(),
##   state = col_character(),
##   county = col_character()
## )
## i Use `spec()` for the full column specifications.

Question 1

The data has a total of 3,142 rows and 35 columns. One of my initial thoughts when looking at the data is that some columns have observations as numbers while others were entered in as percentages. This may make working with the data as a whole difficult so I am debating the need to change the numeric values to percentages for consistency.

One of the first steps in cleaning the data was to identify NA values:

colSums(is.na(countydata))
##      census_id          state         county      total_pop            men 
##              0              0              0              0              0 
##          women       hispanic          white          black         native 
##              0              0              0              0              0 
##          asian        pacific        citizen         income income_per_cap 
##              0              0              0              1              0 
##        poverty  child_poverty   professional        service         office 
##              0              1              0              0              0 
##   construction     production          drive        carpool        transit 
##              0              0              0              0              0 
##           walk   other_transp   work_at_home   mean_commute       employed 
##              0              0              0              0              0 
##   private_work    public_work  self_employed    family_work   unemployment 
##              0              0              0              0              0

As you can see from the chart above, there are very few NA or missing values in this data set. There appears to be only two missing values one in the ‘income’ column and one in the child_poverty column. It seems to make sense that we change the income value to the mean of the column for the state that the county is in. To locate the missing values in each of the columns:

which(is.na(countydata$income))
## [1] 2674
which(is.na(countydata$child_poverty))
## [1] 549

To change to income value to the mean value, I need to identify the state of that particular county and then subset the data to identify the mean.

countydata[2674, ]
## # A tibble: 1 x 35
##   census_id state county total_pop   men women hispanic white black native asian
##       <dbl> <chr> <chr>      <dbl> <dbl> <dbl>    <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1     48301 Texas Loving       117    74    43       35    41     0   12.8     0
## # ... with 24 more variables: pacific <dbl>, citizen <dbl>, income <dbl>,
## #   income_per_cap <dbl>, poverty <dbl>, child_poverty <dbl>,
## #   professional <dbl>, service <dbl>, office <dbl>, construction <dbl>,
## #   production <dbl>, drive <dbl>, carpool <dbl>, transit <dbl>, walk <dbl>,
## #   other_transp <dbl>, work_at_home <dbl>, mean_commute <dbl>, employed <dbl>,
## #   private_work <dbl>, public_work <dbl>, self_employed <dbl>,
## #   family_work <dbl>, unemployment <dbl>

This identified Texas as the state where the missing county income is, so I am going to subset all of the data from counties within Texas and summarize that subset for a mean value of income.