library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.4 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(readr)
To import the data set "acs_2015_county_data-revised.csv:
countydata <- read_csv("R/Week 4/homework3/acs_2015_county_data_revised.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_double(),
## state = col_character(),
## county = col_character()
## )
## i Use `spec()` for the full column specifications.
The data has a total of 3,142 rows and 35 columns. One of my initial thoughts when looking at the data is that some columns have observations as numbers while others were entered in as percentages. This may make working with the data as a whole difficult so I am debating the need to change the numeric values to percentages for consistency.
One of the first steps in cleaning the data was to identify NA values:
colSums(is.na(countydata))
## census_id state county total_pop men
## 0 0 0 0 0
## women hispanic white black native
## 0 0 0 0 0
## asian pacific citizen income income_per_cap
## 0 0 0 1 0
## poverty child_poverty professional service office
## 0 1 0 0 0
## construction production drive carpool transit
## 0 0 0 0 0
## walk other_transp work_at_home mean_commute employed
## 0 0 0 0 0
## private_work public_work self_employed family_work unemployment
## 0 0 0 0 0
As you can see from the chart above, there are very few NA or missing values in this data set. There appears to be only two missing values one in the ‘income’ column and one in the child_poverty column. It seems to make sense that we change the income value to the mean of the column for the state that the county is in. To locate the missing values in each of the columns:
which(is.na(countydata$income))
## [1] 2674
which(is.na(countydata$child_poverty))
## [1] 549
To change to income value to the mean value, I need to identify the state of that particular county and then subset the data to identify the mean.
countydata[2674, ]
## # A tibble: 1 x 35
## census_id state county total_pop men women hispanic white black native asian
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 48301 Texas Loving 117 74 43 35 41 0 12.8 0
## # ... with 24 more variables: pacific <dbl>, citizen <dbl>, income <dbl>,
## # income_per_cap <dbl>, poverty <dbl>, child_poverty <dbl>,
## # professional <dbl>, service <dbl>, office <dbl>, construction <dbl>,
## # production <dbl>, drive <dbl>, carpool <dbl>, transit <dbl>, walk <dbl>,
## # other_transp <dbl>, work_at_home <dbl>, mean_commute <dbl>, employed <dbl>,
## # private_work <dbl>, public_work <dbl>, self_employed <dbl>,
## # family_work <dbl>, unemployment <dbl>
This identified Texas as the state where the missing county income is, so I am going to subset all of the data from counties within Texas and summarize that subset for a mean value of income.