Q1 - Import the data set using a Tidyverse function and NOT with a Base R function. How many rows and columns are in the data set?
acs_country_data <- read_csv("C:/Users/Anjali Shalimar/Documents/acs_2015_county_data_revised.csv")
dim(acs_country_data)## [1] 3142 35
There are 3142 rows and 35 columns in the dataset
Q2 - Do any data types need changed? Show any code to change variable types and show code/output for a command after you’re finished.
We look at the structure of the dataset
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 3142 obs. of 35 variables:
## $ census_id : num 1001 1003 1005 1007 1009 ...
## $ state : chr "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ county : chr "Autauga" "Baldwin" "Barbour" "Bibb" ...
## $ total_pop : num 55221 195121 26932 22604 57710 ...
## $ men : num 26745 95314 14497 12073 28512 ...
## $ women : num 28476 99807 12435 10531 29198 ...
## $ hispanic : num 2.6 4.5 4.6 2.2 8.6 4.4 1.2 3.5 0.4 1.5 ...
## $ white : num 75.8 83.1 46.2 74.5 87.9 22.2 53.3 73 57.3 91.7 ...
## $ black : num 18.5 9.5 46.7 21.4 1.5 70.7 43.8 20.3 40.3 4.8 ...
## $ native : num 0.4 0.6 0.2 0.4 0.3 1.2 0.1 0.2 0.2 0.6 ...
## $ asian : num 1 0.7 0.4 0.1 0.1 0.2 0.4 0.9 0.8 0.3 ...
## $ pacific : num 0 0 0 0 0 0 0 0 0 0 ...
## $ citizen : num 40725 147695 20714 17495 42345 ...
## $ income : num 51281 50254 32964 38678 45813 ...
## $ income_per_cap: num 24974 27317 16824 18431 20532 ...
## $ poverty : num 12.9 13.4 26.7 16.8 16.7 24.6 25.4 20.5 21.6 19.2 ...
## $ child_poverty : num 18.6 19.2 45.3 27.9 27.2 38.4 39.2 31.6 37.2 30.1 ...
## $ professional : num 33.2 33.1 26.8 21.5 28.5 18.8 27.5 27.3 23.3 29.3 ...
## $ service : num 17 17.7 16.1 17.9 14.1 15 16.6 17.7 14.5 16 ...
## $ office : num 24.2 27.1 23.1 17.8 23.9 19.7 21.9 24.2 26.3 19.5 ...
## $ construction : num 8.6 10.8 10.8 19 13.5 20.1 10.3 10.5 11.5 13.7 ...
## $ production : num 17.1 11.2 23.1 23.7 19.9 26.4 23.7 20.4 24.4 21.5 ...
## $ drive : num 87.5 84.7 83.8 83.2 84.9 74.9 84.5 85.3 85.1 83.9 ...
## $ carpool : num 8.8 8.8 10.9 13.5 11.2 14.9 12.4 9.4 11.9 12.1 ...
## $ transit : num 0.1 0.1 0.4 0.5 0.4 0.7 0 0.2 0.2 0.2 ...
## $ walk : num 0.5 1 1.8 0.6 0.9 5 0.8 1.2 0.3 0.6 ...
## $ other_transp : num 1.3 1.4 1.5 1.5 0.4 1.7 0.6 1.2 0.4 0.7 ...
## $ work_at_home : num 1.8 3.9 1.6 0.7 2.3 2.8 1.7 2.7 2.1 2.5 ...
## $ mean_commute : num 26.5 26.4 24.1 28.8 34.9 27.5 24.6 24.1 25.1 27.4 ...
## $ employed : num 23986 85953 8597 8294 22189 ...
## $ private_work : num 73.6 81.5 71.8 76.8 82 79.5 77.4 74.1 85.1 73.1 ...
## $ public_work : num 20.9 12.3 20.8 16.1 13.5 15.1 16.2 20.8 12.1 18.5 ...
## $ self_employed : num 5.5 5.8 7.3 6.7 4.2 5.4 6.2 5 2.8 7.9 ...
## $ family_work : num 0 0.4 0.1 0.4 0.4 0 0.2 0.1 0 0.5 ...
## $ unemployment : num 7.6 7.5 17.6 8.3 7.7 18 10.9 12.3 8.9 7.9 ...
## - attr(*, "spec")=
## .. cols(
## .. census_id = col_double(),
## .. state = col_character(),
## .. county = col_character(),
## .. total_pop = col_double(),
## .. men = col_double(),
## .. women = col_double(),
## .. hispanic = col_double(),
## .. white = col_double(),
## .. black = col_double(),
## .. native = col_double(),
## .. asian = col_double(),
## .. pacific = col_double(),
## .. citizen = col_double(),
## .. income = col_double(),
## .. income_per_cap = col_double(),
## .. poverty = col_double(),
## .. child_poverty = col_double(),
## .. professional = col_double(),
## .. service = col_double(),
## .. office = col_double(),
## .. construction = col_double(),
## .. production = col_double(),
## .. drive = col_double(),
## .. carpool = col_double(),
## .. transit = col_double(),
## .. walk = col_double(),
## .. other_transp = col_double(),
## .. work_at_home = col_double(),
## .. mean_commute = col_double(),
## .. employed = col_double(),
## .. private_work = col_double(),
## .. public_work = col_double(),
## .. self_employed = col_double(),
## .. family_work = col_double(),
## .. unemployment = col_double()
## .. )
We see that the state and country have datatype as chr. We have modified this to factor to fit perfectly with our future modelling of the dataset.
acs_country_data$state <- as.factor(acs_country_data$state)
acs_country_data$county <- as.factor(acs_country_data$county)Rest all datatypes look perfect. We do not require to rename any attributes as they clearly demonstrate the meaning of the dataset.
## Observations: 3,142
## Variables: 35
## $ census_id <dbl> 1001, 1003, 1005, 1007, 1009, 1011, 1013, 1015, 1017...
## $ state <fct> Alabama, Alabama, Alabama, Alabama, Alabama, Alabama...
## $ county <fct> Autauga, Baldwin, Barbour, Bibb, Blount, Bullock, Bu...
## $ total_pop <dbl> 55221, 195121, 26932, 22604, 57710, 10678, 20354, 11...
## $ men <dbl> 26745, 95314, 14497, 12073, 28512, 5660, 9502, 56274...
## $ women <dbl> 28476, 99807, 12435, 10531, 29198, 5018, 10852, 6037...
## $ hispanic <dbl> 2.6, 4.5, 4.6, 2.2, 8.6, 4.4, 1.2, 3.5, 0.4, 1.5, 7....
## $ white <dbl> 75.8, 83.1, 46.2, 74.5, 87.9, 22.2, 53.3, 73.0, 57.3...
## $ black <dbl> 18.5, 9.5, 46.7, 21.4, 1.5, 70.7, 43.8, 20.3, 40.3, ...
## $ native <dbl> 0.4, 0.6, 0.2, 0.4, 0.3, 1.2, 0.1, 0.2, 0.2, 0.6, 0....
## $ asian <dbl> 1.0, 0.7, 0.4, 0.1, 0.1, 0.2, 0.4, 0.9, 0.8, 0.3, 0....
## $ pacific <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0....
## $ citizen <dbl> 40725, 147695, 20714, 17495, 42345, 8057, 15581, 886...
## $ income <dbl> 51281, 50254, 32964, 38678, 45813, 31938, 32229, 417...
## $ income_per_cap <dbl> 24974, 27317, 16824, 18431, 20532, 17580, 18390, 213...
## $ poverty <dbl> 12.9, 13.4, 26.7, 16.8, 16.7, 24.6, 25.4, 20.5, 21.6...
## $ child_poverty <dbl> 18.6, 19.2, 45.3, 27.9, 27.2, 38.4, 39.2, 31.6, 37.2...
## $ professional <dbl> 33.2, 33.1, 26.8, 21.5, 28.5, 18.8, 27.5, 27.3, 23.3...
## $ service <dbl> 17.0, 17.7, 16.1, 17.9, 14.1, 15.0, 16.6, 17.7, 14.5...
## $ office <dbl> 24.2, 27.1, 23.1, 17.8, 23.9, 19.7, 21.9, 24.2, 26.3...
## $ construction <dbl> 8.6, 10.8, 10.8, 19.0, 13.5, 20.1, 10.3, 10.5, 11.5,...
## $ production <dbl> 17.1, 11.2, 23.1, 23.7, 19.9, 26.4, 23.7, 20.4, 24.4...
## $ drive <dbl> 87.5, 84.7, 83.8, 83.2, 84.9, 74.9, 84.5, 85.3, 85.1...
## $ carpool <dbl> 8.8, 8.8, 10.9, 13.5, 11.2, 14.9, 12.4, 9.4, 11.9, 1...
## $ transit <dbl> 0.1, 0.1, 0.4, 0.5, 0.4, 0.7, 0.0, 0.2, 0.2, 0.2, 0....
## $ walk <dbl> 0.5, 1.0, 1.8, 0.6, 0.9, 5.0, 0.8, 1.2, 0.3, 0.6, 1....
## $ other_transp <dbl> 1.3, 1.4, 1.5, 1.5, 0.4, 1.7, 0.6, 1.2, 0.4, 0.7, 1....
## $ work_at_home <dbl> 1.8, 3.9, 1.6, 0.7, 2.3, 2.8, 1.7, 2.7, 2.1, 2.5, 1....
## $ mean_commute <dbl> 26.5, 26.4, 24.1, 28.8, 34.9, 27.5, 24.6, 24.1, 25.1...
## $ employed <dbl> 23986, 85953, 8597, 8294, 22189, 3865, 7813, 47401, ...
## $ private_work <dbl> 73.6, 81.5, 71.8, 76.8, 82.0, 79.5, 77.4, 74.1, 85.1...
## $ public_work <dbl> 20.9, 12.3, 20.8, 16.1, 13.5, 15.1, 16.2, 20.8, 12.1...
## $ self_employed <dbl> 5.5, 5.8, 7.3, 6.7, 4.2, 5.4, 6.2, 5.0, 2.8, 7.9, 4....
## $ family_work <dbl> 0.0, 0.4, 0.1, 0.4, 0.4, 0.0, 0.2, 0.1, 0.0, 0.5, 0....
## $ unemployment <dbl> 7.6, 7.5, 17.6, 8.3, 7.7, 18.0, 10.9, 12.3, 8.9, 7.9...
As specified we have use the function glimpse(). Its a combination of both dim(),str() and head() functions. It gives us the dimensions of the dataset, the structure as well as the first few observations of the datasets.
Q3 - Are there any missing values? How will you handle missing values? Will you impute a missing value with, for example, a mean or median value for the entire column, or will you remove the entire observation? Give a rationale for your decision and show any code/output to handle missing values.
## [1] 2
There are 2 missing values in the entire dataset.
## census_id state county total_pop men
## 0 0 0 0 0
## women hispanic white black native
## 0 0 0 0 0
## asian pacific citizen income income_per_cap
## 0 0 0 1 0
## poverty child_poverty professional service office
## 0 1 0 0 0
## construction production drive carpool transit
## 0 0 0 0 0
## walk other_transp work_at_home mean_commute employed
## 0 0 0 0 0
## private_work public_work self_employed family_work unemployment
## 0 0 0 0 0
From the above observation, there is one missing value in income and 1 in child_poverty. We can remove this two observations as this won’t impact much moving forward with data modelling.
acs_country_data <- acs_country_data %>% filter(is.na(income)==FALSE)
acs_country_data <- acs_country_data %>% filter(is.na(child_poverty)==FALSE)## [1] 0
## census_id state county total_pop men
## 0 0 0 0 0
## women hispanic white black native
## 0 0 0 0 0
## asian pacific citizen income income_per_cap
## 0 0 0 0 0
## poverty child_poverty professional service office
## 0 0 0 0 0
## construction production drive carpool transit
## 0 0 0 0 0
## walk other_transp work_at_home mean_commute employed
## 0 0 0 0 0
## private_work public_work self_employed family_work unemployment
## 0 0 0 0 0
If we verify we can see that the missing values are been handled.
Q4 - Summary of the Dataset
The dataset definitely contains skewed numeric variables. Though we are not removing the outlier values in variables such as population, ‘women’ , ‘men’ etc. at this point as we would need further information to ensure these values are infact data descrepancies. Since numeric variables such proportion of a particular race in a county can vary widely, we have currently not removed the extreme values from the dataset .
## census_id state county total_pop
## Min. : 1001 Texas : 253 Washington: 31 Min. : 267
## 1st Qu.:18179 Georgia : 159 Jefferson : 26 1st Qu.: 11036
## Median :29176 Virginia: 133 Franklin : 25 Median : 25793
## Mean :30383 Kentucky: 120 Jackson : 24 Mean : 100801
## 3rd Qu.:45080 Missouri: 115 Lincoln : 24 3rd Qu.: 67620
## Max. :56045 Kansas : 105 Madison : 20 Max. :10038388
## (Other) :2255 (Other) :2990
## men women hispanic white
## Min. : 136 Min. : 131 Min. : 0.000 Min. : 0.90
## 1st Qu.: 5551 1st Qu.: 5488 1st Qu.: 1.900 1st Qu.:65.67
## Median : 12838 Median : 12916 Median : 3.700 Median :84.65
## Mean : 49597 Mean : 51204 Mean : 8.819 Mean :77.31
## 3rd Qu.: 33328 3rd Qu.: 34123 3rd Qu.: 9.000 3rd Qu.:93.33
## Max. :4945351 Max. :5093037 Max. :98.700 Max. :99.80
##
## black native asian pacific
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.00000
## 1st Qu.: 0.600 1st Qu.: 0.100 1st Qu.: 0.200 1st Qu.: 0.00000
## Median : 2.100 Median : 0.300 Median : 0.500 Median : 0.00000
## Mean : 8.885 Mean : 1.763 Mean : 1.253 Mean : 0.07357
## 3rd Qu.:10.200 3rd Qu.: 0.600 3rd Qu.: 1.200 3rd Qu.: 0.00000
## Max. :85.900 Max. :92.100 Max. :41.600 Max. :11.10000
##
## citizen income income_per_cap poverty
## Min. : 199 Min. : 19328 Min. : 8292 Min. : 1.4
## 1st Qu.: 8276 1st Qu.: 38826 1st Qu.:20470 1st Qu.:12.0
## Median : 19454 Median : 45095 Median :23575 Median :16.0
## Mean : 70849 Mean : 46824 Mean :24331 Mean :16.7
## 3rd Qu.: 50795 3rd Qu.: 52248 3rd Qu.:27138 3rd Qu.:20.3
## Max. :6046749 Max. :123453 Max. :65600 Max. :53.3
##
## child_poverty professional service office
## Min. : 0.00 Min. :13.50 Min. : 5.00 Min. : 4.10
## 1st Qu.:16.10 1st Qu.:26.70 1st Qu.:15.90 1st Qu.:20.20
## Median :22.50 Median :30.00 Median :18.00 Median :22.40
## Mean :23.29 Mean :31.05 Mean :18.25 Mean :22.13
## 3rd Qu.:29.50 3rd Qu.:34.42 3rd Qu.:20.20 3rd Qu.:24.30
## Max. :72.30 Max. :74.00 Max. :36.60 Max. :35.40
##
## construction production drive carpool
## Min. : 1.70 Min. : 0.00 Min. : 5.2 Min. : 0.00
## 1st Qu.: 9.80 1st Qu.:11.50 1st Qu.:76.6 1st Qu.: 8.50
## Median :12.20 Median :15.40 Median :80.6 Median : 9.90
## Mean :12.75 Mean :15.82 Mean :79.1 Mean :10.33
## 3rd Qu.:15.00 3rd Qu.:19.40 3rd Qu.:83.6 3rd Qu.:11.90
## Max. :40.30 Max. :55.60 Max. :94.6 Max. :29.90
##
## transit walk other_transp work_at_home
## Min. : 0.0000 Min. : 0.000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.1000 1st Qu.: 1.400 1st Qu.: 0.90 1st Qu.: 2.800
## Median : 0.4000 Median : 2.400 Median : 1.30 Median : 4.000
## Mean : 0.9681 Mean : 3.294 Mean : 1.61 Mean : 4.694
## 3rd Qu.: 0.8000 3rd Qu.: 4.000 3rd Qu.: 1.90 3rd Qu.: 5.700
## Max. :61.7000 Max. :71.200 Max. :39.10 Max. :37.200
##
## mean_commute employed private_work public_work
## Min. : 4.90 Min. : 166 Min. :29.50 Min. : 5.80
## 1st Qu.:19.30 1st Qu.: 4532 1st Qu.:70.90 1st Qu.:13.07
## Median :22.90 Median : 10657 Median :75.85 Median :16.10
## Mean :23.15 Mean : 46416 Mean :74.45 Mean :17.33
## 3rd Qu.:26.60 3rd Qu.: 29272 3rd Qu.:79.80 3rd Qu.:20.10
## Max. :44.00 Max. :4635465 Max. :88.30 Max. :66.20
##
## self_employed family_work unemployment
## Min. : 0.000 Min. :0.0000 Min. : 0.000
## 1st Qu.: 5.400 1st Qu.:0.1000 1st Qu.: 5.500
## Median : 6.900 Median :0.2000 Median : 7.500
## Mean : 7.922 Mean :0.2917 Mean : 7.815
## 3rd Qu.: 9.400 3rd Qu.:0.3000 3rd Qu.: 9.700
## Max. :36.600 Max. :9.8000 Max. :29.400
##
Q5 - How many counties have more women than men
For this question the analysis should count a combination of county and state, as multiple states would have the same county name.There are 1984 county-state combinations where number of women are greater than men.
## [1] 1984
Q6 - How many counties have an unemployment rate lower than 10%?
There are 2419 county-state combinations with unemployment rate lower than 10%.
## [1] 2419
Q7 - What are the top 10 counties with the highest mean commute?
## Selecting by mean_commute
## # A tibble: 10 x 4
## census_id state county mean_commute
## <dbl> <fct> <fct> <dbl>
## 1 42103 Pennsylvania Pike 44
## 2 36005 New York Bronx 43
## 3 24017 Maryland Charles 42.8
## 4 51187 Virginia Warren 42.7
## 5 36081 New York Queens 42.6
## 6 36085 New York Richmond 42.6
## 7 51193 Virginia Westmoreland 42.5
## 8 8093 Colorado Park 42.4
## 9 36047 New York Kings 41.7
## 10 54015 West Virginia Clay 41.4
Q8- Top ten counties with lowest percentage of women
acs_country_data$percent_women=(acs_country_data$women/acs_country_data$total_pop)*100
acs_women_df<-acs_country_data[order(acs_country_data$percent_women),]
acs_women_df<-select(acs_women_df,1,2,3,36)
head(acs_women_df,10)## # A tibble: 10 x 4
## census_id state county percent_women
## <dbl> <fct> <fct> <dbl>
## 1 42053 Pennsylvania Forest 26.8
## 2 8011 Colorado Bent 31.4
## 3 51183 Virginia Sussex 31.5
## 4 13309 Georgia Wheeler 32.1
## 5 6035 California Lassen 33.2
## 6 48095 Texas Concho 33.3
## 7 13053 Georgia Chattahoochee 33.4
## 8 2013 Alaska Aleutians East Borough 33.5
## 9 22125 Louisiana West Feliciana 33.6
## 10 32027 Nevada Pershing 33.7
Q9- Sum of Race Percentages
Part A: Top Ten lowest race percentage sums
acs_country_data$percent_race=rowSums(acs_country_data[, c(7:12)])
acs_race_df<-acs_country_data[order(acs_country_data$percent_race),]
acs_race_df<-select(acs_race_df,3,37)
head(acs_race_df,10)## # A tibble: 10 x 2
## county percent_race
## <fct> <dbl>
## 1 Hawaii 76.4
## 2 Maui 79.2
## 3 Mayes 79.7
## 4 Honolulu 81.5
## 5 Pontotoc 82.8
## 6 Grundy 83
## 7 Yakutat City and Borough 83.4
## 8 Johnston 84
## 9 Kauai 84.1
## 10 Alfalfa 85.1
The top ten countries with teh lowest racal percentage sums are Hawaii, Maui, Mayes, Honolulu, Pontotoc, Grundy, Yakutat City and Borough, Johnston, Kauai and Alfalfa.
Part B: Top ten states with lowest race percentage sums
acs_race_state_df=group_by(acs_country_data, state) %>% summarize(average_race_percent = mean(percent_race))
acs_race_state_df=acs_race_state_df[order(acs_race_state_df$average_race_percent),]
head(acs_race_state_df,1)## # A tibble: 1 x 2
## state average_race_percent
## <fct> <dbl>
## 1 Hawaii 80.3
Hawaii has the lowest average racial percentage sum at 84%
Part C: counties with sum greater than 100%
## [1] 5
5 Entries have a racial percentage sum above 100%
Question 10: Carpool Rank
Part A: Generating Carpool Rank
Part B: Top 10 highest ranked countries for carpooling
## # A tibble: 10 x 5
## census_id state county carpool carpool_rank
## <dbl> <fct> <fct> <dbl> <int>
## 1 13061 Georgia Clay 29.9 1
## 2 18087 Indiana LaGrange 27 2
## 3 13165 Georgia Jenkins 25.3 3
## 4 5133 Arkansas Sevier 24.4 4
## 5 20175 Kansas Seward 23.4 5
## 6 48079 Texas Cochran 22.8 6
## 7 48247 Texas Jim Hogg 22.6 7
## 8 48393 Texas Roberts 22.4 8
## 9 39075 Ohio Holmes 21.8 9
## 10 21197 Kentucky Powell 21.6 10
Top 10 highest ranked countries for carpooling are Clay, LaGrange, Jenkins, Sevier, Seward, Cochran, Jim Hogg, Roberts, Holmes, and Powell
Part C: Top 10 lowest ranked Countries for Carpooling
## # A tibble: 10 x 5
## census_id state county carpool carpool_rank
## <dbl> <fct> <fct> <dbl> <int>
## 1 51720 Virginia Norton city 2.8 3130
## 2 30019 Montana Daniels 2.6 3132
## 3 31057 Nebraska Dundy 2.6 3132
## 4 13309 Georgia Wheeler 2.3 3134
## 5 38029 North Dakota Emmons 2.3 3134
## 6 36061 New York New York 1.9 3136
## 7 31183 Nebraska Wheeler 1.3 3137
## 8 48235 Texas Irion 0.9 3138
## 9 48261 Texas Kenedy 0 3139
## 10 48269 Texas King 0 3139
Top 10 lowest ranked countries for carpooling are Norton City, Daniels, Dundy, Wheeler, Emmons, New York, Wheeler, Irion, Kendedy, and King
Part D: Top Ranked State for carpooling
acs_carpool_state=group_by(acs_country_data, state) %>% summarize(average_carpool_rate = mean(carpool))
acs_carpool_state=acs_carpool_state[order(-acs_carpool_state$average_carpool_rate),]
head(acs_carpool_state,1)## # A tibble: 1 x 2
## state average_carpool_rate
## <fct> <dbl>
## 1 Hawaii 12.8
Number one state for highest average rate of carpooling is Alaska
Part E: Top 5 ranked states for carpooling
## # A tibble: 5 x 2
## state average_carpool_rate
## <fct> <dbl>
## 1 Hawaii 12.8
## 2 Alaska 12.1
## 3 Arkansas 11.9
## 4 Utah 11.9
## 5 Texas 11.8
The top 5 states for carpooling on average are Alsaka, Arkansas, Utah, Texas, and Nevada