This is importing and setting up the data frame for the assignment
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.2
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
acsdata <- read.csv("acs_2015_county_data_revised.csv")
This is counting rows and columns
acsdata %>%
summarise(
n_rows=n(),
n_cols=ncol(.)
)
## n_rows n_cols
## 1 3142 35
mean_commute is shown as numeric but I just need a whole number for mean and employed is as integer but needs numeric to keep data consistent with the rest of the data frame
acsdata <- acsdata %>%
mutate(
employed = as.numeric(employed),
mean_commute = as.integer(mean_commute))
Now I am using glimpse to see changed columns
acsdata %>%
select(mean_commute, employed) %>%
glimpse()
## Rows: 3,142
## Columns: 2
## $ mean_commute <int> 26, 26, 24, 28, 34, 27, 24, 24, 25, 27, 32, 30, 23, 30, 3…
## $ employed <dbl> 23986, 85953, 8597, 8294, 22189, 3865, 7813, 47401, 13689…
Use summarise to see if na items exist, next use filter to find exact rows
acsdata %>%
summarise(across(everything(), ~sum(is.na(.))))
## census_id state county total_pop men women hispanic white black native asian
## 1 0 0 0 0 0 0 0 0 0 0 0
## pacific citizen income income_per_cap poverty child_poverty professional
## 1 0 0 1 0 0 1 0
## service office construction production drive carpool transit walk
## 1 0 0 0 0 0 0 0 0
## other_transp work_at_home mean_commute employed private_work public_work
## 1 0 0 0 0 0 0
## self_employed family_work unemployment
## 1 0 0 0
acsdata %>%
filter(if_any(everything(), is.na))
## census_id state county total_pop men women hispanic white black native
## 1 15005 Hawaii Kalawao 85 42 43 4.7 37.6 0 0.0
## 2 48301 Texas Loving 117 74 43 35.0 41.0 0 12.8
## asian pacific citizen income income_per_cap poverty child_poverty
## 1 21.2 35.3 80 66250 46769 15.2 NA
## 2 0.0 0.0 98 NA 25877 28.9 0
## professional service office construction production drive carpool transit
## 1 31.3 23.4 21.9 4.7 18.8 34.3 7.5 0
## 2 22.6 22.6 32.3 4.8 17.7 64.5 11.3 0
## walk other_transp work_at_home mean_commute employed private_work public_work
## 1 40.3 14.9 3.0 9 64 25 64.1
## 2 8.1 0.0 16.1 25 62 71 27.4
## self_employed family_work unemployment
## 1 10.9 0 0.0
## 2 1.6 0 15.1
I’ve decided to leave the NAs alone, I do not think I can use mean or median functions to derive an good substitute for NAs. 0 would skew my data with an outlier. Also, there are not enough NAs for it to materially affect my data.
summary(acsdata)
## census_id state county total_pop
## Min. : 1001 Length:3142 Length:3142 Min. : 85
## 1st Qu.:18178 Class :character Class :character 1st Qu.: 11028
## Median :29176 Mode :character Mode :character Median : 25768
## Mean :30384 Mean : 100737
## 3rd Qu.:45081 3rd Qu.: 67552
## Max. :56045 Max. :10038388
##
## men women hispanic white
## Min. : 42 Min. : 43 Min. : 0.000 Min. : 0.90
## 1st Qu.: 5546 1st Qu.: 5466 1st Qu.: 1.900 1st Qu.:65.60
## Median : 12826 Median : 12907 Median : 3.700 Median :84.60
## Mean : 49565 Mean : 51172 Mean : 8.826 Mean :77.28
## 3rd Qu.: 33319 3rd Qu.: 34122 3rd Qu.: 9.000 3rd Qu.:93.30
## Max. :4945351 Max. :5093037 Max. :98.700 Max. :99.80
##
## black native asian pacific
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.00000
## 1st Qu.: 0.600 1st Qu.: 0.100 1st Qu.: 0.200 1st Qu.: 0.00000
## Median : 2.100 Median : 0.300 Median : 0.500 Median : 0.00000
## Mean : 8.879 Mean : 1.766 Mean : 1.258 Mean : 0.08475
## 3rd Qu.:10.175 3rd Qu.: 0.600 3rd Qu.: 1.200 3rd Qu.: 0.00000
## Max. :85.900 Max. :92.100 Max. :41.600 Max. :35.30000
##
## citizen income income_per_cap poverty
## Min. : 80 Min. : 19328 Min. : 8292 Min. : 1.4
## 1st Qu.: 8254 1st Qu.: 38826 1st Qu.:20471 1st Qu.:12.0
## Median : 19434 Median : 45111 Median :23577 Median :16.0
## Mean : 70804 Mean : 46830 Mean :24338 Mean :16.7
## 3rd Qu.: 50728 3rd Qu.: 52250 3rd Qu.:27138 3rd Qu.:20.3
## Max. :6046749 Max. :123453 Max. :65600 Max. :53.3
## NA's :1
## child_poverty professional service office
## Min. : 0.00 Min. :13.50 Min. : 5.00 Min. : 4.10
## 1st Qu.:16.10 1st Qu.:26.70 1st Qu.:15.90 1st Qu.:20.20
## Median :22.50 Median :30.00 Median :18.00 Median :22.40
## Mean :23.29 Mean :31.04 Mean :18.26 Mean :22.13
## 3rd Qu.:29.50 3rd Qu.:34.40 3rd Qu.:20.20 3rd Qu.:24.30
## Max. :72.30 Max. :74.00 Max. :36.60 Max. :35.40
## NA's :1
## construction production drive carpool
## Min. : 1.70 Min. : 0.00 Min. : 5.20 Min. : 0.00
## 1st Qu.: 9.80 1st Qu.:11.53 1st Qu.:76.60 1st Qu.: 8.50
## Median :12.20 Median :15.40 Median :80.60 Median : 9.90
## Mean :12.74 Mean :15.82 Mean :79.08 Mean :10.33
## 3rd Qu.:15.00 3rd Qu.:19.40 3rd Qu.:83.60 3rd Qu.:11.88
## Max. :40.30 Max. :55.60 Max. :94.60 Max. :29.90
##
## transit walk other_transp work_at_home
## Min. : 0.0000 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.1000 1st Qu.: 1.400 1st Qu.: 0.900 1st Qu.: 2.800
## Median : 0.4000 Median : 2.400 Median : 1.300 Median : 4.000
## Mean : 0.9675 Mean : 3.307 Mean : 1.614 Mean : 4.697
## 3rd Qu.: 0.8000 3rd Qu.: 4.000 3rd Qu.: 1.900 3rd Qu.: 5.700
## Max. :61.7000 Max. :71.200 Max. :39.100 Max. :37.200
##
## mean_commute employed private_work public_work
## Min. : 4.0 Min. : 62 Min. :25.00 Min. : 5.80
## 1st Qu.:19.0 1st Qu.: 4524 1st Qu.:70.90 1st Qu.:13.10
## Median :22.0 Median : 10644 Median :75.80 Median :16.10
## Mean :22.7 Mean : 46387 Mean :74.44 Mean :17.35
## 3rd Qu.:26.0 3rd Qu.: 29254 3rd Qu.:79.80 3rd Qu.:20.10
## Max. :44.0 Max. :4635465 Max. :88.30 Max. :66.20
##
## self_employed family_work unemployment
## Min. : 0.000 Min. :0.0000 Min. : 0.000
## 1st Qu.: 5.400 1st Qu.:0.1000 1st Qu.: 5.500
## Median : 6.900 Median :0.2000 Median : 7.500
## Mean : 7.921 Mean :0.2915 Mean : 7.815
## 3rd Qu.: 9.400 3rd Qu.:0.3000 3rd Qu.: 9.700
## Max. :36.600 Max. :9.8000 Max. :29.400
##
I think it is interesting that there are 0 percent values for commuting options, for example in some counties 0 percent of the population walks to work or uses public transportation. The same is true for some forms of work like production. I do not know enough about the information to decide if I should dig into whether 0 really is an NA.
Below, we can see how many counties have more women than men, 1985, men than women, 1154, and how many counties are equal men and women, 3.
acsdata %>%
summarise(
men_greater = sum(men > women, na.rm = TRUE),
women_greater = sum(women > men, na.rm = TRUE),
equal = sum(men == women, na.rm = TRUE)
)
## men_greater women_greater equal
## 1 1154 1985 3
Use summarise function to create three functions,
men_greater
, women_greater
, equal
which use sum function to count instances where their namesakes are
true.
acsdata %>%
filter(unemployment < 10) %>%
summarise(count = n())
## count
## 1 2420
Filter column unemployment
for observations less than
10, then count how many exist.
acsdata %>%
top_n(n = 10, wt = mean_commute) %>%
arrange(desc(mean_commute)) %>%
select(census_id, county, state, mean_commute)
## census_id county state mean_commute
## 1 42103 Pike Pennsylvania 44
## 2 36005 Bronx New York 43
## 3 8093 Park Colorado 42
## 4 24017 Charles Maryland 42
## 5 36081 Queens New York 42
## 6 36085 Richmond New York 42
## 7 51187 Warren Virginia 42
## 8 51193 Westmoreland Virginia 42
## 9 17013 Calhoun Illinois 41
## 10 28041 Greene Mississippi 41
## 11 36047 Kings New York 41
## 12 54015 Clay West Virginia 41
Retrieve top 10 results in column mean_commute
, arrange
them from highest to lowest, then select columns that were asked
for.
acsdata %>%
mutate(women_percentage = women / total_pop) %>%
slice_min(order_by = women_percentage, n = 10) %>%
select(census_id, county, state, women_percentage)
## census_id county state women_percentage
## 1 42053 Forest Pennsylvania 0.2677747
## 2 8011 Bent Colorado 0.3136556
## 3 51183 Sussex Virginia 0.3147336
## 4 13309 Wheeler Georgia 0.3210156
## 5 6035 Lassen California 0.3316588
## 6 48095 Concho Texas 0.3328439
## 7 13053 Chattahoochee Georgia 0.3335572
## 8 2013 Aleutians East Borough Alaska 0.3347458
## 9 22125 West Feliciana Louisiana 0.3364904
## 10 32027 Pershing Nevada 0.3372508
Use mutate
function to take women and divide by total
pop to find percentage of women in each county and create variable
women_percentage
. Then use slice_min
function
to find lowest 10 percentages, then select function to show the desired
columns.
acsdata <- acsdata %>%
mutate(race_percentage = hispanic + white + black + native + asian + pacific)
Creating race_percentage
variable.
acsdata %>%
slice_max(order_by = race_percentage, n = 10) %>%
select(county)
## county
## 1 Gosper
## 2 Hooker
## 3 Bailey
## 4 Edwards
## 5 Nance
## 6 Claiborne
## 7 Duval
## 8 Kenedy
## 9 Kent
## 10 Presidio
## 11 Beaver
Returning top 10 counties by variable
race_percentage
.
acsdata %>%
group_by(state) %>%
summarise(avg_race_percentage = mean(race_percentage, na.rm = TRUE)) %>%
arrange(avg_race_percentage) %>%
slice_head(n = 10)
## # A tibble: 10 × 2
## state avg_race_percentage
## <chr> <dbl>
## 1 Hawaii 84
## 2 Alaska 92.7
## 3 Oklahoma 92.8
## 4 Washington 96.7
## 5 California 96.9
## 6 Oregon 97.1
## 7 Delaware 97.3
## 8 Massachusetts 97.5
## 9 Maryland 97.6
## 10 District of Columbia 97.6
Grouping observations by state, then creating averages for column
race_percentage
using mean function, arranging them lowest
to highest, and returning the top 10.
acsdata %>%
filter(race_percentage > 100) %>%
select(county)
## county
## 1 Claiborne
## 2 Gosper
## 3 Hooker
## 4 Nance
## 5 Bailey
## 6 Duval
## 7 Edwards
## 8 Kenedy
## 9 Kent
## 10 Presidio
## 11 Beaver
Show counties with percentages higher than 100p by using filter
function on column race_percentage
and then showing only
county
column
acsdata %>%
group_by(state) %>%
summarise(total_race_percentage = sum(race_percentage, na.rm = TRUE)) %>%
filter(total_race_percentage == 100)
## # A tibble: 0 × 2
## # ℹ 2 variables: state <chr>, total_race_percentage <dbl>
Show states with percentages equal to 100 exactly using summarise to show rows that are not NAs then filter instances where there total_race is exactly 100. THERE ARE NONE.
acsdata <- acsdata %>%
mutate(carpool_rank = min_rank(desc(carpool)))
Using mutate to create new variable in acsdata
, since
rank ranks lowest to highest, switched column carpool
to
descending order using desc to account for this.
acsdata %>%
filter(carpool_rank <= 10) %>%
select(census_id, county, state, carpool, carpool_rank) %>%
arrange(carpool_rank)
## census_id county state carpool carpool_rank
## 1 13061 Clay Georgia 29.9 1
## 2 18087 LaGrange Indiana 27.0 2
## 3 13165 Jenkins Georgia 25.3 3
## 4 5133 Sevier Arkansas 24.4 4
## 5 20175 Seward Kansas 23.4 5
## 6 48079 Cochran Texas 22.8 6
## 7 48247 Jim Hogg Texas 22.6 7
## 8 48393 Roberts Texas 22.4 8
## 9 39075 Holmes Ohio 21.8 9
## 10 21197 Powell Kentucky 21.6 10
Used filter function to show ranks less than or up to 10 since 1 is
the highest rank. Then used select function to show desired columns,
then used arrange function to put in order of column
carpool_rank
.
acsdata %>%
arrange(desc(carpool_rank)) %>%
slice_head(n = 10) %>%
select(census_id, county, state, carpool, carpool_rank)
## census_id county state carpool carpool_rank
## 1 48261 Kenedy Texas 0.0 3141
## 2 48269 King Texas 0.0 3141
## 3 48235 Irion Texas 0.9 3140
## 4 31183 Wheeler Nebraska 1.3 3139
## 5 36061 New York New York 1.9 3138
## 6 13309 Wheeler Georgia 2.3 3136
## 7 38029 Emmons North Dakota 2.3 3136
## 8 30019 Daniels Montana 2.6 3134
## 9 31057 Dundy Nebraska 2.6 3134
## 10 46069 Hyde South Dakota 2.8 3132
Used arrange function but had to add desc function to switch the order to the inverse. slice_head shows top 10 in that order, and select shows only desired columns.
acsdata %>%
group_by(state) %>%
summarise(avg_carpool_rank = mean(carpool_rank, na.rm = TRUE)) %>%
arrange(avg_carpool_rank) %>%
slice_head(n = 1)
## # A tibble: 1 × 2
## state avg_carpool_rank
## <chr> <dbl>
## 1 Arizona 971.
Grouping observations by state, then using summarise function to create averages for every state, then using arrange function to show lowest average rank which means highest rank. slice_head function set to 1 to show only the top state.
acsdata %>%
group_by(state) %>%
summarise(avg_carpool_rank = mean(carpool_rank, na.rm = TRUE)) %>%
arrange(avg_carpool_rank) %>%
slice_head(n = 5)
## # A tibble: 5 × 2
## state avg_carpool_rank
## <chr> <dbl>
## 1 Arizona 971.
## 2 Utah 1019.
## 3 Arkansas 1055.
## 4 Hawaii 1072.
## 5 Alaska 1087.
Same as above but slice_head is showing top 5 states instead of only the top.