Assignment 3

1

This is importing and setting up the data frame for the assignment

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
acsdata <- read.csv("acs_2015_county_data_revised.csv")

This is counting rows and columns

acsdata %>%
  summarise(
    n_rows=n(),
    n_cols=ncol(.)
      )

##   n_rows n_cols
## 1   3142     35

2

mean_commute is shown as numeric but I just need a whole number for mean and employed is as integer but needs numeric to keep data consistent with the rest of the data frame

acsdata <- acsdata %>%
  mutate(
    employed = as.numeric(employed),
    mean_commute = as.integer(mean_commute))

Now I am using glimpse to see changed columns

acsdata %>% 
  select(mean_commute, employed) %>% 
  glimpse()

## Rows: 3,142
## Columns: 2
## $ mean_commute <int> 26, 26, 24, 28, 34, 27, 24, 24, 25, 27, 32, 30, 23, 30, 3…
## $ employed     <dbl> 23986, 85953, 8597, 8294, 22189, 3865, 7813, 47401, 13689…

3

Use summarise to see if na items exist, next use filter to find exact rows

acsdata %>% 
  summarise(across(everything(), ~sum(is.na(.))))

##   census_id state county total_pop men women hispanic white black native asian
## 1         0     0      0         0   0     0        0     0     0      0     0
##   pacific citizen income income_per_cap poverty child_poverty professional
## 1       0       0      1              0       0             1            0
##   service office construction production drive carpool transit walk
## 1       0      0            0          0     0       0       0    0
##   other_transp work_at_home mean_commute employed private_work public_work
## 1            0            0            0        0            0           0
##   self_employed family_work unemployment
## 1             0           0            0

acsdata %>% 
  filter(if_any(everything(), is.na))

##   census_id  state  county total_pop men women hispanic white black native
## 1     15005 Hawaii Kalawao        85  42    43      4.7  37.6     0    0.0
## 2     48301  Texas  Loving       117  74    43     35.0  41.0     0   12.8
##   asian pacific citizen income income_per_cap poverty child_poverty
## 1  21.2    35.3      80  66250          46769    15.2            NA
## 2   0.0     0.0      98     NA          25877    28.9             0
##   professional service office construction production drive carpool transit
## 1         31.3    23.4   21.9          4.7       18.8  34.3     7.5       0
## 2         22.6    22.6   32.3          4.8       17.7  64.5    11.3       0
##   walk other_transp work_at_home mean_commute employed private_work public_work
## 1 40.3         14.9          3.0            9       64           25        64.1
## 2  8.1          0.0         16.1           25       62           71        27.4
##   self_employed family_work unemployment
## 1          10.9           0          0.0
## 2           1.6           0         15.1

I’ve decided to leave the NAs alone, I do not think I can use mean or median functions to derive an good substitute for NAs. 0 would skew my data with an outlier. Also, there are not enough NAs for it to materially affect my data.

4

summary(acsdata)

##    census_id        state              county            total_pop       
##  Min.   : 1001   Length:3142        Length:3142        Min.   :      85  
##  1st Qu.:18178   Class :character   Class :character   1st Qu.:   11028  
##  Median :29176   Mode  :character   Mode  :character   Median :   25768  
##  Mean   :30384                                         Mean   :  100737  
##  3rd Qu.:45081                                         3rd Qu.:   67552  
##  Max.   :56045                                         Max.   :10038388  
##                                                                          
##       men              women            hispanic          white      
##  Min.   :     42   Min.   :     43   Min.   : 0.000   Min.   : 0.90  
##  1st Qu.:   5546   1st Qu.:   5466   1st Qu.: 1.900   1st Qu.:65.60  
##  Median :  12826   Median :  12907   Median : 3.700   Median :84.60  
##  Mean   :  49565   Mean   :  51172   Mean   : 8.826   Mean   :77.28  
##  3rd Qu.:  33319   3rd Qu.:  34122   3rd Qu.: 9.000   3rd Qu.:93.30  
##  Max.   :4945351   Max.   :5093037   Max.   :98.700   Max.   :99.80  
##                                                                      
##      black            native           asian           pacific        
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.00000  
##  1st Qu.: 0.600   1st Qu.: 0.100   1st Qu.: 0.200   1st Qu.: 0.00000  
##  Median : 2.100   Median : 0.300   Median : 0.500   Median : 0.00000  
##  Mean   : 8.879   Mean   : 1.766   Mean   : 1.258   Mean   : 0.08475  
##  3rd Qu.:10.175   3rd Qu.: 0.600   3rd Qu.: 1.200   3rd Qu.: 0.00000  
##  Max.   :85.900   Max.   :92.100   Max.   :41.600   Max.   :35.30000  
##                                                                       
##     citizen            income       income_per_cap     poverty    
##  Min.   :     80   Min.   : 19328   Min.   : 8292   Min.   : 1.4  
##  1st Qu.:   8254   1st Qu.: 38826   1st Qu.:20471   1st Qu.:12.0  
##  Median :  19434   Median : 45111   Median :23577   Median :16.0  
##  Mean   :  70804   Mean   : 46830   Mean   :24338   Mean   :16.7  
##  3rd Qu.:  50728   3rd Qu.: 52250   3rd Qu.:27138   3rd Qu.:20.3  
##  Max.   :6046749   Max.   :123453   Max.   :65600   Max.   :53.3  
##                    NA's   :1                                      
##  child_poverty    professional      service          office     
##  Min.   : 0.00   Min.   :13.50   Min.   : 5.00   Min.   : 4.10  
##  1st Qu.:16.10   1st Qu.:26.70   1st Qu.:15.90   1st Qu.:20.20  
##  Median :22.50   Median :30.00   Median :18.00   Median :22.40  
##  Mean   :23.29   Mean   :31.04   Mean   :18.26   Mean   :22.13  
##  3rd Qu.:29.50   3rd Qu.:34.40   3rd Qu.:20.20   3rd Qu.:24.30  
##  Max.   :72.30   Max.   :74.00   Max.   :36.60   Max.   :35.40  
##  NA's   :1                                                      
##   construction     production        drive          carpool     
##  Min.   : 1.70   Min.   : 0.00   Min.   : 5.20   Min.   : 0.00  
##  1st Qu.: 9.80   1st Qu.:11.53   1st Qu.:76.60   1st Qu.: 8.50  
##  Median :12.20   Median :15.40   Median :80.60   Median : 9.90  
##  Mean   :12.74   Mean   :15.82   Mean   :79.08   Mean   :10.33  
##  3rd Qu.:15.00   3rd Qu.:19.40   3rd Qu.:83.60   3rd Qu.:11.88  
##  Max.   :40.30   Max.   :55.60   Max.   :94.60   Max.   :29.90  
##                                                                 
##     transit             walk         other_transp     work_at_home   
##  Min.   : 0.0000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 0.1000   1st Qu.: 1.400   1st Qu.: 0.900   1st Qu.: 2.800  
##  Median : 0.4000   Median : 2.400   Median : 1.300   Median : 4.000  
##  Mean   : 0.9675   Mean   : 3.307   Mean   : 1.614   Mean   : 4.697  
##  3rd Qu.: 0.8000   3rd Qu.: 4.000   3rd Qu.: 1.900   3rd Qu.: 5.700  
##  Max.   :61.7000   Max.   :71.200   Max.   :39.100   Max.   :37.200  
##                                                                      
##   mean_commute     employed        private_work    public_work   
##  Min.   : 4.0   Min.   :     62   Min.   :25.00   Min.   : 5.80  
##  1st Qu.:19.0   1st Qu.:   4524   1st Qu.:70.90   1st Qu.:13.10  
##  Median :22.0   Median :  10644   Median :75.80   Median :16.10  
##  Mean   :22.7   Mean   :  46387   Mean   :74.44   Mean   :17.35  
##  3rd Qu.:26.0   3rd Qu.:  29254   3rd Qu.:79.80   3rd Qu.:20.10  
##  Max.   :44.0   Max.   :4635465   Max.   :88.30   Max.   :66.20  
##                                                                  
##  self_employed     family_work      unemployment   
##  Min.   : 0.000   Min.   :0.0000   Min.   : 0.000  
##  1st Qu.: 5.400   1st Qu.:0.1000   1st Qu.: 5.500  
##  Median : 6.900   Median :0.2000   Median : 7.500  
##  Mean   : 7.921   Mean   :0.2915   Mean   : 7.815  
##  3rd Qu.: 9.400   3rd Qu.:0.3000   3rd Qu.: 9.700  
##  Max.   :36.600   Max.   :9.8000   Max.   :29.400  
##

I think it is interesting that there are 0 percent values for commuting options, for example in some counties 0 percent of the population walks to work or uses public transportation. The same is true for some forms of work like production. I do not know enough about the information to decide if I should dig into whether 0 really is an NA.

5

Below, we can see how many counties have more women than men, 1985, men than women, 1154, and how many counties are equal men and women, 3.

acsdata %>%
  summarise(
    men_greater = sum(men > women, na.rm = TRUE),
    women_greater = sum(women > men, na.rm = TRUE),
    equal = sum(men == women, na.rm = TRUE)
  )

##   men_greater women_greater equal
## 1        1154          1985     3

Use summarise function to create three functions, men_greater, women_greater, equal which use sum function to count instances where their namesakes are true.

6

acsdata %>%
  filter(unemployment < 10) %>%
  summarise(count = n())

##   count
## 1  2420

Filter column unemployment for observations less than 10, then count how many exist.

7

acsdata %>%
  top_n(n = 10, wt = mean_commute) %>%
  arrange(desc(mean_commute)) %>%
  select(census_id, county, state, mean_commute)

##    census_id       county         state mean_commute
## 1      42103         Pike  Pennsylvania           44
## 2      36005        Bronx      New York           43
## 3       8093         Park      Colorado           42
## 4      24017      Charles      Maryland           42
## 5      36081       Queens      New York           42
## 6      36085     Richmond      New York           42
## 7      51187       Warren      Virginia           42
## 8      51193 Westmoreland      Virginia           42
## 9      17013      Calhoun      Illinois           41
## 10     28041       Greene   Mississippi           41
## 11     36047        Kings      New York           41
## 12     54015         Clay West Virginia           41

Retrieve top 10 results in column mean_commute, arrange them from highest to lowest, then select columns that were asked for.

8

acsdata %>%
  mutate(women_percentage = women / total_pop) %>%
  slice_min(order_by = women_percentage, n = 10) %>%
  select(census_id, county, state, women_percentage)

##    census_id                 county        state women_percentage
## 1      42053                 Forest Pennsylvania        0.2677747
## 2       8011                   Bent     Colorado        0.3136556
## 3      51183                 Sussex     Virginia        0.3147336
## 4      13309                Wheeler      Georgia        0.3210156
## 5       6035                 Lassen   California        0.3316588
## 6      48095                 Concho        Texas        0.3328439
## 7      13053          Chattahoochee      Georgia        0.3335572
## 8       2013 Aleutians East Borough       Alaska        0.3347458
## 9      22125         West Feliciana    Louisiana        0.3364904
## 10     32027               Pershing       Nevada        0.3372508

Use mutate function to take women and divide by total pop to find percentage of women in each county and create variable women_percentage. Then use slice_min function to find lowest 10 percentages, then select function to show the desired columns.

9

acsdata <- acsdata %>%
  mutate(race_percentage = hispanic + white + black + native + asian + pacific)

Creating race_percentage variable.

acsdata %>%
  slice_max(order_by = race_percentage, n = 10) %>%
  select(county)

##       county
## 1     Gosper
## 2     Hooker
## 3     Bailey
## 4    Edwards
## 5      Nance
## 6  Claiborne
## 7      Duval
## 8     Kenedy
## 9       Kent
## 10  Presidio
## 11    Beaver

Returning top 10 counties by variable race_percentage.

acsdata %>%
  group_by(state) %>%
  summarise(avg_race_percentage = mean(race_percentage, na.rm = TRUE)) %>%
  arrange(avg_race_percentage) %>%
  slice_head(n = 10)

## # A tibble: 10 × 2
##    state                avg_race_percentage
##    <chr>                              <dbl>
##  1 Hawaii                              84  
##  2 Alaska                              92.7
##  3 Oklahoma                            92.8
##  4 Washington                          96.7
##  5 California                          96.9
##  6 Oregon                              97.1
##  7 Delaware                            97.3
##  8 Massachusetts                       97.5
##  9 Maryland                            97.6
## 10 District of Columbia                97.6

Grouping observations by state, then creating averages for column race_percentage using mean function, arranging them lowest to highest, and returning the top 10.

acsdata %>%
  filter(race_percentage > 100) %>%
  select(county)

##       county
## 1  Claiborne
## 2     Gosper
## 3     Hooker
## 4      Nance
## 5     Bailey
## 6      Duval
## 7    Edwards
## 8     Kenedy
## 9       Kent
## 10  Presidio
## 11    Beaver

Show counties with percentages higher than 100p by using filter function on column race_percentage and then showing only county column

acsdata %>%
  group_by(state) %>%
  summarise(total_race_percentage = sum(race_percentage, na.rm = TRUE)) %>%
  filter(total_race_percentage == 100)

## # A tibble: 0 × 2
## # ℹ 2 variables: state <chr>, total_race_percentage <dbl>

Show states with percentages equal to 100 exactly using summarise to show rows that are not NAs then filter instances where there total_race is exactly 100. THERE ARE NONE.

10

acsdata <- acsdata %>%
  mutate(carpool_rank = min_rank(desc(carpool)))

Using mutate to create new variable in acsdata, since rank ranks lowest to highest, switched column carpool to descending order using desc to account for this.

acsdata %>%
  filter(carpool_rank <= 10) %>%
  select(census_id, county, state, carpool, carpool_rank) %>%
  arrange(carpool_rank)

##    census_id   county    state carpool carpool_rank
## 1      13061     Clay  Georgia    29.9            1
## 2      18087 LaGrange  Indiana    27.0            2
## 3      13165  Jenkins  Georgia    25.3            3
## 4       5133   Sevier Arkansas    24.4            4
## 5      20175   Seward   Kansas    23.4            5
## 6      48079  Cochran    Texas    22.8            6
## 7      48247 Jim Hogg    Texas    22.6            7
## 8      48393  Roberts    Texas    22.4            8
## 9      39075   Holmes     Ohio    21.8            9
## 10     21197   Powell Kentucky    21.6           10

Used filter function to show ranks less than or up to 10 since 1 is the highest rank. Then used select function to show desired columns, then used arrange function to put in order of column carpool_rank.

acsdata %>%
  arrange(desc(carpool_rank)) %>%
  slice_head(n = 10) %>%
  select(census_id, county, state, carpool, carpool_rank)

##    census_id   county        state carpool carpool_rank
## 1      48261   Kenedy        Texas     0.0         3141
## 2      48269     King        Texas     0.0         3141
## 3      48235    Irion        Texas     0.9         3140
## 4      31183  Wheeler     Nebraska     1.3         3139
## 5      36061 New York     New York     1.9         3138
## 6      13309  Wheeler      Georgia     2.3         3136
## 7      38029   Emmons North Dakota     2.3         3136
## 8      30019  Daniels      Montana     2.6         3134
## 9      31057    Dundy     Nebraska     2.6         3134
## 10     46069     Hyde South Dakota     2.8         3132

Used arrange function but had to add desc function to switch the order to the inverse. slice_head shows top 10 in that order, and select shows only desired columns.

acsdata %>%
  group_by(state) %>%
  summarise(avg_carpool_rank = mean(carpool_rank, na.rm = TRUE)) %>%
  arrange(avg_carpool_rank) %>%
  slice_head(n = 1)

## # A tibble: 1 × 2
##   state   avg_carpool_rank
##   <chr>              <dbl>
## 1 Arizona             971.

Grouping observations by state, then using summarise function to create averages for every state, then using arrange function to show lowest average rank which means highest rank. slice_head function set to 1 to show only the top state.

acsdata %>%
  group_by(state) %>%
  summarise(avg_carpool_rank = mean(carpool_rank, na.rm = TRUE)) %>%
  arrange(avg_carpool_rank) %>%
  slice_head(n = 5)

## # A tibble: 5 × 2
##   state    avg_carpool_rank
##   <chr>               <dbl>
## 1 Arizona              971.
## 2 Utah                1019.
## 3 Arkansas            1055.
## 4 Hawaii              1072.
## 5 Alaska              1087.

Same as above but slice_head is showing top 5 states instead of only the top.

Assignment 3

Eddie Ticas

2025-09-28

1

2

3

4

5

6

7

8

9

10