Challenge 1: Railroads

First I am going to pull in the data from my working directory using read_cvs and remain the file from ‘railroad_2012_clean_county’ to ‘railroad’

railroad <- read_csv("challenge_datasets/railroad_2012_clean_county.csv")

## Rows: 2930 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): state, county
## dbl (1): total_employees
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

After pulling in the data I want to see a preview of the table and the column names to understand what the data I am working with is and make sure it is pulled in correctly. First I am using head function to see the first 6 rows of the table. Next I am using colnames to just see the different column names without the data and if they make sense or perhaps need to be renamed. Lastly, I am using nrow to see how large the data set is.

I see with head function, it is a tibble that has 3 columns. Two of those columns are character columns and one column is numerical.

The colnames pulled the column names of state, county and total_employees. These column names are clear even without the data below it so won’t change any of the column names.

With nrow, I see there is 2,930 rows of data.

head(railroad)

## # A tibble: 6 × 3
##   state county               total_employees
##   <chr> <chr>                          <dbl>
## 1 AE    APO                                2
## 2 AK    ANCHORAGE                          7
## 3 AK    FAIRBANKS NORTH STAR               2
## 4 AK    JUNEAU                             3
## 5 AK    MATANUSKA-SUSITNA                  2
## 6 AK    SITKA                              1

colnames(railroad)

## [1] "state"           "county"          "total_employees"

nrow(railroad)

## [1] 2930

Group and Summarize Data

Mean, median & SD

Next is to look at the mean, median and sd of the number of employees to understand the distribution of the data set.

The mean (87.17) is substantially higher than the median (21), indicating that data Is skewed by outliers of counties with higher number of employees. The standard deviation is also large, indicating the data is more spread out.

railroad %>% 
  summarise(meanEmployees=mean(total_employees), 
            medianEmployees=median(total_employees), 
           sdEmployees=sd(total_employees))

## # A tibble: 1 × 3
##   meanEmployees medianEmployees sdEmployees
##           <dbl>           <dbl>       <dbl>
## 1          87.2              21        284.

Large and small counties

With the data having outliters and being spread out, I want to see the high and low end of the data. What counties have the most employees with the state grouped as well to know where the county is in the country.

To do this, I am going to pipe the data and arrange by total employees. When I use arrange to sort the data from smallest total employees to largest, I see by scrolling though the pages, that many counties have 1 employee. If I continue to scroll, I see this is the case for also counties with 2 or 3 employees. If I use the desc modifier on the arrange function to see the largest numbers first. I see Cook county in Illinois has 8,207 which is 3,972 employees more than the next county’s employees.

railroad %>%
  arrange(total_employees)

## # A tibble: 2,930 × 3
##    state county   total_employees
##    <chr> <chr>              <dbl>
##  1 AK    SITKA                  1
##  2 AL    BARBOUR                1
##  3 AL    HENRY                  1
##  4 AP    APO                    1
##  5 AR    NEWTON                 1
##  6 CA    MONO                   1
##  7 CO    BENT                   1
##  8 CO    CHEYENNE               1
##  9 CO    COSTILLA               1
## 10 CO    DOLORES                1
## # ℹ 2,920 more rows

railroad %>%
 arrange(desc(total_employees))

## # A tibble: 2,930 × 3
##    state county           total_employees
##    <chr> <chr>                      <dbl>
##  1 IL    COOK                        8207
##  2 TX    TARRANT                     4235
##  3 NE    DOUGLAS                     3797
##  4 NY    SUFFOLK                     3685
##  5 VA    INDEPENDENT CITY            3249
##  6 FL    DUVAL                       3073
##  7 CA    SAN BERNARDINO              2888
##  8 CA    LOS ANGELES                 2545
##  9 TX    HARRIS                      2535
## 10 NE    LINCOLN                     2289
## # ℹ 2,920 more rows

Filtering by state

With Illinois being the state with the highest number of employees, I want to filter by just Illinois to see what the data looks like.

I will pipe the data and then filter by ‘IL’ and arrange the total number of employees from largest to smallest using the desc modifier. I want to see how different the mean would be if this outlier was removed.

Again I filtered by ‘IL’ and then summarised both the mean and median. I did this again but used != to remove COOK from the equation. I did not anticipate the median to change much but wanted it there as reference. I see the mean decreased from 185.737 to 107.098. Although this decreased the mean, it is still quite far from the median at ~41 which means the data is still generally skewing to the higher numbers/right.

railroad %>% 
  filter(state == 'IL') %>% 
  arrange(desc(total_employees))

## # A tibble: 103 × 3
##    state county   total_employees
##    <chr> <chr>              <dbl>
##  1 IL    COOK                8207
##  2 IL    WILL                1784
##  3 IL    KNOX                 885
##  4 IL    DUPAGE               837
##  5 IL    KANE                 577
##  6 IL    ST CLAIR             495
##  7 IL    MADISON              427
##  8 IL    MACON                425
##  9 IL    LAKE                 340
## 10 IL    KANKAKEE             289
## # ℹ 93 more rows

railroad %>% 
  filter(state == 'IL') %>% 
  summarise(mean_value_IL = mean(total_employees), 
            median_value_IL = median(total_employees))

## # A tibble: 1 × 2
##   mean_value_IL median_value_IL
##           <dbl>           <dbl>
## 1          186.              42

railroad %>% 
  filter(state == 'IL' & county != 'COOK') %>%
  summarise(mean_value_IL = mean(total_employees), 
            median_value_IL = median(total_employees))

## # A tibble: 1 × 2
##   mean_value_IL median_value_IL
##           <dbl>           <dbl>
## 1          107.            41.5

State with the overall highest number of employees

Although Illinois has the county with the largest, what state has the most overall employees. To see this, I am going to pipe the data and then group by state. Then I will summarise the data and sum the total number of employees.

When viewing, I want to see the largest total_employee_state at the top so I will arrange the table with the desc modifier.

Texas has the most employees with 19,839 followed by Illinois with 19,131.

railroad %>%
  group_by(state) %>%
  summarise(total_employees_state = sum(total_employees)) %>%
  arrange(desc(total_employees_state))

## # A tibble: 53 × 2
##    state total_employees_state
##    <chr>                 <dbl>
##  1 TX                    19839
##  2 IL                    19131
##  3 NY                    17050
##  4 NE                    13176
##  5 CA                    13137
##  6 PA                    12769
##  7 OH                     9056
##  8 GA                     8605
##  9 IN                     8537
## 10 MO                     8419
## # ℹ 43 more rows

Challenge 1

Ciara Salmon

2023-12-20