Challenge 2 Instructions

challenge_2

railroads

faostat

hotel_bookings

Data wrangling: using group() and summarise()

Author

Sean Conway

Published

December 21, 2023

Code

library(here)
library(readr)
library(knitr)
library(dplyr)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
provide summary statistics for different interesting groups within the data, and interpret those statistics

Read in the Data

Read in one (or more) of the following data sets, using the correct R package and command.

railroad*.csv or StateCounty2012.xls ⭐
FAOstat*.csv or birds.csv ⭐⭐⭐
hotel_bookings.csv ⭐⭐⭐⭐

Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.

Describe the data

Using a combination of words and results of R commands, can you provide a brief, high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

Provide Grouped Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as group_by(), select(), filter(), and summarise(). Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set. Describe what you find.

Solutions

Reading the Data

The working directory for RStudio has been set such that “railroad_2012_clean_county.csv” can be found at the root of the working directory using the setwd() method.

Code

railroad <- read_csv(here("railroad_2012_clean_county.csv"))
railroad

# A tibble: 2,930 × 3
   state county               total_employees
   <chr> <chr>                          <dbl>
 1 AE    APO                                2
 2 AK    ANCHORAGE                          7
 3 AK    FAIRBANKS NORTH STAR               2
 4 AK    JUNEAU                             3
 5 AK    MATANUSKA-SUSITNA                  2
 6 AK    SITKA                              1
 7 AK    SKAGWAY MUNICIPALITY              88
 8 AL    AUTAUGA                          102
 9 AL    BALDWIN                          143
10 AL    BARBOUR                            1
# ℹ 2,920 more rows

Data Description

High Level Description

The data set comprises of 2,930 rows with 3 columns.

Code

railroad

# A tibble: 2,930 × 3
   state county               total_employees
   <chr> <chr>                          <dbl>
 1 AE    APO                                2
 2 AK    ANCHORAGE                          7
 3 AK    FAIRBANKS NORTH STAR               2
 4 AK    JUNEAU                             3
 5 AK    MATANUSKA-SUSITNA                  2
 6 AK    SITKA                              1
 7 AK    SKAGWAY MUNICIPALITY              88
 8 AL    AUTAUGA                          102
 9 AL    BALDWIN                          143
10 AL    BARBOUR                            1
# ℹ 2,920 more rows

The data set has a total of 2 <chr> type columns and the remaining column is of the <dbl> type. The railroad variable contains the entire dataset.

How was the Data likely collected?

The dataset seems to provide a count of the total number of railroad employees in a particular county within multiple states. The dataset is pre-cleaned since no NA values are seen. The data is likely to have been collected using official/unofficial sources providing railroad employee count.

Grouped Summary Statistics

Custom Mode Function

Since R does not provide a built-in “mode” function, a custom function to do so is written below.

Code

custom_mode <- function(x) {
  tab <- table(x) # create a table out of the column
  modes <- as.numeric(names(tab[tab == max(tab)]))
  return(modes)
}

Grouping by State

The following query results in a tibble where the data is grouped by the state. The tibble is further arranged in descending order of the total_employees.

Code

railroad %>%
  group_by(state) %>%
  arrange(desc(total_employees))

# A tibble: 2,930 × 3
# Groups:   state [53]
   state county           total_employees
   <chr> <chr>                      <dbl>
 1 IL    COOK                        8207
 2 TX    TARRANT                     4235
 3 NE    DOUGLAS                     3797
 4 NY    SUFFOLK                     3685
 5 VA    INDEPENDENT CITY            3249
 6 FL    DUVAL                       3073
 7 CA    SAN BERNARDINO              2888
 8 CA    LOS ANGELES                 2545
 9 TX    HARRIS                      2535
10 NE    LINCOLN                     2289
# ℹ 2,920 more rows

From the above tibble, we observe that the “COOK” county in Illinois (IL) has the highest employee count over all counties.

Central Tendencies and Dispersion

The following query provides central tendencies and dispersion also grouped by the state column.

Code

railroad %>%
  group_by(state) %>%
  summarize(mean_count=mean(total_employees, na.rm=T),
            median_count=median(total_employees, na.rm=T),
            mode=custom_mode(total_employees),
            sd_count=sd(total_employees, na.rm=T),
            min_count=min(total_employees, na.rm=T),
            max_count=max(total_employees, na.rm=T),
            q1=quantile(total_employees, 0.25),
            q3=quantile(total_employees, 0.75))

# A tibble: 165 × 9
# Groups:   state [53]
   state mean_count median_count  mode sd_count min_count max_count    q1    q3
   <chr>      <dbl>        <dbl> <dbl>    <dbl>     <dbl>     <dbl> <dbl> <dbl>
 1 AE           2            2       2     NA           2         2   2     2  
 2 AK          17.2          2.5     2     34.8         1        88   2     6  
 3 AL          63.5         26       7    130.          1       990  10.5  57.5
 4 AL          63.5         26      11    130.          1       990  10.5  57.5
 5 AP           1            1       1     NA           1         1   1     1  
 6 AR          53.8         16.5     5    131.          1       972   7    40.8
 7 AZ         210.          94       3    228.          3       749  42.5 338. 
 8 AZ         210.          94      10    228.          3       749  42.5 338. 
 9 AZ         210.          94      18    228.          3       749  42.5 338. 
10 AZ         210.          94      37    228.          3       749  42.5 338. 
# ℹ 155 more rows

We observe a few NA values within the sd_count column, which is due to that state having only a single county within the dataset. Note, the custom mode function has been used to compute the mode in the above tibble.

Finding state with highest employees

The following query is used to find the state with the highest number of employees across all counties.

Code

railroad %>%
  group_by(state) %>%
  summarize(total_state_employees=sum(total_employees)) %>%
  arrange(desc(total_state_employees))

# A tibble: 53 × 2
   state total_state_employees
   <chr>                 <dbl>
 1 TX                    19839
 2 IL                    19131
 3 NY                    17050
 4 NE                    13176
 5 CA                    13137
 6 PA                    12769
 7 OH                     9056
 8 GA                     8605
 9 IN                     8537
10 MO                     8419
# ℹ 43 more rows

From the above tibble, we observe that Texas is the state with a total_state_employees count equaling 19839.

Grouping by County

The following query results in a tibble where the data is grouped by the state. The tibble is further arranged in descending order of the total_employees.

Code

railroad %>%
  group_by(county) %>%
  arrange(desc(total_employees)) %>%
  select(county, total_employees)

# A tibble: 2,930 × 2
# Groups:   county [1,709]
   county           total_employees
   <chr>                      <dbl>
 1 COOK                        8207
 2 TARRANT                     4235
 3 DOUGLAS                     3797
 4 SUFFOLK                     3685
 5 INDEPENDENT CITY            3249
 6 DUVAL                       3073
 7 SAN BERNARDINO              2888
 8 LOS ANGELES                 2545
 9 HARRIS                      2535
10 LINCOLN                     2289
# ℹ 2,920 more rows

As observed previously, the “COOK” county has the highest total_employees.

Central Tendencies and Dispersion

We can also find central tendencies and dispersion when grouped by county using the following query.

Code

railroad %>%
  group_by(county) %>%
  summarize(mean_count=mean(total_employees, na.rm=T),
            median_count=median(total_employees, na.rm=T),
            mode=custom_mode(total_employees),
            sd_count=sd(total_employees, na.rm=T),
            min_count=min(total_employees, na.rm=T),
            max_count=max(total_employees, na.rm=T),
            q1=quantile(total_employees, 0.25),
            q3=quantile(total_employees, 0.75))

# A tibble: 2,508 × 9
# Groups:   county [1,709]
   county    mean_count median_count  mode sd_count min_count max_count     q1
   <chr>          <dbl>        <dbl> <dbl>    <dbl>     <dbl>     <dbl>  <dbl>
 1 ABBEVILLE     124           124     124    NA          124       124 124   
 2 ACADIA         13            13      13    NA           13        13  13   
 3 ACCOMACK        4             4       4    NA            4         4   4   
 4 ADA            81            81      81    NA           81        81  81   
 5 ADAIR           7.25          3.5     1     9.32         1        21   1.75
 6 ADAIR           7.25          3.5     2     9.32         1        21   1.75
 7 ADAIR           7.25          3.5     5     9.32         1        21   1.75
 8 ADAIR           7.25          3.5    21     9.32         1        21   1.75
 9 ADAMS          73.2          19.5     2   155.           2       553   6   
10 ADDISON         8             8       8    NA            8         8   8   
# ℹ 2,498 more rows
# ℹ 1 more variable: q3 <dbl>

We observe a few NA values within the sd_count column, which is due to only a single county being of that name within the dataset. Note, the custom mode function has been used to compute the mode in the above tibble.

Exploratory Analysis

This section aims to perform miscellaneous data analysis for interesting observations within the dataset.

Total Employees

The following query results in the total employee count over the entire dataset.

Code

railroad %>%
  summarize(
            count=sum(total_employees),
            mean_count=mean(total_employees, na.rm=T),
            median_count=median(total_employees, na.rm=T),
            mode=custom_mode(total_employees),
            sd_count=sd(total_employees, na.rm=T),
            min_count=min(total_employees, na.rm=T),
            max_count=max(total_employees, na.rm=T),
            q1=quantile(total_employees, 0.25),
            q3=quantile(total_employees, 0.75))

# A tibble: 1 × 9
   count mean_count median_count  mode sd_count min_count max_count    q1    q3
   <dbl>      <dbl>        <dbl> <dbl>    <dbl>     <dbl>     <dbl> <dbl> <dbl>
1 255432       87.2           21     1     284.         1      8207     7    65

From the above tibble we see that there are a total of 255432 employees across all states and counties.

Filtering a single County

An obervation from the data set is that multiple states have counties with the same names. For example, “ADAMS” is a county in 12 states.

The below query filters all counties named “ADAMS”.

Code

railroad %>%
  filter(county == "ADAMS")

# A tibble: 12 × 3
   state county total_employees
   <chr> <chr>            <dbl>
 1 CO    ADAMS              553
 2 IA    ADAMS                7
 3 ID    ADAMS                2
 4 IL    ADAMS              116
 5 IN    ADAMS               11
 6 MS    ADAMS                3
 7 ND    ADAMS                2
 8 NE    ADAMS               77
 9 OH    ADAMS               24
10 PA    ADAMS               39
11 WA    ADAMS               15
12 WI    ADAMS               29

The following query provides a summary over this filter

Code

railroad %>%
  filter(county == "ADAMS") %>%
  summarize(mean_count=mean(total_employees, na.rm=T),
          median_count=median(total_employees, na.rm=T),
          mode=custom_mode(total_employees),
          sd_count=sd(total_employees, na.rm=T),
          min_count=min(total_employees, na.rm=T),
          max_count=max(total_employees, na.rm=T),
          q1=quantile(total_employees, 0.25),
          q3=quantile(total_employees, 0.75))

# A tibble: 1 × 8
  mean_count median_count  mode sd_count min_count max_count    q1    q3
       <dbl>        <dbl> <dbl>    <dbl>     <dbl>     <dbl> <dbl> <dbl>
1       73.2         19.5     2     155.         2       553     6  48.5

From the above tibble, we observe that even though q3 is 48.5, the mean is much larger 73.16667. This can be attributed to the “ADAMS” county in Colorado (CO), where the total employee count is 553.

Filtering a single State

The below query filters by state on “MA” and sorts by descending order of total_employees.

Code

railroad %>%
  filter(state == "MA") %>%
  arrange(desc(total_employees))

# A tibble: 12 × 3
   state county     total_employees
   <chr> <chr>                <dbl>
 1 MA    MIDDLESEX              673
 2 MA    SUFFOLK                558
 3 MA    PLYMOUTH               429
 4 MA    NORFOLK                386
 5 MA    ESSEX                  314
 6 MA    WORCESTER              310
 7 MA    BRISTOL                232
 8 MA    HAMPDEN                202
 9 MA    FRANKLIN               113
10 MA    HAMPSHIRE               68
11 MA    BERKSHIRE               50
12 MA    BARNSTABLE              44

We observe that the “MIDDLESEX” county has the highest number of total_employees within Massachusetts (MA).

The following query provides a summary over this filter

Code

railroad %>%
  filter(state == "MA") %>%
  group_by(state) %>%
  summarize(mean_count=mean(total_employees, na.rm=T),
          median_count=median(total_employees, na.rm=T),
          sd_count=sd(total_employees, na.rm=T),
          min_count=min(total_employees, na.rm=T),
          max_count=max(total_employees, na.rm=T),
          q1=quantile(total_employees, 0.25),
          q3=quantile(total_employees, 0.75))

# A tibble: 1 × 8
  state mean_count median_count sd_count min_count max_count    q1    q3
  <chr>      <dbl>        <dbl>    <dbl>     <dbl>     <dbl> <dbl> <dbl>
1 MA          282.          271     204.        44       673  102.  397.

From the above tibble, we observe that the mean is much closer to the median, which signifies that the data is symmetrically distributed.