Code
library(here)
library(readr)
library(knitr)
library(dplyr)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)library(here)
library(readr)
library(knitr)
library(dplyr)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)Today’s challenge is to
Read in one (or more) of the following data sets, using the correct R package and command.
Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.
Using a combination of words and results of R commands, can you provide a brief, high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
Conduct some exploratory data analysis, using dplyr commands such as group_by(), select(), filter(), and summarise(). Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set. Describe what you find.
The working directory for RStudio has been set such that “railroad_2012_clean_county.csv” can be found at the root of the working directory using the setwd() method.
railroad <- read_csv(here("railroad_2012_clean_county.csv"))
railroad# A tibble: 2,930 × 3
state county total_employees
<chr> <chr> <dbl>
1 AE APO 2
2 AK ANCHORAGE 7
3 AK FAIRBANKS NORTH STAR 2
4 AK JUNEAU 3
5 AK MATANUSKA-SUSITNA 2
6 AK SITKA 1
7 AK SKAGWAY MUNICIPALITY 88
8 AL AUTAUGA 102
9 AL BALDWIN 143
10 AL BARBOUR 1
# ℹ 2,920 more rows
The data set comprises of 2,930 rows with 3 columns.
railroad# A tibble: 2,930 × 3
state county total_employees
<chr> <chr> <dbl>
1 AE APO 2
2 AK ANCHORAGE 7
3 AK FAIRBANKS NORTH STAR 2
4 AK JUNEAU 3
5 AK MATANUSKA-SUSITNA 2
6 AK SITKA 1
7 AK SKAGWAY MUNICIPALITY 88
8 AL AUTAUGA 102
9 AL BALDWIN 143
10 AL BARBOUR 1
# ℹ 2,920 more rows
The data set has a total of 2 <chr> type columns and the remaining column is of the <dbl> type. The railroad variable contains the entire dataset.
The dataset seems to provide a count of the total number of railroad employees in a particular county within multiple states. The dataset is pre-cleaned since no NA values are seen. The data is likely to have been collected using official/unofficial sources providing railroad employee count.
Since R does not provide a built-in “mode” function, a custom function to do so is written below.
custom_mode <- function(x) {
tab <- table(x) # create a table out of the column
modes <- as.numeric(names(tab[tab == max(tab)]))
return(modes)
}The following query results in a tibble where the data is grouped by the state. The tibble is further arranged in descending order of the total_employees.
railroad %>%
group_by(state) %>%
arrange(desc(total_employees))# A tibble: 2,930 × 3
# Groups: state [53]
state county total_employees
<chr> <chr> <dbl>
1 IL COOK 8207
2 TX TARRANT 4235
3 NE DOUGLAS 3797
4 NY SUFFOLK 3685
5 VA INDEPENDENT CITY 3249
6 FL DUVAL 3073
7 CA SAN BERNARDINO 2888
8 CA LOS ANGELES 2545
9 TX HARRIS 2535
10 NE LINCOLN 2289
# ℹ 2,920 more rows
From the above tibble, we observe that the “COOK” county in Illinois (IL) has the highest employee count over all counties.
The following query provides central tendencies and dispersion also grouped by the state column.
railroad %>%
group_by(state) %>%
summarize(mean_count=mean(total_employees, na.rm=T),
median_count=median(total_employees, na.rm=T),
mode=custom_mode(total_employees),
sd_count=sd(total_employees, na.rm=T),
min_count=min(total_employees, na.rm=T),
max_count=max(total_employees, na.rm=T),
q1=quantile(total_employees, 0.25),
q3=quantile(total_employees, 0.75))# A tibble: 165 × 9
# Groups: state [53]
state mean_count median_count mode sd_count min_count max_count q1 q3
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 AE 2 2 2 NA 2 2 2 2
2 AK 17.2 2.5 2 34.8 1 88 2 6
3 AL 63.5 26 7 130. 1 990 10.5 57.5
4 AL 63.5 26 11 130. 1 990 10.5 57.5
5 AP 1 1 1 NA 1 1 1 1
6 AR 53.8 16.5 5 131. 1 972 7 40.8
7 AZ 210. 94 3 228. 3 749 42.5 338.
8 AZ 210. 94 10 228. 3 749 42.5 338.
9 AZ 210. 94 18 228. 3 749 42.5 338.
10 AZ 210. 94 37 228. 3 749 42.5 338.
# ℹ 155 more rows
We observe a few NA values within the sd_count column, which is due to that state having only a single county within the dataset. Note, the custom mode function has been used to compute the mode in the above tibble.
The following query is used to find the state with the highest number of employees across all counties.
railroad %>%
group_by(state) %>%
summarize(total_state_employees=sum(total_employees)) %>%
arrange(desc(total_state_employees))# A tibble: 53 × 2
state total_state_employees
<chr> <dbl>
1 TX 19839
2 IL 19131
3 NY 17050
4 NE 13176
5 CA 13137
6 PA 12769
7 OH 9056
8 GA 8605
9 IN 8537
10 MO 8419
# ℹ 43 more rows
From the above tibble, we observe that Texas is the state with a total_state_employees count equaling 19839.
The following query results in a tibble where the data is grouped by the state. The tibble is further arranged in descending order of the total_employees.
railroad %>%
group_by(county) %>%
arrange(desc(total_employees)) %>%
select(county, total_employees)# A tibble: 2,930 × 2
# Groups: county [1,709]
county total_employees
<chr> <dbl>
1 COOK 8207
2 TARRANT 4235
3 DOUGLAS 3797
4 SUFFOLK 3685
5 INDEPENDENT CITY 3249
6 DUVAL 3073
7 SAN BERNARDINO 2888
8 LOS ANGELES 2545
9 HARRIS 2535
10 LINCOLN 2289
# ℹ 2,920 more rows
As observed previously, the “COOK” county has the highest total_employees.
We can also find central tendencies and dispersion when grouped by county using the following query.
railroad %>%
group_by(county) %>%
summarize(mean_count=mean(total_employees, na.rm=T),
median_count=median(total_employees, na.rm=T),
mode=custom_mode(total_employees),
sd_count=sd(total_employees, na.rm=T),
min_count=min(total_employees, na.rm=T),
max_count=max(total_employees, na.rm=T),
q1=quantile(total_employees, 0.25),
q3=quantile(total_employees, 0.75))# A tibble: 2,508 × 9
# Groups: county [1,709]
county mean_count median_count mode sd_count min_count max_count q1
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ABBEVILLE 124 124 124 NA 124 124 124
2 ACADIA 13 13 13 NA 13 13 13
3 ACCOMACK 4 4 4 NA 4 4 4
4 ADA 81 81 81 NA 81 81 81
5 ADAIR 7.25 3.5 1 9.32 1 21 1.75
6 ADAIR 7.25 3.5 2 9.32 1 21 1.75
7 ADAIR 7.25 3.5 5 9.32 1 21 1.75
8 ADAIR 7.25 3.5 21 9.32 1 21 1.75
9 ADAMS 73.2 19.5 2 155. 2 553 6
10 ADDISON 8 8 8 NA 8 8 8
# ℹ 2,498 more rows
# ℹ 1 more variable: q3 <dbl>
We observe a few NA values within the sd_count column, which is due to only a single county being of that name within the dataset. Note, the custom mode function has been used to compute the mode in the above tibble.
This section aims to perform miscellaneous data analysis for interesting observations within the dataset.
The following query results in the total employee count over the entire dataset.
railroad %>%
summarize(
count=sum(total_employees),
mean_count=mean(total_employees, na.rm=T),
median_count=median(total_employees, na.rm=T),
mode=custom_mode(total_employees),
sd_count=sd(total_employees, na.rm=T),
min_count=min(total_employees, na.rm=T),
max_count=max(total_employees, na.rm=T),
q1=quantile(total_employees, 0.25),
q3=quantile(total_employees, 0.75))# A tibble: 1 × 9
count mean_count median_count mode sd_count min_count max_count q1 q3
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 255432 87.2 21 1 284. 1 8207 7 65
From the above tibble we see that there are a total of 255432 employees across all states and counties.
An obervation from the data set is that multiple states have counties with the same names. For example, “ADAMS” is a county in 12 states.
The below query filters all counties named “ADAMS”.
railroad %>%
filter(county == "ADAMS")# A tibble: 12 × 3
state county total_employees
<chr> <chr> <dbl>
1 CO ADAMS 553
2 IA ADAMS 7
3 ID ADAMS 2
4 IL ADAMS 116
5 IN ADAMS 11
6 MS ADAMS 3
7 ND ADAMS 2
8 NE ADAMS 77
9 OH ADAMS 24
10 PA ADAMS 39
11 WA ADAMS 15
12 WI ADAMS 29
The following query provides a summary over this filter
railroad %>%
filter(county == "ADAMS") %>%
summarize(mean_count=mean(total_employees, na.rm=T),
median_count=median(total_employees, na.rm=T),
mode=custom_mode(total_employees),
sd_count=sd(total_employees, na.rm=T),
min_count=min(total_employees, na.rm=T),
max_count=max(total_employees, na.rm=T),
q1=quantile(total_employees, 0.25),
q3=quantile(total_employees, 0.75))# A tibble: 1 × 8
mean_count median_count mode sd_count min_count max_count q1 q3
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 73.2 19.5 2 155. 2 553 6 48.5
From the above tibble, we observe that even though q3 is 48.5, the mean is much larger 73.16667. This can be attributed to the “ADAMS” county in Colorado (CO), where the total employee count is 553.
The below query filters by state on “MA” and sorts by descending order of total_employees.
railroad %>%
filter(state == "MA") %>%
arrange(desc(total_employees))# A tibble: 12 × 3
state county total_employees
<chr> <chr> <dbl>
1 MA MIDDLESEX 673
2 MA SUFFOLK 558
3 MA PLYMOUTH 429
4 MA NORFOLK 386
5 MA ESSEX 314
6 MA WORCESTER 310
7 MA BRISTOL 232
8 MA HAMPDEN 202
9 MA FRANKLIN 113
10 MA HAMPSHIRE 68
11 MA BERKSHIRE 50
12 MA BARNSTABLE 44
We observe that the “MIDDLESEX” county has the highest number of total_employees within Massachusetts (MA).
The following query provides a summary over this filter
railroad %>%
filter(state == "MA") %>%
group_by(state) %>%
summarize(mean_count=mean(total_employees, na.rm=T),
median_count=median(total_employees, na.rm=T),
sd_count=sd(total_employees, na.rm=T),
min_count=min(total_employees, na.rm=T),
max_count=max(total_employees, na.rm=T),
q1=quantile(total_employees, 0.25),
q3=quantile(total_employees, 0.75))# A tibble: 1 × 8
state mean_count median_count sd_count min_count max_count q1 q3
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 MA 282. 271 204. 44 673 102. 397.
From the above tibble, we observe that the mean is much closer to the median, which signifies that the data is symmetrically distributed.