challenge1

Author

Jaden Busch

Reading in the Dataset

For this challenge, I chose to use the railroad_2012_clean_county.csv dataset. In order to read in the dataset, I ran the command:

dataset <- read.csv("challenge_datasets/railroad_2012_clean_county.csv")
head(dataset, 25)
   state               county total_employees
1     AE                  APO               2
2     AK            ANCHORAGE               7
3     AK FAIRBANKS NORTH STAR               2
4     AK               JUNEAU               3
5     AK    MATANUSKA-SUSITNA               2
6     AK                SITKA               1
7     AK SKAGWAY MUNICIPALITY              88
8     AL              AUTAUGA             102
9     AL              BALDWIN             143
10    AL              BARBOUR               1
11    AL                 BIBB              25
12    AL               BLOUNT             154
13    AL              BULLOCK              13
14    AL               BUTLER              29
15    AL              CALHOUN              45
16    AL             CHAMBERS              13
17    AL             CHEROKEE               9
18    AL              CHILTON              72
19    AL              CHOCTAW               7
20    AL               CLARKE              26
21    AL                 CLAY              10
22    AL             CLEBURNE               7
23    AL               COFFEE              14
24    AL              COLBERT             199
25    AL              CONECUH              11

Describing the Data

My interpretation is that the data was collected from some of the railroad stations around the US, with the columns corresponding to the location and number of employees at each station. The dataset is composed of 3 columns. The first column state includes the two letter acronym for a state which is stored as a character string. The second column county refers to the county within the corresponding state and is also stored as a character string. The third column total_employees has the total number of employees at the corresponding state/county pair, stored as an integer.

We can view the locations with most and least employees by computing the following:

mean(dataset$total_employees)
[1] 87.17816
sd(dataset$total_employees)
[1] 283.6359
dataset[which.min(dataset$total_employees), ] # minimum
  state county total_employees
6    AK  SITKA               1
dataset[which.max(dataset$total_employees), ] # maximum
    state county total_employees
659    IL   COOK            8207

We see that the mean number of employees is ~87.18 with a standard deviation of ~283.64. The minimum employees (or one of the tied minimums) is 1 in Sitka county in Alaska, and the maximum employees is 8207 in Cook county in Illinois.

We can also compute the number of times that each state appears in the dataframe:

state_counts <- table(dataset$state)
state_counts

 AE  AK  AL  AP  AR  AZ  CA  CO  CT  DC  DE  FL  GA  HI  IA  ID  IL  IN  KS  KY 
  1   6  67   1  72  15  55  57   8   1   3  67 152   3  99  36 103  92  95 119 
 LA  MA  MD  ME  MI  MN  MO  MS  MT  NC  ND  NE  NH  NJ  NM  NV  NY  OH  OK  OR 
 63  12  24  16  78  86 115  78  53  94  49  89  10  21  29  12  61  88  73  33 
 PA  RI  SC  SD  TN  TX  UT  VA  VT  WA  WI  WV  WY 
 65   5  46  52  91 221  25  92  14  39  69  53  22 

We can see each state end its corresponding number of occurrences in the list. We see that there are two states which only occur a single time (AE and DC), while Texas occurs the most with 221 occurrences. We can also note that there are 53 states listed, so the dataset is also including things like AE or “Armed Forces Europe” as a state rather than the typical 50.