Introduction:

For our week 5 discussion, we were to come forward with datasets exemplifying “untidy” data. Project 2 requires us to take 3 of the peer posted examples from Week 5, tidy the data, and then perform the analysis requested.

This portion will be focusing on Joe’s Train Casualties by State post.

Sources:

Packages Used:
- tidyverse

Let’s get a look a the data.

trainCasualties <- read_csv("https://raw.githubusercontent.com/d-ev-craig/DATA607/main/Projects/Project2%20-%20Untidy%20Data/Train%20Casualties%20by%20State.csv")
## New names:
## Rows: 54 Columns: 9
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (6): TOTAL CASUALTIES BY STATE, JAN - DEC (2022 preliminary), ...2, ...3... num
## (3): ...7, ...8, ...9
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
## • `` -> `...9`
trainCasualties
## # A tibble: 54 × 9
##    TOTAL CASUALTIES BY STATE, …¹ ...2  ...3  ...4  ...5  ...6   ...7  ...8  ...9
##    <chr>                         <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
##  1 <NA>                          <NA>  <NA>  <NA>  <NA>  <NA>     NA    NA    NA
##  2 <NA>                          Fatal <NA>  <NA>  <NA>  Nonf…    NA    NA    NA
##  3 <NA>                          2019  2020  2021  2022  2019   2020  2021  2022
##  4 Alabama                       15    13    21    20    101      79    82    75
##  5 Alaska                        .     1     .     1     53       30    24    34
##  6 Arizona                       16    18    27    15    74       52    42    50
##  7 Arkansas                      7     5     13    12    71       52    45    43
##  8 California                    182   174   177   219   587     416   416   413
##  9 Colorado                      18    13    13    13    112      90   105    81
## 10 Connecticut                   3     3     7     4     134      95   111   143
## # … with 44 more rows, and abbreviated variable name
## #   ¹​`TOTAL CASUALTIES BY STATE, JAN - DEC (2022 preliminary)`
trainCasualties <- trainCasualties[-54,] #Dropping the last column that sums up for totals

There are many gaps, but it looks like our goal will be to establish the following variables for each observation:
- Region(State) - Outcome(Fatal/Nonfatal) - Year

#Splitting the two categories of data up to divide and conquer

fatal <- trainCasualties[,1:5]
fatal
## # A tibble: 53 × 5
##    TOTAL CASUALTIES BY STATE, JAN - DEC (2022 prelimin…¹ ...2  ...3  ...4  ...5 
##    <chr>                                                 <chr> <chr> <chr> <chr>
##  1 <NA>                                                  <NA>  <NA>  <NA>  <NA> 
##  2 <NA>                                                  Fatal <NA>  <NA>  <NA> 
##  3 <NA>                                                  2019  2020  2021  2022 
##  4 Alabama                                               15    13    21    20   
##  5 Alaska                                                .     1     .     1    
##  6 Arizona                                               16    18    27    15   
##  7 Arkansas                                              7     5     13    12   
##  8 California                                            182   174   177   219  
##  9 Colorado                                              18    13    13    13   
## 10 Connecticut                                           3     3     7     4    
## # … with 43 more rows, and abbreviated variable name
## #   ¹​`TOTAL CASUALTIES BY STATE, JAN - DEC (2022 preliminary)`
nonfatal <- trainCasualties[,c(1,6:9)]
nonfatal
## # A tibble: 53 × 5
##    TOTAL CASUALTIES BY STATE, JAN - DEC (2022 prelimin…¹ ...6   ...7  ...8  ...9
##    <chr>                                                 <chr> <dbl> <dbl> <dbl>
##  1 <NA>                                                  <NA>     NA    NA    NA
##  2 <NA>                                                  Nonf…    NA    NA    NA
##  3 <NA>                                                  2019   2020  2021  2022
##  4 Alabama                                               101      79    82    75
##  5 Alaska                                                53       30    24    34
##  6 Arizona                                               74       52    42    50
##  7 Arkansas                                              71       52    45    43
##  8 California                                            587     416   416   413
##  9 Colorado                                              112      90   105    81
## 10 Connecticut                                           134      95   111   143
## # … with 43 more rows, and abbreviated variable name
## #   ¹​`TOTAL CASUALTIES BY STATE, JAN - DEC (2022 preliminary)`
fatal<-fatal[-(1:2),]
nonfatal<-nonfatal[-(1:2),]

#Pulling just the row of years so we can turn it into a vector. We want to use this vector to rename the columns to those years, so we can perform pivot_longer on them.
columns<-as.numeric(fatal[1,])
columns<-columns[-1]

colnames(fatal)[2:5] <- columns #Setting column names equal to the years
fatal <- fatal[-1,] #drop unnecessary row

#pivot longer
fatal <- fatal %>% 
  pivot_longer(c('2019','2020','2021','2022'), names_to='year',values_to = 'count')%>% mutate(outcome = 'fatal')

fatal
## # A tibble: 200 × 4
##    `TOTAL CASUALTIES BY STATE, JAN - DEC (2022 preliminary)` year  count outcome
##    <chr>                                                     <chr> <chr> <chr>  
##  1 Alabama                                                   2019  15    fatal  
##  2 Alabama                                                   2020  13    fatal  
##  3 Alabama                                                   2021  21    fatal  
##  4 Alabama                                                   2022  20    fatal  
##  5 Alaska                                                    2019  .     fatal  
##  6 Alaska                                                    2020  1     fatal  
##  7 Alaska                                                    2021  .     fatal  
##  8 Alaska                                                    2022  1     fatal  
##  9 Arizona                                                   2019  16    fatal  
## 10 Arizona                                                   2020  18    fatal  
## # … with 190 more rows
#Pulling just the row of years so we can turn it into a vector. We want to use this vector to rename the columns to those years, so we can perform pivot_longer on them.
columns<-as.numeric(nonfatal[1,])
columns<-columns[-1]

colnames(nonfatal)[2:5] <- columns #Setting column names equal to the years
nonfatal <- nonfatal[-1,] #drop unnecessary row
nonfatal[32,2] <- '1339' #one of the counts was '1,339' and the comma forced a chr column type
nonfatal$`2019` <- as.numeric(nonfatal$`2019`)

#pivot longer
nonfatal <- nonfatal %>% 
  pivot_longer(c('2019','2020','2021','2022'), names_to='year',values_to = 'count')%>%    mutate(outcome = 'nonfatal')
casualties <- rbind(fatal,nonfatal) #Comgine our rows
colnames(casualties)[1] <- 'state'  #Name our first column to state
casualties <- arrange(casualties,state) #order by state
casualties$count <- as.numeric(casualties$count)
## Warning: NAs introduced by coercion
casualties$count[is.na(casualties$count)]<- 0 #removing NAs with 0

casualties
## # A tibble: 400 × 4
##    state   year  count outcome 
##    <chr>   <chr> <dbl> <chr>   
##  1 Alabama 2019     15 fatal   
##  2 Alabama 2020     13 fatal   
##  3 Alabama 2021     21 fatal   
##  4 Alabama 2022     20 fatal   
##  5 Alabama 2019    101 nonfatal
##  6 Alabama 2020     79 nonfatal
##  7 Alabama 2021     82 nonfatal
##  8 Alabama 2022     75 nonfatal
##  9 Alaska  2019      0 fatal   
## 10 Alaska  2020      1 fatal   
## # … with 390 more rows

Although Joe did not mention any analysis in his post, I pulled a suggestion from Genesis’ reply.
Identifying trends in the data over time, such as whether the number of deaths and injuries has been increasing or decreasing in certain states or overall

Since our dataset wasn’t melted down to individual observations, I decided to just sum up the values and compare across years. Most of the below work is just filtering and renaming column names. We can see by checking the graph and the proportion value for each year that although the total number of incidents have decreased over the years, the proportion of fatalities have increased each time.

fatalities <- casualties %>% group_by(year) %>% filter(outcome == 'fatal') 
survivors <- casualties %>% group_by(year) %>% filter(outcome == 'nonfatal')


fatalitiesSum <- aggregate(fatalities$count,list(fatalities$year),FUN=sum)
surviveSum <- aggregate(survivors$count,list(survivors$year),FUN=sum)

colnames(fatalitiesSum)[1] <- 'year'
colnames(fatalitiesSum)[2] <- 'fatality'
colnames(surviveSum)[1] <- 'year'
colnames(surviveSum)[2] <- 'survive'

surviveSum <- surviveSum[,-1]


summed<- data.frame(fatalitiesSum,surviveSum)
colnames(summed)[1] <- 'year'
colnames(summed)[2] <- 'fatalities'
colnames(summed)[3] <- 'survived'


summed <- summed %>% mutate(totalIncidents = fatalities+survived)
summed <- summed %>% mutate(propFatalities = fatalities/totalIncidents)
summed
##   year fatalities survived totalIncidents propFatalities
## 1 2019        853     8037           8890     0.09595051
## 2 2020        727     5555           6282     0.11572748
## 3 2021        863     5846           6709     0.12863318
## 4 2022        978     6112           7090     0.13794076
g <- ggplot(data = summed, aes(x=year,y=totalIncidents,fill=year))
g+geom_bar(stat='identity')

g2 <- ggplot(data = summed, aes(x=year,y=propFatalities,fill=year))
g2+geom_bar(stat='identity')