Introduction:
For our week 5 discussion, we were to come forward with datasets exemplifying “untidy” data. Project 2 requires us to take 3 of the peer posted examples from Week 5, tidy the data, and then perform the analysis requested.
This portion will be focusing on Joe’s Train Casualties by State post.
Sources:
Packages Used:
- tidyverse
Let’s get a look a the data.
trainCasualties <- read_csv("https://raw.githubusercontent.com/d-ev-craig/DATA607/main/Projects/Project2%20-%20Untidy%20Data/Train%20Casualties%20by%20State.csv")
## New names:
## Rows: 54 Columns: 9
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (6): TOTAL CASUALTIES BY STATE, JAN - DEC (2022 preliminary), ...2, ...3... num
## (3): ...7, ...8, ...9
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
## • `` -> `...9`
trainCasualties
## # A tibble: 54 × 9
## TOTAL CASUALTIES BY STATE, …¹ ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 <NA> <NA> <NA> <NA> <NA> <NA> NA NA NA
## 2 <NA> Fatal <NA> <NA> <NA> Nonf… NA NA NA
## 3 <NA> 2019 2020 2021 2022 2019 2020 2021 2022
## 4 Alabama 15 13 21 20 101 79 82 75
## 5 Alaska . 1 . 1 53 30 24 34
## 6 Arizona 16 18 27 15 74 52 42 50
## 7 Arkansas 7 5 13 12 71 52 45 43
## 8 California 182 174 177 219 587 416 416 413
## 9 Colorado 18 13 13 13 112 90 105 81
## 10 Connecticut 3 3 7 4 134 95 111 143
## # … with 44 more rows, and abbreviated variable name
## # ¹`TOTAL CASUALTIES BY STATE, JAN - DEC (2022 preliminary)`
trainCasualties <- trainCasualties[-54,] #Dropping the last column that sums up for totals
There are many gaps, but it looks like our goal will be to establish
the following variables for each observation:
- Region(State) - Outcome(Fatal/Nonfatal) - Year
#Splitting the two categories of data up to divide and conquer
fatal <- trainCasualties[,1:5]
fatal
## # A tibble: 53 × 5
## TOTAL CASUALTIES BY STATE, JAN - DEC (2022 prelimin…¹ ...2 ...3 ...4 ...5
## <chr> <chr> <chr> <chr> <chr>
## 1 <NA> <NA> <NA> <NA> <NA>
## 2 <NA> Fatal <NA> <NA> <NA>
## 3 <NA> 2019 2020 2021 2022
## 4 Alabama 15 13 21 20
## 5 Alaska . 1 . 1
## 6 Arizona 16 18 27 15
## 7 Arkansas 7 5 13 12
## 8 California 182 174 177 219
## 9 Colorado 18 13 13 13
## 10 Connecticut 3 3 7 4
## # … with 43 more rows, and abbreviated variable name
## # ¹`TOTAL CASUALTIES BY STATE, JAN - DEC (2022 preliminary)`
nonfatal <- trainCasualties[,c(1,6:9)]
nonfatal
## # A tibble: 53 × 5
## TOTAL CASUALTIES BY STATE, JAN - DEC (2022 prelimin…¹ ...6 ...7 ...8 ...9
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 <NA> <NA> NA NA NA
## 2 <NA> Nonf… NA NA NA
## 3 <NA> 2019 2020 2021 2022
## 4 Alabama 101 79 82 75
## 5 Alaska 53 30 24 34
## 6 Arizona 74 52 42 50
## 7 Arkansas 71 52 45 43
## 8 California 587 416 416 413
## 9 Colorado 112 90 105 81
## 10 Connecticut 134 95 111 143
## # … with 43 more rows, and abbreviated variable name
## # ¹`TOTAL CASUALTIES BY STATE, JAN - DEC (2022 preliminary)`
fatal<-fatal[-(1:2),]
nonfatal<-nonfatal[-(1:2),]
#Pulling just the row of years so we can turn it into a vector. We want to use this vector to rename the columns to those years, so we can perform pivot_longer on them.
columns<-as.numeric(fatal[1,])
columns<-columns[-1]
colnames(fatal)[2:5] <- columns #Setting column names equal to the years
fatal <- fatal[-1,] #drop unnecessary row
#pivot longer
fatal <- fatal %>%
pivot_longer(c('2019','2020','2021','2022'), names_to='year',values_to = 'count')%>% mutate(outcome = 'fatal')
fatal
## # A tibble: 200 × 4
## `TOTAL CASUALTIES BY STATE, JAN - DEC (2022 preliminary)` year count outcome
## <chr> <chr> <chr> <chr>
## 1 Alabama 2019 15 fatal
## 2 Alabama 2020 13 fatal
## 3 Alabama 2021 21 fatal
## 4 Alabama 2022 20 fatal
## 5 Alaska 2019 . fatal
## 6 Alaska 2020 1 fatal
## 7 Alaska 2021 . fatal
## 8 Alaska 2022 1 fatal
## 9 Arizona 2019 16 fatal
## 10 Arizona 2020 18 fatal
## # … with 190 more rows
#Pulling just the row of years so we can turn it into a vector. We want to use this vector to rename the columns to those years, so we can perform pivot_longer on them.
columns<-as.numeric(nonfatal[1,])
columns<-columns[-1]
colnames(nonfatal)[2:5] <- columns #Setting column names equal to the years
nonfatal <- nonfatal[-1,] #drop unnecessary row
nonfatal[32,2] <- '1339' #one of the counts was '1,339' and the comma forced a chr column type
nonfatal$`2019` <- as.numeric(nonfatal$`2019`)
#pivot longer
nonfatal <- nonfatal %>%
pivot_longer(c('2019','2020','2021','2022'), names_to='year',values_to = 'count')%>% mutate(outcome = 'nonfatal')
casualties <- rbind(fatal,nonfatal) #Comgine our rows
colnames(casualties)[1] <- 'state' #Name our first column to state
casualties <- arrange(casualties,state) #order by state
casualties$count <- as.numeric(casualties$count)
## Warning: NAs introduced by coercion
casualties$count[is.na(casualties$count)]<- 0 #removing NAs with 0
casualties
## # A tibble: 400 × 4
## state year count outcome
## <chr> <chr> <dbl> <chr>
## 1 Alabama 2019 15 fatal
## 2 Alabama 2020 13 fatal
## 3 Alabama 2021 21 fatal
## 4 Alabama 2022 20 fatal
## 5 Alabama 2019 101 nonfatal
## 6 Alabama 2020 79 nonfatal
## 7 Alabama 2021 82 nonfatal
## 8 Alabama 2022 75 nonfatal
## 9 Alaska 2019 0 fatal
## 10 Alaska 2020 1 fatal
## # … with 390 more rows
Although Joe did not mention any analysis in his post, I pulled a
suggestion from Genesis’ reply.
Identifying trends in the data over time, such as whether the
number of deaths and injuries has been increasing or decreasing in
certain states or overall
Since our dataset wasn’t melted down to individual observations, I decided to just sum up the values and compare across years. Most of the below work is just filtering and renaming column names. We can see by checking the graph and the proportion value for each year that although the total number of incidents have decreased over the years, the proportion of fatalities have increased each time.
fatalities <- casualties %>% group_by(year) %>% filter(outcome == 'fatal')
survivors <- casualties %>% group_by(year) %>% filter(outcome == 'nonfatal')
fatalitiesSum <- aggregate(fatalities$count,list(fatalities$year),FUN=sum)
surviveSum <- aggregate(survivors$count,list(survivors$year),FUN=sum)
colnames(fatalitiesSum)[1] <- 'year'
colnames(fatalitiesSum)[2] <- 'fatality'
colnames(surviveSum)[1] <- 'year'
colnames(surviveSum)[2] <- 'survive'
surviveSum <- surviveSum[,-1]
summed<- data.frame(fatalitiesSum,surviveSum)
colnames(summed)[1] <- 'year'
colnames(summed)[2] <- 'fatalities'
colnames(summed)[3] <- 'survived'
summed <- summed %>% mutate(totalIncidents = fatalities+survived)
summed <- summed %>% mutate(propFatalities = fatalities/totalIncidents)
summed
## year fatalities survived totalIncidents propFatalities
## 1 2019 853 8037 8890 0.09595051
## 2 2020 727 5555 6282 0.11572748
## 3 2021 863 5846 6709 0.12863318
## 4 2022 978 6112 7090 0.13794076
g <- ggplot(data = summed, aes(x=year,y=totalIncidents,fill=year))
g+geom_bar(stat='identity')
g2 <- ggplot(data = summed, aes(x=year,y=propFatalities,fill=year))
g2+geom_bar(stat='identity')