We begin by loading the TidyVerse (dplyr, the pipe, ggplot, etc), unzip the file, and read the csv into R.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.4.1
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Warning: package 'ggplot2' was built under R version 3.4.1
## Warning: package 'tidyr' was built under R version 3.4.1
## Warning: package 'readr' was built under R version 3.4.1
## Warning: package 'purrr' was built under R version 3.4.1
## Warning: package 'dplyr' was built under R version 3.4.1
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
storm_dat <- read.csv(bzfile("stormdat.csv.bz2"),
header=TRUE,
sep=",",
stringsAsFactors=FALSE)
We can then look at what the overall data looks like as it is now nicely stored in a dataframe.
str(storm_dat)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
I noticed the column named “PROPDMGEXP” which applies a letter or number to the PROPDMG column to tell us the exponent to take that data by. For example, K = 1000 = 10^3. We need to therefore apply this so that all of the data is stored in a single column for summarizing and plotting.
storm_dat$PROPDMGEXP <- toupper(storm_dat$PROPDMGEXP)
prop_exp_key <- c("\"\"" = 10^0,
"-" = 10^0,
"+" = 10^0,
"0" = 10^0,
"1" = 10^1,
"2" = 10^2,
"3" = 10^3,
"4" = 10^4,
"5" = 10^5,
"6" = 10^6,
"7" = 10&6,
"8" = 10^8,
"9" = 10^9,
"H" = 10^2,
"K" = 10^3,
"M" = 10^6,
"B" = 10^9)
storm_dat$PROPDMGEXP <- prop_exp_key[as.character(storm_dat$PROPDMGEXP)]
storm_dat$PROPDMGEXP[is.na(storm_dat$PROPDMGEXP)] <- 10^0
We can now start to subset the data using dplyr, summarize, group_by, and the pipe. We will group by the Event type (EVTYPE), selecting only the columns of interest. I decided to focus on fatalities and on property damage.
sub <- storm_dat %>% select(EVTYPE, STATE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, STATE__) %>%
group_by(EVTYPE) %>%
summarise(fat = sum(FATALITIES), prop = sum(PROPDMG*PROPDMGEXP))
## Warning: package 'bindrcpp' was built under R version 3.4.1
We can then select only the fatalities or property damage in a new dataframe for plotting. I am selecting the top 10 events by # of fatalities or property damage.
fatal <- sub %>% select(EVTYPE, fat) %>% filter(fat > 1) %>% arrange(desc(fat)) %>% slice(1:10)
property <- sub %>% select(EVTYPE, prop) %>% filter(prop > 1) %>% arrange(desc(prop)) %>% slice(1:10)
We can see the top 10 storm events by number of fatalities here. We can see that Tornadoes cause the highest number of fatalities. Excessive heat and flash flooding cause the 2nd and 3rd most deaths. Flooding caused the 7th most deaths, but as we can see in the following graph it was the most costly in terms of property damage.
(g1 <- ggplot(fatal, (aes(x = reorder(EVTYPE, -fat), y = fat))) +
geom_bar(stat="identity", aes(fill=EVTYPE), position="dodge") +
xlab("Event Type") + ylab("Total number of fatalities") +
theme(axis.text.x = element_text(angle=45, hjust=1)) +
theme(legend.position = "none") +
ggtitle("Chart of top 10 storm event by number of fatalities"))
However, when we look at economic impact via property damage, we can see that flooding causes the most damage. Hurricanes/Typhoons and tornados cause substantial property damage at the 2nd and 3rd most costly.
(g2 <- ggplot(property, (aes(x = reorder(EVTYPE, -prop), y = prop))) +
geom_bar(stat="identity", aes(fill=EVTYPE), position="dodge") +
xlab("Event Type") + ylab("Total property damage in dollars") +
theme(axis.text.x = element_text(angle=45, hjust=1)) +
theme(legend.position = "none") +
ggtitle("Chart of top 10 storm event by property damage"))
Tornadoes cause the most fatalities and the 3rd most property damage. Flooding results in the most property damage, but only the 7th most fatalities.
Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern. We examined NOAA storm data to determine which types of storm events caused the most damage and fatalities in the US. This required us to summmate data across 61 years, and to combine two columns that indicated the exponent level of the property damage. Overall it appears that tornadoes are the most dangerous to human life, however flooding cause the most overall property damage.