The analysis answers two questions based on NOAA Storm Database. 1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health? To address this question I determined which event types resulted in most direct and indirect casualties. The data contained many typos and input errors which were cleaned before analysis.
2.Across the United States, which types of events have the greatest economic consequences? To address this questions I’ve checked which events resulted in most property and crop damage. Units of measurement (B, M, K for billions, millions, and thousands, respectively, were taken into account).
The data was loaded from raw compressed csv.bz2 as follows. EVTYPE field was cleaned from the following errors (typos): 1. EVTYPE set to ‘THUNDERSTORM’ for all the entries with ‘THUNDERSTORM’ and ‘TSTM’ 2. set to ‘TORNADO’ for all the entries with ‘TORNADO’
I. Question 1
# read the data
data <- read.csv("StormData.csv.bz2")
# we'll use dplyr to filter and summarise:
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(data.table)
## Warning: package 'data.table' was built under R version 3.4.2
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
# clean the data
data$EVTYPE[data$EVTYPE %like% "TORNADO"] <- "TORNADO"
data$EVTYPE[data$EVTYPE %like% "THUNDERSTORM"] <- "THUNDERSTORM"
data$EVTYPE[data$EVTYPE %like% "TSTM"] <- "THUNDERSTORM"
# group by EVTYPE and summarize by two fields (FATALITIES, INJURIES)
grouped_data <- group_by(data, EVTYPE)
# summarize, check and remove is any NAs
summ <- summarize(grouped_data, total = sum(FATALITIES, na.rm = TRUE) + sum(INJURIES, na.rm = TRUE))
## Warning: package 'bindrcpp' was built under R version 3.4.2
# output 10 event types with the highest fatalities + injuries:
arrange(summ, desc(total))[1:10,]
## # A tibble: 10 x 2
## EVTYPE total
## <fctr> <dbl>
## 1 TORNADO 97068
## 2 THUNDERSTORM 10273
## 3 EXCESSIVE HEAT 8428
## 4 FLOOD 7259
## 5 LIGHTNING 6046
## 6 HEAT 3037
## 7 FLASH FLOOD 2755
## 8 ICE STORM 2064
## 9 WINTER STORM 1527
## 10 HIGH WIND 1385
# calculate the % of total for the most harmful EVTYPE:
arrange(summ, desc(total))[1,2]/sum(summ$total)
## total
## 1 0.6235378
# prepare the data for vizualization - top 5 EVTYPE and other:
viz <- arrange(summ, desc(total))[1:5,]
viz[6,] <- c('OTHER', sum(summ$total) - sum(viz$total))
viz$total <- as.numeric(viz$total)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.2
# build the barplot
ggplot(data=viz, aes(x=EVTYPE, y=total)) + ggtitle("Events most harmful to population") + labs(y = "Fatalities and Injuries", x = "Events") + geom_bar(stat="identity")
#converting from units to $:
data$PROPDMG[data$PROPDMGEXP %like% "B"] <- data$PROPDMG[data$PROPDMGEXP %like% "B"]*1000000000
data$PROPDMG[data$PROPDMGEXP %like% "M"] <- data$PROPDMG[data$PROPDMGEXP %like% "M"]*1000000
data$PROPDMG[data$PROPDMGEXP %like% "K"] <- data$PROPDMG[data$PROPDMGEXP %like% "K"]*1000
data$CROPDMG[data$CROPDMGEXP %like% "B"] <- data$CROPDMG[data$CROPDMGEXP %like% "B"]*1000000000
data$CROPDMG[data$CROPDMGEXP %like% "M"] <- data$CROPDMG[data$CROPDMGEXP %like% "M"]*1000000
data$CROPDMG[data$CROPDMGEXP %like% "K"] <- data$CROPDMG[data$CROPDMGEXP %like% "K"]*1000
# group by EVTYPE and summarize by two fields (PROPDMG, CROPDMG)
grouped_data <- group_by(ungroup(data), EVTYPE)
# summarize, check and remove is any NAs
summ <- summarize(grouped_data, total = sum(PROPDMG, na.rm = TRUE) + sum(CROPDMG, na.rm = TRUE))
# output 10 event types with the highest fatalities + injuries:
arrange(summ, desc(total))[1:10,]
## # A tibble: 10 x 2
## EVTYPE total
## <fctr> <dbl>
## 1 FLOOD 150319678257
## 2 HURRICANE/TYPHOON 71913712800
## 3 TORNADO 58999059560
## 4 STORM SURGE 43323541000
## 5 HAIL 18752904943
## 6 FLASH FLOOD 17562129167
## 7 DROUGHT 15018672000
## 8 HURRICANE 14610229010
## 9 THUNDERSTORM 12245295473
## 10 RIVER FLOOD 10148404500
# calculate the % of total for EVTYPE with the most damage:
arrange(summ, desc(total))[1,2]/sum(summ$total)
## total
## 1 0.31555
It was found that THUNDERSTORM has by far caused the most casualties in all the period for which the data is available (97,068 out of 155,673 or 62.4%). Flood, however, caused the most property and crop damage - USD 150.3B out of total USD 476.4B (31.6%).