Using the National Weather Service Storm Data we search for which type of storm event causes more economic and public health damages across the USA. Based on our findings, Tornadoes are the main cause of deaths and injuries. Floods are the main cause of property damage and droughts the main cause of crop damage. We also map how those consequences have impacted the states of the US between 1950 and 2011.
First, we are going to load the dplyr and ggplot2 libraries, to help manage the dataframes and to plot our results.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
Now, we will download the file to our previosly created data folder, and load it to the StormData variable.
if(!file.exists("./data/stormdata.csv")){
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",destfile="./data/stormdata.csv")
}
StormData <- read.csv("./data/stormdata.csv")
To determine which events are most harmful, we will need to look at the variables FATALITIES, INJURIES, PROPDMGand CROPDMG. But, examining the National Weather Service Storm Data Documentation, we notice that the variables PROPDMGEXP and CROPDMGEXP are a kind of multiplier for their respective damage. So, we look the frequency of that variables
table(StormData$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5 6
## 465934 1 8 5 216 25 13 4 4 28 4
## 7 8 B h H K m M
## 5 1 40 1 6 424665 7 11330
table(StormData$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
We need to discover how that works. Thank’s to Flying Disc, we came up with the function convertMultiplier to determine how each entry multiplies its variable.
convertMultiplier <- function(x){
if (x == "B"){
return(1000000000)
} else if (x=="M" | x=="m"){
return(1000000)
} else if (x=="K" | x=="k"){
return(1000)
} else if (x=="H" | x=="h"){
return(100)
} else if (x %in% as.character(c(0:8))){
return(10)
} else {
return(1)
}
}
Then, using that function, we will create the new variables: DamageProperty, DamageCrop and TotalDamage.
StormData$DamageProperty <- StormData$PROPDMG*as.numeric(lapply(StormData$PROPDMGEXP,FUN=convertMultiplier))
StormData$DamageCrop <- StormData$CROPDMG *as.numeric(lapply(StormData$CROPDMGEXP,FUN=convertMultiplier))
StormData$TotalDamage <- StormData$DamageCrop+StormData$DamageProperty
Now, we will create a new dataframe called StormDataEventsSummary to easily search for the events we need. We will group the StormData dataframe by event:
StormDataEventsSummary <- StormData %>% group_by(EVTYPE) %>%
summarise(Fatalities=sum(FATALITIES),Injuries=sum(INJURIES),
DamageProperty=sum(DamageProperty),DamageCrop=sum(DamageCrop),
TotalDamage=sum(TotalDamage),events=length(EVTYPE),.groups="drop")
head(StormDataEventsSummary)
## # A tibble: 6 x 7
## EVTYPE Fatalities Injuries DamageProperty DamageCrop TotalDamage events
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 " HIGH SUR~ 0 0 200000 0 200000 1
## 2 " COASTAL FL~ 0 0 0 0 0 1
## 3 " FLASH FLOO~ 0 0 50000 0 50000 1
## 4 " LIGHTNING" 0 0 0 0 0 1
## 5 " TSTM WIND" 0 0 8100000 0 8100000 4
## 6 " TSTM WIND ~ 0 0 8000 0 8000 1
With that dataframe, we can search for what we need.
With that, the five events that have more fatalities are:
top5Fatalities <- StormDataEventsSummary %>% arrange(desc(Fatalities)) %>% select(EVTYPE,Fatalities) %>% head(5)
head(top5Fatalities)
## # A tibble: 5 x 2
## EVTYPE Fatalities
## <chr> <dbl>
## 1 TORNADO 5633
## 2 EXCESSIVE HEAT 1903
## 3 FLASH FLOOD 978
## 4 HEAT 937
## 5 LIGHTNING 816
The states that suffer the most from those events are
StormData %>% group_by(STATE) %>% filter(EVTYPE %in% top5Fatalities$EVTYPE) %>%
summarize(Fatalities=sum(FATALITIES),.groups='drop') %>%
arrange(desc(Fatalities)) %>% head(5)
## # A tibble: 5 x 2
## STATE Fatalities
## <chr> <dbl>
## 1 IL 1215
## 2 TX 1067
## 3 MO 710
## 4 AL 684
## 5 PA 533
Illinois, Texas, Missouri, Alabama and Pennsylvania.
The same way we did before:
top5Injuries <- StormDataEventsSummary %>% arrange(desc(Injuries)) %>% select(EVTYPE,Injuries) %>% head(5)
head(top5Injuries)
## # A tibble: 5 x 2
## EVTYPE Injuries
## <chr> <dbl>
## 1 TORNADO 91346
## 2 TSTM WIND 6957
## 3 FLOOD 6789
## 4 EXCESSIVE HEAT 6525
## 5 LIGHTNING 5230
The states that suffer the most from those events are
StormData %>% group_by(STATE) %>% filter(EVTYPE %in% top5Injuries$EVTYPE) %>%
summarize(Injuries=sum(INJURIES),.groups='drop') %>%
arrange(desc(Injuries)) %>% head(5)
## # A tibble: 5 x 2
## STATE Injuries
## <chr> <dbl>
## 1 TX 15290
## 2 AL 8442
## 3 MO 8174
## 4 MS 6504
## 5 AR 5390
Texas, Alabama, Missouri, Mississipi, Arkansas.
Now, for the events that have more damage to properties.
top5DamageProperty <- StormDataEventsSummary %>% arrange(desc(DamageProperty)) %>% select(EVTYPE,DamageProperty) %>% head(5)
head(top5DamageProperty)
## # A tibble: 5 x 2
## EVTYPE DamageProperty
## <chr> <dbl>
## 1 FLOOD 144657709807
## 2 HURRICANE/TYPHOON 69305840000
## 3 TORNADO 56937162900
## 4 STORM SURGE 43323536000
## 5 FLASH FLOOD 16140815218
The states that suffer the most from those events are
StormData %>% group_by(STATE) %>% filter(EVTYPE %in% top5DamageProperty$EVTYPE ) %>%
summarize(DamageProperty=sum(DamageProperty),.groups='drop') %>%
arrange(desc(DamageProperty)) %>% head(5)
## # A tibble: 5 x 2
## STATE DamageProperty
## <chr> <dbl>
## 1 CA 117127356965
## 2 LA 54735277990
## 3 FL 31036822693
## 4 MS 28665469630
## 5 AL 11357170060
California, Louisiania, Florida, Mississipi, Alabama.
And finally, the events that cause more damage to crops.
top5DamageCrop <- StormDataEventsSummary %>% arrange(desc(DamageCrop)) %>% select(EVTYPE,DamageCrop) %>% head(5)
head(top5DamageCrop)
## # A tibble: 5 x 2
## EVTYPE DamageCrop
## <chr> <dbl>
## 1 DROUGHT 13972566000
## 2 FLOOD 5661968450
## 3 RIVER FLOOD 5029459000
## 4 ICE STORM 5022113500
## 5 HAIL 3025954653
The states that suffers the most from those events are
StormData %>% group_by(STATE) %>% filter(EVTYPE %in% top5DamageCrop$EVTYPE ) %>%
summarize(DamageCrop=sum(DamageCrop),.groups='drop') %>%
arrange(desc(DamageCrop)) %>% head(5)
## # A tibble: 5 x 2
## STATE DamageCrop
## <chr> <dbl>
## 1 TX 6915808600
## 2 IL 5330037600
## 3 MS 5016506000
## 4 IA 3893394450
## 5 NE 1490910650
Which are Texas, Illinois, Missouri, Iowa and Nebraska.
Tornado is the main reason of both fatalities (5633 deaths) and injuries (9.134610^{4} injuries) in Storm Data. Now we will see in a state-level both variables. First, we need to tidy our dataframe to make type (Fatalitity, Injury) a factor variable for each state.
statesTornado <- StormData %>% group_by(STATE) %>% filter(EVTYPE=="TORNADO") %>% summarize(Fatalities=sum(FATALITIES),Injuries=sum(INJURIES),.groups='drop')
Fatalities <- statesTornado[,c(1,2)]
names(Fatalities) <- c("STATE","Quantity")
Injuries <- statesTornado[,c(1,3)]
names(Injuries) <- c("STATE","Quantity")
healthDamage <- rbind(Fatalities,Injuries)
healthDamage$type <- c(rep(c("Fatality","Injury"),each=52))
Now we plot both fatalities and injuries by state:
g <- ggplot(healthDamage,aes(STATE,Quantity,fill=type))
g+geom_bar(stat="identity", position=position_dodge())+
scale_fill_manual(values=c('blue','orange'))+
theme(axis.text.x = element_text(angle=90))+
labs(x="State",y="Quantity",title="Injuries and Deaths caused by tornadoes in the US by State")
Deaths and injuries caused by tornadoes by state
Flood is the main cause for property damage in the US. Filtering our dataset for flood as the type of event, we have:
statesFlood<- StormData %>% group_by(STATE) %>% filter(EVTYPE=="FLOOD") %>%
summarize(DamageProperty=sum(DamageProperty),.groups='drop')
And plotting a bar plot of property damage (in billions) by state:
g <- ggplot(statesFlood,aes(STATE,DamageProperty/1000000000))
g+geom_bar(stat="identity", position=position_dodge())+
theme(axis.text.x = element_text(angle=90))+
labs(x="State",y="Damage (in Billions of Dollars)",title="Property Damage caused by flood in the US by State")
Property Damage caused by floods by state
Finally, droughts are the main cause of crop damage in the US. Filtering our dataset for flood as the type of event, we have:
statesDrought<- StormData %>% group_by(STATE) %>% filter(EVTYPE=="DROUGHT") %>%
summarize(DamageCrop=sum(DamageCrop),.groups='drop')
And plotting our result by state,
g <- ggplot(statesDrought,aes(STATE,DamageCrop/1000000))
g+geom_bar(stat="identity", position=position_dodge())+
theme(axis.text.x = element_text(angle=90))+
labs(x="State",y="Damage to crops (in millions)",title="Crop Damage caused by droughts in the US by State")
Crop Damage caused by droughts by state