Severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This report will explore the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database to determine the type of events (Flood, Storm, Drought, etc) that cause the most economic losses and problems to public healts.
The data (NOAA database) was obtained from: https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2.
if (!file.exists("./stormdata.bz2")){
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url, destfile = "./stormdata.bz2")}
filedate <- file.info("./stormdata.bz2")$mtime
To create this report the data was downloaded on: 2015-04-25 18:00:46
classes = c("numeric", "character", "character", "character", "NULL", "factor", "factor",
"factor", "NULL","NULL","NULL","NULL","NULL","NULL","NULL","NULL","NULL",
"NULL","NULL","NULL","NULL","NULL", "numeric", "numeric", "numeric", "factor",
"numeric", "factor", "NULL","NULL","NULL","NULL","NULL","NULL","NULL","NULL","NULL")
stormdata <- read.table("./stormdata.bz2", header=T, sep=",", stringsAsFactors = FALSE,
colClasses=classes)
summary(stormdata)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE
## Min. : 1.0 Length:902297 Length:902297 Length:902297
## 1st Qu.:19.0 Class :character Class :character Class :character
## Median :30.0 Mode :character Mode :character Mode :character
## Mean :31.2
## 3rd Qu.:45.0
## Max. :95.0
##
## COUNTYNAME STATE EVTYPE
## JEFFERSON : 7840 TX : 83728 HAIL :288661
## WASHINGTON: 7603 KS : 53440 TSTM WIND :219940
## JACKSON : 6660 OK : 46802 THUNDERSTORM WIND: 82563
## FRANKLIN : 6256 MO : 35648 TORNADO : 60652
## LINCOLN : 5937 IA : 31069 FLASH FLOOD : 54277
## MADISON : 5632 NE : 30271 FLOOD : 25326
## (Other) :862369 (Other):621339 (Other) :170878
## FATALITIES INJURIES PROPDMG PROPDMGEXP
## Min. : 0.0000 Min. : 0.0000 Min. : 0.00 :465934
## 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.00 K :424665
## Median : 0.0000 Median : 0.0000 Median : 0.00 M : 11330
## Mean : 0.0168 Mean : 0.1557 Mean : 12.06 0 : 216
## 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.50 B : 40
## Max. :583.0000 Max. :1700.0000 Max. :5000.00 5 : 28
## (Other): 84
## CROPDMG CROPDMGEXP
## Min. : 0.000 :618413
## 1st Qu.: 0.000 K :281832
## Median : 0.000 M : 1994
## Mean : 1.527 k : 21
## 3rd Qu.: 0.000 0 : 19
## Max. :990.000 B : 9
## (Other): 9
The original dataset has 37 variables. However we only loaded the ones that are useful for our purposes.
The variables that were included in the dataset are:
Begin date (BGN_DATE): Represents the date and time at which the event started.
State (STATE): The state in which the event occured.
Event type (EVTYPE): Represents the type of event, according to the following list:
Fatalities (FATALITIES): Number of fatalities caused by the event.
Injuries (INJURIES): Number of injuries caused by the event.
Property damage amount (PROPDMG): The damage amount caused by the event in properties.
Property damage expression (PROPDMGEXP): An expression of the damage amount (H houndreds, K thousands, M Millions, B billions).
Crop damage amount (CROPDMG): The damage amount caused by the event in crops.
Crop damage expression (CROPDMGEXP): An expression of the damage amount (H houndreds, K thousands, M Millions, B billions).
For the purpose of this analysis, we performed the following transformations in the data frame: - Create one date field for the date the event. We didn’t take care of the time zone as we are only interested on the year in which the event occured.
Create a new field for the Property damage amount that stores the amount based on the Property damage amout multiplied by the appropiate factor depending on the Property damage expression.
Create a new field for the Crop damage amount that stores the amount based on the Crop damage amout multiplied by the appropiate factor depending on the Crop damage expression.
Any value in the amount expression that’s not H, K, M or B will be considered an error and will be ignored.
library(stringr)
stormdata$PROPDMGEXP <- toupper(stormdata$PROPDMGEXP)
stormdata$CROPDMGEXP <- toupper(stormdata$CROPDMGEXP)
stormdata$EVTYPE <- str_trim(toupper(stormdata$EVTYPE))
calculateAmount <- function(a,exp) {ifelse (exp=="H", a*100,
ifelse(exp=="K", a*1000,
ifelse(exp=="M",a*1000000,
ifelse(exp=="B",a*1000000000,a))))}
stormdata$PROPDMGAMT <- mapply(calculateAmount, stormdata$PROPDMG, stormdata$PROPDMGEXP)
stormdata$CROPDMGAMT <- mapply(calculateAmount, stormdata$CROPDMG, stormdata$CROPDMGEXP)
stormdata$begindate <- as.POSIXct(stormdata$BGN_DATE, tz ="GMT", format="%m/%d/%Y")
stormdata$year=format(stormdata$begindate, "%Y")
Our first impression on the event type field is that it’s manually input and any value can be entered (the types indicated in the documenation are not enforced). we decided to do some manual transformations on the data so that the events appear under the correct type.
stormdata$eventType <- stormdata$EVTYPE
stormdata$eventType[grepl("TSTM", stormdata$EVTYPE)
& !grepl("MARINE", stormdata$EVTYPE)] <- "THUNDERSTORM WIND"
stormdata$eventType[grepl("TSTM", stormdata$EVTYPE)
& grepl("MARINE", stormdata$EVTYPE)] <- "MARINE THUNDERSTORM WIND"
stormdata$eventType[grepl("HURRICAN", stormdata$EVTYPE)
| grepl("TYPHOON", stormdata$EVTYPE)] <- "HURRICANE (TYPHOON)"
stormdata$eventType[grepl("HAIL", stormdata$EVTYPE)] <- "HAIL"
stormdata$eventType[grepl("TORNADO", stormdata$EVTYPE)] <- "TORNADO"
stormdata$eventType[grepl("SURGE", stormdata$EVTYPE)
| grepl("TIDE", stormdata$EVTYPE)] <- "STORM SURGE/TIDE"
stormdata$eventType[grepl("FLASH", stormdata$EVTYPE)] <- "FLASH FLOOD"
stormdata$eventType[grepl("FLOOD", stormdata$EVTYPE) &
!grepl("FLASH", stormdata$EVTYPE) & !grepl("COASTAL", stormdata$EVTYPE)] <- "FLOOD"
stormdata$eventType[grepl("HEAT", stormdata$EVTYPE)] <- "HEAT"
According to the weather events documentation (http://www.ncdc.noaa.gov/stormevents/details.jsp), there is no record for some types of events prior to 1996. So we will filter out any event that occured prior to 1996 so that we can compare the same period of time for all the event types.
To start the analysis, we calculated which type of event caused the more damage for fatalities, injuries and economic losses (property losses + crop losses).
library(knitr)
library(dplyr)
options(knitr.table.format = 'html')
stormdataFilt <- filter(stormdata, year >= 1996)
summaryByTypeFilt <- summarise(group_by(stormdataFilt,eventType),
count=sum(!is.na(eventType)),
total_fatalities=sum(FATALITIES),
total_injuries=sum(INJURIES),
total_damage_loss=(sum(PROPDMGAMT)+sum(CROPDMGAMT))/1000000000)
kable(arrange(select(top_n(summaryByTypeFilt,5,total_fatalities), eventType, count, total_fatalities),
desc(total_fatalities)),
caption ="Top 5 Events that cause more fatalities")
| eventType | count | total_fatalities |
|---|---|---|
| HEAT | 2421 | 2036 |
| TORNADO | 23155 | 1511 |
| FLASH FLOOD | 51005 | 887 |
| LIGHTNING | 13204 | 651 |
| FLOOD | 24427 | 416 |
kable(arrange(select(top_n(summaryByTypeFilt,5,total_injuries), eventType, count, total_injuries),
desc(total_injuries)),
caption ="Top 5 Events that cause more injuries")
| eventType | count | total_injuries |
|---|---|---|
| TORNADO | 23155 | 20667 |
| HEAT | 2421 | 7683 |
| FLOOD | 24427 | 6760 |
| THUNDERSTORM WIND | 210137 | 5034 |
| LIGHTNING | 13204 | 4141 |
kable(arrange(select(top_n(summaryByTypeFilt,5,total_damage_loss), eventType, count, total_damage_loss),
desc(total_damage_loss)),
caption ="Top 5 Events that cause more economic losses (in billions of dollars)")
| eventType | count | total_damage_loss |
|---|---|---|
| FLOOD | 24427 | 149.09970 |
| HURRICANE (TYPHOON) | 271 | 87.06900 |
| STORM SURGE/TIDE | 680 | 47.84532 |
| TORNADO | 23155 | 24.90037 |
| HAIL | 209247 | 17.20109 |
We see that for fatalities and injuries the Top 5 event types are almost the same, but for economic losses the Top 5 event types are different. So we will analyze it separately.
Now let’s plot the fatalities, injuries and economic losses by type per year.
#Top fatalities plot
library(ggplot2)
library(gridExtra)
theme_set(theme_grey(base_size = 6))
summaryByTypeYearFilt <- summarise(group_by(stormdataFilt,eventType,year),
count=sum(!is.na(eventType)),
total_fatalities=sum(FATALITIES),
total_injuries=sum(INJURIES),
total_damage_loss=(sum(PROPDMGAMT)+sum(CROPDMGAMT))/1000000)
#Top fatalities plot
p1 <-ggplot(data=summaryByTypeYearFilt[
summaryByTypeYearFilt$eventType %in% top_n(summaryByTypeFilt,5,total_fatalities)$eventType,],
aes(x=year, y=total_fatalities, fill=eventType)) +
geom_bar(stat="identity") +
scale_x_discrete(breaks = c(1996, seq(from=2000,to=2011,by = 5)))+
labs(x="",y="Total fatalities")+
labs(title="Total fatalities per Event/Year")
#Top injuries plot
p2 <-ggplot(data=summaryByTypeYearFilt[
summaryByTypeYearFilt$eventType %in% top_n(summaryByTypeFilt,5,total_injuries)$eventType,],
aes(x=year, y=total_fatalities, fill=eventType)) +
geom_bar(stat="identity") +
scale_x_discrete(breaks = c(1996, seq(from=2000,to=2011,by = 5)))+
labs(x="",y="Total injuries")+
labs(title="Total injuries per Event/Year")
#Top economic losses
p3 <-ggplot(data=summaryByTypeYearFilt[
summaryByTypeYearFilt$eventType %in% top_n(summaryByTypeFilt,5,total_damage_loss)$eventType,],
aes(x=year, y=total_damage_loss, fill=eventType)) +
geom_bar(stat="identity") +
scale_x_discrete(breaks = seq(from=1950,to=2011,by = 10))+
labs(x="",y="Total (in millions of dollars)")+
labs(title="Total economic losses per Event/Year")
grid.arrange(p1, p2, p3, ncol=3)
For fatalities and injuries, we can see that FLOOD, FLASH FLOOD and LIGHTNING have about the same amount through the years. Hoewver, events like TORNADO and HEAT have some years with more fatalities or injuries, for example TORNADO events caused a high amount of injuries and fatalities on 2011.
For economic loss, HAIL and TORNADO events caused about the same amount of losses through the years. But TORNADO, HURRICANE and STORM SURGE had major events that elevated the amount of losses for some particular years.
Now we’d like to see which states are more affected by major weather events. We are going to calculate the top 10 states and events with the most injuries, fatalities and economic losses.
summaryByTypeState <- summarise(group_by(stormdataFilt,eventType_State=paste(STATE, " - " , eventType)),
count=sum(!is.na(eventType)),
total_fatalities=sum(FATALITIES),
total_injuries=sum(INJURIES),
total_damage_loss=(sum(PROPDMGAMT)+sum(CROPDMGAMT))/1000000)
kable(arrange(select(top_n(summaryByTypeState,10,total_fatalities), eventType_State, count, total_fatalities),
desc(total_fatalities)),
caption ="Top 10 States affected by Events with more fatalities")
| eventType_State | count | total_fatalities |
|---|---|---|
| IL - HEAT | 167 | 357 |
| AL - TORNADO | 1079 | 335 |
| PA - HEAT | 96 | 308 |
| TX - HEAT | 151 | 285 |
| MO - TORNADO | 944 | 233 |
| MO - HEAT | 186 | 195 |
| TN - TORNADO | 674 | 184 |
| TX - FLASH FLOOD | 7739 | 151 |
| FL - RIP CURRENT | 195 | 149 |
| CA - HEAT | 147 | 118 |
kable(arrange(select(top_n(summaryByTypeState,10,total_injuries), eventType_State, count, total_injuries),
desc(total_injuries)),
caption ="Top 10 States affected by Events with more injuries")
| eventType_State | count | total_injuries |
|---|---|---|
| TX - FLOOD | 881 | 6338 |
| MO - HEAT | 186 | 3539 |
| AL - TORNADO | 1079 | 3231 |
| TN - TORNADO | 674 | 2071 |
| MO - TORNADO | 944 | 2059 |
| OK - TORNADO | 1172 | 1672 |
| AR - TORNADO | 985 | 1410 |
| GA - TORNADO | 621 | 1191 |
| MS - TORNADO | 919 | 898 |
| FL - HURRICANE (TYPHOON) | 60 | 811 |
kable(arrange(select(top_n(summaryByTypeState,10,total_damage_loss), eventType_State, count, total_damage_loss),
desc(total_damage_loss)),
caption ="Top 10 States affected by Events with more economic losses (in millions of dollars)")
| eventType_State | count | total_damage_loss |
|---|---|---|
| CA - FLOOD | 400 | 117041.640 |
| LA - STORM SURGE/TIDE | 48 | 31827.987 |
| FL - HURRICANE (TYPHOON) | 60 | 29765.125 |
| LA - HURRICANE (TYPHOON) | 27 | 22506.022 |
| MS - HURRICANE (TYPHOON) | 16 | 15692.906 |
| MS - STORM SURGE/TIDE | 13 | 11266.790 |
| NC - HURRICANE (TYPHOON) | 40 | 6974.851 |
| TX - DROUGHT | 432 | 6722.465 |
| TX - TROPICAL STORM | 57 | 5495.333 |
| AL - TORNADO | 1079 | 5032.391 |
From the tables above, it’s possible to determine which states require more resources to prevent/proceed in case of certain types of events. We can see that heat in Ilinois and tornados in Alabama, have caused the most fatalities. Also, floods in Texas and heat in Missouri have caused the most injuries. And finally California floods have caused the greatest economic losses during this period.