This report shows the results of an analysis of the U. S. National Oceanic and Atmospheric Administration’s (NOAA) database. The aim of the research is to answer the following questions:
This report is divided in two sections. The first section, Data processing describes all steps taken from data loading to having a dataset ready for analysis. The second section, Results, describes the analysis undertaken to achieve the aims of the report.
Data processing starts with loading into R the dataset:
setwd("~/Dropbox/CourseraReproducible")
storm <- read.csv(bzfile("repdata-data-StormData.csv.bz2"))
This dataset has up to 902297 observations. To obtain a significant dataset, two filterings have been made:
To remove from the sample data prior to 1996, I use the lubridate package to get the year of the observation from the BGN_DATE variable, and then proceed to make the first filtering:
library(lubridate)
storm$BGN_DATE <- mdy_hms(storm$BGN_DATE)
storm$years <- year(storm$BGN_DATE)
storm <- storm[which(storm$years > 1995), ]
Afterwards, I proceed to remove the events with no casualties and no economic consequences:
filter <- (storm$INJURIES !=0) | (storm$FATALITIES!=0) | (storm$PROPDMG!=0) | (storm$CROPDMG!=0)
storm <- storm[which(filter), ]
After the two filterings, we obtain a dataset with 201318 observations.
An examination of the EVTYPE variable shows that there are a large number of different types of events recorded:
length(table(storm$EVTYPE))
## [1] 985
This is a large number of different events, considering that the codebook registers only 48 types of events. So several transformsations and groupings have been undertaken to reduce significantly the number of events:
storm$EVTYPE <- toupper(storm$EVTYPE)
storm$EVTYPE[which(grepl("ASTRONOMICAL HIGH TIDE", storm$EVTYPE))] <- "HIGH TIDE"
storm$EVTYPE[which(grepl("TIDAL FLOODING", storm$EVTYPE))] <- "HIGH TIDE"
storm$EVTYPE[which(grepl("GROUND BLIZZARD", storm$EVTYPE))] <- "BLIZZARD"
storm$EVTYPE[which(grepl("^BLIZZARD", storm$EVTYPE))] <- "BLIZZARD"
storm$EVTYPE[which(grepl("^COASTAL", storm$EVTYPE))] <- "COASTAL FLOOD"
storm$EVTYPE[which(grepl(" EROSION/CSTL FLOOD", storm$EVTYPE))] <- "COASTAL FLOOD"
storm$EVTYPE[which(grepl("EROSION/CSTL FLOOD", storm$EVTYPE))] <- "COASTAL FLOOD"
storm$EVTYPE[which(grepl("^COLD AIR", storm$EVTYPE))] <- "TORNADO"
storm$EVTYPE[which(grepl("^COLD", storm$EVTYPE))] <- "COLD/WIND CHILL"
storm$EVTYPE[which(grepl("^MUD", storm$EVTYPE))] <- "DEBRIS FLOW"
storm$EVTYPE[which(grepl("SLIDE", storm$EVTYPE))] <- "DEBRIS FLOW"
storm$EVTYPE[which(grepl("LANDSLUMP", storm$EVTYPE))] <- "DEBRIS FLOW"
storm$EVTYPE[which(grepl("^DROUGHT", storm$EVTYPE))] <- "DROUGHT"
storm$EVTYPE[which(grepl("^DUST DEVIL", storm$EVTYPE))] <- "DUST DEVIL"
storm$EVTYPE[which(grepl("^DUST STORM", storm$EVTYPE))] <- "DUST STORM"
storm$EVTYPE[which(grepl("BLOWING DUST", storm$EVTYPE))] <- "DUST STORM"
storm$EVTYPE[which(grepl("RECORD/EXCESSIVE HEAT", storm$EVTYPE))] <- c("EXCESSIVE HEAT")
storm$EVTYPE[which(grepl("EXTREME HEAT", storm$EVTYPE))] <- c("EXCESSIVE HEAT")
storm$EVTYPE[which(grepl("^EXTREME", storm$EVTYPE))] <- c("EXTREME COLD/WINDCHILL")
storm$EVTYPE[which(grepl("^FLASH", storm$EVTYPE))] <- "FLASH FLOOD"
storm$EVTYPE[which(grepl("^ FLASH", storm$EVTYPE))] <- "FLASH FLOOD"
storm$EVTYPE[which(grepl("^FLOOD", storm$EVTYPE))] <- "FLOOD"
storm$EVTYPE[which(grepl("BREAKUP FLOODING", storm$EVTYPE))] <- "FLOOD"
storm$EVTYPE[which(grepl("^ICE JAM FLOOD", storm$EVTYPE))] <- "FLOOD"
storm$EVTYPE[which(grepl("^RIVER", storm$EVTYPE))] <- "FLOOD"
storm$EVTYPE[which(grepl("FROST", storm$EVTYPE))] <- "FROST/FREEZE"
storm$EVTYPE[which(grepl("FREEZING FOG", storm$EVTYPE))] <- "aux"
storm$EVTYPE[which(grepl("FREEZ", storm$EVTYPE))] <- "FROST/FREEZE"
storm$EVTYPE[which(grepl("aux", storm$EVTYPE))] <- "FREEZING FOG"
storm$EVTYPE[which(grepl("SMALL HAIL", storm$EVTYPE))] <- "HAIL"
storm$EVTYPE[which(grepl("HEAT WAVE", storm$EVTYPE))] <- "EXCESSIVE HEAT"
storm$EVTYPE[which(grepl("RECORD HEAT", storm$EVTYPE))] <- "EXCESSIVE HEAT"
storm$EVTYPE[which(grepl("^HEAVY RAIN", storm$EVTYPE))] <- "HEAVY RAIN"
storm$EVTYPE[which(grepl("^HEAVY SNOW", storm$EVTYPE))] <- "HEAVY SNOW"
storm$EVTYPE[which(grepl("^SNOW", storm$EVTYPE))] <- "HEAVY SNOW"
storm$EVTYPE[which(grepl("EXCESSIVE SNOW", storm$EVTYPE))] <- "HEAVY SNOW"
storm$EVTYPE[which(grepl("SURF", storm$EVTYPE))] <- "HIGH SURF"
storm$EVTYPE[which(grepl("^LIGHTNING", storm$EVTYPE))] <- "LIGHTNING"
storm$EVTYPE[which(grepl("GUSTY", storm$EVTYPE))] <- "HIGH WIND"
storm$EVTYPE[which(grepl("GRADIENT", storm$EVTYPE))] <- "HIGH WIND"
storm$EVTYPE[which(grepl("^HIGH WIND", storm$EVTYPE))] <- "HIGH WIND"
storm$EVTYPE[which(grepl("NON", storm$EVTYPE))] <- "HIGH WIND"
storm$EVTYPE[which(grepl("NON-SEVERE WIND DAMAGE", storm$EVTYPE))] <- "HIGH WIND"
storm$EVTYPE[which(grepl("^WIND", storm$EVTYPE))] <- "HIGH WIND"
storm$EVTYPE[which(grepl("^STRONG WIND", storm$EVTYPE))] <- "STRONG WIND"
storm$EVTYPE[which(grepl("WHIRLWIND", storm$EVTYPE))] <- "HURRICANE"
storm$EVTYPE[which(grepl("HURRICANE/TYPHOON", storm$EVTYPE))] <- "HURRICANE"
storm$EVTYPE[which(grepl("^LAKE", storm$EVTYPE))] <- "LAKE-EFFECT SNOW"
storm$EVTYPE[which(grepl("RIP", storm$EVTYPE))] <- "RIP CURRENT"
storm$EVTYPE[which(grepl("STORM SURGE", storm$EVTYPE))] <- "STORM SURGE/TIDE"
storm$EVTYPE <- gsub("TSTM", "THUNDERSTORM", storm$EVTYPE)
storm$EVTYPE[which(grepl("^THUNDERSTORM", storm$EVTYPE))] <- "THUNDERSTORM WIND"
storm$EVTYPE[which(grepl("^ THUNDERSTORM", storm$EVTYPE))] <- "THUNDERSTORM WIND"
storm$EVTYPE[which(grepl("WINTER STORM", storm$EVTYPE))] <- "aux2"
storm$EVTYPE[which(grepl("^WINTE", storm$EVTYPE))] <- "WINTER WEATHER"
storm$EVTYPE[which(grepl("aux2", storm$EVTYPE))] <- "WINTER STORM"
storm$EVTYPE[which(grepl("DAM BREAK", storm$EVTYPE))] <- "FLOOD"
storm$EVTYPE[which(grepl("DOWNBURST", storm$EVTYPE))] <- "HIGH WIND"
storm$EVTYPE[which(grepl("FALLING SNOW", storm$EVTYPE))] <- "AVALANCHE"
storm$EVTYPE[which(grepl("HIGH WATER", storm$EVTYPE))] <- "HIGH TIDE"
storm$EVTYPE[which(grepl("HIGH SEAS", storm$EVTYPE))] <- "HIGH TIDE"
storm$EVTYPE[which(grepl("COLD$", storm$EVTYPE))] <- "COLD/WIND CHILL"
storm$EVTYPE[which(grepl("FLD$", storm$EVTYPE))] <- "FLOOD"
storm$EVTYPE[which(grepl("RAIN$", storm$EVTYPE))] <- "RAIN"
storm$EVTYPE[which(grepl("THYPHOON", storm$EVTYPE))] <- "HURRICANE"
storm$EVTYPE[which(grepl("ICE ROADS", storm$EVTYPE))] <- "ICE ON ROAD"
storm$EVTYPE[which(grepl("ICY ROADS", storm$EVTYPE))] <- "ICE ON ROAD"
After the filtering, the number of EVTYPE different events has been reduced significantly:
length(table(storm$EVTYPE))
## [1] 80
The variables of the dataset including data about damage are:
FATALITIES and INJURIESvariables.PROPDMG and PROPDMGEXP. The second variable represents the order of magnitude of economic damage.CROPDMG and CROPDMGEXP, with a similar structure to the properties damage.Prior to data analysis, the order of magnitude of a particular event (the Napa flood) has been corrected. This correction has been made taking into account information obtained from the discussion forums:
storm$PROPDMGEXP <- toupper(storm$PROPDMGEXP)
storm$CROPDMGEXP <- toupper(storm$CROPDMGEXP)
storm$EVTYPE[which(storm$PROPDMGEXP=="B" & storm$PROPDMG==115)]
## [1] "FLOOD"
storm$PROPDMGEXP[which(storm$PROPDMGEXP=="B" & storm$PROPDMG==115)] <- "M"
The economic damage of each event is stored in the propdmg and cropdmg variables, obtained as follows:
storm$cropexp <- 1
storm$propexp <- 1
storm$cropexp[which(storm$CROPDMGEXP=="K")] <- 1000
storm$cropexp[which(storm$CROPDMGEXP=="M")] <- 1000000
storm$cropexp[which(storm$CROPDMGEXP=="B")] <- 1000000000
storm$propexp[which(storm$PROPDMGEXP=="K")] <- 1000
storm$propexp[which(storm$PROPDMGEXP=="M")] <- 1000000
storm$propexp[which(storm$PROPDMGEXP=="B")] <- 1000000000
storm$cropdmg <- storm$CROPDMG*storm$cropexp
storm$propdmg <- storm$PROPDMG*storm$propexp
To assess the impact of a given event, two effects have to be taken into account:
To analyze data, the dplyr package has been used, which is adequate for large datasets. Then, relevant effects have been computed for every kind of damage.
library(dplyr)
storm <- tbl_df(storm)
storm.evtype <- group_by(storm, EVTYPE)
storm.prop <- summarize(storm.evtype, length(STATE), sum(propdmg), max(propdmg))
colnames(storm.prop) <- c("EVTYPE", "Frequency", "Total.PROPDMG", "Max.PROPDMG")
storm.prop <- mutate(storm.prop, freq.PROP = Total.PROPDMG/ sum(Total.PROPDMG)*100)
storm.crop <- summarize(storm.evtype, length(STATE), sum(cropdmg), max(cropdmg))
colnames(storm.crop) <- c("EVTYPE", "Frequency", "Total.CROPDMG", "Max.CROPDMG")
storm.crop <- mutate(storm.crop, freq.crop = Total.CROPDMG/ sum(Total.CROPDMG)*100)
storm.fatalities <- summarize(storm.evtype, length(STATE), sum(FATALITIES), max(FATALITIES))
colnames(storm.fatalities) <- c("EVTYPE", "Frequency", "Total.FATAL", "Max.FATAL")
storm.fatalities <- mutate(storm.fatalities, freq.fatal = Total.FATAL/ sum(Total.FATAL)*100)
storm.injuries <- summarize(storm.evtype, length(STATE), sum(INJURIES), max(INJURIES))
colnames(storm.injuries) <- c("EVTYPE", "Frequency", "Total.INJURIES", "Max.INJURIES")
storm.injuries <- mutate(storm.injuries, freq.injuries = Total.INJURIES/ sum(Total.INJURIES)*100)
As for propdmg the most harmful events are:
arrange(storm.prop, desc (Total.PROPDMG))
## Source: local data frame [80 x 5]
##
## EVTYPE Frequency Total.PROPDMG Max.PROPDMG freq.PROP
## 1 HURRICANE 201 81118671010 1.693e+10 32.204950
## 2 STORM SURGE/TIDE 216 47834724000 3.130e+10 18.990880
## 3 FLOOD 10305 29245593200 3.000e+09 11.610803
## 4 TORNADO 12366 24616945710 2.800e+09 9.773182
## 5 FLASH FLOOD 19013 15222258910 1.000e+09 6.043394
## 6 HAIL 22690 14595213420 1.800e+09 5.794450
## 7 THUNDERSTORM WIND 105372 7913555880 7.500e+08 3.141763
## 8 TROPICAL STORM 410 7642475550 5.150e+09 3.034142
## 9 HIGH WIND 5531 5253421860 1.300e+09 2.085663
## 10 WILDFIRE 847 4758667000 1.040e+09 1.889240
## .. ... ... ... ... ...
arrange(storm.prop, desc (Max.PROPDMG))
## Source: local data frame [80 x 5]
##
## EVTYPE Frequency Total.PROPDMG Max.PROPDMG freq.PROP
## 1 STORM SURGE/TIDE 216 47834724000 3.130e+10 18.990880
## 2 HURRICANE 201 81118671010 1.693e+10 32.204950
## 3 TROPICAL STORM 410 7642475550 5.150e+09 3.034142
## 4 FLOOD 10305 29245593200 3.000e+09 11.610803
## 5 TORNADO 12366 24616945710 2.800e+09 9.773182
## 6 HAIL 22690 14595213420 1.800e+09 5.794450
## 7 WILD/FOREST FIRE 381 3001782500 1.500e+09 1.191739
## 8 HIGH WIND 5531 5253421860 1.300e+09 2.085663
## 9 WILDFIRE 847 4758667000 1.040e+09 1.889240
## 10 FLASH FLOOD 19013 15222258910 1.000e+09 6.043394
## .. ... ... ... ... ...
As for cropdmg the most harmful events are:
arrange(storm.crop, desc (Total.CROPDMG))
## Source: local data frame [80 x 5]
##
## EVTYPE Frequency Total.CROPDMG Max.CROPDMG freq.crop
## 1 DROUGHT 258 13367566000 1.00e+09 38.464795
## 2 HURRICANE 201 5349282800 1.51e+09 15.392411
## 3 FLOOD 10305 5013161500 5.00e+08 14.425231
## 4 HAIL 22690 2496822450 7.00e+07 7.184536
## 5 FROST/FREEZE 177 1368761000 2.86e+08 3.938571
## 6 FLASH FLOOD 19013 1334901700 2.00e+08 3.841142
## 7 EXTREME COLD/WINDCHILL 296 1326023000 5.96e+08 3.815594
## 8 THUNDERSTORM WIND 105372 1016942600 4.10e+07 2.926224
## 9 RAIN 1053 739919800 2.00e+08 2.129098
## 10 TROPICAL STORM 410 677711000 2.00e+08 1.950094
## .. ... ... ... ... ...
arrange(storm.crop, desc (Max.CROPDMG))
## Source: local data frame [80 x 5]
##
## EVTYPE Frequency Total.CROPDMG Max.CROPDMG freq.crop
## 1 HURRICANE 201 5349282800 1.510e+09 15.392411
## 2 DROUGHT 258 13367566000 1.000e+09 38.464795
## 3 EXTREME COLD/WINDCHILL 296 1326023000 5.960e+08 3.815594
## 4 FLOOD 10305 5013161500 5.000e+08 14.425231
## 5 EXCESSIVE HEAT 687 492402000 4.924e+08 1.416873
## 6 FROST/FREEZE 177 1368761000 2.860e+08 3.938571
## 7 FLASH FLOOD 19013 1334901700 2.000e+08 3.841142
## 8 RAIN 1053 739919800 2.000e+08 2.129098
## 9 TROPICAL STORM 410 677711000 2.000e+08 1.950094
## 10 HIGH WIND 5531 634071300 1.750e+08 1.824522
## .. ... ... ... ... ...
And the analysis for fatalities and injuries yields:
arrange(storm.fatalities, desc (Total.FATAL))
## Source: local data frame [80 x 5]
##
## EVTYPE Frequency Total.FATAL Max.FATAL freq.fatal
## 1 EXCESSIVE HEAT 687 1799 99 20.602382
## 2 TORNADO 12366 1511 158 17.304169
## 3 FLASH FLOOD 19013 887 20 10.158039
## 4 LIGHTNING 11152 651 4 7.455337
## 5 RIP CURRENT 603 542 6 6.207055
## 6 FLOOD 10305 444 11 5.084746
## 7 THUNDERSTORM WIND 105372 379 7 4.340357
## 8 HIGH WIND 5531 258 8 2.954650
## 9 EXTREME COLD/WINDCHILL 296 257 10 2.943197
## 10 HEAT 164 237 16 2.714155
## .. ... ... ... ... ...
arrange(storm.fatalities, desc (Max.FATAL))
## Source: local data frame [80 x 5]
##
## EVTYPE Frequency Total.FATAL Max.FATAL freq.fatal
## 1 TORNADO 12366 1511 158 17.3041686
## 2 EXCESSIVE HEAT 687 1799 99 20.6023820
## 3 TSUNAMI 14 33 32 0.3779203
## 4 TROPICAL STORM 410 57 22 0.6527714
## 5 FLASH FLOOD 19013 887 20 10.1580394
## 6 RAIN 1053 94 19 1.0765002
## 7 HEAT 164 237 16 2.7141548
## 8 HURRICANE 201 126 15 1.4429684
## 9 COLD/WIND CHILL 119 132 14 1.5116812
## 10 DEBRIS FLOW 201 43 14 0.4924416
## .. ... ... ... ... ...
arrange(storm.injuries, desc (Total.INJURIES))
## Source: local data frame [80 x 5]
##
## EVTYPE Frequency Total.INJURIES Max.INJURIES freq.injuries
## 1 TORNADO 12366 20667 1150 35.648124
## 2 FLOOD 10305 6838 800 11.794739
## 3 EXCESSIVE HEAT 687 6461 519 11.144459
## 4 THUNDERSTORM WIND 105372 5129 70 8.846917
## 5 LIGHTNING 11152 4141 51 7.142734
## 6 FLASH FLOOD 19013 1674 150 2.887451
## 7 HURRICANE 201 1321 780 2.278568
## 8 WINTER STORM 1460 1292 165 2.228547
## 9 HEAT 164 1222 223 2.107805
## 10 HIGH WIND 5531 1187 70 2.047434
## .. ... ... ... ... ...
arrange(storm.injuries, desc (Max.INJURIES))
## Source: local data frame [80 x 5]
##
## EVTYPE Frequency Total.INJURIES Max.INJURIES freq.injuries
## 1 TORNADO 12366 20667 1150 35.6481242
## 2 FLOOD 10305 6838 800 11.7947391
## 3 HURRICANE 201 1321 780 2.2785683
## 4 EXCESSIVE HEAT 687 6461 519 11.1444588
## 5 HEAT 164 1222 223 2.1078051
## 6 TROPICAL STORM 410 338 200 0.5830099
## 7 WINTER STORM 1460 1292 165 2.2285468
## 8 BLIZZARD 228 385 150 0.6640793
## 9 FLASH FLOOD 19013 1674 150 2.8874515
## 10 WINTER WEATHER 546 483 137 0.8331177
## .. ... ... ... ... ...
Looking at the results obtained, specially the ones concerning total damages, the most harmful weather events appeear to be:
HURRICANE, STORM SURGE/TIDE, FLOOD, TORNADO, DROUGTand HAILEXCESSIVE HEAT, TORNADO, FLASH FLOOD, LIGHTNING, FLOOD and THUNDERSTORM WINDThen, we can keep observations of these events to examine them more close. I will group observations by event and year:
harm.economic <- filter(storm, EVTYPE == "HURRICANE" | EVTYPE == "STORM SURGE/TIDE" | EVTYPE == "FLOOD" | EVTYPE == "TORNADO" | EVTYPE == "DROUGHT" | EVTYPE == "HAIL")
harm.human <- filter(storm, EVTYPE == "EXCESSIVE HEAT" | EVTYPE == "TORNADO" | EVTYPE == "FLASH FLOOD" | EVTYPE == "LIGHTNING" | EVTYPE == "FLOOD" | EVTYPE == "THUNDERSTORM WIND")
harm.economic <- group_by(harm.economic, EVTYPE, years)
harm.human <- group_by(harm.human, EVTYPE, years)
Then, we can obtain the number of events and the total number of economic and human casualties:
economic.graph <- summarize(harm.economic, length(STATE), sum(propdmg), sum(cropdmg))
colnames(economic.graph) <- c("EVTYPE", "year", "freq", "propdmg", "cropdmg")
human.graph <- summarize(harm.human, length(STATE), sum(FATALITIES), sum(INJURIES))
colnames(human.graph) <- c("EVTYPE", "year", "freq", "fatalities", "injuries")
The following graph shows the economic impact of the six most economically harmful weather events across the years. Economic impact has been plotted in logarythmic scale.
library(lattice)
xyplot(log10(propdmg) + log10(cropdmg) ~ year | EVTYPE, type="l", economic.graph, auto.key=TRUE)
Looking at the results, looks like tornadoes, floods and hail are the events that more regularly have high economic impact along the years. The other events, although can have also a high economic impact, appear less regularly.
As for human consequences, here is the evolution of fatalities along the years for the most harmful weather events:
xyplot(fatalities ~ year | EVTYPE, type="l", human.graph, auto.key=TRUE)
And here is a similar plot for injuries:
xyplot(injuries ~ year | EVTYPE, type="l", human.graph, auto.key=TRUE)
As for injuries and fatalities, there is a clear outlier which distorts results somewhat, which is the 2011 Joplin tornado:
joplin <- storm[which(storm$FATALITIES > 100), ]
joplin$FATALITIES
## [1] 158
joplin$INJURIES
## [1] 1150
In general, human impact looks like less predictable than economic impact. It must be noted, though, the effect of EXCESSIVE HEAT, (i.e., hetwaves), which calls for prevention actions regarding heatwaves.