Synopsis

This report shows the results of an analysis of the U. S. National Oceanic and Atmospheric Administration’s (NOAA) database. The aim of the research is to answer the following questions:

  1. Across the United States, which types of events are most harmful with respect to population health?
  2. Across the United States, which types of events have the greatest economic consequences?

This report is divided in two sections. The first section, Data processing describes all steps taken from data loading to having a dataset ready for analysis. The second section, Results, describes the analysis undertaken to achieve the aims of the report.

Data processing

Data loading

Data processing starts with loading into R the dataset:

setwd("~/Dropbox/CourseraReproducible")
storm <- read.csv(bzfile("repdata-data-StormData.csv.bz2"))

Event filtering

This dataset has up to 902297 observations. To obtain a significant dataset, two filterings have been made:

  1. Deleting data prior to 1996. As has been stated in the discussion forum, looks like observations prior to 1996 have a record of economic consequences inconsistent with the rest of the dataset. Furthermore, the number of recorded events is smaller than in the post-1996 observations.
  2. Deleting observations without economic and personal consequences. Many of the observations have no economic consequences and no casualties, so some work can be saved if these observations are deleted.

To remove from the sample data prior to 1996, I use the lubridate package to get the year of the observation from the BGN_DATE variable, and then proceed to make the first filtering:

library(lubridate)
storm$BGN_DATE <- mdy_hms(storm$BGN_DATE)
storm$years <- year(storm$BGN_DATE)
storm <- storm[which(storm$years > 1995), ]

Afterwards, I proceed to remove the events with no casualties and no economic consequences:

filter <- (storm$INJURIES !=0) | (storm$FATALITIES!=0) | (storm$PROPDMG!=0) | (storm$CROPDMG!=0)

storm <- storm[which(filter), ]

After the two filterings, we obtain a dataset with 201318 observations.

Cleaning of the EVTYPE variable

An examination of the EVTYPE variable shows that there are a large number of different types of events recorded:

length(table(storm$EVTYPE))
## [1] 985

This is a large number of different events, considering that the codebook registers only 48 types of events. So several transformsations and groupings have been undertaken to reduce significantly the number of events:

storm$EVTYPE <- toupper(storm$EVTYPE)

storm$EVTYPE[which(grepl("ASTRONOMICAL HIGH TIDE", storm$EVTYPE))] <- "HIGH TIDE"
storm$EVTYPE[which(grepl("TIDAL FLOODING", storm$EVTYPE))] <- "HIGH TIDE"

storm$EVTYPE[which(grepl("GROUND BLIZZARD", storm$EVTYPE))] <- "BLIZZARD"
storm$EVTYPE[which(grepl("^BLIZZARD", storm$EVTYPE))] <- "BLIZZARD"

storm$EVTYPE[which(grepl("^COASTAL", storm$EVTYPE))] <- "COASTAL FLOOD"
storm$EVTYPE[which(grepl(" EROSION/CSTL FLOOD", storm$EVTYPE))] <- "COASTAL FLOOD"
storm$EVTYPE[which(grepl("EROSION/CSTL FLOOD", storm$EVTYPE))] <- "COASTAL FLOOD"

storm$EVTYPE[which(grepl("^COLD AIR", storm$EVTYPE))] <- "TORNADO"
storm$EVTYPE[which(grepl("^COLD", storm$EVTYPE))] <- "COLD/WIND CHILL"

storm$EVTYPE[which(grepl("^MUD", storm$EVTYPE))] <- "DEBRIS FLOW"
storm$EVTYPE[which(grepl("SLIDE", storm$EVTYPE))] <- "DEBRIS FLOW"
storm$EVTYPE[which(grepl("LANDSLUMP", storm$EVTYPE))] <- "DEBRIS FLOW"

storm$EVTYPE[which(grepl("^DROUGHT", storm$EVTYPE))] <- "DROUGHT"

storm$EVTYPE[which(grepl("^DUST DEVIL", storm$EVTYPE))] <- "DUST DEVIL"
storm$EVTYPE[which(grepl("^DUST STORM", storm$EVTYPE))] <- "DUST STORM"
storm$EVTYPE[which(grepl("BLOWING DUST", storm$EVTYPE))] <- "DUST STORM"

storm$EVTYPE[which(grepl("RECORD/EXCESSIVE HEAT", storm$EVTYPE))] <- c("EXCESSIVE HEAT")
storm$EVTYPE[which(grepl("EXTREME HEAT", storm$EVTYPE))] <- c("EXCESSIVE HEAT")

storm$EVTYPE[which(grepl("^EXTREME", storm$EVTYPE))] <- c("EXTREME COLD/WINDCHILL")

storm$EVTYPE[which(grepl("^FLASH", storm$EVTYPE))] <- "FLASH FLOOD"
storm$EVTYPE[which(grepl("^ FLASH", storm$EVTYPE))] <- "FLASH FLOOD"

storm$EVTYPE[which(grepl("^FLOOD", storm$EVTYPE))] <- "FLOOD"
storm$EVTYPE[which(grepl("BREAKUP FLOODING", storm$EVTYPE))] <- "FLOOD"
storm$EVTYPE[which(grepl("^ICE JAM FLOOD", storm$EVTYPE))] <- "FLOOD"
storm$EVTYPE[which(grepl("^RIVER", storm$EVTYPE))] <- "FLOOD"

storm$EVTYPE[which(grepl("FROST", storm$EVTYPE))] <- "FROST/FREEZE"
storm$EVTYPE[which(grepl("FREEZING FOG", storm$EVTYPE))] <- "aux"
storm$EVTYPE[which(grepl("FREEZ", storm$EVTYPE))] <- "FROST/FREEZE"
storm$EVTYPE[which(grepl("aux", storm$EVTYPE))] <- "FREEZING FOG"

storm$EVTYPE[which(grepl("SMALL HAIL", storm$EVTYPE))] <- "HAIL"

storm$EVTYPE[which(grepl("HEAT WAVE", storm$EVTYPE))] <- "EXCESSIVE HEAT"
storm$EVTYPE[which(grepl("RECORD HEAT", storm$EVTYPE))] <- "EXCESSIVE HEAT"

storm$EVTYPE[which(grepl("^HEAVY RAIN", storm$EVTYPE))] <- "HEAVY RAIN"

storm$EVTYPE[which(grepl("^HEAVY SNOW", storm$EVTYPE))] <- "HEAVY SNOW"
storm$EVTYPE[which(grepl("^SNOW", storm$EVTYPE))] <- "HEAVY SNOW"
storm$EVTYPE[which(grepl("EXCESSIVE SNOW", storm$EVTYPE))] <- "HEAVY SNOW"

storm$EVTYPE[which(grepl("SURF", storm$EVTYPE))] <- "HIGH SURF"

storm$EVTYPE[which(grepl("^LIGHTNING", storm$EVTYPE))] <- "LIGHTNING"

storm$EVTYPE[which(grepl("GUSTY", storm$EVTYPE))] <- "HIGH WIND"
storm$EVTYPE[which(grepl("GRADIENT", storm$EVTYPE))] <- "HIGH WIND"
storm$EVTYPE[which(grepl("^HIGH WIND", storm$EVTYPE))] <- "HIGH WIND"
storm$EVTYPE[which(grepl("NON", storm$EVTYPE))] <- "HIGH WIND"
storm$EVTYPE[which(grepl("NON-SEVERE WIND DAMAGE", storm$EVTYPE))] <- "HIGH WIND"
storm$EVTYPE[which(grepl("^WIND", storm$EVTYPE))] <- "HIGH WIND"
storm$EVTYPE[which(grepl("^STRONG WIND", storm$EVTYPE))] <- "STRONG WIND"

storm$EVTYPE[which(grepl("WHIRLWIND", storm$EVTYPE))] <- "HURRICANE"
storm$EVTYPE[which(grepl("HURRICANE/TYPHOON", storm$EVTYPE))] <- "HURRICANE"

storm$EVTYPE[which(grepl("^LAKE", storm$EVTYPE))] <- "LAKE-EFFECT SNOW"

storm$EVTYPE[which(grepl("RIP", storm$EVTYPE))] <- "RIP CURRENT"

storm$EVTYPE[which(grepl("STORM SURGE", storm$EVTYPE))] <- "STORM SURGE/TIDE"

storm$EVTYPE <- gsub("TSTM", "THUNDERSTORM", storm$EVTYPE)
storm$EVTYPE[which(grepl("^THUNDERSTORM", storm$EVTYPE))] <- "THUNDERSTORM WIND"
storm$EVTYPE[which(grepl("^ THUNDERSTORM", storm$EVTYPE))] <- "THUNDERSTORM WIND"

storm$EVTYPE[which(grepl("WINTER STORM", storm$EVTYPE))] <- "aux2"
storm$EVTYPE[which(grepl("^WINTE", storm$EVTYPE))] <- "WINTER WEATHER"
storm$EVTYPE[which(grepl("aux2", storm$EVTYPE))] <- "WINTER STORM"

storm$EVTYPE[which(grepl("DAM BREAK", storm$EVTYPE))] <- "FLOOD"

storm$EVTYPE[which(grepl("DOWNBURST", storm$EVTYPE))] <- "HIGH WIND"

storm$EVTYPE[which(grepl("FALLING SNOW", storm$EVTYPE))] <- "AVALANCHE"
storm$EVTYPE[which(grepl("HIGH WATER", storm$EVTYPE))] <- "HIGH TIDE"
storm$EVTYPE[which(grepl("HIGH SEAS", storm$EVTYPE))] <- "HIGH TIDE"

storm$EVTYPE[which(grepl("COLD$", storm$EVTYPE))] <- "COLD/WIND CHILL"

storm$EVTYPE[which(grepl("FLD$", storm$EVTYPE))] <- "FLOOD"

storm$EVTYPE[which(grepl("RAIN$", storm$EVTYPE))] <- "RAIN"

storm$EVTYPE[which(grepl("THYPHOON", storm$EVTYPE))] <- "HURRICANE"

storm$EVTYPE[which(grepl("ICE ROADS", storm$EVTYPE))] <- "ICE ON ROAD"
storm$EVTYPE[which(grepl("ICY ROADS", storm$EVTYPE))] <- "ICE ON ROAD"

After the filtering, the number of EVTYPE different events has been reduced significantly:

length(table(storm$EVTYPE))
## [1] 80

Computing economic damage of each event

The variables of the dataset including data about damage are:

Prior to data analysis, the order of magnitude of a particular event (the Napa flood) has been corrected. This correction has been made taking into account information obtained from the discussion forums:

storm$PROPDMGEXP <- toupper(storm$PROPDMGEXP)
storm$CROPDMGEXP <- toupper(storm$CROPDMGEXP)

storm$EVTYPE[which(storm$PROPDMGEXP=="B" & storm$PROPDMG==115)]
## [1] "FLOOD"
storm$PROPDMGEXP[which(storm$PROPDMGEXP=="B" & storm$PROPDMG==115)] <- "M"

The economic damage of each event is stored in the propdmg and cropdmg variables, obtained as follows:

storm$cropexp <- 1
storm$propexp <- 1

storm$cropexp[which(storm$CROPDMGEXP=="K")] <- 1000
storm$cropexp[which(storm$CROPDMGEXP=="M")] <- 1000000
storm$cropexp[which(storm$CROPDMGEXP=="B")] <- 1000000000
storm$propexp[which(storm$PROPDMGEXP=="K")] <- 1000
storm$propexp[which(storm$PROPDMGEXP=="M")] <- 1000000
storm$propexp[which(storm$PROPDMGEXP=="B")] <- 1000000000

storm$cropdmg <- storm$CROPDMG*storm$cropexp
storm$propdmg <- storm$PROPDMG*storm$propexp

Results

To assess the impact of a given event, two effects have to be taken into account:

To analyze data, the dplyr package has been used, which is adequate for large datasets. Then, relevant effects have been computed for every kind of damage.

library(dplyr)

storm <- tbl_df(storm)

storm.evtype <- group_by(storm, EVTYPE)

storm.prop <- summarize(storm.evtype, length(STATE), sum(propdmg), max(propdmg))
colnames(storm.prop) <- c("EVTYPE", "Frequency", "Total.PROPDMG", "Max.PROPDMG")
storm.prop <- mutate(storm.prop, freq.PROP = Total.PROPDMG/ sum(Total.PROPDMG)*100)

storm.crop <- summarize(storm.evtype, length(STATE), sum(cropdmg), max(cropdmg))
colnames(storm.crop) <- c("EVTYPE", "Frequency", "Total.CROPDMG", "Max.CROPDMG")
storm.crop <- mutate(storm.crop, freq.crop = Total.CROPDMG/ sum(Total.CROPDMG)*100)

storm.fatalities <- summarize(storm.evtype, length(STATE), sum(FATALITIES), max(FATALITIES))
colnames(storm.fatalities) <- c("EVTYPE", "Frequency", "Total.FATAL", "Max.FATAL")
storm.fatalities <- mutate(storm.fatalities, freq.fatal = Total.FATAL/ sum(Total.FATAL)*100)

storm.injuries <- summarize(storm.evtype, length(STATE), sum(INJURIES), max(INJURIES))
colnames(storm.injuries) <- c("EVTYPE", "Frequency", "Total.INJURIES", "Max.INJURIES")
storm.injuries <- mutate(storm.injuries, freq.injuries = Total.INJURIES/ sum(Total.INJURIES)*100)

As for propdmg the most harmful events are:

arrange(storm.prop, desc (Total.PROPDMG))
## Source: local data frame [80 x 5]
## 
##               EVTYPE Frequency Total.PROPDMG Max.PROPDMG freq.PROP
## 1          HURRICANE       201   81118671010   1.693e+10 32.204950
## 2   STORM SURGE/TIDE       216   47834724000   3.130e+10 18.990880
## 3              FLOOD     10305   29245593200   3.000e+09 11.610803
## 4            TORNADO     12366   24616945710   2.800e+09  9.773182
## 5        FLASH FLOOD     19013   15222258910   1.000e+09  6.043394
## 6               HAIL     22690   14595213420   1.800e+09  5.794450
## 7  THUNDERSTORM WIND    105372    7913555880   7.500e+08  3.141763
## 8     TROPICAL STORM       410    7642475550   5.150e+09  3.034142
## 9          HIGH WIND      5531    5253421860   1.300e+09  2.085663
## 10          WILDFIRE       847    4758667000   1.040e+09  1.889240
## ..               ...       ...           ...         ...       ...
arrange(storm.prop, desc (Max.PROPDMG))
## Source: local data frame [80 x 5]
## 
##              EVTYPE Frequency Total.PROPDMG Max.PROPDMG freq.PROP
## 1  STORM SURGE/TIDE       216   47834724000   3.130e+10 18.990880
## 2         HURRICANE       201   81118671010   1.693e+10 32.204950
## 3    TROPICAL STORM       410    7642475550   5.150e+09  3.034142
## 4             FLOOD     10305   29245593200   3.000e+09 11.610803
## 5           TORNADO     12366   24616945710   2.800e+09  9.773182
## 6              HAIL     22690   14595213420   1.800e+09  5.794450
## 7  WILD/FOREST FIRE       381    3001782500   1.500e+09  1.191739
## 8         HIGH WIND      5531    5253421860   1.300e+09  2.085663
## 9          WILDFIRE       847    4758667000   1.040e+09  1.889240
## 10      FLASH FLOOD     19013   15222258910   1.000e+09  6.043394
## ..              ...       ...           ...         ...       ...

As for cropdmg the most harmful events are:

arrange(storm.crop, desc (Total.CROPDMG))
## Source: local data frame [80 x 5]
## 
##                    EVTYPE Frequency Total.CROPDMG Max.CROPDMG freq.crop
## 1                 DROUGHT       258   13367566000    1.00e+09 38.464795
## 2               HURRICANE       201    5349282800    1.51e+09 15.392411
## 3                   FLOOD     10305    5013161500    5.00e+08 14.425231
## 4                    HAIL     22690    2496822450    7.00e+07  7.184536
## 5            FROST/FREEZE       177    1368761000    2.86e+08  3.938571
## 6             FLASH FLOOD     19013    1334901700    2.00e+08  3.841142
## 7  EXTREME COLD/WINDCHILL       296    1326023000    5.96e+08  3.815594
## 8       THUNDERSTORM WIND    105372    1016942600    4.10e+07  2.926224
## 9                    RAIN      1053     739919800    2.00e+08  2.129098
## 10         TROPICAL STORM       410     677711000    2.00e+08  1.950094
## ..                    ...       ...           ...         ...       ...
arrange(storm.crop, desc (Max.CROPDMG))
## Source: local data frame [80 x 5]
## 
##                    EVTYPE Frequency Total.CROPDMG Max.CROPDMG freq.crop
## 1               HURRICANE       201    5349282800   1.510e+09 15.392411
## 2                 DROUGHT       258   13367566000   1.000e+09 38.464795
## 3  EXTREME COLD/WINDCHILL       296    1326023000   5.960e+08  3.815594
## 4                   FLOOD     10305    5013161500   5.000e+08 14.425231
## 5          EXCESSIVE HEAT       687     492402000   4.924e+08  1.416873
## 6            FROST/FREEZE       177    1368761000   2.860e+08  3.938571
## 7             FLASH FLOOD     19013    1334901700   2.000e+08  3.841142
## 8                    RAIN      1053     739919800   2.000e+08  2.129098
## 9          TROPICAL STORM       410     677711000   2.000e+08  1.950094
## 10              HIGH WIND      5531     634071300   1.750e+08  1.824522
## ..                    ...       ...           ...         ...       ...

And the analysis for fatalities and injuries yields:

arrange(storm.fatalities, desc (Total.FATAL))
## Source: local data frame [80 x 5]
## 
##                    EVTYPE Frequency Total.FATAL Max.FATAL freq.fatal
## 1          EXCESSIVE HEAT       687        1799        99  20.602382
## 2                 TORNADO     12366        1511       158  17.304169
## 3             FLASH FLOOD     19013         887        20  10.158039
## 4               LIGHTNING     11152         651         4   7.455337
## 5             RIP CURRENT       603         542         6   6.207055
## 6                   FLOOD     10305         444        11   5.084746
## 7       THUNDERSTORM WIND    105372         379         7   4.340357
## 8               HIGH WIND      5531         258         8   2.954650
## 9  EXTREME COLD/WINDCHILL       296         257        10   2.943197
## 10                   HEAT       164         237        16   2.714155
## ..                    ...       ...         ...       ...        ...
arrange(storm.fatalities, desc (Max.FATAL))
## Source: local data frame [80 x 5]
## 
##             EVTYPE Frequency Total.FATAL Max.FATAL freq.fatal
## 1          TORNADO     12366        1511       158 17.3041686
## 2   EXCESSIVE HEAT       687        1799        99 20.6023820
## 3          TSUNAMI        14          33        32  0.3779203
## 4   TROPICAL STORM       410          57        22  0.6527714
## 5      FLASH FLOOD     19013         887        20 10.1580394
## 6             RAIN      1053          94        19  1.0765002
## 7             HEAT       164         237        16  2.7141548
## 8        HURRICANE       201         126        15  1.4429684
## 9  COLD/WIND CHILL       119         132        14  1.5116812
## 10     DEBRIS FLOW       201          43        14  0.4924416
## ..             ...       ...         ...       ...        ...
arrange(storm.injuries, desc (Total.INJURIES))
## Source: local data frame [80 x 5]
## 
##               EVTYPE Frequency Total.INJURIES Max.INJURIES freq.injuries
## 1            TORNADO     12366          20667         1150     35.648124
## 2              FLOOD     10305           6838          800     11.794739
## 3     EXCESSIVE HEAT       687           6461          519     11.144459
## 4  THUNDERSTORM WIND    105372           5129           70      8.846917
## 5          LIGHTNING     11152           4141           51      7.142734
## 6        FLASH FLOOD     19013           1674          150      2.887451
## 7          HURRICANE       201           1321          780      2.278568
## 8       WINTER STORM      1460           1292          165      2.228547
## 9               HEAT       164           1222          223      2.107805
## 10         HIGH WIND      5531           1187           70      2.047434
## ..               ...       ...            ...          ...           ...
arrange(storm.injuries, desc (Max.INJURIES))
## Source: local data frame [80 x 5]
## 
##            EVTYPE Frequency Total.INJURIES Max.INJURIES freq.injuries
## 1         TORNADO     12366          20667         1150    35.6481242
## 2           FLOOD     10305           6838          800    11.7947391
## 3       HURRICANE       201           1321          780     2.2785683
## 4  EXCESSIVE HEAT       687           6461          519    11.1444588
## 5            HEAT       164           1222          223     2.1078051
## 6  TROPICAL STORM       410            338          200     0.5830099
## 7    WINTER STORM      1460           1292          165     2.2285468
## 8        BLIZZARD       228            385          150     0.6640793
## 9     FLASH FLOOD     19013           1674          150     2.8874515
## 10 WINTER WEATHER       546            483          137     0.8331177
## ..            ...       ...            ...          ...           ...

Looking at the results obtained, specially the ones concerning total damages, the most harmful weather events appeear to be:

Then, we can keep observations of these events to examine them more close. I will group observations by event and year:

harm.economic <- filter(storm, EVTYPE == "HURRICANE" | EVTYPE == "STORM SURGE/TIDE" | EVTYPE == "FLOOD" | EVTYPE == "TORNADO" | EVTYPE == "DROUGHT" | EVTYPE == "HAIL")
harm.human <- filter(storm, EVTYPE == "EXCESSIVE HEAT" | EVTYPE == "TORNADO" | EVTYPE == "FLASH FLOOD" | EVTYPE == "LIGHTNING" | EVTYPE == "FLOOD" | EVTYPE == "THUNDERSTORM WIND")
harm.economic <- group_by(harm.economic, EVTYPE, years)
harm.human <- group_by(harm.human, EVTYPE, years)

Then, we can obtain the number of events and the total number of economic and human casualties:

economic.graph <- summarize(harm.economic, length(STATE), sum(propdmg), sum(cropdmg))
colnames(economic.graph) <- c("EVTYPE", "year", "freq", "propdmg", "cropdmg")
human.graph <- summarize(harm.human, length(STATE), sum(FATALITIES), sum(INJURIES))
colnames(human.graph) <- c("EVTYPE", "year", "freq", "fatalities", "injuries")

The following graph shows the economic impact of the six most economically harmful weather events across the years. Economic impact has been plotted in logarythmic scale.

library(lattice)

xyplot(log10(propdmg) + log10(cropdmg) ~ year | EVTYPE, type="l", economic.graph, auto.key=TRUE)

Looking at the results, looks like tornadoes, floods and hail are the events that more regularly have high economic impact along the years. The other events, although can have also a high economic impact, appear less regularly.
As for human consequences, here is the evolution of fatalities along the years for the most harmful weather events:

xyplot(fatalities ~ year | EVTYPE, type="l", human.graph, auto.key=TRUE)

And here is a similar plot for injuries:

xyplot(injuries ~ year | EVTYPE, type="l", human.graph, auto.key=TRUE)

As for injuries and fatalities, there is a clear outlier which distorts results somewhat, which is the 2011 Joplin tornado:

joplin <- storm[which(storm$FATALITIES > 100), ]
joplin$FATALITIES
## [1] 158
joplin$INJURIES
## [1] 1150

In general, human impact looks like less predictable than economic impact. It must be noted, though, the effect of EXCESSIVE HEAT, (i.e., hetwaves), which calls for prevention actions regarding heatwaves.