Synopsis

This analysis seeks to answer two very specific questions about the impact on the economy and public health of extreme meteorological events and storms in the US, so they made use U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States and considers that the events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records.

Data Processing

Loading data from the web and processing them in the working directory. The data come in BZ2-zipped and is a .csv file. We just thus unzip it after downloading and load it into R.

library(R.utils)
if(!file.exists("repSTData.csv.bz2"))
        download.file("http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
                      destfile="repSTData.csv.bz2")

if(!file.exists("repSTData.csv"))
        bunzip2("repSTData.csv.bz2", overwrite=FALSE, remove=FALSE)

STData = read.csv("repSTData.csv")

Taking a look to the data

head(STData, 3)
##   STATE__          BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1 4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1 4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1 2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3

Making use of the available documentation about the database and considering the questions to answer, are selected only the few variables that we believe relevant for the analysis.

Now extract the above variables and label them

labels = c("tevent", "fatalities", 
                   "injuries", "apdomagnitude", "ompdamage",
                   "acdomagnitude", "omcdamage")

ImpData = STData[,c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]

names(ImpData) = labels

Again the structure of the data is checked

str(ImpData)
## 'data.frame':    902297 obs. of  7 variables:
##  $ tevent       : Factor w/ 985 levels "?","ABNORMALLY DRY",..: 830 830 830 830 830 830 830 830 830 830 ...
##  $ fatalities   : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ injuries     : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ apdomagnitude: num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ ompdamage    : Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ acdomagnitude: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ omcdamage    : Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...

It is found that for the Order of magnitude for property damage and the Order of magnitude for crop damage only the “K”, “M” and “B” values will be used.

unique(ImpData$ompdamage) 
##  [1] K M   B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels:  - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
unique(ImpData$omcdamage)
## [1]   M K m B ? 0 k 2
## Levels:  ? 0 2 B k K m M

Known values are then replaced by their numerical value (K=1000, M=1000000, B=1000000000). Replace zeros with unknown values.

library(plyr)
ImpData$ompdamage = as.character(ImpData$ompdamage)
ImpData$ompdamage = gsub("\\-|\\+|\\?|h|H|0","0",ImpData$ompdamage)
ImpData$ompdamage = gsub("k|K", "1000", ImpData$ompdamage)
ImpData$ompdamage = gsub("m|M", "1000000", ImpData$ompdamage)
ImpData$ompdamage = gsub("b|B", "1000000000", ImpData$ompdamage)
ImpData$ompdamage = as.numeric(ImpData$ompdamage)
ImpData$ompdamage[is.na(ImpData$ompdamage)] = 0

ImpData$omcdamage = as.character(ImpData$omcdamage)
ImpData$omcdamage = gsub("\\-|\\+|\\?|h|H|0","0",ImpData$omcdamage)
ImpData$omcdamage = gsub("k|K", "1000", ImpData$omcdamage)
ImpData$omcdamage = gsub("m|M", "1000000", ImpData$omcdamage)
ImpData$omcdamage = gsub("b|B", "1000000000", ImpData$omcdamage)
ImpData$omcdamage = as.numeric(ImpData$omcdamage)
ImpData$omcdamage[is.na(ImpData$omcdamage)] = 0

ImpData = mutate(ImpData, Property = apdomagnitude * ompdamage, Crops = acdomagnitude * omcdamage)

Results

In this sections we train to give answer to the questions proposed basing on a graphic analysis of the situations presented.

What events cause the most fatalities and injuries?

The following graph shows the most extreme weather events are summarized in the US Considering injuries and fatalities.

library(dplyr)
library(plyr)
library(reshape2)
library(ggplot2)

ImpDataS = ddply(ImpData, "tevent", summarise, tot.injuries = sum(injuries), tot.fatalities = sum(fatalities))

ImpDataS = arrange(ImpDataS, desc(tot.injuries),desc(tot.fatalities))

ImpDataS = head(ImpDataS, 10)

ImpDataS = melt(ImpDataS, id.vars = "tevent")

p = ggplot(ImpDataS, aes(x = tevent, y = value, fill= variable))
p + geom_bar(stat="identity") + coord_flip() + scale_fill_brewer() + labs(x="Event Type", y="Fatalities and Injuries", title="Top 10 most dangerous weather events in the US")

It is noteworthy that weather events more fatalities and injuries occur to the United States are tornadoes and excessive heat events.

What events cause the most economic damage?

In the following graphic summarizes the more extreme weather events in the US. Considering damage in crops and properties.

library(dplyr)
library(plyr)
library(reshape2)
library(ggplot2)

ImpDataS = ddply(ImpData, "tevent", summarise, tot.property = sum(Property), tot.crops = sum(Crops))

ImpDataS = arrange(ImpDataS, desc(tot.property),desc(tot.crops))

ImpDataS = head(ImpDataS, 10)

ImpDataS = melt(ImpDataS, id.vars = "tevent")

p = ggplot(ImpDataS, aes(x = tevent, y = value, fill= variable))
p + geom_bar(stat="identity") + coord_flip() + scale_fill_brewer() + labs(x="Event Type", y="Damages (in US Billions) ", title="Top 10 weather events that cause the most economic damages in the US")

It is noteworthy that the meteorological events that cause more economic damage in the United States are floods, followed by hurricanes / typhoons and tornadoes lesser extent.