This is a brief analysis that address two questions related with weather events in the United States: which types are most harmful with respect to population health?, and which ones have the greatest economic consequences? Knowing more about these events and their impact in our society will allow us to be focused on the types of events that have the major level of repercursions. Besides, resources on this field of study are limited.

The data for this study has been taken from the Storm Events Database from the National Climat Data Center page.

Here it is also shown the steps and code (written in R version ) that were performed to get the Results concluded in this analysis.

Data Processing

The file containing the data can be found here. It is in a compressed state. The following R code helps to download the file, and load it into a raw data frame.

filename = "stormdata.csv.bz2"
download.file(url = "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
              , destfile = filename)
rwDf <- read.csv(bzfile(filename)) # load raw data int rwDf

The raw data has 902,297 observations. Before starting manipulating it further, lets first filter out by a subset of interest and load it in a new data frame called anDf (analysis data frame). For harmful on the population health we select FATALITIES and INJURIES columns. For economic consequences we select PROPDMG (Property Damage), PROPDMGEXP (Property Damage Exponential), CROPDMG (Crop Damage) and CROPDMGEXP (Crop Damage Exponential) columns.

In addition, we select BGN_DATE (Beginning Date), STATE and EVTYPE (Event Type) variables for general analysis.

# Only when these vars are greater than zero
anDf <- rwDf[rwDf$FATALITIES > 0 | rwDf$INJURIES > 0 | rwDf$PROPDMG > 0 | rwDf$CROPDMG > 0 
             , c("BGN_DATE", "STATE", "EVTYPE", "FATALITIES", "INJURIES"
                 , "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")] # create analysis data frame anDf

Now we have 254,633 observations which are more managable.

Lets format correctly the Beginning Date and create a Year column.

anDf$BGN_DATE <- as.Date(anDf$BGN_DATE,"%m/%d/%Y") # convert to proper date format
anDf$BGN_YEAR <- format(anDf$BGN_DATE, "%Y")

The EVTYPE (Event Type) variable has 985 classifications which makes impossible to do a an analysis. Lets take the offical Event Type 48 classification. Taken from here (Table 2.1.1) on January 25th, 2015.

officialClass <-  c("ASTRONOMICAL LOW TIDE",  "AVALANCHE",  "BLIZZARD",  "COASTAL FLOOD",   "COLD/WIND CHILL",  "DEBRIS FLOW"
                    ,   "DENSE FOG",    "DENSE SMOKE",  "DROUGHT",  "DUST DEVIL",   "DUST STORM",   "EXCESSIVE HEAT"
                    ,   "EXTREME COLD/WIND CHILL",  "FLASH FLOOD",  "FLOOD",    "FROST/FREEZE", "FUNNEL CLOUD", "FREEZING FOG"
                    ,   "HAIL", "HEAT", "HEAVY RAIN",   "HEAVY SNOW",   "HIGH SURF",    "HIGH WIND",    "HURRICANE (TYPHOON)"
                    ,   "ICE STORM",    "LAKE-EFFECT SNOW", "LAKESHORE FLOOD",  "LIGHTNING",    "MARINE HAIL",  "MARINE HIGH WIND"
                    ,   "MARINE STRONG WIND",   "MARINE THUNDERSTORM WIND", "RIP CURRENT",  "SEICHE",   "SLEET",    "STORM SURGE/TIDE"
                    ,   "STRONG WIND",  "THUNDERSTORM WIND",    "TORNADO",  "TROPICAL DEPRESSION",  "TROPICAL STORM",   "TSUNAMI"
                    ,   "VOLCANIC ASH", "WATERSPOUT",   "WILDFIRE", "WINTER STORM", "WINTER WEATHER")

Before trying to match against the official classification, we need to make some manual modifications that will allow to classify.

anDf$EVTYPE <- toupper(anDf$EVTYPE) # Make all capitals
anDf$EVTYPE <- gsub("^\\s+|\\s+$", "", anDf$EVTYPE) # Remove leading and trailing spaces
anDf$EVTYPE <- gsub("TSTM", "THUNDERSTORM", anDf$EVTYPE) ## Change difficult known abbreviations
anDf$EVTYPE <- gsub("FLOODING", "FLOOD", anDf$EVTYPE) ## Change common flooding by flood
## Change very particular records since using grep get into error due to unclosed parethesis
anDf$EVTYPE <- gsub("ICE JAM FLOOD \\(MI", "ICE JAM FLOOD", anDf$EVTYPE) 
anDf$EVTYPE <- gsub("THUNDERSTORM WIND \\(G", "THUNDERSTORM WIND", anDf$EVTYPE)
anDf$EVTYPE <- gsub("THUNDERSTORM WIND \\(", "THUNDERSTORM WIND", anDf$EVTYPE)
anDf$EVTYPE <- gsub("THUNDERSTORM WIND  \\(", "THUNDERSTORM WIND", anDf$EVTYPE)
anDf$EVTYPE <- gsub("HIGH WIND \\(G4", "HIGH WIND", anDf$EVTYPE)

Note: The removal of the unclosed parethesis can be refined to look for all of them. In this case there are only 5 records.

Lets create a new CLNEVTYP (Cleaned Event Type) variable. Since finding matches for every entry would be a long effort, lets put the remaining unclassified records into a Other classification. These last should be minimal count so our analysis should not be impacted.

anDf$CLNEVTYP <- NA

## Build auxiliary functions to classify correctly
library(stringdist)
findClass1 <- function(str, vecstr, mxdst)vecstr[amatch(str, vecstr, maxDist = mxdst)]
findClass2 <- function(str, vecstr)grep(paste0("^", substring(str, 1, ceiling(nchar(str)/2))), vecstr, value = TRUE)[1]

## Exact matches and almost exact ones (small variations like typos, plural vs singular, extra special characters)
anDf$CLNEVTYP <- sapply(anDf$EVTYPE, FUN = findClass1, vecstr = officialClass, mxdst = 2)
sum(is.na(anDf$CLNEVTYP))
## [1] 4818
## Lets get a 50% of the first letters of each and see if matches any current classification by grep.
anDf[is.na(anDf$CLNEVTYP),"CLNEVTYP"] <- sapply(anDf[is.na(anDf$CLNEVTYP),"EVTYPE"], FUN = findClass2, vecstr = officialClass)
sum(is.na(anDf$CLNEVTYP))
## [1] 2810
## Lets put the rest in a Other classification
anDf[is.na(anDf$CLNEVTYP),"CLNEVTYP"] <- "Other"

Having the PROPDMG (Property Damage) and CROPDMG (Crop Damage) in two variables does not make the data tidy. Lets fix that by creating a PROPDMGAMT (Property Damage Amount) and CROPDMGAMT (Crop Damage Amount) variables. For make analysis easier, lets also create a TOTDMGAMT (Total Damage Amount) variable.

anDf$PROPDMGAMT <- ifelse(anDf$PROPDMGEXP == "0", anDf$PROPDMG
                 , ifelse(anDf$PROPDMGEXP == "1", anDf$PROPDMG * 10
                 , ifelse(anDf$PROPDMGEXP %in% c("2","h","H"), anDf$PROPDMG * 100
                 , ifelse(anDf$PROPDMGEXP %in% c("3","k","K"), anDf$PROPDMG * 1000
                 , ifelse(anDf$PROPDMGEXP == "4", anDf$PROPDMG * 10000
                 , ifelse(anDf$PROPDMGEXP == "5", anDf$PROPDMG * 100000
                 , ifelse(anDf$PROPDMGEXP %in% c("6","m","M"), anDf$PROPDMG * 1000000
                 , ifelse(anDf$PROPDMGEXP == "7", anDf$PROPDMG * 10000000
                 , ifelse(anDf$PROPDMGEXP == "8", anDf$PROPDMG * 100000000
                 , ifelse(anDf$PROPDMGEXP %in% c("9","b","B"), anDf$PROPDMG * 1000000000, NA))))))))))

anDf$CROPDMGAMT <- ifelse(anDf$CROPDMGEXP == "0", anDf$CROPDMG
                 , ifelse(anDf$CROPDMGEXP == "1", anDf$CROPDMG * 10
                 , ifelse(anDf$CROPDMGEXP %in% c("2","h","H"), anDf$CROPDMG * 100
                 , ifelse(anDf$CROPDMGEXP %in% c("3","k","K"), anDf$CROPDMG * 1000
                 , ifelse(anDf$CROPDMGEXP == "4", anDf$CROPDMG * 10000
                 , ifelse(anDf$CROPDMGEXP == "5", anDf$CROPDMG * 100000
                 , ifelse(anDf$CROPDMGEXP %in% c("6","m","M"), anDf$CROPDMG * 1000000
                 , ifelse(anDf$CROPDMGEXP == "7", anDf$CROPDMG * 10000000
                 , ifelse(anDf$CROPDMGEXP == "8", anDf$CROPDMG * 100000000
                 , ifelse(anDf$CROPDMGEXP %in% c("9","b","B"), anDf$CROPDMG * 1000000000, NA))))))))))

## Lets make a total damage variable
anDf$TOTDMGAMT <- anDf$PROPDMGAMT + anDf$CROPDMGAMT

Aggregate the data to make fewer records an also according to each analysis

phDf <- aggregate(cbind(FATALITIES, INJURIES) ~ BGN_YEAR + STATE + CLNEVTYP
                  , data = anDf[anDf$FATALITIES > 0 | anDf$INJURIES > 0, ]
                  , FUN = sum)
ecDf <- aggregate(cbind(PROPDMGAMT, CROPDMGAMT, TOTDMGAMT) ~ BGN_YEAR + STATE + CLNEVTYP
                  , data = anDf[anDf$PROPDMGAMT > 0 | anDf$CROPDMGAMT > 0, ]
                  , FUN = sum)

Lets get some top fives of the type of events with more fatalities and injuries:

topFatal <- aggregate(FATALITIES ~ CLNEVTYP, data = phDf, FUN = sum)
head(topFatal[order(-topFatal$FATALITIES), ], n = 5)
##          CLNEVTYP FATALITIES
## 35        TORNADO       5633
## 9  EXCESSIVE HEAT       1905
## 11    FLASH FLOOD       1018
## 17           HEAT        937
## 24      LIGHTNING        818
topInjur <- aggregate(INJURIES ~ CLNEVTYP, data = phDf, FUN = sum)
head(topInjur[order(-topInjur$INJURIES), ], n = 5)
##             CLNEVTYP INJURIES
## 35           TORNADO    91364
## 34 THUNDERSTORM WIND     9509
## 12             FLOOD     6791
## 9     EXCESSIVE HEAT     6548
## 24         LIGHTNING     5233

Lets do the same for the economic side:

topTotDmg <- aggregate(TOTDMGAMT ~ CLNEVTYP, data = ecDf, FUN = sum)
head(topTotDmg[order(-topTotDmg$TOTDMGAMT), ], n = 5)
##               CLNEVTYP    TOTDMGAMT
## 14               FLOOD 138072315000
## 24 HURRICANE (TYPHOON)  44203445800
## 39             TORNADO  16570343763
## 33               Other  12547527950
## 18                HAIL  10048611590

Let’s see in a graphic these top fives accross time

topFatalbyYr <- aggregate(FATALITIES ~ BGN_YEAR + CLNEVTYP
                          , data = phDf[phDf$CLNEVTYP %in% c("TORNADO", "EXCESSIVE HEAT", "FLASH FLOOD", "HEAT", "LIGHTNING"),]
                          , FUN = sum)

g <- ggplot(data = topFatalbyYr, aes(x=BGN_YEAR, y=FATALITIES)) 
g + geom_line(aes(group=CLNEVTYP, color=CLNEVTYP)) + ggtitle("Number of Fatalities by Event Type")

topInjurbyYr <- aggregate(INJURIES ~ BGN_YEAR + CLNEVTYP
                          , data = phDf[phDf$CLNEVTYP %in% c("TORNADO", "THUNDERSTORM WIND", "FLOOD", "EXCESSIVE HEAT", "LIGHTNING"),]
                          , FUN = sum)

g <- ggplot(data = topInjurbyYr, aes(x=BGN_YEAR, y=INJURIES)) 
g + geom_line(aes(group=CLNEVTYP, color=CLNEVTYP)) + ggtitle("Number of Injuries by Event Type")

topTotDmgbyYr <- aggregate(TOTDMGAMT ~ BGN_YEAR + CLNEVTYP
                          , data = ecDf[phDf$CLNEVTYP %in% c("FLOOD", "HURRICANE (TYPHOON)", "TORNADO", "Other", "HAIL"),]
                          , FUN = sum)

g <- ggplot(data = topTotDmgbyYr, aes(x=BGN_YEAR, y=TOTDMGAMT)) 
g + geom_line(aes(group=CLNEVTYP, color=CLNEVTYP)) + ggtitle("Total Damage (US $) by Event Type")

Results

The Tornado events have the most harmful consequences on the public safety while the Floods has caused the most economic issues.