Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

We explored the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database which tracks characteristics of major storms and weather events in the United States, starting in the year 1950 and till the end of November 2011.

As a result of our analysis, we found that across the United States, the following weather events were the most harmful with respect to population health:

We also found that the greatest economic damages were made by the following weather events: (1) TORNADOS (2) THUNDERSTORMS (3) FLOODS, (based on TOTAL damage from the event, which is sum of Crop and Property damages)

Data Processing

setInternet2(TRUE)
f <- tempfile()
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", f)
data <- read.csv(bzfile(f), stringsAsFactors = FALSE)

First we’d like to check how clean is the variable EVTYPE which we’ll use extensively for further analysis. The variable EVTYPE lists different types of weather conditions which caused public / economic damages.

event <- data$EVTYPE
event <- as.data.frame(table(event))
library(plyr)
arr <- arrange(event, -Freq)       
head(arr, 15)
##                 event   Freq
## 1                HAIL 288661
## 2           TSTM WIND 219940
## 3   THUNDERSTORM WIND  82563
## 4             TORNADO  60652
## 5         FLASH FLOOD  54277
## 6               FLOOD  25326
## 7  THUNDERSTORM WINDS  20843
## 8           HIGH WIND  20212
## 9           LIGHTNING  15754
## 10         HEAVY SNOW  15708
## 11         HEAVY RAIN  11723
## 12       WINTER STORM  11433
## 13     WINTER WEATHER   7026
## 14       FUNNEL CLOUD   6839
## 15   MARINE TSTM WIND   6175

By looking into EVTYPE Frequencies, We can observe that most of the variables are in upper case (few are in lower case), so we transform all variables to upper case. Also, we’ll trim the names

trim <- function(x) gsub("^\\s+|\\s+$", "", x)
data$EVTYPE <- toupper(data$EVTYPE)
data$EVTYPE <- trim(data$EVTYPE)

Also by looking at the data we observed another issue: in many cases the same event in EVTYPE is coded differently,

For example all below different events should be reported as one event because they all represent THUNDERSTORM

##                      EVENT FREQUANCY
## 1                TSTM WIND    219940
## 2        THUNDERSTORM WIND     82563
## 3       THUNDERSTORM WINDS     20843
## 4         MARINE TSTM WIND      6175
## 5 MARINE THUNDERSTORM WIND      5812
## 6           TSTM WIND/HAIL      1028

Another example where all of below different events should be reported as one event because they all represent FLOOD

##               EVENT FREQUANCY
## 1       FLASH FLOOD     54277
## 2             FLOOD     25326
## 3    FLASH FLOODING       682
## 4     COASTAL FLOOD       650
## 5 FLOOD/FLASH FLOOD       624

And so on, for other varables in EVTYPE. Similar transformation / aggregation is also required for HEAT, HAIL, WiLDFIRE, EXTREMECOLD, HIGH WIND, SNOW, which we make using below code:

new <- data

new$EVTYPE[new$EVTYPE == "TSTM WIND" | new$EVTYPE == "THUNDERSTORM WIND" | new$EVTYPE == "THUNDERSTORM WINDS" | new$EVTYPE == "MARINE TSTM WIND" | new$EVTYPE == "TSTM WIND/HAIL" | new$EVTYPE == "MARINE TSTM WIND" | new$EVTYPE =="TROPICAL STORM" | new$EVTYPE == "THUNDERSTORM"] <- "THUNDERSTORM"
  
new$EVTYPE[new$EVTYPE == "FLASH FLOOD" | new$EVTYPE == "FLOOD" | new$EVTYPE == "FLASH FLOODING" | new$EVTYPE == "COASTAL FLOOD" | new$EVTYPE == "FLOOD/FLASH FLOOD"] <- "FLOOD"

new$EVTYPE[new$EVTYPE == "MARINE HAIL" | new$EVTYPE == "HAIL" ] <- "HAIL"

new$EVTYPE[new$EVTYPE == "WILDFIRE" | new$EVTYPE == "WILD/FOREST FIRE" ] <- "WILDFIFRE"

new$EVTYPE[new$EVTYPE == "EXTREME COLD/WIND CHILL" | new$EVTYPE == "EXTREME COLD" ] <- "EXTREMECOLD"

new$EVTYPE[new$EVTYPE == "HIGH WIND" | new$EVTYPE == "HIGH WINDS" | new$EVTYPE == "WIND" | new$EVTYPE == "WIND " ] <- "HIGH WIND"

new$EVTYPE[new$EVTYPE == "EXCESSIVE HEAT" | new$EVTYPE == "HEAT" ] <- "HEAT"

Data Analysis

After cleaning / preparing the data, we are ready to do the analysis to answer reqested questions.

First lets address the question on most harmful weatrher events for public health

The following 2 variables report Population casualties (each variable reports the number of people impacted)

FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...

INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...

We load needed R packages and sum together the numbers of injuries and fatalities for each EVTYPE

library(plyr)
## Warning: package 'plyr' was built under R version 3.2.4
library(reshape2)
## Warning: package 'reshape2' was built under R version 3.2.4
sum_population <- ddply(new, .(EVTYPE), summarise, fatalities = sum(FATALITIES), 
    injuries = sum(INJURIES))

Let’s sort fatalities and injuries data to find what weather events are the most harmful

major_fatalities <- head(sum_population[order(sum_population$fatalities, decreasing = T), 
    ], n = 10)[, c(1, 2)]
major_injuries <- head(sum_population[order(sum_population$injuries, decreasing = T), ], 
    n = 10)[, c(1, 3)]

Now lets address the question on most harmful weatrher events for economy

The following 2 variables report Economic damages (each of which is reported in $)

PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 …

CROPDMG : num 0 0 0 0 0 0 0 0 0 0 …

We sum together the numbers of PROPERTY DAMAGE and CROP DAMAGE for each EVTYPE

sum_econdamage <- ddply(new, .(EVTYPE), summarise, cropdmg = sum(CROPDMG), propdmg = sum(PROPDMG))
sum_econdamage$total_damage <- sum_econdamage$cropdmg + sum_econdamage$propdmg

Let’s sort PROPERTY DAMAGE and CROP DAMAGE data to find what weather events are the most harmful

major_econdamage <- head(sum_econdamage[order(sum_econdamage$total_damage, decreasing = T), ], n = 10)

Results

By looking at below list of top fatalities we conclude that across the United States, the following 3 weather events were the most harmful with respect to population health: The 3 major causes of human fatalities were (1) TORNADO (2) HEAT (3) FLOOD

major_fatalities
##           EVTYPE fatalities
## 739      TORNADO       5633
## 229         HEAT       2840
## 141        FLOOD       1487
## 403    LIGHTNING        816
## 662 THUNDERSTORM        774
## 507  RIP CURRENT        368
## 306    HIGH WIND        306
## 122  EXTREMECOLD        287
## 11     AVALANCHE        224
## 864 WINTER STORM        206

The same data plotted as barchart:

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.4
barchart1 <- melt(major_fatalities)
## Using EVTYPE as id variables
ggplot(barchart1, aes(x = factor(barchart1$EVTYPE, levels = barchart1$EVTYPE[order(-barchart1$value)]), y = barchart1$value, fill = variable)) + 
    geom_bar(fill="green", stat = "identity", position = "dodge") + theme(axis.text.x = element_text(angle = -315)) + 
    labs(x = "WEATHER EVENT", y = " # OF FATALITIES") + 
    theme(legend.position = "none") + ggtitle("HUMAN FATALITIES DUE TO WEATHER EVENTS")

By looking at below list of top injuries we conclude that across the United States, the following 3 weather events were the most harmful with respect to population health: The 3 major causes of human injuries were (1) TORNADO (2) THUNDERSTORMS (3) HEAT

major_injuries
##           EVTYPE injuries
## 739      TORNADO    91346
## 662 THUNDERSTORM     9808
## 229         HEAT     8625
## 141        FLOOD     8591
## 403    LIGHTNING     5230
## 372    ICE STORM     1975
## 306    HIGH WIND     1525
## 852    WILDFIFRE     1456
## 198         HAIL     1361
## 864 WINTER STORM     1321

The same data plotted as barchart:

barchart2 <- melt(major_injuries)
## Using EVTYPE as id variables
ggplot(barchart2, aes(x = factor(barchart2$EVTYPE, levels = barchart2$EVTYPE[order(-barchart2$value)]), y = barchart2$value, fill = variable)) + 
    geom_bar(fill="blue", stat = "identity", position = "dodge") + theme(axis.text.x = element_text(angle = -315)) + 
    labs(x = "WEATHER EVENT", y = " # OF INJURIES") + 
    theme(legend.position = "none") + ggtitle("HUMAN INJURIES DUE TO WEATHER EVENTS")

By looking at below list of top total injuries (SUM OF CROP AND PROPERTY DAMAGES) we conclude that the greatest economic damages in the US were made by the following weather events: (1) TORNADOS (2) THUNDERSTORMS (3) FLOODS

major_econdamage
##           EVTYPE   cropdmg    propdmg total_damage
## 739      TORNADO 100018.52 3212258.16   3312276.68
## 662 THUNDERSTORM 204935.75 2719380.69   2924316.44
## 141        FLOOD 353792.09 2381300.31   2735092.40
## 198         HAIL 579596.28  688697.38   1268293.66
## 403    LIGHTNING   3580.61  603351.78    606932.39
## 306    HIGH WIND  19342.81  383017.10    402359.91
## 864 WINTER STORM   1978.99  132720.59    134699.58
## 852    WILDFIFRE   8553.74  123804.29    132358.03
## 260   HEAVY SNOW   2165.72  122251.99    124417.71
## 372    ICE STORM   1688.95   66000.67     67689.62

The same data plotted as barchart:

barchart3 <- melt(major_econdamage)
## Using EVTYPE as id variables
barchart3 <- barchart3[barchart3$variable != "total_damage", ]
ggplot(na.omit(barchart3), aes(x = factor(barchart3$EVTYPE, levels = barchart3$EVTYPE[order(-barchart3$value)]), y = barchart3$value, fill = variable)) + 
    geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = -315)) + labs(x = "WEATHER EVENT", y = "DAMAGE TO ECONOMY $") + 
    scale_fill_discrete(name = "Type of damage", labels = c("CROP", "PROPERTY")) + 
    theme(legend.position = "top") + ggtitle("ECONOMIC DAMAGE DUE TO WEATHER EVENTS")