Explore the NOAA Storm Database to understand the impact of servere weather events

1. Synopsis

There are two aspects to this assignment:

The scientific objective of the assignment is to explore the explore the NOAA Storm Database and answer some basic questions about severe weather events.
The technical objective is to familiarize oneself with knitr package in R and the publishing to RPubs platform.

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

In the next sections of this report, an analysis and conclusion shall be presented to answer the following two questions.

Across the United States, which types of events are most harmful with respect to human population life & health?
Across the United States, which types of events have the greatest economic consequences?

2. Data Processing

2.1 Download and load the Data

The Data can be downloaded via this link. The documentation of variables is not included in the dataset, however they can be found in

National Weather Service Storm [Data Documentation] (https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf)
National Climatic Data Center Storm Events FAQ

The following snippet, the above liked data shall be downloaded and loaded into the workspace.

library(data.table)
library(ggplot2)

# Download the dataset ------------------------------------
fileURL1 = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"

if(!file.exists("data")) {
  dir.create("data")
}
localFile <- "./data/repStormData.csv.bz2"
if (!file.exists(localFile))
{
  download.file(fileURL1, destfile = "./data/repStormData.csv.bz2", method = "curl")
}
stormData <- read.csv("./data/repStormData.csv.bz2", header = TRUE)

2.2 Explore the data

Lets take a look a the column names to understand what data is present.

colnames(stormData)

##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

Close to 40 columns are there in this dataset, however EVTYPE and columns related to injuries (FATALITIES, INJURIES etc) and columns related to property damage (PROPDMG, CROPDMG etc) are of sufficient for us to carry out the ananlysis to cover the scope of this report.

2.3 Pre-propcessing

Before we start the analysis, we can do some pre-processing to make the available dataset lighter to handle and easy to interpret!

As explained in previous sub-section, we have close to 40 columns, while we need less than 10 of these. Thus lets strip off the unnecessary data and convert to data.table.

# Remove the prefix in the column names
colnames(stormData) <- gsub("Freq.", "", colnames(stormData))

# Step 1 - Delete unnecessary columns
# Find columnnames to be deleted
relevantCols <- c("EVTYPE",
                  "FATALITIES",
                  "INJURIES",
                  "PROPDMG",
                  "PROPDMGEXP",
                  "CROPDMG",
                  "CROPDMGEXP")

stormData <- stormData[relevantCols]

# Step 2 - Convert to Data Table
stormDT <- as.data.table(stormData)

# Step 3 - Retain data rows ONLY where there is impact on health or property
stormDT <- stormDT[(EVTYPE != "?" & 
                      (  INJURIES > 0 | 
                         FATALITIES > 0 |
                         PROPDMG > 0 |
                        CROPDMG > 0)),
                   c("EVTYPE",
                     "FATALITIES",
                     "INJURIES",
                     "PROPDMG",
                     "PROPDMGEXP",
                     "CROPDMG",
                     "CROPDMGEXP")]

The stripped off data set contains 7 columns and observations which have some impact on health and property damages. lets take a look at damages and see if any pre-processing is required.

unique(stormDT$PROPDMGEXP)

##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "4" "h" "2" "7" "3" "H" "-"

unique(stormDT$CROPDMGEXP)

## [1] ""  "M" "K" "m" "B" "?" "0" "k"

The property damages are expressed in exponential units, however not in numeric format but in alphanumeric! Lets convert them to numeric formats.

# First bring all data to upper case
cols <- c("PROPDMGEXP", "CROPDMGEXP")
stormDT[, (cols) := c(lapply(.SD, toupper)), .SDcols = cols]

# Lookup table for property damages exponentials
propdamagesKey <- c("\"\"" = 10^0,
                "-" = 10^0,
                "+" = 10^1,
                "0" = 10^2,
                "1" = 10^1,
                "2" = 10^2,
                "3" = 10^3,
                "4" = 10^4,
                "5" = 10^5,
                "6" = 10^6,
                "7" = 10^7,
                "8" = 10^8,
                "9" = 10^9,
                "H" = 10^2,   # Hundreds
                "K" = 10^3,   # Thousands
                "M" = 10^6,   # Million
                "B" = 10^9    # Billion
                )

# Lookup for crop damages exponentials
cropdamagesKey <- c("\"\"" = 10^0,
                    "?" = 10^0,
                    "0" = 10^0,
                    "K" = 10^3,   # Thousands
                    "M" = 10^6,   # Million
                    "B" = 10^9    # Billion
                    )

# Replace the alphanumeric description with numeric data using the above defined keys
stormDT[, PROPDMGEXP := propdamagesKey[as.character(stormDT[, PROPDMGEXP])]]
stormDT[, CROPDMGEXP := cropdamagesKey[as.character(stormDT[, CROPDMGEXP])]]

# Replace the NA entries with Zero
stormDT[is.na(PROPDMGEXP), PROPDMGEXP := 10^0]
stormDT[is.na(CROPDMGEXP), CROPDMGEXP := 10^0]

2.4 Analysis

To understand the impact of weather events on crops/property, we need to establish what was the loss due to property damags & crop damages. Lets use a simple mechanism, where loss shall be equal to damages times the expenses per event.

stormDT <- stormDT[, .(EVTYPE,
                       FATALITIES,
                       INJURIES,
                       PROPDMG,
                       PROPDMGEXP,
                       PROPLOSS = PROPDMG*PROPDMGEXP, # New Column for loss due to property damage
                       CROPDMG,
                       CROPDMGEXP,
                       CROPLOSS = CROPDMG*CROPDMGEXP # New Column for loss due to crop damage 
                       )]

# Sum up the property and crop losses by Event Type
totalLossDT <- stormDT[, .(PROPLOSS = sum(PROPLOSS),
                           CROPLOSS = sum(CROPLOSS),
                           totalLoss = sum(PROPLOSS)+sum(CROPLOSS)),
                       by = .(EVTYPE)]

# Order the loss summary by total losses
totalLossDT <- totalLossDT[order(-totalLoss),]

# Top five weather events causing damages to property and crops are ordered by total loss
head(totalLossDT, 5)

##               EVTYPE     PROPLOSS   CROPLOSS    totalLoss
## 1:             FLOOD 144657709807 5661968450 150319678257
## 2: HURRICANE/TYPHOON  69305840000 2607872800  71913712800
## 3:           TORNADO  56947394433  414953270  57362347703
## 4:       STORM SURGE  43323536000       5000  43323541000
## 5:              HAIL  15735290847 3025954473  18761245320

Now lets take a look at impact on human life in terms of injuries and fatalities.

totalHLImpactDT <- stormDT[, .(FATALITIES = sum(FATALITIES),
                               INJURIES = sum(INJURIES),
                               totalHLF = sum(FATALITIES)+sum(INJURIES)), # New column indicating overall factor on Human Life
                           by = .(EVTYPE)]

# Order the HUman Life analysis by fatalities
totalHLImpactDT <- totalHLImpactDT[order(-FATALITIES), ]

# Top five weather events in terms of fatalities
head(totalHLImpactDT, 5)

##            EVTYPE FATALITIES INJURIES totalHLF
## 1:        TORNADO       5633    91346    96979
## 2: EXCESSIVE HEAT       1903     6525     8428
## 3:    FLASH FLOOD        978     1777     2755
## 4:           HEAT        937     2100     3037
## 5:      LIGHTNING        816     5230     6046

3. Results & Conclusions

We set out to understand the data available in the storm data base and identify weather events which cause the most harm in terms of economic damages and on human life. The analysis in previous section, provides us a glimpse of how this looks. Lets us know answer the following two questions specifically.

3.1 Weather events most harmful to human population health

It may be prescient to plot most significant weather events as opposed to all the hundreds of the them to help us with our analysis. Following the tidy principles, lets merge the two(fatalities & injuries) impact factor columns into one.

# Subset the data table with just top 10 entries
adverseHumanhealth <- totalHLImpactDT[1:10, ]
adverseHumanhealth <- melt(adverseHumanhealth, id.vars = "EVTYPE", variable.name = "adverse_impact")
head(adverseHumanhealth, 5)

##            EVTYPE adverse_impact value
## 1:        TORNADO     FATALITIES  5633
## 2: EXCESSIVE HEAT     FATALITIES  1903
## 3:    FLASH FLOOD     FATALITIES   978
## 4:           HEAT     FATALITIES   937
## 5:      LIGHTNING     FATALITIES   816

Now we can plot the impact of various weather events in terms of severity in a bar graph.

# Create graph object
hl_chart <- ggplot(adverseHumanhealth, aes(x=reorder(EVTYPE, -value), y=value))

# Plot as bar graph
hl_chart = hl_chart + geom_bar(stat = "identity", aes(fill=adverse_impact), position="dodge")

# Format y-axis scale and set y-axis label
hl_chart = hl_chart + ylab("Incidence Count") 

# Set x-axis label
hl_chart = hl_chart + xlab("Weather Event Type") 

# Rotate x-axis tick labels 
hl_chart = hl_chart + theme(axis.text.x = element_text(angle=45, hjust=1))

# Set chart title and center it
hl_chart = hl_chart + ggtitle("Top 10 weather events causing adverse impact on human life in US") + theme(plot.title = element_text(hjust = 0.5))

hl_chart

From the above graph we can conclude that tornadoes have the most severe impact on humna life in US.

3.2 Weather events causing most economic damages

# Subset the data table with just top 10 entries
adversEcon <- totalLossDT[1:10, ]
adversEcon <- melt(adversEcon, id.vars = "EVTYPE", variable.name = "adverse_econ")
head(adversEcon, 5)

##               EVTYPE adverse_econ        value
## 1:             FLOOD     PROPLOSS 144657709807
## 2: HURRICANE/TYPHOON     PROPLOSS  69305840000
## 3:           TORNADO     PROPLOSS  56947394433
## 4:       STORM SURGE     PROPLOSS  43323536000
## 5:              HAIL     PROPLOSS  15735290847

Lets create a chart with loss due to property and crop damages.

# Create graph object
loss_chart <- ggplot(adversEcon, aes(x=reorder(EVTYPE, -value), y=value))

# Plot as bar graph
loss_chart = loss_chart + geom_bar(stat = "identity", aes(fill=adverse_econ), position="dodge")

# Format y-axis scale and set y-axis label
loss_chart = loss_chart + ylab("Loss in dollars") 

# Set x-axis label
loss_chart = loss_chart + xlab("Weather Event Type") 

# Rotate x-axis tick labels 
loss_chart = loss_chart + theme(axis.text.x = element_text(angle=45, hjust=1))

# Set chart title and center it
loss_chart = loss_chart + ggtitle("Top 10 weather events causing economic loss in US") + theme(plot.title = element_text(hjust = 0.5))

loss_chart

From the above graph we can conclude that floods cause the most severe adverse economic impact in US.