There are two aspects to this assignment:
Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
In the next sections of this report, an analysis and conclusion shall be presented to answer the following two questions.
The Data can be downloaded via this link. The documentation of variables is not included in the dataset, however they can be found in
The following snippet, the above liked data shall be downloaded and loaded into the workspace.
library(data.table)
library(ggplot2)
# Download the dataset ------------------------------------
fileURL1 = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if(!file.exists("data")) {
dir.create("data")
}
localFile <- "./data/repStormData.csv.bz2"
if (!file.exists(localFile))
{
download.file(fileURL1, destfile = "./data/repStormData.csv.bz2", method = "curl")
}
stormData <- read.csv("./data/repStormData.csv.bz2", header = TRUE)
Lets take a look a the column names to understand what data is present.
colnames(stormData)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
Close to 40 columns are there in this dataset, however EVTYPE and columns related to injuries (FATALITIES, INJURIES etc) and columns related to property damage (PROPDMG, CROPDMG etc) are of sufficient for us to carry out the ananlysis to cover the scope of this report.
Before we start the analysis, we can do some pre-processing to make the available dataset lighter to handle and easy to interpret!
As explained in previous sub-section, we have close to 40 columns, while we need less than 10 of these. Thus lets strip off the unnecessary data and convert to data.table.
# Remove the prefix in the column names
colnames(stormData) <- gsub("Freq.", "", colnames(stormData))
# Step 1 - Delete unnecessary columns
# Find columnnames to be deleted
relevantCols <- c("EVTYPE",
"FATALITIES",
"INJURIES",
"PROPDMG",
"PROPDMGEXP",
"CROPDMG",
"CROPDMGEXP")
stormData <- stormData[relevantCols]
# Step 2 - Convert to Data Table
stormDT <- as.data.table(stormData)
# Step 3 - Retain data rows ONLY where there is impact on health or property
stormDT <- stormDT[(EVTYPE != "?" &
( INJURIES > 0 |
FATALITIES > 0 |
PROPDMG > 0 |
CROPDMG > 0)),
c("EVTYPE",
"FATALITIES",
"INJURIES",
"PROPDMG",
"PROPDMGEXP",
"CROPDMG",
"CROPDMGEXP")]
The stripped off data set contains 7 columns and observations which have some impact on health and property damages. lets take a look at damages and see if any pre-processing is required.
unique(stormDT$PROPDMGEXP)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "4" "h" "2" "7" "3" "H" "-"
unique(stormDT$CROPDMGEXP)
## [1] "" "M" "K" "m" "B" "?" "0" "k"
The property damages are expressed in exponential units, however not in numeric format but in alphanumeric! Lets convert them to numeric formats.
# First bring all data to upper case
cols <- c("PROPDMGEXP", "CROPDMGEXP")
stormDT[, (cols) := c(lapply(.SD, toupper)), .SDcols = cols]
# Lookup table for property damages exponentials
propdamagesKey <- c("\"\"" = 10^0,
"-" = 10^0,
"+" = 10^1,
"0" = 10^2,
"1" = 10^1,
"2" = 10^2,
"3" = 10^3,
"4" = 10^4,
"5" = 10^5,
"6" = 10^6,
"7" = 10^7,
"8" = 10^8,
"9" = 10^9,
"H" = 10^2, # Hundreds
"K" = 10^3, # Thousands
"M" = 10^6, # Million
"B" = 10^9 # Billion
)
# Lookup for crop damages exponentials
cropdamagesKey <- c("\"\"" = 10^0,
"?" = 10^0,
"0" = 10^0,
"K" = 10^3, # Thousands
"M" = 10^6, # Million
"B" = 10^9 # Billion
)
# Replace the alphanumeric description with numeric data using the above defined keys
stormDT[, PROPDMGEXP := propdamagesKey[as.character(stormDT[, PROPDMGEXP])]]
stormDT[, CROPDMGEXP := cropdamagesKey[as.character(stormDT[, CROPDMGEXP])]]
# Replace the NA entries with Zero
stormDT[is.na(PROPDMGEXP), PROPDMGEXP := 10^0]
stormDT[is.na(CROPDMGEXP), CROPDMGEXP := 10^0]
To understand the impact of weather events on crops/property, we need to establish what was the loss due to property damags & crop damages. Lets use a simple mechanism, where loss shall be equal to damages times the expenses per event.
stormDT <- stormDT[, .(EVTYPE,
FATALITIES,
INJURIES,
PROPDMG,
PROPDMGEXP,
PROPLOSS = PROPDMG*PROPDMGEXP, # New Column for loss due to property damage
CROPDMG,
CROPDMGEXP,
CROPLOSS = CROPDMG*CROPDMGEXP # New Column for loss due to crop damage
)]
# Sum up the property and crop losses by Event Type
totalLossDT <- stormDT[, .(PROPLOSS = sum(PROPLOSS),
CROPLOSS = sum(CROPLOSS),
totalLoss = sum(PROPLOSS)+sum(CROPLOSS)),
by = .(EVTYPE)]
# Order the loss summary by total losses
totalLossDT <- totalLossDT[order(-totalLoss),]
# Top five weather events causing damages to property and crops are ordered by total loss
head(totalLossDT, 5)
## EVTYPE PROPLOSS CROPLOSS totalLoss
## 1: FLOOD 144657709807 5661968450 150319678257
## 2: HURRICANE/TYPHOON 69305840000 2607872800 71913712800
## 3: TORNADO 56947394433 414953270 57362347703
## 4: STORM SURGE 43323536000 5000 43323541000
## 5: HAIL 15735290847 3025954473 18761245320
Now lets take a look at impact on human life in terms of injuries and fatalities.
totalHLImpactDT <- stormDT[, .(FATALITIES = sum(FATALITIES),
INJURIES = sum(INJURIES),
totalHLF = sum(FATALITIES)+sum(INJURIES)), # New column indicating overall factor on Human Life
by = .(EVTYPE)]
# Order the HUman Life analysis by fatalities
totalHLImpactDT <- totalHLImpactDT[order(-FATALITIES), ]
# Top five weather events in terms of fatalities
head(totalHLImpactDT, 5)
## EVTYPE FATALITIES INJURIES totalHLF
## 1: TORNADO 5633 91346 96979
## 2: EXCESSIVE HEAT 1903 6525 8428
## 3: FLASH FLOOD 978 1777 2755
## 4: HEAT 937 2100 3037
## 5: LIGHTNING 816 5230 6046
We set out to understand the data available in the storm data base and identify weather events which cause the most harm in terms of economic damages and on human life. The analysis in previous section, provides us a glimpse of how this looks. Lets us know answer the following two questions specifically.
It may be prescient to plot most significant weather events as opposed to all the hundreds of the them to help us with our analysis. Following the tidy principles, lets merge the two(fatalities & injuries) impact factor columns into one.
# Subset the data table with just top 10 entries
adverseHumanhealth <- totalHLImpactDT[1:10, ]
adverseHumanhealth <- melt(adverseHumanhealth, id.vars = "EVTYPE", variable.name = "adverse_impact")
head(adverseHumanhealth, 5)
## EVTYPE adverse_impact value
## 1: TORNADO FATALITIES 5633
## 2: EXCESSIVE HEAT FATALITIES 1903
## 3: FLASH FLOOD FATALITIES 978
## 4: HEAT FATALITIES 937
## 5: LIGHTNING FATALITIES 816
Now we can plot the impact of various weather events in terms of severity in a bar graph.
# Create graph object
hl_chart <- ggplot(adverseHumanhealth, aes(x=reorder(EVTYPE, -value), y=value))
# Plot as bar graph
hl_chart = hl_chart + geom_bar(stat = "identity", aes(fill=adverse_impact), position="dodge")
# Format y-axis scale and set y-axis label
hl_chart = hl_chart + ylab("Incidence Count")
# Set x-axis label
hl_chart = hl_chart + xlab("Weather Event Type")
# Rotate x-axis tick labels
hl_chart = hl_chart + theme(axis.text.x = element_text(angle=45, hjust=1))
# Set chart title and center it
hl_chart = hl_chart + ggtitle("Top 10 weather events causing adverse impact on human life in US") + theme(plot.title = element_text(hjust = 0.5))
hl_chart
From the above graph we can conclude that tornadoes have the most severe impact on humna life in US.
It may be prescient to plot most significant weather events as opposed to all the hundreds of the them to help us with our analysis. Following the tidy principles, lets merge the two damages columns into one.
# Subset the data table with just top 10 entries
adversEcon <- totalLossDT[1:10, ]
adversEcon <- melt(adversEcon, id.vars = "EVTYPE", variable.name = "adverse_econ")
head(adversEcon, 5)
## EVTYPE adverse_econ value
## 1: FLOOD PROPLOSS 144657709807
## 2: HURRICANE/TYPHOON PROPLOSS 69305840000
## 3: TORNADO PROPLOSS 56947394433
## 4: STORM SURGE PROPLOSS 43323536000
## 5: HAIL PROPLOSS 15735290847
Lets create a chart with loss due to property and crop damages.
# Create graph object
loss_chart <- ggplot(adversEcon, aes(x=reorder(EVTYPE, -value), y=value))
# Plot as bar graph
loss_chart = loss_chart + geom_bar(stat = "identity", aes(fill=adverse_econ), position="dodge")
# Format y-axis scale and set y-axis label
loss_chart = loss_chart + ylab("Loss in dollars")
# Set x-axis label
loss_chart = loss_chart + xlab("Weather Event Type")
# Rotate x-axis tick labels
loss_chart = loss_chart + theme(axis.text.x = element_text(angle=45, hjust=1))
# Set chart title and center it
loss_chart = loss_chart + ggtitle("Top 10 weather events causing economic loss in US") + theme(plot.title = element_text(hjust = 0.5))
loss_chart
From the above graph we can conclude that floods cause the most severe adverse economic impact in US.