This report summarises the data found in the NOAA Storm Database, particularly in relation to the following questions:
The report presents only preliminary findings. It comprises (1) a section describing the transformations which were made to the original dataset (notably selecting relevant years to analyze, converting coded amounts to reflect their economic value, and doing some minor variable cleaing), and (2) a section presenting basic tables and figures related to the two questions stated above.
The data used in this report covers the period from 01 January 1996 to 30 November 2011.
First we read in the data using R. The source file is a bz2 file, so we can use bzcat to directly read the data in using fread from the “data.table” package.
According to the details available1, complete storm data for all event types are available only from 1996 on-wards, so we first convert the event date to an actual date format, and subset on the relevant rows. There are also several columns that we are not interested in, so we will drop those columns and re-save the Rdata file. This should speed up loading the data on subsequent runs.
## -------------------------------------------------------------------
## 01 - SETTING THE WORKING DIRECTORY --------------------------------
## -------------------------------------------------------------------
## I work on both Windows and Linux, and keep the files in Dropbox.
## This lets me load the relevant data from either system.
OS <- Sys.info()["sysname"]
if (OS == "Windows") {
setwd("C:/Users/Ananda PC/Dropbox/Coursera/reproducible_research")
} else if (OS == "Linux") {
setwd("~/Dropbox/Coursera/reproducible_research/")
}
## -------------------------------------------------------------------
## 02 - LOADING REQUIRED PACKAGES ------------------------------------
## -------------------------------------------------------------------
## Load "data.table"
suppressPackageStartupMessages(library(data.table))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(lattice))
## -------------------------------------------------------------------
## 03 - DECIDING WHICH VERSION OF THE DATA TO LOAD -------------------
## -------------------------------------------------------------------
## Check to see if the data exists as a rdata file.
## If not, read it in and save as rdata file
## -------------------------------------------------------------------
## 03A - LOADING THE RAW DATA ----------------------------------------
## -------------------------------------------------------------------
if (!file.exists("storm.Rdata")) {
storm <- fread("bzcat repdata-data-StormData.csv.bz2")
save(storm, file = "storm.Rdata")
}
## -------------------------------------------------------------------
## 03B - LOADING THE RAW DATA PROCESSED AS AN RDATA FILE -------------
## - PROCESS THE FILE AND RE-SAVE AS A NEW DATA SUBSET -----------
## -------------------------------------------------------------------
if (!file.exists("storm_final.Rdata")) {
cat("Loading from saved Rdata file.")
load("storm.Rdata")
## -----------------------------------------------------------------
## 03B1 - Keep only 1996 onwards since that's when
## full data is available
## - First, convert to proper dates,
## then select the relevant rows
## -----------------------------------------------------------------
storm[, BGN_DATE := as.Date(BGN_DATE, format = "%m/%d/%Y")]
storm[, END_DATE := as.Date(END_DATE, format = "%m/%d/%Y")]
storm <- storm[BGN_DATE >= "1996-01-01"]
## -----------------------------------------------------------------
## 03B2 - Make the dataset a little bit more manageable
## by only keeping selected columns
## -----------------------------------------------------------------
keep <- c("BGN_DATE", "TIME_ZONE", "COUNTY", "STATE", "EVTYPE", "F",
"MAG", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP",
"CROPDMG", "CROPDMGEXP", "WFO", "LATITUDE", "LONGITUDE")
storm <- storm[, keep, with = FALSE]
## -----------------------------------------------------------------
## 03B3 - Tidy up the EVTYPE column
## - Process the CROPDMGEXP and PROPDMGEXP columns
## - Save to a smaller and more manageable dataset
## -----------------------------------------------------------------
storm[, EVTYPE := trimws(gsub("[^A-Z]", "", toupper(EVTYPE)))]
## Write a basic function to process the multiplication.
## Input = The character column.
## Output = The converted value as numeric.
EXP <- c("K", "M", "B")
EXPV <- c("1000", "1e+06", "1e+09")
EXPF <- function(invec) {
as.numeric(
as.character(
factor(replace(invec, !invec %in% EXP, NA), EXP, EXPV)))}
## Replace the PROPDMG and CROPDMG columns.
## Drop the PROPDMGEXP and CROPDMGEXP columns.
storm[, PROPDMG := PROPDMG * EXPF(PROPDMGEXP)][, PROPDMGEXP := NULL]
storm[, CROPDMG := CROPDMG * EXPF(CROPDMGEXP)][, CROPDMGEXP := NULL]
## Resave and move on.
save(storm, file = "storm_final.Rdata")
} else {
## -------------------------------------------------------------------
## 03C - LOADING THE PROCESSED DATA AND PROCEED WITH ANALYSIS --------
## -------------------------------------------------------------------
cat("Loading processed storm data file.")
load("storm_final.Rdata")
}
## Loading processed storm data file.
names(storm)
## [1] "BGN_DATE" "TIME_ZONE" "COUNTY" "STATE" "EVTYPE"
## [6] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [11] "CROPDMG" "WFO" "LATITUDE" "LONGITUDE"
The raw data includes many values in EVTYPE, too many to present meaningfully in a summary table. When looking at the events most harmful to a population’s head, the following steps were taken:
EVTYPE.sum of fatalities or injuries for each EVTYPE.The following table lists the top 15 events that have caused fatalities.
Fatalities <- storm %>%
group_by(EVTYPE) %>%
summarise(FATALITIES = sum(FATALITIES, na.rm = TRUE)) %>%
arrange(desc(FATALITIES)) %>%
head(15)
knitr::kable(Fatalities)
| EVTYPE | FATALITIES |
|---|---|
| EXCESSIVEHEAT | 1797 |
| TORNADO | 1511 |
| FLASHFLOOD | 887 |
| LIGHTNING | 651 |
| FLOOD | 414 |
| RIPCURRENT | 340 |
| TSTMWIND | 241 |
| HEAT | 237 |
| HIGHWIND | 235 |
| AVALANCHE | 223 |
| RIPCURRENTS | 202 |
| WINTERSTORM | 191 |
| THUNDERSTORMWIND | 130 |
| EXTREMECOLDWINDCHILL | 125 |
| EXTREMECOLD | 115 |
The same process was used to calculate the top 15 events that have cause injuries, as displayed in the following table.
Injuries <- storm %>%
group_by(EVTYPE) %>%
summarise(INJURIES = sum(INJURIES, na.rm = TRUE)) %>%
arrange(desc(INJURIES)) %>%
head(15)
knitr::kable(Injuries)
| EVTYPE | INJURIES |
|---|---|
| TORNADO | 20667 |
| FLOOD | 6758 |
| EXCESSIVEHEAT | 6391 |
| LIGHTNING | 4141 |
| TSTMWIND | 3629 |
| FLASHFLOOD | 1674 |
| THUNDERSTORMWIND | 1400 |
| WINTERSTORM | 1292 |
| HURRICANETYPHOON | 1275 |
| HEAT | 1222 |
| HIGHWIND | 1083 |
| WILDFIRE | 911 |
| HAIL | 713 |
| FOG | 712 |
| HEAVYSNOW | 698 |
There are several events which intersect:
matchedEvents <- sprintf("* %s\n", Injuries$EVTYPE[Injuries$EVTYPE %in% Fatalities$EVTYPE])
cat("", matchedEvents)
A similar approach was taken to identify the events which had the greatest economic consequences. To understand this impact, totals for property damage (PROPDMG) and crop damage (CROPDMG) were calculated grouped by EVTYPE.
As in the previous section, only the top 15 values were identified.
When calculating the PROPDMG values, the totals for EVTYPE == "FLOOD" seemed to be inaccurate. The anomalous value was cross referenced with the original raw data, which includes a column titled REMARKS. In this column, the following was noted:
Major flooding continued into the early hours of January 1st, before the Napa River finally fell below flood stage and the water receeded. Flooding was severe in Downtown Napa from the Napa Creek and the City and Parks Department was hit with $6 million in damage alone. The City of Napa had 600 homes with moderate damage, 150 damaged businesses with costs of at least $70 million.
The assumption that can be made is that instead of a billion multiplier, the multiplier should have been million.
## Initial run -- but numbers seem off...
PD_SUM <- storm %>%
group_by(EVTYPE) %>%
summarise(PD_SUM = sum(PROPDMG, na.rm = TRUE)) %>%
arrange(desc(PD_SUM)) %>%
head(15)
## Identify incorrect FLOOD value and replace it with a more reasonable value
fixme <- storm[EVTYPE == "FLOOD", which.max(PROPDMG)]
storm[EVTYPE == "FLOOD"][fixme]$PROPDMG <-
storm[EVTYPE == "FLOOD"][fixme]$PROPDMG/1000
## Recalculate property damage totals
PD_SUM <- storm %>%
group_by(EVTYPE) %>%
summarise(PD_SUM = sum(PROPDMG, na.rm = TRUE)) %>%
arrange(desc(PD_SUM)) %>%
mutate(EVTYPE = as.character(EVTYPE)) %>%
head(15)
## Required for proper sorting in the plot
PD_SUM$EVTYPE <- factor(PD_SUM$EVTYPE, PD_SUM$EVTYPE)
## Generate table and chart
PropertyDamage <- barchart(PD_SUM/1e+09 ~ EVTYPE, PD_SUM,
scales = list(x = list(rot = 45)),
xlab = "Event Type",
ylab = "Damage Cost (in USD Billion)",
main = "Property Damage Cost of Top 15 Event Types")
knitr::kable(PD_SUM %>%
mutate("Property Damage (in Billions)" =
sprintf("$ %6.2f", PD_SUM/1e+9)) %>%
select(-PD_SUM))
| EVTYPE | Property Damage (in Billions) |
|---|---|
| HURRICANETYPHOON | $ 69.31 |
| STORMSURGE | $ 43.19 |
| FLOOD | $ 29.06 |
| TORNADO | $ 24.62 |
| FLASHFLOOD | $ 15.22 |
| HAIL | $ 14.60 |
| HURRICANE | $ 11.81 |
| TROPICALSTORM | $ 7.64 |
| HIGHWIND | $ 5.25 |
| WILDFIRE | $ 4.76 |
| STORMSURGETIDE | $ 4.64 |
| TSTMWIND | $ 4.49 |
| ICESTORM | $ 3.64 |
| THUNDERSTORMWIND | $ 3.38 |
| WILDFORESTFIRE | $ 3.00 |
PropertyDamage
The same approach was taken for calculating crop damage.
## Calculate crop damage totals
CROP_SUM <- storm %>%
group_by(EVTYPE) %>%
summarise(CROP_SUM = sum(CROPDMG, na.rm = TRUE)) %>%
arrange(desc(CROP_SUM)) %>%
head(15)
## Required for proper sorting in the plot
CROP_SUM$EVTYPE <- factor(CROP_SUM$EVTYPE, CROP_SUM$EVTYPE)
## Generate table and chart
CropDamage <- barchart(CROP_SUM/1e+09 ~ EVTYPE, CROP_SUM,
scales = list(x = list(rot = 45)),
xlab = "Event Type",
ylab = "Damage Cost (in USD Billion)",
main = "Crop Damage Cost of Top 15 Event Types")
knitr::kable(CROP_SUM %>%
mutate("Crop Damage (in Billions)" =
sprintf("$ %6.2f", CROP_SUM/1e+9)) %>%
select(-CROP_SUM))
| EVTYPE | Crop Damage (in Billions) |
|---|---|
| DROUGHT | $ 13.37 |
| FLOOD | $ 4.97 |
| HURRICANE | $ 2.74 |
| HURRICANETYPHOON | $ 2.61 |
| HAIL | $ 2.48 |
| FLASHFLOOD | $ 1.33 |
| EXTREMECOLD | $ 1.31 |
| FROSTFREEZE | $ 1.09 |
| HEAVYRAIN | $ 0.73 |
| TROPICALSTORM | $ 0.68 |
| HIGHWIND | $ 0.63 |
| TSTMWIND | $ 0.55 |
| EXCESSIVEHEAT | $ 0.49 |
| THUNDERSTORMWIND | $ 0.40 |
| WILDFIRE | $ 0.30 |
CropDamage
The overall damage can be calculated as follows:
OVERALL <- storm %>%
mutate(PROPDMG = replace(PROPDMG, is.na(PROPDMG), 0),
CROPDMG = replace(CROPDMG, is.na(CROPDMG), 0),
DMG = PROPDMG + CROPDMG) %>%
group_by(EVTYPE) %>%
summarise(DMG = sum(DMG)) %>%
arrange(desc(DMG)) %>%
head(15)
## Required for proper sorting in the plot
OVERALL$EVTYPE <- factor(OVERALL$EVTYPE, OVERALL$EVTYPE)
## Generate table and chart
OverallDamage <- barchart(DMG/1e+09 ~ EVTYPE, OVERALL,
scales = list(x = list(rot = 45)),
xlab = "Event Type",
ylab = "Damage Cost (in USD Billion)",
main = "Overall Damage Cost of Top 15 Event Types")
knitr::kable(OVERALL %>%
mutate("Overall Damage (in Billions)" =
sprintf("$ %6.2f", DMG/1e+9)) %>%
select(-DMG))
| EVTYPE | Overall Damage (in Billions) |
|---|---|
| HURRICANETYPHOON | $ 71.91 |
| STORMSURGE | $ 43.19 |
| FLOOD | $ 34.03 |
| TORNADO | $ 24.90 |
| HAIL | $ 17.07 |
| FLASHFLOOD | $ 16.56 |
| HURRICANE | $ 14.55 |
| DROUGHT | $ 14.41 |
| TROPICALSTORM | $ 8.32 |
| HIGHWIND | $ 5.88 |
| WILDFIRE | $ 5.05 |
| TSTMWIND | $ 5.04 |
| STORMSURGETIDE | $ 4.64 |
| THUNDERSTORMWIND | $ 3.78 |
| ICESTORM | $ 3.66 |
OverallDamage
This should only be considered a very preliminary analysis. As is evident in the data summaries provided, there is still a fair amount of data cleaning still necessary. For instance, among the most economically impactful events were TSTMWIND and THUNDERSTORMWIND. It is likely that these are different terms for the same EVTYPE and should thus be added to gether. Similarly, there are different degrees mentioned for certain event types; these can also be treated as a common cluster in subsequent analyses.