Exploring the U.S. NOAA Storm Database

Michelsone Presendieu

August 12, 2018

Introduction

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) of the storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. The analysis below will analyze the major storm events causing injuries and fatalities. Similarly, we will also examine the major Storm Event causing highest property damage.

Synopsis

The analysis on the storm event database revealed that tornadoes are the most dangerous weather event to the populations health. The second most dangerous event type is excessive heat. The economic impact of weather events was also analyzed. Flash floods and thunderstorm winds caused billions of dollars in property damages between 1950 and 2011. The largest damage to crops were caused by droughts, followed by floods and hailing.

Load required libraries and basic setting

#Load requird libraries. Set warning = FALSE and Message = FALSE to hide the
#verbose messages printed while importing libraries
suppressMessages(library("ggplot2"))
## Warning: package 'ggplot2' was built under R version 3.4.4
suppressMessages(library("gridExtra"))
## Warning: package 'gridExtra' was built under R version 3.4.4
suppressMessages(library("R.utils"))
## Warning: package 'R.utils' was built under R version 3.4.4
## Warning: package 'R.oo' was built under R version 3.4.4
options(scipen = 1) #turn off scientific notations for numbers
knitr::opts_chunk$set(fig.width=8, fig.height=4) 

Reading Data

We will download the data file and unzip it.

setwd("~/Coursera/Reproducibile Research/Week_3/Project")

if (!"storm_data.csv.bz2" %in% dir()) {
        download.file("http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", destfile = "storm_data.csv.bz2")
        bunzip2("storm_data.csv.bz2", overwrite=T, remove=F)
}

Then we will read the generated csv file. If the data exist in the working enviromnent, we do not need to load it again. However if it’s not there, we will create and read the csv file.

if (!"storm_data" %in% ls()) {
        storm_data <- read.csv("storm_data.csv", sep = ",", stringsAsFactors = FALSE)
}
#quick view to the data structure
str(storm_data)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
head(storm_data, n = 3)
##   STATE__          BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1 4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1 4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1 2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
names(storm_data)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"
#Trim the dataset to the required column
storm_event <- storm_data[, c("BGN_DATE", "EVTYPE", "FATALITIES", "INJURIES", 
    "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]

storm_event$year <- as.numeric(format(as.Date(storm_data$BGN_DATE, format = "%m/%d/%Y %H:%M:%S"), "%Y"))

# Create subset for Question 1 and Question 2

# Select data for Fatalities and injuries for Question 1
event_health <- subset(storm_event, !storm_event$FATALITIES == 0 & !storm_event$INJURIES == 
    0, select = c(EVTYPE, FATALITIES, INJURIES))

# Select data for Property Damage and Crop Damage for Question 2
event_economic <- subset(storm_event, !storm_event$PROPDMG == 0 & !storm_event$CROPDMG == 
    0, select = c(EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP))

There are 902297 rows and 9 columns in total with the modified dataset. We can visually see that eariler years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete. Based on the above histogram, we see that the number of events tracked starts to significantly increase around 1995.

hist(storm_event$year, main = "Event Timeline in Years", xlab = "Years", ylab = "# of Event", breaks = 25)

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

Data Processing

This will prepare requiured data to present most harmful events with respect to population health.

event_health_death <- aggregate(event_health$FATALITIES, by = list(event_health$EVTYPE), 
    FUN = sum)
# Give proper name for columns
colnames(event_health_death) <- c("EVENTTYPE", "FATALITIES")

# Injuries
event_health_inj <- aggregate(event_health$INJURIES, by = list(event_health$EVTYPE), 
    FUN = sum)

# Give column name
colnames(event_health_inj) <- c("EVENTTYPE", "INJURIES")

# Let's reorder 2 dataset and filter top 5 events for both dataset
event_health_death <- event_health_death[order(event_health_death$FATALITIES, decreasing = TRUE), 
    ][1:5, ]

event_health_inj <- event_health_inj[order(event_health_inj$INJURIES, decreasing = TRUE), 
    ][1:5, ]

Results

Populate the top 5 major cause of both fatalities and injuriees

# plot top 5 events for fatalities and injuries

# Plot Fatalities and store at Death_plot
death_plot <- ggplot() + geom_bar(data = event_health_death, aes(x = EVENTTYPE, 
    y = FATALITIES, fill = interaction(FATALITIES, EVENTTYPE)), stat = "identity", 
    show.legend = F) + theme(axis.text.x = element_text(angle = 30, hjust = 1)) + 
    xlab("Harmful Events") + ylab("# of fatailities") + ggtitle("Top 5 weather events causing fatalities")

# Plot injuries and store at variable Inj_plot
inj_plot <- ggplot() + geom_bar(data = event_health_inj, aes(x = EVENTTYPE, y = INJURIES, 
    fill = interaction(INJURIES, EVENTTYPE)), stat = "identity", show.legend = F) + 
    theme(axis.text.x = element_text(angle = 30, hjust = 1)) + xlab("Harmful Events") + 
    ylab("# of Injuries") + ggtitle("Top 5 Weather Events - Injuries")

# Draw two plots generated above dividing space in two columns

grid.arrange(death_plot, inj_plot, ncol = 2)

Tornado is the major cause with respect to population health, both for causing fatalities and injuries.

2. Across the United States, which types of events have the greatest economic consequences?

Data Processing

This will prepare requiured data to present most harmful events with respect to the economic damages.

# select required entries for economy
event_economic <- subset(event_economic, event_economic$PROPDMGEXP == "K" | event_economic$PROPDMGEXP == 
    "k" | event_economic$PROPDMGEXP == "M" | event_economic$PROPDMGEXP == "m" | 
    event_economic$PROPDMGEXP == "B" | event_economic$PROPDMGEXP == "b")

event_economic <- subset(event_economic, event_economic$CROPDMGEXP == "K" | event_economic$CROPDMGEXP == 
    "k" | event_economic$CROPDMGEXP == "M" | event_economic$CROPDMGEXP == "m" | 
    event_economic$CROPDMGEXP == "B" | event_economic$CROPDMGEXP == "b")

# Convert ecnomic values to number
event_economic$PROPDMGEXP <- gsub("m", 1e+06, event_economic$PROPDMGEXP, ignore.case = TRUE)
event_economic$PROPDMGEXP <- gsub("k", 1000, event_economic$PROPDMGEXP, ignore.case = TRUE)
event_economic$PROPDMGEXP <- gsub("b", 1e+09, event_economic$PROPDMGEXP, ignore.case = TRUE)
event_economic$PROPDMGEXP <- as.numeric(event_economic$PROPDMGEXP)
event_economic$CROPDMGEXP <- gsub("m", 1e+06, event_economic$CROPDMGEXP, ignore.case = TRUE)
event_economic$CROPDMGEXP <- gsub("k", 1000, event_economic$CROPDMGEXP, ignore.case = TRUE)
event_economic$CROPDMGEXP <- gsub("b", 1e+09, event_economic$CROPDMGEXP, ignore.case = TRUE)
event_economic$CROPDMGEXP <- as.numeric(event_economic$CROPDMGEXP)
event_economic$PROPDMGEXP <- as.numeric(event_economic$PROPDMGEXP)

# then sum the damages by each event type
event_economic$TOTALDMG <- (event_economic$CROPDMG * event_economic$CROPDMGEXP) + 
    (event_economic$PROPDMG * event_economic$PROPDMGEXP)

event_economic <- aggregate(event_economic$TOTALDMG, by = list(event_economic$EVTYPE), 
    FUN = sum)

colnames(event_economic) <- c("EVTYPE", "TOTALDMG")

# Rank the event type by highest damage cost and take top 5 columns
event_economic <- event_economic[order(event_economic$TOTALDMG, decreasing = TRUE), 
    ]
event_economic <- event_economic[1:5, ]

Results

# Now plot the graph
ggplot() + geom_bar(data = event_economic, aes(x = EVTYPE, y = TOTALDMG, fill = interaction(TOTALDMG, 
    EVTYPE)), stat = "identity", show.legend = F) + theme(axis.text.x = element_text(angle = 30, 
    hjust = 1)) + xlab("Event Type") + ylab("Total Damage")

Flood is the major cause with respect to cost of damage.

Summary

From these data and grpahs, we found that Tornado are most harmful with respect to population health, while Flood have the greatest economic consequences.