U.S. Storm Data Analysis

Synopsis

The goal of this analysis is to explore the NOAA Storm Database and answer basic questions about severe weather events in order to protect human lives and property.

The analysis presented here specifically focuses on the following questions:

Across the United States, which types of events are the most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

This analysis is prepared for local, federal authorities and GNOs to make decisions related to resource allocation and funding in order to prepare for these type of natural events to minimize the impact on human life, local and the U.S. economy.

U.S. Storm Data

Storm Data is an official publication of the National Oceanic and Atmospheric Administration (NOAA) which documents:

The occurrence of storms and other significant weather phenomena having sufficient intensity to cause loss of life, injuries, significant property damage, and/or disruption to commerce
Rare, unusual, weather phenomena that generate media attention, such as snow flurries in South Florida or the San Diego coastal area; and
Other significant meteorological events, such as record maximum or minimum temperatures or precipitation that occur in connection with another event.

Data Processing & Transformations

Data can be downloaded via the following code. The code will check the working directory and look for the “StormData.csv” file. If the data file does not exist in the folder, the code will download it from the web.

suppressWarnings(suppressMessages(library(dplyr)))
library (tidyr)
library(ggplot2)

#It is assumed that the data file is already in csv format in the working directory.  
#If it does not exist, it will be downloaded and unzipped automatically in the working directory

if (!file.exists("data.csv.bz2")) {
    download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
                  destfile="data.csv.bz2")
    
}
Raw_Data <- read.table("data.csv.bz2", header = TRUE, sep = ",", stringsAsFactors = FALSE, na.strings="NA")

The data has the following structure.

str(Raw_Data)

## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

The data has 37 columns (variables) and 902297 rows (records).

Of these data fields, we only need the following fields to answer the aforementioned questions.

EVTYPE: Type of the event
FATALITIES:Number of fatalities
INJURIES: Number of injuries
PROPDMG: Property damage
PROPDMGEXP: Magnitude for property damange (million or billion)
CROPDMG:Damage to Crops
CROPDMGEXP: Magnitude of the crop damage (million or billion)

Source: NOAA Storm Database Data Dictionary

To answer the first question, we need to extract, group, summarize and plot the data by event type for fatalities and injuries.

#Extract data to answer question #1
Q1_Data<-select(Raw_Data, EVTYPE, FATALITIES, INJURIES)

#Summarize, organize and sort the data by event type for fatalities.  Take records
#where fatalities >=100 (To reduce data for the bar plot)
Q1_Data %>% group_by(EVTYPE) %>% summarise(Total_Fatalities=sum(FATALITIES)) %>% arrange(desc(Total_Fatalities)) %>% filter(Total_Fatalities>=100) ->Q1_Fatalities

#Let's review the result of the data transformation for fatalities
head(Q1_Fatalities)

## # A tibble: 6 x 2
##           EVTYPE Total_Fatalities
##            <chr>            <dbl>
## 1        TORNADO             5633
## 2 EXCESSIVE HEAT             1903
## 3    FLASH FLOOD              978
## 4           HEAT              937
## 5      LIGHTNING              816
## 6      TSTM WIND              504

#Summarize, organize and sort the data by event type for injuries >=500 (to simplify the plot)
Q1_Data %>% group_by(EVTYPE) %>% summarise(Total_Injuries=sum(INJURIES)) %>% arrange(desc(Total_Injuries)) %>% filter(Total_Injuries>=500) ->Q1_Injuries

#Let's review the result of the data transformation for injuries
head(Q1_Injuries)

## # A tibble: 6 x 2
##           EVTYPE Total_Injuries
##            <chr>          <dbl>
## 1        TORNADO          91346
## 2      TSTM WIND           6957
## 3          FLOOD           6789
## 4 EXCESSIVE HEAT           6525
## 5      LIGHTNING           5230
## 6           HEAT           2100

To answer the second question, we need to extract and transform the data as well.

#Extract data to answer question #2
# To calculate the total damage by event, we will add Property and Crop damage
Q2_Data <- select(Raw_Data, EVTYPE,PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
head(Q2_Data)

##    EVTYPE PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO    25.0          K       0           
## 2 TORNADO     2.5          K       0           
## 3 TORNADO    25.0          K       0           
## 4 TORNADO     2.5          K       0           
## 5 TORNADO     2.5          K       0           
## 6 TORNADO     2.5          K       0

The data indicate that to find the total cost, we need to compute the total property and crop damage. To calculate the cost for each row, we will multiply the numeric PROPDMG and CROPDMG with the corresponding K, M, and B. To achieve this, we will first replace the K, M and B values with corresponding values.

## # A tibble: 6 x 2
##                       EVTYPE Total_Cost
##                        <chr>      <dbl>
## 1 TORNADOES, TSTM WIND, HAIL 1602500000
## 2            HIGH WINDS/COLD  117500000
## 3  HURRICANE OPAL/HIGH WINDS  110000000
## 4    WINTER STORM HIGH WINDS   65000000
## 5       Heavy Rain/High Surf   15000000
## 6            LAKESHORE FLOOD    7540000

Analysis & Results

What is the most harmful to population health?

To find out which natural events are the most harmful, we plot the data manipulated above.

pl<- ggplot(Q1_Fatalities,aes(x=reorder(EVTYPE, -Total_Fatalities), y=Total_Fatalities)) + geom_bar(stat="identity", fill='red')
pl<-pl + ggtitle("Top 20 Fatalities by Event Type in the US")
pl<- pl + labs(x = "Event Type", y = "Fatalities")
pl <- pl + theme(axis.text.x = element_text(angle = 90, hjust = 1))
pl <- pl + theme(plot.title = element_text(hjust = 0.5))
print (pl)

This plot indicates that Tornados, Excessive Heat and Flash Flood are the top causes for most fatalities in the U.S.

pl<- ggplot(Q1_Injuries,aes(x=reorder(EVTYPE,-Total_Injuries), y=Total_Injuries)) + geom_bar(stat="identity", fill='blue')
pl<-pl + ggtitle("Top Injury Causes by Event Type in the US")
pl<- pl + labs(x = "Event Type", y = "Injuries")
pl <- pl + theme(axis.text.x = element_text(angle = 90, hjust = 1))
pl <- pl + theme(plot.title = element_text(hjust = 0.5))
print (pl)

According to the plot above, most injuries are caused by tornadoes, strong winds, flooding and excessive heat.

Which types of events have the greatest economic consequences?

The natural events cause damage to personal property and crop. The total cost of this damage to the U.S. economy by rootcause is as in the plot below.

pl<- ggplot(Q2_Summary,aes(x=reorder(EVTYPE,-Total_Cost), y=Total_Cost)) + geom_bar(stat="identity", fill='green')
pl<-pl + ggtitle("Top Damage Causes by Event Type in the US")
pl<- pl + labs(x = "Event Type", y = "Total Damage ($)")
pl <- pl + theme(axis.text.x = element_text(angle = 90, hjust = 1))
pl <- pl + theme(plot.title = element_text(hjust = 0.5))
print (pl)

This plot indicates that strong winds and tornadoes are the number one reason for economic loss. In this group, tornadoes are certainly a billion dollar event.

Educating the general public and local authorities may reduce this damage.