Exploring the US National Oceanic and Atmospheric Administration’s (NOAA) Storm Database

Peer-Graded Assignment for Johns Hopkins University ‘Reproducible Research’ Course on Coursera

Arash Tavassoli | September 4, 2018

Synopsis

In this report we aim to explore the available data from US National Oceanic and Atmospheric Administration’s (NOAA) Storm Database from years of 1950 to 2011 to identify the most harmful events in terms of public health (fatalities and injuries) as well as economic impacts on the society. From this analysis it is found that tornado was by far the most harmful event in that period of time with significantly higher fatalities and injuries, while floods have shown to be the most harmful events in terms of economic loss (i.e. building damages and crop damages in total).

Data Processing

This project explors the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. The file is downloaded from the course website (if not downloaded already):

if(!file.exists("repdata-data-StormData.csv.bz2")) {
  download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
  destfile = "repdata-data-StormData.csv.bz2", method = "curl")
}

Before proceeding the required package(s) are loaded into R:

library(dplyr)

The data is then loaded into R using the read.csv() function:

raw.data <- read.csv("repdata-data-StormData.csv.bz2")

The names() function is called to see the variables in the dataset:

names(raw.data)

##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

Out of the 37 variables in the dataset only the following 7 are of interest for the current assignement:

Variable name	Description
EVTYPE	Event Type
FATALITIES	Number of fatalities
INJURIES	Number of injuries
PROPDMG	Estimated property damage
PROPDMGEXP	Character signifying the magnitude of the number (K, M, B)
CROPDMG	Estimated crop damage
CROPDMGEXP	Character signifying the magnitude of the number (K, M, B)

Therefore the dataset is filtered to include only the 7 variables. This assists with further analysis of the data:

data <- select(raw.data, EVTYPE, FATALITIES:CROPDMGEXP)

Next step is to define the values of the estimated property and crop damage using the magnitude indexes (“K” for thousands, “M” for millions, and “B” for billions). This is done through generation of two new variables called PROPtotDMG and CROPtotDMG for the propoerty and crop damages, respectively.

First a summary table of the current values for the magnitude index variables PROPDMGEXP and CROPDMGEXP is generated:

table(data$PROPDMGEXP)

## 
##             -      ?      +      0      1      2      3      4      5 
## 465934      1      8      5    216     25     13      4      4     28 
##      6      7      8      B      h      H      K      m      M 
##      4      5      1     40      1      6 424665      7  11330

table(data$CROPDMGEXP)

## 
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994

It is assumed that the indexes other than “K”, “M” and “B” are meant to be equal to 1 (i.e. similar to blank values).

Using a for() loop the indices are converted to corresponding amplification factors:

exp <- c("", "-", "?", "+", 0, 2, 3, 4, 5, 6, 7, 8, "h", "H", "k", "K", "m", "M", "B")
num.exp <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 10^3, 10^3, 10^6, 10^6, 10^9)

data$PROPDMGEXP <- as.character(data$PROPDMGEXP)
data$CROPDMGEXP <- as.character(data$CROPDMGEXP)

for(i in 1:length(exp)) {
    
    data$PROPDMGEXP [data$PROPDMGEXP == exp[i]] <- num.exp[i]
    data$CROPDMGEXP [data$CROPDMGEXP == exp[i]] <- num.exp[i]
}

data$PROPDMGEXP <- as.numeric(data$PROPDMGEXP)
data$CROPDMGEXP <- as.numeric(data$CROPDMGEXP)

An updated summary table of the values for the magnitude index variables PROPDMGEXP and CROPDMGEXP is generated to ensure all indices are correctly converted:

table(data$PROPDMGEXP)

## 
##      1   1000  1e+06  1e+09 
## 466255 424665  11337     40

table(data$CROPDMGEXP)

## 
##      1   1000  1e+06  1e+09 
## 618440 281853   1995      9

The last processing step is to multiply the estimated property and crop damages with the designated magnitude indices and save the resulting values in new PROPtotDMG and CROPtotDMG variables:

data$PROPtotDMG <- data$PROPDMG * data$PROPDMGEXP
data$CROPtotDMG <- data$CROPDMG * data$CROPDMGEXP

The PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP are no longer required and can be removed from the dataset:

data$PROPDMG <- NULL
data$PROPDMGEXP <- NULL
data$CROPDMG <- NULL
data$CROPDMGEXP <- NULL

A quick look at the dataset:

head(data)

##    EVTYPE FATALITIES INJURIES PROPtotDMG CROPtotDMG
## 1 TORNADO          0       15      25000          0
## 2 TORNADO          0        0       2500          0
## 3 TORNADO          0        2      25000          0
## 4 TORNADO          0        2       2500          0
## 5 TORNADO          0        2       2500          0
## 6 TORNADO          0        6       2500          0

Results

The key focus of the current projet is to answer the following two question:

Across the United States, which types of events are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

To be able to answer these questions, the tidy dataset from the previous section is grouped by the event type variable EVTYPE:

event.data <- group_by(data, EVTYPE)

The top 10 event types based on “total fatalities”, “total injuries” and “total economic loss” are identified using a combination of arrange() and summarize() functions:

# Top 10 event types with maximum number of total fatalities:
top.fatal <- arrange(summarize(event.data, sum = sum(FATALITIES)), desc(sum))[1:10,]

# Top 10 event types with maximum number of total injuries:
top.injur <- arrange(summarize(event.data, sum = sum(INJURIES)), desc(sum))[1:10,]

# Top 10 event types with maximum economic loss (buiding and crop damage in total):
top.eco <- arrange(summarize(event.data, sum = sum(PROPtotDMG + CROPtotDMG)), desc(sum))[1:10,]

Eventually barplots are constructed using the base plotting system to picture the top 10 most harmful events with respect to population health (fatalities and injuries on separate plots) and the top 10 events with the greatest economic consequences:

par(mar = c(8,6,4,2), mgp = c(4,1,0))
barplot(top.fatal$sum, names.arg = top.fatal$EVTYPE, 
        main = "The most harmful events with respect to population health (total fatalities)", 
        ylab = "Total fatalities",
        cex.main = 0.9, cex.axis = .8, cex.names=0.7,
        col = "blue", las=2)

par(mar = c(8,6,4,2), mgp = c(4,1,0))
barplot(top.injur$sum, names.arg = top.injur$EVTYPE, 
        main = "The most harmful events with respect to population health (total injuries)", 
        ylab = "Total injuries",
        cex.main = 0.9, cex.axis = .8, cex.names=0.7,
        col = "red", las=2)

Tornado is found to be the most harmful event type with respect to population health (fatalities and injuries).

par(mar = c(8,6,4,2), mgp = c(4,1,0))
barplot(top.eco$sum, names.arg = top.eco$EVTYPE, 
        main = "The most harmful events with the greatest economic consequences", 
        ylab = "Total damages in $",
        cex.main = 0.9, cex.axis = .8, cex.names=0.7,
        col = "orange", las=2)

Flooding is shown to have the most adverse impact in terms of financial damage to the properties and crop.