Executive Summary

This report is a course project within the Reproducible Research Course on the Data Science Specialization by Johns Hopkins University on Coursera.

The basic goal of this assignment is to explore the NOAA Storm Database and answer some basic questions about severe weather events. You must use the database to answer the questions below and show the code for your entire analysis. Your analysis can consist of tables, figures, or other summaries. You may use any R package you want to support your analysis.

Project instructions

Your data analysis must address the following questions:

Consider writing your report as if it were to be read by a government or municipal manager who might be responsible for preparing for severe weather events and will need to prioritize resources for different types of events. However, there is no need to make any specific recommendations in your report.

Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Documentation

Data Processing

1- Libraries

library(dplyr)
library(ggplot2)

2- Download the raw data from HERE, and load it into R Studio.

# Setting working directory first
setwd("~/Coursera/8_Data_Science_Specialization/5_Reproducible_Research/Week_4/Assignment")

# Downloading .ZIP containing Data
if(!file.exists("repdata_data_StormData.csv.bz2")) {
    download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", 
                  destfile = "repdata_data_StormData.csv.bz2", 
                  method = "curl")
}

# Loading the Data
A_1_StormData <- read.csv("repdata_data_StormData.csv.bz2") 
A_1_StormData <- data.frame(lapply(A_1_StormData, as.character), stringsAsFactors=FALSE)

3- For this study we will only keep EVTYPE, and the variables related to the impact on population health and economic consequences.

# EVTYPES variable is converted to Uppercase to avoid any case-sensitive duplicates during data processing
A_2_StormData <- A_1_StormData[, c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
A_2_StormData$EVTYPE <- toupper(A_2_StormData$EVTYPE) 
EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
TORNADO 0 15 25 K 0
TORNADO 0 0 2.5 K 0
TORNADO 0 2 25 K 0
TORNADO 0 2 2.5 K 0
TORNADO 0 2 2.5 K 0
TORNADO 0 6 2.5 K 0

Let’s take a look at EVTYPE variable in the data.

# Number of possible events that are in the data.
N_Events_Data <- arrange(count(A_2_StormData, EVTYPE),EVTYPE, desc(n))
N_Events_Data <- dim(N_Events_Data)[1]

There are 898 different types of events in the data. We will relabel them according to the events listed in the Documentation in order to generate concise results.

4- Here is the complete list of all the possible type of events according to the documentation (Pages 2-4).

ASTRONOMICAL LOW TIDE DUST DEVIL HEAVY SNOW MARINE HIGH WIND TORNADO
AVALANCHE DUST STORM HIGH SURF MARINE STRONG WIND TROPICAL DEPRESSION
BLIZZARD EXTREME COLD/WIND CHILL HIGH WIND MARINE THUNDERSTORM WIND TROPICAL STORM
COASTAL FLOOD FLOOD/FLASH FLOOD HURRICANE/TYPHOON RIP CURRENT TSUNAMI
COLD/WIND CHILL FREEZING FOG ICE STORM SEICHE VOLCANIC ASH
DEBRIS FLOW FUNNEL CLOUD LAKESHORE FLOOD SLEET WATERSPOUT
DENSE FOG HAIL LAKE-EFFECT SNOW STORM TIDE WILDFIRE
DENSE SMOKE HEAT LIGHTNING STRONG WIND WINTER STORM
DROUGHT HEAVY RAIN MARINE HAIL THUNDERSTORM WIND WINTER WEATHER

There are 45 different types of events in the documentation. Which means that EVTYPE variable in the data, is very inconsistent.

5- Lets fix EVTYPE events in the data

Multiple Events

Some event labels are multiple (e.g., “HEAVY SNOW/HIGH WINDS/FREEZING”). If (and only if) EVTYPE starts with an event listed in the documentation (A_3_Doc_Events), we will prioritize that name as more relevant.

In the example, “HEAVY SNOW/HIGH WINDS/FREEZING” starts with “HEAVY SNOW”. As you can find this event in the documentation list A_3_Doc_Events, we will replace “HEAVY SNOW/HIGH WINDS/FREEZING” with just “HEAVY SNOW”.

for(i in seq_along(A_3_Doc_Events$EVTYPE)) {
    A_2_StormData$EVTYPE[grepl(paste("^", A_3_Doc_Events$EVTYPE[i], sep = ""), A_2_StormData$EVTYPE,ignore.case = TRUE)]  <- A_3_Doc_Events$EVTYPE[i]
} ; rm(i)

Similar Events

Some labels like “HIGH WINDS” and “HIGH WIND” are similar. In this example, we will replace “HIGH WINDS” for “HIGH WIND” because it is in the Documentation.

There are many cases like this, and we can’t fix all of them because we are doing this manually. We will focus only on the most frequent ones.

A_2_StormData$EVTYPE[grepl("COASTAL FLOOD", A_2_StormData$EVTYPE,ignore.case = TRUE)]  <- "COASTAL FLOOD"
A_2_StormData$EVTYPE[grepl("FLOOD|FLASH", A_2_StormData$EVTYPE,ignore.case = TRUE)]  <- "FLOOD/FLASH FLOOD"

A_2_StormData$EVTYPE[grepl("HEAT", A_2_StormData$EVTYPE,ignore.case = TRUE)]  <- "HEAT"

A_2_StormData$EVTYPE[grepl("FOG", A_2_StormData$EVTYPE,ignore.case = TRUE)]  <- "DENSE FOG"
A_2_StormData$EVTYPE[grepl("STORM SURGE", A_2_StormData$EVTYPE,ignore.case = TRUE)]  <- "STORM TIDE"
A_2_StormData$EVTYPE[grepl("RAIN|STREAM", A_2_StormData$EVTYPE,ignore.case = TRUE)]  <- "HEAVY RAIN"
A_2_StormData$EVTYPE[grepl("HURRICANE|TYPHOON", A_2_StormData$EVTYPE,ignore.case = TRUE)]  <- "HURRICANE/TYPHOON"

A_2_StormData$EVTYPE[grepl("LAKE", A_2_StormData$EVTYPE,ignore.case = TRUE)]  <- "LAKE-EFFECT SNOW"
A_2_StormData$EVTYPE[grepl("SNOW", A_2_StormData$EVTYPE,ignore.case = TRUE)]  <- "HEAVY SNOW" 

A_2_StormData$EVTYPE[grepl("FROST/FREEZE|FREEZE|FROST|ICE", A_2_StormData$EVTYPE,ignore.case = TRUE)]  <- "ICE STORM"

A_2_StormData$EVTYPE[grepl("WILD", A_2_StormData$EVTYPE,ignore.case = TRUE)]  <- "WILDFIRE"

A_2_StormData$EVTYPE[grepl("EXTREME", A_2_StormData$EVTYPE,ignore.case = TRUE)]  <- "EXTREME COLD/WIND CHILL"  
A_2_StormData$EVTYPE[grepl("COLD|WIND CHILL", A_2_StormData$EVTYPE,ignore.case = TRUE)]  <- "COLD/WIND CHILL" 

A_2_StormData$EVTYPE[grepl("TSTM WIND|THUNDERSTORM WIND|THUNDERSTORM WINDS", A_2_StormData$EVTYPE,ignore.case = TRUE)]  <- "THUNDERSTORM WIND" 
A_2_StormData$EVTYPE[grepl("HIGH WINDS", A_2_StormData$EVTYPE,ignore.case = TRUE)]  <- "HIGH WIND"
A_2_StormData$EVTYPE[grepl("STRONG WINDS", A_2_StormData$EVTYPE,ignore.case = TRUE)]  <- "STRONG WIND"
A_2_StormData$EVTYPE[grepl("WIND", A_2_StormData$EVTYPE,ignore.case = TRUE)]  <- "STRONG WIND"


# How many possible events are in the data now?
N_Events_Data2 <- rename(arrange(count(A_2_StormData, EVTYPE),EVTYPE, desc(n)), Events = EVTYPE)
N_Events_Data2 <- dim(N_Events_Data2)[1]

We have reduced EVTYPE from 898 different events in the data, to 369.

6- Now let’s sum the rows in the data where EVTYPE labels still don’t match the documentation list.

In_Documentation N
No 4413
Yes 897884

Only 0.5% of the data rows have events that are not properly labeled. Which means that EVTYPE variable is much more consistent now.

7- Magnitude variables fixing

Lets take a look at PROPDMGEXP and CROPDMGEXP.

table(A_1_StormData$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5 
## 465934      1      8      5    216     25     13      4      4     28 
##      6      7      8      B      h      H      K      m      M 
##      4      5      1     40      1      6 424665      7  11330
table(A_1_StormData$CROPDMGEXP)
## 
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994

As we can see, PROPDMGEXP and CROPDMGEXP are not consistent. They should only take the values of H, K, M, B or O. Let’s fix that.

# Transformation 
A_2_StormData$PROPDMGEXP<-factor(A_2_StormData$PROPDMGEXP,levels=c("H","K","M","B","h","m","O"))
A_2_StormData$PROPDMGEXP[is.na(A_2_StormData$PROPDMGEXP)] <- "O"

A_2_StormData$CROPDMGEXP<-factor(A_2_StormData$CROPDMGEXP,levels=c("K","M","B","k","m","O"))
A_2_StormData$CROPDMGEXP[is.na(A_2_StormData$CROPDMGEXP)] <- "O"

A_2_StormData$PROPDMGEXP <- as.character(A_2_StormData$PROPDMGEXP)
A_2_StormData$CROPDMGEXP <- as.character(A_2_StormData$CROPDMGEXP)

A_2_StormData$PROPDMGMLT <- 0
A_2_StormData$CROPDMGMLT <- 0

# Replace Magnitud character values into it's numer equivalent
A_2_StormData$PROPDMGMLT[grepl("h", A_2_StormData$PROPDMGEXP,ignore.case = TRUE)]<-100
A_2_StormData$PROPDMGMLT[grepl("k", A_2_StormData$PROPDMGEXP,ignore.case = TRUE)]<-1000
A_2_StormData$PROPDMGMLT[grepl("m", A_2_StormData$PROPDMGEXP,ignore.case = TRUE)]<-1000000
A_2_StormData$PROPDMGMLT[grepl("b", A_2_StormData$PROPDMGEXP,ignore.case = TRUE)]<-1000000000
A_2_StormData$PROPDMGMLT[grepl("o", A_2_StormData$PROPDMGEXP,ignore.case = TRUE)]<-1

A_2_StormData$CROPDMGMLT[grepl("k", A_2_StormData$CROPDMGEXP,ignore.case = TRUE)]<-1000
A_2_StormData$CROPDMGMLT[grepl("m", A_2_StormData$CROPDMGEXP,ignore.case = TRUE)]<-1000000
A_2_StormData$CROPDMGMLT[grepl("b", A_2_StormData$CROPDMGEXP,ignore.case = TRUE)]<-1000000000
A_2_StormData$CROPDMGMLT[grepl("o", A_2_StormData$CROPDMGEXP,ignore.case = TRUE)]<-1

After this arranges, we can see that PROPDMGEXP and CROPDMGEXP are consistent now.

table(A_2_StormData$PROPDMGEXP)
## 
##      B      h      H      K      m      M      O 
##     40      1      6 424665      7  11330 466248
table(A_2_StormData$CROPDMGEXP)
## 
##      B      k      K      m      M      O 
##      9     21 281832      1   1994 618440

So now we can calculate the exact amount of Property Damage PROPDMG and Crop Damage CROPDMG.

# Convert Property Damage and Crop Damage to full number format
A_2_StormData$PROPDMG <- as.numeric(A_2_StormData$PROPDMG) * A_2_StormData$PROPDMGMLT
A_2_StormData$CROPDMG <- as.numeric(A_2_StormData$CROPDMG) * A_2_StormData$CROPDMGMLT

Results/Answers:

1- Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

Let’s create a table and a chart with the events that reports the most frecuent FATALITIES and INJURIES.

A_5_Health <- aggregate(cbind(as.numeric(FATALITIES),as.numeric(INJURIES)) ~ EVTYPE, data = A_2_StormData, sum, na.rm=TRUE)

names(A_5_Health) <- c("EVTYPE", "FATALITIES","INJURIES")

A_5_Health$TOTAL <- A_5_Health$FATALITIES + A_5_Health$INJURIES

A_5_Health <- arrange(A_5_Health, desc(TOTAL))
EVTYPE FATALITIES INJURIES TOTAL
TORNADO 5633 91346 96979
STRONG WIND 1657 11734 13391
HEAT 3138 9224 12362
FLOOD/FLASH FLOOD 1525 8604 10129
LIGHTNING 816 5230 6046
ICE STORM 99 2155 2254
WILDFIRE 90 1606 1696
WINTER STORM 206 1321 1527
HURRICANE/TYPHOON 135 1333 1468
HAIL 15 1361 1376

Tornados are by far the most harmful events with respect to population health.

2- Across the United States, which types of events have the greatest economic consequences?

Let’s create a table with the events that have the most frecuent PROPDMG and CROPDMG.

A_6_DMG <- aggregate(cbind(as.numeric(PROPDMG),as.numeric(CROPDMG)) ~ EVTYPE, data = A_2_StormData, sum, na.rm=TRUE)
names(A_6_DMG) <- c("EVTYPE", "PROPDMG","CROPDMG") 

A_6_DMG$TOTAL <- A_6_DMG$PROPDMG + A_6_DMG$CROPDMG

A_6_DMG <- arrange(A_6_DMG, desc(TOTAL))
EVTYPE PROPDMG CROPDMG TOTAL
FLOOD/FLASH FLOOD 167529740932 12380109100 179909850032
HURRICANE/TYPHOON 85356410010 5516117800 90872527810
TORNADO 56937160779 414953270 57352114049
STORM TIDE 47964724000 855000 47965579000
STRONG WIND 17747390679 3445560088 21192950767
HAIL 15732267543 3025954473 18758222016
DROUGHT 1046106000 13972566000 15018672000
ICE STORM 3981204360 7019175300 11000379660
WILDFIRE 8491563500 402781630 8894345130
TROPICAL STORM 7703890550 678346000 8382236550

Floods are the events that reports the greatest economic consequences.