This report is a course project within the Reproducible Research Course on the Data Science Specialization by Johns Hopkins University on Coursera.
The basic goal of this assignment is to explore the NOAA Storm Database and answer some basic questions about severe weather events. You must use the database to answer the questions below and show the code for your entire analysis. Your analysis can consist of tables, figures, or other summaries. You may use any R package you want to support your analysis.
Your data analysis must address the following questions:
Across the United States, which types of events (as indicated in the EVTYPE variable) are most - harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
Consider writing your report as if it were to be read by a government or municipal manager who might be responsible for preparing for severe weather events and will need to prioritize resources for different types of events. However, there is no need to make any specific recommendations in your report.
Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
1- Libraries
library(dplyr)
library(ggplot2)
2- Download the raw data from HERE, and load it into R Studio.
# Setting working directory first
setwd("~/Coursera/8_Data_Science_Specialization/5_Reproducible_Research/Week_4/Assignment")
# Downloading .ZIP containing Data
if(!file.exists("repdata_data_StormData.csv.bz2")) {
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
destfile = "repdata_data_StormData.csv.bz2",
method = "curl")
}
# Loading the Data
A_1_StormData <- read.csv("repdata_data_StormData.csv.bz2")
A_1_StormData <- data.frame(lapply(A_1_StormData, as.character), stringsAsFactors=FALSE)
3- For this study we will only keep EVTYPE
, and the variables related to the impact on population health and economic consequences.
# EVTYPES variable is converted to Uppercase to avoid any case-sensitive duplicates during data processing
A_2_StormData <- A_1_StormData[, c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
A_2_StormData$EVTYPE <- toupper(A_2_StormData$EVTYPE)
EVTYPE | FATALITIES | INJURIES | PROPDMG | PROPDMGEXP | CROPDMG | CROPDMGEXP |
---|---|---|---|---|---|---|
TORNADO | 0 | 15 | 25 | K | 0 | |
TORNADO | 0 | 0 | 2.5 | K | 0 | |
TORNADO | 0 | 2 | 25 | K | 0 | |
TORNADO | 0 | 2 | 2.5 | K | 0 | |
TORNADO | 0 | 2 | 2.5 | K | 0 | |
TORNADO | 0 | 6 | 2.5 | K | 0 |
Let’s take a look at EVTYPE
variable in the data.
# Number of possible events that are in the data.
N_Events_Data <- arrange(count(A_2_StormData, EVTYPE),EVTYPE, desc(n))
N_Events_Data <- dim(N_Events_Data)[1]
There are 898 different types of events in the data. We will relabel them according to the events listed in the Documentation in order to generate concise results.
4- Here is the complete list of all the possible type of events according to the documentation (Pages 2-4).
ASTRONOMICAL LOW TIDE | DUST DEVIL | HEAVY SNOW | MARINE HIGH WIND | TORNADO |
AVALANCHE | DUST STORM | HIGH SURF | MARINE STRONG WIND | TROPICAL DEPRESSION |
BLIZZARD | EXTREME COLD/WIND CHILL | HIGH WIND | MARINE THUNDERSTORM WIND | TROPICAL STORM |
COASTAL FLOOD | FLOOD/FLASH FLOOD | HURRICANE/TYPHOON | RIP CURRENT | TSUNAMI |
COLD/WIND CHILL | FREEZING FOG | ICE STORM | SEICHE | VOLCANIC ASH |
DEBRIS FLOW | FUNNEL CLOUD | LAKESHORE FLOOD | SLEET | WATERSPOUT |
DENSE FOG | HAIL | LAKE-EFFECT SNOW | STORM TIDE | WILDFIRE |
DENSE SMOKE | HEAT | LIGHTNING | STRONG WIND | WINTER STORM |
DROUGHT | HEAVY RAIN | MARINE HAIL | THUNDERSTORM WIND | WINTER WEATHER |
There are 45 different types of events in the documentation. Which means that EVTYPE
variable in the data, is very inconsistent.
5- Lets fix EVTYPE
events in the data
Multiple Events
Some event labels are multiple (e.g., “HEAVY SNOW/HIGH WINDS/FREEZING”). If (and only if) EVTYPE
starts with an event listed in the documentation (A_3_Doc_Events
), we will prioritize that name as more relevant.
In the example, “HEAVY SNOW/HIGH WINDS/FREEZING” starts with “HEAVY SNOW”. As you can find this event in the documentation list A_3_Doc_Events
, we will replace “HEAVY SNOW/HIGH WINDS/FREEZING” with just “HEAVY SNOW”.
for(i in seq_along(A_3_Doc_Events$EVTYPE)) {
A_2_StormData$EVTYPE[grepl(paste("^", A_3_Doc_Events$EVTYPE[i], sep = ""), A_2_StormData$EVTYPE,ignore.case = TRUE)] <- A_3_Doc_Events$EVTYPE[i]
} ; rm(i)
Similar Events
Some labels like “HIGH WINDS” and “HIGH WIND” are similar. In this example, we will replace “HIGH WINDS” for “HIGH WIND” because it is in the Documentation.
There are many cases like this, and we can’t fix all of them because we are doing this manually. We will focus only on the most frequent ones.
A_2_StormData$EVTYPE[grepl("COASTAL FLOOD", A_2_StormData$EVTYPE,ignore.case = TRUE)] <- "COASTAL FLOOD"
A_2_StormData$EVTYPE[grepl("FLOOD|FLASH", A_2_StormData$EVTYPE,ignore.case = TRUE)] <- "FLOOD/FLASH FLOOD"
A_2_StormData$EVTYPE[grepl("HEAT", A_2_StormData$EVTYPE,ignore.case = TRUE)] <- "HEAT"
A_2_StormData$EVTYPE[grepl("FOG", A_2_StormData$EVTYPE,ignore.case = TRUE)] <- "DENSE FOG"
A_2_StormData$EVTYPE[grepl("STORM SURGE", A_2_StormData$EVTYPE,ignore.case = TRUE)] <- "STORM TIDE"
A_2_StormData$EVTYPE[grepl("RAIN|STREAM", A_2_StormData$EVTYPE,ignore.case = TRUE)] <- "HEAVY RAIN"
A_2_StormData$EVTYPE[grepl("HURRICANE|TYPHOON", A_2_StormData$EVTYPE,ignore.case = TRUE)] <- "HURRICANE/TYPHOON"
A_2_StormData$EVTYPE[grepl("LAKE", A_2_StormData$EVTYPE,ignore.case = TRUE)] <- "LAKE-EFFECT SNOW"
A_2_StormData$EVTYPE[grepl("SNOW", A_2_StormData$EVTYPE,ignore.case = TRUE)] <- "HEAVY SNOW"
A_2_StormData$EVTYPE[grepl("FROST/FREEZE|FREEZE|FROST|ICE", A_2_StormData$EVTYPE,ignore.case = TRUE)] <- "ICE STORM"
A_2_StormData$EVTYPE[grepl("WILD", A_2_StormData$EVTYPE,ignore.case = TRUE)] <- "WILDFIRE"
A_2_StormData$EVTYPE[grepl("EXTREME", A_2_StormData$EVTYPE,ignore.case = TRUE)] <- "EXTREME COLD/WIND CHILL"
A_2_StormData$EVTYPE[grepl("COLD|WIND CHILL", A_2_StormData$EVTYPE,ignore.case = TRUE)] <- "COLD/WIND CHILL"
A_2_StormData$EVTYPE[grepl("TSTM WIND|THUNDERSTORM WIND|THUNDERSTORM WINDS", A_2_StormData$EVTYPE,ignore.case = TRUE)] <- "THUNDERSTORM WIND"
A_2_StormData$EVTYPE[grepl("HIGH WINDS", A_2_StormData$EVTYPE,ignore.case = TRUE)] <- "HIGH WIND"
A_2_StormData$EVTYPE[grepl("STRONG WINDS", A_2_StormData$EVTYPE,ignore.case = TRUE)] <- "STRONG WIND"
A_2_StormData$EVTYPE[grepl("WIND", A_2_StormData$EVTYPE,ignore.case = TRUE)] <- "STRONG WIND"
# How many possible events are in the data now?
N_Events_Data2 <- rename(arrange(count(A_2_StormData, EVTYPE),EVTYPE, desc(n)), Events = EVTYPE)
N_Events_Data2 <- dim(N_Events_Data2)[1]
We have reduced EVTYPE
from 898 different events in the data, to 369.
6- Now let’s sum the rows in the data where EVTYPE
labels still don’t match the documentation list.
In_Documentation | N |
---|---|
No | 4413 |
Yes | 897884 |
Only 0.5% of the data rows have events that are not properly labeled. Which means that EVTYPE
variable is much more consistent now.
7- Magnitude variables fixing
Lets take a look at PROPDMGEXP
and CROPDMGEXP
.
table(A_1_StormData$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5
## 465934 1 8 5 216 25 13 4 4 28
## 6 7 8 B h H K m M
## 4 5 1 40 1 6 424665 7 11330
table(A_1_StormData$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
As we can see, PROPDMGEXP and CROPDMGEXP are not consistent. They should only take the values of H, K, M, B or O. Let’s fix that.
# Transformation
A_2_StormData$PROPDMGEXP<-factor(A_2_StormData$PROPDMGEXP,levels=c("H","K","M","B","h","m","O"))
A_2_StormData$PROPDMGEXP[is.na(A_2_StormData$PROPDMGEXP)] <- "O"
A_2_StormData$CROPDMGEXP<-factor(A_2_StormData$CROPDMGEXP,levels=c("K","M","B","k","m","O"))
A_2_StormData$CROPDMGEXP[is.na(A_2_StormData$CROPDMGEXP)] <- "O"
A_2_StormData$PROPDMGEXP <- as.character(A_2_StormData$PROPDMGEXP)
A_2_StormData$CROPDMGEXP <- as.character(A_2_StormData$CROPDMGEXP)
A_2_StormData$PROPDMGMLT <- 0
A_2_StormData$CROPDMGMLT <- 0
# Replace Magnitud character values into it's numer equivalent
A_2_StormData$PROPDMGMLT[grepl("h", A_2_StormData$PROPDMGEXP,ignore.case = TRUE)]<-100
A_2_StormData$PROPDMGMLT[grepl("k", A_2_StormData$PROPDMGEXP,ignore.case = TRUE)]<-1000
A_2_StormData$PROPDMGMLT[grepl("m", A_2_StormData$PROPDMGEXP,ignore.case = TRUE)]<-1000000
A_2_StormData$PROPDMGMLT[grepl("b", A_2_StormData$PROPDMGEXP,ignore.case = TRUE)]<-1000000000
A_2_StormData$PROPDMGMLT[grepl("o", A_2_StormData$PROPDMGEXP,ignore.case = TRUE)]<-1
A_2_StormData$CROPDMGMLT[grepl("k", A_2_StormData$CROPDMGEXP,ignore.case = TRUE)]<-1000
A_2_StormData$CROPDMGMLT[grepl("m", A_2_StormData$CROPDMGEXP,ignore.case = TRUE)]<-1000000
A_2_StormData$CROPDMGMLT[grepl("b", A_2_StormData$CROPDMGEXP,ignore.case = TRUE)]<-1000000000
A_2_StormData$CROPDMGMLT[grepl("o", A_2_StormData$CROPDMGEXP,ignore.case = TRUE)]<-1
After this arranges, we can see that PROPDMGEXP
and CROPDMGEXP
are consistent now.
table(A_2_StormData$PROPDMGEXP)
##
## B h H K m M O
## 40 1 6 424665 7 11330 466248
table(A_2_StormData$CROPDMGEXP)
##
## B k K m M O
## 9 21 281832 1 1994 618440
So now we can calculate the exact amount of Property Damage PROPDMG
and Crop Damage CROPDMG
.
# Convert Property Damage and Crop Damage to full number format
A_2_StormData$PROPDMG <- as.numeric(A_2_StormData$PROPDMG) * A_2_StormData$PROPDMGMLT
A_2_StormData$CROPDMG <- as.numeric(A_2_StormData$CROPDMG) * A_2_StormData$CROPDMGMLT
1- Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Let’s create a table and a chart with the events that reports the most frecuent FATALITIES
and INJURIES
.
A_5_Health <- aggregate(cbind(as.numeric(FATALITIES),as.numeric(INJURIES)) ~ EVTYPE, data = A_2_StormData, sum, na.rm=TRUE)
names(A_5_Health) <- c("EVTYPE", "FATALITIES","INJURIES")
A_5_Health$TOTAL <- A_5_Health$FATALITIES + A_5_Health$INJURIES
A_5_Health <- arrange(A_5_Health, desc(TOTAL))
EVTYPE | FATALITIES | INJURIES | TOTAL |
---|---|---|---|
TORNADO | 5633 | 91346 | 96979 |
STRONG WIND | 1657 | 11734 | 13391 |
HEAT | 3138 | 9224 | 12362 |
FLOOD/FLASH FLOOD | 1525 | 8604 | 10129 |
LIGHTNING | 816 | 5230 | 6046 |
ICE STORM | 99 | 2155 | 2254 |
WILDFIRE | 90 | 1606 | 1696 |
WINTER STORM | 206 | 1321 | 1527 |
HURRICANE/TYPHOON | 135 | 1333 | 1468 |
HAIL | 15 | 1361 | 1376 |
Tornados are by far the most harmful events with respect to population health.
2- Across the United States, which types of events have the greatest economic consequences?
Let’s create a table with the events that have the most frecuent PROPDMG
and CROPDMG
.
A_6_DMG <- aggregate(cbind(as.numeric(PROPDMG),as.numeric(CROPDMG)) ~ EVTYPE, data = A_2_StormData, sum, na.rm=TRUE)
names(A_6_DMG) <- c("EVTYPE", "PROPDMG","CROPDMG")
A_6_DMG$TOTAL <- A_6_DMG$PROPDMG + A_6_DMG$CROPDMG
A_6_DMG <- arrange(A_6_DMG, desc(TOTAL))
EVTYPE | PROPDMG | CROPDMG | TOTAL |
---|---|---|---|
FLOOD/FLASH FLOOD | 167529740932 | 12380109100 | 179909850032 |
HURRICANE/TYPHOON | 85356410010 | 5516117800 | 90872527810 |
TORNADO | 56937160779 | 414953270 | 57352114049 |
STORM TIDE | 47964724000 | 855000 | 47965579000 |
STRONG WIND | 17747390679 | 3445560088 | 21192950767 |
HAIL | 15732267543 | 3025954473 | 18758222016 |
DROUGHT | 1046106000 | 13972566000 | 15018672000 |
ICE STORM | 3981204360 | 7019175300 | 11000379660 |
WILDFIRE | 8491563500 | 402781630 | 8894345130 |
TROPICAL STORM | 7703890550 | 678346000 | 8382236550 |
Floods are the events that reports the greatest economic consequences.