This report is the second course project for the course Reproducible Research from the Coursera course Data Science Specialization.
Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern. This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The main finding of the data analysis is that tornadoes, by far, have the largest impact on population health (measured by the number of injuries and fatalities) and floods cause the greatest economic damage in terms of US dollars (measured by property and crop damage).
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:
There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
The following libraries are used in the data analysis:
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library("plyr")
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
library("ggplot2")
The data is obtained from the link above and saved. The working directory has been set in order to read the data in Rstudio.
setwd('C:/Users/Hidde/Documents/Coursera/Data Science Foundations using R Specialization/5_Reproducible_Research/Project2')
if(!exists("stormData")) {
stormData <- read.csv(bzfile("repdata_data_StormData.csv.bz2"), header = TRUE)
}
There are 902297 observations in the dataset and 37 variables.
dim(stormData)
## [1] 902297 37
The structure of the dataset is as follows:
str(stormData)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
The variables of interest are population health, economy, and weather event types. For each of these, the following variables are selected:
Population health:
Economic variables:
Weather event types:
variables <- c( "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
dataSelection <- stormData[, variables]
In the new dataset with selected variables, it is important to check the number of missing values. The number of missing values per variable:
sapply(dataSelection, function(x) sum(is.na(x)))
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 0 0 0 0 0 0 0
And in percentage:
sapply(dataSelection, function(x) sum(is.na(x))) / nrow(dataSelection)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 0 0 0 0 0 0 0
since there are no missing values, the data analysis can continue without managing missing values.
First, the ten most occurring weather events are selected.
sort(table(dataSelection$EVTYPE), decreasing = T)[1:10]
##
## HAIL TSTM WIND THUNDERSTORM WIND TORNADO
## 288661 219940 82563 60652
## FLASH FLOOD FLOOD THUNDERSTORM WINDS HIGH WIND
## 54277 25326 20843 20212
## LIGHTNING HEAVY SNOW
## 15754 15708
Then, weather events are grouped based on reoccuring keywords. For example, “THUNDERSTORM WIND”, “THUNDERSTORM WINDS”, “HIGH WIND” are grouped based on the keyword “WIND”. Weather events that do not have common keywords are grouped in “OTHER”.
dataSelection$EVENT <- "OTHER"
dataSelection$EVENT[grep("HAIL", dataSelection$EVTYPE, ignore.case = T)] <- "HAIL"
dataSelection$EVENT[grep("HEAT", dataSelection$EVTYPE, ignore.case = T)] <- "HEAT"
dataSelection$EVENT[grep("FLOOD", dataSelection$EVTYPE, ignore.case = T)] <- "FLOOD"
dataSelection$EVENT[grep("WIND", dataSelection$EVTYPE, ignore.case = T)] <- "WIND"
dataSelection$EVENT[grep("STORM", dataSelection$EVTYPE, ignore.case = T)] <- "STORM"
dataSelection$EVENT[grep("SNOW", dataSelection$EVTYPE, ignore.case = T)] <- "SNOW"
dataSelection$EVENT[grep("TORNADO", dataSelection$EVTYPE, ignore.case = T)] <- "TORNADO"
dataSelection$EVENT[grep("WINTER", dataSelection$EVTYPE, ignore.case = T)] <- "WINTER"
dataSelection$EVENT[grep("RAIN", dataSelection$EVTYPE, ignore.case = T)] <- "RAIN"
sort(table(dataSelection$EVENT), decreasing = T)
##
## HAIL WIND STORM FLOOD TORNADO OTHER WINTER SNOW RAIN HEAT
## 289270 255362 113156 82686 60700 48970 19604 17660 12241 2648
The units of the economic variables are as follows.
sort(table(dataSelection$PROPDMGEXP), decreasing = T)[1:10]
##
## K M 0 B 5 1 2 ? m
## 465934 424665 11330 216 40 28 25 13 8 7
sort(table(dataSelection$CROPDMGEXP), decreasing = T)[1:10]
##
## K M k 0 B ? 2 m <NA>
## 618413 281832 1994 21 19 9 7 1 1
Since these units are not very representative, they are transformed into the same unit:
dataSelection$PROPDMGEXP <- as.character(dataSelection$PROPDMGEXP)
# Missing values are considered as dollars (in this case 0)
dataSelection$PROPDMGEXP[is.na(dataSelection$PROPDMGEXP)] <- 0
dataSelection$PROPDMGEXP[!grep("K|M|B", dataSelection$PROPDMGEXP, ignore.case = T)] <- 0
dataSelection$PROPDMGEXP[grep("K", dataSelection$PROPDMGEXP, ignore.case = T)] <- "3"
dataSelection$PROPDMGEXP[grep("M", dataSelection$PROPDMGEXP, ignore.case = T)] <- "6"
dataSelection$PROPDMGEXP[grep("B", dataSelection$PROPDMGEXP, ignore.case = T)] <- "9"
dataSelection$PROPDMGEXP <- as.numeric(as.character(dataSelection$PROPDMGEXP))
## Warning: NAs introduced by coercion
dataSelection$PropertyDamage <- dataSelection$PROPDMG * 10^dataSelection$PROPDMGEXP
dataSelection$CROPDMGEXP <- as.character(dataSelection$CROPDMGEXP)
# Missing values are considered as dollars (in this case 0)
dataSelection$CROPDMGEXP[is.na(dataSelection$CROPDMGEXP)] <- 0
dataSelection$CROPDMGEXP[!grep("K|M|B", dataSelection$CROPDMGEXP, ignore.case = T)] <- 0
dataSelection$CROPDMGEXP[grep("K", dataSelection$CROPDMGEXP, ignore.case = T)] <- "3"
dataSelection$CROPDMGEXP[grep("M", dataSelection$CROPDMGEXP, ignore.case = T)] <- "6"
dataSelection$CROPDMGEXP[grep("B", dataSelection$CROPDMGEXP, ignore.case = T)] <- "9"
dataSelection$CROPDMGEXP <- as.numeric(as.character(dataSelection$CROPDMGEXP))
## Warning: NAs introduced by coercion
dataSelection$CropDamage <- dataSelection$CROPDMG * 10^dataSelection$CROPDMGEXP
A second check of the values of economic variables, showing the first ten most occurring:
sort(table(dataSelection$PropertyDamage), decreasing = T)[1:10]
##
## 0 5000 10000 1000 2000 25000 50000 3000 20000 15000
## 197257 31731 21788 17545 17186 17104 13596 10364 9182 8617
sort(table(dataSelection$CropDamage), decreasing = T)[1:10]
##
## 0 5000 10000 50000 1e+05 1000 2000 25000 20000 5e+05
## 261781 4097 2349 1984 1233 956 951 830 758 721
The weather events are aggregated for the population health and economic variables.
aggFatalitiesInjuries <- ddply(dataSelection, .(EVENT), summarize, total = sum(FATALITIES + INJURIES, na.rm = T))
aggFatalitiesInjuries$Type <- "Fatalities and injuries"
aggFatalities <- ddply(dataSelection, .(EVENT), summarize, total = sum(FATALITIES, na.rm = T))
aggFatalities$Type <- "Fatalities"
aggInjuries <- ddply(dataSelection, .(EVENT), summarize, total = sum(INJURIES, na.rm = T))
aggInjuries$Type <- "Injuries"
aggHealth <- rbind(aggFatalities, aggInjuries)
healthByEvent <- join(aggFatalities, aggInjuries, by = "EVENT", type = "inner")
healthByEvent
## EVENT total Type total Type
## 1 FLOOD 1524 Fatalities 8602 Injuries
## 2 HAIL 15 Fatalities 1371 Injuries
## 3 HEAT 3138 Fatalities 9224 Injuries
## 4 OTHER 2626 Fatalities 12224 Injuries
## 5 RAIN 114 Fatalities 305 Injuries
## 6 SNOW 164 Fatalities 1164 Injuries
## 7 STORM 416 Fatalities 5339 Injuries
## 8 TORNADO 5661 Fatalities 91407 Injuries
## 9 WIND 1209 Fatalities 9001 Injuries
## 10 WINTER 278 Fatalities 1891 Injuries
aggPropCropDmg <- ddply(dataSelection, .(EVENT), summarize, total = sum(PropertyDamage + CropDamage, na.rm = T))
aggPropCropDmg$Type <- "Property and crop damage"
aggProp <- ddply(dataSelection, .(EVENT), summarize, total = sum(PropertyDamage, na.rm = T))
aggProp$Type <- "Property"
aggCrop <- ddply(dataSelection, .(EVENT), summarize, total = sum(CropDamage, na.rm = T))
aggCrop$Type <- "Crop"
aggEconomic <- rbind(aggProp, aggCrop)
EconomicByEvent <- join(aggProp, aggCrop, by = "EVENT", type = "inner")
EconomicByEvent
## EVENT total Type total Type
## 1 FLOOD 168184668589 Property 12266906100 Crop
## 2 HAIL 15736042956 Property 3046837470 Crop
## 3 HEAT 20325750 Property 904469280 Crop
## 4 OTHER 97248432317 Property 23588880870 Crop
## 5 RAIN 3270230190 Property 919315800 Crop
## 6 SNOW 1024339740 Property 134683100 Crop
## 7 STORM 66513046188 Property 6374474880 Crop
## 8 TORNADO 58603317864 Property 417461520 Crop
## 9 WIND 10847166565 Property 1403719150 Crop
## 10 WINTER 6777295251 Property 47444000 Crop
The first question to analyze is about the types of weather events that are most harmful with respect to population health.
aggHealth$EVENT <- as.character(aggHealth$EVENT)
plotHealth <- ggplot(aggHealth, aes(x = EVENT, y = total, fill = Type)) +
geom_bar(stat = "identity") +
scale_fill_brewer(palette="Pastel2") +
coord_flip() +
xlab("Event") +
ylab("Number of health impacts") +
ggtitle("Weather event types and impact on population health") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
print(plotHealth)
From the graph presented above it can be seen that tornadoes, by far, have the most influence on health impacts. Wind, heat, flood, and other weather events have the second largest impact on health. It can also be observed that injuries are the dominant force on health impacts.
The second question to analyze relates to the types of weather events that have the greatest economic consequences.
aggEconomic$EVENT <- as.character(aggEconomic$EVENT)
plotEconomic <- ggplot(aggEconomic, aes(x = EVENT, y = total, fill = Type)) +
geom_bar(stat = "identity") +
scale_fill_brewer(palette="Pastel1") +
coord_flip() +
xlab("Event") +
ylab("Economic damage in US dollars") +
ggtitle("Weather event types and impact on economy (property and crop damage)") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
print(plotEconomic)
From the graph presented above it can be concluded that greatest economic damage, measured in US dollars, is due to floods. Other weather events have the second largest impact. It can be seen that economic damage on property is the main factor in economic damage, relatively to crop damage.