Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
This analysis focuses on answering two basic questions:
1- Which types of events are most harmful with respect to population health? 2- Which types of events have the greatest economic consequences?
The data for this project come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from this web site:
There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
National Weather Service Storm Data Documentation
National Climatic Data Center Storm Events FAQ
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
Download and read the storm dataset. This code chunck is time-consuming and we will be using the cache = TRUE option. This way, after reading the dataset, it will be saved in cache.
# Set working directory
setwd("~/R_coursera/Reproducible Research/Assignment2")
# If file is not in the working directory then download it.
if (!file.exists("stormdata.csv.bz2")) {
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
destfile = "stormdata.csv.bz2")
}
# Read storm dataset
stormdata <- read.csv("stormdata.csv.bz2", header = TRUE, sep = "," )
This is the structure of the raw dataset that will be used:
# Display stormdata dataset structure
str(stormdata)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
The first 6 records of the Storm dataset look like this:
# Display first 6 records of the stormdata dataset
head(stormdata)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
In order to make the analysis less time-comsuming, a subset of the Storm dataset will be created. The new structure of the subset dataset will be displayed :
# Subset stormdata with only the required columns
cols <- c("EVTYPE","FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")
stormsub <- stormdata[c(8, 23:28)]
# Display stormdata subset head records
head(stormsub)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO 0 15 25.0 K 0
## 2 TORNADO 0 0 2.5 K 0
## 3 TORNADO 0 2 25.0 K 0
## 4 TORNADO 0 2 2.5 K 0
## 5 TORNADO 0 2 2.5 K 0
## 6 TORNADO 0 6 2.5 K 0
To answer this questions is necessary to aggregate the dataset by type of event(EVTYPE) and add the variables FATALITIES and INJURIES. After, only the 15 most harmfull types of events will selected to be later displayed in a bar chart to better illustrate the answer.
# Aggregate number of harmed people by storm type of event
harmfull <- aggregate(FATALITIES+INJURIES ~ EVTYPE, stormsub, sum)
# Change data frame columns names
names(harmfull) = c("EVTYPE","FATALINJUR")
# Get only the top 15 most harmfull storm events
topharmfull <- head(harmfull[order(harmfull$FATALINJUR, decreasing= T),], n = 15)
tevent <- topharmfull$EVTYPE[1]
tinjur <- topharmfull$FATALINJUR[1]
topharmfull$FATALINJUR <- topharmfull$FATALINJUR/1000
Here is the bar chart to better illustrate the 15 most harmfull types of events:
# Plot a bar chart total number of harmed people by storm type of event
barplot(topharmfull$FATALINJUR,
names.arg = topharmfull$EVTYPE,
ylab = "Total number fatalities and injuries(in thousands)", main = "Top 15 Most Harmfull Types of Events",
cex.axis = 0.8, cex.names = 0.6, col = "blue", axes = T, las = 2, ylim = c(0,100))
The most harmfull type of event is TORNADO and it caused 9.697910^{4} fatalities and injuries.
To better answer this question, some extra data processing is necessary:
The exponents are stored in the dataset in many different characters:
# In order to get the total of economic damage, it is necessary to multiply the damage by its exponent
unique(stormsub$PROPDMGEXP)
## [1] K M B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
Property and crop damage are calculated multiplying the value of the damage(“PROPDMG” and “CROPDMG”) by its exponent(“PROPDMGEXP” and “CROPDMGEXP”). The exponents will be converted to a numeric based on its value and multiplied accordingly.
# create a column with numeric proprety damage exponent
stormsub$PROPDMGEXP <- as.character(stormsub$PROPDMGEXP)
stormsub$PROPDMGEXP[(stormsub$PROPDMGEXP=="")|(stormsub$PROPDMGEXP=="+")|(stormsub$PROPDMGEXP=="?")|
(stormsub$PROPDMGEXP=="-")|(stormsub$PROPDMGEXP=="0")] <- 0
stormsub$PROPDMGEXP[(stormsub$PROPDMGEXP=="h")|(stormsub$PROPDMGEXP=="H")] <- 2
stormsub$PROPDMGEXP[(stormsub$PROPDMGEXP=="K")] <- 3
stormsub$PROPDMGEXP[(stormsub$PROPDMGEXP=="M")|(stormsub$PROPDMGEXP=="m")] <- 6
stormsub$PROPDMGEXP[(stormsub$PROPDMGEXP=="B")] <- 9
stormsub$CROPDMGEXP <- as.character(stormsub$CROPDMGEXP)
stormsub$CROPDMGEXP[(stormsub$CROPDMGEXP=="")|(stormsub$CROPDMGEXP=="?")|(stormsub$CROPDMGEXP=="0")] <- 0
stormsub$CROPDMGEXP[(stormsub$CROPDMGEXP=="K")|(stormsub$CROPDMGEXP=="k")] <- 3
stormsub$CROPDMGEXP[(stormsub$CROPDMGEXP=="M")|(stormsub$CROPDMGEXP=="m")] <- 6
stormsub$CROPDMGEXP[(stormsub$CROPDMGEXP=="B")] <- 9
stormsub$PROPDMGEXP <- as.numeric(stormsub$PROPDMGEXP)
stormsub$CROPDMGEXP <- as.numeric(stormsub$CROPDMGEXP)
stormsub$TPROPDMG <- stormsub$PROPDMG*10^stormsub$PROPDMGEXP
stormsub$TCROPDMG <- stormsub$CROPDMG*10^stormsub$CROPDMGEXP
stormsub$TOTDMG <- stormsub$TPROPDMG+stormsub$TCROPDMG
It is also necessary to aggregate the types of events by the total damage it causes. Then, only the top 15 types of events with the heghest economic damages are selected. To better illustrate the answer , the total damage is divided by one billion and displayed in a bar chart in US$ billions.
# Aggregate number of harmed people by storm type of event
ecodmg <- aggregate(TOTDMG ~ EVTYPE, stormsub, sum)
# Get only the top 15 highest economic damage storm events
topecodmg <- head(ecodmg[order(ecodmg$TOTDMG, decreasing= T),], n = 15)
# divide totals by 1,000,000,000 to convert in units of billion
topecodmg$TOTDMG <- topecodmg$TOTDMG/1000000000
#variables with top event and its value
tedmg <- topecodmg$EVTYPE[1]
tvdmg <- topecodmg$TOTDMG[1]
Here is the bar plot that illustrates the top 15 types of events with highest economic damage:
# Plot a bar chart total economic damage by storm type of event
barplot(topecodmg$TOTDMG,
names.arg = topecodmg$EVTYPE,
ylab = "Total economic damage(in US$ billions)", main = "Top 15 highest economic damage Types of Events",
cex.axis = 0.8, cex.names = 0.6, col = "green", axes = T, las = 2)
The type of event with greatest economic consequence is FLOOD and it causes an economic damage of 150.3196783 in billions of US dollars.