Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This report explores the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database that tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
In this report we aim to analyse which types of events as indicated by the EVTYPE variable are most harmful with respect to population health, and which types have the greatest economic consequencies across the United States.
To accomplish this analysis we will explore the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database that tracks characteristics of major storms and weather events in the United States between the years 1950 and November 2011.
Since this report requires writing code chunks in the R markdown document; I have set echo = TRUE and set results = hold as global options so that the reader can read them.
knitr::opts_chunk$set(echo = TRUE, results = 'hold')
# load packages
suppressMessages(require("data.table"))
suppressMessages(require("dplyr"))
suppressMessages(require("tidyr"))
suppressMessages(require("ggplot2"))
suppressMessages(require("gridExtra"))
## Warning: package 'gridExtra' was built under R version 3.1.3
suppressMessages(require("R.utils"))
## Warning: package 'R.utils' was built under R version 3.1.3
## Warning: package 'R.oo' was built under R version 3.1.3
## Warning: package 'R.methodsS3' was built under R version 3.1.3
First we will download the database from the course web site. It is in the form of a comma-separated-value file, compressed via the bzip2 algorithm to reduce its size, so we will have to unzip it.
# download file from the URL
if (!file.exists("stormdata.csv.bz2")) {
fileUrl = "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileUrl, destfile = "./stormdata.csv.bz2")
# unzip stormdata.csv dataset
bunzip2("./stormdata.csv.bz2", "./stormdata.csv", overwrite = T, remove = F)
}
Next we will read the stormdata.csv data that is now loaded in the working directory.
stormdata <- read.table("stormdata.csv", header = TRUE, sep = ",")
Then we will explore its dataset.
dim(stormdata)
## [1] 902297 37
We can see that the stormdata dataset has 902297 rows and 37 columns.
Next we will look at a sample view of the head of the dataset and the tail of the dataset.
# view rows at the head of the dataset
head(stormdata)[1:3,]
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
# view rows at the tail of the dataset
tail(stormdata)[1:3,]
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY
## 902292 47 11/28/2011 0:00:00 03:00:00 PM CST 21
## 902293 56 11/30/2011 0:00:00 10:30:00 PM MST 7
## 902294 30 11/10/2011 0:00:00 02:48:00 PM MST 9
## COUNTYNAME STATE EVTYPE BGN_RANGE
## 902292 TNZ001>004 - 019>021 - 048>055 - 088 TN WINTER WEATHER 0
## 902293 WYZ007 - 017 WY HIGH WIND 0
## 902294 MTZ009 - 010 MT HIGH WIND 0
## BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 902292 11/29/2011 0:00:00 12:00:00 PM 0
## 902293 11/30/2011 0:00:00 10:30:00 PM 0
## 902294 11/10/2011 0:00:00 02:48:00 PM 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG
## 902292 NA 0 0 0 NA 0
## 902293 NA 0 0 0 NA 66
## 902294 NA 0 0 0 NA 52
## FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO
## 902292 0 0 0 K 0 K MEG
## 902293 0 0 0 K 0 K RIW
## 902294 0 0 0 K 0 K TFX
## STATEOFFIC
## 902292 TENNESSEE, West
## 902293 WYOMING, Central and West
## 902294 MONTANA, Central
## ZONENAMES
## 902292 LAKE - LAKE - OBION - WEAKLEY - HENRY - DYER - GIBSON - CARROLL - LAUDERDALE - TIPTON - HAYWOOD - CROCKETT - MADISON - CHESTER - HENDERSON - DECATUR - SHELBY
## 902293 OWL CREEK & BRIDGER MOUNTAINS - OWL CREEK & BRIDGER MOUNTAINS - WIND RIVER BASIN
## 902294 NORTH ROCKY MOUNTAIN FRONT - NORTH ROCKY MOUNTAIN FRONT - EASTERN GLACIER
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_
## 902292 0 0 0 0
## 902293 0 0 0 0
## 902294 0 0 0 0
## REMARKS
## 902292 EPISODE NARRATIVE: A powerful upper level low pressure system brought snow to portions of Northeast Arkansas, the Missouri Bootheel, West Tennessee and extreme north Mississippi. Most areas picked up between 1 and 3 inches of with areas of Northeast Arkansas and the Missouri Bootheel receiving between 4 and 6 inches of snow.EVENT NARRATIVE: Around 1 inch of snow fell in Carroll County.
## 902293 EPISODE NARRATIVE: A strong cold front moved south through north central Wyoming bringing high wind to the Meeteetse area and along the south slopes of the western Owl Creek Range. Wind gusts to 76 mph were recorded at Madden Reservoir.EVENT NARRATIVE:
## 902294 EPISODE NARRATIVE: A strong westerly flow aloft produced gusty winds at the surface along the Rocky Mountain front and over the plains of Central Montana. Wind gusts in excess of 60 mph were reported.EVENT NARRATIVE: A wind gust to 60 mph was reported at East Glacier Park 1ENE (the Two Medicine DOT site).
## REFNUM
## 902292 902292
## 902293 902293
## 902294 902294
From this view, We can see that recording of events in the dataset starts around April 18, 1950 and ends in November 30, 2011. In the earlier years however, their seems to be fewer events recorded than in recent years; which is most likely due to a lack of good recording of storm events.
Next we will plot an histogram of recorded storm events between 1950 and November 2011, to further explore this analysis.
if (dim(stormdata)[2] == 37) {
stormdata$year <- as.numeric(format(as.Date(stormdata$BGN_DATE, format = "%m/%d/%Y %H:%M:%S"), "%Y"))
}
hist(stormdata$year, breaks = 30, main = "Recorded Storm Events between 1950 and November 2011", xlab = "Year", ylab= "Number of Storm Events", col= "purple")
By looking at the histogram we can see that poor recording of storm events are between the period 1950 and 1998, with a slow but gradual increase of good recordings between the period 1999 to 2010 and followed by a drastic decrease in 2011 which is the end of the recording period.
Now, in order to identify the variables associated with this research we will look at the required dataset below and the information related to this study from the Storm Data Documentation.
names(stormdata)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM" "year"
The variables identified are:
Next we will extract these columns and print a summary of the required dataset.
ext.col <- subset(stormdata, select = c("EVTYPE","FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG", "CROPDMGEXP"))
# summarise the ext.col dataset
summary(ext.col)
## EVTYPE FATALITIES INJURIES
## HAIL :288661 Min. : 0.0000 Min. : 0.0000
## TSTM WIND :219940 1st Qu.: 0.0000 1st Qu.: 0.0000
## THUNDERSTORM WIND: 82563 Median : 0.0000 Median : 0.0000
## TORNADO : 60652 Mean : 0.0168 Mean : 0.1557
## FLASH FLOOD : 54277 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## FLOOD : 25326 Max. :583.0000 Max. :1700.0000
## (Other) :170878
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## Min. : 0.00 :465934 Min. : 0.000 :618413
## 1st Qu.: 0.00 K :424665 1st Qu.: 0.000 K :281832
## Median : 0.00 M : 11330 Median : 0.000 M : 1994
## Mean : 12.06 0 : 216 Mean : 1.527 k : 21
## 3rd Qu.: 0.50 B : 40 3rd Qu.: 0.000 0 : 19
## Max. :5000.00 5 : 28 Max. :990.000 B : 9
## (Other): 84 (Other): 9
From the summary, their are no missing values so we can now process the data for analysis.
Looking at the property damage data, we will have to convert the property damage exponent data column PROPDMGEXP into a column that we will call NUMPROPEXP, that has comparible numerical data based on the units of measurement in the code book Storm Data Documentation.
We will also need to compute a new property damage variable for PROPDMG called PROPDMGVAR based on this conversion.
# property damage exponent data
unique(ext.col$PROPDMGEXP)
## [1] K M B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
# convert the property data exponent data
ext.col$NUMPROPEXP[ext.col$PROPDMGEXP == "K"] <- 1000
ext.col$NUMPROPEXP[ext.col$PROPDMGEXP == "M"] <- 1e+06
ext.col$NUMPROPEXP[ext.col$PROPDMGEXP == ""] <- 1
ext.col$NUMPROPEXP[ext.col$PROPDMGEXP == "B"] <- 1e+09
ext.col$NUMPROPEXP[ext.col$PROPDMGEXP == "m"] <- 1e+06
ext.col$NUMPROPEXP[ext.col$PROPDMGEXP == "0"] <- 1
ext.col$NUMPROPEXP[ext.col$PROPDMGEXP == "5"] <- 1e+05
ext.col$NUMPROPEXP[ext.col$PROPDMGEXP == "6"] <- 1e+06
ext.col$NUMPROPEXP[ext.col$PROPDMGEXP == "4"] <- 10000
ext.col$NUMPROPEXP[ext.col$PROPDMGEXP == "2"] <- 100
ext.col$NUMPROPEXP[ext.col$PROPDMGEXP == "3"] <- 1000
ext.col$NUMPROPEXP[ext.col$PROPDMGEXP == "h"] <- 100
ext.col$NUMPROPEXP[ext.col$PROPDMGEXP == "7"] <- 1e+07
ext.col$NUMPROPEXP[ext.col$PROPDMGEXP == "H"] <- 100
ext.col$NUMPROPEXP[ext.col$PROPDMGEXP == "1"] <- 10
ext.col$NUMPROPEXP[ext.col$PROPDMGEXP == "8"] <- 1e+08
# give 0 to invalid exponent data, so they do not count in
ext.col$NUMPROPEXP[ext.col$PROPDMGEXP == "+"] <- 0
ext.col$NUMPROPEXP[ext.col$PROPDMGEXP == "-"] <- 0
ext.col$NUMPROPEXP[ext.col$PROPDMGEXP == "?"] <- 0
# compute the new property damage variable
ext.col$PROPDMGVAR <- ext.col$PROPDMG * ext.col$NUMPROPEXP
Looking at the crop damage data, we will have to convert the crop damage exponent data column CROPDMGEXP into a column that we will call NUMCROPEXP, that has comparible numerical data based on the units of measurement in the code book Storm Data Documentation.
We will also need to compute a new crop damage variable for CROPDMG called CROPDMGVAR based on this conversion.
# crop damage exponent data
unique(ext.col$CROPDMGEXP)
## [1] M K m B ? 0 k 2
## Levels: ? 0 2 B k K m M
# convert the crop damage exponent data
ext.col$NUMCROPEXP[ext.col$CROPDMGEXP == "M"] <- 1e+06
ext.col$NUMCROPEXP[ext.col$CROPDMGEXP == "K"] <- 1000
ext.col$NUMCROPEXP[ext.col$CROPDMGEXP == "m"] <- 1e+06
ext.col$NUMCROPEXP[ext.col$CROPDMGEXP == "B"] <- 1e+09
ext.col$NUMCROPEXP[ext.col$CROPDMGEXP == "0"] <- 1
ext.col$NUMCROPEXP[ext.col$CROPDMGEXP == "k"] <- 1000
ext.col$NUMCROPEXP[ext.col$CROPDMGEXP == "2"] <- 100
ext.col$NUMCROPEXP[ext.col$CROPDMGEXP == ""] <- 1
# give 0 to invalid exponent data, so they not count in
ext.col$NUMCROPEXP[ext.col$CROPDMGEXP == "?"] <- 0
# compute the new crop damage variable
ext.col$CROPDMGVAR <- ext.col$CROPDMG * ext.col$NUMCROPEXP
Now that we have completed processing the property damage data as well as the crop damage data, we can aggregate all variables in the ext.col dataset.
# compute the storm data
fatalities <- aggregate(FATALITIES ~ EVTYPE, data = ext.col, sum)
injuries <- aggregate(INJURIES ~ EVTYPE, data = ext.col, sum)
property <- aggregate(PROPDMGVAR ~ EVTYPE, data = ext.col, sum)
crop <- aggregate(CROPDMGVAR ~ EVTYPE, data = ext.col, sum)
To answer this question, we will create a dataset of the top 15 total fatalites and injuries caused by storms and plot 2 graphs, one showing total storm fatalites and the other showing total storm injuries between the years 1950 and 2011.
# get top 15 event with highest fatalities
fatal <- fatalities[order(-fatalities$FATALITIES), ][1:15, ]
# get top 15 event with highest injuries
injury <- injuries[order(-injuries$INJURIES), ][1:15, ]
# plot graph showing fatalities
fatalitiesPlot <- qplot(EVTYPE, data = fatal, weight = FATALITIES, geom = "bar", binwidth = 1) +
scale_y_continuous("Number of Fatalities") +
theme(axis.text.x = element_text(angle = 45,
hjust = 1)) + xlab("Severe Storm Events") +
ggtitle("Total Fatalities \n by Severe Storms in the U.S.\n from 1950 - 2011")
# plot graph showing injuries
injuriesPlot <- qplot(EVTYPE, data = injury, weight = INJURIES, geom = "bar", binwidth = 1) +
scale_y_continuous("Number of Injuries") +
theme(axis.text.x = element_text(angle = 45,
hjust = 1)) + xlab("Severe Storm Events") +
ggtitle("Total Injuries \n by Severe Storms in the U.S.\n from 1950 - 2011")
grid.arrange(fatalitiesPlot, injuriesPlot, ncol=2 )
Base on the outcome of the total fatalities graph, we can see that tornadoes are the most harmful storm type, having over 5000 fatalities, followed by execessive heat having over 1500 fatalities then heat and lightning with approximately 1000 fatalities.
Base on the outcome of the total injuries graph we can see that once again tornadoes are the most harmful storm type, having over 7500 injuries, followed by tropical storm wind, flood and execessive heat; all in close proximity to 1250 injuries.
To answer this question, We will create a dataset of the top 15 property and crop damages caused by storms and plot 2 graphs, one showing total property damages and the other showing total crop damages between the years 1950 and 2011.
# get top 15 event with highest property damages
properties <- property[order(-property$PROPDMGVAR), ][1:15, ]
# get top 15 event with highest crop damages
crops <- crop[order(-crop$CROPDMGVAR), ][1:15, ]
# plot graph showing property damages
properyPlot <- qplot(EVTYPE, data = properties, weight = PROPDMGVAR/(10^9), geom = "bar", binwidth = 1) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_y_continuous("Property Damages ($ billion)")+
xlab("Severe Storm Events") +
ggtitle("Total Property Damages by \n Severe Storms in the U.S. \n from 1950 - 2011")
# plot graph showing crop damages
cropPlot <- qplot(EVTYPE, data = crops, weight = CROPDMGVAR/(10^9), geom = "bar", binwidth = 1) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_y_continuous("Crop Damages ($ billion)") +
xlab("Severe Storm Events") +
ggtitle("Total Crop Damages by \n Severe Storms in the U.S. \n from 1950 - 2011")
grid.arrange(properyPlot, cropPlot, ncol=2)
Base on the outcome of the total properties damage graph, we can see that floods are the most costly, having over US$125 billion in property damages, followed by hurricane and typhoon having close to US$75 billion. Next is tornado having cost of over US$50 billion and storm surge having cost close to US$50 billion.
Base on the outcome of the total crop damage graph we can see that droughts are the most costly, having over US$12.5 billion in crop damages, followed by flood, having cost of over US$5 billion. Next, almost sharing similar costs of US$5 billion are ice storm and river flood .
From this exploratory analysis we were able to see that among the most harmful storm events identifed, tornadoes are the most destructive. This type of storm not only drastically affects population health but the number one major contributing factor to economic consequencies across the United States from the year 1950 until the year 2011.