I use the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database to identify the most detrimental weather events in the country with regard to population health and economic impact. The analysis is performed as a whole for the country, not by states or counties. Effects on population health are measured in fatalities and injuries, summed together. These are measured in number of people. Economic impact is measured in billions of US dollars in terms of damages on property and crops.
## you do not need this step if package is already installed; if not, remove the # in front of the code line
## install.packages("ggplot2", "dplyr", "knitr")
library(ggplot2)
library(dplyr)
library(knitr)
## SET PATH AND FOLDER
path <- getwd()
if(!file.exists("./data")){dir.create("./data")}
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
filename <- "OriginalDataSet.zip"
download.file(url, file.path(path, "data", filename))
After loading the whole data set in R and exploring it, I noticed that chunks of text appear together with the variable values. I inspected the column names and from them I concluded which 7 variables are necessary for the analysis at hand. I must emphasize that the two PDF files that were suggested as help for understanding the data set do not contain a codebook. I had to assume a lot of things on my own.
dt <- read.csv(file.path(path, "data", "OriginalDataSet.zip"))
tail(dt,1)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME
## 902297 1 11/28/2011 0:00:00 08:00:00 PM CST 6 ALZ006
## STATE EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE
## 902297 AL HEAVY SNOW 0 11/29/2011 0:00:00
## END_TIME COUNTY_END COUNTYENDN END_RANGE END_AZI END_LOCATI
## 902297 04:00:00 AM 0 NA 0
## LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG
## 902297 0 0 NA 0 0 0 0 K 0
## CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 902297 K HUN ALABAMA, North MADISON - MADISON 0 0
## LATITUDE_E LONGITUDE_
## 902297 0 0
## REMARKS
## 902297 EPISODE NARRATIVE: An intense upper level low developed on the 28th at the base of a highly amplified upper trough across the Great Lakes and Mississippi Valley. The upper low closed off over the mid South and tracked northeast across the Tennessee Valley during the morning of the 29th. A warm conveyor belt of heavy rainfall developed in advance of the low which dumped from around 2 to over 5 inches of rain across the eastern two thirds of north Alabama and middle Tennessee. The highest rain amounts were recorded in Jackson and DeKalb Counties with 3 to 5 inches. The rain fell over 24 to 36 hour period, with rainfall remaining light to moderate during most its duration. The rainfall resulted in minor river flooding along the Little River, Big Wills Creek and Paint Rock. A landslide occurred on Highway 35 just north of Section in Jackson County. A driver was trapped in his vehicle, but was rescued unharmed. Trees, boulders and debris blocked 100 to 250 yards of Highway 35.\n\nThe rain mixed with and changed to snow across north Alabama during the afternoon and evening hours of the 28th, and lasted into the 29th. The heaviest bursts of snow occurred in northwest Alabama during the afternoon and evening hours, and in north central and northeast Alabama during the overnight and morning hours. Since ground temperatures were in the 50s, and air temperatures in valley areas only dropped into the mid 30s, most of the snowfall melted on impact with mostly trace amounts reported in valley locations. However, above 1500 foot elevation, snow accumulations of 1 to 2 inches were reported. The heaviest amount was 2.3 inches on Monte Sano Mountain, about 5 miles northeast of Huntsville.EVENT NARRATIVE: Snowfall accumulations of up to 2.3 inches were reported on the higher elevations of eastern Madison County. A snow accumulation of 1.5 inches was reported 2.7 miles south of Gurley, while 2.3 inches was reported 3 miles east of Huntsville atop Monte Sano Mountain.
## REFNUM
## 902297 902297
## nasty text rows appear; will explore the names of the variables and will load only the necessary columns
I will load only 7 variables and I expect that this will solve the problem with the text chunks. The variables are: EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP. The variables ending in -EXP give the currency units of the non-human damages: thousands $, millions $, etc.
dt1 <- read.csv(file.path(path, "data", "OriginalDataSet.zip"))[ ,c('EVTYPE', 'FATALITIES', 'INJURIES', 'PROPDMG', 'PROPDMGEXP', 'CROPDMG', 'CROPDMGEXP')]
tail(dt1)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG
## 902292 WINTER WEATHER 0 0 0 K 0
## 902293 HIGH WIND 0 0 0 K 0
## 902294 HIGH WIND 0 0 0 K 0
## 902295 HIGH WIND 0 0 0 K 0
## 902296 BLIZZARD 0 0 0 K 0
## 902297 HEAVY SNOW 0 0 0 K 0
## CROPDMGEXP
## 902292 K
## 902293 K
## 902294 K
## 902295 K
## 902296 K
## 902297 K
## no more strange text rows; I assume what I did was enough to solve the problem
From the monetary damage units I could only interpret the letters K, M and B as thousands, millions, billions. The other symbols do not speak anything sensible to me and I will skip them. Luckily they form less than 1% of the data set. I also checked for missing values. They are no such.
unique(dt1$PROPDMGEXP)
## [1] K M B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
## we don’t have actually a codebook for this assignment; I have guess what these codes mean
table(dt1$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5
## 465934 1 8 5 216 25 13 4 4 28
## 6 7 8 B h H K m M
## 4 5 1 40 1 6 424665 7 11330
unique(dt1$CROPDMGEXP)
## [1] M K m B ? 0 k 2
## Levels: ? 0 2 B k K m M
table(dt1$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
## the biggest items are with factor levels K, M and NA; I will keep only them and B
dt2 <- dt1[(dt1$PROPDMGEXP == "" | dt1$PROPDMGEXP == "K" | dt1$PROPDMGEXP == "M" | dt1$PROPDMGEXP == "B") & (dt1$CROPDMGEXP == "" | dt1$CROPDMGEXP == "K" | dt1$CROPDMGEXP == "M" | dt1$CROPDMGEXP == "B"),]
## K stands for thousands, M for millions, B for billions; the rest of the codes I did not understand and their numbers were negligible
length(dt2$EVTYPE) / length(dt1$EVTYPE)
## [1] 0.9995833
## 99% of the data set is preserved
sum(apply(dt2, 2, is.na))
## [1] 0
## no missing values
All monetary damages are converted into one unit: billions.
## I will turn damages in one unit – first in dollars and then in billions of dollars
dt2$PROPDMG <- with(dt2, ifelse(PROPDMGEXP == "K", PROPDMG*1000, ifelse(PROPDMGEXP == "M", PROPDMG*1000000, ifelse(PROPDMGEXP == "B", PROPDMG*1000000000, PROPDMG))))
dt2$CROPDMG <- with(dt2, ifelse(CROPDMGEXP == "K", CROPDMG*1000, ifelse(CROPDMGEXP == "M", CROPDMG*1000000, ifelse(CROPDMGEXP == "B", CROPDMG*1000000000, CROPDMG))))
dt2$PROPDMG <- dt2$PROPDMG/1000000000
dt2$CROPDMG <- dt2$CROPDMG/1000000000
dt3 <- select(dt2, EVTYPE, FATALITIES, INJURIES, PROPDMG, CROPDMG)
rm(dt, dt1, dt2)
dt <- dt3
rm(dt3)
From the cleared data set I form a new data set (SumByevtype) which sums the costs and the casualties by event type. In this new data set I calculate two new variables. The variable Health is the sum of fatalities and injuries and its purpose is to measure the effects of the catastrophes on population health. The variable Economic is the sum of damages on property and on crop. It quantifies the economic consequences. Then I create two new data sets, they are ordered in descending order respectively by Health and by Economic. From them I will extract the top 3 most damaging events which I need for the plots.
SumByEvtype <- summarize(group_by(dt, EVTYPE), totalFatalities = sum(FATALITIES), totalInjuries = sum(INJURIES), totalPropdmg = sum(PROPDMG), totalCropdmg = sum(CROPDMG))
SumByEvtype$Health <- SumByEvtype$totalFatalities + SumByEvtype$totalInjuries
SumByEvtype$Economic <- SumByEvtype$totalPropdmg + SumByEvtype$totalCropdmg
byHealth <- arrange(SumByEvtype, desc(Health))
byEconomic <- arrange(SumByEvtype, desc(Economic))
With regard to population health we may say that the single most harmful event are the tornadoes. They have killed or injured almost 100,000 people for the period 1950 - 2011. The second and the third most damaging events are far less harmful, on a totally different scale in their magnitude.
g1 <- ggplot(byHealth[1:3,], aes(EVTYPE, Health))
g1 + geom_bar(stat = "identity") + labs(title = "Top 3 events most harmful to population health") + labs(x = 'Event', y = "Fatalities & Injuries, persons")
In monetary terms the most serious weather event are the floods. The property and crop damages from them sum up to 150 bln dollars. The second and the third events together cannot contribute such a huge damage.
g2 <- ggplot(byEconomic[1:3,], aes(EVTYPE, Economic))
g2 + geom_bar(stat = "identity") + labs(title = "Top 3 events with the greatest economic consequences") + labs(x = 'Event', y = "Property & Crop Damages, bln $")
In conclusion, tornadoes and floods are the most dangerous and harmful weather events and demand to prioritize resources for them.