This report aims at identifying the most harmful weather events in the U.S. as measured from 1950 to 2011. It shows which events have had the greatest impact on public health and the economy. The report is based on the Storm Data published by the the National Oceanic and Atmospheric Administration (NOAA).
These data measure the number of fatalities and injuries as well as the values of damages on property and crops for each event. Specifically, the report addresses the two following questions:
It suggests that, overall, floods and tornados are the most harmful weather events in the U.S.
Note: We created the sub-directory “ds-project-52” for this project.
We are obtaining the data from the course web site Storm Data.
#Download and unzip zip file, read in the data
if (!file.exists("./ds-project-52/repdata_dat_StormData.csv")) {
library(R.utils)
getzip <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
ziparch <- "./ds-project-52/repdata_dat_StormData.csv.bz2"
download.file(getzip, destfile = ziparch)
bunzip2(ziparch, destname = "./ds-project-52/repdata_dat_StormData.csv")
}
dataraw <- read.csv("./ds-project-52/repdata_dat_StormData.csv", header=TRUE, stringsAsFactors=FALSE)
str(dataraw)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
#Checking for NA
sum (is.na(dataraw))
## [1] 1745947
For the questions to answer here we need only the following variables:
We don’t want to loose completely the time and space dimensions. So we keep also these variables:
We reduce the data set to these variables.
datared<-dataraw[, c("BGN_DATE", "STATE", "EVTYPE","FATALITIES","INJURIES","PROPDMG", "PROPDMGEXP","CROPDMG","CROPDMGEXP")]
In order to answer the questions of this report the variable EVTYPE is crucial. So let’s have a closer look at it. First, we look at how many types there are.
types_num0 <- length(sort(unique(datared$EVTYPE)))
There are 985 types, which seems far too many.
A closer look at all of these types shows that they are rather incoherently denominated. I.e. leading spaces, upper- and lowercases mixed (“Beach Erosion”, “BEACH EROSION”), typos, (“BEACH EROSIN”), abreviations (“TSTM”, “WND”), singular/plural (“THUNDERSTRORM WIND”, “THUNDERSTORM WINDS”, spaces (“WILD FIRES”, WILDFIRES). Displaying all types would unduly inflate this report, so we will just list the first 40 types for illustration.
evtype <- sort(unique(datared$EVTYPE))
head(evtype, 40)
## [1] " HIGH SURF ADVISORY" " COASTAL FLOOD"
## [3] " FLASH FLOOD" " LIGHTNING"
## [5] " TSTM WIND" " TSTM WIND (G45)"
## [7] " WATERSPOUT" " WIND"
## [9] "?" "ABNORMAL WARMTH"
## [11] "ABNORMALLY DRY" "ABNORMALLY WET"
## [13] "ACCUMULATED SNOWFALL" "AGRICULTURAL FREEZE"
## [15] "APACHE COUNTY" "ASTRONOMICAL HIGH TIDE"
## [17] "ASTRONOMICAL LOW TIDE" "AVALANCE"
## [19] "AVALANCHE" "BEACH EROSIN"
## [21] "Beach Erosion" "BEACH EROSION"
## [23] "BEACH EROSION/COASTAL FLOOD" "BEACH FLOOD"
## [25] "BELOW NORMAL PRECIPITATION" "BITTER WIND CHILL"
## [27] "BITTER WIND CHILL TEMPERATURES" "Black Ice"
## [29] "BLACK ICE" "BLIZZARD"
## [31] "BLIZZARD AND EXTREME WIND CHIL" "BLIZZARD AND HEAVY SNOW"
## [33] "Blizzard Summary" "BLIZZARD WEATHER"
## [35] "BLIZZARD/FREEZING RAIN" "BLIZZARD/HEAVY SNOW"
## [37] "BLIZZARD/HIGH WIND" "BLIZZARD/WINTER STORM"
## [39] "BLOW-OUT TIDE" "BLOW-OUT TIDES"
We will try to eliminate the most obvious redundancies by hand.
datared$EVTYPE <- trimws(datared$EVTYPE)
datared$EVTYPE <- toupper(dataraw$EVTYPE)
datared$EVTYPE <- gsub("WND", "WIND", datared$EVTYPE)
datared$EVTYPE <- gsub("WINDS", "WIND", datared$EVTYPE)
datared$EVTYPE <- gsub("WIND CH", "WIND CHIL", datared$EVTYPE)
datared$EVTYPE <- gsub("TSTM", "THUNDERSTORM", datared$EVTYPE)
datared$EVTYPE <- gsub("TIDES", "TIDE", datared$EVTYPE)
datared$EVTYPE <- gsub("AVALANCE", "AVALANCHE", datared$EVTYPE)
datared$EVTYPE <- gsub("EROSIN", "EROSION", datared$EVTYPE)
datared$EVTYPE <- gsub(" & ", "/", datared$EVTYPE)
datared$EVTYPE <- gsub("FIRES", "FIRE", datared$EVTYPE)
datared$EVTYPE <- gsub("WILDFIRES", "WILD FIRE", datared$EVTYPE)
datared$EVTYPE <- gsub("FLOODING", "FLOOD", datared$EVTYPE)
datared$EVTYPE <- gsub("CSTL", "COASTAL", datared$EVTYPE)
datared$EVTYPE <- gsub("COASTALFLOOD", "COASTAL FLOOD", datared$EVTYPE)
datared$EVTYPE <- gsub("COASTALSTORM", "COASTAL STORM", datared$EVTYPE)
datared$EVTYPE <- gsub("FUNNELS", "FUNNEL", datared$EVTYPE)
datared$EVTYPE <- gsub("TEMPERATURES", "TEMPERATURE", datared$EVTYPE)
datared$EVTYPE <- gsub("MUDSLIDES", "MUDSLIDE", datared$EVTYPE)
datared$EVTYPE <- gsub("THUNDERSTORMW", "THUNDERSTORM", datared$EVTYPE)
datared$EVTYPE <- gsub("THUNERSTORM", "THUNDERSTORM", datared$EVTYPE)
datared$EVTYPE <- gsub("THUNDERSTROM", "THUNDERSTORM", datared$EVTYPE)
datared$EVTYPE <- gsub("THUNDERTSORM", "THUNDERSTORM", datared$EVTYPE)
datared$EVTYPE <- gsub("THUNDERSTORMS", "THUNDERSTORM", datared$EVTYPE)
datared$EVTYPE <- gsub("HUNDERSTORMWINDS", "THUNDERSTORM WIND", datared$EVTYPE)
datared$EVTYPE <- gsub("TORNADOES", "TORNADO", datared$EVTYPE)
datared$EVTYPE <- gsub("TORNADOS", "TORNADO", datared$EVTYPE)
datared$EVTYPE <- gsub("WATERSPROUT", "WATER SPROUT", datared$EVTYPE)
types_num1 <- length(sort(unique(datared$EVTYPE)))
So, we could reduce the types from 985 to 827. This is still far too many compared to the 48 types defined in the Storm Data Documentation. However, it is beyond the scope of this analysis to do more cleaning. As we will focus on the few types with highest impact, we asume that the effects of these incoherencies will be limited. Still, when interpreting the results, we have to bear in mind that there is a probleme here.
Eventually, we get the years out of the BGN_DATE variable.
#datared$BGN_DATE <- as.Date(data1$BGN_DATE, format = "%d/%m/%Y")
datared$YEAR <- as.numeric(format(as.Date(datared$BGN_DATE, format = "%m/%d/%Y %H:%M:%S"), "%Y"))
Checking for NA
sum(is.na(datared))
## [1] 0
We don’t have to worry about missing values.
What are the main characteristics of the data?
str(datared)
## 'data.frame': 902297 obs. of 10 variables:
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ YEAR : num 1950 1950 1951 1951 1951 ...
head(datared)
## BGN_DATE STATE EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP
## 1 4/18/1950 0:00:00 AL TORNADO 0 15 25.0 K
## 2 4/18/1950 0:00:00 AL TORNADO 0 0 2.5 K
## 3 2/20/1951 0:00:00 AL TORNADO 0 2 25.0 K
## 4 6/8/1951 0:00:00 AL TORNADO 0 2 2.5 K
## 5 11/15/1951 0:00:00 AL TORNADO 0 2 2.5 K
## 6 11/15/1951 0:00:00 AL TORNADO 0 6 2.5 K
## CROPDMG CROPDMGEXP YEAR
## 1 0 1950
## 2 0 1950
## 3 0 1951
## 4 0 1951
## 5 0 1951
## 6 0 1951
Let’s see, how the the measurments are spread in time and space.
#Create a histogram for events by year.
#Load packages
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.3
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.2.5
yplot <- ggplot(data=datared, aes(datared$YEAR)) + geom_histogram(breaks=seq(1950, 2010, by = 0.5), col="red", fill="orange", alpha = .5) + labs(title="Weather Events in Time") + labs(x="YEAR", y="Measured Weather Events")
#Create a histogram for events by most affected states.
library(plyr)
## Warning: package 'plyr' was built under R version 3.2.3
states <- count(datared, vars = "STATE")
states <- states[order(states$freq, decreasing = TRUE), ]
sttop <- states[1:40, ]
sttop$STATE <- factor(sttop$STATE, levels = sttop$STATE)
sttop$STATE <- factor(sttop$STATE, levels = rev(levels(sttop$STATE)))
stplot <- ggplot(sttop, aes(x=factor(STATE), y=freq))+ geom_bar(stat ="identity", fill="green") + ggtitle("Weather Events by State") + xlab("States") + ylab("Number of Weather Events") + scale_y_continuous(breaks = seq(0,80000, by = 10000))
grid.arrange(yplot, stplot, ncol=1, bottom="Figure 1: Extreme Weather Events in Time and Space")
From figure 1 we see that the majority of the measurments falls within the period of 1995/2011 and that Texas is the state by far most affected by extreme weather events, followed by Kansas and Oklahoma.
Question: Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
The variables FATALITIES and INJURIES are relevant for answering this question.
#Aggregate FATALITIES and INJURIES to variable on population health.
datared$PHEALTH <- datared$FATALITIES + datared$INJURIES
phealth <- aggregate (PHEALTH~EVTYPE, datared, sum)
phealth <- phealth[order(phealth$PHEALTH, decreasing=TRUE), ]
unname(unlist(subset(phealth, PHEALTH >= 20, select = PHEALTH)))
## [1] 96979 10055 8428 7267 6046 3037 2782 2064 1722 1527 1376
## [12] 1339 1148 986 906 796 600 557 551 501 462 431
## [23] 412 398 395 393 360 349 260 251 223 162 153
## [34] 149 143 111 107 107 107 101 100 90 90 86
## [45] 78 68 53 52 51 51 48 45 45 40 37
## [56] 36 36 36 36 32 32 31 30 29 28 28
## [67] 27 25 25 23 23 22
We see that the impact decreases sharply within the first few cases. In order to detect the most harmful types we can concentrate on the first 10 cases without loosing too much information.
fatality <- aggregate (FATALITIES~EVTYPE, datared, sum)
fatality <- fatality[order(fatality$FATALITIES, decreasing=TRUE), ]
injury <- aggregate (INJURIES~EVTYPE, datared, sum)
injury <- injury[order(injury$INJURIES, decreasing=TRUE), ]
#We want to use ggplot2 for plotting. This requires the EVTYPE variable to be set to factor, so that ggplot arranges the events in decreasing order.
phtop <- phealth[1:7, ]
phtop$EVTYPE <- factor(phtop$EVTYPE, levels = phtop$EVTYPE)
phtop$EVTYPE <- factor(phtop$EVTYPE, levels = rev(levels(phtop$EVTYPE)))
fataltop <- fatality[1:7, ]
fataltop$EVTYPE <- factor(fataltop$EVTYPE , levels = fataltop$EVTYPE)
fataltop$EVTYPE <- factor(fataltop$EVTYPE, levels = rev(levels(fataltop$EVTYPE)))
injtop <- injury[1:7, ]
injtop$EVTYPE <- factor(injtop$EVTYPE , levels = injtop$EVTYPE)
injtop$EVTYPE <- factor(injtop$EVTYPE, levels = rev(levels(injtop$EVTYPE)))
Creating figure of most harmful events for population health.
phplot <- ggplot(phtop, aes(x=factor(EVTYPE), y=PHEALTH)) + geom_bar(stat ="identity", fill="blue") + ggtitle("Public Health Damage: Top 7 Events") + xlab("Wheater Events") + ylab("Total public health damage (number of cases") + coord_flip() + scale_y_continuous(breaks = seq(0,90000, by = 10000))
fatalplot <- ggplot(fataltop, aes(x=factor(EVTYPE), y=FATALITIES))+ geom_bar(stat ="identity", fill="black") + ggtitle("Fatalities: Top 7 Events") + xlab("Wheater Events") + ylab("Total number of Fatalities") + coord_flip() + scale_y_continuous(breaks = seq(0,5500, by = 500))
injplot <- ggplot(injtop, aes(x=factor(EVTYPE), y=INJURIES))+ geom_bar(stat ="identity", fill="green") + ggtitle("Injuries: Top 7 Events") + xlab("Wheater Events") + ylab("Total number of Injuries") + coord_flip() +scale_y_continuous(breaks = seq(0,90000, by = 10000))
grid.arrange(phplot, injplot, fatalplot, ncol=1, bottom="Figure 2: Events most harmful with respect to population health (fatalities and injuries")
Question: Across the United States, which types of events have the greatest economic consequences?
We will calculate the economic damages by aggregating property and crop damages. The values of these damages are each represented by two variabels, one recording a number rounded to three digits, the other indicating the magnitude of the values with letters (e.g. “K” for thousands, “M” for millions, and “B” for billions. In order to do any calculation, we will have to convert these letters into their numeric value. The product of the number and the converted value variables will be used as a new new variable for calculating economic damages.
#Transform the magnitude information into numeric.
#Create a vector with the magnitude indications.
dimprop <- unique(datared$PROPDMGEXP)
dimcrop <- unique(datared$CROPDMGEXP)
dim <- sort(unique(c(dimprop, dimcrop)))
dim
## [1] "" "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "k"
## [18] "K" "m" "M"
#Create a numerical vector with the respective magnitude values.
dimnum <- c(1, 0, 0, 0, 1, 10, 10^2, 10^3, 10^4, 10^5, 10^6, 10^7, 10^8, 10^9, 10^2, 10^2, 10^3, 10^3, 10^6, 10^6)
#Create exponential variables for PROPDMG and CROPDMG
for (i in 1:length(dim)) {
datared$PROPEXP[datared$PROPDMGEXP == dim[i]] <- dimnum[i]
datared$CROPEXP[datared$CROPDMGEXP == dim[i]] <- dimnum[i]
}
datared$PROPDMG_VAL <- datared$PROPDMG * datared$PROPEXP
datared$CROPDMG_VAL <- datared$CROPDMG * datared$CROPEXP
datared$ECONDMG <- datared$PROPDMG_VAL + datared$CROPDMG_VAL
Now, we can calculate the economic damage (property and crop damages) by type of event based on the created variable ECONDMG.
econdmg <- aggregate(ECONDMG~EVTYPE, datared, sum)
econdmg <- econdmg[order(econdmg$ECONDMG, decreasing = TRUE), ]
unname(unlist(subset(econdmg, ECONDMG >= 10^8, select = ECONDMG)))
## [1] 150437379757 71913712800 57362334887 43323541000 18761221986
## [6] 18566870733 15018672000 14610229010 11072776564 10292723500
## [11] 8967041360 8382236550 6715441251 6557661893 5161586800
## [16] 4642038000 3191846000 3108658330 2500000000 1602500000
## [21] 1427647890 1380710400 1224560000 1104666000 1067412240
## [26] 942471520 771273950 624100000 601055000 500155700
## [31] 456930000 403258500 394110000 392592060 344613000
## [36] 304230000 274930006 269095007 247627740 241000000
## [41] 144082000 142000000 117500000 112800000 110000000
## [46] 109427250 105000000
We see that we can focus on the first 20 cases.
#Top 20 total economic damage events.
econtop <- econdmg[1:7, ]
#Top 20 property damage events.
propdmg <- aggregate (PROPDMG_VAL~EVTYPE, datared, sum)
propdmg <- propdmg[order(propdmg$PROPDMG_VAL, decreasing = TRUE), ]
proptop <- propdmg[1:7, ]
#Top 20 crop damage events.
cropdmg <- aggregate (CROPDMG_VAL~EVTYPE, datared, sum)
cropdmg <- cropdmg[order(cropdmg$CROPDMG_VAL, decreasing = TRUE), ]
croptop <- cropdmg[1:7, ]
#Set EVTYPE variable to factor, so that ggplot arranges the events in decreasing order.
econtop$EVTYPE <- factor(econtop$EVTYPE , levels = econtop$EVTYPE)
econtop$EVTYPE <- factor(econtop$EVTYPE, levels = rev(levels(econtop$EVTYPE)))
proptop$EVTYPE <- factor(proptop$EVTYPE , levels = proptop$EVTYPE)
proptop$EVTYPE <- factor(proptop$EVTYPE, levels = rev(levels(proptop$EVTYPE)))
croptop$EVTYPE <- factor(croptop$EVTYPE , levels = croptop$EVTYPE)
croptop$EVTYPE <- factor(croptop$EVTYPE, levels = rev(levels(croptop$EVTYPE)))
#Create figure 3 with panel plot for total economic, property and crop damages.
econplot <- ggplot(data=econtop, aes(x=EVTYPE, y=ECONDMG/10^9)) +
geom_bar(stat ="identity", fill = "blue") + ggtitle("Total Economic Damage: Top 7 Events") +
xlab("Wheater Events") + ylab("Total Damage in billion USD") + coord_flip() +
scale_y_continuous(breaks = seq(0,150, by = 20))
propplot <- ggplot(data=proptop, aes(x=EVTYPE, y=PROPDMG_VAL/10^9)) +
geom_bar(stat ="identity", fill = "red") + ggtitle("Property Damage: Top 7 Events") +
xlab("Wheater Events") + ylab("Total Damage in billion USD") + coord_flip() +
scale_y_continuous(breaks = seq(0,150, by = 20))
cropplot <- ggplot(data=croptop, aes(x=EVTYPE, y=CROPDMG_VAL/10^9)) +
geom_bar(stat ="identity", fill="green" ) + ggtitle("Crop Damage: Top 7 Events") +
xlab("Wheater Events") + ylab("Total Damage in billion USD") + coord_flip() +
scale_y_continuous(breaks = seq(0,15, by = 2))
grid.arrange(econplot, propplot, cropplot, ncol=1, bottom="Figure 3: Events with greatest economic impact (Total, Property, Crop")
The figure shows that flood, hurricane/typhoon, tornado and storm surge are the four most damaging events in economic terms. The economic damages are dominated by the property damages, crop damages count only for a relatively small part. For crops drought is the most harmful event.
From figures 2 and 3 we can infer that floods and tornados are by far the most harmful weather events. While tornados were responsible for most human victims, floods caused the highest economic damage. Moreover, both of these top events, were high ranked also in the other domain (tornado 3rd in the economic damage ranking, flood 4th in the public health ranking.