Synopsis

This report aims at identifying the most harmful weather events in the U.S. as measured from 1950 to 2011. It shows which events have had the greatest impact on public health and the economy. The report is based on the Storm Data published by the the National Oceanic and Atmospheric Administration (NOAA).

These data measure the number of fatalities and injuries as well as the values of damages on property and crops for each event. Specifically, the report addresses the two following questions:

It suggests that, overall, floods and tornados are the most harmful weather events in the U.S.

Data Processing

Note: We created the sub-directory “ds-project-52” for this project.

Loading the data

We are obtaining the data from the course web site Storm Data.

#Download and unzip zip file, read in the data
if (!file.exists("./ds-project-52/repdata_dat_StormData.csv")) {
    library(R.utils)
    getzip <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
    ziparch <- "./ds-project-52/repdata_dat_StormData.csv.bz2"
    download.file(getzip, destfile = ziparch)
    bunzip2(ziparch, destname = "./ds-project-52/repdata_dat_StormData.csv")
}
dataraw <- read.csv("./ds-project-52/repdata_dat_StormData.csv", header=TRUE, stringsAsFactors=FALSE)

Checking out the data

str(dataraw)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
#Checking for NA
sum (is.na(dataraw))
## [1] 1745947

For the questions to answer here we need only the following variables:

  • EVTYPE = type of event according to the National Weather Service Storm Data Documentation, p. 6
  • FATALITIES = number of fatalities
  • INJURIES = number of injuries
  • PROPDMG = cost of property damage
  • PROPDMGEXP = indicator of magnitude of cost of property damage
  • CROPDMG = cost of crop damage
  • CROPDMGEXP = indicator of magnitude of cost of crop damage

We don’t want to loose completely the time and space dimensions. So we keep also these variables:

  • BGN_DATE = date of event
  • STATE = USPS code of state

We reduce the data set to these variables.

datared<-dataraw[, c("BGN_DATE", "STATE", "EVTYPE","FATALITIES","INJURIES","PROPDMG", "PROPDMGEXP","CROPDMG","CROPDMGEXP")]

In order to answer the questions of this report the variable EVTYPE is crucial. So let’s have a closer look at it. First, we look at how many types there are.

types_num0 <- length(sort(unique(datared$EVTYPE)))

There are 985 types, which seems far too many.

A closer look at all of these types shows that they are rather incoherently denominated. I.e. leading spaces, upper- and lowercases mixed (“Beach Erosion”, “BEACH EROSION”), typos, (“BEACH EROSIN”), abreviations (“TSTM”, “WND”), singular/plural (“THUNDERSTRORM WIND”, “THUNDERSTORM WINDS”, spaces (“WILD FIRES”, WILDFIRES). Displaying all types would unduly inflate this report, so we will just list the first 40 types for illustration.

evtype <- sort(unique(datared$EVTYPE))
head(evtype, 40)
##  [1] "   HIGH SURF ADVISORY"          " COASTAL FLOOD"                
##  [3] " FLASH FLOOD"                   " LIGHTNING"                    
##  [5] " TSTM WIND"                     " TSTM WIND (G45)"              
##  [7] " WATERSPOUT"                    " WIND"                         
##  [9] "?"                              "ABNORMAL WARMTH"               
## [11] "ABNORMALLY DRY"                 "ABNORMALLY WET"                
## [13] "ACCUMULATED SNOWFALL"           "AGRICULTURAL FREEZE"           
## [15] "APACHE COUNTY"                  "ASTRONOMICAL HIGH TIDE"        
## [17] "ASTRONOMICAL LOW TIDE"          "AVALANCE"                      
## [19] "AVALANCHE"                      "BEACH EROSIN"                  
## [21] "Beach Erosion"                  "BEACH EROSION"                 
## [23] "BEACH EROSION/COASTAL FLOOD"    "BEACH FLOOD"                   
## [25] "BELOW NORMAL PRECIPITATION"     "BITTER WIND CHILL"             
## [27] "BITTER WIND CHILL TEMPERATURES" "Black Ice"                     
## [29] "BLACK ICE"                      "BLIZZARD"                      
## [31] "BLIZZARD AND EXTREME WIND CHIL" "BLIZZARD AND HEAVY SNOW"       
## [33] "Blizzard Summary"               "BLIZZARD WEATHER"              
## [35] "BLIZZARD/FREEZING RAIN"         "BLIZZARD/HEAVY SNOW"           
## [37] "BLIZZARD/HIGH WIND"             "BLIZZARD/WINTER STORM"         
## [39] "BLOW-OUT TIDE"                  "BLOW-OUT TIDES"

We will try to eliminate the most obvious redundancies by hand.

datared$EVTYPE <- trimws(datared$EVTYPE)
datared$EVTYPE <- toupper(dataraw$EVTYPE) 
datared$EVTYPE <- gsub("WND", "WIND", datared$EVTYPE)
datared$EVTYPE <- gsub("WINDS", "WIND", datared$EVTYPE)
datared$EVTYPE <- gsub("WIND CH", "WIND CHIL", datared$EVTYPE)
datared$EVTYPE <- gsub("TSTM", "THUNDERSTORM", datared$EVTYPE)
datared$EVTYPE <- gsub("TIDES", "TIDE", datared$EVTYPE)
datared$EVTYPE <- gsub("AVALANCE", "AVALANCHE", datared$EVTYPE)
datared$EVTYPE <- gsub("EROSIN", "EROSION", datared$EVTYPE)
datared$EVTYPE <- gsub(" & ", "/", datared$EVTYPE)
datared$EVTYPE <- gsub("FIRES", "FIRE", datared$EVTYPE)
datared$EVTYPE <- gsub("WILDFIRES", "WILD FIRE", datared$EVTYPE)
datared$EVTYPE <- gsub("FLOODING", "FLOOD", datared$EVTYPE)
datared$EVTYPE <- gsub("CSTL", "COASTAL", datared$EVTYPE)
datared$EVTYPE <- gsub("COASTALFLOOD", "COASTAL FLOOD", datared$EVTYPE)
datared$EVTYPE <- gsub("COASTALSTORM", "COASTAL STORM", datared$EVTYPE)
datared$EVTYPE <- gsub("FUNNELS", "FUNNEL", datared$EVTYPE)
datared$EVTYPE <- gsub("TEMPERATURES", "TEMPERATURE", datared$EVTYPE)
datared$EVTYPE <- gsub("MUDSLIDES", "MUDSLIDE", datared$EVTYPE)
datared$EVTYPE <- gsub("THUNDERSTORMW", "THUNDERSTORM", datared$EVTYPE)
datared$EVTYPE <- gsub("THUNERSTORM", "THUNDERSTORM", datared$EVTYPE)
datared$EVTYPE <- gsub("THUNDERSTROM", "THUNDERSTORM", datared$EVTYPE)
datared$EVTYPE <- gsub("THUNDERTSORM", "THUNDERSTORM", datared$EVTYPE)
datared$EVTYPE <- gsub("THUNDERSTORMS", "THUNDERSTORM", datared$EVTYPE)
datared$EVTYPE <- gsub("HUNDERSTORMWINDS", "THUNDERSTORM WIND", datared$EVTYPE)
datared$EVTYPE <- gsub("TORNADOES", "TORNADO", datared$EVTYPE)
datared$EVTYPE <- gsub("TORNADOS", "TORNADO", datared$EVTYPE)
datared$EVTYPE <- gsub("WATERSPROUT", "WATER SPROUT", datared$EVTYPE)

types_num1 <- length(sort(unique(datared$EVTYPE)))

So, we could reduce the types from 985 to 827. This is still far too many compared to the 48 types defined in the Storm Data Documentation. However, it is beyond the scope of this analysis to do more cleaning. As we will focus on the few types with highest impact, we asume that the effects of these incoherencies will be limited. Still, when interpreting the results, we have to bear in mind that there is a probleme here.

Eventually, we get the years out of the BGN_DATE variable.

#datared$BGN_DATE <- as.Date(data1$BGN_DATE, format = "%d/%m/%Y")
datared$YEAR <- as.numeric(format(as.Date(datared$BGN_DATE, format = "%m/%d/%Y %H:%M:%S"), "%Y"))

Checking for NA

sum(is.na(datared))
## [1] 0

We don’t have to worry about missing values.

What are the main characteristics of the data?

str(datared)
## 'data.frame':    902297 obs. of  10 variables:
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ YEAR      : num  1950 1950 1951 1951 1951 ...
head(datared)
##             BGN_DATE STATE  EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP
## 1  4/18/1950 0:00:00    AL TORNADO          0       15    25.0          K
## 2  4/18/1950 0:00:00    AL TORNADO          0        0     2.5          K
## 3  2/20/1951 0:00:00    AL TORNADO          0        2    25.0          K
## 4   6/8/1951 0:00:00    AL TORNADO          0        2     2.5          K
## 5 11/15/1951 0:00:00    AL TORNADO          0        2     2.5          K
## 6 11/15/1951 0:00:00    AL TORNADO          0        6     2.5          K
##   CROPDMG CROPDMGEXP YEAR
## 1       0            1950
## 2       0            1950
## 3       0            1951
## 4       0            1951
## 5       0            1951
## 6       0            1951

Let’s see, how the the measurments are spread in time and space.

#Create a histogram for events by year.

#Load packages
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.3
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.2.5
yplot <- ggplot(data=datared, aes(datared$YEAR)) + geom_histogram(breaks=seq(1950, 2010, by = 0.5), col="red", fill="orange", alpha = .5) + labs(title="Weather Events in Time") + labs(x="YEAR", y="Measured Weather Events")

#Create a histogram for events by most affected states.
library(plyr)
## Warning: package 'plyr' was built under R version 3.2.3
states <- count(datared, vars = "STATE")

states <- states[order(states$freq, decreasing = TRUE), ]
sttop <- states[1:40, ]
sttop$STATE <- factor(sttop$STATE, levels = sttop$STATE)
sttop$STATE <- factor(sttop$STATE, levels = rev(levels(sttop$STATE)))

stplot <- ggplot(sttop, aes(x=factor(STATE), y=freq))+ geom_bar(stat ="identity", fill="green") + ggtitle("Weather Events by State") + xlab("States") + ylab("Number of Weather Events") + scale_y_continuous(breaks = seq(0,80000, by = 10000))

grid.arrange(yplot, stplot, ncol=1, bottom="Figure 1: Extreme Weather Events in Time and Space")

From figure 1 we see that the majority of the measurments falls within the period of 1995/2011 and that Texas is the state by far most affected by extreme weather events, followed by Kansas and Oklahoma.

Results

Impact of Types of Events on Population Health

Question: Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

The variables FATALITIES and INJURIES are relevant for answering this question.

#Aggregate FATALITIES and INJURIES to variable on population health.
datared$PHEALTH <- datared$FATALITIES + datared$INJURIES
phealth <- aggregate (PHEALTH~EVTYPE, datared, sum)
phealth <- phealth[order(phealth$PHEALTH, decreasing=TRUE), ]

unname(unlist(subset(phealth, PHEALTH >= 20, select = PHEALTH)))
##  [1] 96979 10055  8428  7267  6046  3037  2782  2064  1722  1527  1376
## [12]  1339  1148   986   906   796   600   557   551   501   462   431
## [23]   412   398   395   393   360   349   260   251   223   162   153
## [34]   149   143   111   107   107   107   101   100    90    90    86
## [45]    78    68    53    52    51    51    48    45    45    40    37
## [56]    36    36    36    36    32    32    31    30    29    28    28
## [67]    27    25    25    23    23    22

We see that the impact decreases sharply within the first few cases. In order to detect the most harmful types we can concentrate on the first 10 cases without loosing too much information.

fatality <- aggregate (FATALITIES~EVTYPE, datared, sum)
fatality <- fatality[order(fatality$FATALITIES, decreasing=TRUE), ]

injury <- aggregate (INJURIES~EVTYPE, datared, sum)
injury <- injury[order(injury$INJURIES, decreasing=TRUE), ]

#We want to use ggplot2 for plotting. This requires the EVTYPE variable to be set to factor, so that ggplot arranges the events in decreasing order.  

phtop <- phealth[1:7, ]
phtop$EVTYPE <- factor(phtop$EVTYPE, levels = phtop$EVTYPE)
phtop$EVTYPE <- factor(phtop$EVTYPE, levels = rev(levels(phtop$EVTYPE)))

fataltop <- fatality[1:7, ]
fataltop$EVTYPE <-  factor(fataltop$EVTYPE , levels = fataltop$EVTYPE)
fataltop$EVTYPE <-  factor(fataltop$EVTYPE, levels = rev(levels(fataltop$EVTYPE)))

injtop <- injury[1:7, ]
injtop$EVTYPE <-  factor(injtop$EVTYPE , levels = injtop$EVTYPE)
injtop$EVTYPE <-  factor(injtop$EVTYPE, levels = rev(levels(injtop$EVTYPE)))

Creating figure of most harmful events for population health.

phplot <- ggplot(phtop, aes(x=factor(EVTYPE), y=PHEALTH)) + geom_bar(stat ="identity", fill="blue") + ggtitle("Public Health Damage: Top 7 Events") + xlab("Wheater Events") + ylab("Total public health damage (number of cases") + coord_flip()  + scale_y_continuous(breaks = seq(0,90000, by = 10000)) 

fatalplot <- ggplot(fataltop, aes(x=factor(EVTYPE), y=FATALITIES))+ geom_bar(stat ="identity", fill="black") + ggtitle("Fatalities: Top 7 Events") + xlab("Wheater Events") + ylab("Total number of Fatalities") + coord_flip()  + scale_y_continuous(breaks = seq(0,5500, by = 500))

injplot <- ggplot(injtop, aes(x=factor(EVTYPE), y=INJURIES))+ geom_bar(stat ="identity", fill="green") + ggtitle("Injuries: Top 7 Events") + xlab("Wheater Events") + ylab("Total number of Injuries") + coord_flip()  +scale_y_continuous(breaks = seq(0,90000, by = 10000))

grid.arrange(phplot, injplot, fatalplot, ncol=1, bottom="Figure 2: Events most harmful with respect to population health (fatalities and injuries")

Economic Concequences of Type of Events

Question: Across the United States, which types of events have the greatest economic consequences?

We will calculate the economic damages by aggregating property and crop damages. The values of these damages are each represented by two variabels, one recording a number rounded to three digits, the other indicating the magnitude of the values with letters (e.g. “K” for thousands, “M” for millions, and “B” for billions. In order to do any calculation, we will have to convert these letters into their numeric value. The product of the number and the converted value variables will be used as a new new variable for calculating economic damages.

#Transform the magnitude information into numeric.

#Create a vector with the magnitude indications.
dimprop <- unique(datared$PROPDMGEXP)
dimcrop <- unique(datared$CROPDMGEXP)
dim <- sort(unique(c(dimprop, dimcrop)))
dim
##  [1] ""  "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "k"
## [18] "K" "m" "M"
#Create a numerical vector with the respective magnitude values. 
dimnum <- c(1, 0, 0, 0, 1, 10, 10^2, 10^3, 10^4, 10^5, 10^6, 10^7, 10^8, 10^9, 10^2, 10^2, 10^3, 10^3, 10^6, 10^6)

#Create exponential variables for PROPDMG and CROPDMG
for (i in 1:length(dim)) {
    datared$PROPEXP[datared$PROPDMGEXP ==  dim[i]]  <- dimnum[i]
    datared$CROPEXP[datared$CROPDMGEXP == dim[i]] <- dimnum[i]
}

datared$PROPDMG_VAL <- datared$PROPDMG * datared$PROPEXP
datared$CROPDMG_VAL <- datared$CROPDMG * datared$CROPEXP
datared$ECONDMG <- datared$PROPDMG_VAL + datared$CROPDMG_VAL

Now, we can calculate the economic damage (property and crop damages) by type of event based on the created variable ECONDMG.

econdmg <- aggregate(ECONDMG~EVTYPE, datared, sum)
econdmg <- econdmg[order(econdmg$ECONDMG, decreasing = TRUE), ]
unname(unlist(subset(econdmg, ECONDMG >= 10^8, select = ECONDMG)))
##  [1] 150437379757  71913712800  57362334887  43323541000  18761221986
##  [6]  18566870733  15018672000  14610229010  11072776564  10292723500
## [11]   8967041360   8382236550   6715441251   6557661893   5161586800
## [16]   4642038000   3191846000   3108658330   2500000000   1602500000
## [21]   1427647890   1380710400   1224560000   1104666000   1067412240
## [26]    942471520    771273950    624100000    601055000    500155700
## [31]    456930000    403258500    394110000    392592060    344613000
## [36]    304230000    274930006    269095007    247627740    241000000
## [41]    144082000    142000000    117500000    112800000    110000000
## [46]    109427250    105000000

We see that we can focus on the first 20 cases.

#Top 20 total economic damage events.
econtop <- econdmg[1:7, ]

#Top 20 property damage events.
propdmg <- aggregate (PROPDMG_VAL~EVTYPE, datared, sum)
propdmg <- propdmg[order(propdmg$PROPDMG_VAL, decreasing = TRUE), ]
proptop <- propdmg[1:7, ]

#Top 20 crop damage events.
cropdmg <- aggregate (CROPDMG_VAL~EVTYPE, datared, sum)
cropdmg <- cropdmg[order(cropdmg$CROPDMG_VAL, decreasing = TRUE), ]
croptop <- cropdmg[1:7, ]

#Set EVTYPE variable to factor, so that ggplot arranges the events in decreasing order. 
econtop$EVTYPE <-  factor(econtop$EVTYPE , levels = econtop$EVTYPE)
econtop$EVTYPE <-  factor(econtop$EVTYPE, levels = rev(levels(econtop$EVTYPE)))

proptop$EVTYPE <-  factor(proptop$EVTYPE , levels = proptop$EVTYPE)
proptop$EVTYPE <-  factor(proptop$EVTYPE, levels = rev(levels(proptop$EVTYPE)))

croptop$EVTYPE <-  factor(croptop$EVTYPE , levels = croptop$EVTYPE)
croptop$EVTYPE <-  factor(croptop$EVTYPE, levels = rev(levels(croptop$EVTYPE)))

#Create figure 3 with panel plot for total economic, property and crop damages.

econplot <- ggplot(data=econtop, aes(x=EVTYPE, y=ECONDMG/10^9)) + 
    geom_bar(stat ="identity", fill = "blue") + ggtitle("Total Economic Damage: Top 7 Events") +
    xlab("Wheater Events") + ylab("Total Damage in billion USD") + coord_flip() +
    scale_y_continuous(breaks = seq(0,150, by = 20))

propplot <- ggplot(data=proptop, aes(x=EVTYPE, y=PROPDMG_VAL/10^9)) +
    geom_bar(stat ="identity", fill = "red") + ggtitle("Property Damage: Top 7 Events") +
    xlab("Wheater Events") + ylab("Total Damage in billion USD") + coord_flip() +
    scale_y_continuous(breaks = seq(0,150, by = 20))

cropplot <- ggplot(data=croptop, aes(x=EVTYPE, y=CROPDMG_VAL/10^9)) + 
    geom_bar(stat ="identity", fill="green" ) + ggtitle("Crop Damage: Top 7 Events") + 
    xlab("Wheater Events") + ylab("Total Damage in billion USD") + coord_flip() +
    scale_y_continuous(breaks = seq(0,15, by = 2))

grid.arrange(econplot, propplot, cropplot, ncol=1, bottom="Figure 3: Events with greatest economic impact (Total, Property, Crop")

The figure shows that flood, hurricane/typhoon, tornado and storm surge are the four most damaging events in economic terms. The economic damages are dominated by the property damages, crop damages count only for a relatively small part. For crops drought is the most harmful event.

Conclusion

From figures 2 and 3 we can infer that floods and tornados are by far the most harmful weather events. While tornados were responsible for most human victims, floods caused the highest economic damage. Moreover, both of these top events, were high ranked also in the other domain (tornado 3rd in the economic damage ranking, flood 4th in the public health ranking.