1 Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

The goal of the project is to identify the most harmful events with respect to population health as well as the events with the greatest economic consequences. A lot of data preprocessing is required in order to clean the data and transform it in a format suitable for an accurate analysis. We identify possible misspellings by calculating the Damerau-Levenshtein distance between the unique event names in the provided dataset and the names of the official weather events given in the National Weather Service Storm Data Documentation.

The results of our analysis show that the most harmful events with respect to population health are tornado, thunderstorm wind, excessive heat and flood. We further found out that the events with the greatest economic consequences are flood, heavy rain and tornado.

The given computational times in the report are on a laptop with Intel i7 5500U CPU and 8GB DDR3-1600 RAM.

2 Overview of the Data Set

The data for this project come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. The file can be downloaded from here:

– Storm Data

There is also some documentation of the database available. Here you will find how some of the variables are defined:

– National Weather Service Storm Data Documentation

– National Climatic Data Center Storm Events FAQ

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

3 Data Processing

First we load the packages that we will need:

library(data.table)
library(knitr)
library(lubridate)
library(stringdist)
library(ggplot2)
library(gridExtra)
library(R.utils)

3.1 Loading the Data

We set the working directory:

setwd("D:/Coursera_Reproducible_Research/Project_Assignment_2")

Then we download the archive and we decompress it:

if(!file.exists("./Storm_Data.bz2")){
  
    fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"  
    
    download.file(fileUrl, destfile = "./Storm_Data.bz2")
    
    bunzip2(filename = "./Storm_Data.bz2", destname = "./Storm_Data.csv", remove = FALSE)
    
}

In order to decompress the “Storm_Data.bz2” file we use the R function bunzip2 from the R.utils package. We set remove=FALSE to keep the input compressed file “Storm_Data.bz2”. The default setting is remove=TRUE, which causes the input file to be removed after the decompressed output file is created.

Finally we read the data into the storm.db data table:

t1 <- Sys.time()

storm.db <- fread(input = "./Storm_Data.csv", sep = "auto", header = TRUE)

## 
Read 22.7% of 967216 rows
Read 42.4% of 967216 rows
Read 54.8% of 967216 rows
Read 73.4% of 967216 rows
Read 82.7% of 967216 rows
Read 902297 rows and 37 (of 37) columns from 0.523 GB file in 00:00:08

t2 <- Sys.time()
t2-t1

## Time difference of 7.507717 secs

We use the package data.table and its function fread() in order to load the storm data. The package data.table inherits from data.frame but it is written in C which makes it much faster, especially for big data sets as is the case with the storm data set. As the R documentation states “data.table inherits from data.frame. It offers fast subset, fast grouping, fast update, fast ordered joins and list columns in a short and flexible syntax, for faster development.” Hence, all functions that accept data.frame objects work also on data.table objects with the advantage that operations like subsetting, grouping and updating are much faster for data.table objects. We report the time needed to read the data and it is 7.508s, which is way much faster than the respective time needed to read the data with read.csv() or read.table().

3.2 Structure and Summary of the Data

We start with taking a look at the structure of the storm.db data table. We see that there are 902297 observations of 37 variables:

dim(storm.db)

## [1] 902297     37

The variables are of type “character”, “numerical” and “logical”:

str(storm.db)

Classes 'data.table' and 'data.frame':  902297 obs. of  37 variables:
 $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
 $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
 $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
 $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
 $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
 $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
 $ STATE     : chr  "AL" "AL" "AL" "AL" ...
 $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
 $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
 $ BGN_AZI   : chr  "" "" "" "" ...
 $ BGN_LOCATI: chr  "" "" "" "" ...
 $ END_DATE  : chr  "" "" "" "" ...
 $ END_TIME  : chr  "" "" "" "" ...
 $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
 $ COUNTYENDN: logi  NA NA NA NA NA NA ...
 $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
 $ END_AZI   : chr  "" "" "" "" ...
 $ END_LOCATI: chr  "" "" "" "" ...
 $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
 $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
 $ F         : chr  "3" "2" "2" "2" ...
 $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
 $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
 $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
 $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
 $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
 $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ CROPDMGEXP: chr  "" "" "" "" ...
 $ WFO       : chr  "" "" "" "" ...
 $ STATEOFFIC: chr  "" "" "" "" ...
 $ ZONENAMES : chr  "" "" "" "" ...
 $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
 $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
 $ LATITUDE_E: num  3051 0 0 0 0 ...
 $ LONGITUDE_: num  8806 0 0 0 0 ...
 $ REMARKS   : chr  "" "" "" "" ...
 $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
 - attr(*, ".internal.selfref")=<externalptr>

and their names are:

names(storm.db)

 [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
 [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
[11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
[16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
[21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
[26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
[31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
[36] "REMARKS"    "REFNUM"

For our analysis we will need only the variables “EVTYPE”, “FATALITIES”, “INJURIES”, “PROPDMG”, “PROPDMGEXP”, “CROPDMG” and “CROPDMGEXP”. Below we show the first five rows of the storm.db data table:

head(storm.db)

   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
1:       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
2:       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
3:       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
4:       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
5:       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
6:       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
1: TORNADO         0                                               0
2: TORNADO         0                                               0
3: TORNADO         0                                               0
4: TORNADO         0                                               0
5: TORNADO         0                                               0
6: TORNADO         0                                               0
   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
1:         NA         0                      14.0   100 3   0          0
2:         NA         0                       2.0   150 2   0          0
3:         NA         0                       0.1   123 2   0          0
4:         NA         0                       0.0   100 2   0          0
5:         NA         0                       0.0   150 2   0          0
6:         NA         0                       1.5   177 2   0          0
   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
1:       15    25.0          K       0                                    
2:        0     2.5          K       0                                    
3:        2    25.0          K       0                                    
4:        2     2.5          K       0                                    
5:        2     2.5          K       0                                    
6:        6     2.5          K       0                                    
   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
1:     3040      8812       3051       8806              1
2:     3042      8755          0          0              2
3:     3340      8742          0          0              3
4:     3458      8626          0          0              4
5:     3412      8642          0          0              5
6:     3450      8748          0          0              6

3.3 Cleaning the “EVTYPE” Variable

We start by determining the number of unique weather events stored in the “EVTYPE” variable:

length(unique(storm.db$EVTYPE))

## [1] 985

We see that there are 985 unique events but according to the provided documentation there should be only 48 event types. Looking more closely at the storm data set we notice that there are many typos. There are also events with the same names both with upper and lower cases. We also notice records with event names composed of two or more existing event types as well as many new weather events which are not part of the officially documented 48. Therefore, we try to preprocess the “EVTYPE” variable and reduce the number of unique events. First, we transform all event names to upper cases:

storm.db$EVTYPE <- toupper(storm.db$EVTYPE)

length(unique(storm.db$EVTYPE))

## [1] 898

We observe that after turning all event names to upper cases, the number of unique events is reduced with 87 from 985 to 898.

Next we display the first 10 weather events with the greatest number of occurrences (i.e. with highest frequency):

df.event <- as.data.frame(table(as.factor(storm.db$EVTYPE)))
colnames(df.event) <- c("Event", "Count")
df.event <- df.event[order(-df.event$Count), ]
rownames(df.event) <- 1:nrow(df.event)
kable(df.event[c(1:10), ], row.names = TRUE, caption = "Table 1")

Table 1
	Event	Count
1	HAIL	288661
2	TSTM WIND	219942
3	THUNDERSTORM WIND	82564
4	TORNADO	60652
5	FLASH FLOOD	54277
6	FLOOD	25327
7	THUNDERSTORM WINDS	20843
8	HIGH WIND	20214
9	LIGHTNING	15754
10	HEAVY SNOW	15708

From the table above we see that the weather event with highest frequency is “HAIL”, followed by “TSTM WIND”, “THUNDERSTORM WIND”, “TORNADO” and “FLASH FLOOD” at the fifth place. We notice that the event “THUNDERSTORM WIND” is recorded also as “TSTM WIND” and “THUNDERSTORM WINDS”. Hence, we rename all occurrences of “TSTM WIND” and “THUNDERSTORM WINDS” to “THUNDERSTORM WIND”:

t1 <- Sys.time()
storm.db[storm.db$EVTYPE == "TSTM WIND" | storm.db$EVTYPE == "THUNDERSTORM WINDS", ]$EVTYPE <- "THUNDERSTORM WIND"
t2 <- Sys.time()
t2-t1

## Time difference of 0.3711889 secs

After this transformation we display again the top 10 (by frequency) weather events:

df.event <- as.data.frame(table(as.factor(storm.db$EVTYPE)))
colnames(df.event) <- c("Event", "Count")
df.event <- df.event[order(-df.event$Count), ]
rownames(df.event) <- 1:nrow(df.event)
kable(df.event[c(1:10), ], row.names = TRUE, caption = "Table 2")

Table 2
	Event	Count
1	THUNDERSTORM WIND	323349
2	HAIL	288661
3	TORNADO	60652
4	FLASH FLOOD	54277
5	FLOOD	25327
6	HIGH WIND	20214
7	LIGHTNING	15754
8	HEAVY SNOW	15708
9	HEAVY RAIN	11742
10	WINTER STORM	11433

Now the first place is for “THUNDERSTORM WIND”, followed by “HAIL”, “TORNADO”, “FLASH FLOOD” and “FLOOD” at the fifth place. Also, “HEAVY RAIN” and “WINTER STORM” appear in the top ten.

3.3.1 Measuring the String Distance

We will try to further reduce the number of unique weather events by identifying possible misspellings. We calculate the Damerau–Levenshtein distance between the unique event names in the storm.db data table and the names of the official weather events given in the National Weather Service Storm Data Documentation. According to Wikipedia, the “Damerau–Levenshtein distance is a distance (string metric) between two strings, i.e., finite sequence of symbols, given by counting the minimum number of operations needed to transform one string into the other, where an operation is defined as an insertion, deletion, or substitution of a single character, or a transposition of two adjacent characters.” In R we compute the aforementioned string distance with the function stringdistmatrix() from the stringdist package.

Let us denote by storm.db.name a unique value of the “EVTYPE” variable from the storm.db data table, and by official.name – any of the 48 official event names. We apply the following strategy for cleaning the “EVTYPE” variable:

We calculate the Damerau–Levenshtein string distance between storm.db.name and official.name: dist(storm.db.name, official.name).
We compute all pairwise string distances dist(official.name, official.name) between any two official event names for which \(1\leq\) dist(official.name, official.name)\(\leq 3\). We denote by list(official.self.dist) the list of the calculated pair distances.
If the string distance – dist(storm.db.name, official.name), is greater than zero and less than or equal to 3, and the pair (storm.db.name, official.name) does not belong to list(official.self.dist), we replace storm.db.name with official.name.

We start by creating the character vector event.type, which contains all unique weather event names from the storm.db data table:

event.type <- unique(storm.db$EVTYPE)

We see that there are 896 unique elements of the vector event.type:

length(event.type)

## [1] 896

Then we construct the character vector events which is composed of all 48 official weather events provided in the National Weather Service documentation:

events <- c("ASTRONOMICAL LOW TIDE", "AVALANCHE", "BLIZZARD", "COASTAL FLOOD",
            "COLD/WIND CHILL", "DEBRIS FLOW", "DENSE FOG", "DENSE SMOKE",
            "DROUGHT", "DUST DEVIL", "DUST STORM", "EXCESSIVE HEAT", 
            "EXTREME COLD/WIND CHILL", "FLASH FLOOD", "FLOOD", "FROST/FREEZE",
            "FUNNEL CLOUD", "FREEZING FOG", "HAIL", "HEAT",
            "HEAVY RAIN",
            "HEAVY SNOW", "HIGH SURF", "HIGH WIND", "HURRICANE (TYPHOON)",
            "ICE STORM", "LAKE-EFFECT SNOW", "LAKESHORE FLOOD", "LIGHTNING",
            "MARINE HAIL", "MARINE HIGH WIND", "MARINE STRONG WIND", "MARINE THUNDERSTORM WIND",
            "RIP CURRENT", "SEICHE", "SLEET", "STORM SURGE/TIDE", "STRONG WIND",
            "THUNDERSTORM WIND", "TORNADO", "TROPICAL DEPRESSION", "TROPICAL STORM",
            "TSUNAMI", "VOLCANIC ASH", "WATERSPOUT", "WILDFIRE", "WINTER STORM",
            "WINTER WEATHER")

Step 1:

Now we compute the pairwise Damerau-Levenshtein string distances between the elements of the two character vectors. We store the resulting matrix in the strdist object:

strdist <- stringdistmatrix(event.type, events, method = "dl")

Below we create the matrix matching.events that contains the indices of the elements of the character vectors event.type and events for which the respective string distance is greater or equal to 1 and less than or equal to 3:

match.events <- list()
res <- list()
cnt <- 0

for(i in 1:dim(strdist)[1]){
    for(j in 1:dim(strdist)[2]){
        
        if(strdist[i,j] > 0 & strdist[i,j] <= 3){
            
             cnt <- cnt + 1
             
             res[[cnt]] = c(i,j)
             
             match.events <- do.call(cbind, res)
           
        }
    }
}


matching.events <- as.matrix(match.events)
matching.events <- t(matching.events)

The matrix matching.events has 101 rows and 2 columns. In the first column are stored the indices i of the elements of the character vector event.type and in the second column – the corresponding indices j of the elements of the vector events. For each pair of indices (i,j) we have that \(1\leq\)dist(event.type[i], events[j])\(\leq 3\). Here are the first few rows of the matrix matching.events:

head(matching.events)

##      [,1] [,2]
## [1,]    3   20
## [2,]   17   39
## [3,]   19   14
## [4,]   20   24
## [5,]   22   40
## [6,]   25   19

For convenience we transform the matrix matching.events into a data frame:

matching.events <- as.data.frame(matching.events)

colnames(matching.events) <- c("index.db.name", "index.correct.name")

The character vectors db.name and real.name contain the unique values of the “EVTYPE” variable and the official weather event names, respectively, for which the Damerau-Levenshtein string distance is in the interval [1,3]:

db.name <- event.type[matching.events[ ,"index.db.name"]]

real.name <- events[matching.events[ ,"index.correct.name"]]

Step 2:

Next we compute the string distances between the elements of the character vector events (which contains the 48 official weather event names):

self.dist <- stringdistmatrix(events, events, method = "dl")

Then we generate the matrix self.match that contains the indices i and j of the elements of the character vector events for which the respective string distance – dist(events[i], events[j]), is greater than or equal to 1 and less than or equal to 3:

match.events <- list()
res <- list()
cnt <- 0

for(i in 1:dim(self.dist)[1]){
    for(j in 1:dim(self.dist)[2]){
        
        if(self.dist[i,j] >= 1 & self.dist[i,j] <= 3){
            
             cnt <- cnt + 1
             
             res[[cnt]] = c(i,j)
             
             match.events <- do.call(cbind, res)
           
        }
    }
}

self.match <- as.matrix(match.events)

self.match <- t(self.match)

The matrix self.match has 4 rows and 2 columns. Next we create the character vector self.pairs, whose elements are composed of the pasted strings representing the official event names with string distance between them in the interval [1,3]:

self.pairs <- vector(mode = "character")

for(i in 1:dim(self.match)[1]){
    
    self.pairs <- c(self.pairs, paste(events[self.match[i,1]],                                               paste(events[self.match[i,2]])))    
}


self.pairs

## [1] "HAIL HEAT"  "HEAT HAIL"  "HEAT SLEET" "SLEET HEAT"

length(self.pairs)

## [1] 4

Our next step is to create the character vector pairs composed of the 101 pasted matching pairs (“EVTYPE” name, official name) such that the string distance between them – dist(“EVTYPE” name, official name) is in the interval [1,3]:

pairs <- mapply(function(db, correct){
    
    paste(db, correct, sep = " ")
    
}, db.name, real.name)

pairs <- as.vector(x = pairs, mode = "character")

Step 3:

The only thing left is to exclude from db.name the pasted string pairs of official event names for which the string distance between them is greater than 0 and less than 4:

db.name.out.self <- db.name[!(pairs %in% self.pairs)]

Finally we replace the misspelled “EVTYPE” names with the respective correct official names:

t1 <- Sys.time()
storm.db[which(storm.db$EVTYPE %in% db.name.out.self)]$EVTYPE <- real.name[match(storm.db[which(storm.db$EVTYPE %in% db.name.out.self), ]$EVTYPE, db.name.out.self)]
t2 <- Sys.time()
t2-t1

## Time difference of 0.232378 secs

We report the system time needed for the described string transformation and we see that it is only 0.232 s.

We see that the number of unique event types in the “EVTYPE” variable decreased from 896 to 801:

length(unique(storm.db$EVTYPE))

## [1] 801

We show the top ten events (by frequency) after the above transformations:

df.event <- as.data.frame(table(as.factor(storm.db$EVTYPE)))
colnames(df.event) <- c("Event" , "Count")
df.event$Percentage <- round(df.event$Count/sum(df.event$Count)*100, 3)
df.event <- df.event[order(-df.event$Count), ]
rownames(df.event) <- 1:nrow(df.event)
kable(df.event[1:10, ], row.names = TRUE, caption = "Table 3")

Table 3
	Event	Count	Percentage
1	THUNDERSTORM WIND	324474	35.961
2	HAIL	288742	32.001
3	TORNADO	60664	6.723
4	FLASH FLOOD	55996	6.206
5	FLOOD	25985	2.880
6	HIGH WIND	20248	2.244
7	LIGHTNING	15842	1.756
8	HEAVY SNOW	15713	1.741
9	HEAVY RAIN	11840	1.312
10	WINTER STORM	11463	1.270

We observe that there is no change in the top ten weather events compared to Table 2.

3.3.2 Further Manual Cleaning

After closely examining the storm.db data set we notice further inconsistencies in the weather event names which we clean manually:

# Cleaning names that contain the string "MARINE TSTM WIND":
storm.db[storm.db$EVTYPE == "MARINE TSTM WIND", ]$EVTYPE <- "MARINE THUNDERSTORM WIND"

# Cleaning names that contain the string "WILD/FOREST FIRE":
storm.db[storm.db$EVTYPE == "WILD/FOREST FIRE", ]$EVTYPE <- "WILDFIRE"

# Cleaning names that contain the string "UNSEASONABLY WARM":
storm.db[storm.db$EVTYPE == "UNSEASONABLY WARM", ]$EVTYPE <- "EXCESSIVE HEAT"

# Cleaning names that contain the string "EXCESSIVE SNOW":
storm.db[storm.db$EVTYPE == "EXCESSIVE SNOW", ]$EVTYPE <- "HEAVY SNOW"

# Cleaning names that contain the string "HURRICANE":
storm.db[grep("HURRICANE", storm.db$EVTYPE), ]$EVTYPE <- "HURRICANE (TYPHOON)"

# Cleaning names that contain the string "HIGH TEMPERATURE":
storm.db[grep("HIGH TEMPERATURE", storm.db$EVTYPE), ]$EVTYPE <- "HEAT"

# Cleaning names that contain the string "HAIL STORM":
storm.db[grep("HAIL STORM", storm.db$EVTYPE), ]$EVTYPE <- "HAIL"

# Cleaning names that contain the string "EXTREME COLD":
storm.db[grep("EXTREME COLD|RECORD COLD|COLD RECORD", storm.db$EVTYPE), ]$EVTYPE <- "EXTREME COLD/WIND CHILL"


# Cleaning names that contain the string "DROUGHT":
storm.db[grep("DROUGHT", storm.db$EVTYPE), ]$EVTYPE <- "DROUGHT"

# Cleaning names that contain the string "EXCESSIVE HEAT|RECORD HEAT|HEAT RECORD|EXTREME HEAT":
storm.db[grep("EXCESSIVE HEAT|RECORD HEAT|HEAT RECORD|EXTREME HEAT", storm.db$EVTYPE), ]$EVTYPE <- "EXCESSIVE HEAT"


# Cleaning names that contain the string "COASTAL FLOOD":
storm.db[grep("COASTAL FLOOD", storm.db$EVTYPE), ]$EVTYPE <- "COASTAL FLOOD"

After the above operations, the unique event names are:

length(unique(storm.db$EVTYPE))

## [1] 766

Probably more preprocessing can be done, but it requres a lot of time and effort. The task becomes even more difficult due to the fact that many new weather events are introduced in the storm data set and they should be kept as they are because they have no analogue in the official documentation.

We show again the top ten severe weather events:

df.event <- as.data.frame(table(as.factor(storm.db$EVTYPE)))
colnames(df.event) <- c("Event", "Count")
df.event$Percentage <- round(df.event$Count/sum(df.event$Count)*100, 3)
df.event <- df.event[order(-df.event$Count), ]
rownames(df.event) <- 1:nrow(df.event)
kable(df.event[1:10, ], row.names = TRUE, caption = "Table 4")

Table 4
	Event	Count	Percentage
1	THUNDERSTORM WIND	324474	35.961
2	HAIL	288743	32.001
3	TORNADO	60664	6.723
4	FLASH FLOOD	55996	6.206
5	FLOOD	25985	2.880
6	HIGH WIND	20248	2.244
7	LIGHTNING	15842	1.756
8	HEAVY SNOW	15738	1.744
9	MARINE THUNDERSTORM WIND	11987	1.328
10	HEAVY RAIN	11840	1.312

We see that compared to Table 3, “WINTER STORM” drops out, whereas “MARINE THUNDERSTORM WIND” makes in in the top ten.

4 Results

4.1 Most harmful events with respect to population health

For the analysis of the most harmful events with respect to population health we will use the variables “FATALITIES”, “INJURIES” and “EVTYPE”.

Below we create the data frame df.fatality which contains the total number of fatalities grouped by event type (the variable “EVTYPE”) and sorted in descending order:

# Fatalities:

df.fatality <- aggregate(storm.db$FATALITIES, by = list(storm.db$EVTYPE), FUN = sum)
colnames(df.fatality) <- c("Event", "Fatalities") 
df.fatality$Percentage <- round(df.fatality$Fatalities/sum(df.fatality$Fatalities)*100, 3)
df.fatality <- df.fatality[order(-df.fatality$Fatalities), ]
rownames(df.fatality) <- 1:nrow(df.fatality)

Then we display the top 10 severe weather events in terms of number of fatalities:

kable(df.fatality[1:10, ], row.names = TRUE, caption = "Table 5")

Table 5
	Event	Fatalities	Percentage
1	TORNADO	5633	37.194
2	EXCESSIVE HEAT	2029	13.397
3	FLASH FLOOD	1016	6.708
4	HEAT	951	6.279
5	THUNDERSTORM WIND	928	6.127
6	LIGHTNING	854	5.639
7	FLOOD	534	3.526
8	RIP CURRENT	368	2.430
9	EXTREME COLD/WIND CHILL	288	1.902
10	HIGH WIND	249	1.644

From the table we see that the top 3 events with the greatest number of fatalities are tornado, excessive heat and flash flood.

Next we create the data frame df.injury which contains the total number of injuries grouped by event type (the variable “EVTYPE”) and sorted in descending order:

# Injuries:

df.injury <- aggregate(storm.db$INJURIES, by = list(storm.db$EVTYPE), FUN = sum)
colnames(df.injury) <- c("Event", "Injuries") 
df.injury$Percentage <- round(df.injury$Injuries/sum(df.injury$Injuries)*100, 3)
df.injury <- df.injury[order(-df.injury$Injuries), ]
rownames(df.injury) <- 1:nrow(df.injury)

We display the top 10 severe weather events in terms of number of injured people:

kable(df.injury[1:10, ], row.names = TRUE, caption = "Table 6")

Table 6
	Event	Injuries	Percentage
1	TORNADO	91346	65.002
2	THUNDERSTORM WIND	9679	6.888
3	FLOOD	7523	5.353
4	EXCESSIVE HEAT	6747	4.801
5	LIGHTNING	5278	3.756
6	HEAT	2277	1.620
7	FLASH FLOOD	2080	1.480
8	ICE STORM	1985	1.413
9	HEAVY RAIN	1534	1.092
10	WILDFIRE	1473	1.048

From the table we see that the top 3 events with the greatest number of injuries are tornado, thunderstorm wind and flood.

In the next table we show the ten most harmful events with respect to number of fatalities, together with the respective number of injuries:

# Both fatalities and injuries:

df.health <- merge(df.fatality, df.injury, by = "Event")
df.health <- df.health[order(-df.health$Fatalities), ]
rownames(df.health) <- 1:nrow(df.health)
colnames(df.health) <- c("Event", "Fatalities", "Percent.Fatalities", "Injuries", "Percent.Injuries")
kable(df.health[1:10, ], row.names = TRUE, caption = "Table 7")

Table 7
	Event	Fatalities	Percent.Fatalities	Injuries	Percent.Injuries
1	TORNADO	5633	37.194	91346	65.002
2	EXCESSIVE HEAT	2029	13.397	6747	4.801
3	FLASH FLOOD	1016	6.708	2080	1.480
4	HEAT	951	6.279	2277	1.620
5	THUNDERSTORM WIND	928	6.127	9679	6.888
6	LIGHTNING	854	5.639	5278	3.756
7	FLOOD	534	3.526	7523	5.353
8	RIP CURRENT	368	2.430	232	0.165
9	EXTREME COLD/WIND CHILL	288	1.902	255	0.181
10	HIGH WIND	249	1.644	1137	0.809

# df.health <- aggregate(storm.db[ , .(FATALITIES, INJURIES)], by = list(storm.db$EVTYPE), FUN = sum)

# head(storm.db[ , .(FATALITIES, INJURIES)])

The next table provides the event types sorted in descending order by frequency together with the respective number of fatalities and injuries:

df.both <- merge(df.event, df.health, by = "Event")
df.both <- df.both[order(-df.both$Count), ]
rownames(df.both) <- 1:nrow(df.both)
colnames(df.both) <- c("Event", "Count.Event", "Percentage", "Fatalities", "Percent.Fatalities", "Injuries", "Percent.Injuries")
kable(df.both[1:15, ], row.names = TRUE, caption = "Table 8")

Table 8
	Event	Count.Event	Percentage	Fatalities	Percent.Fatalities	Injuries	Percent.Injuries
1	THUNDERSTORM WIND	324474	35.961	928	6.127	9679	6.888
2	HAIL	288743	32.001	16	0.106	1368	0.973
3	TORNADO	60664	6.723	5633	37.194	91346	65.002
4	FLASH FLOOD	55996	6.206	1016	6.708	2080	1.480
5	FLOOD	25985	2.880	534	3.526	7523	5.353
6	HIGH WIND	20248	2.244	249	1.644	1137	0.809
7	LIGHTNING	15842	1.756	854	5.639	5278	3.756
8	HEAVY SNOW	15738	1.744	127	0.839	1023	0.728
9	MARINE THUNDERSTORM WIND	11987	1.328	19	0.125	34	0.024
10	HEAVY RAIN	11840	1.312	167	1.103	1534	1.092
11	WINTER STORM	11463	1.270	206	1.360	1321	0.940
12	WINTER WEATHER	7045	0.781	33	0.218	398	0.283
13	FUNNEL CLOUD	6849	0.759	4	0.026	153	0.109
14	WILDFIRE	4222	0.468	97	0.640	1473	1.048
15	WATERSPOUT	3860	0.428	4	0.026	32	0.023

From the table we see that the most frequent event - thunderstorm wind, accounts only for 6.1% of all fatalities and 6.9% of all injuries, whereas the third most frequent event - tornado, accounts for 37.2% of all fatalities and 65% of all injuries.

The last table concerning public health shows the total number of accidents related to public health (fatalities plus injuries) grouped by weather event and sorted in descending order:

df.total <- df.both
df.total$Total.Count.Accidents <- df.total$Fatalities + df.total$Injuries
df.total <- df.total[order(-df.total$Total.Count), ]
rownames(df.total) <- 1:nrow(df.total)
drop.names <- match(c("Percentage", "Percent.Fatalities", "Percent.Injuries"), names(df.total))
df.total <- df.total[, -drop.names]
kable(df.total[1:15, ], row.names = TRUE, caption = "Table 9")

Table 9
	Event	Count.Event	Fatalities	Injuries	Total.Count.Accidents
1	TORNADO	60664	5633	91346	96979
2	THUNDERSTORM WIND	324474	928	9679	10607
3	EXCESSIVE HEAT	1912	2029	6747	8776
4	FLOOD	25985	534	7523	8057
5	LIGHTNING	15842	854	5278	6132
6	HEAT	923	951	2277	3228
7	FLASH FLOOD	55996	1016	2080	3096
8	ICE STORM	2013	89	1985	2074
9	HEAVY RAIN	11840	167	1534	1701
10	WILDFIRE	4222	97	1473	1570
11	WINTER STORM	11463	206	1321	1527
12	HIGH WIND	20248	249	1137	1386
13	HAIL	288743	16	1368	1384
14	HEAVY SNOW	15738	127	1023	1150
15	BLIZZARD	2719	101	805	906

We see that the highest number of fatalities and injuries is due to tornados, followed by thunderstorm winds and excessive heat.

Next we show bar plots of the number of fatalities and the number of injuries by event type. We use the R package “ggplot2()”:

# Fatalities:
g1 <- ggplot(data = df.fatality[1:10, ], aes(x = reorder(Event, -Fatalities),                                              fill = Event)) + 
        geom_bar(aes(y = Fatalities),
                 stat = "identity") +
        labs(x = "", 
             y = "Number of Fatalities",
             fill = "Event") +
        theme(legend.position = 'none',
              axis.text.x = element_text(angle = 45, hjust = 1)) +
        ggtitle("Number of Fatalities by Event") +
        scale_y_continuous(breaks = seq(0, 6000, 1000))


# Injuries:
g2 <- ggplot(data = df.injury[1:10, ], aes(x = reorder(Event, -Injuries),                                               fill = Event)) + 
        geom_bar(aes(y = Injuries),
                 stat = "identity") +
        labs(x = "", 
             y = "Number of Injuries",
             fill = "Event") +
        theme(legend.position = 'none',
              axis.text.x = element_text(angle = 45, hjust = 1)) +
        ggtitle("Number of Injuries by Event") +
        scale_y_continuous(breaks = seq(0, 90000, 10000))


grid.arrange(grobs = list(g1, g2), ncol = 2)

4.2 Events with greatest economic consequences

For the analysis of the events with greatest economic consequences we will use the variables “PROPDMG”, “PROPDMGEXP”, “CROPDMG”, “CROPDMGEXP” and “EVTYPE”.

According to the provided documentation, the “PROPDMGEXP” contains the exponent values of “PROPDMG” (property damage). This means that if we want to obtain the value of the property damage in US dollars, we have to multiply “PROPDMG” by 10 to the respective power, which is stored in the “PROPDMGEXP” variable. By analogy, “CROPDMGEXP” contains the exponent values of “CROPDMG” (crop damage). In the variables “PROPDMGEXP” and “CROPDMGEXP” the respective powers of 10 are codes both as letters and numbers. For example, B or b stands for billion, M or m stands for million, K or k – thousand, and H or h – hundred. There are also numbers from one to ten which simply represent the power of ten (10^the number). The symbols “-”, “+” and “?” refer to less than, greater than and low certainty and we will ignore them.

We start with the analysis of property damage. First we list the sorted unique character values of the “PROPDMGEXP” variable:

sort(unique(storm.db$PROPDMGEXP))

##  [1] ""  "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"

We see that the there are both lower and upper cases of the same letters and, hence, we transform all letters to upper case:

storm.db$PROPDMGEXP <- toupper(storm.db$PROPDMGEXP) 

sort(unique(storm.db$PROPDMGEXP))

##  [1] ""  "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "H" "K" "M"

Next we modify the “PROPDMGEXP” (which is of the atomic type character) so that it contains the respective power of 10 but represented as a string:

storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "H"] <- 10^2
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "K"] <- 10^3
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "M"] <- 10^6
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "B"] <- 10^9
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "0"] <- 1
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "1"] <- 10
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "2"] <- 10^2
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "3"] <- 10^3
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "4"] <- 10^4
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "5"] <- 10^5
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "6"] <- 10^6
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "7"] <- 10^7
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "8"] <- 10^8

We set to 0 the symbols that we do not use:

storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "+"] <- 0
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "-"] <- 0
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "?"] <- 0
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == ""] <- 0

We transform “PROPDMGEXP” into a numeric variable and then we create a new variable called “PROPDMGCOST”, which we add to the storm.db data table:

storm.db$PROPDMGEXP <- as.numeric(storm.db$PROPDMGEXP)

storm.db$PROPDMGCOST <- storm.db$PROPDMG*storm.db$PROPDMGEXP

The “PROPDMGCOST” variable contains the value of property damage costs in US dollars. Our next step is to generate the data frame df.property which has two variables - “Event” and “Losses”. The variable “Event” contains a weather event and the variable “Losses” contains the total cost of property damage for that event. The data in the df.property is sorted in descending order by the amount of total losses for a given event due to property damage:

df.property <- aggregate(storm.db$PROPDMGCOST, by = list(storm.db$EVTYPE),
                         FUN = sum)
colnames(df.property) <- c("Event", "Losses")
df.property <- df.property[order(-df.property$Losses), ]
rownames(df.property) <- 1:nrow(df.property)

We display the top ten events with respect to cost of property damage:

kable(df.property[1:10, ], row.names = TRUE, caption = "Table 10")

Table 10
	Event	Losses
1	FLOOD	144679596800
2	HEAVY RAIN	70000103590
3	TORNADO	56947892815
4	STORM SURGE	43323536000
5	FLASH FLOOD	17563971075
6	HAIL	15749092677
7	HURRICANE (TYPHOON)	15450340010
8	THUNDERSTORM WIND	10235772311
9	WILDFIRE	7767443500
10	TROPICAL STORM	7703890550

The weather events with biggest impact on the US economy via property damage are flood, heavy rain, tornado and storm surge.

Below we perform transformations on the “CROPDMGEXP” variable which are analogous to those we performed on the “PROPDMGEXP” variable:

sort(unique(storm.db$CROPDMGEXP))

## [1] ""  "?" "0" "2" "B" "k" "K" "m" "M"

storm.db$CROPDMGEXP <- toupper(storm.db$CROPDMGEXP) 

sort(unique(storm.db$CROPDMGEXP))

## [1] ""  "?" "0" "2" "B" "K" "M"

storm.db$CROPDMGEXP[storm.db$CROPDMGEXP == "K"] <- 10^3
storm.db$CROPDMGEXP[storm.db$CROPDMGEXP == "M"] <- 10^6
storm.db$CROPDMGEXP[storm.db$CROPDMGEXP == "B"] <- 10^9
storm.db$CROPDMGEXP[storm.db$CROPDMGEXP == "0"] <- 1
storm.db$CROPDMGEXP[storm.db$CROPDMGEXP == "2"] <- 10^2

storm.db$CROPDMGEXP[storm.db$CROPDMGEXP == "?"] <- 0
storm.db$CROPDMGEXP[storm.db$CROPDMGEXP == ""] <- 0

storm.db$CROPDMGEXP <- as.numeric(storm.db$CROPDMGEXP)

storm.db$CROPDMGCOST <- storm.db$CROPDMG*storm.db$CROPDMGEXP

The newly generated “CROPDMGCOST” variable contains the value of crop damage costs in US dollars. Then we aggregate the data frame df.crop which has two variables - “Event” and “Losses”. The variable “Event” contains a weather event and the variable “Losses” contains the total cost of crop damage for that event. The data in the df.crop is sorted in descending order by the amount of total losses for a given event due to crop damage:

df.crop <- aggregate(storm.db$CROPDMGCOST, by = list(storm.db$EVTYPE),
                         FUN = sum)
colnames(df.crop) <- c("Event", "Losses")
df.crop <- df.crop[order(-df.crop$Losses), ]
rownames(df.crop) <- 1:nrow(df.crop)

We display the top ten events with respect to costs of crop damage:

kable(df.crop[1:10, ], row.names = TRUE, caption = "Table 11")

Table 11
	Event	Losses
1	DROUGHT	13972621780
2	FLOOD	5661968450
3	RIVER FLOOD	5029459000
4	ICE STORM	5022113500
5	HEAVY RAIN	3341275600
6	HAIL	3086455470
7	HURRICANE (TYPHOON)	2907420000
8	FLASH FLOOD	1462093700
9	EXTREME COLD/WIND CHILL	1313023000
10	THUNDERSTORM WIND	1175491780

The biggest losses due to crop damage are caused by drought, flood, river flood and ice storm.

Below we show bar plots of the cost of crop and property damage in billions of US dollars, grouped by type of weather event:

# Crop damage:
g1 <- ggplot(data = df.crop[1:10, ], aes(x = reorder(Event, -Losses), 
                                         fill = Event)) + 
        geom_bar(aes(y = Losses/10^9),
                 stat = "identity") +
        labs(x = "", 
             y = "Billions US$",
             fill = "Event") +
        theme(legend.position = 'none',
              axis.text.x = element_text(angle = 45, hjust = 1)) +
        ggtitle("Crop Damage in Billions of US$") +
        scale_y_continuous(breaks = seq(0, 15, 2))


# Property damage:

g2 <- ggplot(data = df.property[1:10, ], aes(x = reorder(Event, -Losses),                                                 fill = Event)) + 
        geom_bar(aes(y = Losses/10^9),
                 stat = "identity") +
        labs(x = "", 
             y = "Billions US$",
             fill = "Event") +
        theme(legend.position = 'none',
              axis.text.x = element_text(angle = 45, hjust = 1)) +
        ggtitle("Property Damage in Billions of US$") +
        scale_y_continuous(breaks = seq(0, 150, 20))

grid.arrange(grobs = list(g1, g2), ncol = 2)

From the figures above we see that the biggest losses due to crop damage are caused by drought, followed by flood. The biggest losses due to property damage are caused by flood, heavy rain and tornado.

Next we calculate the total cost of both property and crop damages. We store the result in a new variable called “DMGTOTAL” and we aggregate the total.dmg data frame which contains the total losses grouped by event type and sorted in descending order:

storm.db$DMGTOTAL <- storm.db$PROPDMGCOST + storm.db$CROPDMGCOST

total.dmg <- aggregate(storm.db$DMGTOTAL, by = list(storm.db$EVTYPE),
                         FUN = sum)
colnames(total.dmg) <- c("Event", "Losses")
total.dmg <- total.dmg[order(-total.dmg$Losses), ]
rownames(total.dmg) <- 1:nrow(total.dmg)

Finally we show a bar plot of the cost of total damages in billions of US dollars:

ggplot(data = total.dmg[1:10, ], aes(x = reorder(Event, -Losses), fill = Event)) + 
geom_bar(aes(y = Losses/10^9),
         stat = "identity") +
labs(x = "", 
     y = "Billions US$",
     fill = "Event") +
theme(legend.position = 'none',
      axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Total  Damage in Billions of US$") +
scale_y_continuous(breaks = seq(0, 150, 20))

We observe that flood, heavy rain and tornado are the three main severe weather events causing the biggest economic losses due to crop and property damage.

Analysis of the Impact of Severe Weather Events on Public Health and Economy in the US

Vasilena Taralova

February 11, 2017