1 Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

The goal of the project is to identify the most harmful events with respect to population health as well as the events with the greatest economic consequences. A lot of data preprocessing is required in order to clean the data and transform it in a format suitable for an accurate analysis. We identify possible misspellings by calculating the Damerau-Levenshtein distance between the unique event names in the provided dataset and the names of the official weather events given in the National Weather Service Storm Data Documentation.

The results of our analysis show that the most harmful events with respect to population health are tornado, thunderstorm wind, excessive heat and flood. We further found out that the events with the greatest economic consequences are flood, heavy rain and tornado.

The given computational times in the report are on a laptop with Intel i7 5500U CPU and 8GB DDR3-1600 RAM.

2 Overview of the Data Set

The data for this project come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. The file can be downloaded from here:

Storm Data

There is also some documentation of the database available. Here you will find how some of the variables are defined:

– National Weather Service Storm Data Documentation

– National Climatic Data Center Storm Events FAQ

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

3 Data Processing

First we load the packages that we will need:

library(data.table)
library(knitr)
library(lubridate)
library(stringdist)
library(ggplot2)
library(gridExtra)
library(R.utils)

3.1 Loading the Data

We set the working directory:

setwd("D:/Coursera_Reproducible_Research/Project_Assignment_2")

Then we download the archive and we decompress it:

if(!file.exists("./Storm_Data.bz2")){
  
    fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"  
    
    download.file(fileUrl, destfile = "./Storm_Data.bz2")
    
    bunzip2(filename = "./Storm_Data.bz2", destname = "./Storm_Data.csv", remove = FALSE)
    
}

In order to decompress the “Storm_Data.bz2” file we use the R function bunzip2 from the R.utils package. We set remove=FALSE to keep the input compressed file “Storm_Data.bz2”. The default setting is remove=TRUE, which causes the input file to be removed after the decompressed output file is created.

Finally we read the data into the storm.db data table:

t1 <- Sys.time()

storm.db <- fread(input = "./Storm_Data.csv", sep = "auto", header = TRUE)
## 
Read 22.7% of 967216 rows
Read 42.4% of 967216 rows
Read 54.8% of 967216 rows
Read 73.4% of 967216 rows
Read 82.7% of 967216 rows
Read 902297 rows and 37 (of 37) columns from 0.523 GB file in 00:00:08
t2 <- Sys.time()
t2-t1
## Time difference of 7.507717 secs

We use the package data.table and its function fread() in order to load the storm data. The package data.table inherits from data.frame but it is written in C which makes it much faster, especially for big data sets as is the case with the storm data set. As the R documentation states “data.table inherits from data.frame. It offers fast subset, fast grouping, fast update, fast ordered joins and list columns in a short and flexible syntax, for faster development.” Hence, all functions that accept data.frame objects work also on data.table objects with the advantage that operations like subsetting, grouping and updating are much faster for data.table objects. We report the time needed to read the data and it is 7.508s, which is way much faster than the respective time needed to read the data with read.csv() or read.table().

3.2 Structure and Summary of the Data

We start with taking a look at the structure of the storm.db data table. We see that there are 902297 observations of 37 variables:

dim(storm.db)
## [1] 902297     37

The variables are of type “character”, “numerical” and “logical”:

str(storm.db)
Classes 'data.table' and 'data.frame':  902297 obs. of  37 variables:
 $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
 $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
 $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
 $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
 $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
 $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
 $ STATE     : chr  "AL" "AL" "AL" "AL" ...
 $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
 $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
 $ BGN_AZI   : chr  "" "" "" "" ...
 $ BGN_LOCATI: chr  "" "" "" "" ...
 $ END_DATE  : chr  "" "" "" "" ...
 $ END_TIME  : chr  "" "" "" "" ...
 $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
 $ COUNTYENDN: logi  NA NA NA NA NA NA ...
 $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
 $ END_AZI   : chr  "" "" "" "" ...
 $ END_LOCATI: chr  "" "" "" "" ...
 $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
 $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
 $ F         : chr  "3" "2" "2" "2" ...
 $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
 $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
 $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
 $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
 $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
 $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ CROPDMGEXP: chr  "" "" "" "" ...
 $ WFO       : chr  "" "" "" "" ...
 $ STATEOFFIC: chr  "" "" "" "" ...
 $ ZONENAMES : chr  "" "" "" "" ...
 $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
 $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
 $ LATITUDE_E: num  3051 0 0 0 0 ...
 $ LONGITUDE_: num  8806 0 0 0 0 ...
 $ REMARKS   : chr  "" "" "" "" ...
 $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
 - attr(*, ".internal.selfref")=<externalptr> 

and their names are:

names(storm.db)
 [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
 [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
[11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
[16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
[21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
[26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
[31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
[36] "REMARKS"    "REFNUM"    

For our analysis we will need only the variables “EVTYPE”, “FATALITIES”, “INJURIES”, “PROPDMG”, “PROPDMGEXP”, “CROPDMG” and “CROPDMGEXP”. Below we show the first five rows of the storm.db data table:

head(storm.db)
   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
1:       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
2:       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
3:       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
4:       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
5:       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
6:       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
1: TORNADO         0                                               0
2: TORNADO         0                                               0
3: TORNADO         0                                               0
4: TORNADO         0                                               0
5: TORNADO         0                                               0
6: TORNADO         0                                               0
   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
1:         NA         0                      14.0   100 3   0          0
2:         NA         0                       2.0   150 2   0          0
3:         NA         0                       0.1   123 2   0          0
4:         NA         0                       0.0   100 2   0          0
5:         NA         0                       0.0   150 2   0          0
6:         NA         0                       1.5   177 2   0          0
   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
1:       15    25.0          K       0                                    
2:        0     2.5          K       0                                    
3:        2    25.0          K       0                                    
4:        2     2.5          K       0                                    
5:        2     2.5          K       0                                    
6:        6     2.5          K       0                                    
   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
1:     3040      8812       3051       8806              1
2:     3042      8755          0          0              2
3:     3340      8742          0          0              3
4:     3458      8626          0          0              4
5:     3412      8642          0          0              5
6:     3450      8748          0          0              6

3.3 Cleaning the “EVTYPE” Variable

We start by determining the number of unique weather events stored in the “EVTYPE” variable:

length(unique(storm.db$EVTYPE))
## [1] 985

We see that there are 985 unique events but according to the provided documentation there should be only 48 event types. Looking more closely at the storm data set we notice that there are many typos. There are also events with the same names both with upper and lower cases. We also notice records with event names composed of two or more existing event types as well as many new weather events which are not part of the officially documented 48. Therefore, we try to preprocess the “EVTYPE” variable and reduce the number of unique events. First, we transform all event names to upper cases:

storm.db$EVTYPE <- toupper(storm.db$EVTYPE)

length(unique(storm.db$EVTYPE))
## [1] 898

We observe that after turning all event names to upper cases, the number of unique events is reduced with 87 from 985 to 898.

Next we display the first 10 weather events with the greatest number of occurrences (i.e. with highest frequency):

df.event <- as.data.frame(table(as.factor(storm.db$EVTYPE)))
colnames(df.event) <- c("Event", "Count")
df.event <- df.event[order(-df.event$Count), ]
rownames(df.event) <- 1:nrow(df.event)
kable(df.event[c(1:10), ], row.names = TRUE, caption = "Table 1")
Table 1
Event Count
1 HAIL 288661
2 TSTM WIND 219942
3 THUNDERSTORM WIND 82564
4 TORNADO 60652
5 FLASH FLOOD 54277
6 FLOOD 25327
7 THUNDERSTORM WINDS 20843
8 HIGH WIND 20214
9 LIGHTNING 15754
10 HEAVY SNOW 15708

From the table above we see that the weather event with highest frequency is “HAIL”, followed by “TSTM WIND”, “THUNDERSTORM WIND”, “TORNADO” and “FLASH FLOOD” at the fifth place. We notice that the event “THUNDERSTORM WIND” is recorded also as “TSTM WIND” and “THUNDERSTORM WINDS”. Hence, we rename all occurrences of “TSTM WIND” and “THUNDERSTORM WINDS” to “THUNDERSTORM WIND”:

t1 <- Sys.time()
storm.db[storm.db$EVTYPE == "TSTM WIND" | storm.db$EVTYPE == "THUNDERSTORM WINDS", ]$EVTYPE <- "THUNDERSTORM WIND"
t2 <- Sys.time()
t2-t1
## Time difference of 0.3711889 secs

After this transformation we display again the top 10 (by frequency) weather events:

df.event <- as.data.frame(table(as.factor(storm.db$EVTYPE)))
colnames(df.event) <- c("Event", "Count")
df.event <- df.event[order(-df.event$Count), ]
rownames(df.event) <- 1:nrow(df.event)
kable(df.event[c(1:10), ], row.names = TRUE, caption = "Table 2")
Table 2
Event Count
1 THUNDERSTORM WIND 323349
2 HAIL 288661
3 TORNADO 60652
4 FLASH FLOOD 54277
5 FLOOD 25327
6 HIGH WIND 20214
7 LIGHTNING 15754
8 HEAVY SNOW 15708
9 HEAVY RAIN 11742
10 WINTER STORM 11433

Now the first place is for “THUNDERSTORM WIND”, followed by “HAIL”, “TORNADO”, “FLASH FLOOD” and “FLOOD” at the fifth place. Also, “HEAVY RAIN” and “WINTER STORM” appear in the top ten.

3.3.1 Measuring the String Distance

We will try to further reduce the number of unique weather events by identifying possible misspellings. We calculate the Damerau–Levenshtein distance between the unique event names in the storm.db data table and the names of the official weather events given in the National Weather Service Storm Data Documentation. According to Wikipedia, the “Damerau–Levenshtein distance is a distance (string metric) between two strings, i.e., finite sequence of symbols, given by counting the minimum number of operations needed to transform one string into the other, where an operation is defined as an insertion, deletion, or substitution of a single character, or a transposition of two adjacent characters.” In R we compute the aforementioned string distance with the function stringdistmatrix() from the stringdist package.

Let us denote by storm.db.name a unique value of the “EVTYPE” variable from the storm.db data table, and by official.name – any of the 48 official event names. We apply the following strategy for cleaning the “EVTYPE” variable:

  1. We calculate the Damerau–Levenshtein string distance between storm.db.name and official.name: dist(storm.db.name, official.name).

  2. We compute all pairwise string distances dist(official.name, official.name) between any two official event names for which \(1\leq\) dist(official.name, official.name)\(\leq 3\). We denote by list(official.self.dist) the list of the calculated pair distances.

  3. If the string distance – dist(storm.db.name, official.name), is greater than zero and less than or equal to 3, and the pair (storm.db.name, official.name) does not belong to list(official.self.dist), we replace storm.db.name with official.name.

We start by creating the character vector event.type, which contains all unique weather event names from the storm.db data table:

event.type <- unique(storm.db$EVTYPE)

We see that there are 896 unique elements of the vector event.type:

length(event.type)
## [1] 896

Then we construct the character vector events which is composed of all 48 official weather events provided in the National Weather Service documentation:

events <- c("ASTRONOMICAL LOW TIDE", "AVALANCHE", "BLIZZARD", "COASTAL FLOOD",
            "COLD/WIND CHILL", "DEBRIS FLOW", "DENSE FOG", "DENSE SMOKE",
            "DROUGHT", "DUST DEVIL", "DUST STORM", "EXCESSIVE HEAT", 
            "EXTREME COLD/WIND CHILL", "FLASH FLOOD", "FLOOD", "FROST/FREEZE",
            "FUNNEL CLOUD", "FREEZING FOG", "HAIL", "HEAT",
            "HEAVY RAIN",
            "HEAVY SNOW", "HIGH SURF", "HIGH WIND", "HURRICANE (TYPHOON)",
            "ICE STORM", "LAKE-EFFECT SNOW", "LAKESHORE FLOOD", "LIGHTNING",
            "MARINE HAIL", "MARINE HIGH WIND", "MARINE STRONG WIND", "MARINE THUNDERSTORM WIND",
            "RIP CURRENT", "SEICHE", "SLEET", "STORM SURGE/TIDE", "STRONG WIND",
            "THUNDERSTORM WIND", "TORNADO", "TROPICAL DEPRESSION", "TROPICAL STORM",
            "TSUNAMI", "VOLCANIC ASH", "WATERSPOUT", "WILDFIRE", "WINTER STORM",
            "WINTER WEATHER")

Step 1:

Now we compute the pairwise Damerau-Levenshtein string distances between the elements of the two character vectors. We store the resulting matrix in the strdist object:

strdist <- stringdistmatrix(event.type, events, method = "dl")

Below we create the matrix matching.events that contains the indices of the elements of the character vectors event.type and events for which the respective string distance is greater or equal to 1 and less than or equal to 3:

match.events <- list()
res <- list()
cnt <- 0

for(i in 1:dim(strdist)[1]){
    for(j in 1:dim(strdist)[2]){
        
        if(strdist[i,j] > 0 & strdist[i,j] <= 3){
            
             cnt <- cnt + 1
             
             res[[cnt]] = c(i,j)
             
             match.events <- do.call(cbind, res)
           
        }
    }
}


matching.events <- as.matrix(match.events)
matching.events <- t(matching.events)

The matrix matching.events has 101 rows and 2 columns. In the first column are stored the indices i of the elements of the character vector event.type and in the second column – the corresponding indices j of the elements of the vector events. For each pair of indices (i,j) we have that \(1\leq\)dist(event.type[i], events[j])\(\leq 3\). Here are the first few rows of the matrix matching.events:

head(matching.events)
##      [,1] [,2]
## [1,]    3   20
## [2,]   17   39
## [3,]   19   14
## [4,]   20   24
## [5,]   22   40
## [6,]   25   19

For convenience we transform the matrix matching.events into a data frame:

matching.events <- as.data.frame(matching.events)

colnames(matching.events) <- c("index.db.name", "index.correct.name")

The character vectors db.name and real.name contain the unique values of the “EVTYPE” variable and the official weather event names, respectively, for which the Damerau-Levenshtein string distance is in the interval [1,3]:

db.name <- event.type[matching.events[ ,"index.db.name"]]

real.name <- events[matching.events[ ,"index.correct.name"]]

Step 2:

Next we compute the string distances between the elements of the character vector events (which contains the 48 official weather event names):

self.dist <- stringdistmatrix(events, events, method = "dl")

Then we generate the matrix self.match that contains the indices i and j of the elements of the character vector events for which the respective string distance – dist(events[i], events[j]), is greater than or equal to 1 and less than or equal to 3:

match.events <- list()
res <- list()
cnt <- 0

for(i in 1:dim(self.dist)[1]){
    for(j in 1:dim(self.dist)[2]){
        
        if(self.dist[i,j] >= 1 & self.dist[i,j] <= 3){
            
             cnt <- cnt + 1
             
             res[[cnt]] = c(i,j)
             
             match.events <- do.call(cbind, res)
           
        }
    }
}

self.match <- as.matrix(match.events)

self.match <- t(self.match)

The matrix self.match has 4 rows and 2 columns. Next we create the character vector self.pairs, whose elements are composed of the pasted strings representing the official event names with string distance between them in the interval [1,3]:

self.pairs <- vector(mode = "character")

for(i in 1:dim(self.match)[1]){
    
    self.pairs <- c(self.pairs, paste(events[self.match[i,1]],                                               paste(events[self.match[i,2]])))    
}


self.pairs
## [1] "HAIL HEAT"  "HEAT HAIL"  "HEAT SLEET" "SLEET HEAT"
length(self.pairs)
## [1] 4

Our next step is to create the character vector pairs composed of the 101 pasted matching pairs (“EVTYPE” name, official name) such that the string distance between them – dist(“EVTYPE” name, official name) is in the interval [1,3]:

pairs <- mapply(function(db, correct){
    
    paste(db, correct, sep = " ")
    
}, db.name, real.name)

pairs <- as.vector(x = pairs, mode = "character")

Step 3:

The only thing left is to exclude from db.name the pasted string pairs of official event names for which the string distance between them is greater than 0 and less than 4:

db.name.out.self <- db.name[!(pairs %in% self.pairs)]

Finally we replace the misspelled “EVTYPE” names with the respective correct official names:

t1 <- Sys.time()
storm.db[which(storm.db$EVTYPE %in% db.name.out.self)]$EVTYPE <- real.name[match(storm.db[which(storm.db$EVTYPE %in% db.name.out.self), ]$EVTYPE, db.name.out.self)]
t2 <- Sys.time()
t2-t1
## Time difference of 0.232378 secs

We report the system time needed for the described string transformation and we see that it is only 0.232 s.

We see that the number of unique event types in the “EVTYPE” variable decreased from 896 to 801:

length(unique(storm.db$EVTYPE))
## [1] 801

We show the top ten events (by frequency) after the above transformations:

df.event <- as.data.frame(table(as.factor(storm.db$EVTYPE)))
colnames(df.event) <- c("Event" , "Count")
df.event$Percentage <- round(df.event$Count/sum(df.event$Count)*100, 3)
df.event <- df.event[order(-df.event$Count), ]
rownames(df.event) <- 1:nrow(df.event)
kable(df.event[1:10, ], row.names = TRUE, caption = "Table 3")
Table 3
Event Count Percentage
1 THUNDERSTORM WIND 324474 35.961
2 HAIL 288742 32.001
3 TORNADO 60664 6.723
4 FLASH FLOOD 55996 6.206
5 FLOOD 25985 2.880
6 HIGH WIND 20248 2.244
7 LIGHTNING 15842 1.756
8 HEAVY SNOW 15713 1.741
9 HEAVY RAIN 11840 1.312
10 WINTER STORM 11463 1.270

We observe that there is no change in the top ten weather events compared to Table 2.

3.3.2 Further Manual Cleaning

After closely examining the storm.db data set we notice further inconsistencies in the weather event names which we clean manually:

# Cleaning names that contain the string "MARINE TSTM WIND":
storm.db[storm.db$EVTYPE == "MARINE TSTM WIND", ]$EVTYPE <- "MARINE THUNDERSTORM WIND"

# Cleaning names that contain the string "WILD/FOREST FIRE":
storm.db[storm.db$EVTYPE == "WILD/FOREST FIRE", ]$EVTYPE <- "WILDFIRE"

# Cleaning names that contain the string "UNSEASONABLY WARM":
storm.db[storm.db$EVTYPE == "UNSEASONABLY WARM", ]$EVTYPE <- "EXCESSIVE HEAT"

# Cleaning names that contain the string "EXCESSIVE SNOW":
storm.db[storm.db$EVTYPE == "EXCESSIVE SNOW", ]$EVTYPE <- "HEAVY SNOW"

# Cleaning names that contain the string "HURRICANE":
storm.db[grep("HURRICANE", storm.db$EVTYPE), ]$EVTYPE <- "HURRICANE (TYPHOON)"

# Cleaning names that contain the string "HIGH TEMPERATURE":
storm.db[grep("HIGH TEMPERATURE", storm.db$EVTYPE), ]$EVTYPE <- "HEAT"

# Cleaning names that contain the string "HAIL STORM":
storm.db[grep("HAIL STORM", storm.db$EVTYPE), ]$EVTYPE <- "HAIL"

# Cleaning names that contain the string "EXTREME COLD":
storm.db[grep("EXTREME COLD|RECORD COLD|COLD RECORD", storm.db$EVTYPE), ]$EVTYPE <- "EXTREME COLD/WIND CHILL"


# Cleaning names that contain the string "DROUGHT":
storm.db[grep("DROUGHT", storm.db$EVTYPE), ]$EVTYPE <- "DROUGHT"

# Cleaning names that contain the string "EXCESSIVE HEAT|RECORD HEAT|HEAT RECORD|EXTREME HEAT":
storm.db[grep("EXCESSIVE HEAT|RECORD HEAT|HEAT RECORD|EXTREME HEAT", storm.db$EVTYPE), ]$EVTYPE <- "EXCESSIVE HEAT"


# Cleaning names that contain the string "COASTAL FLOOD":
storm.db[grep("COASTAL FLOOD", storm.db$EVTYPE), ]$EVTYPE <- "COASTAL FLOOD"

After the above operations, the unique event names are:

length(unique(storm.db$EVTYPE))
## [1] 766

Probably more preprocessing can be done, but it requres a lot of time and effort. The task becomes even more difficult due to the fact that many new weather events are introduced in the storm data set and they should be kept as they are because they have no analogue in the official documentation.

We show again the top ten severe weather events:

df.event <- as.data.frame(table(as.factor(storm.db$EVTYPE)))
colnames(df.event) <- c("Event", "Count")
df.event$Percentage <- round(df.event$Count/sum(df.event$Count)*100, 3)
df.event <- df.event[order(-df.event$Count), ]
rownames(df.event) <- 1:nrow(df.event)
kable(df.event[1:10, ], row.names = TRUE, caption = "Table 4")
Table 4
Event Count Percentage
1 THUNDERSTORM WIND 324474 35.961
2 HAIL 288743 32.001
3 TORNADO 60664 6.723
4 FLASH FLOOD 55996 6.206
5 FLOOD 25985 2.880
6 HIGH WIND 20248 2.244
7 LIGHTNING 15842 1.756
8 HEAVY SNOW 15738 1.744
9 MARINE THUNDERSTORM WIND 11987 1.328
10 HEAVY RAIN 11840 1.312

We see that compared to Table 3, “WINTER STORM” drops out, whereas “MARINE THUNDERSTORM WIND” makes in in the top ten.

4 Results

4.1 Most harmful events with respect to population health

For the analysis of the most harmful events with respect to population health we will use the variables “FATALITIES”, “INJURIES” and “EVTYPE”.

Below we create the data frame df.fatality which contains the total number of fatalities grouped by event type (the variable “EVTYPE”) and sorted in descending order:

# Fatalities:

df.fatality <- aggregate(storm.db$FATALITIES, by = list(storm.db$EVTYPE), FUN = sum)
colnames(df.fatality) <- c("Event", "Fatalities") 
df.fatality$Percentage <- round(df.fatality$Fatalities/sum(df.fatality$Fatalities)*100, 3)
df.fatality <- df.fatality[order(-df.fatality$Fatalities), ]
rownames(df.fatality) <- 1:nrow(df.fatality)

Then we display the top 10 severe weather events in terms of number of fatalities:

kable(df.fatality[1:10, ], row.names = TRUE, caption = "Table 5")
Table 5
Event Fatalities Percentage
1 TORNADO 5633 37.194
2 EXCESSIVE HEAT 2029 13.397
3 FLASH FLOOD 1016 6.708
4 HEAT 951 6.279
5 THUNDERSTORM WIND 928 6.127
6 LIGHTNING 854 5.639
7 FLOOD 534 3.526
8 RIP CURRENT 368 2.430
9 EXTREME COLD/WIND CHILL 288 1.902
10 HIGH WIND 249 1.644

From the table we see that the top 3 events with the greatest number of fatalities are tornado, excessive heat and flash flood.

Next we create the data frame df.injury which contains the total number of injuries grouped by event type (the variable “EVTYPE”) and sorted in descending order:

# Injuries:

df.injury <- aggregate(storm.db$INJURIES, by = list(storm.db$EVTYPE), FUN = sum)
colnames(df.injury) <- c("Event", "Injuries") 
df.injury$Percentage <- round(df.injury$Injuries/sum(df.injury$Injuries)*100, 3)
df.injury <- df.injury[order(-df.injury$Injuries), ]
rownames(df.injury) <- 1:nrow(df.injury)

We display the top 10 severe weather events in terms of number of injured people:

kable(df.injury[1:10, ], row.names = TRUE, caption = "Table 6")
Table 6
Event Injuries Percentage
1 TORNADO 91346 65.002
2 THUNDERSTORM WIND 9679 6.888
3 FLOOD 7523 5.353
4 EXCESSIVE HEAT 6747 4.801
5 LIGHTNING 5278 3.756
6 HEAT 2277 1.620
7 FLASH FLOOD 2080 1.480
8 ICE STORM 1985 1.413
9 HEAVY RAIN 1534 1.092
10 WILDFIRE 1473 1.048

From the table we see that the top 3 events with the greatest number of injuries are tornado, thunderstorm wind and flood.

In the next table we show the ten most harmful events with respect to number of fatalities, together with the respective number of injuries:

# Both fatalities and injuries:

df.health <- merge(df.fatality, df.injury, by = "Event")
df.health <- df.health[order(-df.health$Fatalities), ]
rownames(df.health) <- 1:nrow(df.health)
colnames(df.health) <- c("Event", "Fatalities", "Percent.Fatalities", "Injuries", "Percent.Injuries")
kable(df.health[1:10, ], row.names = TRUE, caption = "Table 7")
Table 7
Event Fatalities Percent.Fatalities Injuries Percent.Injuries
1 TORNADO 5633 37.194 91346 65.002
2 EXCESSIVE HEAT 2029 13.397 6747 4.801
3 FLASH FLOOD 1016 6.708 2080 1.480
4 HEAT 951 6.279 2277 1.620
5 THUNDERSTORM WIND 928 6.127 9679 6.888
6 LIGHTNING 854 5.639 5278 3.756
7 FLOOD 534 3.526 7523 5.353
8 RIP CURRENT 368 2.430 232 0.165
9 EXTREME COLD/WIND CHILL 288 1.902 255 0.181
10 HIGH WIND 249 1.644 1137 0.809
# df.health <- aggregate(storm.db[ , .(FATALITIES, INJURIES)], by = list(storm.db$EVTYPE), FUN = sum)

# head(storm.db[ , .(FATALITIES, INJURIES)])

The next table provides the event types sorted in descending order by frequency together with the respective number of fatalities and injuries:

df.both <- merge(df.event, df.health, by = "Event")
df.both <- df.both[order(-df.both$Count), ]
rownames(df.both) <- 1:nrow(df.both)
colnames(df.both) <- c("Event", "Count.Event", "Percentage", "Fatalities", "Percent.Fatalities", "Injuries", "Percent.Injuries")
kable(df.both[1:15, ], row.names = TRUE, caption = "Table 8")
Table 8
Event Count.Event Percentage Fatalities Percent.Fatalities Injuries Percent.Injuries
1 THUNDERSTORM WIND 324474 35.961 928 6.127 9679 6.888
2 HAIL 288743 32.001 16 0.106 1368 0.973
3 TORNADO 60664 6.723 5633 37.194 91346 65.002
4 FLASH FLOOD 55996 6.206 1016 6.708 2080 1.480
5 FLOOD 25985 2.880 534 3.526 7523 5.353
6 HIGH WIND 20248 2.244 249 1.644 1137 0.809
7 LIGHTNING 15842 1.756 854 5.639 5278 3.756
8 HEAVY SNOW 15738 1.744 127 0.839 1023 0.728
9 MARINE THUNDERSTORM WIND 11987 1.328 19 0.125 34 0.024
10 HEAVY RAIN 11840 1.312 167 1.103 1534 1.092
11 WINTER STORM 11463 1.270 206 1.360 1321 0.940
12 WINTER WEATHER 7045 0.781 33 0.218 398 0.283
13 FUNNEL CLOUD 6849 0.759 4 0.026 153 0.109
14 WILDFIRE 4222 0.468 97 0.640 1473 1.048
15 WATERSPOUT 3860 0.428 4 0.026 32 0.023

From the table we see that the most frequent event - thunderstorm wind, accounts only for 6.1% of all fatalities and 6.9% of all injuries, whereas the third most frequent event - tornado, accounts for 37.2% of all fatalities and 65% of all injuries.

The last table concerning public health shows the total number of accidents related to public health (fatalities plus injuries) grouped by weather event and sorted in descending order:

df.total <- df.both
df.total$Total.Count.Accidents <- df.total$Fatalities + df.total$Injuries
df.total <- df.total[order(-df.total$Total.Count), ]
rownames(df.total) <- 1:nrow(df.total)
drop.names <- match(c("Percentage", "Percent.Fatalities", "Percent.Injuries"), names(df.total))
df.total <- df.total[, -drop.names]
kable(df.total[1:15, ], row.names = TRUE, caption = "Table 9")
Table 9
Event Count.Event Fatalities Injuries Total.Count.Accidents
1 TORNADO 60664 5633 91346 96979
2 THUNDERSTORM WIND 324474 928 9679 10607
3 EXCESSIVE HEAT 1912 2029 6747 8776
4 FLOOD 25985 534 7523 8057
5 LIGHTNING 15842 854 5278 6132
6 HEAT 923 951 2277 3228
7 FLASH FLOOD 55996 1016 2080 3096
8 ICE STORM 2013 89 1985 2074
9 HEAVY RAIN 11840 167 1534 1701
10 WILDFIRE 4222 97 1473 1570
11 WINTER STORM 11463 206 1321 1527
12 HIGH WIND 20248 249 1137 1386
13 HAIL 288743 16 1368 1384
14 HEAVY SNOW 15738 127 1023 1150
15 BLIZZARD 2719 101 805 906

We see that the highest number of fatalities and injuries is due to tornados, followed by thunderstorm winds and excessive heat.

Next we show bar plots of the number of fatalities and the number of injuries by event type. We use the R package “ggplot2()”:

# Fatalities:
g1 <- ggplot(data = df.fatality[1:10, ], aes(x = reorder(Event, -Fatalities),                                              fill = Event)) + 
        geom_bar(aes(y = Fatalities),
                 stat = "identity") +
        labs(x = "", 
             y = "Number of Fatalities",
             fill = "Event") +
        theme(legend.position = 'none',
              axis.text.x = element_text(angle = 45, hjust = 1)) +
        ggtitle("Number of Fatalities by Event") +
        scale_y_continuous(breaks = seq(0, 6000, 1000))


# Injuries:
g2 <- ggplot(data = df.injury[1:10, ], aes(x = reorder(Event, -Injuries),                                               fill = Event)) + 
        geom_bar(aes(y = Injuries),
                 stat = "identity") +
        labs(x = "", 
             y = "Number of Injuries",
             fill = "Event") +
        theme(legend.position = 'none',
              axis.text.x = element_text(angle = 45, hjust = 1)) +
        ggtitle("Number of Injuries by Event") +
        scale_y_continuous(breaks = seq(0, 90000, 10000))


grid.arrange(grobs = list(g1, g2), ncol = 2)

4.2 Events with greatest economic consequences

For the analysis of the events with greatest economic consequences we will use the variables “PROPDMG”, “PROPDMGEXP”, “CROPDMG”, “CROPDMGEXP” and “EVTYPE”.

According to the provided documentation, the “PROPDMGEXP” contains the exponent values of “PROPDMG” (property damage). This means that if we want to obtain the value of the property damage in US dollars, we have to multiply “PROPDMG” by 10 to the respective power, which is stored in the “PROPDMGEXP” variable. By analogy, “CROPDMGEXP” contains the exponent values of “CROPDMG” (crop damage). In the variables “PROPDMGEXP” and “CROPDMGEXP” the respective powers of 10 are codes both as letters and numbers. For example, B or b stands for billion, M or m stands for million, K or k – thousand, and H or h – hundred. There are also numbers from one to ten which simply represent the power of ten (10^the number). The symbols “-”, “+” and “?” refer to less than, greater than and low certainty and we will ignore them.

We start with the analysis of property damage. First we list the sorted unique character values of the “PROPDMGEXP” variable:

sort(unique(storm.db$PROPDMGEXP))
##  [1] ""  "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"

We see that the there are both lower and upper cases of the same letters and, hence, we transform all letters to upper case:

storm.db$PROPDMGEXP <- toupper(storm.db$PROPDMGEXP) 

sort(unique(storm.db$PROPDMGEXP))
##  [1] ""  "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "H" "K" "M"

Next we modify the “PROPDMGEXP” (which is of the atomic type character) so that it contains the respective power of 10 but represented as a string:

storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "H"] <- 10^2
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "K"] <- 10^3
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "M"] <- 10^6
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "B"] <- 10^9
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "0"] <- 1
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "1"] <- 10
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "2"] <- 10^2
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "3"] <- 10^3
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "4"] <- 10^4
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "5"] <- 10^5
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "6"] <- 10^6
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "7"] <- 10^7
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "8"] <- 10^8

We set to 0 the symbols that we do not use:

storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "+"] <- 0
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "-"] <- 0
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "?"] <- 0
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == ""] <- 0

We transform “PROPDMGEXP” into a numeric variable and then we create a new variable called “PROPDMGCOST”, which we add to the storm.db data table:

storm.db$PROPDMGEXP <- as.numeric(storm.db$PROPDMGEXP)

storm.db$PROPDMGCOST <- storm.db$PROPDMG*storm.db$PROPDMGEXP

The “PROPDMGCOST” variable contains the value of property damage costs in US dollars. Our next step is to generate the data frame df.property which has two variables - “Event” and “Losses”. The variable “Event” contains a weather event and the variable “Losses” contains the total cost of property damage for that event. The data in the df.property is sorted in descending order by the amount of total losses for a given event due to property damage:

df.property <- aggregate(storm.db$PROPDMGCOST, by = list(storm.db$EVTYPE),
                         FUN = sum)
colnames(df.property) <- c("Event", "Losses")
df.property <- df.property[order(-df.property$Losses), ]
rownames(df.property) <- 1:nrow(df.property)

We display the top ten events with respect to cost of property damage:

kable(df.property[1:10, ], row.names = TRUE, caption = "Table 10")
Table 10
Event Losses
1 FLOOD 144679596800
2 HEAVY RAIN 70000103590
3 TORNADO 56947892815
4 STORM SURGE 43323536000
5 FLASH FLOOD 17563971075
6 HAIL 15749092677
7 HURRICANE (TYPHOON) 15450340010
8 THUNDERSTORM WIND 10235772311
9 WILDFIRE 7767443500
10 TROPICAL STORM 7703890550

The weather events with biggest impact on the US economy via property damage are flood, heavy rain, tornado and storm surge.

Below we perform transformations on the “CROPDMGEXP” variable which are analogous to those we performed on the “PROPDMGEXP” variable:

sort(unique(storm.db$CROPDMGEXP))
## [1] ""  "?" "0" "2" "B" "k" "K" "m" "M"
storm.db$CROPDMGEXP <- toupper(storm.db$CROPDMGEXP) 

sort(unique(storm.db$CROPDMGEXP))
## [1] ""  "?" "0" "2" "B" "K" "M"
storm.db$CROPDMGEXP[storm.db$CROPDMGEXP == "K"] <- 10^3
storm.db$CROPDMGEXP[storm.db$CROPDMGEXP == "M"] <- 10^6
storm.db$CROPDMGEXP[storm.db$CROPDMGEXP == "B"] <- 10^9
storm.db$CROPDMGEXP[storm.db$CROPDMGEXP == "0"] <- 1
storm.db$CROPDMGEXP[storm.db$CROPDMGEXP == "2"] <- 10^2

storm.db$CROPDMGEXP[storm.db$CROPDMGEXP == "?"] <- 0
storm.db$CROPDMGEXP[storm.db$CROPDMGEXP == ""] <- 0

storm.db$CROPDMGEXP <- as.numeric(storm.db$CROPDMGEXP)

storm.db$CROPDMGCOST <- storm.db$CROPDMG*storm.db$CROPDMGEXP

The newly generated “CROPDMGCOST” variable contains the value of crop damage costs in US dollars. Then we aggregate the data frame df.crop which has two variables - “Event” and “Losses”. The variable “Event” contains a weather event and the variable “Losses” contains the total cost of crop damage for that event. The data in the df.crop is sorted in descending order by the amount of total losses for a given event due to crop damage:

df.crop <- aggregate(storm.db$CROPDMGCOST, by = list(storm.db$EVTYPE),
                         FUN = sum)
colnames(df.crop) <- c("Event", "Losses")
df.crop <- df.crop[order(-df.crop$Losses), ]
rownames(df.crop) <- 1:nrow(df.crop)

We display the top ten events with respect to costs of crop damage:

kable(df.crop[1:10, ], row.names = TRUE, caption = "Table 11")
Table 11
Event Losses
1 DROUGHT 13972621780
2 FLOOD 5661968450
3 RIVER FLOOD 5029459000
4 ICE STORM 5022113500
5 HEAVY RAIN 3341275600
6 HAIL 3086455470
7 HURRICANE (TYPHOON) 2907420000
8 FLASH FLOOD 1462093700
9 EXTREME COLD/WIND CHILL 1313023000
10 THUNDERSTORM WIND 1175491780

The biggest losses due to crop damage are caused by drought, flood, river flood and ice storm.

Below we show bar plots of the cost of crop and property damage in billions of US dollars, grouped by type of weather event:

# Crop damage:
g1 <- ggplot(data = df.crop[1:10, ], aes(x = reorder(Event, -Losses), 
                                         fill = Event)) + 
        geom_bar(aes(y = Losses/10^9),
                 stat = "identity") +
        labs(x = "", 
             y = "Billions US$",
             fill = "Event") +
        theme(legend.position = 'none',
              axis.text.x = element_text(angle = 45, hjust = 1)) +
        ggtitle("Crop Damage in Billions of US$") +
        scale_y_continuous(breaks = seq(0, 15, 2))


# Property damage:

g2 <- ggplot(data = df.property[1:10, ], aes(x = reorder(Event, -Losses),                                                 fill = Event)) + 
        geom_bar(aes(y = Losses/10^9),
                 stat = "identity") +
        labs(x = "", 
             y = "Billions US$",
             fill = "Event") +
        theme(legend.position = 'none',
              axis.text.x = element_text(angle = 45, hjust = 1)) +
        ggtitle("Property Damage in Billions of US$") +
        scale_y_continuous(breaks = seq(0, 150, 20))

grid.arrange(grobs = list(g1, g2), ncol = 2)

From the figures above we see that the biggest losses due to crop damage are caused by drought, followed by flood. The biggest losses due to property damage are caused by flood, heavy rain and tornado.

Next we calculate the total cost of both property and crop damages. We store the result in a new variable called “DMGTOTAL” and we aggregate the total.dmg data frame which contains the total losses grouped by event type and sorted in descending order:

storm.db$DMGTOTAL <- storm.db$PROPDMGCOST + storm.db$CROPDMGCOST

total.dmg <- aggregate(storm.db$DMGTOTAL, by = list(storm.db$EVTYPE),
                         FUN = sum)
colnames(total.dmg) <- c("Event", "Losses")
total.dmg <- total.dmg[order(-total.dmg$Losses), ]
rownames(total.dmg) <- 1:nrow(total.dmg)

Finally we show a bar plot of the cost of total damages in billions of US dollars:

ggplot(data = total.dmg[1:10, ], aes(x = reorder(Event, -Losses), fill = Event)) + 
geom_bar(aes(y = Losses/10^9),
         stat = "identity") +
labs(x = "", 
     y = "Billions US$",
     fill = "Event") +
theme(legend.position = 'none',
      axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Total  Damage in Billions of US$") +
scale_y_continuous(breaks = seq(0, 150, 20))

We observe that flood, heavy rain and tornado are the three main severe weather events causing the biggest economic losses due to crop and property damage.