Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The goal of the project is to identify the most harmful events with respect to population health as well as the events with the greatest economic consequences. A lot of data preprocessing is required in order to clean the data and transform it in a format suitable for an accurate analysis. We identify possible misspellings by calculating the Damerau-Levenshtein distance between the unique event names in the provided dataset and the names of the official weather events given in the National Weather Service Storm Data Documentation.
The results of our analysis show that the most harmful events with respect to population health are tornado, thunderstorm wind, excessive heat and flood. We further found out that the events with the greatest economic consequences are flood, heavy rain and tornado.
The given computational times in the report are on a laptop with Intel i7 5500U CPU and 8GB DDR3-1600 RAM.
The data for this project come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. The file can be downloaded from here:
There is also some documentation of the database available. Here you will find how some of the variables are defined:
– National Weather Service Storm Data Documentation
– National Climatic Data Center Storm Events FAQ
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
First we load the packages that we will need:
library(data.table)
library(knitr)
library(lubridate)
library(stringdist)
library(ggplot2)
library(gridExtra)
library(R.utils)We set the working directory:
setwd("D:/Coursera_Reproducible_Research/Project_Assignment_2")Then we download the archive and we decompress it:
if(!file.exists("./Storm_Data.bz2")){
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileUrl, destfile = "./Storm_Data.bz2")
bunzip2(filename = "./Storm_Data.bz2", destname = "./Storm_Data.csv", remove = FALSE)
}In order to decompress the “Storm_Data.bz2” file we use the R function bunzip2 from the R.utils package. We set remove=FALSE to keep the input compressed file “Storm_Data.bz2”. The default setting is remove=TRUE, which causes the input file to be removed after the decompressed output file is created.
Finally we read the data into the storm.db data table:
t1 <- Sys.time()
storm.db <- fread(input = "./Storm_Data.csv", sep = "auto", header = TRUE)##
Read 22.7% of 967216 rows
Read 42.4% of 967216 rows
Read 54.8% of 967216 rows
Read 73.4% of 967216 rows
Read 82.7% of 967216 rows
Read 902297 rows and 37 (of 37) columns from 0.523 GB file in 00:00:08
t2 <- Sys.time()
t2-t1## Time difference of 7.507717 secs
We use the package data.table and its function fread() in order to load the storm data. The package data.table inherits from data.frame but it is written in C which makes it much faster, especially for big data sets as is the case with the storm data set. As the R documentation states “data.table inherits from data.frame. It offers fast subset, fast grouping, fast update, fast ordered joins and list columns in a short and flexible syntax, for faster development.” Hence, all functions that accept data.frame objects work also on data.table objects with the advantage that operations like subsetting, grouping and updating are much faster for data.table objects. We report the time needed to read the data and it is 7.508s, which is way much faster than the respective time needed to read the data with read.csv() or read.table().
We start with taking a look at the structure of the storm.db data table. We see that there are 902297 observations of 37 variables:
dim(storm.db)## [1] 902297 37
The variables are of type “character”, “numerical” and “logical”:
str(storm.db)Classes 'data.table' and 'data.frame': 902297 obs. of 37 variables:
$ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
$ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
$ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
$ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
$ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
$ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
$ STATE : chr "AL" "AL" "AL" "AL" ...
$ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
$ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
$ BGN_AZI : chr "" "" "" "" ...
$ BGN_LOCATI: chr "" "" "" "" ...
$ END_DATE : chr "" "" "" "" ...
$ END_TIME : chr "" "" "" "" ...
$ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
$ COUNTYENDN: logi NA NA NA NA NA NA ...
$ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
$ END_AZI : chr "" "" "" "" ...
$ END_LOCATI: chr "" "" "" "" ...
$ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
$ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
$ F : chr "3" "2" "2" "2" ...
$ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
$ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
$ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
$ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
$ PROPDMGEXP: chr "K" "K" "K" "K" ...
$ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
$ CROPDMGEXP: chr "" "" "" "" ...
$ WFO : chr "" "" "" "" ...
$ STATEOFFIC: chr "" "" "" "" ...
$ ZONENAMES : chr "" "" "" "" ...
$ LATITUDE : num 3040 3042 3340 3458 3412 ...
$ LONGITUDE : num 8812 8755 8742 8626 8642 ...
$ LATITUDE_E: num 3051 0 0 0 0 ...
$ LONGITUDE_: num 8806 0 0 0 0 ...
$ REMARKS : chr "" "" "" "" ...
$ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
- attr(*, ".internal.selfref")=<externalptr>
and their names are:
names(storm.db) [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
[6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
[11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
[16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
[21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
[26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
[31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
[36] "REMARKS" "REFNUM"
For our analysis we will need only the variables “EVTYPE”, “FATALITIES”, “INJURIES”, “PROPDMG”, “PROPDMGEXP”, “CROPDMG” and “CROPDMGEXP”. Below we show the first five rows of the storm.db data table:
head(storm.db) STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
1: 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
2: 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
3: 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
4: 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
5: 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
6: 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
1: TORNADO 0 0
2: TORNADO 0 0
3: TORNADO 0 0
4: TORNADO 0 0
5: TORNADO 0 0
6: TORNADO 0 0
COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
1: NA 0 14.0 100 3 0 0
2: NA 0 2.0 150 2 0 0
3: NA 0 0.1 123 2 0 0
4: NA 0 0.0 100 2 0 0
5: NA 0 0.0 150 2 0 0
6: NA 0 1.5 177 2 0 0
INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
1: 15 25.0 K 0
2: 0 2.5 K 0
3: 2 25.0 K 0
4: 2 2.5 K 0
5: 2 2.5 K 0
6: 6 2.5 K 0
LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
1: 3040 8812 3051 8806 1
2: 3042 8755 0 0 2
3: 3340 8742 0 0 3
4: 3458 8626 0 0 4
5: 3412 8642 0 0 5
6: 3450 8748 0 0 6
We start by determining the number of unique weather events stored in the “EVTYPE” variable:
length(unique(storm.db$EVTYPE))## [1] 985
We see that there are 985 unique events but according to the provided documentation there should be only 48 event types. Looking more closely at the storm data set we notice that there are many typos. There are also events with the same names both with upper and lower cases. We also notice records with event names composed of two or more existing event types as well as many new weather events which are not part of the officially documented 48. Therefore, we try to preprocess the “EVTYPE” variable and reduce the number of unique events. First, we transform all event names to upper cases:
storm.db$EVTYPE <- toupper(storm.db$EVTYPE)
length(unique(storm.db$EVTYPE))## [1] 898
We observe that after turning all event names to upper cases, the number of unique events is reduced with 87 from 985 to 898.
Next we display the first 10 weather events with the greatest number of occurrences (i.e. with highest frequency):
df.event <- as.data.frame(table(as.factor(storm.db$EVTYPE)))
colnames(df.event) <- c("Event", "Count")
df.event <- df.event[order(-df.event$Count), ]
rownames(df.event) <- 1:nrow(df.event)
kable(df.event[c(1:10), ], row.names = TRUE, caption = "Table 1")| Event | Count | |
|---|---|---|
| 1 | HAIL | 288661 |
| 2 | TSTM WIND | 219942 |
| 3 | THUNDERSTORM WIND | 82564 |
| 4 | TORNADO | 60652 |
| 5 | FLASH FLOOD | 54277 |
| 6 | FLOOD | 25327 |
| 7 | THUNDERSTORM WINDS | 20843 |
| 8 | HIGH WIND | 20214 |
| 9 | LIGHTNING | 15754 |
| 10 | HEAVY SNOW | 15708 |
From the table above we see that the weather event with highest frequency is “HAIL”, followed by “TSTM WIND”, “THUNDERSTORM WIND”, “TORNADO” and “FLASH FLOOD” at the fifth place. We notice that the event “THUNDERSTORM WIND” is recorded also as “TSTM WIND” and “THUNDERSTORM WINDS”. Hence, we rename all occurrences of “TSTM WIND” and “THUNDERSTORM WINDS” to “THUNDERSTORM WIND”:
t1 <- Sys.time()
storm.db[storm.db$EVTYPE == "TSTM WIND" | storm.db$EVTYPE == "THUNDERSTORM WINDS", ]$EVTYPE <- "THUNDERSTORM WIND"
t2 <- Sys.time()
t2-t1## Time difference of 0.3711889 secs
After this transformation we display again the top 10 (by frequency) weather events:
df.event <- as.data.frame(table(as.factor(storm.db$EVTYPE)))
colnames(df.event) <- c("Event", "Count")
df.event <- df.event[order(-df.event$Count), ]
rownames(df.event) <- 1:nrow(df.event)
kable(df.event[c(1:10), ], row.names = TRUE, caption = "Table 2")| Event | Count | |
|---|---|---|
| 1 | THUNDERSTORM WIND | 323349 |
| 2 | HAIL | 288661 |
| 3 | TORNADO | 60652 |
| 4 | FLASH FLOOD | 54277 |
| 5 | FLOOD | 25327 |
| 6 | HIGH WIND | 20214 |
| 7 | LIGHTNING | 15754 |
| 8 | HEAVY SNOW | 15708 |
| 9 | HEAVY RAIN | 11742 |
| 10 | WINTER STORM | 11433 |
Now the first place is for “THUNDERSTORM WIND”, followed by “HAIL”, “TORNADO”, “FLASH FLOOD” and “FLOOD” at the fifth place. Also, “HEAVY RAIN” and “WINTER STORM” appear in the top ten.
We will try to further reduce the number of unique weather events by identifying possible misspellings. We calculate the Damerau–Levenshtein distance between the unique event names in the storm.db data table and the names of the official weather events given in the National Weather Service Storm Data Documentation. According to Wikipedia, the “Damerau–Levenshtein distance is a distance (string metric) between two strings, i.e., finite sequence of symbols, given by counting the minimum number of operations needed to transform one string into the other, where an operation is defined as an insertion, deletion, or substitution of a single character, or a transposition of two adjacent characters.” In R we compute the aforementioned string distance with the function stringdistmatrix() from the stringdist package.
Let us denote by storm.db.name a unique value of the “EVTYPE” variable from the storm.db data table, and by official.name – any of the 48 official event names. We apply the following strategy for cleaning the “EVTYPE” variable:
We calculate the Damerau–Levenshtein string distance between storm.db.name and official.name: dist(storm.db.name, official.name).
We compute all pairwise string distances dist(official.name, official.name) between any two official event names for which \(1\leq\) dist(official.name, official.name)\(\leq 3\). We denote by list(official.self.dist) the list of the calculated pair distances.
If the string distance – dist(storm.db.name, official.name), is greater than zero and less than or equal to 3, and the pair (storm.db.name, official.name) does not belong to list(official.self.dist), we replace storm.db.name with official.name.
We start by creating the character vector event.type, which contains all unique weather event names from the storm.db data table:
event.type <- unique(storm.db$EVTYPE)We see that there are 896 unique elements of the vector event.type:
length(event.type)## [1] 896
Then we construct the character vector events which is composed of all 48 official weather events provided in the National Weather Service documentation:
events <- c("ASTRONOMICAL LOW TIDE", "AVALANCHE", "BLIZZARD", "COASTAL FLOOD",
"COLD/WIND CHILL", "DEBRIS FLOW", "DENSE FOG", "DENSE SMOKE",
"DROUGHT", "DUST DEVIL", "DUST STORM", "EXCESSIVE HEAT",
"EXTREME COLD/WIND CHILL", "FLASH FLOOD", "FLOOD", "FROST/FREEZE",
"FUNNEL CLOUD", "FREEZING FOG", "HAIL", "HEAT",
"HEAVY RAIN",
"HEAVY SNOW", "HIGH SURF", "HIGH WIND", "HURRICANE (TYPHOON)",
"ICE STORM", "LAKE-EFFECT SNOW", "LAKESHORE FLOOD", "LIGHTNING",
"MARINE HAIL", "MARINE HIGH WIND", "MARINE STRONG WIND", "MARINE THUNDERSTORM WIND",
"RIP CURRENT", "SEICHE", "SLEET", "STORM SURGE/TIDE", "STRONG WIND",
"THUNDERSTORM WIND", "TORNADO", "TROPICAL DEPRESSION", "TROPICAL STORM",
"TSUNAMI", "VOLCANIC ASH", "WATERSPOUT", "WILDFIRE", "WINTER STORM",
"WINTER WEATHER")Step 1:
Now we compute the pairwise Damerau-Levenshtein string distances between the elements of the two character vectors. We store the resulting matrix in the strdist object:
strdist <- stringdistmatrix(event.type, events, method = "dl")Below we create the matrix matching.events that contains the indices of the elements of the character vectors event.type and events for which the respective string distance is greater or equal to 1 and less than or equal to 3:
match.events <- list()
res <- list()
cnt <- 0
for(i in 1:dim(strdist)[1]){
for(j in 1:dim(strdist)[2]){
if(strdist[i,j] > 0 & strdist[i,j] <= 3){
cnt <- cnt + 1
res[[cnt]] = c(i,j)
match.events <- do.call(cbind, res)
}
}
}
matching.events <- as.matrix(match.events)
matching.events <- t(matching.events)The matrix matching.events has 101 rows and 2 columns. In the first column are stored the indices i of the elements of the character vector event.type and in the second column – the corresponding indices j of the elements of the vector events. For each pair of indices (i,j) we have that \(1\leq\)dist(event.type[i], events[j])\(\leq 3\). Here are the first few rows of the matrix matching.events:
head(matching.events)## [,1] [,2]
## [1,] 3 20
## [2,] 17 39
## [3,] 19 14
## [4,] 20 24
## [5,] 22 40
## [6,] 25 19
For convenience we transform the matrix matching.events into a data frame:
matching.events <- as.data.frame(matching.events)
colnames(matching.events) <- c("index.db.name", "index.correct.name")The character vectors db.name and real.name contain the unique values of the “EVTYPE” variable and the official weather event names, respectively, for which the Damerau-Levenshtein string distance is in the interval [1,3]:
db.name <- event.type[matching.events[ ,"index.db.name"]]
real.name <- events[matching.events[ ,"index.correct.name"]]Step 2:
Next we compute the string distances between the elements of the character vector events (which contains the 48 official weather event names):
self.dist <- stringdistmatrix(events, events, method = "dl")Then we generate the matrix self.match that contains the indices i and j of the elements of the character vector events for which the respective string distance – dist(events[i], events[j]), is greater than or equal to 1 and less than or equal to 3:
match.events <- list()
res <- list()
cnt <- 0
for(i in 1:dim(self.dist)[1]){
for(j in 1:dim(self.dist)[2]){
if(self.dist[i,j] >= 1 & self.dist[i,j] <= 3){
cnt <- cnt + 1
res[[cnt]] = c(i,j)
match.events <- do.call(cbind, res)
}
}
}
self.match <- as.matrix(match.events)
self.match <- t(self.match)The matrix self.match has 4 rows and 2 columns. Next we create the character vector self.pairs, whose elements are composed of the pasted strings representing the official event names with string distance between them in the interval [1,3]:
self.pairs <- vector(mode = "character")
for(i in 1:dim(self.match)[1]){
self.pairs <- c(self.pairs, paste(events[self.match[i,1]], paste(events[self.match[i,2]])))
}
self.pairs## [1] "HAIL HEAT" "HEAT HAIL" "HEAT SLEET" "SLEET HEAT"
length(self.pairs)## [1] 4
Our next step is to create the character vector pairs composed of the 101 pasted matching pairs (“EVTYPE” name, official name) such that the string distance between them – dist(“EVTYPE” name, official name) is in the interval [1,3]:
pairs <- mapply(function(db, correct){
paste(db, correct, sep = " ")
}, db.name, real.name)
pairs <- as.vector(x = pairs, mode = "character")Step 3:
The only thing left is to exclude from db.name the pasted string pairs of official event names for which the string distance between them is greater than 0 and less than 4:
db.name.out.self <- db.name[!(pairs %in% self.pairs)]Finally we replace the misspelled “EVTYPE” names with the respective correct official names:
t1 <- Sys.time()
storm.db[which(storm.db$EVTYPE %in% db.name.out.self)]$EVTYPE <- real.name[match(storm.db[which(storm.db$EVTYPE %in% db.name.out.self), ]$EVTYPE, db.name.out.self)]
t2 <- Sys.time()
t2-t1## Time difference of 0.232378 secs
We report the system time needed for the described string transformation and we see that it is only 0.232 s.
We see that the number of unique event types in the “EVTYPE” variable decreased from 896 to 801:
length(unique(storm.db$EVTYPE))## [1] 801
We show the top ten events (by frequency) after the above transformations:
df.event <- as.data.frame(table(as.factor(storm.db$EVTYPE)))
colnames(df.event) <- c("Event" , "Count")
df.event$Percentage <- round(df.event$Count/sum(df.event$Count)*100, 3)
df.event <- df.event[order(-df.event$Count), ]
rownames(df.event) <- 1:nrow(df.event)
kable(df.event[1:10, ], row.names = TRUE, caption = "Table 3")| Event | Count | Percentage | |
|---|---|---|---|
| 1 | THUNDERSTORM WIND | 324474 | 35.961 |
| 2 | HAIL | 288742 | 32.001 |
| 3 | TORNADO | 60664 | 6.723 |
| 4 | FLASH FLOOD | 55996 | 6.206 |
| 5 | FLOOD | 25985 | 2.880 |
| 6 | HIGH WIND | 20248 | 2.244 |
| 7 | LIGHTNING | 15842 | 1.756 |
| 8 | HEAVY SNOW | 15713 | 1.741 |
| 9 | HEAVY RAIN | 11840 | 1.312 |
| 10 | WINTER STORM | 11463 | 1.270 |
We observe that there is no change in the top ten weather events compared to Table 2.
After closely examining the storm.db data set we notice further inconsistencies in the weather event names which we clean manually:
# Cleaning names that contain the string "MARINE TSTM WIND":
storm.db[storm.db$EVTYPE == "MARINE TSTM WIND", ]$EVTYPE <- "MARINE THUNDERSTORM WIND"
# Cleaning names that contain the string "WILD/FOREST FIRE":
storm.db[storm.db$EVTYPE == "WILD/FOREST FIRE", ]$EVTYPE <- "WILDFIRE"
# Cleaning names that contain the string "UNSEASONABLY WARM":
storm.db[storm.db$EVTYPE == "UNSEASONABLY WARM", ]$EVTYPE <- "EXCESSIVE HEAT"
# Cleaning names that contain the string "EXCESSIVE SNOW":
storm.db[storm.db$EVTYPE == "EXCESSIVE SNOW", ]$EVTYPE <- "HEAVY SNOW"
# Cleaning names that contain the string "HURRICANE":
storm.db[grep("HURRICANE", storm.db$EVTYPE), ]$EVTYPE <- "HURRICANE (TYPHOON)"
# Cleaning names that contain the string "HIGH TEMPERATURE":
storm.db[grep("HIGH TEMPERATURE", storm.db$EVTYPE), ]$EVTYPE <- "HEAT"
# Cleaning names that contain the string "HAIL STORM":
storm.db[grep("HAIL STORM", storm.db$EVTYPE), ]$EVTYPE <- "HAIL"
# Cleaning names that contain the string "EXTREME COLD":
storm.db[grep("EXTREME COLD|RECORD COLD|COLD RECORD", storm.db$EVTYPE), ]$EVTYPE <- "EXTREME COLD/WIND CHILL"
# Cleaning names that contain the string "DROUGHT":
storm.db[grep("DROUGHT", storm.db$EVTYPE), ]$EVTYPE <- "DROUGHT"
# Cleaning names that contain the string "EXCESSIVE HEAT|RECORD HEAT|HEAT RECORD|EXTREME HEAT":
storm.db[grep("EXCESSIVE HEAT|RECORD HEAT|HEAT RECORD|EXTREME HEAT", storm.db$EVTYPE), ]$EVTYPE <- "EXCESSIVE HEAT"
# Cleaning names that contain the string "COASTAL FLOOD":
storm.db[grep("COASTAL FLOOD", storm.db$EVTYPE), ]$EVTYPE <- "COASTAL FLOOD"After the above operations, the unique event names are:
length(unique(storm.db$EVTYPE))## [1] 766
Probably more preprocessing can be done, but it requres a lot of time and effort. The task becomes even more difficult due to the fact that many new weather events are introduced in the storm data set and they should be kept as they are because they have no analogue in the official documentation.
We show again the top ten severe weather events:
df.event <- as.data.frame(table(as.factor(storm.db$EVTYPE)))
colnames(df.event) <- c("Event", "Count")
df.event$Percentage <- round(df.event$Count/sum(df.event$Count)*100, 3)
df.event <- df.event[order(-df.event$Count), ]
rownames(df.event) <- 1:nrow(df.event)
kable(df.event[1:10, ], row.names = TRUE, caption = "Table 4")| Event | Count | Percentage | |
|---|---|---|---|
| 1 | THUNDERSTORM WIND | 324474 | 35.961 |
| 2 | HAIL | 288743 | 32.001 |
| 3 | TORNADO | 60664 | 6.723 |
| 4 | FLASH FLOOD | 55996 | 6.206 |
| 5 | FLOOD | 25985 | 2.880 |
| 6 | HIGH WIND | 20248 | 2.244 |
| 7 | LIGHTNING | 15842 | 1.756 |
| 8 | HEAVY SNOW | 15738 | 1.744 |
| 9 | MARINE THUNDERSTORM WIND | 11987 | 1.328 |
| 10 | HEAVY RAIN | 11840 | 1.312 |
We see that compared to Table 3, “WINTER STORM” drops out, whereas “MARINE THUNDERSTORM WIND” makes in in the top ten.
For the analysis of the most harmful events with respect to population health we will use the variables “FATALITIES”, “INJURIES” and “EVTYPE”.
Below we create the data frame df.fatality which contains the total number of fatalities grouped by event type (the variable “EVTYPE”) and sorted in descending order:
# Fatalities:
df.fatality <- aggregate(storm.db$FATALITIES, by = list(storm.db$EVTYPE), FUN = sum)
colnames(df.fatality) <- c("Event", "Fatalities")
df.fatality$Percentage <- round(df.fatality$Fatalities/sum(df.fatality$Fatalities)*100, 3)
df.fatality <- df.fatality[order(-df.fatality$Fatalities), ]
rownames(df.fatality) <- 1:nrow(df.fatality)Then we display the top 10 severe weather events in terms of number of fatalities:
kable(df.fatality[1:10, ], row.names = TRUE, caption = "Table 5")| Event | Fatalities | Percentage | |
|---|---|---|---|
| 1 | TORNADO | 5633 | 37.194 |
| 2 | EXCESSIVE HEAT | 2029 | 13.397 |
| 3 | FLASH FLOOD | 1016 | 6.708 |
| 4 | HEAT | 951 | 6.279 |
| 5 | THUNDERSTORM WIND | 928 | 6.127 |
| 6 | LIGHTNING | 854 | 5.639 |
| 7 | FLOOD | 534 | 3.526 |
| 8 | RIP CURRENT | 368 | 2.430 |
| 9 | EXTREME COLD/WIND CHILL | 288 | 1.902 |
| 10 | HIGH WIND | 249 | 1.644 |
From the table we see that the top 3 events with the greatest number of fatalities are tornado, excessive heat and flash flood.
Next we create the data frame df.injury which contains the total number of injuries grouped by event type (the variable “EVTYPE”) and sorted in descending order:
# Injuries:
df.injury <- aggregate(storm.db$INJURIES, by = list(storm.db$EVTYPE), FUN = sum)
colnames(df.injury) <- c("Event", "Injuries")
df.injury$Percentage <- round(df.injury$Injuries/sum(df.injury$Injuries)*100, 3)
df.injury <- df.injury[order(-df.injury$Injuries), ]
rownames(df.injury) <- 1:nrow(df.injury)We display the top 10 severe weather events in terms of number of injured people:
kable(df.injury[1:10, ], row.names = TRUE, caption = "Table 6")| Event | Injuries | Percentage | |
|---|---|---|---|
| 1 | TORNADO | 91346 | 65.002 |
| 2 | THUNDERSTORM WIND | 9679 | 6.888 |
| 3 | FLOOD | 7523 | 5.353 |
| 4 | EXCESSIVE HEAT | 6747 | 4.801 |
| 5 | LIGHTNING | 5278 | 3.756 |
| 6 | HEAT | 2277 | 1.620 |
| 7 | FLASH FLOOD | 2080 | 1.480 |
| 8 | ICE STORM | 1985 | 1.413 |
| 9 | HEAVY RAIN | 1534 | 1.092 |
| 10 | WILDFIRE | 1473 | 1.048 |
From the table we see that the top 3 events with the greatest number of injuries are tornado, thunderstorm wind and flood.
In the next table we show the ten most harmful events with respect to number of fatalities, together with the respective number of injuries:
# Both fatalities and injuries:
df.health <- merge(df.fatality, df.injury, by = "Event")
df.health <- df.health[order(-df.health$Fatalities), ]
rownames(df.health) <- 1:nrow(df.health)
colnames(df.health) <- c("Event", "Fatalities", "Percent.Fatalities", "Injuries", "Percent.Injuries")
kable(df.health[1:10, ], row.names = TRUE, caption = "Table 7")| Event | Fatalities | Percent.Fatalities | Injuries | Percent.Injuries | |
|---|---|---|---|---|---|
| 1 | TORNADO | 5633 | 37.194 | 91346 | 65.002 |
| 2 | EXCESSIVE HEAT | 2029 | 13.397 | 6747 | 4.801 |
| 3 | FLASH FLOOD | 1016 | 6.708 | 2080 | 1.480 |
| 4 | HEAT | 951 | 6.279 | 2277 | 1.620 |
| 5 | THUNDERSTORM WIND | 928 | 6.127 | 9679 | 6.888 |
| 6 | LIGHTNING | 854 | 5.639 | 5278 | 3.756 |
| 7 | FLOOD | 534 | 3.526 | 7523 | 5.353 |
| 8 | RIP CURRENT | 368 | 2.430 | 232 | 0.165 |
| 9 | EXTREME COLD/WIND CHILL | 288 | 1.902 | 255 | 0.181 |
| 10 | HIGH WIND | 249 | 1.644 | 1137 | 0.809 |
# df.health <- aggregate(storm.db[ , .(FATALITIES, INJURIES)], by = list(storm.db$EVTYPE), FUN = sum)
# head(storm.db[ , .(FATALITIES, INJURIES)])The next table provides the event types sorted in descending order by frequency together with the respective number of fatalities and injuries:
df.both <- merge(df.event, df.health, by = "Event")
df.both <- df.both[order(-df.both$Count), ]
rownames(df.both) <- 1:nrow(df.both)
colnames(df.both) <- c("Event", "Count.Event", "Percentage", "Fatalities", "Percent.Fatalities", "Injuries", "Percent.Injuries")
kable(df.both[1:15, ], row.names = TRUE, caption = "Table 8")| Event | Count.Event | Percentage | Fatalities | Percent.Fatalities | Injuries | Percent.Injuries | |
|---|---|---|---|---|---|---|---|
| 1 | THUNDERSTORM WIND | 324474 | 35.961 | 928 | 6.127 | 9679 | 6.888 |
| 2 | HAIL | 288743 | 32.001 | 16 | 0.106 | 1368 | 0.973 |
| 3 | TORNADO | 60664 | 6.723 | 5633 | 37.194 | 91346 | 65.002 |
| 4 | FLASH FLOOD | 55996 | 6.206 | 1016 | 6.708 | 2080 | 1.480 |
| 5 | FLOOD | 25985 | 2.880 | 534 | 3.526 | 7523 | 5.353 |
| 6 | HIGH WIND | 20248 | 2.244 | 249 | 1.644 | 1137 | 0.809 |
| 7 | LIGHTNING | 15842 | 1.756 | 854 | 5.639 | 5278 | 3.756 |
| 8 | HEAVY SNOW | 15738 | 1.744 | 127 | 0.839 | 1023 | 0.728 |
| 9 | MARINE THUNDERSTORM WIND | 11987 | 1.328 | 19 | 0.125 | 34 | 0.024 |
| 10 | HEAVY RAIN | 11840 | 1.312 | 167 | 1.103 | 1534 | 1.092 |
| 11 | WINTER STORM | 11463 | 1.270 | 206 | 1.360 | 1321 | 0.940 |
| 12 | WINTER WEATHER | 7045 | 0.781 | 33 | 0.218 | 398 | 0.283 |
| 13 | FUNNEL CLOUD | 6849 | 0.759 | 4 | 0.026 | 153 | 0.109 |
| 14 | WILDFIRE | 4222 | 0.468 | 97 | 0.640 | 1473 | 1.048 |
| 15 | WATERSPOUT | 3860 | 0.428 | 4 | 0.026 | 32 | 0.023 |
From the table we see that the most frequent event - thunderstorm wind, accounts only for 6.1% of all fatalities and 6.9% of all injuries, whereas the third most frequent event - tornado, accounts for 37.2% of all fatalities and 65% of all injuries.
The last table concerning public health shows the total number of accidents related to public health (fatalities plus injuries) grouped by weather event and sorted in descending order:
df.total <- df.both
df.total$Total.Count.Accidents <- df.total$Fatalities + df.total$Injuries
df.total <- df.total[order(-df.total$Total.Count), ]
rownames(df.total) <- 1:nrow(df.total)
drop.names <- match(c("Percentage", "Percent.Fatalities", "Percent.Injuries"), names(df.total))
df.total <- df.total[, -drop.names]
kable(df.total[1:15, ], row.names = TRUE, caption = "Table 9")| Event | Count.Event | Fatalities | Injuries | Total.Count.Accidents | |
|---|---|---|---|---|---|
| 1 | TORNADO | 60664 | 5633 | 91346 | 96979 |
| 2 | THUNDERSTORM WIND | 324474 | 928 | 9679 | 10607 |
| 3 | EXCESSIVE HEAT | 1912 | 2029 | 6747 | 8776 |
| 4 | FLOOD | 25985 | 534 | 7523 | 8057 |
| 5 | LIGHTNING | 15842 | 854 | 5278 | 6132 |
| 6 | HEAT | 923 | 951 | 2277 | 3228 |
| 7 | FLASH FLOOD | 55996 | 1016 | 2080 | 3096 |
| 8 | ICE STORM | 2013 | 89 | 1985 | 2074 |
| 9 | HEAVY RAIN | 11840 | 167 | 1534 | 1701 |
| 10 | WILDFIRE | 4222 | 97 | 1473 | 1570 |
| 11 | WINTER STORM | 11463 | 206 | 1321 | 1527 |
| 12 | HIGH WIND | 20248 | 249 | 1137 | 1386 |
| 13 | HAIL | 288743 | 16 | 1368 | 1384 |
| 14 | HEAVY SNOW | 15738 | 127 | 1023 | 1150 |
| 15 | BLIZZARD | 2719 | 101 | 805 | 906 |
We see that the highest number of fatalities and injuries is due to tornados, followed by thunderstorm winds and excessive heat.
Next we show bar plots of the number of fatalities and the number of injuries by event type. We use the R package “ggplot2()”:
# Fatalities:
g1 <- ggplot(data = df.fatality[1:10, ], aes(x = reorder(Event, -Fatalities), fill = Event)) +
geom_bar(aes(y = Fatalities),
stat = "identity") +
labs(x = "",
y = "Number of Fatalities",
fill = "Event") +
theme(legend.position = 'none',
axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Number of Fatalities by Event") +
scale_y_continuous(breaks = seq(0, 6000, 1000))
# Injuries:
g2 <- ggplot(data = df.injury[1:10, ], aes(x = reorder(Event, -Injuries), fill = Event)) +
geom_bar(aes(y = Injuries),
stat = "identity") +
labs(x = "",
y = "Number of Injuries",
fill = "Event") +
theme(legend.position = 'none',
axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Number of Injuries by Event") +
scale_y_continuous(breaks = seq(0, 90000, 10000))
grid.arrange(grobs = list(g1, g2), ncol = 2)For the analysis of the events with greatest economic consequences we will use the variables “PROPDMG”, “PROPDMGEXP”, “CROPDMG”, “CROPDMGEXP” and “EVTYPE”.
According to the provided documentation, the “PROPDMGEXP” contains the exponent values of “PROPDMG” (property damage). This means that if we want to obtain the value of the property damage in US dollars, we have to multiply “PROPDMG” by 10 to the respective power, which is stored in the “PROPDMGEXP” variable. By analogy, “CROPDMGEXP” contains the exponent values of “CROPDMG” (crop damage). In the variables “PROPDMGEXP” and “CROPDMGEXP” the respective powers of 10 are codes both as letters and numbers. For example, B or b stands for billion, M or m stands for million, K or k – thousand, and H or h – hundred. There are also numbers from one to ten which simply represent the power of ten (10^the number). The symbols “-”, “+” and “?” refer to less than, greater than and low certainty and we will ignore them.
We start with the analysis of property damage. First we list the sorted unique character values of the “PROPDMGEXP” variable:
sort(unique(storm.db$PROPDMGEXP))## [1] "" "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"
We see that the there are both lower and upper cases of the same letters and, hence, we transform all letters to upper case:
storm.db$PROPDMGEXP <- toupper(storm.db$PROPDMGEXP)
sort(unique(storm.db$PROPDMGEXP))## [1] "" "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "H" "K" "M"
Next we modify the “PROPDMGEXP” (which is of the atomic type character) so that it contains the respective power of 10 but represented as a string:
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "H"] <- 10^2
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "K"] <- 10^3
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "M"] <- 10^6
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "B"] <- 10^9
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "0"] <- 1
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "1"] <- 10
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "2"] <- 10^2
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "3"] <- 10^3
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "4"] <- 10^4
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "5"] <- 10^5
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "6"] <- 10^6
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "7"] <- 10^7
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "8"] <- 10^8We set to 0 the symbols that we do not use:
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "+"] <- 0
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "-"] <- 0
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == "?"] <- 0
storm.db$PROPDMGEXP[storm.db$PROPDMGEXP == ""] <- 0We transform “PROPDMGEXP” into a numeric variable and then we create a new variable called “PROPDMGCOST”, which we add to the storm.db data table:
storm.db$PROPDMGEXP <- as.numeric(storm.db$PROPDMGEXP)
storm.db$PROPDMGCOST <- storm.db$PROPDMG*storm.db$PROPDMGEXPThe “PROPDMGCOST” variable contains the value of property damage costs in US dollars. Our next step is to generate the data frame df.property which has two variables - “Event” and “Losses”. The variable “Event” contains a weather event and the variable “Losses” contains the total cost of property damage for that event. The data in the df.property is sorted in descending order by the amount of total losses for a given event due to property damage:
df.property <- aggregate(storm.db$PROPDMGCOST, by = list(storm.db$EVTYPE),
FUN = sum)
colnames(df.property) <- c("Event", "Losses")
df.property <- df.property[order(-df.property$Losses), ]
rownames(df.property) <- 1:nrow(df.property)We display the top ten events with respect to cost of property damage:
kable(df.property[1:10, ], row.names = TRUE, caption = "Table 10")| Event | Losses | |
|---|---|---|
| 1 | FLOOD | 144679596800 |
| 2 | HEAVY RAIN | 70000103590 |
| 3 | TORNADO | 56947892815 |
| 4 | STORM SURGE | 43323536000 |
| 5 | FLASH FLOOD | 17563971075 |
| 6 | HAIL | 15749092677 |
| 7 | HURRICANE (TYPHOON) | 15450340010 |
| 8 | THUNDERSTORM WIND | 10235772311 |
| 9 | WILDFIRE | 7767443500 |
| 10 | TROPICAL STORM | 7703890550 |
The weather events with biggest impact on the US economy via property damage are flood, heavy rain, tornado and storm surge.
Below we perform transformations on the “CROPDMGEXP” variable which are analogous to those we performed on the “PROPDMGEXP” variable:
sort(unique(storm.db$CROPDMGEXP))## [1] "" "?" "0" "2" "B" "k" "K" "m" "M"
storm.db$CROPDMGEXP <- toupper(storm.db$CROPDMGEXP)
sort(unique(storm.db$CROPDMGEXP))## [1] "" "?" "0" "2" "B" "K" "M"
storm.db$CROPDMGEXP[storm.db$CROPDMGEXP == "K"] <- 10^3
storm.db$CROPDMGEXP[storm.db$CROPDMGEXP == "M"] <- 10^6
storm.db$CROPDMGEXP[storm.db$CROPDMGEXP == "B"] <- 10^9
storm.db$CROPDMGEXP[storm.db$CROPDMGEXP == "0"] <- 1
storm.db$CROPDMGEXP[storm.db$CROPDMGEXP == "2"] <- 10^2
storm.db$CROPDMGEXP[storm.db$CROPDMGEXP == "?"] <- 0
storm.db$CROPDMGEXP[storm.db$CROPDMGEXP == ""] <- 0
storm.db$CROPDMGEXP <- as.numeric(storm.db$CROPDMGEXP)
storm.db$CROPDMGCOST <- storm.db$CROPDMG*storm.db$CROPDMGEXPThe newly generated “CROPDMGCOST” variable contains the value of crop damage costs in US dollars. Then we aggregate the data frame df.crop which has two variables - “Event” and “Losses”. The variable “Event” contains a weather event and the variable “Losses” contains the total cost of crop damage for that event. The data in the df.crop is sorted in descending order by the amount of total losses for a given event due to crop damage:
df.crop <- aggregate(storm.db$CROPDMGCOST, by = list(storm.db$EVTYPE),
FUN = sum)
colnames(df.crop) <- c("Event", "Losses")
df.crop <- df.crop[order(-df.crop$Losses), ]
rownames(df.crop) <- 1:nrow(df.crop)We display the top ten events with respect to costs of crop damage:
kable(df.crop[1:10, ], row.names = TRUE, caption = "Table 11")| Event | Losses | |
|---|---|---|
| 1 | DROUGHT | 13972621780 |
| 2 | FLOOD | 5661968450 |
| 3 | RIVER FLOOD | 5029459000 |
| 4 | ICE STORM | 5022113500 |
| 5 | HEAVY RAIN | 3341275600 |
| 6 | HAIL | 3086455470 |
| 7 | HURRICANE (TYPHOON) | 2907420000 |
| 8 | FLASH FLOOD | 1462093700 |
| 9 | EXTREME COLD/WIND CHILL | 1313023000 |
| 10 | THUNDERSTORM WIND | 1175491780 |
The biggest losses due to crop damage are caused by drought, flood, river flood and ice storm.
Below we show bar plots of the cost of crop and property damage in billions of US dollars, grouped by type of weather event:
# Crop damage:
g1 <- ggplot(data = df.crop[1:10, ], aes(x = reorder(Event, -Losses),
fill = Event)) +
geom_bar(aes(y = Losses/10^9),
stat = "identity") +
labs(x = "",
y = "Billions US$",
fill = "Event") +
theme(legend.position = 'none',
axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Crop Damage in Billions of US$") +
scale_y_continuous(breaks = seq(0, 15, 2))
# Property damage:
g2 <- ggplot(data = df.property[1:10, ], aes(x = reorder(Event, -Losses), fill = Event)) +
geom_bar(aes(y = Losses/10^9),
stat = "identity") +
labs(x = "",
y = "Billions US$",
fill = "Event") +
theme(legend.position = 'none',
axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Property Damage in Billions of US$") +
scale_y_continuous(breaks = seq(0, 150, 20))
grid.arrange(grobs = list(g1, g2), ncol = 2)From the figures above we see that the biggest losses due to crop damage are caused by drought, followed by flood. The biggest losses due to property damage are caused by flood, heavy rain and tornado.
Next we calculate the total cost of both property and crop damages. We store the result in a new variable called “DMGTOTAL” and we aggregate the total.dmg data frame which contains the total losses grouped by event type and sorted in descending order:
storm.db$DMGTOTAL <- storm.db$PROPDMGCOST + storm.db$CROPDMGCOST
total.dmg <- aggregate(storm.db$DMGTOTAL, by = list(storm.db$EVTYPE),
FUN = sum)
colnames(total.dmg) <- c("Event", "Losses")
total.dmg <- total.dmg[order(-total.dmg$Losses), ]
rownames(total.dmg) <- 1:nrow(total.dmg)Finally we show a bar plot of the cost of total damages in billions of US dollars:
ggplot(data = total.dmg[1:10, ], aes(x = reorder(Event, -Losses), fill = Event)) +
geom_bar(aes(y = Losses/10^9),
stat = "identity") +
labs(x = "",
y = "Billions US$",
fill = "Event") +
theme(legend.position = 'none',
axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Total Damage in Billions of US$") +
scale_y_continuous(breaks = seq(0, 150, 20))We observe that flood, heavy rain and tornado are the three main severe weather events causing the biggest economic losses due to crop and property damage.