In the following analysis we explore the data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. It records the weather events in the territory of USA from 1950 to 2011. The purpose is to address 2 main questions: which types of weather events cause the greatest damages to the population (1) and to the economic resources (2) across the United States.
The analysis shows that TORNADO is the most harmful type of event for population health, as it caused about 65% of the total number of injuries related to weather events. Concerning the economic damages FLOOD has the greatest impact, but HURRICANE/TYPHOON and TORNADO play an important role too.
The raw data are in the form of a comma-separated-value (csv) file, compressed via the bzip2 algorithm, from the following web site:
[https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2]
We download the zip file and extract the file named repdata-data-StormData.csv in the working directory. Then we read in the raw data, check the dimensions of the dataframe and determine the class of its columns:
library(R.utils)
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if (!file.exists("repdata-data-StormData.csv.bz2")) {
download.file(url, destfile = "repdata-data-StormData.csv.bz2")
bunzip2("repdata-data-StormData.csv.bz2", destname = "repdata-data-StormData.csv", overwrite = FALSE, remove = FALSE)
}
# read the csv file
if (!"storm" %in% ls()){
storm <- read.csv("repdata-data-StormData.csv")
}
# check the dimensions and display the structure of data
dim(storm)
## [1] 902297 37
str(storm)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436774 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
In the following analysis our purpose is to answer 2 main questions, both referred to the United States territory.
The first question (Qst1) is: which types of weather events are most harmful to population health?
The second question (Qst2) is: which types of weather events cause the greatest economic damages?
Due to the very large dimension of the dataframe, we choose to filter the rows including only the events that are relevant to our 2 questions.
In particular we take the rows in which at least one of these 4 variables is greater than 0:
number of fatalities FATALITIES, number of injuries INJURIES, damages to properties PROPDMG, damages to crop CROPDMG.
qstdf <- storm[storm$FATALITIES > 0 | storm$INJURIES > 0 | storm$PROPDMG > 0 | storm$CROPDMG > 0,]
dim(qstdf)
## [1] 254633 37
The 2 variables PROPDMGEXP and CROPDMGEXP express the multiplier of damages to properties and crop respectively. The letters correspond to the following values:
| Letter | Multiplier |
|---|---|
| “” | 1 |
| H or h | 100 |
| K or k | 1000 |
| M or m | 1 million |
| B or b | 1 billion |
there are other multiplier values that are not consistent with the table above:
unique(qstdf$PROPDMGEXP)
## [1] K M B m + 0 5 6 4 h 2 7 3 H -
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
nc1 <- qstdf$PROPDMGEXP %in% c(0:9,"-","+","?")
unique(qstdf$CROPDMGEXP)
## [1] M K m B ? 0 k
## Levels: ? 0 2 B k K m M
nc2 <- qstdf$CROPDMGEXP %in% c(0:9,"-","+","?")
# share of inconsistent multiplier values
nc <- (sum(nc1) + sum(nc2)) / dim(qstdf)[1]
print(nc)
## [1] 0.001056422
but they are relatively few, 0.1056422% of the total (about 1 out of 1000), thus we choose to remove the corresponding rows from the dataframe qstdf:
qstdf <- qstdf[!nc1 & !nc2,]
Now we do some data processing operations to obtain a new variable ECONDMG expressing the total damages (in US dollars) to properties and crop caused by an event.
MPROP <- rep(1.0, dim(qstdf)[1])
MCROP <- rep(1.0, dim(qstdf)[1])
MPROP[qstdf$PROPDMGEXP %in% c("H","h")] <- 1.0 * 1e2
MPROP[qstdf$PROPDMGEXP %in% c("K","k")] <- 1.0 * 1e3
MPROP[qstdf$PROPDMGEXP %in% c("M","m")] <- 1.0 * 1e6
MPROP[qstdf$PROPDMGEXP %in% c("B","b")] <- 1.0 * 1e9
MCROP[qstdf$CROPDMGEXP %in% c("H","h")] <- 1.0 * 1e2
MCROP[qstdf$CROPDMGEXP %in% c("K","k")] <- 1.0 * 1e3
MCROP[qstdf$CROPDMGEXP %in% c("M","m")] <- 1.0 * 1e6
MCROP[qstdf$CROPDMGEXP %in% c("B","b")] <- 1.0 * 1e9
ECONDMG <- qstdf$PROPDMG * MPROP + qstdf$CROPDMG * MCROP
In the following steps of our analysis we use the R packages dplyr (for data processing) and ggplot2 (for data plotting).
library(dplyr)
library(ggplot2)
We select the columns of the dataframe qstdf that are relevant to our purpose, we append the new column ECONDMG, then we compute the sums of the numeric variables FATALITIES, INJURIES and ECONDMG.
qstdf <- select(qstdf, BGN_DATE, STATE, EVTYPE, FATALITIES, INJURIES)
qstdf <- bind_cols(qstdf, as.data.frame(ECONDMG))
TOTFATAL <- sum(qstdf$FATALITIES)
TOTINJUR <- sum(qstdf$INJURIES)
TOTECONDMG <- sum(qstdf$ECONDMG)
We do some further processing in order to make more uniform the values in the column EVTYPE containing the words TORNADO, FLOOD or HEAT:
qstdf$EVTYPE <- as.character(qstdf$EVTYPE)
qstdf$EVTYPE <- toupper(qstdf$EVTYPE)
qstdf$EVTYPE[grepl("TORNADO", qstdf$EVTYPE)] <- "TORNADO"
qstdf$EVTYPE[grepl("FLOOD", qstdf$EVTYPE)] <- "FLOOD"
qstdf$EVTYPE[grepl("HEAT", qstdf$EVTYPE)] <- "HEAT"
qstdf$EVTYPE <- as.factor(qstdf$EVTYPE)
Finally we group data into rows with the same value of EVTYPE (type of weather event) and calculate the sums of the damages within each group.
grpdf <- group_by(qstdf, EVTYPE)
grpdf <- summarise(grpdf, FATAL = sum(FATALITIES), INJUR = sum(INJURIES), ECON = sum(ECONDMG))
grpdf <- mutate(grpdf, FATALPC = FATAL/TOTFATAL, INJURPC = INJUR/TOTINJUR, ECONPC = ECON/TOTECONDMG)
In order to determine which are the most harmful weather events to population health (Qst1) we look for those events with a percentage of fatalities FATALPC or injuries INJURPC above 3% of the total:
res1 <- grpdf[(grpdf$FATALPC > 0.03) | (grpdf$INJURPC > 0.03), -c(4,7)]
print(res1)
## Source: local data frame [5 x 5]
##
## EVTYPE FATAL INJUR FATALPC INJURPC
## 1 FLOOD 1525 8604 0.10075983 0.06124977
## 2 HEAT 3138 9224 0.20733399 0.06566340
## 3 LIGHTNING 816 5230 0.05391477 0.03723109
## 4 TORNADO 5658 91382 0.37383548 0.65052608
## 5 TSTM WIND 504 6957 0.03330030 0.04952518
The following plots show the number of fatalities (Plot1) and injuries (Plot2) for each event.
qplot(EVTYPE, FATAL, data = res1, fill = EVTYPE, geom = "bar", stat = "identity") + labs(title = "Plot1 - number of Fatalities", x = "type of event", y = "# of fatalities") + guides(fill = "none")
qplot(EVTYPE, INJUR, data = res1, fill = EVTYPE, geom = "bar", stat = "identity") + labs(title = "Plot2 - number of Injuries", x = "type of event", y = "# of injuries") + guides(fill = "none")
The analysis shows that TORNADO is the most harmful among the severe weather events (it caused about 65% of total injuries and 37% of total fatalities), followed by HEAT and FLOOD respectively at 2nd and 3rd position.
In order to answer Qst2 we focus on the weather events that had an impact (ECONPC) greater than 3% of total economic damages across the United States.
res2 <- grpdf[grpdf$ECONPC > 0.03, c(1,4,7)]
print(res2)
## Source: local data frame [7 x 3]
##
## EVTYPE ECON ECONPC
## 1 DROUGHT 15018672000 0.03152892
## 2 FLOOD 179908359644 0.37768422
## 3 HAIL 18733216730 0.03932691
## 4 HURRICANE 14610229010 0.03067147
## 5 HURRICANE/TYPHOON 71913712800 0.15096950
## 6 STORM SURGE 43323541000 0.09094974
## 7 TORNADO 58960381093 0.12377638
Plot3 shows the economic damages (in millions of US dollars) caused by those types of events.
qplot(EVTYPE, ECON / 1e6, data = res2, fill = "EVTYPE", geom = "bar", stat = "identity") + labs(title = "Plot3 - economic damages (millions of US dollars)", x = "type of event", y = "economic damages (millions of US dollars)") + guides(fill = "none")
We observe that FLOOD has the greatest economic impact (above 37% of total damages), followed by HURRICANE/TYPHOON (18% of total damages) and TORNADO (12% of total damages).