This exploratory study involves analysing data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. More information about the data collection and data keeping nuances you can find here. Main goal of this short exploratory study is to determine types of weather events which are most harmful with respect to population health and economic consequences. The storm data were collected from 1950 to 2011. The dataset contains 902297 records with 37 variables, most of them are irrelevant for the analysis in question so relevant variable were selected (event type, number of fatalities, number of injuries, damage estimates and others). Then the data were aggregated by event type and total numbers of fatalities as well as total amount of estimated damage costs were calculated for each event type. The 10 most harmful event types in terms of population health and economic consequences were determined and illustrated using bar plots.
The dataset for analysis come in the form of a compressed comma-separated-value file. It is available via the following link: Storm Data [47Mb].
You can find the description of some of the variables here [MS Word file].
First of all you need to install R packages R.utils, ggplot2 and gridExtra
library(R.utils)
library(ggplot2)
library(gridExtra)
The next step is to set working directory and a name for the downloaded archive file. Those who use this code should download the data previously, place it into a directory and specify path to this directory. Then you will need to specify the name of the downloaded archive.
working_dir <- "F:\\CouRsera\\Data Science Specialization\\Reproducible Research\\Peer 2"
setwd(working_dir)
destfile <- "storm_data.bz2"
## URL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
## download.file(URL,destfile )
dataset <- read.csv(destfile, strip.white = TRUE, na.strings = c("NA",""))
The dataset contains 902297 rows (i.e. observations, records) and 37 columns (i.e. variables):
dim(dataset)
## [1] 902297 37
There are many irrelevant variables in the dataset:
colnames(dataset)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
It seems we should select a subset of relevant variables from the dataset to reduce its size. The relevant varables are:
EVTYPE is an event type;
FATALITIES is a number of fatal cases reported for an event;
INJURIES is a a number of injury cases reported for an event;
PROPDMG and PROPDMGEXP are variables which should be used to calculate an estimated value of property damage;
CROPDMG and CROPDMGEXP are variables which should be used to calculate an estimated value of crop damage.
So we select only these variables:
data <- data.frame(dataset$EVTYPE, dataset$FATALITIES, dataset$INJURIES, dataset$PROPDMG, dataset$PROPDMGEXP, dataset$CROPDMG, dataset$CROPDMGEXP)
colnames(data) <- c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
The data summary:
summary(data)
## EVTYPE FATALITIES INJURIES PROPDMG
## HAIL :288661 Min. : 0 Min. : 0.0 Min. : 0
## TSTM WIND :219940 1st Qu.: 0 1st Qu.: 0.0 1st Qu.: 0
## THUNDERSTORM WIND: 82563 Median : 0 Median : 0.0 Median : 0
## TORNADO : 60652 Mean : 0 Mean : 0.2 Mean : 12
## FLASH FLOOD : 54277 3rd Qu.: 0 3rd Qu.: 0.0 3rd Qu.: 0
## FLOOD : 25326 Max. :583 Max. :1700.0 Max. :5000
## (Other) :170878
## PROPDMGEXP CROPDMG CROPDMGEXP
## K :424665 Min. : 0.0 K :281832
## M : 11330 1st Qu.: 0.0 M : 1994
## 0 : 216 Median : 0.0 k : 21
## B : 40 Mean : 1.5 0 : 19
## 5 : 28 3rd Qu.: 0.0 B : 9
## (Other): 84 Max. :990.0 (Other): 9
## NA's :465934 NA's :618413
There are many missing data in the dataset if we consider PROPDMGMULT and CROPDMGMULT variables. This may be a big problem if we decide to perform a sophisticated statistical analysis. But if we are interested in comparing totals it is okay if we assume that missing data are unbiased in terms of event types.
It is also disturbing that EVTYPE variable contains much more event types than expected (should be 48 according to the documentation):
length(levels(data$EVTYPE))
## [1] 985
This is due to typos or different storage standards used during 60 years. Ignoring this fact we proceed to calculating totals.
First of all we should create a vector with documented multiplier levels:
multipliers <- c("H", "K", "M", "B")
Then switch these multiplier level names to upper case (so that, for example, h and H will be the same):
levels(data$PROPDMGEXP) <- toupper(levels(data$PROPDMGEXP))
levels(data$CROPDMGEXP) <- toupper(levels(data$CROPDMGEXP))
And get multiplier level names for all observations for property and crop damage:
data$PROPDMGEXP <- trim(as.character(data$PROPDMGEXP))
data$CROPDMGEXP <- trim(as.character(data$CROPDMGEXP))
We need to create a function to switch the multiplier level names by corresponding multiplier values:
multiplier_value <- function(exponent)
{
switch(exponent, "H" = 100, "K" = 1000, "M" = 10^6, "B" = 10^12, NA)
}
Now it is possible to construct variables containing multiplier values corresponding to the exponent: names
data[,8] <- sapply(data$PROPDMGEXP, multiplier_value)
data[,9] <- sapply(data$CROPDMGEXP, multiplier_value)
colnames(data)[c(8,9)] <- c("PROPDMGMULT", "CROPDMGMULT")
Based on the new variables, adding variables containing estimated property and crop damage costs:
data[,10] <- data$PROPDMG*data$PROPDMGMULT
data[,11] <- data$CROPDMG*data$CROPDMGMULT
colnames(data)[c(10,11)] <- c("PROPDMGCOST", "CROPDMGCOST")
Let’s look at the dataset again to check if new variables added successfully:
summary(data)
## EVTYPE FATALITIES INJURIES PROPDMG
## HAIL :288661 Min. : 0 Min. : 0.0 Min. : 0
## TSTM WIND :219940 1st Qu.: 0 1st Qu.: 0.0 1st Qu.: 0
## THUNDERSTORM WIND: 82563 Median : 0 Median : 0.0 Median : 0
## TORNADO : 60652 Mean : 0 Mean : 0.2 Mean : 12
## FLASH FLOOD : 54277 3rd Qu.: 0 3rd Qu.: 0.0 3rd Qu.: 0
## FLOOD : 25326 Max. :583 Max. :1700.0 Max. :5000
## (Other) :170878
## PROPDMGEXP CROPDMG CROPDMGEXP PROPDMGMULT
## Length:902297 Min. : 0.0 Length:902297 Min. :1.00e+02
## Class :character 1st Qu.: 0.0 Class :character 1st Qu.:1.00e+03
## Mode :character Median : 0.0 Mode :character Median :1.00e+03
## Mean : 1.5 Mean :9.18e+07
## 3rd Qu.: 0.0 3rd Qu.:1.00e+03
## Max. :990.0 Max. :1.00e+12
## NA's :466248
## CROPDMGMULT PROPDMGCOST CROPDMGCOST
## Min. :1.00e+03 Min. :0.00e+00 Min. :0.00e+00
## 1st Qu.:1.00e+03 1st Qu.:0.00e+00 1st Qu.:0.00e+00
## Median :1.00e+03 Median :1.00e+03 Median :0.00e+00
## Mean :3.17e+07 Mean :6.33e+08 Mean :4.81e+07
## 3rd Qu.:1.00e+03 3rd Qu.:1.00e+04 3rd Qu.:0.00e+00
## Max. :1.00e+12 Max. :1.15e+14 Max. :5.00e+12
## NA's :618440 NA's :466248 NA's :618440
We’ve done! The data was successfully processed to use in further analysis!
The following code creates a dataset for making a bar chart of total fatalities per event type below:
fatalities_per_event <- aggregate(data$FATALITIES, by=list(data$EVTYPE), FUN=sum, na.rm = T)
injuries_per_event <- aggregate(data$INJURIES, by=list(data$EVTYPE), FUN=sum, na.rm = T)
fatalities_per_event_srt <- fatalities_per_event[order(fatalities_per_event$x, decreasing = T),]
injuries_per_event_srt <- injuries_per_event[order(injuries_per_event$x, decreasing = T),]
health_damage <- data.frame(fatalities_per_event_srt$x, injuries_per_event_srt$x, injuries_per_event_srt$Group.1)
colnames(health_damage) <- c("fatalities", "injuries", "eventype")
This code draws a bar chart of total fatalities per each of the 10 most harmful event types:
health_damage$eventype <- with(health_damage, reorder(eventype, 1/fatalities))
most_fatal <- ggplot(health_damage[1:10,], aes(x = eventype, y = fatalities))
most_fatal + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 90, hjust = 1))
The most deadly event type is, obviously, the TORNADO.
cor(health_damage$fatalities, health_damage$injuries)
## [1] 0.9549
It is interesting that there is a very strong direct linear relationship between a number of fatal and a number of injury cases. It is interpretable and we can conclude that TORNADO, on average, also leads to a greater number of injuries if we compare it with another event types.
The following code creates a dataset for making a bar chart of total economy losses per event type below:
propdmgcost_per_event <- aggregate(data$PROPDMGCOST, by=list(data$EVTYPE), FUN=sum, na.rm = T)
cropdmgcost_per_event <- aggregate(data$CROPDMGCOST, by=list(data$EVTYPE), FUN=sum, na.rm = T)
propdmgcost_per_event_srt <- propdmgcost_per_event[order(propdmgcost_per_event$x, decreasing = T),]
cropdmgcost_per_event_srt <- cropdmgcost_per_event[order(cropdmgcost_per_event$x, decreasing = T),]
propcrop_damage <- data.frame(propdmgcost_per_event_srt$x, cropdmgcost_per_event_srt$x, propdmgcost_per_event_srt$Group.1)
colnames(propcrop_damage) <- c("prop.damage", "crop.damage", "eventype")
propcrop_damage[,4] <- propcrop_damage$prop.damage + propcrop_damage$crop.damage
colnames(propcrop_damage)[4] <- c("total.damage")
This code draws a bar chart of total cost per each of the 10 most harmful event types:
propcrop_damage$eventype <- with(propcrop_damage, reorder(eventype, 1/total.damage))
most_cropdmg <- ggplot(propcrop_damage[1:10,], aes(x = eventype, y = total.damage))
most_cropdmg + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 90, hjust = 1))
The most destructive event type is, obviously, the FLOOD.