Analysis of storm data

Synopsis

This exploratory study involves analysing data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. More information about the data collection and data keeping nuances you can find here. Main goal of this short exploratory study is to determine types of weather events which are most harmful with respect to population health and economic consequences. The storm data were collected from 1950 to 2011. The dataset contains 902297 records with 37 variables, most of them are irrelevant for the analysis in question so relevant variable were selected (event type, number of fatalities, number of injuries, damage estimates and others). Then the data were aggregated by event type and total numbers of fatalities as well as total amount of estimated damage costs were calculated for each event type. The 10 most harmful event types in terms of population health and economic consequences were determined and illustrated using bar plots.

Data Processing

Dataset Description

The dataset for analysis come in the form of a compressed comma-separated-value file. It is available via the following link: Storm Data [47Mb].

You can find the description of some of the variables here [MS Word file].

Loading the data

First of all you need to install R packages R.utils, ggplot2 and gridExtra

library(R.utils)
library(ggplot2)
library(gridExtra)

The next step is to set working directory and a name for the downloaded archive file. Those who use this code should download the data previously, place it into a directory and specify path to this directory. Then you will need to specify the name of the downloaded archive.

working_dir <- "F:\\CouRsera\\Data Science Specialization\\Reproducible Research\\Peer 2"
setwd(working_dir)
destfile <- "storm_data.bz2"
## URL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
## download.file(URL,destfile )
dataset <- read.csv(destfile, strip.white = TRUE, na.strings = c("NA",""))

The dataset contains 902297 rows (i.e. observations, records) and 37 columns (i.e. variables):

dim(dataset)

## [1] 902297     37

There are many irrelevant variables in the dataset:

colnames(dataset)

##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

It seems we should select a subset of relevant variables from the dataset to reduce its size. The relevant varables are:

EVTYPE is an event type;
FATALITIES is a number of fatal cases reported for an event;
INJURIES is a a number of injury cases reported for an event;
PROPDMG and PROPDMGEXP are variables which should be used to calculate an estimated value of property damage;
CROPDMG and CROPDMGEXP are variables which should be used to calculate an estimated value of crop damage.

So we select only these variables:

data <- data.frame(dataset$EVTYPE, dataset$FATALITIES, dataset$INJURIES, dataset$PROPDMG, dataset$PROPDMGEXP, dataset$CROPDMG, dataset$CROPDMGEXP)
colnames(data) <- c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")

The data summary:

summary(data)

##                EVTYPE         FATALITIES     INJURIES         PROPDMG    
##  HAIL             :288661   Min.   :  0   Min.   :   0.0   Min.   :   0  
##  TSTM WIND        :219940   1st Qu.:  0   1st Qu.:   0.0   1st Qu.:   0  
##  THUNDERSTORM WIND: 82563   Median :  0   Median :   0.0   Median :   0  
##  TORNADO          : 60652   Mean   :  0   Mean   :   0.2   Mean   :  12  
##  FLASH FLOOD      : 54277   3rd Qu.:  0   3rd Qu.:   0.0   3rd Qu.:   0  
##  FLOOD            : 25326   Max.   :583   Max.   :1700.0   Max.   :5000  
##  (Other)          :170878                                                
##    PROPDMGEXP        CROPDMG        CROPDMGEXP    
##  K      :424665   Min.   :  0.0   K      :281832  
##  M      : 11330   1st Qu.:  0.0   M      :  1994  
##  0      :   216   Median :  0.0   k      :    21  
##  B      :    40   Mean   :  1.5   0      :    19  
##  5      :    28   3rd Qu.:  0.0   B      :     9  
##  (Other):    84   Max.   :990.0   (Other):     9  
##  NA's   :465934                   NA's   :618413

There are many missing data in the dataset if we consider PROPDMGMULT and CROPDMGMULT variables. This may be a big problem if we decide to perform a sophisticated statistical analysis. But if we are interested in comparing totals it is okay if we assume that missing data are unbiased in terms of event types.

It is also disturbing that EVTYPE variable contains much more event types than expected (should be 48 according to the documentation):

length(levels(data$EVTYPE))

## [1] 985

This is due to typos or different storage standards used during 60 years. Ignoring this fact we proceed to calculating totals.

First of all we should create a vector with documented multiplier levels:

multipliers <- c("H", "K", "M", "B")

Then switch these multiplier level names to upper case (so that, for example, h and H will be the same):

levels(data$PROPDMGEXP) <- toupper(levels(data$PROPDMGEXP))
levels(data$CROPDMGEXP) <- toupper(levels(data$CROPDMGEXP))

And get multiplier level names for all observations for property and crop damage:

data$PROPDMGEXP <- trim(as.character(data$PROPDMGEXP))
data$CROPDMGEXP <- trim(as.character(data$CROPDMGEXP))

We need to create a function to switch the multiplier level names by corresponding multiplier values:

multiplier_value <- function(exponent)
{
  switch(exponent, "H" = 100, "K" = 1000, "M" = 10^6, "B" = 10^12, NA)
}

Now it is possible to construct variables containing multiplier values corresponding to the exponent: names

data[,8] <- sapply(data$PROPDMGEXP, multiplier_value)
data[,9] <- sapply(data$CROPDMGEXP, multiplier_value)
colnames(data)[c(8,9)] <- c("PROPDMGMULT", "CROPDMGMULT")

Based on the new variables, adding variables containing estimated property and crop damage costs:

data[,10] <- data$PROPDMG*data$PROPDMGMULT
data[,11] <- data$CROPDMG*data$CROPDMGMULT
colnames(data)[c(10,11)] <- c("PROPDMGCOST", "CROPDMGCOST")

Let’s look at the dataset again to check if new variables added successfully:

summary(data)

##                EVTYPE         FATALITIES     INJURIES         PROPDMG    
##  HAIL             :288661   Min.   :  0   Min.   :   0.0   Min.   :   0  
##  TSTM WIND        :219940   1st Qu.:  0   1st Qu.:   0.0   1st Qu.:   0  
##  THUNDERSTORM WIND: 82563   Median :  0   Median :   0.0   Median :   0  
##  TORNADO          : 60652   Mean   :  0   Mean   :   0.2   Mean   :  12  
##  FLASH FLOOD      : 54277   3rd Qu.:  0   3rd Qu.:   0.0   3rd Qu.:   0  
##  FLOOD            : 25326   Max.   :583   Max.   :1700.0   Max.   :5000  
##  (Other)          :170878                                                
##   PROPDMGEXP           CROPDMG       CROPDMGEXP         PROPDMGMULT      
##  Length:902297      Min.   :  0.0   Length:902297      Min.   :1.00e+02  
##  Class :character   1st Qu.:  0.0   Class :character   1st Qu.:1.00e+03  
##  Mode  :character   Median :  0.0   Mode  :character   Median :1.00e+03  
##                     Mean   :  1.5                      Mean   :9.18e+07  
##                     3rd Qu.:  0.0                      3rd Qu.:1.00e+03  
##                     Max.   :990.0                      Max.   :1.00e+12  
##                                                        NA's   :466248    
##   CROPDMGMULT        PROPDMGCOST        CROPDMGCOST      
##  Min.   :1.00e+03   Min.   :0.00e+00   Min.   :0.00e+00  
##  1st Qu.:1.00e+03   1st Qu.:0.00e+00   1st Qu.:0.00e+00  
##  Median :1.00e+03   Median :1.00e+03   Median :0.00e+00  
##  Mean   :3.17e+07   Mean   :6.33e+08   Mean   :4.81e+07  
##  3rd Qu.:1.00e+03   3rd Qu.:1.00e+04   3rd Qu.:0.00e+00  
##  Max.   :1.00e+12   Max.   :1.15e+14   Max.   :5.00e+12  
##  NA's   :618440     NA's   :466248     NA's   :618440

We’ve done! The data was successfully processed to use in further analysis!

Results

First analysis: which event type is the most harmful wrt population health?

The following code creates a dataset for making a bar chart of total fatalities per event type below:

fatalities_per_event <- aggregate(data$FATALITIES, by=list(data$EVTYPE), FUN=sum, na.rm = T)
injuries_per_event <- aggregate(data$INJURIES, by=list(data$EVTYPE), FUN=sum, na.rm = T)
fatalities_per_event_srt <- fatalities_per_event[order(fatalities_per_event$x, decreasing = T),]
injuries_per_event_srt <- injuries_per_event[order(injuries_per_event$x, decreasing = T),]
health_damage <- data.frame(fatalities_per_event_srt$x, injuries_per_event_srt$x, injuries_per_event_srt$Group.1)
colnames(health_damage) <- c("fatalities", "injuries", "eventype")

This code draws a bar chart of total fatalities per each of the 10 most harmful event types:

health_damage$eventype <- with(health_damage, reorder(eventype, 1/fatalities))
most_fatal <- ggplot(health_damage[1:10,], aes(x = eventype, y = fatalities))
most_fatal + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 90, hjust = 1))

plot of chunk unnamed-chunk-16

The most deadly event type is, obviously, the TORNADO.

cor(health_damage$fatalities, health_damage$injuries)

## [1] 0.9549

It is interesting that there is a very strong direct linear relationship between a number of fatal and a number of injury cases. It is interpretable and we can conclude that TORNADO, on average, also leads to a greater number of injuries if we compare it with another event types.

Second analysis: which types of events have the greatest economic consequences?

The following code creates a dataset for making a bar chart of total economy losses per event type below:

propdmgcost_per_event <- aggregate(data$PROPDMGCOST, by=list(data$EVTYPE), FUN=sum, na.rm = T)
cropdmgcost_per_event <- aggregate(data$CROPDMGCOST, by=list(data$EVTYPE), FUN=sum, na.rm = T)
propdmgcost_per_event_srt <- propdmgcost_per_event[order(propdmgcost_per_event$x, decreasing = T),]
cropdmgcost_per_event_srt <- cropdmgcost_per_event[order(cropdmgcost_per_event$x, decreasing = T),]
propcrop_damage <- data.frame(propdmgcost_per_event_srt$x, cropdmgcost_per_event_srt$x, propdmgcost_per_event_srt$Group.1)
colnames(propcrop_damage) <- c("prop.damage", "crop.damage", "eventype")
propcrop_damage[,4] <- propcrop_damage$prop.damage + propcrop_damage$crop.damage
colnames(propcrop_damage)[4] <- c("total.damage")

This code draws a bar chart of total cost per each of the 10 most harmful event types:

propcrop_damage$eventype <- with(propcrop_damage, reorder(eventype, 1/total.damage))
most_cropdmg <- ggplot(propcrop_damage[1:10,], aes(x = eventype, y = total.damage))
most_cropdmg + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 90, hjust = 1))

plot of chunk unnamed-chunk-19

The most destructive event type is, obviously, the FLOOD.