Synopsis:

This data analysis report is created to provide a review of the impact of severe weather events on humans. The data is from U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database (1950 through 2011). This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. The report provides input to the major event types for Human Population Health Impact (Fatalities and Injuries) and Economic Impact (sum of crop and property damage). The data is reported in Millions of US Dollars.

Data Processing:

Note: This section includes description and justification for any data transformations.

The database was provided as a csv file in a zipped format. The .csv file was stored into a dataframe called ‘stormdata’.

## set working directory
setwd("C:/Users/Harish/Desktop/Coursera/Reproducible research/Assgmnt2")
library(knitr)
library(ggplot2)
##load data to stormdata
stormdata <- read.csv(bzfile("repdata_data_StormData.csv.bz2"))

It was noted that EVTYPE is the key field that details the categories to perform the data analysis. It was also noted that there were many duplication of categories creating an inflated number of categories. The following code chunk is used to consolidate the duplicate categories.

#EVTYPE is our key field
# number of unique types
length(unique(stormdata$EVTYPE)) 
## [1] 985
# Reducing Event types
red_EVTYPE <- toupper(stormdata$EVTYPE)
red_EVTYPE <- gsub("[[:punct:]]", " ", red_EVTYPE)
length(unique(red_EVTYPE))
## [1] 874
# Final data frame
stormdata$EVTYPE <- red_EVTYPE

## Consolidate replicates
stormdata$EVTYPE <- gsub(".*HEAT", "EXCESSIVE HEAT", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub(".*THUNDER", "TSTM WIND", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub(".*SLIDE", "LANDSLIDE", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub(".*COAST", "COASTAL", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub(".*COLD", "COLD WEATHER", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub(".*DRY", "DRY WEATHER", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub(".*FLOOD", "FLOOD", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub(".*HURRICANE TYPHOON", "HURRICANE", stormdata$EVTYPE)
## Now only 746 unique EVTYPEs

Data Transformation for the dollar values:

It was also noted that the dollar values had their suffixes (M, B etc.) in another column. This does not allow us to calculate the dollar amounts. Hence, we need to transform the data to a new field. The following code is used to create a new column for the dollar amount values. Finally a new column was also created to combine the crop and prop damage dollars.

# Convert Dollar Suffixes to amount
stormdata$PROPDMGEXP <- as.character(stormdata$PROPDMGEXP)
stormdata$PROPDMGEXP[grep("K", stormdata$PROPDMGEXP)] <- "1000"
stormdata$PROPDMGEXP[grep("M", stormdata$PROPDMGEXP)] <- "1000000"
stormdata$PROPDMGEXP[grep("m", stormdata$PROPDMGEXP)] <- "1000000"
stormdata$PROPDMGEXP[grep("B", stormdata$PROPDMGEXP)] <- "1000000000"
blanks <- stormdata$PROPDMGEXP %in% c("1000", "1000000", "1000000000") == F
stormdata$PROPDMGEXP[blanks == T] <- "1"
stormdata$PROPDMGEXP <- as.numeric(stormdata$PROPDMGEXP)
stormdata$PROPDMGDOLLARS <- stormdata$PROPDMG * stormdata$PROPDMGEXP

# Convert Dollar Suffixes to amount
stormdata$CROPDMGEXP <- as.character(stormdata$CROPDMGEXP)
stormdata$CROPDMGEXP[grep("K", stormdata$CROPDMGEXP)] <- "1000"
stormdata$CROPDMGEXP[grep("M", stormdata$CROPDMGEXP)] <- "1000000"
stormdata$CROPDMGEXP[grep("m", stormdata$CROPDMGEXP)] <- "1000000"
stormdata$CROPDMGEXP[grep("B", stormdata$CROPDMGEXP)] <- "1000000000"
blanks <- stormdata$CROPDMGEXP %in% c("1000", "1000000", "1000000000") == F
stormdata$CROPDMGEXP[blanks == T] <- "1"
stormdata$CROPDMGEXP <- as.numeric(stormdata$CROPDMGEXP)
stormdata$CROPDMGDOLLARS <- stormdata$CROPDMG * stormdata$CROPDMGEXP
stormdata$TOTALECONOMIC <- stormdata$CROPDMGDOLLARS + stormdata$PROPDMGDOLLARS

Results:

The results section is broken down into two categories.

1. Population Health Damage:

This is further broken down into Fatalities and Injuries. The dollar amounts for each of the two categories were tallied up and a Pareto chart was created for each category to find out the top 10 event types that caused the most damage in dollars (Million). The code used for Population Damage is given below.

##POPULATION DAMAGE
library(plyr)
HEALTH_F <- ddply(stormdata, .(EVTYPE), summarize,
                    Sum_FATALITIES = sum(FATALITIES)/1000000)#In million
HEALTH_I <- ddply(stormdata, .(EVTYPE), summarize,
                    Sum_INJURIES = sum(INJURIES)/1000000)#In million

Once the two dataframes were created as above, we are able to find the top ten categories for each type. The kable function from knitr is used to create a neat table as well.

# Top 10 for Fatalities and Injuries
FATALITIES_TALLY <- head(HEALTH_F[order(HEALTH_F$Sum_FATALITIES, decreasing = T), ], 10)
INJURIES_TALLY <- head(HEALTH_I[order(HEALTH_I$Sum_INJURIES, decreasing = T), ], 10)
#Neat table for each category
kable(FATALITIES_TALLY, align = 'c', format = "markdown")
## 
## 
## |    |         EVTYPE          | Sum_FATALITIES |
## |:---|:-----------------------:|:--------------:|
## |557 |         TORNADO         |     0.0056     |
## |104 |     EXCESSIVE HEAT      |     0.0030     |
## |125 |          FLOOD          |     0.0015     |
## |354 |        LIGHTNING        |     0.0008     |
## |578 |        TSTM WIND        |     0.0005     |
## |429 |       RIP CURRENT       |     0.0004     |
## |271 |        HIGH WIND        |     0.0002     |
## |16  |        AVALANCHE        |     0.0002     |
## |59  | COLD WEATHER WIND CHILL |     0.0002     |
## |737 |      WINTER STORM       |     0.0002     |
kable(INJURIES_TALLY, align = 'c', format = "markdown")
## 
## 
## |    |       EVTYPE        | Sum_INJURIES |
## |:---|:-------------------:|:------------:|
## |557 |       TORNADO       |    0.0913    |
## |104 |   EXCESSIVE HEAT    |    0.0088    |
## |125 |        FLOOD        |    0.0086    |
## |578 |      TSTM WIND      |    0.0070    |
## |354 |      LIGHTNING      |    0.0052    |
## |330 |      ICE STORM      |    0.0020    |
## |608 | TSTM WINDSTORM WIND |    0.0015    |
## |181 |        HAIL         |    0.0014    |
## |308 |      HURRICANE      |    0.0013    |
## |737 |    WINTER STORM     |    0.0013    |

Finally, a Pareto chart was created to graphically represent the top 10 event types.

#Make Pareto chart for Fatalities only
library(ggplot2)
ordered_FATALITIES_TALLY <- FATALITIES_TALLY
ordered_FATALITIES_TALLY$EVTYPE <- with(ordered_FATALITIES_TALLY , reorder(EVTYPE, -Sum_FATALITIES))
g <- ggplot(data=ordered_FATALITIES_TALLY , aes(x=EVTYPE, y=Sum_FATALITIES, fill=EVTYPE, ymax=max(Sum_FATALITIES)*1.05))
g <- g + geom_bar(stat="identity") + xlab("Event Type") +  ylab("Damage dollars in Million") + ggtitle("Population Impact from Fatalities Plot")
g <- g +geom_text(aes(label = round(Sum_FATALITIES,4)), position = position_dodge(width=1),vjust=-.5, size = 2)+ scale_fill_discrete(name="Event Type")+ theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 6))
g

plot of chunk unnamed-chunk-6

#Make Pareto chart for Inuries only
ordered_INJURIES_TALLY <- INJURIES_TALLY
ordered_INJURIES_TALLY$EVTYPE <- with(ordered_INJURIES_TALLY , reorder(EVTYPE, -Sum_INJURIES))
g <- ggplot(data=ordered_INJURIES_TALLY , aes(x=EVTYPE, y=Sum_INJURIES, fill=EVTYPE, ymax=max(Sum_INJURIES)*1.05))
g <- g + geom_bar(stat="identity") + xlab("Event Type") +  ylab("Damage dollars in Million") + ggtitle("Population Impact from Injuries Plot")
g <- g +geom_text(aes(label = round(Sum_INJURIES,4)), position = position_dodge(width=1),vjust=-.5, size = 2)+ scale_fill_discrete(name="Event Type")+ theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 6))
g

plot of chunk unnamed-chunk-6

It can be seen that Tornado, Excessive Heat and Flood are the top three events in both categories.

2. Economic Damage:

Similar to the Health damage detailed above, we will perform an analysis on the economic damage. This is also divided into two categories, Crop damage and property damage. These two will be combined to create a Total Economic damage tally.

##ECONOMIC DAMAGE
ECONOMY_C <- ddply(stormdata, .(EVTYPE), summarize,
                    Sum_CROPDMGDOLLARS = sum(CROPDMGDOLLARS)/1000000) #In million
ECONOMY_P <- ddply(stormdata, .(EVTYPE), summarize,
                    Sum_PROPDMGDOLLARS = sum(PROPDMGDOLLARS)/1000000) #In million
ECONOMY_T <- ddply(stormdata, .(EVTYPE), summarize,
        Sum_TOTALECONOMIC = round(sum(TOTALECONOMIC),2)/1000000) #In million

# Top 10 for Crop and Property damage, and a new variable Total Economic tally that sums them up.
CROPDMGDOLLARS_TALLY <- head(ECONOMY_C[order(ECONOMY_C$Sum_CROPDMGDOLLARS, decreasing = T), ], 10)
PROPDMGDOLLARS_TALLY <- head(ECONOMY_P[order(ECONOMY_P$Sum_PROPDMGDOLLARS, decreasing = T), ], 10)
TOTALECONOMIC_TALLY <- head(ECONOMY_T[order(ECONOMY_T$Sum_TOTALECONOMIC, decreasing = T), ], 10)

As before, we will create a neat table using the kable function to display the top ten event types.

#Neat table for each category
kable(CROPDMGDOLLARS_TALLY, align = 'c', format = "markdown")
## 
## 
## |    |     EVTYPE     | Sum_CROPDMGDOLLARS |
## |:---|:--------------:|:------------------:|
## |74  |    DROUGHT     |      13972.6       |
## |125 |     FLOOD      |      12209.5       |
## |308 |   HURRICANE    |       5349.8       |
## |330 |   ICE STORM    |       5022.1       |
## |181 |      HAIL      |       3025.5       |
## |45  |  COLD WEATHER  |       1350.1       |
## |159 |  FROST FREEZE  |       1094.2       |
## |104 | EXCESSIVE HEAT |       898.9        |
## |215 |   HEAVY RAIN   |       733.4        |
## |571 | TROPICAL STORM |       678.3        |
kable(PROPDMGDOLLARS_TALLY, align = 'c', format = "markdown")
## 
## 
## |    |      EVTYPE      | Sum_PROPDMGDOLLARS |
## |:---|:----------------:|:------------------:|
## |125 |      FLOOD       |       166781       |
## |308 |    HURRICANE     |       81174        |
## |557 |     TORNADO      |       56937        |
## |481 |   STORM SURGE    |       43324        |
## |181 |       HAIL       |       15732        |
## |571 |  TROPICAL STORM  |        7704        |
## |737 |   WINTER STORM   |        6688        |
## |271 |    HIGH WIND     |        5270        |
## |724 |     WILDFIRE     |        4765        |
## |482 | STORM SURGE TIDE |        4641        |
kable(TOTALECONOMIC_TALLY, align = 'c', format = "markdown")
## 
## 
## |    |     EVTYPE     | Sum_TOTALECONOMIC |
## |:---|:--------------:|:-----------------:|
## |125 |     FLOOD      |      178991       |
## |308 |   HURRICANE    |       86524       |
## |557 |    TORNADO     |       57352       |
## |481 |  STORM SURGE   |       43324       |
## |181 |      HAIL      |       18758       |
## |74  |    DROUGHT     |       15019       |
## |330 |   ICE STORM    |       8967        |
## |571 | TROPICAL STORM |       8382        |
## |737 |  WINTER STORM  |       6715        |
## |271 |   HIGH WIND    |       5909        |

As before, we will also create a Pareto chart for the total economic tally top ten events. This is shown below.

# Pareto plot for Total Economic
ordered_TOTALECONOMIC_TALLY <- TOTALECONOMIC_TALLY
ordered_TOTALECONOMIC_TALLY$EVTYPE <- with(ordered_TOTALECONOMIC_TALLY , reorder(EVTYPE, -Sum_TOTALECONOMIC))
g <- ggplot(data=ordered_TOTALECONOMIC_TALLY , aes(x=EVTYPE, y=Sum_TOTALECONOMIC, fill=EVTYPE, ymax=max(Sum_TOTALECONOMIC)*1.05))
g <- g + geom_bar(stat="identity") + xlab("Event Type") +  ylab("Damage dollars in Million") + ggtitle("Economic Impact Plot")
g <- g +geom_text(aes(label = round(Sum_TOTALECONOMIC,2)), position = position_dodge(width=1),vjust=-.5, size = 2)+ scale_fill_discrete(name="Event Type")+ theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 6))
g

plot of chunk unnamed-chunk-9

It can be seen that Flood, Hurricane and Tornado are the top three Event Types for Economic damage.

Conclusion:

A detailed analysis was performed on the storm data that was made available to us. Several of the duplicate categories were consolidated to perform an accurate analysis. The top ten events for each category was identified, and further we also created 3 plots to help in visualizing the data.