This document reports the analysis of US Storm database from 1950 till 2015. The analysis attempts to answer the questions posed in the course project of coursera reproducible research and is not by any means complete, comprehensive and suggestive of any limitations in the data collection methods. This report utilises the data collected from the following URL: https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2 The database is a statistical collection of severe weather events that may have significant demographic or economic impact on the local economies across the US.

Synopsis

The proceeds of the analysis relies on basic premise that the raw data from the database is incomplete in some aspects, especially when it is examined chronologically. The website suggests the collection has not been well-established during the early years and covered only a few events. Whereas, later, the collection expanded to include a wide variety of weather events and could be considered close to comprehensive. First, the raw data is obtained, subsetted, processed,clustered and then summarised to yield desired results. Plots are used suitably to supplement the analysis.

Raw Data Load

The raw data is first obtained from the website and loaded into R. It requires significant memory and is computationally an expensive task to accomplish.Only the relevant variables such as EVTYPE are looked at. We do not require most of the other variables. Hence, the data is first subsetted to achieve this. This subset is used for subsequent analysis.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(stringdist)
## Warning: package 'stringdist' was built under R version 3.2.4
setwd("E:\\R data and codes\\Rep Research Storm Data")
#Read in the data
#download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",destfile = "./StormData RepResProject.bz2",method ="curl")

data <- read.csv('./StormData RepResProject.csv.bz2', header = T)

The computational burden is quite heavy, considering that data spans over 50 years across multiple counties of US.

Data Processing

Filter only the following relevant variables: County, State, Evtype, Fatalities, Injuries, Propdmg, Cropdmg variables

sdata <- select(data, COUNTY, STATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)

#remove original data to free some memory
rm(data)

A significant effort of this assignment was spent on analysing the EVTYPE variable. The variable did not contain any missing events but was quite messy with same event being termed in different words and also misspelled. Thus, it is clear that most of our effort should be on cleaning the data. For this purpose, a little research was done on identifying better/different ways of cleaning. A similar analysis done by jeremy Beck in the following [URL]:(http://www.rpubs.com/jbeck/StormData) provided motivation to use a text mining technique. The analysis follows a similar line to enhance my learning experience.

As a precursor, some pattern replacement is done in the hope to make clustering accurate.

sdata$EVTYPE <- gsub("TSTM | GUSTY THUNDERSTORM", "THUNDERSTORM", sdata$EVTYPE)
sdata$EVTYPE <- gsub(".*TORNADO.*", "TORNADO", sdata$EVTYPE)
sdata$EVTYPE <- gsub(".*FLOOD.*","FLOODS",sdata$EVTYPE)

Cluster Groupings - EVTYPE

As a learning component, I tried both K-means and Hierarchical clustering. Although both methods did not give 100% accurate groupings, hierarchical clustering seemed to work better when the cluster groupings were examined. Thus, Im including only that method here.

As an important parameter, number of cluster groupings was fixed at 49 because the reference document showed 48 defined events for documentation purposes. One extra cluster was included to accomodate events which cannot be fit in the 48 clusters. In the end, the results of this method was not completely satisfactory.

  #First extract unique names
  uniq_EV <- unique(sdata$EVTYPE)

  #Find string distance using stringdist package
  strDist  <- stringdistmatrix(uniq_EV,uniq_EV, method = "jw")
  rownames(strDist) <- uniq_EV
  
  #Apply Hierarchical clustering
  hc <- hclust(as.dist(strDist))
  Cut_EV <- cutree(hc, k = 49)
  Cut_EV <- as.data.frame(Cut_EV)
  
  #Extracting row names and creating a variable
  Cut_EV$Event_Type <- attr(Cut_EV, 'row.names')
  colnames(Cut_EV) <- c("Cluster", "Event_Type")
  Cut_EV_df <- tbl_df(Cut_EV)
  Cut_EV_df <- Cut_EV_df %>% arrange(desc(Cluster))
  sdat_dup_df <- tbl_df(sdata)
  #Merge with original data the found cluster numbers
  sdat_dup_df <- left_join(sdat_dup_df, Cut_EV_df, c("EVTYPE"="Event_Type")) 

Identifying cluster Names

Once the clusters have been identified for the various event types, next step is to name the clusters. This had two possibilites: One, calculating word frequencies in each cluster and then assigning the frequently occuring keyword. But the limitation is that the frequently occuring word or in this case the event need not be the most damaging one. So, cluster naming in the analysis was based on most damaging events in the group.

  #First filter for total fatalities
  sdat_summary <- sdat_dup_df %>% group_by(Cluster,EVTYPE) %>%
                  summarise(totfatalities = sum(FATALITIES))
  #Identify cluster names for each cluster
  for(i in 1:49)
  {
   tmp <- filter(sdat_summary,Cluster==i)
   sdat_dup_df$ClustName[sdat_dup_df$Cluster==i] <- tmp$EVTYPE[which.max(tmp$totfatalities)]
    
  }
  
  rm(sdat_summary,tmp)

Filter only relevant events

#now filter data for population damage 
popd <- sdat_dup_df %>% select(ClustName,FATALITIES, INJURIES) %>%   
                    filter(FATALITIES >0 | INJURIES >0)
#filter for property damage, crop damage
econd <- sdat_dup_df %>% select(ClustName,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP) %>%
         filter(PROPDMG >0 | CROPDMG >0)
#Modify for dplyr
popd_df <- tbl_df(popd)
econd_df <- tbl_df(econd)

Create New Economic Variables and Summarise

Next step is to create new variables for property damages and crop damages. For this, the scales defined in the PROPDMGEXP, CROPDMGEXP is used.

rpStr <<- function(x) { if(x == "K") 1E3 else if(x == "M") 1E6 else if(x == "B") 1E9 else 1}

#Create new variable
econd_df <-  mutate(econd_df, 
                   PROPDMG_COST = PROPDMG*sapply(PROPDMGEXP, FUN = rpStr),
                   CROPDMG_COST = CROPDMG*sapply(CROPDMGEXP, FUN = rpStr))

#Population damage summary data
smrypop <- popd_df %>% group_by(ClustName) %>% summarise(totfat = sum(FATALITIES), totinj = sum(INJURIES)) %>% arrange(desc(totfat))

#Economic damage summary data
smryecon <- econd_df %>% group_by(ClustName) %>% summarise(totpropdmg = sum(PROPDMG_COST), totcropdmg = sum(CROPDMG_COST)) %>% arrange(desc(totpropdmg))

Results

First let us examine the top 10 events that contribute for population damages. Here is the plot.

Population damage

ggpl <- ggplot(smrypop[1:10,], aes(reorder(ClustName,totfat),totfat))+
geom_bar(aes(fill = ClustName),stat="identity", width=0.5,position = position_dodge(width = 0.9))+coord_flip()+
      theme_bw()+
       labs(x="Event Types", y="Total Fatalities")+
       labs(title="Total Fatalities Across US from 1950-2015 due to Storm Events")

print(ggpl)

As observed from the graph, TORNADO and Excessive heat are biggest contributors towards fatalities.

Economic Damage

Here is the plot for top 10 economic damages.

ggpl_prop <- ggplot(smryecon[1:10,], aes(reorder(ClustName,totpropdmg),totpropdmg/1E6))+
  geom_bar(aes(fill = ClustName),stat="identity", width=0.5,position = position_dodge(width = 0.9))+coord_flip()+
  theme_bw()+
  labs(x="Event Types", y="Total Property Damages")+
  labs(title="Total Property Damages Across US from 1950-2015 due to Storm Events")

print(ggpl_prop)

With regard to economic consequences, FLOODS contribute the highest followed by TORNADO