Synopisis

This is a project done as an assignment of the Coursera Reproducible Research course, part of the Data Science Specialization. The project explore the NOAA Storm Database and the data can be found here Storm Data

This analysis aims to investigate different types of severe weather events with highest impact on the population(No of fatalities) and the Economy(financial loss or Damages) specifically on property and agriculture i.e. crops.

Load required packages for this analysis

The analysis utilizes two external packages.
1. dplyr - For data manipulation
2. ggplot2 - For making plots

library(dplyr)

library(ggplot2)

Data Processing

1. Download and Load

Here check if the data exist in your working directory if not download and the Load Data.

if (! file.exists("./repdata_data_StormData.csv.bz2")){
  
  download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2")
}
  
StormData <- read.csv(bzfile("./repdata_data_StormData.csv.bz2"))

2. CleanUp the Events Types

Here we concentrate on removing inconsistencies brought about by differences in naming of similar events due to the following :

  • Letter case - Convert all characters in Events names to lower cases
  • Singulars and Plurals - Remove “s” at the end
  • Digits - Remove all digits in Events names
  • Panctuations - Eliminates all punctuations
  • Misspelling - Correct visible Misspelling
Events <- StormData$EVTYPE

# No of unique Events Before removing inconsistence/typos/unnecesaryb details

unique_before_cleanup <- length(unique(Events))  


Events <- tolower(Events) # convert to lower cases

Events <- sub("s$", "", Events)

Events <- gsub("[[:digit:]]", "", Events)   # Remove all the digits

Events <- gsub("[[:punct:]]", "", Events)  # Remove all punctuations


# Corrrect some visible typos/Abbreviations

Events<-gsub("tstm", "thunderstorm", Events)

Events<-gsub("torndao", "tornado", Events)

Events<-gsub("vog", "fog", Events)

Events<-gsub("avalance", "avalanche", Events)

Events<-gsub("(?<=[\\s])\\s*|^\\s+|\\s+$", "", Events, perl=TRUE)

# Lets remove unnecesary details eg. hurricanegenerated swell and hurricane emily are still 
# hurricane

for(i in 1 : length(Events)){
  
  if (grepl("hurricane", Events[i])) {Events[i] <- "hurricane"}
  else if (grepl("tornado", Events[i])) {Events[i] <- "tornado"}
  else if (grepl("blizzard", Events[i])) {Events[i] <- "blizzard"}
  else if (grepl("thunderstorm", Events[i])) {Events[i] <- "thunderstorm"}
  else if (grepl("hail", Events[i])) {Events[i] <- "hail"}
  else if (grepl("frost|freeze", Events[i])) {Events[i] <- "freeze"}
  else if (grepl("flood|high water", Events[i])) {Events[i] <- "flood"}
  else if (grepl("snow|ice|icy", Events[i])) {Events[i] <- "snow/ice"}
  else if (grepl("lightning", Events[i])) {Events[i] <- "lightning"}
  else if (grepl("rain", Events[i])) {Events[i] <- "rain"}
  else if (grepl("warm|heat", Events[i])) {Events[i] <- "heat"}
  else if (grepl("wind", Events[i])) {Events[i] <- "wind"}
  else if (grepl("volcanic", Events[i])) {Events[i] <- "volcanic"}
}


# Number of unique Event types after cleaning up

unique_after_cleanup <- length(unique(Events))

We can see we have significantly reduced the number of unique events from 985 to 239. This is a significant reduction thus can continue with the analysis.
Lets create a new dataframe with events type and their respective frequency of occouring.
Here we utilize the dplyr package.

Clean_EVtype_df<-arrange(tbl_df(as.data.frame(table(Events))),desc(Freq))

# Lets see what is the percentage of the top ten events   

top1o_percent <- round(((sum(Clean_EVtype_df$Freq[1:10])/sum(Clean_EVtype_df$Freq))*100), 2)

The percentage of top ten events occouring is 95.74.

Here is a view of the ten most occuring severe Events after our clean up

head(Clean_EVtype_df, 10)
## # A tibble: 10 x 2
##            Events   Freq
##            <fctr>  <int>
##  1   thunderstorm 336807
##  2           hail 289283
##  3          flood  82730
##  4        tornado  60701
##  5           wind  28123
##  6       snow/ice  19825
##  7      lightning  15764
##  8           rain  12163
##  9   winter storm  11436
## 10 winter weather   7045

This look fine and a 95.74 representation is not bad we can continue to add these Events into our dataframe.

StormData$Clean_EvType <- as.factor(Events)

Calculate the Number of fatalities by Major Events

Now our data consist a new variable Clean_EvType that we shall use in our analyses instead of the untidy original EvType.
Next we create a subset(Fatalities_df) for our data consisting of unique events type and the total number of fatalities they have caused.

Fatalities_df <- arrange(aggregate(FATALITIES ~ Clean_EvType, data = StormData,
                                   
                                   FUN = sum), -FATALITIES)

names(Fatalities_df) <- c("EventType", "FATALITIES")

Calculate the Total Economic conciquence by major Events

Here we aim to calculate the total economic loss associated with the each unique event type according to our cleanup.
First define a function to convert to convert exponential values (h = hundred, k = thousand, m = million, b = billion) to useable values(digits).

CalTotalDamage <- function(e) {
  
  if (e == "h")
    
    return(2)
  
  else if (e == "k")
    
    return(3)
  
  else if (e == "m")
    
    return(6)
  
  else if (e == "b")
    
    return(9)
  
  else if (!is.na(as.numeric(e)))
    
    return(as.numeric(e))
  
  else{
    
    return(0)
  }
  
}

Next use the function CalTotalDamage defined above to convert the exponentials used to digits.

propertyExp <- sapply(tolower(StormData$PROPDMGEXP), FUN = CalTotalDamage)

cropExp <- sapply(tolower(StormData$CROPDMGEXP), FUN = CalTotalDamage)

Then we calculate the total financial consequence associated with each severe weather Events type and add as a variable in our original dataframe as TotalDamage.

StormData$TotalDamage <- StormData$PROPDMG * (10**propertyExp) +
  
  StormData$CROPDMG * (10**cropExp)

Next we create a subset(TotalDamage_df) for our data consisting of unique events type and the total economic damage they have caused.

TotalDamages_df <- arrange(aggregate(TotalDamage ~ Clean_EvType, data = StormData,
                                     
                                     FUN = sum), -TotalDamage)

names(TotalDamages_df) <- c("EventType", "EconomicDamage")

Results

In this section we split it into events type impact on:
* Fatalities
* Economy

Events causing major Fatalities in USA

Here we utilize the dataframe Fatalities_df created above.

First we view of the top ten events causing the highest fatalities.

head(Fatalities_df, 10)
##       EventType FATALITIES
## 1       tornado       5661
## 2          heat       3178
## 3         flood       1528
## 4     lightning        817
## 5  thunderstorm        729
## 6          wind        690
## 7   rip current        572
## 8      snow/ice        271
## 9     avalanche        225
## 10 winter storm        216

Finally make a plot of a bar graph representing Events types and their Facalities

ggplot(data = Fatalities_df[1:15,], aes(x = reorder(EventType, FATALITIES),
                                        
                                        y = FATALITIES, fill = FATALITIES)) +
  
  geom_bar(stat = "identity") +
  
  ggtitle("EVENTS CAUSING MOST FATALITIES (TOP 15)") +
  
  xlab("Event Type") +
  
  coord_flip()

Events causing major Economic consequences

Lets view ten Events causing the highest economic consequences in USA.

head(TotalDamages_df, 10)
##         EventType EconomicDamage
## 1           flood   180592274935
## 2       hurricane    90271472810
## 3         tornado    59020779947
## 4     storm surge    43323541000
## 5            hail    19024452136
## 6         drought    15018672000
## 7    thunderstorm    12456462688
## 8        snow/ice    10140542710
## 9  tropical storm     8382236550
## 10           wind     7035769523

Finally lets make plot for this results using ggplot package

ggplot(data = TotalDamages_df[1:15,], aes(x = reorder(EventType, EconomicDamage),
                                          
                                          y = EconomicDamage,
                                          
                                          fill = EconomicDamage)) +
  
  geom_bar(stat = "identity") +
  
  ggtitle("EVENTS CAUSING MOST ECONOMIC SEVERE\n ECONOMIC CONSEQUENCES (TOP 15)") +
  
  xlab("Event Type") +
  
  coord_flip()