This is a project done as an assignment of the Coursera Reproducible Research course, part of the Data Science Specialization. The project explore the NOAA Storm Database and the data can be found here Storm Data
This analysis aims to investigate different types of severe weather events with highest impact on the population(No of fatalities) and the Economy(financial loss or Damages) specifically on property and agriculture i.e. crops.
The analysis utilizes two external packages.
1. dplyr - For data manipulation
2. ggplot2 - For making plots
library(dplyr)
library(ggplot2)
Here check if the data exist in your working directory if not download and the Load Data.
if (! file.exists("./repdata_data_StormData.csv.bz2")){
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2")
}
StormData <- read.csv(bzfile("./repdata_data_StormData.csv.bz2"))
Here we concentrate on removing inconsistencies brought about by differences in naming of similar events due to the following :
Events <- StormData$EVTYPE
# No of unique Events Before removing inconsistence/typos/unnecesaryb details
unique_before_cleanup <- length(unique(Events))
Events <- tolower(Events) # convert to lower cases
Events <- sub("s$", "", Events)
Events <- gsub("[[:digit:]]", "", Events) # Remove all the digits
Events <- gsub("[[:punct:]]", "", Events) # Remove all punctuations
# Corrrect some visible typos/Abbreviations
Events<-gsub("tstm", "thunderstorm", Events)
Events<-gsub("torndao", "tornado", Events)
Events<-gsub("vog", "fog", Events)
Events<-gsub("avalance", "avalanche", Events)
Events<-gsub("(?<=[\\s])\\s*|^\\s+|\\s+$", "", Events, perl=TRUE)
# Lets remove unnecesary details eg. hurricanegenerated swell and hurricane emily are still
# hurricane
for(i in 1 : length(Events)){
if (grepl("hurricane", Events[i])) {Events[i] <- "hurricane"}
else if (grepl("tornado", Events[i])) {Events[i] <- "tornado"}
else if (grepl("blizzard", Events[i])) {Events[i] <- "blizzard"}
else if (grepl("thunderstorm", Events[i])) {Events[i] <- "thunderstorm"}
else if (grepl("hail", Events[i])) {Events[i] <- "hail"}
else if (grepl("frost|freeze", Events[i])) {Events[i] <- "freeze"}
else if (grepl("flood|high water", Events[i])) {Events[i] <- "flood"}
else if (grepl("snow|ice|icy", Events[i])) {Events[i] <- "snow/ice"}
else if (grepl("lightning", Events[i])) {Events[i] <- "lightning"}
else if (grepl("rain", Events[i])) {Events[i] <- "rain"}
else if (grepl("warm|heat", Events[i])) {Events[i] <- "heat"}
else if (grepl("wind", Events[i])) {Events[i] <- "wind"}
else if (grepl("volcanic", Events[i])) {Events[i] <- "volcanic"}
}
# Number of unique Event types after cleaning up
unique_after_cleanup <- length(unique(Events))
We can see we have significantly reduced the number of unique events from 985 to 239. This is a significant reduction thus can continue with the analysis.
Lets create a new dataframe with events type and their respective frequency of occouring.
Here we utilize the dplyr package.
Clean_EVtype_df<-arrange(tbl_df(as.data.frame(table(Events))),desc(Freq))
# Lets see what is the percentage of the top ten events
top1o_percent <- round(((sum(Clean_EVtype_df$Freq[1:10])/sum(Clean_EVtype_df$Freq))*100), 2)
The percentage of top ten events occouring is 95.74.
Here is a view of the ten most occuring severe Events after our clean up
head(Clean_EVtype_df, 10)
## # A tibble: 10 x 2
## Events Freq
## <fctr> <int>
## 1 thunderstorm 336807
## 2 hail 289283
## 3 flood 82730
## 4 tornado 60701
## 5 wind 28123
## 6 snow/ice 19825
## 7 lightning 15764
## 8 rain 12163
## 9 winter storm 11436
## 10 winter weather 7045
This look fine and a 95.74 representation is not bad we can continue to add these Events into our dataframe.
StormData$Clean_EvType <- as.factor(Events)
Now our data consist a new variable Clean_EvType that we shall use in our analyses instead of the untidy original EvType.
Next we create a subset(Fatalities_df) for our data consisting of unique events type and the total number of fatalities they have caused.
Fatalities_df <- arrange(aggregate(FATALITIES ~ Clean_EvType, data = StormData,
FUN = sum), -FATALITIES)
names(Fatalities_df) <- c("EventType", "FATALITIES")
Here we aim to calculate the total economic loss associated with the each unique event type according to our cleanup.
First define a function to convert to convert exponential values (h = hundred, k = thousand, m = million, b = billion) to useable values(digits).
CalTotalDamage <- function(e) {
if (e == "h")
return(2)
else if (e == "k")
return(3)
else if (e == "m")
return(6)
else if (e == "b")
return(9)
else if (!is.na(as.numeric(e)))
return(as.numeric(e))
else{
return(0)
}
}
Next use the function CalTotalDamage defined above to convert the exponentials used to digits.
propertyExp <- sapply(tolower(StormData$PROPDMGEXP), FUN = CalTotalDamage)
cropExp <- sapply(tolower(StormData$CROPDMGEXP), FUN = CalTotalDamage)
Then we calculate the total financial consequence associated with each severe weather Events type and add as a variable in our original dataframe as TotalDamage.
StormData$TotalDamage <- StormData$PROPDMG * (10**propertyExp) +
StormData$CROPDMG * (10**cropExp)
Next we create a subset(TotalDamage_df) for our data consisting of unique events type and the total economic damage they have caused.
TotalDamages_df <- arrange(aggregate(TotalDamage ~ Clean_EvType, data = StormData,
FUN = sum), -TotalDamage)
names(TotalDamages_df) <- c("EventType", "EconomicDamage")
In this section we split it into events type impact on:
* Fatalities
* Economy
Here we utilize the dataframe Fatalities_df created above.
First we view of the top ten events causing the highest fatalities.
head(Fatalities_df, 10)
## EventType FATALITIES
## 1 tornado 5661
## 2 heat 3178
## 3 flood 1528
## 4 lightning 817
## 5 thunderstorm 729
## 6 wind 690
## 7 rip current 572
## 8 snow/ice 271
## 9 avalanche 225
## 10 winter storm 216
Finally make a plot of a bar graph representing Events types and their Facalities
ggplot(data = Fatalities_df[1:15,], aes(x = reorder(EventType, FATALITIES),
y = FATALITIES, fill = FATALITIES)) +
geom_bar(stat = "identity") +
ggtitle("EVENTS CAUSING MOST FATALITIES (TOP 15)") +
xlab("Event Type") +
coord_flip()
Lets view ten Events causing the highest economic consequences in USA.
head(TotalDamages_df, 10)
## EventType EconomicDamage
## 1 flood 180592274935
## 2 hurricane 90271472810
## 3 tornado 59020779947
## 4 storm surge 43323541000
## 5 hail 19024452136
## 6 drought 15018672000
## 7 thunderstorm 12456462688
## 8 snow/ice 10140542710
## 9 tropical storm 8382236550
## 10 wind 7035769523
Finally lets make plot for this results using ggplot package
ggplot(data = TotalDamages_df[1:15,], aes(x = reorder(EventType, EconomicDamage),
y = EconomicDamage,
fill = EconomicDamage)) +
geom_bar(stat = "identity") +
ggtitle("EVENTS CAUSING MOST ECONOMIC SEVERE\n ECONOMIC CONSEQUENCES (TOP 15)") +
xlab("Event Type") +
coord_flip()