Analysis of historical storm data of USA

Introduction

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Synopsis

The goal here is to analyze which event has been the most costliest so far (in terms of monetary and fatalities).

We will be first loading the data into R using read.csv. Then we will be subsetting the data with relevant columns which we will be required for analysis down the pipeline. For the simplicity we will filter the data only for period between 1995-2011, because the counts of records before 1995 period are quite few and hence doesn’t provide accurate information about storms pre 90s era.

From the analysis we found that in terms of monetary losses: Thunderstorms contributes enormously, followed not so closely by Flash Floods.
In terms of fatalities we found that: Excessive Heat has caused most deaths, closely followed by Tornados.

Configurating Libraries and Reading the data

#Loading libraries
set.seed(123)
library("lubridate")
library("plyr")
library("ggplot2")

# Reading the data

#Read data
stormdata0<-read.csv("StormData.csv")

Data Processing

Since we are interested only in property damages, fatalities and injuries, we will remove other irrelevant columns for the analysis below.

#Keeping only the relevant columns
stormdata1<-stormdata0[,c("BGN_DATE","COUNTY","COUNTYNAME","STATE","EVTYPE","F","MAG","FATALITIES","INJURIES","PROPDMG","PROPDMGEXP")]

Following chunk of code massages the data type of the date in which the event occured and produces 2 new columns for month and year of event occurance.

#Data massaging
stormdata1$BGN_DATE<-mdy_hms(stormdata1$BGN_DATE)
stormdata1$month<-month(stormdata1$BGN_DATE)
stormdata1$year<-year(stormdata1$BGN_DATE)

Certain events had colossal losses, an extra column called “PROPDMGEXP” denotes the exponential power for the damage incurred. Thus, in order to have 1 single column for the total calculated loss, following chunks of code replaces indicators by their actual exponential powers.

stormdata1$PROPDMGEXP<-mapvalues(stormdata1$PROPDMGEXP,from = c("","-","?","+","0","1","2","3","4","5","6","7","8","B","h","H","K","m","M"),
to = c("1","0","0","0","1","10","100","1000","10000","100000","1000000","10000000","100000000","1000000000","100","100","1000","1000000","1000000")
)

Following chunk of code, calculates the total damage, after converting the data types for it.

stormdata1$PROPDMGEXP2<-as.numeric(levels(stormdata1$PROPDMGEXP))
stormdata1$Tot_damage<-as.numeric(stormdata1$PROPDMG * stormdata1$PROPDMGEXP2)
stormdata1$EVTYPE<-as.character(stormdata1$EVTYPE)

Plots

By looking into the year vs records count histogram, we can say that data is too less for years before 90s. Hence we will be subsetting data only for period 1995 onwards.

#By year
g0<-ggplot(data=stormdata1,aes(x=year))+geom_histogram(binwidth=1,fill = "blue")+scale_x_continuous(breaks=seq(1950,2011,by=5))
g0<-g0+ggtitle("Total records by year")+theme(plot.title = element_text(hjust=0.5))+ylab("Frequency")
g0

Subsetting for the years for which the data is rich. This is being done to reduce bias towards more later years.

stormdata2<-stormdata1[stormdata1$year>=1995,]

Now we will be listing the top events by damage. Also it was found that certain events (in this case Thunderstorms) were somethimes reported in short forms. So we have to merge the data for “TSTM WINDS” into “THUNDERSTORM WINDS”. We could have done this for all the event types (basically data cleaning), but due to data being huge, large counts of events and our focus being only on top costly events, we will restrict to only this 1 change.

events<-aggregate(Tot_damage~EVTYPE,data=stormdata2,FUN=sum)
top_events<-events[order(events$Tot_damage,decreasing = TRUE),][1:10,]

#Replacing TSTM Wind wit hThunderstorm becuase both are same
stormdata2$EVTYPE[stormdata2$EVTYPE=="TSTM WIND"]<-"THUNDERSTORM WIND"
top_events$EVTYPE[top_events$EVTYPE=="TSTM WIND"]<-"THUNDERSTORM WIND"
#Aggregating again
top_events<-aggregate(Tot_damage~EVTYPE,data=top_events,FUN=sum) 
top_events

##              EVTYPE   Tot_damage
## 1       FLASH FLOOD 1.286839e+14
## 2             FLOOD 8.772437e+13
## 3              HAIL 5.709295e+13
## 4        HEAVY SNOW 1.077879e+13
## 5         HIGH WIND 3.277512e+13
## 6         LIGHTNING 5.251686e+13
## 7 THUNDERSTORM WIND 2.208265e+14
## 8           TORNADO 1.270806e+14
## 9      WINTER STORM 1.341867e+13

Top events by the monetary damage. We fidn that the “Thunderstorm winds” have caused the most damage in almost $200,000 Billion since 1995.

#By total monetary damage
Damage_by_events<-aggregate(Tot_damage~EVTYPE,data=stormdata2,FUN=sum)
Damage_by_events_10<-data.frame(Damage_by_events[order(Damage_by_events$Tot_damage, decreasing = TRUE),][1:5,])
colnames(Damage_by_events_10)<-c("Event","Tot_Damage")

g1<-ggplot(data=Damage_by_events_10,aes(x=Event,y=as.integer(Tot_Damage/10^9),fill = Tot_Damage))+geom_bar(stat="identity")+scale_x_discrete(limits=Damage_by_events_10$Event)
g1<-g1+theme(plot.title = element_text(hjust=0.5))+ggtitle("Total Damage by events")+ylab("Damage in Billion $$")
g1<-g1+theme(axis.text.x = element_text(angle = 90))
g1

Following are the top 5 events which we will be focussed in the subsequent analysis. For subsetting in the analysis below, lets first create a vecotr for them.

top_5_events<-c("TORNADO","FLASH FLOOD","HAIL","FLOOD","THUNDERSTORM WIND")

Following plot shows that “Excessive Heat” caused more fatalities even more than “Thunderstorms winds”.

#By total fatalities
fatalities_by_events<-aggregate(FATALITIES~EVTYPE,data=stormdata2,FUN=sum)
fatalities_by_events_10<-fatalities_by_events[order(fatalities_by_events$FATALITIES, decreasing = TRUE),][1:5,]
colnames(fatalities_by_events_10)<-c("Event","Fatalities")

g2<-ggplot(data=fatalities_by_events_10,aes(x=Event,y=Fatalities,fill = Fatalities))+geom_bar(stat="identity")+scale_x_discrete(limits=fatalities_by_events_10$Event)
g2<-g2+theme(plot.title = element_text(hjust=0.5))+ggtitle("Fatalities by events")+ylab("Total fatalities")
g2<-g2+theme(axis.text.x = element_text(angle = 90))
g2

Results

Type of event which is most harmful to population health: Excessive Heat
Type of event which has the greatest economic consequences: Thunderstorm (only after combining “TSTM” and “THUNDERSTORM WINDS”)