This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database to identify the type of storms that have a major impact on public health and economy. In this report, we have quantified the impact on public health as a function of public fatalities and injuries and economic impact as a function of property and crop damage. From our analysis, we concluded that Tornados and Floods cause the most damage to public health and economy respectively.
The data for this project came from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. Storm Data
Documentation for this data is also available at Storm Data Documentation
library(dplyr)
library(ggplot2)
We will first download the data from the download link. Then we will read the data from the bz2 compressed comma separated raw file into an R object.
## Download the data file
fl_nm = "storm_data.csv.bz2"
if(!file.exists(fl_nm)) {
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",fl_nm)
}
## Read the data
strm_dt <- read.csv("storm_data.csv.bz2",stringsAsFactors = FALSE)
After reading the data, we will summarize the R object to familiarize ourselves with the data.
str(strm_dt)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
Lastly, we will check few rows to make sure everything looks good.
head(strm_dt[,4:10])
## TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE BGN_RANGE BGN_AZI
## 1 CST 97 MOBILE AL TORNADO 0
## 2 CST 3 BALDWIN AL TORNADO 0
## 3 CST 57 FAYETTE AL TORNADO 0
## 4 CST 89 MADISON AL TORNADO 0
## 5 CST 43 CULLMAN AL TORNADO 0
## 6 CST 77 LAUDERDALE AL TORNADO 0
In this section, we will transform the raw data into a tidy dataset that we will use for our analysis.
## Making a copy of the raw data into a new object
tdy_strm_dt <- strm_dt
First, Property damage and crop damage is expressed as a combination of values and exponents. In this step, we will convert them into absolute numbers.
Then, we will remove the records with erroneous exponents from our dataset.
## Standardize the Property Damage and Crop Damage Exponents
tdy_strm_dt[tdy_strm_dt$PROPDMGEXP=="","PROPDMGEXP"] <- "0"
tdy_strm_dt[grep("[Bb]",tdy_strm_dt$PROPDMGEXP),"PROPDMGEXP"] <- "9"
tdy_strm_dt[grep("[Mm]",tdy_strm_dt$PROPDMGEXP),"PROPDMGEXP"] <- "6"
tdy_strm_dt[grep("[Kk]",tdy_strm_dt$PROPDMGEXP),"PROPDMGEXP"] <- "3"
tdy_strm_dt[tdy_strm_dt$CROPDMGEXP=="","CROPDMGEXP"] <- "0"
tdy_strm_dt[grep("[Kk]",tdy_strm_dt$CROPDMGEXP),"CROPDMGEXP"] <- "3"
tdy_strm_dt[grep("[Mm]",tdy_strm_dt$CROPDMGEXP),"CROPDMGEXP"] <- "6"
tdy_strm_dt[grep("[Bb]",tdy_strm_dt$CROPDMGEXP),"CROPDMGEXP"] <- "9"
## Segregate the error records
err_strm_dt <- rbind(tdy_strm_dt[-grep("[0-9]",tdy_strm_dt$PROPDMGEXP),],
tdy_strm_dt[-grep("[0-9]",tdy_strm_dt$CROPDMGEXP),])
## Remove the error records
tdy_strm_dt <- tdy_strm_dt[grep("[0-9]",tdy_strm_dt$PROPDMGEXP),]
tdy_strm_dt <- tdy_strm_dt[grep("[0-9]",tdy_strm_dt$CROPDMGEXP),]
## Calculate Total Damage
tdy_strm_dt$PROPDMGEXP <- as.numeric(tdy_strm_dt$PROPDMGEXP)
tdy_strm_dt$CROPDMGEXP <- as.numeric(tdy_strm_dt$CROPDMGEXP)
tdy_strm_dt$PROPDMGABS <- tdy_strm_dt$PROPDMG*10^tdy_strm_dt$PROPDMGEXP
tdy_strm_dt$CROPDMGABS <- tdy_strm_dt$CROPDMG*10^tdy_strm_dt$CROPDMGEXP
Lastly, we will retain only the relevant fields for our analysis.
## Retain only the required information
tdy_strm_dt$BGN_YR <- format(as.Date(tdy_strm_dt$BGN_DATE,"%m/%d/%Y"),"%Y")
sbst_tdy_strm_dt <- tdy_strm_dt[,c("REFNUM","BGN_YR","EVTYPE","FATALITIES","INJURIES","PROPDMGABS","CROPDMGABS")]
We will define the most harmful events as a function of fatalities and injuries caused to the general population.
First, we will calculate key aggregated metrics for each Event over the entire duration of the data.
## Calculate key aggregated metrics
sbst_tdy_strm_dt <- group_by(sbst_tdy_strm_dt, EVTYPE)
agg_strm_dt <- summarize(sbst_tdy_strm_dt, ttl_FATALITIES=sum(FATALITIES), ttl_INJURIES=sum(INJURIES),
ttl_DMG=sum(PROPDMGABS+CROPDMGABS), mean_FATALITIES=mean(FATALITIES),
mean_INJURIES=mean(INJURIES), mean_DMG=mean(PROPDMGABS+CROPDMGABS),
ttl_INSTANCES=n())
Then, we will rank the events based on Public fatalities and Injuries.
For the purpose of ranking, we are equating 20 injuries to a fatality.
Followed by this, we will identify the top 5 events with highest public health impact based on fatalities and injuries.
## Calculate the rank and top 5 events
agg_strm_dt$rank <- dense_rank(desc(agg_strm_dt$ttl_FATALITIES+agg_strm_dt$ttl_INJURIES*.05))
agg_strm_dt$legend <- agg_strm_dt$EVTYPE
agg_strm_dt[agg_strm_dt$rank>5,"legend"] <- "All Others"
## Plot the Fatalities and Injuries and label the top 5 events
plot(agg_strm_dt$ttl_FATALITIES, agg_strm_dt$ttl_INJURIES, col=as.factor(agg_strm_dt$legend), lwd=3, pch=8,
ylab = "Total Injuries", xlab = "Total Fatalities")
legend("topleft",unique(agg_strm_dt$legend), col = as.factor(unique(agg_strm_dt$legend)), lwd=3)
From the above scatter plot, we conclude that Tornados have the highest impact on the public health by far.
To ensure data capture is not a factor in our conclusion, we will re-run the same analysis on data since 2000.
## Re-run the same steps on data since 2000
sbst_tdy_strm_dt_2k <- subset(sbst_tdy_strm_dt, BGN_YR>=2000)
agg_strm_dt_2k <- summarize(sbst_tdy_strm_dt_2k, ttl_FATALITIES=sum(FATALITIES), ttl_INJURIES=sum(INJURIES),
ttl_DMG=sum(PROPDMGABS+CROPDMGABS), mean_FATALITIES=mean(FATALITIES),
mean_INJURIES=mean(INJURIES), mean_DMG=mean(PROPDMGABS+CROPDMGABS),
ttl_INSTANCES=n())
agg_strm_dt_2k$rank <- dense_rank(desc(agg_strm_dt_2k$ttl_FATALITIES+agg_strm_dt_2k$ttl_INJURIES*.05))
agg_strm_dt_2k$legend <- agg_strm_dt_2k$EVTYPE
agg_strm_dt_2k[agg_strm_dt_2k$rank>5,"legend"] <- "All Others"
## Plot the Fatalities and Injuries and label the top 5 events since 2000
plot(agg_strm_dt_2k$ttl_FATALITIES, agg_strm_dt_2k$ttl_INJURIES, col=as.factor(agg_strm_dt_2k$legend), lwd=3, pch=8,
ylab = "Total Injuries", xlab = "Total Fatalities")
legend("topleft",unique(agg_strm_dt_2k$legend), col = as.factor(unique(agg_strm_dt_2k$legend)), lwd=3)
The above scatter plot confirms that Tornados continues to have a major impact on public health. In fact, that is true for Excessive Heat, Lighting and Flash floods as well.
In this section, we will analyze the economic impact of these events.
Similar to the above analysis, we will first identify the Top 5 events with highest property and crop damage (in USD).
For our scatter plot, we will focus on two key aspects. 1) Total damage caused by an event 2) Mean damage caused by an event
## Calculate the rank and top 5 events
agg_strm_dt$rank <- dense_rank(desc(agg_strm_dt$ttl_DMG))
agg_strm_dt$legend <- agg_strm_dt$EVTYPE
agg_strm_dt[agg_strm_dt$rank>5,"legend"] <- "All Others"
## Plot the total and mean damage and label the top 5 events
plot(agg_strm_dt$ttl_DMG, agg_strm_dt$mean_DMG, col=as.factor(agg_strm_dt$legend), lwd=3, pch=8,
ylab = "Average Damage", xlab = "Total Damage")
legend("topleft",unique(agg_strm_dt$legend), col = as.factor(unique(agg_strm_dt$legend)), lwd=3)
From this scatter plot, we can draw following conclusions: