by Joshua Jin
June 16, 2014
This assignment is to analyze and report public health and material damage of severe weather events. This document describes the data handling, data transformation, data analysis and reporting. It produced bar charts showing the top 10 most significant events in term of fatalities, injuries and economic damage caused. The result of the analysis is sufficient to answer the two questions listed below. Overall, tornadoes are the most hazardous to human health with 5,633 reported fatalities and 91,346 reported injuries, and floods have been responsible for the most economic damage, more than $150 billion US dollars.
The goal of the data analysis is to address the following questions:
Storm Data is an official publication of the National Oceanic and Atmospheric Administration. It receives data from Storm Data from the National Weather Service. The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. It can download the file from the course web site:
There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
National Weather Service Storm Data Documentation
National Climatic Data Center Storm Events FAQ
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
According to the documentation provided, 7 out of 37 available variables in the dataset are used for data analysis:
The data processing contains the following steps:
Note: From R, set the working directory first. Here some of data analysis is relying on sqldf package to leverage someone who is familiar with SQL language already. The charting is depended on ggplot2 package.
setwd("~/My_Projects/JohnsHDS/05_Reproducible_Research/Assignment_2")
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Loading required namespace: tcltk
## Loading required package: RSQLite
## Loading required package: DBI
## Loading required package: RSQLite.extfuns
library(ggplot2)
z <- bzfile("repdata_data_StormData.csv.bz2", open = "r")
data <- read.csv(z)
close(z)
Create a data frame (fatalities.df) that contains the 10 most harmful events based on the total fatalities.
fatalities.df <- sqldf('
SELECT
SUM(FATALITIES) as TotalFatalities,
EVTYPE
FROM data
GROUP BY EVTYPE
ORDER BY TotalFatalities DESC
LIMIT 10')
## Loading required package: tcltk
# Report the top 10 fatalities and event types
fatalities.df
## TotalFatalities EVTYPE
## 1 5633 TORNADO
## 2 1903 EXCESSIVE HEAT
## 3 978 FLASH FLOOD
## 4 937 HEAT
## 5 816 LIGHTNING
## 6 504 TSTM WIND
## 7 470 FLOOD
## 8 368 RIP CURRENT
## 9 248 HIGH WIND
## 10 224 AVALANCHE
Create a data frame (injuries.df) that contains the 10 harmful events based on the total injuries.
injuries.df <- sqldf('
SELECT
SUM(INJURIES) as TotalInjuries,
EVTYPE
FROM data
GROUP BY EVTYPE
ORDER BY TotalInjuries DESC
LIMIT 10')
# Report the top 10 injuries and event types
injuries.df
## TotalInjuries EVTYPE
## 1 91346 TORNADO
## 2 6957 TSTM WIND
## 3 6789 FLOOD
## 4 6525 EXCESSIVE HEAT
## 5 5230 LIGHTNING
## 6 2100 HEAT
## 7 1975 ICE STORM
## 8 1777 FLASH FLOOD
## 9 1488 THUNDERSTORM WIND
## 10 1361 HAIL
Create a data frame damaged.event.data. It contains all records that has caused fatalities, injuries, property damages, and crop damages, including dollar multipliers (PROPDMGEXP and CROPDMGEXP).
damaged.event.data <- sqldf('
SELECT
EVTYPE,
FATALITIES,
INJURIES,
PROPDMG,
CROPDMG,
PROPDMGEXP,
CROPDMGEXP
FROM data
WHERE FATALITIES + INJURIES + PROPDMG + CROPDMG > 0
ORDER BY FATALITIES DESC ')
Calculate the total damages for each of harmful events by multiply the base dollar amount to its multiplier. Store the result to a data frame summary.dmg.df for charting.
# Set multipliers (e.g. M for thousands) that will be used for calculating damages
multiplier <- c( "0"= 10^0, "2"=10^2, "3"=10^3, "4"=10^4, "5"=10^5, "6"=10^6,
"7"=10^7, H=10^2, h=10^2, M=10^6, m=10^6, K=10^3, k=10^3,
B=10^9, b= 10^9)
damaged.event.data$PROPDMGDollar <- damaged.event.data$PROPDMG *
multiplier[as.character(damaged.event.data$PROPDMGEXP)]
damaged.event.data$PROPDMGDollar[is.na(damaged.event.data$PROPDMGDollar)] <- 0
damaged.event.data$CROPDMGDollar <- damaged.event.data$CROPDMG *
multiplier[as.character(damaged.event.data$CROPDMGEXP)]
damaged.event.data$CROPDMGDollar[is.na(damaged.event.data$CROPDMGDollar)] <- 0
summary.dmg.df <- sqldf('
SELECT
SUM(PROPDMGDollar) as TotalDollarPropDMG,
SUM(CROPDMGDollar) as TotalDollarCropDMG,
SUM(PROPDMGDollar) + SUM(CROPDMGDollar) as TotalDollarDMG,
EVTYPE
FROM "damaged.event.data"
GROUP BY EVTYPE
ORDER BY TotalDollarDMG DESC
LIMIT 10')
# Report the top 10 event types caused property and crop damages
summary.dmg.df[,3:4]
## TotalDollarDMG EVTYPE
## 1 1.503e+11 FLOOD
## 2 7.191e+10 HURRICANE/TYPHOON
## 3 5.736e+10 TORNADO
## 4 4.332e+10 STORM SURGE
## 5 1.876e+10 HAIL
## 6 1.824e+10 FLASH FLOOD
## 7 1.502e+10 DROUGHT
## 8 1.461e+10 HURRICANE
## 9 1.015e+10 RIVER FLOOD
## 10 8.967e+09 ICE STORM
p1 <- ggplot(fatalities.df, aes(reorder(EVTYPE,TotalFatalities),
TotalFatalities, fill=factor(EVTYPE))) +
geom_bar(stat="identity") +
xlab("Event Type") +
ylab("Total Fatalities") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
print(p1)
p2 <- ggplot(injuries.df, aes(reorder(EVTYPE,TotalInjuries),
TotalInjuries, fill=factor(EVTYPE))) +
geom_bar(stat="identity") +
xlab("Event Type") +
ylab("Total Injuries") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
print(p2)
# Plot damage
p3 <- ggplot(summary.dmg.df, aes(reorder(EVTYPE,TotalDollarDMG),
TotalDollarDMG, fill=factor(EVTYPE))) +
geom_bar(stat="identity") +
xlab("Event Type") +
ylab("Total Damages ($)") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
print(p3)
By looking at result numbers and charts, it is obvious that tornadoes are the most hazardous to human health with 5,633 reported fatalities and 91,346 reported injuries. Floods have been responsible for the most economic damage with more than $150 billion US dollars.
In this assignment, it provided a step-by-step approach for analyzing data. The document was prepared with R version 3.0.2 on Mac OS X v10.9.3.