The report detailed the analysis of the storm data collected from 1950 to 2011 in United States, and shows that tornado has caused the most fatalities and injuries. Flood has caused the greatest property and total damage, and drought has caused the most crop damage. Further analysis shows that the total fatalities and injuries increased over the 62 years, so does property and crop damage.
Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The data is downloaded from cloudfrount website.
if(!file.exists("repdata-data-StormData.csv.bz2"))
{
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileUrl,
destfile = "repdata-data-StormData.csv.bz2",
method = "curl",
cacheOK = TRUE)
}
rawData <- read.csv(bzfile("repdata-data-StormData.csv.bz2"))
After reading in the data, let us check the dimension of the data frame. There are 902297 observations of 37 variables.
dim(rawData)
## [1] 902297 37
The variables in the data frame are:
names(rawData)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
The types of events that are most harmful with respect to population health can be any events that caused the most fatalities and/or injuries. The events that have the greatest economic consequences can be any events that caused the most property and/or crop damange. Here I extract columns of interest and print a brief summary.
subsetData <- rawData[names(rawData) %in% c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
summary(subsetData)
## EVTYPE FATALITIES INJURIES
## HAIL :288661 Min. : 0.0000 Min. : 0.0000
## TSTM WIND :219940 1st Qu.: 0.0000 1st Qu.: 0.0000
## THUNDERSTORM WIND: 82563 Median : 0.0000 Median : 0.0000
## TORNADO : 60652 Mean : 0.0168 Mean : 0.1557
## FLASH FLOOD : 54277 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## FLOOD : 25326 Max. :583.0000 Max. :1700.0000
## (Other) :170878
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## Min. : 0.00 :465934 Min. : 0.000 :618413
## 1st Qu.: 0.00 K :424665 1st Qu.: 0.000 K :281832
## Median : 0.00 M : 11330 Median : 0.000 M : 1994
## Mean : 12.06 0 : 216 Mean : 1.527 k : 21
## 3rd Qu.: 0.50 B : 40 3rd Qu.: 0.000 0 : 19
## Max. :5000.00 5 : 28 Max. :990.000 B : 9
## (Other): 84 (Other): 9
There is no missing value for fatalities, injuries, property damange or crop damage.
mean(is.na(subsetData$FATALITIES))
## [1] 0
mean(is.na(subsetData$INJURIES))
## [1] 0
mean(is.na(subsetData$PROPDMG))
## [1] 0
mean(is.na(subsetData$CROPDMG))
## [1] 0
The subsetted data is aggregated by event type and summarized for total of fatalities and injuries to find out the most harmful event in regards to population health.
populationHealthData <- aggregate(x = subsetData[,c(2,3)],
by = list(EVTYPE = tolower(subsetData$EVTYPE)),
FUN = sum,
na.rm = TRUE)
The data is then melted to prepare for stack barplot.
library(reshape)
populationHealthData.m <- melt(populationHealthData, id = "EVTYPE")
To find out the top events that have caused the most fatalities and injuries, I created a derived column TotalCasualty and get the event names.
populationHealthData$TotalCasualty <- populationHealthData$FATALITIES + populationHealthData$INJURIES
TopTenHealthEvents <- head(populationHealthData[with(populationHealthData, order(-TotalCasualty)),1], 10)
Subset the melted data to keep only the data of the top 10 events for plotting.
populationHealthData.m <- populationHealthData.m[populationHealthData.m$EVTYPE %in% as.vector(TopTenHealthEvents), ]
Before adding up the monetary damage of property and crop, the amounts need to be transformed based on *EXP fields. Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions.
getNumber <- function(number, exp){
if (tolower(exp) == "k")
number * 1000
else if (tolower(exp) == "m")
number * 1000000
else if (tolower(exp) == "b")
number * 1000000000
else
number
}
# transform the property damage data
subsetData$PROPDMG <- mapply(getNumber, subsetData$PROPDMG, subsetData$PROPDMGEXP)
# #transform the crop damage data
subsetData$CROPDMG <- mapply(getNumber, subsetData$CROPDMG, subsetData$CROPDMGEXP)
The data is then aggregated by event type and summarized for property and crop damange to find out the most harmful event in regards to economic consequences.
damageData <- aggregate(x = subsetData[,c(4, 6)],
by = list(EVTYPE = tolower(subsetData$EVTYPE)),
FUN = sum,
na.rm = TRUE)
The data is then melted to prepare for stack barplot.
library(reshape)
damageData.m <- melt(damageData, id = "EVTYPE")
To find out the top events that have caused the most property and crop damage, I created a derived column TotalDamage and get the event names.
damageData$TotalDamage <- damageData$PROPDMG + damageData$CROPDMG
TopTenDamageEvents <- head(damageData[with(damageData, order(-TotalDamage)),1], 10)
Subset the melted data to keep only the data of the top 10 events for plotting.
damageData.m <- damageData.m[damageData.m$EVTYPE %in% as.vector(TopTenDamageEvents), ]
The following plot shows the top 10 events that have caused the most fatalities and injuries. The event that has caused the most fatalities and injuries is tornado.
# plot
library(ggplot2)
ggplot(populationHealthData.m, aes(x = reorder(EVTYPE, value),
y = value/1000,
fill = variable)) +
geom_bar(stat = "identity") +
ggtitle("US Fatalities and Injuries by Events Between 1950-2011(thousand people)") +
xlab("") +
ylab("Number of Fatalities and Injuries (in thousand people)") +
theme(axis.text.x = element_text(angle = -45, vjust = 0.6, size = 8),
plot.title = element_text(size = 11))
The following plot shows the top 10 events that have caused the most property and crop damage. The event that has caused the most property and total damage is flood, and drought has caused the most crop damage.
library(ggplot2)
ggplot(damageData.m, aes(x = reorder(EVTYPE, value),
y = value/1000000000,
fill = variable)) +
geom_bar(stat = "identity") +
ggtitle("US Property and Crop Damage by Events Between 1950-2011(billion dollar)") +
xlab("") +
ylab("Property and Crop Damage (in billion dollar)") +
theme(axis.text.x = element_text(angle = -45, vjust = 0.6, size = 8),
plot.title = element_text(size = 11))
The above result shows the data that is collected for 62 years, but it is unclear the trend of the total damages incurred per year. For each record in subsetData, generate a new column “YEAR”.
subsetData$YEAR <- format(as.Date(rawData$BGN_DATE, "%m/%d/%Y %H:%M:%S"), "%Y")
Aggregate the data by year and find out the total number of fatalities, injuries, property damage and crop damage.
eventData <- aggregate(x = subsetData[,c(2,3,4,6)],
by = list(YEAR = tolower(subsetData$YEAR)),
FUN = sum,
na.rm = TRUE)
# subset fatalities and injuries
eventData1 <- eventData[, c(1, 2, 3)]
# subset property and crop damage
eventData2 <- eventData[, c(1, 4, 5)]
# melt data by year
eventData1.m <- melt(eventData1, id = "YEAR")
eventData2.m <- melt(eventData2, id = "YEAR")
It shows that the property and crop damage has increased since 1950 and peaked at 2006. Fatalities and injuries has also increased since 1950, peaked at 1998.
p1 <- ggplot(eventData1.m, aes(x = YEAR,
y = value,
fill = variable)) +
geom_bar(stat = "identity") +
ggtitle("Fatalities and Injuries By Year") +
xlab("") +
ylab("Population Damage(no. of people)") +
theme(axis.text.x = element_text(angle = -90, vjust = 0.6, size = 8),
plot.title = element_text(size = 12))
p2 <- ggplot(eventData2.m, aes(x = YEAR,
y = value/1000000000,
fill = variable)) +
geom_bar(stat = "identity") +
ggtitle("Property and Crop Damage By Year") +
xlab("") +
ylab("Damage(billion $)") +
theme(axis.text.x = element_text(angle = -90, vjust = 0.6, size = 8),
plot.title = element_text(size = 12))
library(gridExtra)
## Loading required package: grid
grid.arrange(p1,p2, ncol = 1, main = "US Event Damage by Year Between 1950-2011")