Synopsis

Storms and other severe natural disasters threaten humans’ lives and properties. After examining the fatality and injury numbers caused by natural disasters, We found that from 1950 to 2004, Tornado killed and injured the most across the United States(4658 and 80084 respectively). And although Tornado events were not the most frequently recorded natural disasters, they caused the most damages on human properties(approx. 41 billion dollars). Flood events lead the top damages on crops which cost 642 million dollars losses. Hail events possed second place on both crops damages and frequenctly records, which means for most situation, hail events threat the crops lives severely.

Data Processing

Data Download

Looking into the data downloaded from the National Climate Data Center, we found there were still lots of misreading columns and rows in the dataframe, which got mutiple factor levels for each variable.

##Download and read the data from the website
if(!dir.exists("~/Course Project 2")) dir.create("~/Course Project 2")
setwd("~/Course Project 2")
if(!file.exists("~/Course Project 2/FStormData.csv.bz2")) download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "~/Course Project 2/FStormData.csv.bz2")
FStorm <- read.csv("~/Course Project 2/FStormData.csv.bz2", blank.lines.skip = TRUE, na.strings = "")
## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
## dec, : EOF within quoted string
#Look into the data with so many factor levels
str(FStorm$EVTYPE)
##  Factor w/ 487 levels "AGRICULTURAL FREEZE",..: 416 416 416 416 416 416 416 416 416 416 ...

Clean Data

We checked the frequency of disaster events recorded in the data and found out that the top ten frequently recorded events contains 73% data volume of the whole data. So we ignored the rest of the data which would be a good way not only to get clean data, but to reduce the analytical complexity as well. Because the most harmful disaster event would be sure in these top 10 frequently recorded events.

##Select the columns indication disaster events' population damages and economic damages
FStormData <- FStorm[, c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "CROPDMG", "PROPDMGEXP", "CROPDMGEXP")]

##Select top 10 frequently recorded disaster events to clean the data
TypeFreq <- as.data.frame(table(FStormData$EVTYPE))
names(TypeFreq) <- c("EVTYPE", "EVTYPEFreq")
TypeFreqOrder <- TypeFreq[order(-TypeFreq$EVTYPEFreq), ]
HighFreqType <- TypeFreqOrder[1:10, ]
FStormHighFreq <- FStormData[FStormData$EVTYPE %in% HighFreqType$EVTYPE, ]

##Define the class of each variables
FStormHighFreq$EVTYPE <- as.character(FStormHighFreq$EVTYPE)
FStormHighFreq$INJURIES <- as.numeric(as.character(FStormHighFreq$INJURIES))
FStormHighFreq$FATALITIES <- as.numeric(as.character(FStormHighFreq$FATALITIES))
FStormHighFreq$PROPDMG <- as.numeric(as.character(FStormHighFreq$PROPDMG))
FStormHighFreq$CROPDMG <- as.numeric(as.character(FStormHighFreq$CROPDMG))
FStormHighFreq$PROPDMGEXP <- as.character(FStormHighFreq$PROPDMGEXP)
FStormHighFreq$CROPDMGEXP <- as.character(FStormHighFreq$CROPDMGEXP)

##Observation percentage after clean data
dim(HighFreqType)[1]/dim(FStorm)[1]
## [1] 4.332136e-05

Calculate the Economic Damages

Another tricky thing is that the Crops damages and the Properties damages data could not be reached directly. We need to calculate these economic damages combining both the damage number columns and the damage magnitude columns.

##Calculate the actural value of economic damages
EXP <- data.frame(letter = c("K", "M", "B", NA), number = c(10^3, 10^6, 10^9, 1))
FStormHighFreq <- merge(FStormHighFreq, EXP, by.x = "PROPDMGEXP", by.y = "letter")
FStormHighFreq <- merge(FStormHighFreq, EXP, by.x = "CROPDMGEXP", by.y = "letter")
FStormHighFreq$PROPDMG <- FStormHighFreq$PROPDMG*FStormHighFreq$number.x
FStormHighFreq$CROPDMG <- FStormHighFreq$CROPDMG*FStormHighFreq$number.y

FStormSum <- aggregate(FStormHighFreq[, c("FATALITIES", "INJURIES", "PROPDMG", "CROPDMG")], by = list("EVTYPE" = FStormHighFreq$EVTYPE), sum)
FStormSum <- merge(FStormSum, HighFreqType, by = "EVTYPE")
FStormSum$EVTYPEFreq <- as.numeric(as.character(FStormSum$EVTYPEFreq))

Results

Population Damages of Natural Disasters

We ploted a “Fatalities ~ Injuries” point plot with exponential scales to show which natural disaster event would be the most dangerous to human lives. Each dots’ size indicates the frequency of the records for the event. And we found out that Tornado killed and injured the most across the United States(4658 and 80084 respectively).

library(ggplot2)
g <- ggplot(FStormSum, aes(x = FATALITIES, y = INJURIES, color = EVTYPE, size = EVTYPEFreq))
g + geom_point() + scale_x_continuous(trans = "log10") + 
        scale_y_continuous(trans = "log10") + 
        labs(title = "Population Damages caused by Natural Disasters  across the U.S.", 
             x = "Fatality Populations", y = "Injury Populations", size = "Events Frequency")

Economic Damages of Natural Disasters

We also ploted a “Property Damages ~ Crops Damages” point plot with exponential scales to show which natural disaster event would be the most harmful to economics. Each dots’ size indicates the frequency of the records for the event. And we found out that although Tornado events were not the most frequently recorded natural disasters, they caused the most damages on human properties(approx. 41 billion dollars). Flood events lead the top damages on crops which cost 642 million dollars losses. Hail events possed second place on both crops damages and frequenctly records, which means for most situation, hail events threat the crops lives severely.

p <- ggplot(FStormSum, aes(x = PROPDMG, y = CROPDMG, color = EVTYPE, size = EVTYPEFreq))
p + geom_point() + scale_x_continuous(trans = "log10") + 
        scale_y_continuous(trans = "log10") +
        labs(title = "Economic Damages caused by Natural Disasters across the U.S.", 
             x = "Property Damages (Dollars)", y = "Crop Damages (Dollars)", size = "Events Frequency")