Synopsis

The report detailed the analysis of the storm data collected from 1950 to 2011 in United States, and shows that tornado has caused the most fatalities and injuries. Flood has caused the greatest property and total damage, and drought has caused the most crop damage. Further analysis shows that the total fatalities and injuries increased over the 62 years, so does property and crop damage.

Introduction

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Data Processing

The data is downloaded from cloudfrount website.

1. Reading in the original data
if(!file.exists("repdata-data-StormData.csv.bz2"))
{ 
    fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
    download.file(fileUrl,
                  destfile = "repdata-data-StormData.csv.bz2",
                  method = "curl",
                  cacheOK = TRUE)
}

rawData <- read.csv(bzfile("repdata-data-StormData.csv.bz2"))

After reading in the data, let us check the dimension of the data frame. There are 902297 observations of 37 variables.

dim(rawData)
## [1] 902297     37

The variables in the data frame are:

names(rawData)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

The types of events that are most harmful with respect to population health can be any events that caused the most fatalities and/or injuries. The events that have the greatest economic consequences can be any events that caused the most property and/or crop damange. Here I extract columns of interest and print a brief summary.

subsetData <- rawData[names(rawData) %in% c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]

summary(subsetData)
##                EVTYPE         FATALITIES          INJURIES        
##  HAIL             :288661   Min.   :  0.0000   Min.   :   0.0000  
##  TSTM WIND        :219940   1st Qu.:  0.0000   1st Qu.:   0.0000  
##  THUNDERSTORM WIND: 82563   Median :  0.0000   Median :   0.0000  
##  TORNADO          : 60652   Mean   :  0.0168   Mean   :   0.1557  
##  FLASH FLOOD      : 54277   3rd Qu.:  0.0000   3rd Qu.:   0.0000  
##  FLOOD            : 25326   Max.   :583.0000   Max.   :1700.0000  
##  (Other)          :170878                                         
##     PROPDMG          PROPDMGEXP        CROPDMG          CROPDMGEXP    
##  Min.   :   0.00          :465934   Min.   :  0.000          :618413  
##  1st Qu.:   0.00   K      :424665   1st Qu.:  0.000   K      :281832  
##  Median :   0.00   M      : 11330   Median :  0.000   M      :  1994  
##  Mean   :  12.06   0      :   216   Mean   :  1.527   k      :    21  
##  3rd Qu.:   0.50   B      :    40   3rd Qu.:  0.000   0      :    19  
##  Max.   :5000.00   5      :    28   Max.   :990.000   B      :     9  
##                    (Other):    84                     (Other):     9

There is no missing value for fatalities, injuries, property damange or crop damage.

mean(is.na(subsetData$FATALITIES))
## [1] 0
mean(is.na(subsetData$INJURIES))
## [1] 0
mean(is.na(subsetData$PROPDMG))
## [1] 0
mean(is.na(subsetData$CROPDMG))
## [1] 0
2. Process Population Health Data

The subsetted data is aggregated by event type and summarized for total of fatalities and injuries to find out the most harmful event in regards to population health.

populationHealthData <- aggregate(x = subsetData[,c(2,3)], 
                                  by = list(EVTYPE = tolower(subsetData$EVTYPE)), 
                                  FUN = sum, 
                                  na.rm = TRUE)

The data is then melted to prepare for stack barplot.

library(reshape)
populationHealthData.m <- melt(populationHealthData, id = "EVTYPE")

To find out the top events that have caused the most fatalities and injuries, I created a derived column TotalCasualty and get the event names.

populationHealthData$TotalCasualty <- populationHealthData$FATALITIES + populationHealthData$INJURIES

TopTenHealthEvents <- head(populationHealthData[with(populationHealthData, order(-TotalCasualty)),1], 10)

Subset the melted data to keep only the data of the top 10 events for plotting.

populationHealthData.m <- populationHealthData.m[populationHealthData.m$EVTYPE %in% as.vector(TopTenHealthEvents), ]
3. Process Economic Damage Data

Before adding up the monetary damage of property and crop, the amounts need to be transformed based on *EXP fields. Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions.

getNumber <- function(number, exp){
    if (tolower(exp) == "k")
        number * 1000
    else if (tolower(exp) == "m")
        number * 1000000
    else if (tolower(exp) == "b")
        number * 1000000000
    else
        number
}
# transform the property damage data
subsetData$PROPDMG <- mapply(getNumber, subsetData$PROPDMG, subsetData$PROPDMGEXP)

# #transform the crop damage data
subsetData$CROPDMG <- mapply(getNumber, subsetData$CROPDMG, subsetData$CROPDMGEXP)

The data is then aggregated by event type and summarized for property and crop damange to find out the most harmful event in regards to economic consequences.

damageData <- aggregate(x = subsetData[,c(4, 6)], 
                      by = list(EVTYPE = tolower(subsetData$EVTYPE)), 
                      FUN = sum, 
                      na.rm = TRUE)

The data is then melted to prepare for stack barplot.

library(reshape)
damageData.m <- melt(damageData, id = "EVTYPE")

To find out the top events that have caused the most property and crop damage, I created a derived column TotalDamage and get the event names.

damageData$TotalDamage <- damageData$PROPDMG + damageData$CROPDMG

TopTenDamageEvents <- head(damageData[with(damageData, order(-TotalDamage)),1], 10)

Subset the melted data to keep only the data of the top 10 events for plotting.

damageData.m <- damageData.m[damageData.m$EVTYPE %in% as.vector(TopTenDamageEvents), ]

Results

1. Population Health Damage

The following plot shows the top 10 events that have caused the most fatalities and injuries. The event that has caused the most fatalities and injuries is tornado.

# plot 
library(ggplot2)
ggplot(populationHealthData.m, aes(x = reorder(EVTYPE, value), 
                                   y = value/1000, 
                                   fill = variable)) +
    geom_bar(stat = "identity") +
    ggtitle("US Fatalities and Injuries by Events Between 1950-2011(thousand people)") +
    xlab("") +
    ylab("Number of Fatalities and Injuries (in thousand people)") +
    theme(axis.text.x = element_text(angle = -45, vjust = 0.6, size = 8),
          plot.title = element_text(size = 11))

2. Economic Damage

The following plot shows the top 10 events that have caused the most property and crop damage. The event that has caused the most property and total damage is flood, and drought has caused the most crop damage.

library(ggplot2)
ggplot(damageData.m, aes(x = reorder(EVTYPE, value), 
                         y = value/1000000000, 
                         fill = variable)) +
    geom_bar(stat = "identity") +
    ggtitle("US Property and Crop Damage by Events Between 1950-2011(billion dollar)") +
    xlab("") +
    ylab("Property and Crop Damage (in billion dollar)") +
    theme(axis.text.x = element_text(angle = -45, vjust = 0.6, size = 8),
          plot.title = element_text(size = 11))

3. Event Trend

The above result shows the data that is collected for 62 years, but it is unclear the trend of the total damages incurred per year. For each record in subsetData, generate a new column “YEAR”.

subsetData$YEAR <- format(as.Date(rawData$BGN_DATE, "%m/%d/%Y %H:%M:%S"), "%Y") 

Aggregate the data by year and find out the total number of fatalities, injuries, property damage and crop damage.

eventData <- aggregate(x = subsetData[,c(2,3,4,6)], 
                       by = list(YEAR = tolower(subsetData$YEAR)), 
                       FUN = sum, 
                       na.rm = TRUE)


# subset fatalities and injuries
eventData1 <- eventData[, c(1, 2, 3)]
# subset property and crop damage
eventData2 <- eventData[, c(1, 4, 5)]

# melt data by year
eventData1.m <- melt(eventData1, id = "YEAR")
eventData2.m <- melt(eventData2, id = "YEAR")

It shows that the property and crop damage has increased since 1950 and peaked at 2006. Fatalities and injuries has also increased since 1950, peaked at 1998.

p1 <- ggplot(eventData1.m, aes(x = YEAR, 
                               y = value, 
                               fill = variable)) +
    geom_bar(stat = "identity") +
    ggtitle("Fatalities and Injuries By Year") +
    xlab("") +
    ylab("Population Damage(no. of people)") +
    theme(axis.text.x = element_text(angle = -90, vjust = 0.6, size = 8),
          plot.title = element_text(size = 12))


p2 <- ggplot(eventData2.m, aes(x = YEAR, 
                               y = value/1000000000, 
                               fill = variable)) +
    geom_bar(stat = "identity") +
    ggtitle("Property and Crop Damage By Year") +
    xlab("") +
    ylab("Damage(billion $)") +
    theme(axis.text.x = element_text(angle = -90, vjust = 0.6, size = 8),
          plot.title = element_text(size = 12))

library(gridExtra)
## Loading required package: grid
grid.arrange(p1,p2, ncol = 1, main = "US Event Damage by Year Between 1950-2011")