Synopsis

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

This research document will show that, using the data provided by NOAA, between 1950 and 2011, Flood, Typhoon, Tornado and Storm Surge caused the most Property and Crop Damage while Tornado , TSTM Wind, Flood, Lightning and Excessive Heat caused the most injuries and fatalities to the population in United States.

Preprocessing the data

  1. load the required libraries
library(lubridate)
library(plyr)
library(reshape2)
library(data.table)
library(ggplot2)
  1. Data for NOAA is downloaded from the website provided by the course.
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
infile <- "storm.csv.bz2"
download.file(fileUrl, destfile=infile, method="curl")

Once the data is downloaded, we will load the data and convert the columns for dates to date format.

DT <- as.data.table(read.csv(bzfile(infile)))
DT$BGN_DATE <- mdy_hms(DT$BGN_DATE)
DT$END_DATE <- mdy_hms(DT$END_DATE)

Analysis

Checking the impact of weather events on population health in USA

From the whole dataset loaded, we need to take a subset of the data which pertains to injuries and fatalities for the population. On this subset we need to calculate the total, mean and median for each type of event and consider the top 20 events with the highest total of damages.

data.set1 <-DT[,list(BGN_DATE, END_DATE, EVTYPE, FATALITIES, INJURIES)]
data.set1.melted<-melt(data.set1, id=c("BGN_DATE", "END_DATE", "EVTYPE"))
data.set1.topevents<- head(data.set1.melted[
    ,list(sum=sum(value), 
          mean=mean(value), 
          median=median(value)), 
    by=list(EVTYPE, variable)][order(-sum, -mean)], 20)

The top 20 events that caused the most damage to the population are listed below.

data.set1.topevents
##                 EVTYPE   variable   sum         mean median
##  1:            TORNADO   INJURIES 91346  1.506067401      0
##  2:          TSTM WIND   INJURIES  6957  0.031631354      0
##  3:              FLOOD   INJURIES  6789  0.268064440      0
##  4:     EXCESSIVE HEAT   INJURIES  6525  3.888557807      0
##  5:            TORNADO FATALITIES  5633  0.092874101      0
##  6:          LIGHTNING   INJURIES  5230  0.331979180      0
##  7:               HEAT   INJURIES  2100  2.737940026      0
##  8:          ICE STORM   INJURIES  1975  0.984546361      0
##  9:     EXCESSIVE HEAT FATALITIES  1903  1.134088200      0
## 10:        FLASH FLOOD   INJURIES  1777  0.032739466      0
## 11:  THUNDERSTORM WIND   INJURIES  1488  0.018022601      0
## 12:               HAIL   INJURIES  1361  0.004714873      0
## 13:       WINTER STORM   INJURIES  1321  0.115542727      0
## 14:  HURRICANE/TYPHOON   INJURIES  1275 14.488636364      0
## 15:          HIGH WIND   INJURIES  1137  0.056253711      0
## 16:         HEAVY SNOW   INJURIES  1021  0.064998727      0
## 17:        FLASH FLOOD FATALITIES   978  0.018018682      0
## 18:               HEAT FATALITIES   937  1.221642764      0
## 19:           WILDFIRE   INJURIES   911  0.329952916      0
## 20: THUNDERSTORM WINDS   INJURIES   908  0.043563786      0

We can easily see the comparison of the fatalities and injuries due to these top 20 events in the following chart.

ggplot(data.set1.topevents, aes(x=reorder(EVTYPE, sum), y=sum, fill=variable)) + 
    geom_bar(width=1, stat="identity") + 
    coord_flip() +
    labs(title="Top 20 Event Types for injuries and fatalities",
                            y="Total Injuries+fatalities",x="Event Type")

Checking the impact of Weather events on the property and crop.

Again, we need to extract the relevant subset from the whole dataset which pertains to the damage to crop and property.

data.set2 <-DT[,list(BGN_DATE, END_DATE, EVTYPE, PROPDMG, 
                     PROPDMGEXP, CROPDMG, CROPDMGEXP)]

In the data loaded, property and crop damage have exponents specified for amount of damage which need to be converted to get the exact amount of damage.

exp <- list(c("k", "K", "M", "", "B", "m", "+", "?", "h", "H", "-", 
              "0", "1", "2", "3", "4", "5", "6", "7", "8", "9"), 
            c(1e3, 1e3, 1e6, 1, 1e9, 1e6,  1, 1, 1, 1e2,  1, 
              1, 10, 1e2, 1e3, 1e4, 1e5, 1e6, 1e7, 1e8, 1e9))
data.set2$PROPDMGEXP<-mapvalues(data.set2$PROPDMGEXP, unlist(exp[1]), 
                                unlist(exp[2]), warn_missing = FALSE)
data.set2$CROPDMGEXP<-mapvalues(data.set2$CROPDMGEXP, unlist(exp[1]),
                                unlist(exp[2]), warn_missing = FALSE)

data.set2$totalpropdmg <- data.set2$PROPDMG * 
    as.numeric(as.character(data.set2$PROPDMGEXP))
data.set2$totalcropdmg <- data.set2$CROPDMG * 
    as.numeric(as.character(data.set2$CROPDMGEXP))

Now that the total crop and property damage is known, we need to calculate the total, mean and median for damage caused to crop and property by the various weather events and find the top 20 events.

data.set2.melted<-melt(data.set2[,list(BGN_DATE, END_DATE, EVTYPE, 
                                       totalcropdmg, totalpropdmg)], 
                       id=c("BGN_DATE", "END_DATE", "EVTYPE"))

data.set2.topevents<- head(
    data.set2.melted[,list(sum=sum(value),  mean=mean(value), 
                           median=median(value)),  by=list(EVTYPE, variable)]
    [order(-sum, -mean)], 20)

The top 20 weather events which caused the most crop and property damage are,

data.set2.topevents
##                EVTYPE     variable          sum         mean  median
##  1:             FLOOD totalpropdmg 144657709807   5711826.18       0
##  2: HURRICANE/TYPHOON totalpropdmg  69305840000 787566363.64 6765000
##  3:           TORNADO totalpropdmg  56947380676    938920.08    2500
##  4:       STORM SURGE totalpropdmg  43323536000 165990559.39   37500
##  5:       FLASH FLOOD totalpropdmg  16822673978    309941.12       0
##  6:              HAIL totalpropdmg  15735267513     54511.23       0
##  7:           DROUGHT totalcropdmg  13972566000   5615983.12       0
##  8:         HURRICANE totalpropdmg  11868319010  68208729.94  500000
##  9:    TROPICAL STORM totalpropdmg   7703890550  11165058.77    5000
## 10:      WINTER STORM totalpropdmg   6688497251    585016.82       0
## 11:             FLOOD totalcropdmg   5661968450    223563.47       0
## 12:         HIGH WIND totalpropdmg   5270046295    260738.49       0
## 13:       RIVER FLOOD totalpropdmg   5118945500  29589280.35    5000
## 14:       RIVER FLOOD totalcropdmg   5029459000  29072017.34       0
## 15:         ICE STORM totalcropdmg   5022113500   2503546.11       0
## 16:          WILDFIRE totalpropdmg   4765114000   1725865.27       0
## 17:  STORM SURGE/TIDE totalpropdmg   4641188000  31359378.38       0
## 18:         TSTM WIND totalpropdmg   4484928495     20391.60       0
## 19:         ICE STORM totalpropdmg   3944927860   1966564.24       0
## 20: THUNDERSTORM WIND totalpropdmg   3483122472     42187.45     700

We can also see the comparison of the property and crop damage due to these top 20 events in the following chart.

ggplot(data.set2.topevents, aes(x=reorder(EVTYPE, sum), y=sum, fill=variable)) + 
    geom_bar(width=1, stat="identity") + 
    coord_flip() +
    labs(title="Top 20 Event Types for Property and Crop Damage",
         y="Total Property+Crop Damage",x="Event Type")

Results

Using the data provided by NOAA, between 1950 and 2011 and based on the analysis and charts above, we can conclude the following,

  1. Tornado , TSTM Wind, Flood, Lightning and Excessive Heat caused the most injuries and fatalities to the population in United States.
  2. Flood, Typhoon, Tornado and Storm Surge caused the most Property and Crop Damage.