Overview

This report analyzes data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, which documents severe storm and weather events in the United States occurring in the time period between year 1950 and November of 2011. The data consists of characteristics of these events, including but not limited to location, date and duration, fatality, injuries, and damages. This report has 2 objectives:

  1. to identify the event type that causes the greatest harm to population health
  2. to identify the event type that has the greatest economic consequences

Data Processing

We start by requiring the packages we will need later for descriptive analyses.

library(ggplot2)
library(knitr)
library(plyr)

Now we download the raw data, in CSV format. This file was obtained from the link here, as provided from the Coursera course site ‘Reproducible Research’. We import the file of size 46.9 MB into a data frame in the local working directory.

StormURL <-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(StormURL, dest = "StormData.csv.bz2")
Storm <- read.csv("StormData.csv.bz2", header=TRUE)

We check the dimension of the data imported.

dim(Storm)
## [1] 902297     37

The data contains 37 columns and 902297 rows.

We then check the first 2 rows and the column names.

head(Storm,2)
##   STATE__          BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1 4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1 4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                        14   100 3   0          0
## 2         NA         0                         2   150 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
names(Storm)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

Results

Objective 1: Population health

To find the event type that causes the greatest harm, we focus on the characteristics INJURIES and FATALITIES separately.

Let us first take a look at the total number of injuries, by event type. As we previously called the package ‘plyr’, we make use of the function ‘ddply’ here, to sum the total number of injuries by event type, and place the information in a new data frame which we name ‘Storm.Inj’

Storm.Inj <- ddply(Storm,.(EVTYPE),summarize,Total.Injuries = sum(INJURIES,na.rm=TRUE))

We then order the number of injuries from highest to lowest, and display the 5 event types with the highest number of injuries.

Storm.Inj <- Storm.Inj[order(Storm.Inj$Total.Injuries,decreasing = TRUE), ]
head(Storm.Inj,5)
##             EVTYPE Total.Injuries
## 834        TORNADO          91346
## 856      TSTM WIND           6957
## 170          FLOOD           6789
## 130 EXCESSIVE HEAT           6525
## 464      LIGHTNING           5230

We see that Tornado is the event type with the highest number of injuries: 91346. Followed by that, in decreasing number of injuries, are: TSTM Wind, Flood, Excessive Heat, and Lightning.

We visualize this ordered number of injuries by event type with a horizontal bar plot, diplaying the top 5 event types. Note that ‘geom_bar’ function has the argument “stat = ‘identity’” which is suitable when one variable (EVTYPE) is categorical, whereas the other (Total.Injuries) is an integer count.

PlotInj <- ggplot(Storm.Inj[1:5,],aes(EVTYPE,Total.Injuries,fill=EVTYPE))
PlotInj + geom_bar(stat='identity')  +  xlab('Type of Event') + ylab ('Total Number of Injuries')+ 
        ggtitle('Highest Number of Injuries by Event Type') + coord_flip()

The plot reaffirms our finding that Tornado causes the highest number of injuries, out of all event types.

Next, we look at total number of fatalities, in a similar fashion as the number of injuries.

Storm.Fatal <- ddply(Storm,.(EVTYPE),summarize,Total.Fatalities = sum(FATALITIES,na.rm = TRUE))
Storm.Fatal <- Storm.Fatal[order(Storm.Fatal$Total.Fatalities,decreasing = TRUE), ]
head(Storm.Fatal,5)
##             EVTYPE Total.Fatalities
## 834        TORNADO             5633
## 130 EXCESSIVE HEAT             1903
## 153    FLASH FLOOD              978
## 275           HEAT              937
## 464      LIGHTNING              816

We see that Tornado not only causes the highest number of injuries, but also fatalities. We again visualize this in a horizontal bar plot:

PlotFatal <- ggplot(Storm.Fatal[1:5,],aes(EVTYPE,Total.Fatalities,fill=EVTYPE))
PlotFatal + geom_bar(stat='identity') + xlab('Type of Event') + ylab ('Total Fatalities')+ 
        ggtitle('Highest Number of Fatalities by Event Type') + coord_flip()

Objective 2: Economic consequences

We first identify the event type that causes the most damage to CROPS, followed by that to PROPERTY. Then, we look at the TOTAL damage to crop and property, by event type.

Start by examining the crop damage. Since the column ‘CROPDMG’ contains the numerical value in dollars, while the column ‘CROPDMGEXP’ specifies the units, we can create a new column that displays the exact monetary value by multiplying the two columns together. To do this, let us start by viewing the available units in ‘CROPDMGEXP’.

unique(Storm$CROPDMGEXP)
## [1]   M K m B ? 0 k 2
## Levels:  ? 0 2 B k K m M

Two of the levels, ‘m’ and ‘k’, are lowercased, but represent the exact same as their uppercase counterparts, ‘M’ and ‘K’. We clean up the column to ensure uniform documentation. We first convert the column from class ‘factor’ to ‘character’, so that next we can convert all levels to upper case using the ‘toupper’ function,

Storm$CROPDMGEXP <- as.character(Storm$CROPDMGEXP)
Storm$CROPDMGEXP <- toupper(Storm$CROPDMGEXP)

This is followed by replacing non-numerical-character levels (‘?’ and ‘’) to the character ’0’.

Storm$CROPDMGEXP[Storm$CROPDMGEXP %in% c('','?')] <- '0'
unique(Storm$CROPDMGEXP) # check column has been cleaned to format desired
## [1] "0" "M" "K" "B" "2"

Now we substitute the character levels ‘M’(for millions) and ‘K’ (for thousands) and ‘B’ (for billions) with their actual powers of base ten; 6,3,9, respectively. Note that the character ‘2’ remains unchanged, since it has already been converted from ‘H’ (for hundreds) in the raw data. Then, we convert the column to class ‘numeric’ raised to the power of 10, to clearly depict the actual numeric value.

Storm$CROPDMGEXP[Storm$CROPDMGEXP %in% c('M')] <- '6'
Storm$CROPDMGEXP[Storm$CROPDMGEXP %in% c('K')] <- '3'
Storm$CROPDMGEXP[Storm$CROPDMGEXP %in% c('B')] <- '9'
Storm$CROPDMGEXP <- 10^(as.numeric(Storm$CROPDMGEXP))
unique(Storm$CROPDMGEXP) # check
## [1] 1e+00 1e+06 1e+03 1e+09 1e+02

Before multiplying the 2 columns together, we check to ensure the class of ‘CROPDMG’ is also numeric.

class(Storm$CROPDMG)
## [1] "numeric"

We create this new column ‘Crop.Damage’, formed by multiplication of the 2 columns, that clearly depicts the crop damage. We merge the column with the rest of the ‘Storm’ data imported.

Crop.Damage <- Storm$CROPDMG * Storm$CROPDMGEXP # new column
Storm <- cbind(Storm,Crop.Damage) # merge column with Storm data

Next, by event type, we sum the total crop damage. This information is extracted into a new data frame which we call ‘Event.Crop.Dmg’. The top 5 event types with the highest damage are shown.

Event.Crop.Dmg <- ddply(Storm, .(EVTYPE), summarize, Total.Crop.Dmg = sum(Crop.Damage, na.rm = TRUE))
Event.Crop.Dmg <- Event.Crop.Dmg[order(Event.Crop.Dmg$Total.Crop.Dmg, decreasing = T), ]
head(Event.Crop.Dmg,5)
##          EVTYPE Total.Crop.Dmg
## 95      DROUGHT    13972566000
## 170       FLOOD     5661968450
## 590 RIVER FLOOD     5029459000
## 427   ICE STORM     5022113500
## 244        HAIL     3025954473

We see that Drought results in the greatest amount of financial damage to crops.

We then look at property damages in a similar way.

unique(Storm$PROPDMGEXP)
##  [1] K M   B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels:  - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
Storm$PROPDMGEXP <-as.character(Storm$PROPDMGEXP)

This is followed by converting all levels to upper case counterparts,and replacing non-numerical-character levels (‘+’ and ‘-’ and ‘?’ and ‘’) with the character ’0’.

Storm$PROPDMGEXP <- toupper(Storm$PROPDMGEXP)
Storm$PROPDMGEXP[Storm$PROPDMGEXP %in% c('+','-','?','')] <- '0'
unique(Storm$PROPDMGEXP) # check column has been cleaned to format desired
##  [1] "K" "M" "0" "B" "5" "6" "4" "2" "3" "H" "7" "1" "8"
Storm$PROPDMGEXP[Storm$PROPDMGEXP %in% c('M')] <- '6'
Storm$PROPDMGEXP[Storm$PROPDMGEXP %in% c('K')] <- '3'
Storm$PROPDMGEXP[Storm$PROPDMGEXP %in% c('B')] <- '9'
Storm$PROPDMGEXP[Storm$PROPDMGEXP %in% c('H')] <- '2'
Storm$PROPDMGEXP <- 10^(as.numeric(Storm$PROPDMGEXP))
unique(Storm$PROPDMGEXP) # check
##  [1] 1e+03 1e+06 1e+00 1e+09 1e+05 1e+04 1e+02 1e+07 1e+01 1e+08
Prop.Damage <- Storm$PROPDMG * Storm$PROPDMGEXP # new column
Storm <- cbind(Storm,Prop.Damage) # merge column with Storm data
Event.Prop.Dmg <- ddply(Storm, .(EVTYPE), summarize, Total.Prop.Dmg = sum(Prop.Damage, na.rm = TRUE))
Event.Prop.Dmg <- Event.Prop.Dmg[order(Event.Prop.Dmg$Total.Prop.Dmg, decreasing = T), ]
head(Event.Prop.Dmg,5)
##                EVTYPE Total.Prop.Dmg
## 170             FLOOD   144657709807
## 411 HURRICANE/TYPHOON    69305840000
## 834           TORNADO    56947380676
## 670       STORM SURGE    43323536000
## 153       FLASH FLOOD    16822673978

We see that Flood results in the highest damage to properties.

Now, we look at the total damages (both crop and property). We create a column of total damages named ‘Total.Dmg’, then merge this column into the Storm data frame.

Total.Dmg <- Storm$Crop.Damage + Storm$Prop.Damage
head(Total.Dmg)
## [1] 25000  2500 25000  2500  2500  2500
Storm <- cbind(Storm,Total.Dmg)

We then rank the top 5 event types with the highest total damages.

Total.Damage <- ddply(Storm,.(EVTYPE),summarize,TotalDamage = sum(Total.Dmg,na.rm=TRUE))
Total.Damage <- Total.Damage[order(Total.Damage$TotalDamage, decreasing = TRUE), ]
head(Total.Damage)
##                EVTYPE  TotalDamage
## 170             FLOOD 150319678257
## 411 HURRICANE/TYPHOON  71913712800
## 834           TORNADO  57362333946
## 670       STORM SURGE  43323541000
## 244              HAIL  18761221986
## 153       FLASH FLOOD  18243991078

Let us visualize the total damage by event type in a horizontal bar plot:

PlotDmg <- ggplot(Total.Damage[1:5,],aes(EVTYPE,TotalDamage,fill=EVTYPE))
PlotDmg + geom_bar(stat='identity') + xlab('Type of Event') + ylab ('Total Damages')+ 
        ggtitle('Top 5 Events Ranked by Total Damages') + coord_flip()

We see that FLOOD causes the greatest economic damage overall.

Conclusion

-Addressing Objective 1, TORNADO causes the greatest harm to population health, both in terms of injuries and fatalities. EXCESSIVE HEAT comes second overall, since it is 4th highest in total injuries, and 2nd highest in total fatalities. Majority of the resources for rescuing efforts should be allocated to TORNADO events.

-Addressing Objective 2, FLOOD causes the greatest economic loss, since it is the event type that causes the greatest damage to property and 2nd greatest damage to crops.

Resources

More information on the database can be found here and here.