NOAA Storm Data Exploratory Analysis on Impacts of Storm Events

Synopsis

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. We are interested in the following two questions:

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

The analysis shows that the answer to the first question is tornado and the answer to the second question is flood.

At the end of this document, the top 10 events related to each question are listed for further information.

Data Processing

Read in the NOAA storm data into a data frame object and cache result.

noaa <- read.csv("repdata-data-StormData.csv.bz2")
head(noaa, 3)

##   STATE__          BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1 4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1 4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1 2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3

What are the names of the data?

names(noaa)

##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

1. Data processing for question 1

The first question asks the most harmful types of events. After looking at the names of the data columns, I think “FATALITIES” and “INJUERIES” relate to harmfulness of an event.

Explore the values of “FATALITIES” and “INJUERIES”.

fatalities <- noaa$FATALITIES
injuries <- noaa$INJURIES
sum(fatalities == 0)

## [1] 895323

sum(injuries == 0)

## [1] 884693

There are many 0 values in the two columns. Subset the data set for non-zero of “FATALITIES” or “INJURIES”

data1 = subset(noaa, !((FATALITIES == 0) & (INJURIES == 0)))
dim(data1)

## [1] 21929    37

Create a column called “harm” that is the total fatalie and injuries for each event. Subset only events and harms for answering the question 1.

fatalities <-as.integer(data1$FATALITIES)
injuries <- as.integer(data1$INJURIES)
harm <- fatalities + injuries
data1 <- cbind(data1, as.data.frame(harm))
data1 <- data1[, c("EVTYPE", "harm")]
head(data1)

##    EVTYPE harm
## 1 TORNADO   15
## 3 TORNADO    2
## 4 TORNADO    2
## 5 TORNADO    2
## 6 TORNADO    6
## 7 TORNADO    1

Group the data by EVTYPE and compute the sum of harm for each type of event.

library(dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

harm_by_event <- group_by(data1, EVTYPE)
data1 <- summarize(harm_by_event, totalharm = sum(harm))

2. Data processing for question 2

The second question asks the types of events that have the greatest economic consequences. After looking at the names of the data columns, I think “PROPDMG” and “CROPDMG” are related to economic consequences. Similary, I subset the data set for non-zero values of “PROPDMG” or “CROPDMG”

data2 = subset(noaa, !((PROPDMG == 0) & (CROPDMG == 0)))
dim(data2)

## [1] 245031     37

According to the code book provided by the project webpage, damage estimates are rounded to three significant digits, followed by an alphabetcal character signifying the magnitude of the number, i.e., 1.55B for $1,550,000,000. I first want to examine the content of the magnitudes in the column “RPOPDMGEXP” and “CROPDMGEXP.

propexp <- data2$PROPDMGEXP
cropexp <- data2$CROPDMGEXP
table(propexp)

## propexp
##             -      ?      +      0      1      2      3      4      5 
##   4357      1      0      5    209      0      1      1      4     18 
##      6      7      8      B      h      H      K      m      M 
##      3      2      0     40      1      6 229057      7  11319

table(cropexp)

## cropexp
##             ?      0      2      B      k      K      m      M 
## 145037      6     17      0      7     21  97960      1   1982

Many characters for signifying magnitude are not recognizable. For this project, I will only analyze the damage values that have magnitude symbols in {B, h, H, K, m, M}

mag_symbols = c("B", "h", "H", "K", "m", "M")
data2 <- subset(data2, PROPDMGEXP %in% mag_symbols & CROPDMGEXP %in% mag_symbols)

Multiply the values of “PROPDMG” and “CROPDMG” by the corresponding magnitudes

map = setNames(c(1000000000, 100, 100, 1000, 1000000, 1000000), c("B", "h", "H", "K", "m", "M"))
propexp <- as.vector(data2$PROPDMGEXP)
cropexp <- as.vector(data2$CROPDMGEXP)
propmag <- as.integer(map[unlist(propexp)])
cropmag <- as.integer(map[unlist(cropexp)])
propval = as.integer(data2$PROPDMG) * (propmag / 1000000)
cropval = as.integer(data2$CROPDMG) * (cropmag / 1000000)

Create a column called “damage” and subset the data set only on events and damages for answering question 2

damage = propval + cropval
data2 <- cbind(data2, as.data.frame(damage))
data2 = data2[, c("EVTYPE", "damage")]
head(data2)

##                           EVTYPE damage
## 187566 HURRICANE OPAL/HIGH WINDS   10.0
## 187571        THUNDERSTORM WINDS    5.5
## 187581            HURRICANE ERIN   26.0
## 187583            HURRICANE OPAL   52.0
## 187584            HURRICANE OPAL   30.0
## 187653        THUNDERSTORM WINDS    0.1

Group the data by EVTYPE and compute the sum of damage for each type of event.

damage_by_event <- group_by(data2, EVTYPE)
data2 <- summarize(damage_by_event, totaldamage = sum(damage))

Results

I have pre-processed the raw data by combining and extracting related information for answering the questions. The data set data1 contains the infomration about events and the total harm for each event. The data set data2 contains the information about events and the total economic damage for each event. The following plot shows the values of total harms and total economic damages with respect to the indices of events.

par(mfrow = c(1, 2), oma = c(0, 0, 3, 0))
plot(log10(data1$totalharm), type = "l", xlab = "Event Index", ylab = "Total Harm (log)")
plot(log10(data2$totaldamage), type = "l", xlab = "Event Index", ylab = "Total Economic Damage (log)")
mtext("Harms and Economic Damages vs. Event Indices", outer = TRUE, cex = 1.5)

plot of chunk unnamed-chunk-13

Rank the harm and damage values and find the events related to the largest values.

data1sorted = arrange(data1, desc(totalharm))
data2sorted = arrange(data2, desc(totaldamage))

The top 10 most harmful events are:

head(data1sorted[1], 10)

## Source: local data frame [10 x 1]
## 
##               EVTYPE
## 1            TORNADO
## 2     EXCESSIVE HEAT
## 3          TSTM WIND
## 4              FLOOD
## 5          LIGHTNING
## 6               HEAT
## 7        FLASH FLOOD
## 8          ICE STORM
## 9  THUNDERSTORM WIND
## 10      WINTER STORM

The top 10 events that have the greatest economic consequences are:

head(data2sorted[1], 10)

## Source: local data frame [10 x 1]
## 
##               EVTYPE
## 1              FLOOD
## 2  HURRICANE/TYPHOON
## 3            TORNADO
## 4          HURRICANE
## 5        RIVER FLOOD
## 6               HAIL
## 7        FLASH FLOOD
## 8          ICE STORM
## 9   STORM SURGE/TIDE
## 10 THUNDERSTORM WIND