Synopsis

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. The purpose of the analysis is to (1) determine which weather events have the highest impacts on population health, and (2) which types of events have the greatest economic consequences.

Data Processing

The data set is downloaded, unarchived and then read with read.csv, as shown below. Note that the events in the database start in the year 1950, but more recent years should be considered more complete. Therefore, data from the year 1990 forward has been selected for the analysis. The column of interest is EVTYPE, which stores the event list. However, there are far more entries in the EVTYPE column of the raw data than the event categories as described on the NOAA website. The list of 48 event categories is copied manually from the NOAA document to categories.csv and downloaded to be used in cleaning up the EVTYPE column in the data set.

fileURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
file <- download.file(fileURL, destfile = "FStormData.csv.bz2")
system("bzip2 -dk FStormData.csv.bz2")
rawData <- read.csv("repdata_data_StormData.csv")
class(rawData$BGN_DATE) <- "Date"

# Select events after Jan 1, 1990.
categories <- read.csv("categories.csv", header = FALSE)
categList <- toupper(c(as.character(categories$V1)))
myData <- subset.data.frame(rawData, rawData$BGN_DATE >= as.Date("1990-01-01"))

Furthermore, in order to answer the first question, a subset of the raw data was selected such that the number of fatalities and injuries was greater than 30 (about 1 per year), which dramatically reduced the amount of data to be processed and the number of issues with event names. A brief inspection with a frequency table table(EVTYPE) indicates that there are some name variants left, but their occurences are few. Some cleaning up is done on the more frequent entries.

myData1<- subset.data.frame(myData, myData$FATALITIES > 10 | myData$INJURIES > 10)
myData1$EVTYPE <- gsub("TSTM", "THUNDERSTORM", myData1$EVTYPE)
myData1$EVTYPE <- gsub("WINDS", "WIND", myData1$EVTYPE)
myData1$EVTYPE <- gsub("THUNDERSTORMW", "THUNDERSTORM WIND", myData1$EVTYPE)
myData1$EVTYPE <- gsub("Heat Wave", "HEAT WAVE", myData1$EVTYPE)
myData1$EVTYPE <- gsub("HEAT WAVE", "HEAT", myData1$EVTYPE)
myData1$EVTYPE <- gsub("RECORD HEAT", "EXCESSIVE HEAT", myData1$EVTYPE)

print(table(myData1$EVTYPE))
## 
##                  BLIZZARD                DUST STORM 
##                         1                         8 
##            EXCESSIVE HEAT        EXCESSIVE RAINFALL 
##                       105                         1 
##              EXTREME HEAT               FLASH FLOOD 
##                         6                        19 
##                     FLOOD                       FOG 
##                        12                         1 
##                      HAIL                      HEAT 
##                        26                        34 
##              HEAT DROUGHT                HEAVY RAIN 
##                         1                         1 
##                HEAVY SNOW                 HIGH SURF 
##                         1                         2 
##                 HIGH WIND                 HURRICANE 
##                         8                         2 
##         HURRICANE/TYPHOON                 LIGHTNING 
##                         4                        33 
##     RECORD/EXCESSIVE HEAT              RIP CURRENTS 
##                         1                         3 
##          STORM SURGE/TIDE               STRONG WIND 
##                         1                         1 
##         THUNDERSTORM WIND    THUNDERSTORM WIND/HAIL 
##                        87                         2 
##                   TORNADO            TROPICAL STORM 
##                       634                         3 
##                   TSUNAMI         UNSEASONABLY WARM 
##                         1                         1 
## UNSEASONABLY WARM AND DRY      URBAN/SML STREAM FLD 
##                         1                         1 
##                WATERSPOUT        WATERSPOUT/TORNADO 
##                         1                         1 
##                  WILDFIRE          WILD/FOREST FIRE 
##                         8                         6 
##                      WIND              WINTER STORM 
##                         2                         1 
##        WINTER WEATHER MIX 
##                         1

Results

Events with most significant impact on population health

To answer this question, after analyzing the types of variables in the data set, the FATALITIES and INJURIES columns were selected and matched to the values in the 48 event category list. Furthermore, a separate analysis was done for the weather events leading to fatalities and injuries. The total number of fatalities and injuries was computed, and the top five events that lead to fatalities or injuries are shown.

names(myData1)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"
myData1 <- subset.data.frame(myData1, myData1$EVTYPE %in% categList)
fat <- data.frame(event = myData1$EVTYPE, cnt = myData1$FATALITIES)
inj <- data.frame(event = myData1$EVTYPE, cnt = myData1$INJURIES)

library(plyr)
fat <- ddply(fat, .(event), numcolwise(sum, na.rm = TRUE))
inj <- ddply(inj, .(event), numcolwise(sum, na.rm = TRUE))
# Select output sorted descending
fat <- fat[with(fat, order(-cnt)),]
inj <- inj[with(inj, order(-cnt)),]

Note that tornadoes represent the event with the highest impact for on population health, followed by excessive heat. Dataset fat stores fatality information, and dataset inj stores injuries.

head(fat)
##                event  cnt
## 16           TORNADO 2071
## 3     EXCESSIVE HEAT  920
## 7               HEAT  807
## 4        FLASH FLOOD   66
## 15 THUNDERSTORM WIND   34
## 18           TSUNAMI   32
head(inj)
##                event   cnt
## 16           TORNADO 33585
## 3     EXCESSIVE HEAT  6259
## 7               HEAT  2377
## 15 THUNDERSTORM WIND  1932
## 4        FLASH FLOOD   813
## 6               HAIL   794

The number of tornado related fatalities from 1990 to 2014 is shown in the figure below; according to this graph, the number has decreased beginning with 2001, possibly because the population is better prepared for such events and the warning system has improved.

library(lubridate)
## 
## Attaching package: 'lubridate'
## 
## The following object is masked from 'package:plyr':
## 
##     here
myData3 <- subset.data.frame(myData1, select = c("BGN_DATE", "EVTYPE", "FATALITIES", "INJURIES"))
myData3 <- subset.data.frame(myData3, myData3$EVTYPE %in% categList)
myData3 <- mutate(myData3, FATINJ = FATALITIES + INJURIES)
myData3 <- subset.data.frame(myData3, select = c("BGN_DATE", "EVTYPE","FATINJ"))

myData3$BGN_DATE <- year(myData3$BGN_DATE)
#myData3 <- transform(myData3, BGN_DATE = year(BGN_DATE)) 
fatinj <- ddply(myData3, .(BGN_DATE, EVTYPE), numcolwise(sum, na.rm = TRUE))
fatinj <- subset.data.frame(fatinj, fatinj$EVTYPE == "TORNADO", select = c("BGN_DATE", "FATINJ"))
plot(fatinj, xlab = "Year", ylab ="Fatalities and Injuries", main = "Event: TORNADO", col = "blue")

** The figure above shows the time course of the health impact of tornadoes from 1990 to 2014.**

Across the United States, which types of events have the greatest economic consequences?

Select rows with crop and property damage greater than 0, and consider both damages together for simplicity. Again, subsetting is done to get a smaller, more significant data set, where the damages are greater than a chosen small threshold of, say, 10. The total damage per event is calculated and displayed in decreasing order; the top 5 are shown.

library(data.table)
## 
## Attaching package: 'data.table'
## 
## The following objects are masked from 'package:lubridate':
## 
##     hour, mday, month, quarter, wday, week, yday, year
DT <- data.table(id=1:ncol(myData), name = names(myData))
selection <- DT[name == "EVTYPE" | name == "PROPDMG" | name == "CROPDMG"]
myData2 <- myData[selection$id]
myData2 <- subset.data.frame(myData2, myData2$EVTYPE %in% categList)
myData2<- subset.data.frame(myData2, myData2$PROPDMG > 10 | myData2$CROPDMG > 10)

myData2 <- mutate(myData2, dmg = CROPDMG + PROPDMG)
names(myData2)[1] <- "event"
myData2 <- subset.data.frame(myData2, select =c("event", "dmg"))
damage <- ddply(myData2, .(event), numcolwise(sum, na.rm = TRUE))
damage <- damage[with(damage, order(-dmg)),]

Note that tornadoes have the greatest economic impact, as described in terms of the property and crop damages produced.

head(damage)
##                event       dmg
## 30           TORNADO 1756911.2
## 10       FLASH FLOOD 1176444.6
## 14              HAIL  980066.7
## 11             FLOOD  639209.2
## 29 THUNDERSTORM WIND  624082.2
## 22         LIGHTNING  511978.1