Title: Explore the NOAA storm Database, and answer what types of weather events are harmful?

Synopis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project explores the U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database, which tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. Exploring this database, we can answer two the following questions:

Data Processing

The data for this project come be download from web site. It is in the form of a comma-seperated-value file, compressed via the bzip2 algorithm.

Read data file


fname = "StormData.csv"
if (!file.exists(fname)) {
    dataURL = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
    temp <- "StormData.csv.bz2"
    download.file(dataURL, temp, method = "curl")
    library(R.utils)
    bunzip2(temp, "StormData.csv", remove = FALSE)
    unlink(temp)
}
data <- read.csv(fname)

First, brief look at data and find out what kinds of information are available.

dim(data)
## [1] 902297     37
# summary(data)
head(data)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6
str(data)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
ncolumns <- ncol(data)

Extract select columns

Although the dataset has 37 fields, only a few are intereting for this project.

selectFields <- c("STATE", "EVTYPE", "BGN_DATE", "FATALITIES", "INJURIES", "PROPDMG", 
    "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
data <- data[, selectFields]

Restrict to US states

Here we are only interesting data on US, so using R's built-in state.abb dataset to restrict the records of US *STATE*s.

data <- data[data$STATE %in% state.abb, ]

Covert BGN_DATE field into a new beginDate date class.

# data$BGN_DATE <- as.Date(as.POSIXlt(data$BGN_DATE,format='%m/%d/%Y
# %H:%M:%S'))
data$beginDate <- as.Date(as.POSIXct(data$BGN_DATE, format = "%m/%d/%Y %H:%M:%S"))

Add new property damage variable propertyDamage

'propertyDamage', is defined as the product of PROPDMG and PROPDMGEXP, where “H” for hundres, “K” for thousands, “M” for millions, and “B” for billions.

data$PROPDMGEXP <- toupper(data$PROPDMGEXP)
data$propertyDamage <- ifelse(data$PROPDMGEXP == "B", data$PROPDMG * 1e+09, 
    ifelse(data$PROPDMGEXP == "M", data$PROPDMG * 1e+06, ifelse(data$PROPDMGEXP == 
        "K", data$PROPDMG * 1000, ifelse(data$PROPDMGEXP == "H", data$PROPDMG * 
        100, data$PROPDMG))))

summary(data$propertyDamage)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.00e+00 0.00e+00 0.00e+00 4.80e+05 1.00e+03 1.15e+11

Another new variable, cropDamage, defined similar ass 'propDamage'.

data$CROPDMGEXP <- toupper(data$CROPDMGEXP)
data$cropDamage <- ifelse(data$CROPDMGEXP == "B", data$CROPDMG * 1e+09, ifelse(data$CROPDMGEXP == 
    "M", data$CROPDMG * 1e+06, ifelse(data$CROPDMGEXP == "K", data$CROPDMG * 
    1000, ifelse(data$CROPDMGEXP == "H", data$CROPDMG * 100, data$CROPDMG))))
summary(data$cropDamage)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.00e+00 0.00e+00 0.00e+00 5.48e+04 0.00e+00 5.00e+09

Clean EVTYPE variable

nEVTYPE <- length(unique(data$EVTYPE))

There are 952 types of events recorded in this database, listed as above. The unique number is too large to manage without grouping. Here we refer the group categories found in the 2009 Annual Summaries on page 3. There are 7 categories: Convection, Extreme Temperature, Flood, Marine, Tropical Cyclones, Winter and Other. In each category has many sub-types. For the sake of simplicity, marine and tropical cyclones are put into Other.

For the event types fall into the category of Convection.

Lightning <- "\\bL\\S+?G\\b"
Tornado <- "(NADO)|(\\bTOR\\S+?O\\b|(\\bFUN))"
Thunderstorm <- "THUNDERSTORM|TSTM"
Wind <- "(WIND)|(WND)"
Hail <- "HAIL"
regex <- paste(Lightning, Tornado, Thunderstorm, Wind, Hail, sep = "|")
data$eventConvection <- grepl(regex, data$EVTYPE, ignore.case = TRUE)

For variations of Cold and Heat, list the event types that fall into the category of Extreme Temperatures.

regex <- "COLD|HEAT"
data$eventExtremeTemp <- grepl(regex, data$EVTYPE, ignore.case = TRUE)

For variations of Flood and Rain,list the event types that fall into the category of Flood.

Flood <- "(\\bFL\\S+?D)"
Rain <- "RAIN|PRECIP|SHOWER"
regex <- paste(Flood, Rain, sep = "|")

data$eventFlood <- grepl(regex, data$EVTYPE, ignore.case = TRUE)

For variations of Snow, Ice, Freeze, or Winter Weather, list the event types that fall into the category of Winter.

regex <- "(SNOW)|(ICE)|(ICY)|(FREEZ)|(WINT)"
data$eventWinter <- grepl(regex, data$EVTYPE, ignore.case = TRUE)

For event types that don't satisfy any one of the aboves, create an Other indicator for ungrouped event types.

where <- expression(data$eventConvection == FALSE & data$eventExtremeTemp == 
    FALSE & data$eventFlood == FALSE & data$eventWinter == FALSE)
data$eventOther <- eval(where)

Categorize event types

Now we can set up a categorization hierarchy of event types. Notices 'EVTYPE' records can have multiple events, E.g., WINTER STORM/HIGH WINDS which set both eventConvection and eventWinter. A crosstabulation for the event type categories is below.

d1 <- aggregate(propertyDamage ~ eventConvection + eventExtremeTemp + eventFlood + 
    eventWinter + eventOther, data = data, sum)
d1
##    eventConvection eventExtremeTemp eventFlood eventWinter eventOther
## 1             TRUE            FALSE      FALSE       FALSE      FALSE
## 2            FALSE             TRUE      FALSE       FALSE      FALSE
## 3             TRUE             TRUE      FALSE       FALSE      FALSE
## 4            FALSE            FALSE       TRUE       FALSE      FALSE
## 5             TRUE            FALSE       TRUE       FALSE      FALSE
## 6            FALSE            FALSE      FALSE        TRUE      FALSE
## 7             TRUE            FALSE      FALSE        TRUE      FALSE
## 8            FALSE             TRUE      FALSE        TRUE      FALSE
## 9            FALSE            FALSE       TRUE        TRUE      FALSE
## 10            TRUE            FALSE       TRUE        TRUE      FALSE
## 11           FALSE            FALSE      FALSE       FALSE       TRUE
##    propertyDamage
## 1       9.262e+10
## 2       1.447e+08
## 3       1.211e+08
## 4       1.703e+11
## 5       1.807e+07
## 6       1.171e+10
## 7       6.555e+07
## 8       1.050e+06
## 9       2.765e+07
## 10      1.500e+06
## 11      1.487e+11

To simplify the problem, we choose higher categories outrank lower categories, The hierarchy is as follows. 1. Convection (including lightning, tornado, thunderstorm, wind, and hail) 2. Extreme temperature (including hot and cold) 3. Flood (including flood, flash flood, rain) 4. Winter (including snow, ice, freeze, or winter weather) 5. Other That is, the example event type of WINTER STORM/HIGH WIND would only be considered as Convection category.

data$eventCategory <- ifelse(data$eventConvection, "Convection", ifelse(data$eventExtremeTemp, 
    "ExtremeTemp", ifelse(data$eventFlood, "Flood", ifelse(data$eventWinter, 
        "Winter", ifelse(data$eventOther, "Others", NA)))))


table(data$eventCategory)
## 
##  Convection ExtremeTemp       Flood      Others      Winter 
##      724519        3515       92651       21616       40885

Restrict date range

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete. The date ranges for each category are listed below. Filter the data to include records including all considered event categories.

d1 <- aggregate(beginDate ~ eventCategory, data, min)
d2 <- aggregate(beginDate ~ eventCategory, data, max)

cbind(d1, end_BeginDate = d2$beginDate)
##   eventCategory  beginDate end_BeginDate
## 1    Convection 1950-01-03    2011-11-30
## 2   ExtremeTemp 1993-01-04    2011-11-15
## 3         Flood 1993-01-01    2011-11-30
## 4        Others 1993-01-01    2011-11-30
## 5        Winter 1993-01-01    2011-11-30
# boxplot(log(propertyDamage)~eventCategory,
# data=data[data$propertyDamage>0,])
library(lubridate)
minYear <- max(year(d1$beginDate))
maxYear <- max(year(d2$beginDate))
data <- data[year(data$beginDate) >= minYear, ]

Results

labels <- c("Convection", "Extreme temperature", "Flood", "Winter", "Other")
d_Fatalities <- data[data$FATALITIES > 0, ]
tab_Fatalities <- aggregate(FATALITIES ~ eventCategory, d_Fatalities, sum)
maxFcategory <- max(tab_Fatalities$FATALITIES)
maxFEvtCategory <- labels[which(tab_Fatalities$FATALITIES == maxFcategory)]
library(ggplot2)
ggplot(data = tab_Fatalities, aes(x = eventCategory, y = FATALITIES, fill = FATALITIES)) + 
    geom_bar(stat = "identity")

plot of chunk resultsFAT

The most harmful event category is Convection with the total dealth of 3592 between 1993 and 2011.

Here list the top 5 harmful events with largest fatalities in US.

data[order(data$FATALITIES, decreasing = T), ][1:5, c("STATE", "EVTYPE", "beginDate", 
    "eventCategory", "FATALITIES")]
##        STATE         EVTYPE  beginDate eventCategory FATALITIES
## 198704    IL           HEAT 1995-07-12   ExtremeTemp        583
## 862634    MO        TORNADO 2011-05-22    Convection        158
## 355128    IL EXCESSIVE HEAT 1999-07-28   ExtremeTemp         99
## 371112    PA EXCESSIVE HEAT 1999-07-04   ExtremeTemp         74
## 230927    PA EXCESSIVE HEAT 1995-07-01   ExtremeTemp         67
d_Injuries <- data[data$INJURIES > 0, ]
tab_Injuries <- aggregate(INJURIES ~ eventCategory, d_Injuries, sum)
maxFcategory <- max(tab_Injuries$INJURIES)
maxFEvtCategory <- labels[which(tab_Injuries$INJURIES == maxFcategory)]
ggplot(data = tab_Injuries, aes(x = eventCategory, y = INJURIES, fill = INJURIES)) + 
    geom_bar(stat = "identity")

plot of chunk inj

The most harmful event category is Convection with the total injuries of 3.7648 × 104 between 1993 and 2011.

Here list the top 5 harmful events with largest injuries in US.

data[order(data$INJURIES, decreasing = T), ][1:5, c("STATE", "EVTYPE", "beginDate", 
    "eventCategory", "INJURIES")]
##        STATE            EVTYPE  beginDate eventCategory INJURIES
## 223449    OH         ICE STORM 1994-02-08        Winter     1568
## 862634    MO           TORNADO 2011-05-22    Convection     1150
## 344159    TX             FLOOD 1998-10-17         Flood      800
## 860386    AL           TORNADO 2011-04-27    Convection      800
## 529351    FL HURRICANE/TYPHOON 2004-08-13        Others      780
d_DMG <- data[data$propertyDamage > 0, ]
tab_DMG <- aggregate(propertyDamage ~ eventCategory, d_DMG, sum)
maxFcategory <- max(tab_DMG$propertyDamage)
maxFEvtCategory <- labels[which(tab_DMG$propertyDamage == maxFcategory)]
maxFcategory <- maxFcategory/1e+09
ggplot(data = tab_DMG, aes(x = eventCategory, y = propertyDamage, fill = propertyDamage)) + 
    geom_bar(stat = "identity")

plot of chunk RpropertyDMG

The most property damaged event category is Flood with the total property damage of $170.3601B between 1993 and 2011.

Here list the top 5 property damaged event in US.

# data[which(data$propertyDamage==max(data$propertyDamage)),c('STATE','EVTYPE','beginDate','eventCategory','propertyDamage')]
data[order(data$propertyDamage, decreasing = T), ][1:5, c("STATE", "EVTYPE", 
    "beginDate", "eventCategory", "propertyDamage")]
##        STATE            EVTYPE  beginDate eventCategory propertyDamage
## 605953    CA             FLOOD 2006-01-01         Flood      1.150e+11
## 577676    LA       STORM SURGE 2005-08-29        Others      3.130e+10
## 577675    LA HURRICANE/TYPHOON 2005-08-28        Others      1.693e+10
## 581535    MS       STORM SURGE 2005-08-29        Others      1.126e+10
## 569308    FL HURRICANE/TYPHOON 2005-10-24        Others      1.000e+10