Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project explores the U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database, which tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. Exploring this database, we can answer two the following questions:
The data for this project come be download from web site. It is in the form of a comma-seperated-value file, compressed via the bzip2 algorithm.
fname = "StormData.csv"
if (!file.exists(fname)) {
dataURL = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
temp <- "StormData.csv.bz2"
download.file(dataURL, temp, method = "curl")
library(R.utils)
bunzip2(temp, "StormData.csv", remove = FALSE)
unlink(temp)
}
data <- read.csv(fname)
First, brief look at data and find out what kinds of information are available.
dim(data)
## [1] 902297 37
# summary(data)
head(data)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
str(data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
ncolumns <- ncol(data)
Although the dataset has 37 fields, only a few are intereting for this project.
selectFields <- c("STATE", "EVTYPE", "BGN_DATE", "FATALITIES", "INJURIES", "PROPDMG",
"PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
data <- data[, selectFields]
Here we are only interesting data on US, so using R's built-in state.abb
dataset to restrict the records of US *STATE*s.
data <- data[data$STATE %in% state.abb, ]
# data$BGN_DATE <- as.Date(as.POSIXlt(data$BGN_DATE,format='%m/%d/%Y
# %H:%M:%S'))
data$beginDate <- as.Date(as.POSIXct(data$BGN_DATE, format = "%m/%d/%Y %H:%M:%S"))
'propertyDamage', is defined as the product of PROPDMG and PROPDMGEXP, where “H” for hundres, “K” for thousands, “M” for millions, and “B” for billions.
data$PROPDMGEXP <- toupper(data$PROPDMGEXP)
data$propertyDamage <- ifelse(data$PROPDMGEXP == "B", data$PROPDMG * 1e+09,
ifelse(data$PROPDMGEXP == "M", data$PROPDMG * 1e+06, ifelse(data$PROPDMGEXP ==
"K", data$PROPDMG * 1000, ifelse(data$PROPDMGEXP == "H", data$PROPDMG *
100, data$PROPDMG))))
summary(data$propertyDamage)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00e+00 0.00e+00 0.00e+00 4.80e+05 1.00e+03 1.15e+11
cropDamage
, defined similar ass 'propDamage'.data$CROPDMGEXP <- toupper(data$CROPDMGEXP)
data$cropDamage <- ifelse(data$CROPDMGEXP == "B", data$CROPDMG * 1e+09, ifelse(data$CROPDMGEXP ==
"M", data$CROPDMG * 1e+06, ifelse(data$CROPDMGEXP == "K", data$CROPDMG *
1000, ifelse(data$CROPDMGEXP == "H", data$CROPDMG * 100, data$CROPDMG))))
summary(data$cropDamage)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00e+00 0.00e+00 0.00e+00 5.48e+04 0.00e+00 5.00e+09
nEVTYPE <- length(unique(data$EVTYPE))
There are 952 types of events recorded in this database, listed as above. The unique number is too large to manage without grouping. Here we refer the group categories found in the 2009 Annual Summaries on page 3. There are 7 categories: Convection, Extreme Temperature, Flood, Marine, Tropical Cyclones, Winter and Other. In each category has many sub-types. For the sake of simplicity, marine and tropical cyclones are put into Other.
For the event types fall into the category of Convection.
Lightning <- "\\bL\\S+?G\\b"
Tornado <- "(NADO)|(\\bTOR\\S+?O\\b|(\\bFUN))"
Thunderstorm <- "THUNDERSTORM|TSTM"
Wind <- "(WIND)|(WND)"
Hail <- "HAIL"
regex <- paste(Lightning, Tornado, Thunderstorm, Wind, Hail, sep = "|")
data$eventConvection <- grepl(regex, data$EVTYPE, ignore.case = TRUE)
For variations of Cold and Heat, list the event types that fall into the category of Extreme Temperatures.
regex <- "COLD|HEAT"
data$eventExtremeTemp <- grepl(regex, data$EVTYPE, ignore.case = TRUE)
For variations of Flood and Rain,list the event types that fall into the category of Flood.
Flood <- "(\\bFL\\S+?D)"
Rain <- "RAIN|PRECIP|SHOWER"
regex <- paste(Flood, Rain, sep = "|")
data$eventFlood <- grepl(regex, data$EVTYPE, ignore.case = TRUE)
For variations of Snow, Ice, Freeze, or Winter Weather, list the event types that fall into the category of Winter.
regex <- "(SNOW)|(ICE)|(ICY)|(FREEZ)|(WINT)"
data$eventWinter <- grepl(regex, data$EVTYPE, ignore.case = TRUE)
For event types that don't satisfy any one of the aboves, create an Other indicator for ungrouped event types.
where <- expression(data$eventConvection == FALSE & data$eventExtremeTemp ==
FALSE & data$eventFlood == FALSE & data$eventWinter == FALSE)
data$eventOther <- eval(where)
Now we can set up a categorization hierarchy of event types. Notices 'EVTYPE' records can have multiple events, E.g., WINTER STORM/HIGH WINDS which set both eventConvection and eventWinter. A crosstabulation for the event type categories is below.
d1 <- aggregate(propertyDamage ~ eventConvection + eventExtremeTemp + eventFlood +
eventWinter + eventOther, data = data, sum)
d1
## eventConvection eventExtremeTemp eventFlood eventWinter eventOther
## 1 TRUE FALSE FALSE FALSE FALSE
## 2 FALSE TRUE FALSE FALSE FALSE
## 3 TRUE TRUE FALSE FALSE FALSE
## 4 FALSE FALSE TRUE FALSE FALSE
## 5 TRUE FALSE TRUE FALSE FALSE
## 6 FALSE FALSE FALSE TRUE FALSE
## 7 TRUE FALSE FALSE TRUE FALSE
## 8 FALSE TRUE FALSE TRUE FALSE
## 9 FALSE FALSE TRUE TRUE FALSE
## 10 TRUE FALSE TRUE TRUE FALSE
## 11 FALSE FALSE FALSE FALSE TRUE
## propertyDamage
## 1 9.262e+10
## 2 1.447e+08
## 3 1.211e+08
## 4 1.703e+11
## 5 1.807e+07
## 6 1.171e+10
## 7 6.555e+07
## 8 1.050e+06
## 9 2.765e+07
## 10 1.500e+06
## 11 1.487e+11
To simplify the problem, we choose higher categories outrank lower categories, The hierarchy is as follows. 1. Convection (including lightning, tornado, thunderstorm, wind, and hail) 2. Extreme temperature (including hot and cold) 3. Flood (including flood, flash flood, rain) 4. Winter (including snow, ice, freeze, or winter weather) 5. Other That is, the example event type of WINTER STORM/HIGH WIND would only be considered as Convection category.
data$eventCategory <- ifelse(data$eventConvection, "Convection", ifelse(data$eventExtremeTemp,
"ExtremeTemp", ifelse(data$eventFlood, "Flood", ifelse(data$eventWinter,
"Winter", ifelse(data$eventOther, "Others", NA)))))
table(data$eventCategory)
##
## Convection ExtremeTemp Flood Others Winter
## 724519 3515 92651 21616 40885
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete. The date ranges for each category are listed below. Filter the data to include records including all considered event categories.
d1 <- aggregate(beginDate ~ eventCategory, data, min)
d2 <- aggregate(beginDate ~ eventCategory, data, max)
cbind(d1, end_BeginDate = d2$beginDate)
## eventCategory beginDate end_BeginDate
## 1 Convection 1950-01-03 2011-11-30
## 2 ExtremeTemp 1993-01-04 2011-11-15
## 3 Flood 1993-01-01 2011-11-30
## 4 Others 1993-01-01 2011-11-30
## 5 Winter 1993-01-01 2011-11-30
# boxplot(log(propertyDamage)~eventCategory,
# data=data[data$propertyDamage>0,])
library(lubridate)
minYear <- max(year(d1$beginDate))
maxYear <- max(year(d2$beginDate))
data <- data[year(data$beginDate) >= minYear, ]
labels <- c("Convection", "Extreme temperature", "Flood", "Winter", "Other")
d_Fatalities <- data[data$FATALITIES > 0, ]
tab_Fatalities <- aggregate(FATALITIES ~ eventCategory, d_Fatalities, sum)
maxFcategory <- max(tab_Fatalities$FATALITIES)
maxFEvtCategory <- labels[which(tab_Fatalities$FATALITIES == maxFcategory)]
library(ggplot2)
ggplot(data = tab_Fatalities, aes(x = eventCategory, y = FATALITIES, fill = FATALITIES)) +
geom_bar(stat = "identity")
The most harmful event category is Convection with the total dealth of 3592 between 1993 and 2011.
Here list the top 5 harmful events with largest fatalities in US.
data[order(data$FATALITIES, decreasing = T), ][1:5, c("STATE", "EVTYPE", "beginDate",
"eventCategory", "FATALITIES")]
## STATE EVTYPE beginDate eventCategory FATALITIES
## 198704 IL HEAT 1995-07-12 ExtremeTemp 583
## 862634 MO TORNADO 2011-05-22 Convection 158
## 355128 IL EXCESSIVE HEAT 1999-07-28 ExtremeTemp 99
## 371112 PA EXCESSIVE HEAT 1999-07-04 ExtremeTemp 74
## 230927 PA EXCESSIVE HEAT 1995-07-01 ExtremeTemp 67
d_Injuries <- data[data$INJURIES > 0, ]
tab_Injuries <- aggregate(INJURIES ~ eventCategory, d_Injuries, sum)
maxFcategory <- max(tab_Injuries$INJURIES)
maxFEvtCategory <- labels[which(tab_Injuries$INJURIES == maxFcategory)]
ggplot(data = tab_Injuries, aes(x = eventCategory, y = INJURIES, fill = INJURIES)) +
geom_bar(stat = "identity")
The most harmful event category is Convection with the total injuries of 3.7648 × 104 between 1993 and 2011.
Here list the top 5 harmful events with largest injuries in US.
data[order(data$INJURIES, decreasing = T), ][1:5, c("STATE", "EVTYPE", "beginDate",
"eventCategory", "INJURIES")]
## STATE EVTYPE beginDate eventCategory INJURIES
## 223449 OH ICE STORM 1994-02-08 Winter 1568
## 862634 MO TORNADO 2011-05-22 Convection 1150
## 344159 TX FLOOD 1998-10-17 Flood 800
## 860386 AL TORNADO 2011-04-27 Convection 800
## 529351 FL HURRICANE/TYPHOON 2004-08-13 Others 780
d_DMG <- data[data$propertyDamage > 0, ]
tab_DMG <- aggregate(propertyDamage ~ eventCategory, d_DMG, sum)
maxFcategory <- max(tab_DMG$propertyDamage)
maxFEvtCategory <- labels[which(tab_DMG$propertyDamage == maxFcategory)]
maxFcategory <- maxFcategory/1e+09
ggplot(data = tab_DMG, aes(x = eventCategory, y = propertyDamage, fill = propertyDamage)) +
geom_bar(stat = "identity")
The most property damaged event category is Flood with the total property damage of $170.3601B between 1993 and 2011.
Here list the top 5 property damaged event in US.
# data[which(data$propertyDamage==max(data$propertyDamage)),c('STATE','EVTYPE','beginDate','eventCategory','propertyDamage')]
data[order(data$propertyDamage, decreasing = T), ][1:5, c("STATE", "EVTYPE",
"beginDate", "eventCategory", "propertyDamage")]
## STATE EVTYPE beginDate eventCategory propertyDamage
## 605953 CA FLOOD 2006-01-01 Flood 1.150e+11
## 577676 LA STORM SURGE 2005-08-29 Others 3.130e+10
## 577675 LA HURRICANE/TYPHOON 2005-08-28 Others 1.693e+10
## 581535 MS STORM SURGE 2005-08-29 Others 1.126e+10
## 569308 FL HURRICANE/TYPHOON 2005-10-24 Others 1.000e+10