Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
This report consists of analysis of the NOAA Storm Data, aiming at answering the questions how the weather events are related to threaten to public health and damage to properties and crops.
The data under study is the NOAA Storm Data, where a detailed documentation of the dataset is given as Storm Data Documentation. We first download the dataset directly from the data website and read the data into storm:
## download, unzip and load data set
url = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if(!file.exists("StormData.csv.bz2")) {
download.file(url, destfile = "StormData.csv.bz2")
}
storm <- read.csv("StormData.csv.bz2", stringsAsFactors = FALSE)
str(storm)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
From the output, we see the dataset consists of 37 features with 902297 observations. Most of the features will not be relevant for our analysis. The main features that we are interested in include:
EVTYPE: Severe weather typeFATALITIES: Number of individuals killed in the event of hazzardINJURIES: Number of individuals injuried during the eventPROPDMG + PROPDMGEXP: Property damages and the corresponding unitCROPDMG + CROPDMGEXP: Crop damages and the unitThe raw dataset is extremely untidy, such as existence of missing data, non-unified terminology of weather type. Thus a further data processing is needed. Here, we split our data analysis with respect to public health and economic consequence, by creating two separate clean dataset health and economy.
First, we subset the INJURIES and FATALITIES data from the storm dataset
library(dplyr)
health <- select(storm, EVTYPE, FATALITIES, INJURIES) %>%
mutate_each(funs(toupper), EVTYPE)
str(health)
## 'data.frame': 902297 obs. of 3 variables:
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
Further, we group the EVTYPE by names and take summation of INJURIES and FATALITIES data within each group. Another column HARMS is added to the dataset, which gives the total number of individuals affected in the event. The dataset is displaced in descending order of HARMS.
## clean `health` grouping elements in `EVTYPE`
healthclean <- aggregate(. ~ EVTYPE, health, FUN = sum) %>%
filter(FATALITIES > 0 | INJURIES > 0) %>%
mutate(HARMS = INJURIES + FATALITIES) %>%
arrange(desc(HARMS))
healthclean[1:25,]
## EVTYPE FATALITIES INJURIES HARMS
## 1 TORNADO 5633 91346 96979
## 2 EXCESSIVE HEAT 1903 6525 8428
## 3 TSTM WIND 504 6957 7461
## 4 FLOOD 470 6789 7259
## 5 LIGHTNING 816 5230 6046
## 6 HEAT 937 2100 3037
## 7 FLASH FLOOD 978 1777 2755
## 8 ICE STORM 89 1975 2064
## 9 THUNDERSTORM WIND 133 1488 1621
## 10 WINTER STORM 206 1321 1527
## 11 HIGH WIND 248 1137 1385
## 12 HAIL 15 1361 1376
## 13 HURRICANE/TYPHOON 64 1275 1339
## 14 HEAVY SNOW 127 1021 1148
## 15 WILDFIRE 75 911 986
## 16 THUNDERSTORM WINDS 64 908 972
## 17 BLIZZARD 101 805 906
## 18 FOG 62 734 796
## 19 RIP CURRENT 368 232 600
## 20 WILD/FOREST FIRE 12 545 557
## 21 HEAT WAVE 172 379 551
## 22 RIP CURRENTS 204 297 501
## 23 DUST STORM 22 440 462
## 24 WINTER WEATHER 33 398 431
## 25 TROPICAL STORM 58 340 398
Now, the dataset healthclean contains only 205 observations, dramatically reduced. The first few lines of the dataset indicate that EVTYPE name is still not tidy. For example,
THUNDERSTORM WIND and THUNDERSTORM WINDS differ by only an STSTM WIND and THUNDERSTORM WIND is an example disnormal abbreviationHURRICANE/TYPHOON is simply a concatenate of HURRICANE and TYPHOON, both have similar meaningCOLD\WIND CHILL is a combination of two weather events COLD and WIND CHILLSince the ranked public health data is decaying exponentially fast and we care only the most severe weathers. We can put it by hand the names of the top 20 EVTYPE as
ev_liv <- c("TORNADO" , "EXCESSIVE HEAT" ,
"THUNDERSTORM WIND", "FLOOD" ,
"LIGHTNING" , "HEAT" ,
"FLASH FLOOD" , "ICE STORM" ,
"WINTER STORM" , "HIGH WIND" ,
"HAIL" , "HURRICANE/TYPHOON" ,
"HEAVY SNOW" , "WILDFIRE" ,
"BLIZZARD" , "FOG" ,
"RIP CURRENT" , "DUST STORM" ,
"WINTER WEATHER" , "TROPICAL STORM" )
ev_liv_rep <- c("TORNADO" , "EXCESSIVE HEAT" ,
"THUNDERSTORM WIND|TSTM WIND" ,
"^FLOOD|RIVER FLOOD|/FLOOD" ,
"LIGHTNING" ,
"^HEAT" , "FLASH FLOOD" ,
"ICE STORM" , "WINTER STORM" ,
"HIGH WIND" , "HAIL" ,
"HURRICANE/TYPHOON|HURRICANE|TYPHOON" ,
"HEAVY SNOW" ,
"WILDFIRE|WILD/FOREST FIRE" ,
"BLIZZARD" , "FOG" ,
"RIP CURRENT" , "DUST STORM" ,
"WINTER WEATHER", "TROPICAL STORM" )
Once with the standard names defined, one can further clean the EVTYPE data
for (i in seq_along(ev_liv)) {
ind <- grep(ev_liv_rep[i], healthclean$EVTYPE)
healthclean$EVTYPE[ind] <- ev_liv[i]
}
healthclean <- aggregate(. ~ EVTYPE, healthclean, FUN = sum) %>%
arrange(desc(HARMS))
healthclean[1:20,]
## EVTYPE FATALITIES INJURIES HARMS
## 1 TORNADO 5661 91407 97068
## 2 THUNDERSTORM WIND 728 9493 10221
## 3 EXCESSIVE HEAT 1922 6525 8447
## 4 FLOOD 518 6809 7327
## 5 LIGHTNING 817 5232 6049
## 6 HEAT 1118 2494 3612
## 7 FLASH FLOOD 999 1787 2786
## 8 ICE STORM 89 1990 2079
## 9 HIGH WIND 298 1508 1806
## 10 WINTER STORM 217 1353 1570
## 11 WILDFIRE 87 1456 1543
## 12 HURRICANE/TYPHOON 133 1333 1466
## 13 HAIL 15 1371 1386
## 14 HEAVY SNOW 127 1034 1161
## 15 FOG 81 1077 1158
## 16 RIP CURRENT 577 529 1106
## 17 BLIZZARD 101 805 906
## 18 WINTER WEATHER 61 538 599
## 19 DUST STORM 22 440 462
## 20 TROPICAL STORM 66 383 449
Similar to the case with public health data, first, we subset the data relevant to economic consequence.
economy <- select(storm, EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
str(economy)
## 'data.frame': 902297 obs. of 5 variables:
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
The features PROPDMGEXP and CROPDMGEXP are composed of character variables, which indicate thousand, million or billion etc (in unit of US dollars).
unique(economy$PROPDMGEXP)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8"
unique(economy$CROPDMGEXP)
## [1] "" "M" "K" "m" "B" "?" "0" "k" "2"
Here, anomalous entries, e.g. "+", "-", "?", "" is void of real meaning and thus should be omitted. Characters with upper and lower cases, e.g. "H", "h", give the same unit. It is more convinient to deal with only one case, thus all the characters will be transfered into upper case.
economy <- filter(economy,
!(PROPDMGEXP %in% c("+", "-", "?", "") | CROPDMGEXP %in% c("?", "")) &
(PROPDMG > 0 | CROPDMG > 0)) %>%
mutate_each(funs(toupper), EVTYPE, PROPDMGEXP, CROPDMGEXP)
unique(economy$PROPDMGEXP)
## [1] "B" "M" "K" "5" "0" "3"
unique(economy$CROPDMGEXP)
## [1] "M" "K" "0" "B"
Now the data in PROPDMGEXP and CROPDMGEXP becomes tidy. The next step is to combine PROPDMG + PROPDMGEXP (CROPDMG + CROPDMGEXP) to give damage in units of billion dollars.
economy$PROPDMGEXP[grep("K", economy$PROPDMGEXP)] <- 3
economy$CROPDMGEXP[grep("K", economy$CROPDMGEXP)] <- 3
economy$PROPDMGEXP[grep("M", economy$PROPDMGEXP)] <- 6
economy$CROPDMGEXP[grep("M", economy$CROPDMGEXP)] <- 6
economy$PROPDMGEXP[grep("B", economy$PROPDMGEXP)] <- 9
economy$CROPDMGEXP[grep("B", economy$CROPDMGEXP)] <- 9
economy <- mutate(economy,
PROPDMG = PROPDMG * 10 ^ (as.numeric(PROPDMGEXP) - 9),
CROPDMG = CROPDMG * 10 ^ (as.numeric(CROPDMGEXP) - 9)) %>%
select(EVTYPE, PROPDMG, CROPDMG)
Given PROPDMG and CROPDMG is already tidy, the next step is to aggregate the dataset according to the weather types in EVTYPE.
economyclean <- aggregate(. ~ EVTYPE, economy, FUN = sum)
The economyclean dataset contains 120 observations. Let us print the top 25 observations when arranging the dataset according to property and crop damage in descending order separately in order to determine names of the EVTYPE
economyclean <- arrange(economyclean, desc(PROPDMG))
economyclean[1:25, c("EVTYPE", "PROPDMG")]
## EVTYPE PROPDMG
## 1 FLOOD 132.8364891
## 2 HURRICANE/TYPHOON 26.7402950
## 3 TORNADO 16.1669467
## 4 HURRICANE 9.7163580
## 5 HAIL 7.9947887
## 6 FLASH FLOOD 7.3284961
## 7 RIVER FLOOD 5.0796350
## 8 STORM SURGE/TIDE 4.6406430
## 9 WILDFIRE 3.4983655
## 10 THUNDERSTORM WIND 3.3989424
## 11 HIGH WIND 2.4257423
## 12 HURRICANE OPAL 2.1680000
## 13 TORNADOES, TSTM WIND, HAIL 1.6000000
## 14 TROPICAL STORM 1.0565913
## 15 WINTER STORM 1.0178442
## 16 ICE STORM 0.9030374
## 17 TSTM WIND 0.6583057
## 18 LIGHTNING 0.3152740
## 19 HEAVY RAIN 0.3107019
## 20 THUNDERSTORM WINDS 0.2801900
## 21 FLASH FLOOD/FLOOD 0.2710000
## 22 HURRICANE ERIN 0.2560000
## 23 DROUGHT 0.2337210
## 24 HEAVY SNOW 0.1782920
## 25 COASTAL FLOOD 0.1675806
economyclean <- arrange(economyclean, desc(CROPDMG))
economyclean[1:25, c("EVTYPE", "CROPDMG")]
## EVTYPE CROPDMG
## 1 FLOOD 5.1709555
## 2 RIVER FLOOD 5.0287340
## 3 ICE STORM 5.0221135
## 4 HURRICANE 2.6889100
## 5 HURRICANE/TYPHOON 2.6078728
## 6 HAIL 2.0538079
## 7 DROUGHT 1.6526960
## 8 FLASH FLOOD 1.3880291
## 9 FROST/FREEZE 0.9319010
## 10 HIGH WIND 0.6319243
## 11 TSTM WIND 0.4972844
## 12 EXCESSIVE HEAT 0.4924000
## 13 TROPICAL STORM 0.4517110
## 14 THUNDERSTORM WIND 0.4147055
## 15 TORNADO 0.4033796
## 16 THUNDERSTORM WINDS 0.1861133
## 17 WILDFIRE 0.1861029
## 18 HEAVY SNOW 0.1316431
## 19 BLIZZARD 0.1120600
## 20 WILD/FOREST FIRE 0.0987072
## 21 FLOOD/FLASH FLOOD 0.0950340
## 22 STRONG WIND 0.0649485
## 23 HEAVY RAIN 0.0645858
## 24 HEAVY RAINS 0.0605000
## 25 SEVERE THUNDERSTORM WINDS 0.0290000
Combine results shown in both prints above, the names chosen to study in EVTYPE is
ev_eco <- c("FLOOD" , "HURRICANE/TYPHOON" ,
"TORNADO" , "HAIL" ,
"FLASH FLOOD" , "STORM SURGE/TIDE" ,
"WILDFIRE" , "THUNDERSTORM WIND" ,
"HIGH WIND" , "TROPICAL STORM" ,
"WINTER STORM" , "ICE STORM" ,
"LIGHTNING" , "HEAVY RAIN" ,
"DROUGHT" , "HEAVY SNOW" ,
"COASTAL FLOOD" , "FROST/FREEZE" ,
"EXCESSIVE HEAT", "STRONG WIND" )
ev_eco_rep <- c("^FLOOD|RIVER FLOOD|/FLOOD" ,
"HURRICANE/TYPHOON|HURRICANE|TYPHOON" ,
"TORNADO" , "HAIL" ,
"FLASH FLOOD" ,
"STORM SURGE/TIDE|STORM TIDE" ,
"WILDFIRE|WILD/FOREST FIRE" ,
"THUNDERSTORM WIND|TSTM WIND" ,
"HIGH WIND" , "TROPICAL STORM" ,
"WINTER STORM" , "ICE STORM" ,
"LIGHTNING" , "HEAVY RAIN" ,
"DROUGHT" , "HEAVY SNOW" ,
"COASTAL FLOOD" ,
"FROST/FREEZE|FROST|FREEZE" ,
"EXCESSIVE HEAT", "STRONG WIND" )
Given the names of EVTYPE defined for economy dataset, further clean EVTYPE data
for (i in seq_along(ev_eco)) {
ind <- grep(ev_eco_rep[i], economyclean$EVTYPE)
economyclean$EVTYPE[ind] <- ev_eco[i]
}
economyclean <- aggregate(. ~ EVTYPE, economyclean, FUN = sum) %>%
arrange(desc(PROPDMG))
economyclean[1:20,]
## EVTYPE PROPDMG CROPDMG
## 1 FLOOD 138.4239991 10.33268695
## 2 HURRICANE/TYPHOON 38.9968830 5.33311780
## 3 TORNADO 17.7669570 0.40588687
## 4 HAIL 7.9997102 2.07883995
## 5 FLASH FLOOD 7.3895281 1.39813510
## 6 STORM SURGE/TIDE 4.6406430 0.00085000
## 7 THUNDERSTORM WIND 4.3397600 1.12724733
## 8 WILDFIRE 3.5477275 0.28532210
## 9 HIGH WIND 2.6982874 0.66415485
## 10 TROPICAL STORM 1.0620913 0.46826100
## 11 WINTER STORM 1.0183442 0.02422400
## 12 ICE STORM 0.9030374 5.02211350
## 13 HEAVY RAIN 0.3352019 0.12658580
## 14 LIGHTNING 0.3152740 0.00551215
## 15 DROUGHT 0.2339210 1.65274600
## 16 COASTAL FLOOD 0.1928806 0.00005600
## 17 HEAVY SNOW 0.1782920 0.13164310
## 18 LANDSLIDE 0.1503035 0.02001700
## 19 TSUNAMI 0.1440620 0.00002000
## 20 STRONG WIND 0.1196554 0.06494850
In the following, The results of ranking for both injuries and fatalities will be shown in separate tables
arrange(healthclean, desc(INJURIES))[1:10, c("EVTYPE", "INJURIES")]
## EVTYPE INJURIES
## 1 TORNADO 91407
## 2 THUNDERSTORM WIND 9493
## 3 FLOOD 6809
## 4 EXCESSIVE HEAT 6525
## 5 LIGHTNING 5232
## 6 HEAT 2494
## 7 ICE STORM 1990
## 8 FLASH FLOOD 1787
## 9 HIGH WIND 1508
## 10 WILDFIRE 1456
arrange(healthclean, desc(FATALITIES))[1:10, c("EVTYPE", "FATALITIES")]
## EVTYPE FATALITIES
## 1 TORNADO 5661
## 2 EXCESSIVE HEAT 1922
## 3 HEAT 1118
## 4 FLASH FLOOD 999
## 5 LIGHTNING 817
## 6 THUNDERSTORM WIND 728
## 7 RIP CURRENT 577
## 8 FLOOD 518
## 9 HIGH WIND 298
## 10 AVALANCHE 224
The following shows the harm to public health from the severe weather events, combining the injury and fatality data (top 10).
library(ggplot2)
library(reshape)
library(scales)
# reorder `healthclean` leading to ordered dataset after melting
healthclean <- transform(healthclean, EVTYPE = reorder(EVTYPE, desc(HARMS)))
# subset and melt to generate suitable data for barplot
counts <- healthclean[1:10, c("EVTYPE", "FATALITIES", "INJURIES")]
counts <- melt(counts, id.vars = "EVTYPE")
# ggplot `identity` required, x ticks set to vertical
g <- ggplot(counts, aes(EVTYPE, value, fill = variable))
g + geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(x = "", y = "Individuals Affected",
title = "Weather Events Affecting Public Health")
There are several interesting conclusions obtained from the analysis:
TORNADO, which leads to almost one million injuries and six thousand death over the years 1950 - 2011.HEAT or EXCESSIVE HEAT, instead of other events, such as FLOOD or THUNDERSTORM.The following two graphs indicate property and crop damages caused by severe weather events respectively.
par(mfrow = c(1, 2), mar = c(12, 4, 3, 2), mgp = c(3, 1, 0), cex = 0.8)
with(arrange(economyclean, desc(PROPDMG))[1:10,],
barplot(PROPDMG, las = 3, names.arg = EVTYPE,
main = "Weather Events: Property Damages",
ylab = "Cost of damages (billions usd)", col = "blue"))
with(arrange(economyclean, desc(CROPDMG))[1:10,],
barplot(CROPDMG, las = 3, names.arg = EVTYPE,
main = "Weather Events: Crop Damages",
ylab = "Cost of damages (billions usd)", col = "red"))
As a conclusion to this part of the analysis:
FLOOD, followed by HURRICANE/TYPHOON.TORNADO, which leads to significant damage to properties, does not seem to affect crop that much.ICE STORM or DROUGHT, though leads to crop damage, does not damage the properties.