This study focused on identifying which weather events have the greatest impact on 1.population health 2.property damage 3.crop damage. The data was obtained from the NOAA Storm Database which recorded injuries, fatalities, property and crop damage values resulting from a set of categorized weather events. We looked at data starting from the year 1996 till Nov 2011.
The following libraries were first included.
library(dplyr)
library(lubridate)
library(stringr)
library(ggplot2)
The data was read from a csv file and the relevant columns were selected.
myData <- read.csv("repdata%2Fdata%2FStormData.csv",
header = TRUE,
stringsAsFactors = FALSE)
myDataSel <- select(myData, BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG,
PROPDMGEXP, CROPDMG, CROPDMGEXP)
Filtering and cleanup of the data was done in the following sequence:
Taking only data from Jan 1996 onward.
Based on a pinned discussion forum post by Usama A.F. Khalil (Mentor), all weather event types were only recorded from Jan 1996 onward.
As the time-resolution of the analysis was coarse, no accounting for the various timezones in the data was done. All were simply assumed to be a single time zone.
Only begin dates were used.
myDataSel$BGN_DATE <- mdy_hms(myDataSel$BGN_DATE)
myDataSel <- filter(myDataSel,
as.numeric(BGN_DATE - mdy_hms("01/01/1996 00:00:00")) > 0)
Removing extra whitespaces and setting characters to uppercase.
Any repetition of a whitespace was reduced to a single character.
Trailing and preceding whitespaces were removed.
All alphabetic characters were set to uppercase.
myDataSel$EVTYPE <- gsub(pattern = "( )+", replacement = " ", myDataSel$EVTYPE) %>%
str_trim(side = "both") %>%
toupper()
Manual reclassification of certain event types.
myDataSel$EVTYPE[myDataSel$EVTYPE == "COASTAL FLOODING"] <- "COASTAL FLOOD"
myDataSel$EVTYPE[myDataSel$EVTYPE == "COLD"] <- "COLD/WIND CHILL"
myDataSel$EVTYPE[myDataSel$EVTYPE == "EXTREME WINDCHILL" | myDataSel$EVTYPE == "EXTREME COLD"] <- "EXTREME COLD/WIND CHILL"
myDataSel$EVTYPE[myDataSel$EVTYPE == "FROST" | myDataSel$EVTYPE == "FREEZE"] <- "FROST/FREEZE"
myDataSel$EVTYPE[myDataSel$EVTYPE == "HEAVY SURF" | myDataSel$EVTYPE == "HEAVY SURF/HIGH SURF"] <- "HIGH SURF"
myDataSel$EVTYPE[myDataSel$EVTYPE == "HURRICANE"] <- "HURRICANE/TYPHOON"
myDataSel$EVTYPE[myDataSel$EVTYPE == "MARINE TSTM WIND"] <- "MARINE THUNDERSTORM WIND"
myDataSel$EVTYPE[myDataSel$EVTYPE == "RIP CURRENTS"] <- "RIP CURRENT"
myDataSel$EVTYPE[myDataSel$EVTYPE == "STORM SURGE"] <- "STORM SURGE/TIDE"
myDataSel$EVTYPE[myDataSel$EVTYPE == "STRONG WINDS"] <- "STRONG WIND"
myDataSel$EVTYPE[myDataSel$EVTYPE == "TSTM WIND" | myDataSel$EVTYPE == "TSTM WIND (G45)"] <- "THUNDERSTORM WIND"
myDataSel$EVTYPE[myDataSel$EVTYPE == "WINTER WEATHER" | myDataSel$EVTYPE == "WINTRY MIX"] <- "WINTER WEATHER/MIX"
ind <- myDataSel$EVTYPE == "OTHER" | myDataSel$EVTYPE == "MONTHLY PRECIPITATION"
myDataSel <- myDataSel[!ind, ]
Setting a minimum threshold of event type appearances.
thresholdEVTypeCount <- 30
uniqueEVTypeCount <- table(myDataSel$EVTYPE)
evTypeSel <- names(uniqueEVTypeCount[uniqueEVTypeCount > thresholdEVTypeCount])
myDataSel <- filter(myDataSel, EVTYPE %in% evTypeSel)
At this stage, there were the following event types remaining:
sort(unique(myDataSel$EVTYPE))
## [1] "ASTRONOMICAL HIGH TIDE" "ASTRONOMICAL LOW TIDE"
## [3] "AVALANCHE" "BLIZZARD"
## [5] "COASTAL FLOOD" "COLD/WIND CHILL"
## [7] "DENSE FOG" "DROUGHT"
## [9] "DRY MICROBURST" "DUST DEVIL"
## [11] "DUST STORM" "EXCESSIVE HEAT"
## [13] "EXTREME COLD/WIND CHILL" "FLASH FLOOD"
## [15] "FLOOD" "FOG"
## [17] "FREEZING FOG" "FREEZING RAIN"
## [19] "FROST/FREEZE" "FUNNEL CLOUD"
## [21] "GLAZE" "GUSTY WINDS"
## [23] "HAIL" "HEAT"
## [25] "HEAVY RAIN" "HEAVY SNOW"
## [27] "HIGH SURF" "HIGH WIND"
## [29] "HURRICANE/TYPHOON" "ICE STORM"
## [31] "LAKE-EFFECT SNOW" "LANDSLIDE"
## [33] "LIGHT SNOW" "LIGHTNING"
## [35] "MARINE HAIL" "MARINE HIGH WIND"
## [37] "MARINE STRONG WIND" "MARINE THUNDERSTORM WIND"
## [39] "MIXED PRECIPITATION" "MODERATE SNOWFALL"
## [41] "RECORD COLD" "RECORD HEAT"
## [43] "RECORD WARMTH" "RIP CURRENT"
## [45] "RIVER FLOOD" "SLEET"
## [47] "SMALL HAIL" "SNOW"
## [49] "STORM SURGE/TIDE" "STRONG WIND"
## [51] "TEMPERATURE RECORD" "THUNDERSTORM WIND"
## [53] "TORNADO" "TROPICAL DEPRESSION"
## [55] "TROPICAL STORM" "TSTM WIND/HAIL"
## [57] "UNSEASONABLY DRY" "UNSEASONABLY WARM"
## [59] "URBAN/SML STREAM FLD" "WATERSPOUT"
## [61] "WILD/FOREST FIRE" "WILDFIRE"
## [63] "WIND" "WINTER STORM"
## [65] "WINTER WEATHER/MIX"
We proceeded to transform the data into more useful values:
Convert the event types into factors.
Convert the property damage value currently in mantissa+exponent form into a single variable.
Convert the crop damage value currently in mantissa+exponent form into a single variable.
Add the fatalities and injuries together into a casulty count.
Obtain a year value from the begin date.
# Set the evtype as factor
myDataSel$EVTYPE <- as.factor(myDataSel$EVTYPE)
# Convert the exponents to numbers
propExpNum <- sapply(myDataSel$PROPDMGEXP, function(a) {
a <- toupper(a)
if(a == "B") {9}
else if(a == "M") {6}
else if(a == "K") {3}
else if(a == "H") {2}
else if(a == "") {0}
else {as.numeric(a)}
})
cropExpNum <- sapply(myDataSel$CROPDMGEXP, function(a) {
a <- toupper(a)
if(a == "B") {9}
else if(a == "M") {6}
else if(a == "K") {3}
else if(a == "H") {2}
else if(a == "") {0}
else {as.numeric(a)}
})
myDataSel <- mutate(myDataSel, year = year(BGN_DATE),
casulties = FATALITIES + INJURIES,
cropDmgTotal = CROPDMG * 10 ^ cropExpNum,
propDmgTotal = PROPDMG * 10 ^ propExpNum)
myDataSel$PROPDMG <- NULL
myDataSel$PROPDMGEXP <- NULL
myDataSel$CROPDMG <- NULL
myDataSel$CROPDMGEXP <- NULL
myDataSel$FATALITIES <- NULL
myDataSel$INJURIES <- NULL
The final structure of the data frame is
str(myDataSel)
## 'data.frame': 652202 obs. of 6 variables:
## $ BGN_DATE : POSIXct, format: "1996-01-06" "1996-01-11" ...
## $ EVTYPE : Factor w/ 65 levels "ASTRONOMICAL HIGH TIDE",..: 64 53 52 52 52 23 28 52 52 52 ...
## $ year : num 1996 1996 1996 1996 1996 ...
## $ casulties : num 0 0 0 0 0 0 0 0 0 0 ...
## $ cropDmgTotal: Named num 38000 0 0 0 0 0 0 0 0 0 ...
## ..- attr(*, "names")= chr "K" "" "" "" ...
## $ propDmgTotal: Named num 380000 100000 3000 5000 2000 0 400000 12000 8000 12000 ...
## ..- attr(*, "names")= chr "K" "K" "K" "K" ...
and a cursory check for NAs show all is OK.
any(is.na(myDataSel))
## [1] FALSE
To determine which weather event types had the greatest metric values, we simply summed all the data of the same weather event type and did no averaging.
myDataSel <- group_by(myDataSel, EVTYPE)
totalCasulties <- sum(myDataSel$casulties)
totalPropDmg <- sum(myDataSel$propDmgTotal)
totalCropDmg <- sum(myDataSel$cropDmgTotal)
myDataSelSummary <- summarize(myDataSel, CASULTIES = sum(casulties),
CASULTY_PCT = sum(casulties) / totalCasulties * 100,
PROP_DMG = sum(propDmgTotal),
PROP_DMG_PCT = sum(propDmgTotal) / totalPropDmg * 100,
CROP_DMG = sum(cropDmgTotal),
CROP_DMG_PCT = sum(cropDmgTotal) / totalCropDmg * 100,
NUM_EVENTS = length(casulties))
The following is the top 10 event types with the greatest casulty count:
arrData <- arrange(select(myDataSelSummary, EVTYPE, CASULTIES, CASULTY_PCT, NUM_EVENTS), desc(CASULTIES))
head(arrData, n = 10)
## # A tibble: 10 × 4
## EVTYPE CASULTIES CASULTY_PCT NUM_EVENTS
## <fctr> <dbl> <dbl> <int>
## 1 TORNADO 22178 33.539001 23153
## 2 EXCESSIVE HEAT 8188 12.382421 1656
## 3 FLOOD 7172 10.845961 24248
## 4 THUNDERSTORM WIND 5403 8.170765 210112
## 5 LIGHTNING 4790 7.243747 13203
## 6 FLASH FLOOD 2561 3.872909 51000
## 7 WINTER STORM 1483 2.242688 11315
## 8 HEAT 1459 2.206394 716
## 9 HURRICANE/TYPHOON 1446 2.186734 258
## 10 HIGH WIND 1318 1.993165 19907
The Top 10 accounted for 84.68% of all the casulties in the selected dataset.
Focusing on the Top 5, a time series (by year) casulty-count plot was generated.
myDataSel2 <- filter(myDataSel, EVTYPE %in% arrData$EVTYPE[1 : 5]) %>%
group_by(EVTYPE, year) %>%
summarize(casultiesInYr = sum(casulties))
ggplot(myDataSel2, aes(year, casultiesInYr / 1e3, color = EVTYPE, shape = EVTYPE)) +
geom_point(size = 3) +
geom_line() +
labs(x = "Year", y = "Casulties (K)")
Top 5 dangers to population health
Tornadoes appear to be the biggest perennial danger to population health, though a spike in the last year drove the casulty count way up. We did not continue to further analyze if this spike was due to a single event.
Excessive Heat, the No. 2 danger was typically above the rest of the event types as well.
Floods, the No. 3 danger appeared to have been driven up the rankings by the big spike in 1998. It appears to typically fall below the rest of the Top 5.
The following is the Top 10 event types with the greatest property damage value:
arrData <- arrange(select(myDataSelSummary, EVTYPE, PROP_DMG, PROP_DMG_PCT, NUM_EVENTS), desc(PROP_DMG))
head(arrData, n = 10)
## # A tibble: 10 × 4
## EVTYPE PROP_DMG PROP_DMG_PCT NUM_EVENTS
## <fctr> <dbl> <dbl> <int>
## 1 FLOOD 143944833550 39.348014 24248
## 2 HURRICANE/TYPHOON 81118659010 22.174176 258
## 3 STORM SURGE/TIDE 47834724000 13.075852 401
## 4 TORNADO 24616905710 6.729150 23153
## 5 FLASH FLOOD 15222253910 4.161076 51000
## 6 HAIL 14595143420 3.989653 207714
## 7 THUNDERSTORM WIND 7869045380 2.151041 210112
## 8 TROPICAL STORM 7642475550 2.089108 682
## 9 HIGH WIND 5247860360 1.434528 19907
## 10 WILDFIRE 4758667000 1.300805 2732
The Top 10 accounted for 96.45% of all the property damage in the selected dataset.
Again, we look at a time series plot of the Top 5.
myDataSel2 <- filter(myDataSel, EVTYPE %in% arrData$EVTYPE[1 : 5]) %>%
group_by(EVTYPE, year) %>%
summarize(propDamageInYr = sum(propDmgTotal))
ggplot(myDataSel2, aes(year, propDamageInYr / 1e9, color = EVTYPE, shape = EVTYPE)) +
geom_point(size = 3) +
geom_line() +
labs(x = "Year", y = "Property Damage (B$)")
Top 5 causes of property damage
There’s clearly a small number of huge spikes here. Looking at Floods and comparing to the summary table, we can see that the single spike of about $120 billion is a major contributor to its total damage value of $144 billion. The same can also be said about Hurricane/Typoon, which had 2 major spikes, and Storm Surge/Tide (1 major spike).
The following is the Top 10 event types with the greatest crop damage value:
arrData <- arrange(select(myDataSelSummary, EVTYPE, CROP_DMG, CROP_DMG_PCT, NUM_EVENTS), desc(CROP_DMG))
head(arrData, n = 10)
## # A tibble: 10 × 4
## EVTYPE CROP_DMG CROP_DMG_PCT NUM_EVENTS
## <fctr> <dbl> <dbl> <int>
## 1 DROUGHT 13367566000 38.675857 2433
## 2 HURRICANE/TYPHOON 5349282800 15.476871 258
## 3 FLOOD 4974778400 14.393332 24248
## 4 HAIL 2476029450 7.163799 207714
## 5 FLASH FLOOD 1334901700 3.862219 51000
## 6 EXTREME COLD/WIND CHILL 1326023000 3.836531 1823
## 7 FROST/FREEZE 1250911000 3.619212 1458
## 8 THUNDERSTORM WIND 952246350 2.755097 210112
## 9 HEAVY RAIN 728169800 2.106785 11528
## 10 TROPICAL STORM 677711000 1.960795 682
The Top 10 accounted for 93.85% of all the crop damage in the selected dataset.
The time series plot of the Top 5:
myDataSel2 <- filter(myDataSel, EVTYPE %in% arrData$EVTYPE[1 : 5]) %>%
group_by(EVTYPE, year) %>%
summarize(cropDamageInYr = sum(cropDmgTotal))
ggplot(myDataSel2, aes(year, cropDamageInYr / 1e9, color = EVTYPE, shape = EVTYPE)) +
geom_point(size = 3) +
geom_line() +
labs(x = "Year", y = "Crop Damage (B$)")
Top 5 causes of crop damage
The crop damage plot has no major spikes that are at extreme distance from the “noise floor” of the plot. Drought is above the noise floor for many years (about 10) and clearly represents the most major perennial problem. Hurricane/Typhoon is significantly above the noise floor for only about 4 years.