In this report we explore the NOAA Storm database between the years 1950 and 2011. The aim is to find out which weather events have caused greatest impact on public health and economy. We analyzed the NOAA storm database in two aspects: events with consequences for the public health and events with consequences for the economy. To investigate most harmful events to public health we totaled harms throughout the years and cut top 20% most harmful events. It came out that TORNADO is exceeding by large degree in terms of magnitude the other harmful event - EXCESSIVE HEAT, HIGH WIND and FLOOD. The same approach was taken with the investigation of the event types with greatest impact on economy. It came out that FLOOD, HURRICANE (TYPHOON) and TORNADO are the top 3 weather phenomena in that respect.
library(dplyr)
library(readr)
library(stringdist)
library(ggplot2)
options(scipen = 20)
First we read the data from the repdatadataStormData.csv.bz2 file, downloaded from Peer-graded Assignment: Course Project 2 site.
StormData <- read_csv("repdatadataStormData.csv.bz2")
we check the first few rows (there are 902 297) rows in this dataset.
dim(StormData)
## [1] 902297 37
head(StormData[1:3, ])
## # A tibble: 3 x 37
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## # ... with 30 more variables: EVTYPE <chr>, BGN_RANGE <dbl>,
## # BGN_AZI <chr>, BGN_LOCATI <chr>, END_DATE <chr>, END_TIME <chr>,
## # COUNTY_END <dbl>, COUNTYENDN <chr>, END_RANGE <dbl>, END_AZI <chr>,
## # END_LOCATI <chr>, LENGTH <dbl>, WIDTH <dbl>, F <int>, MAG <dbl>,
## # FATALITIES <dbl>, INJURIES <dbl>, PROPDMG <dbl>, PROPDMGEXP <chr>,
## # CROPDMG <dbl>, CROPDMGEXP <chr>, WFO <chr>, STATEOFFIC <chr>,
## # ZONENAMES <chr>, LATITUDE <dbl>, LONGITUDE <dbl>, LATITUDE_E <dbl>,
## # LONGITUDE_ <dbl>, REMARKS <chr>, REFNUM <dbl>
The columns that we analyze are: EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP. We transform PROPDMGEXP and CROPDMGEXP to factor variables as they denote a multiplier of PROPDMG and CROPDMG values.
DF_StormData <- select(StormData,
EVTYPE,
INJURIES,
FATALITIES,
PROPDMG,
PROPDMGEXP,
CROPDMG,
CROPDMGEXP) %>%
mutate(EVTYPE = toupper((StormData$EVTYPE)),
PROPDMGEXP = as.factor(StormData$PROPDMGEXP),
CROPDMGEXP = as.factor(StormData$CROPDMGEXP))
head(DF_StormData)
## # A tibble: 6 x 7
## EVTYPE INJURIES FATALITIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## <chr> <dbl> <dbl> <dbl> <fctr> <dbl> <fctr>
## 1 TORNADO 15 0 25.0 K 0 NA
## 2 TORNADO 0 0 2.5 K 0 NA
## 3 TORNADO 2 0 25.0 K 0 NA
## 4 TORNADO 2 0 2.5 K 0 NA
## 5 TORNADO 2 0 2.5 K 0 NA
## 6 TORNADO 6 0 2.5 K 0 NA
In order to calculate correctly the amounts in PROPDMG and CROPDMG columns we have to take into account the values in PROPDMGEXP and CROPDMGEXP columns. They indicate whether we have to multiply the value by 100, 1 000, 1 000 0000 etc. For this purpose we define a function which sets the correct multiplier.
getExp <- function (multiplier){
if (is.na(multiplier)){
m <- 0
}
else if (any(multiplier == "H" | multiplier == "h")){
m <- 100
}
else if (any(multiplier == "K" | multiplier == "k")){
m <- 1000
}
else if (any(multiplier == "M" | multiplier == "m")){
m <- 1000000
}
else if (any(multiplier == "B" | multiplier == "b")){
m <- 1000000000
}
else{
m <- 0
}
m
}
According to the NOAA documentation there are 48 official event types. We load these types and transform to uppercase.
EVT_OFFIC <- toupper(c("Astronomical Low Tide","Avalanche","Blizzard","Coastal Flood","Cold/Wind Chill","Debris Flow","Dense Fog","Dense Smoke","Drought","Dust Devil","Dust Storm","Excessive Heat","Extreme Cold/Wind Chill","Flash Flood","Flood","Frost/Freeze","Funnel Cloud","Freezing Fog","Hail","Heat","Heavy Rain","Heavy Snow","High Surf","High Wind","Hurricane (Typhoon)","Ice Storm","Lake-Effect Snow","Lakeshore Flood","Lightning","Marine Hail","Marine High Wind","Marine Strong Wind","Marine Thunderstorm Wind","Rip Current","Seiche","Sleet","Storm Surge/Tide","Strong Wind","Thunderstorm Wind","Tornado","Tropical Depression","Tropical Storm","Tsunami","Volcanic Ash","Waterspout","Wildfire","Winter Storm","Winter Weather"))
EVT_OFFIC
## [1] "ASTRONOMICAL LOW TIDE" "AVALANCHE"
## [3] "BLIZZARD" "COASTAL FLOOD"
## [5] "COLD/WIND CHILL" "DEBRIS FLOW"
## [7] "DENSE FOG" "DENSE SMOKE"
## [9] "DROUGHT" "DUST DEVIL"
## [11] "DUST STORM" "EXCESSIVE HEAT"
## [13] "EXTREME COLD/WIND CHILL" "FLASH FLOOD"
## [15] "FLOOD" "FROST/FREEZE"
## [17] "FUNNEL CLOUD" "FREEZING FOG"
## [19] "HAIL" "HEAT"
## [21] "HEAVY RAIN" "HEAVY SNOW"
## [23] "HIGH SURF" "HIGH WIND"
## [25] "HURRICANE (TYPHOON)" "ICE STORM"
## [27] "LAKE-EFFECT SNOW" "LAKESHORE FLOOD"
## [29] "LIGHTNING" "MARINE HAIL"
## [31] "MARINE HIGH WIND" "MARINE STRONG WIND"
## [33] "MARINE THUNDERSTORM WIND" "RIP CURRENT"
## [35] "SEICHE" "SLEET"
## [37] "STORM SURGE/TIDE" "STRONG WIND"
## [39] "THUNDERSTORM WIND" "TORNADO"
## [41] "TROPICAL DEPRESSION" "TROPICAL STORM"
## [43] "TSUNAMI" "VOLCANIC ASH"
## [45] "WATERSPOUT" "WILDFIRE"
## [47] "WINTER STORM" "WINTER WEATHER"
In our database we check and find out that there are 890 unique present event types after we transform to upper case.
EVT_CURRENT <- unique(toupper(StormData$EVTYPE))
EVT_CURRENT[1:8]
## [1] "TORNADO" "TSTM WIND" "HAIL"
## [4] "FREEZING RAIN" "SNOW" "ICE STORM/FLASH FLOOD"
## [7] "SNOW/ICE" "WINTER STORM"
In order to correctly match the official and present event types we create a distribution matrix, which analyzes how close is given present type to official one.
df_stringdist <- stringdistmatrix(EVT_CURRENT, EVT_OFFIC)
df_stringdist[1:3, ]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,] 16 8 7 10 13 9 8 10 6 9 9 14 20
## [2,] 15 9 8 9 11 10 9 10 9 8 8 12 16
## [3,] 19 8 7 11 13 9 9 11 7 8 10 13 21
## [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
## [1,] 10 6 10 10 10 6 6 9 9 9 9
## [2,] 9 8 10 10 11 8 9 8 9 8 4
## [3,] 9 5 12 11 11 0 3 7 8 8 7
## [,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35]
## [1,] 16 8 15 12 8 8 14 15 20 10 7
## [2,] 17 9 15 12 7 10 11 11 15 10 8
## [3,] 17 9 15 13 7 7 14 16 22 11 5
## [,36] [,37] [,38] [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46]
## [1,] 7 12 8 14 0 14 10 4 9 8 8
## [2,] 8 12 5 8 8 16 12 6 11 9 8
## [3,] 5 15 10 15 6 17 12 6 10 9 7
## [,47] [,48]
## [1,] 9 11
## [2,] 10 11
## [3,] 12 13
The lesser the number, the closer is a present event type to an official one. The number denotes the number of edits it takes for the two strings to become the same.
After that we create Dictionary data frame, which pairs types that are closest to each other.
makeDictionary <- function(matrix, official, present){
df <- data.frame(Current = as.character(),
Official = as.character(),
Distance = as.numeric(),
stringsAsFactors = FALSE)
for(i in 1:length(present)){
index <- match(min(matrix[i, ]), matrix[i, ])
newEntry <- data.frame(Current = present[i],
Official = official[index],
Distance = min(matrix[i, ]))
df <- rbind(df, newEntry)
}
df
}
df_dictionary <- makeDictionary(df_stringdist, EVT_OFFIC, EVT_CURRENT)
head(df_dictionary)
## Current Official Distance
## 1 TORNADO TORNADO 0
## 2 TSTM WIND HIGH WIND 4
## 3 HAIL HAIL 0
## 4 FREEZING RAIN FREEZING FOG 4
## 5 SNOW FLOOD 4
## 6 ICE STORM/FLASH FLOOD FLASH FLOOD 10
We then construct a function to return the closest official event type for each present event type according to the dictionary.
getOfficialEvent <- function(curr_event, dict){
offic_event <- dict[dict$Current == curr_event, 2]
offic_event
}
After we have correct matching dictionary and multiplier, we finalize the data frame and calculate the totals. We take the totals as an indicator for harm and impact on economy.
DF_StormData_Totals <- select(DF_StormData,
EVTYPE,
PROPDMG,
PROPDMGEXP,
CROPDMG,
CROPDMGEXP,
INJURIES,
FATALITIES) %>%
mutate(PROPDMGEXP = sapply(PROPDMGEXP, getExp),
CROPDMGEXP = sapply(CROPDMGEXP, getExp),
EVTYPE = sapply(EVTYPE, getOfficialEvent, df_dictionary)) %>%
transmute(INJURIES,
FATALITIES,
EVTYPE = as.factor(EVTYPE),
PROPDMG = PROPDMG * PROPDMGEXP,
CROPDMG = CROPDMG * CROPDMGEXP) %>%
mutate(TOTECONDMG = PROPDMG + CROPDMG,
TOTHARMS = INJURIES + FATALITIES)
DF_StormData_SumHarms <- group_by(DF_StormData_Totals,
EVTYPE) %>%
summarise(TOTHARMS = sum(TOTHARMS)) %>%
filter(TOTHARMS > quantile(TOTHARMS, 0.8)) %>%
arrange(desc(TOTHARMS))
g <- ggplot(DF_StormData_SumHarms, aes(x = EVTYPE, y = TOTHARMS/10^3, fill = EVTYPE))
g <- g + geom_bar(stat = "identity") + theme(axis.text.x=element_text(angle=30,hjust=1,vjust=1))
g + labs(title = "Total health harm between the years 1950 and 2011") + labs(x = "") + labs(y = "$ Thousands") + scale_y_continuous(labels=function(y) format(y, big.mark = " "))
In conclusion the TOP 3 weather phenomena with consequences for public health:
DF_StormData_SumHarms[1:3, ]
## # A tibble: 3 x 2
## EVTYPE TOTHARMS
## <fctr> <dbl>
## 1 TORNADO 96997
## 2 HIGH WIND 9286
## 3 EXCESSIVE HEAT 8723
DF_StormData_SumEconDmg <- group_by(DF_StormData_Totals,
EVTYPE) %>%
summarise(TOTECONDMG = sum(TOTECONDMG)) %>%
filter(TOTECONDMG > quantile(TOTECONDMG, 0.8)) %>%
arrange(desc(TOTECONDMG))
g <- ggplot(DF_StormData_SumEconDmg, aes(x = EVTYPE, y = TOTECONDMG/10^6, fill = EVTYPE))
g <- g + geom_bar(stat = "identity") + theme(axis.text.x=element_text(angle=30,hjust=1,vjust=1))
g + labs(title = "Total US dollars between the years 1950 and 2011") + labs(x = "") + labs(y = "$ Millions") + scale_y_continuous(labels=function(y) format(y, big.mark = " "))
In conclusion the TOP 3 weather phenomena with consequences for economy:
DF_StormData_SumEconDmg[1:3, ]
## # A tibble: 3 x 2
## EVTYPE TOTECONDMG
## <fctr> <dbl>
## 1 FLOOD 151264401300
## 2 HURRICANE (TYPHOON) 75501243800
## 3 TORNADO 57356997190