This data analysis is the second peer assessment project of the Reproducible Research course (the 5th course in the Data Science: Foundations using R Specialization offered by the John Hopkins University in Coursera).
Given the Storm Database from the NOAA along with the documentations, the researcher attempted to explore this data further. There are 2 supplementary materials provided namely: Explanation of Variables in the Database and Guide on What Each Value in PROPDMGEXP and CROPDMGEXP Columns Represent.
The 2 main areas in which the analysis focuses on are the impact on public health (fatalities and injuries) and economic consequence (damages on properties and crops). Overall, the project is aimed to find the most severe weather events in terms of public health and economic consequences.
During the analysis, the process could be divided into 3 stages: 1. Loading and subsetting the data 2. Grouping the event types together 3. Calculating the total damages and plotting the bar charts to visualize the results.
As a result of the data analysis, it is clear that TORNADO caused the most devastating effect on public health whilst FLOOD had led to the greatest economic damages.
Below is the process of the data analysis for which the conclusion above is derived.
From trials, by calling Sys.time() before and after reading in the data using different functions, the fread() function used the least amount of time. Thus, it is the choice for reading the raw data into R. To save time reloading the data every time the document is knitted, adds the argument cache = TRUE in the code chunk.
library(data.table)
raw_data <- fread("repdata-data-StormData.csv.bz2")
However, from inspection of the raw_data, it is noticeable that only a certain number of columns are sufficient for the goals set above. Therefore, it is useful to select only those necessary columns to reduce the size of the data set. Hence, reducing the time for later evaluations.
Note that the following is the mapping between the interested columns’ name and their corresponding indices. EVTYPE = 8 , FATALITIES = 23, INJURIES = 24, PROPDMG = 25, PROPDMGEXP = 26, CROPDMG = 27 and CROPDMGEXP = 28
data <- raw_data[, c(8,23,24,25,26,27,28)]
Now, let’s look at the dimensions of the data set.
dim(data)
## [1] 902297 7
At this stage, the data set is still huge. For efficiency purpose, it is useful to reduce the size of the data set further. Since the only interested proportion of the weather events are those which caused some damages. Therefore, by removing those occurrences with no damage, the data set could be made much smaller.
data <- subset(data, !(FATALITIES == 0 & INJURIES == 0 & PROPDMG == 0.00 &
CROPDMG == 0 & PROPDMGEXP == "" & CROPDMGEXP == ""))
data$EVTYPE <- as.factor(data$EVTYPE)
data$PROPDMGEXP <- as.factor(data$PROPDMGEXP)
data$CROPDMGEXP <- as.factor(data$CROPDMGEXP)
str(data)
## Classes 'data.table' and 'data.frame': 447978 obs. of 7 variables:
## $ EVTYPE : Factor w/ 491 levels " HIGH SURF ADVISORY",..: 408 408 408 408 408 408 408 408 408 408 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
The basic structure of the updated data set is shown above. With approximately half of its rows being omitted, the data set is much more compacted. Note that all of the non-numeric columns are converted to factor variables, so that its variations can be observed.
According to the summary of the data in the previous section, there are 491 levels of the EVTYPE column. Let’s take a closer look at it.
levels(data$EVTYPE)
## [1] " HIGH SURF ADVISORY" " FLASH FLOOD"
## [3] " TSTM WIND" " TSTM WIND (G45)"
## [5] "?" "AGRICULTURAL FREEZE"
## [7] "APACHE COUNTY" "ASTRONOMICAL HIGH TIDE"
## [9] "ASTRONOMICAL LOW TIDE" "AVALANCE"
## [11] "AVALANCHE" "Beach Erosion"
## [13] "BLACK ICE" "BLIZZARD"
## [15] "BLIZZARD/WINTER STORM" "BLOWING DUST"
## [17] "blowing snow" "BLOWING SNOW"
## [19] "BREAKUP FLOODING" "BRUSH FIRE"
## [21] "COASTAL FLOODING/EROSION" "COASTAL EROSION"
## [23] "Coastal Flood" "COASTAL FLOOD"
## [25] "Coastal Flooding" "COASTAL FLOODING"
## [27] "COASTAL FLOODING/EROSION" "Coastal Storm"
## [29] "COASTAL STORM" "COASTAL SURGE"
## [31] "COASTALSTORM" "Cold"
## [33] "COLD" "COLD AIR TORNADO"
## [35] "COLD AND SNOW" "COLD AND WET CONDITIONS"
## [37] "Cold Temperature" "COLD WAVE"
## [39] "COLD WEATHER" "COLD/WIND CHILL"
## [41] "COLD/WINDS" "COOL AND WET"
## [43] "DAM BREAK" "Damaging Freeze"
## [45] "DAMAGING FREEZE" "DENSE FOG"
## [47] "DENSE SMOKE" "DOWNBURST"
## [49] "DROUGHT" "DROUGHT/EXCESSIVE HEAT"
## [51] "DROWNING" "DRY MICROBURST"
## [53] "DRY MIRCOBURST WINDS" "Dust Devil"
## [55] "DUST DEVIL" "DUST DEVIL WATERSPOUT"
## [57] "DUST STORM" "DUST STORM/HIGH WINDS"
## [59] "Early Frost" "Erosion/Cstl Flood"
## [61] "EXCESSIVE HEAT" "EXCESSIVE RAINFALL"
## [63] "EXCESSIVE SNOW" "EXCESSIVE WETNESS"
## [65] "Extended Cold" "Extreme Cold"
## [67] "EXTREME COLD" "EXTREME COLD/WIND CHILL"
## [69] "EXTREME HEAT" "EXTREME WIND CHILL"
## [71] "EXTREME WINDCHILL" "FALLING SNOW/ICE"
## [73] "FLASH FLOOD" "FLASH FLOOD - HEAVY RAIN"
## [75] "FLASH FLOOD FROM ICE JAMS" "FLASH FLOOD LANDSLIDES"
## [77] "FLASH FLOOD WINDS" "FLASH FLOOD/"
## [79] "FLASH FLOOD/ STREET" "FLASH FLOOD/FLOOD"
## [81] "FLASH FLOOD/LANDSLIDE" "FLASH FLOODING"
## [83] "FLASH FLOODING/FLOOD" "FLASH FLOODING/THUNDERSTORM WI"
## [85] "FLASH FLOODS" "FLOOD"
## [87] "FLOOD & HEAVY RAIN" "FLOOD FLASH"
## [89] "FLOOD/FLASH" "FLOOD/FLASH FLOOD"
## [91] "FLOOD/FLASH/FLOOD" "FLOOD/FLASHFLOOD"
## [93] "FLOOD/RAIN/WINDS" "FLOOD/RIVER FLOOD"
## [95] "FLOODING" "FLOODING/HEAVY RAIN"
## [97] "FLOODS" "FOG"
## [99] "FOG AND COLD TEMPERATURES" "FOREST FIRES"
## [101] "Freeze" "FREEZE"
## [103] "Freezing drizzle" "Freezing Drizzle"
## [105] "FREEZING DRIZZLE" "FREEZING FOG"
## [107] "Freezing Rain" "FREEZING RAIN"
## [109] "FREEZING RAIN/SLEET" "FREEZING RAIN/SNOW"
## [111] "Freezing Spray" "FROST"
## [113] "Frost/Freeze" "FROST/FREEZE"
## [115] "FROST\\FREEZE" "FUNNEL CLOUD"
## [117] "Glaze" "GLAZE"
## [119] "GLAZE ICE" "GLAZE/ICE STORM"
## [121] "gradient wind" "Gradient wind"
## [123] "GRADIENT WIND" "GRASS FIRES"
## [125] "GROUND BLIZZARD" "GUSTNADO"
## [127] "GUSTY WIND" "GUSTY WIND/HAIL"
## [129] "GUSTY WIND/HVY RAIN" "Gusty wind/rain"
## [131] "Gusty winds" "Gusty Winds"
## [133] "GUSTY WINDS" "HAIL"
## [135] "HAIL 0.75" "HAIL 075"
## [137] "HAIL 100" "HAIL 125"
## [139] "HAIL 150" "HAIL 175"
## [141] "HAIL 200" "HAIL 275"
## [143] "HAIL 450" "HAIL 75"
## [145] "HAIL DAMAGE" "HAIL/WIND"
## [147] "HAIL/WINDS" "HAILSTORM"
## [149] "HARD FREEZE" "HAZARDOUS SURF"
## [151] "HEAT" "Heat Wave"
## [153] "HEAT WAVE" "HEAT WAVE DROUGHT"
## [155] "HEAT WAVES" "HEAVY LAKE SNOW"
## [157] "HEAVY MIX" "HEAVY PRECIPITATION"
## [159] "HEAVY RAIN" "HEAVY RAIN AND FLOOD"
## [161] "Heavy Rain/High Surf" "HEAVY RAIN/LIGHTNING"
## [163] "HEAVY RAIN/SEVERE WEATHER" "HEAVY RAIN/SMALL STREAM URBAN"
## [165] "HEAVY RAIN/SNOW" "HEAVY RAINS"
## [167] "HEAVY RAINS/FLOODING" "HEAVY SEAS"
## [169] "HEAVY SHOWER" "HEAVY SNOW"
## [171] "HEAVY SNOW AND HIGH WINDS" "HEAVY SNOW AND STRONG WINDS"
## [173] "Heavy snow shower" "HEAVY SNOW SQUALLS"
## [175] "HEAVY SNOW-SQUALLS" "HEAVY SNOW/BLIZZARD"
## [177] "HEAVY SNOW/BLIZZARD/AVALANCHE" "HEAVY SNOW/FREEZING RAIN"
## [179] "HEAVY SNOW/HIGH WINDS & FLOOD" "HEAVY SNOW/ICE"
## [181] "HEAVY SNOW/SQUALLS" "HEAVY SNOW/WIND"
## [183] "HEAVY SNOW/WINTER STORM" "HEAVY SNOWPACK"
## [185] "Heavy Surf" "HEAVY SURF"
## [187] "Heavy surf and wind" "HEAVY SURF COASTAL FLOODING"
## [189] "HEAVY SURF/HIGH SURF" "HEAVY SWELLS"
## [191] "HIGH" "HIGH WINDS"
## [193] "HIGH SEAS" "High Surf"
## [195] "HIGH SURF" "HIGH SWELLS"
## [197] "HIGH TIDES" "HIGH WATER"
## [199] "HIGH WAVES" "HIGH WIND"
## [201] "HIGH WIND (G40)" "HIGH WIND 48"
## [203] "HIGH WIND AND SEAS" "HIGH WIND DAMAGE"
## [205] "HIGH WIND/BLIZZARD" "HIGH WIND/HEAVY SNOW"
## [207] "HIGH WIND/SEAS" "HIGH WINDS"
## [209] "HIGH WINDS HEAVY RAINS" "HIGH WINDS/"
## [211] "HIGH WINDS/COASTAL FLOOD" "HIGH WINDS/COLD"
## [213] "HIGH WINDS/HEAVY RAIN" "HIGH WINDS/SNOW"
## [215] "HURRICANE" "Hurricane Edouard"
## [217] "HURRICANE EMILY" "HURRICANE ERIN"
## [219] "HURRICANE FELIX" "HURRICANE GORDON"
## [221] "HURRICANE OPAL" "HURRICANE OPAL/HIGH WINDS"
## [223] "HURRICANE-GENERATED SWELLS" "HURRICANE/TYPHOON"
## [225] "HVY RAIN" "HYPERTHERMIA/EXPOSURE"
## [227] "HYPOTHERMIA" "Hypothermia/Exposure"
## [229] "HYPOTHERMIA/EXPOSURE" "ICE"
## [231] "ICE AND SNOW" "ICE FLOES"
## [233] "ICE JAM" "Ice jam flood (minor"
## [235] "ICE JAM FLOODING" "ICE ON ROAD"
## [237] "ICE ROADS" "ICE STORM"
## [239] "ICE STORM/FLASH FLOOD" "ICE/STRONG WINDS"
## [241] "ICY ROADS" "Lake Effect Snow"
## [243] "LAKE EFFECT SNOW" "LAKE FLOOD"
## [245] "LAKE-EFFECT SNOW" "LAKESHORE FLOOD"
## [247] "LANDSLIDE" "LANDSLIDES"
## [249] "Landslump" "LANDSPOUT"
## [251] "LATE SEASON SNOW" "LIGHT FREEZING RAIN"
## [253] "Light snow" "Light Snow"
## [255] "LIGHT SNOW" "Light Snowfall"
## [257] "LIGHTING" "LIGHTNING"
## [259] "LIGHTNING WAUSEON" "LIGHTNING AND HEAVY RAIN"
## [261] "LIGHTNING AND THUNDERSTORM WIN" "LIGHTNING FIRE"
## [263] "LIGHTNING INJURY" "LIGHTNING THUNDERSTORM WINDS"
## [265] "LIGHTNING." "LIGHTNING/HEAVY RAIN"
## [267] "LIGNTNING" "LOW TEMPERATURE"
## [269] "MAJOR FLOOD" "Marine Accident"
## [271] "MARINE HAIL" "MARINE HIGH WIND"
## [273] "MARINE MISHAP" "MARINE STRONG WIND"
## [275] "MARINE THUNDERSTORM WIND" "MARINE TSTM WIND"
## [277] "Microburst" "MICROBURST"
## [279] "MICROBURST WINDS" "MINOR FLOODING"
## [281] "MIXED PRECIP" "Mixed Precipitation"
## [283] "MIXED PRECIPITATION" "MUD SLIDE"
## [285] "MUD SLIDES" "MUD SLIDES URBAN FLOODING"
## [287] "Mudslide" "MUDSLIDE"
## [289] "Mudslides" "MUDSLIDES"
## [291] "NON TSTM WIND" "NON-SEVERE WIND DAMAGE"
## [293] "NON-TSTM WIND" "Other"
## [295] "OTHER" "RAIN"
## [297] "RAIN/SNOW" "RAIN/WIND"
## [299] "RAINSTORM" "RAPIDLY RISING WATER"
## [301] "RECORD COLD" "RECORD HEAT"
## [303] "RECORD RAINFALL" "RECORD SNOW"
## [305] "RECORD WARMTH" "RECORD/EXCESSIVE HEAT"
## [307] "RIP CURRENT" "RIP CURRENTS"
## [309] "RIP CURRENTS/HEAVY SURF" "RIVER AND STREAM FLOOD"
## [311] "RIVER FLOOD" "River Flooding"
## [313] "RIVER FLOODING" "ROCK SLIDE"
## [315] "ROGUE WAVE" "ROUGH SEAS"
## [317] "ROUGH SURF" "RURAL FLOOD"
## [319] "SEICHE" "SEVERE THUNDERSTORM"
## [321] "SEVERE THUNDERSTORM WINDS" "SEVERE THUNDERSTORMS"
## [323] "SEVERE TURBULENCE" "SLEET"
## [325] "SLEET/ICE STORM" "SMALL HAIL"
## [327] "SMALL STREAM FLOOD" "Snow"
## [329] "SNOW" "SNOW ACCUMULATION"
## [331] "SNOW AND HEAVY SNOW" "SNOW AND ICE"
## [333] "SNOW AND ICE STORM" "SNOW FREEZING RAIN"
## [335] "SNOW SQUALL" "Snow Squalls"
## [337] "SNOW SQUALLS" "SNOW/ BITTER COLD"
## [339] "SNOW/ ICE" "SNOW/BLOWING SNOW"
## [341] "SNOW/COLD" "SNOW/FREEZING RAIN"
## [343] "SNOW/HEAVY SNOW" "SNOW/HIGH WINDS"
## [345] "SNOW/ICE" "SNOW/ICE STORM"
## [347] "SNOW/SLEET" "SNOW/SLEET/FREEZING RAIN"
## [349] "SNOWMELT FLOODING" "STORM FORCE WINDS"
## [351] "STORM SURGE" "STORM SURGE/TIDE"
## [353] "Strong Wind" "STRONG WIND"
## [355] "Strong Winds" "STRONG WINDS"
## [357] "THUDERSTORM WINDS" "THUNDEERSTORM WINDS"
## [359] "THUNDERESTORM WINDS" "THUNDERSNOW"
## [361] "THUNDERSTORM" "THUNDERSTORM WINDS"
## [363] "THUNDERSTORM DAMAGE TO" "THUNDERSTORM HAIL"
## [365] "THUNDERSTORM WIND" "THUNDERSTORM WIND (G40)"
## [367] "THUNDERSTORM WIND 60 MPH" "THUNDERSTORM WIND 65 MPH"
## [369] "THUNDERSTORM WIND 65MPH" "THUNDERSTORM WIND 98 MPH"
## [371] "THUNDERSTORM WIND G50" "THUNDERSTORM WIND G52"
## [373] "THUNDERSTORM WIND G55" "THUNDERSTORM WIND G60"
## [375] "THUNDERSTORM WIND TREES" "THUNDERSTORM WIND."
## [377] "THUNDERSTORM WIND/ TREE" "THUNDERSTORM WIND/ TREES"
## [379] "THUNDERSTORM WIND/AWNING" "THUNDERSTORM WIND/HAIL"
## [381] "THUNDERSTORM WIND/LIGHTNING" "THUNDERSTORM WINDS"
## [383] "THUNDERSTORM WINDS 13" "THUNDERSTORM WINDS 63 MPH"
## [385] "THUNDERSTORM WINDS AND" "THUNDERSTORM WINDS G60"
## [387] "THUNDERSTORM WINDS HAIL" "THUNDERSTORM WINDS LIGHTNING"
## [389] "THUNDERSTORM WINDS." "THUNDERSTORM WINDS/ FLOOD"
## [391] "THUNDERSTORM WINDS/FLOODING" "THUNDERSTORM WINDS/FUNNEL CLOU"
## [393] "THUNDERSTORM WINDS/HAIL" "THUNDERSTORM WINDS53"
## [395] "THUNDERSTORM WINDSHAIL" "THUNDERSTORM WINDSS"
## [397] "THUNDERSTORM WINS" "THUNDERSTORMS"
## [399] "THUNDERSTORMS WIND" "THUNDERSTORMS WINDS"
## [401] "THUNDERSTORMW" "THUNDERSTORMWINDS"
## [403] "THUNDERSTROM WIND" "THUNDERTORM WINDS"
## [405] "THUNERSTORM WINDS" "Tidal Flooding"
## [407] "TIDAL FLOODING" "TORNADO"
## [409] "TORNADO F0" "TORNADO F1"
## [411] "TORNADO F2" "TORNADO F3"
## [413] "TORNADOES" "TORNADOES, TSTM WIND, HAIL"
## [415] "TORNDAO" "Torrential Rainfall"
## [417] "TROPICAL DEPRESSION" "TROPICAL STORM"
## [419] "TROPICAL STORM ALBERTO" "TROPICAL STORM DEAN"
## [421] "TROPICAL STORM GORDON" "TROPICAL STORM JERRY"
## [423] "Tstm Wind" "TSTM WIND"
## [425] "TSTM WIND (G45)" "TSTM WIND (41)"
## [427] "TSTM WIND (G35)" "TSTM WIND (G40)"
## [429] "TSTM WIND (G45)" "TSTM WIND 40"
## [431] "TSTM WIND 45" "TSTM WIND 55"
## [433] "TSTM WIND 65)" "TSTM WIND AND LIGHTNING"
## [435] "TSTM WIND DAMAGE" "TSTM WIND G45"
## [437] "TSTM WIND G58" "TSTM WIND/HAIL"
## [439] "TSTM WINDS" "TSTMW"
## [441] "TSUNAMI" "TUNDERSTORM WIND"
## [443] "TYPHOON" "Unseasonable Cold"
## [445] "UNSEASONABLY COLD" "UNSEASONABLY WARM"
## [447] "UNSEASONABLY WARM AND DRY" "UNSEASONAL RAIN"
## [449] "URBAN AND SMALL" "URBAN AND SMALL STREAM FLOODIN"
## [451] "URBAN FLOOD" "URBAN FLOODING"
## [453] "URBAN FLOODS" "URBAN SMALL"
## [455] "URBAN/SMALL STREAM" "URBAN/SMALL STREAM FLOOD"
## [457] "URBAN/SMALL STREAM FLOODING" "URBAN/SML STREAM FLD"
## [459] "VOLCANIC ASH" "VOLCANIC ASHFALL"
## [461] "WARM WEATHER" "WATERSPOUT"
## [463] "WATERSPOUT TORNADO" "WATERSPOUT-"
## [465] "WATERSPOUT-TORNADO" "WATERSPOUT/ TORNADO"
## [467] "WATERSPOUT/TORNADO" "WET MICROBURST"
## [469] "Whirlwind" "WHIRLWIND"
## [471] "WILD FIRES" "WILD/FOREST FIRE"
## [473] "WILD/FOREST FIRES" "WILDFIRE"
## [475] "WILDFIRES" "Wind"
## [477] "WIND" "WIND AND WAVE"
## [479] "Wind Damage" "WIND DAMAGE"
## [481] "WIND STORM" "WIND/HAIL"
## [483] "WINDS" "WINTER STORM"
## [485] "WINTER STORM HIGH WINDS" "WINTER STORMS"
## [487] "WINTER WEATHER" "WINTER WEATHER MIX"
## [489] "WINTER WEATHER/MIX" "Wintry Mix"
## [491] "WINTRY MIX"
It is clear that there are many levels whose cases are differed but they are the same thing. In addition, the word choice may have evolved from when the data was first recorded, resulting in different levels where, essentially, they mean the same. Also, there are some typing errors that must be taken into account as well. As such, it is tempting that those events who are similar or the same should be clustered together. Note that the clustering process is based on the Event Types section of the document explaining the variables in the database.
In order to group the events together, it is necessary to convert it back to the character class before reverted to a factor variable.
data$EVTYPE <- as.character(data$EVTYPE)
data$EVTYPE[grep("AVALANCHE|AVALANCE", data$EVTYPE, ignore.case = T)] <- "Avalanche"
data$EVTYPE[grep("COLD.*|LOW TEMP.*", data$EVTYPE, ignore.case = T)] <- "Cold"
data$EVTYPE[grep("DEBRIS FLOW|LANDSLIDE.*|MUD.*SLIDE.*|ROCK.*SLIDE|landslump", data$EVTYPE, ignore.case = T)] <- "Landslide"
data$EVTYPE[grep(".*EROSION.*|APACHE COUNTY", data$EVTYPE, ignore.case = T)] <- "Erosion"
data$EVTYPE[grep(".*FOG", data$EVTYPE, ignore.case = T)] <- "Fog"
data$EVTYPE[grep("SMOKE", data$EVTYPE, ignore.case = T)] <- "Smoke"
data$EVTYPE[grep("DROUGHT", data$EVTYPE, ignore.case = T)] <- "Drought"
data$EVTYPE[grep(".*DUST.*", data$EVTYPE, ignore.case = T)] <- "Dust"
data$EVTYPE[grep(".*HEAT|.*HYPOTHERMIA|.*WARM|.*BURST|HYPERTHERMIA.*", data$EVTYPE, ignore.case = T)] <- "Heat"
data$EVTYPE[grep(".*FLOOD.*|.*STREAM.*|URBAN.*", data$EVTYPE, ignore.case = T)] <- "Flood"
data$EVTYPE[grep(".*FROST.*|.*FREEZ.*", data$EVTYPE, ignore.case = T)] <- "Frost/Freeze"
data$EVTYPE[grep("HAIL.*", data$EVTYPE, ignore.case = T)] <- "Hail"
data$EVTYPE[grep(".*RAIN.*|.*PRECIP.*|.*SHOWER", data$EVTYPE, ignore.case = T)] <- "Rain"
data$EVTYPE[grep(".*SURF|.*SWELL", data$EVTYPE, ignore.case = T)] <- "High Surf"
data$EVTYPE[grep("HURRICANE|TYPHOON", data$EVTYPE, ignore.case = T)] <- "Hurricane"
data$EVTYPE[grep("ICE|icy|GLAZE", data$EVTYPE, ignore.case = T)] <- "Ice"
data$EVTYPE[grep("LIGHTNING|LIGHTING|LIGNTNING", data$EVTYPE, ignore.case = T)] <- "Lightning"
data$EVTYPE[grep("MARINE.*|DROWNING", data$EVTYPE, ignore.case = T)] <- "Marine Accident"
data$EVTYPE[grep("RIP CURRENT", data$EVTYPE, ignore.case = T)] <- "Rip Current"
data$EVTYPE[grep("SNOW", data$EVTYPE, ignore.case = T)] <- "Snow"
data$EVTYPE[grep(".*SURGE|TIDE|.*WAVE.*|.*SEA.*|.*WATER|SEICHE", data$EVTYPE, ignore.case = T)] <- "Tide"
data$EVTYPE[grep("TORNADO|TORNDAO|LANDSPOUT|WATERSPOUT|FUNNEL CLOUD", data$EVTYPE, ignore.case = T)] <- "Tornado"
data$EVTYPE[grep("THUNDERSTORM.|.*STORM|DEPRESSION", data$EVTYPE, ignore.case = T)] <- "Storm"
data$EVTYPE[grep("TSUNAMI", data$EVTYPE, ignore.case = T)] <- "Tsunami"
data$EVTYPE[grep("VOLCANIC ASH.*", data$EVTYPE, ignore.case = T)] <- "Volcano"
data$EVTYPE[grep(".*FIRE.*", data$EVTYPE, ignore.case = T)] <- "Wildfire"
data$EVTYPE[grep("WIND|TSTMW|GUSTNADO|.*TURBULENCE", data$EVTYPE, ignore.case = T)] <- "Wind"
data$EVTYPE[grep("WINTER WEATHER|DRIZZLE|.*MIX|BLIZZARD|SLEET", data$EVTYPE, ignore.case = T)] <- "Winter Weather"
data$EVTYPE[grep(".*WET.*|DAM BREAK|HIGH|\\?|OTHER", data$EVTYPE, ignore.case = T)] <- "Other"
data$EVTYPE <- as.factor(data$EVTYPE)
str(data$EVTYPE)
## Factor w/ 28 levels "Avalanche","Cold",..: 23 23 23 23 23 23 23 23 23 23 ...
Please note that the grouping is modified so that there are a total of 28 levels of EVTYPE rather than 49 as stated in the document.
Let’s first view how many times each group of events occurred during the period of investigation.
sort(table(data$EVTYPE), decreasing = TRUE)
##
## Storm Hail Wind Tornado Flood
## 102457 95238 84234 54650 53290
## Lightning Winter Weather Snow Marine Accident Rain
## 13355 8785 8035 6110 5957
## Wildfire Heat Cold Tide Drought
## 2431 2164 1673 1666 1538
## Ice Frost/Freeze Fog Rip Current Other
## 1449 1159 1105 691 584
## Landslide Dust Avalanche Hurricane Tsunami
## 432 379 319 235 19
## Erosion Smoke Volcano
## 9 9 5
It is obvious that there are a high dispersion between the frequency of each event, ranging from only 5 to over 100,000. Since there is such a great variation, it is important that the frequency in which it occurs is taken into account when considering its harmful effect. To clarify, if we categorize an event type as being the most harmful by adding up the total deaths and injuries, it will certainly be biased towards those with high frequency of occurrence. So, a much more representative figure is the average fatalities and injuries of each event type. This will be accounted for in the Results section. For now, let’s calculate the total health effect.
injured <- aggregate(FATALITIES ~ EVTYPE, data, FUN = sum)
death <- aggregate(INJURIES ~ EVTYPE, data, FUN = sum)
Now, let’s create a new data frame called health which classifies the number of patients according to the event type and whether they were fatal or injured.
health <- data.frame(
Type = rep(c("Fatalities", "Injuries"), each = 28),
Event = rep(levels(data$EVTYPE), 2),
Cases = c(death[,2], injured[,2])
)
For the purpose of clarification, all of the event types will be plotted on the bar chart against the total number of cases. Using the ggplot2 package, the number of fatalities and injuries will be separated by color filled.
library(ggplot2)
gHealth <- ggplot(health, aes(Cases, Event, fill = Type))
gHealth + geom_bar(stat = "identity") + labs(title =
"Public Health Effect from Weather Events", x = "Total Number of Cases")
In short, Tornado had created the greatest health impacts.
Before calculating the total damages to properties and crops from weather events, it should be noted that these are subjected to exponent values in the PROPDMGEXP and CROPDMGEXP columns of the data set. But they are currently factor variables. Let’s see first, what are these columns comprise of.
levels(data$PROPDMGEXP)
## [1] "" "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K" "m" "M"
levels(data$CROPDMGEXP)
## [1] "" "?" "0" "2" "B" "k" "K" "m" "M"
Some of these are self-explained such as the common suffixes (K, M, B etc.) while others are not quite meaningful. Let’s convert these exponent values into numeric values which can be readily multiplied by the PROPDMG and CROPDMG columns to get the actual, total damages. This process uses the mapvalues function from the plyr package. The conversion is based on the Guide on Exponent Values.
library(plyr)
data$PROPDMGEXP <- as.character(data$PROPDMGEXP)
data$CROPDMGEXP <- as.character(data$CROPDMGEXP)
# Let's do the property damages first
data$PROPDMGEXP <- mapvalues(data$PROPDMGEXP, from = c("K", "M", "", "B", "m",
"+", "0", "5", "6", "?", "4", "2", "3", "h", "7", "H", "-", "1", "8"),
to = c(10^3, 10^6, 0, 10^9, 10^6, 1, 10, 10, 10, 0, 10, 10, 10, 100,
10, 100, 0, 10, 10))
data$PROPDMGEXP <- as.numeric(data$PROPDMGEXP)
data$PROPDMG <- data$PROPDMG * data$PROPDMGEXP
# Then the crop damages
data$CROPDMGEXP <- mapvalues(data$CROPDMGEXP, from = c("", "M", "K", "m", "B",
"?", "0", "k", "2"),to = c(0, 10^6, 10^3, 10^6, 10^9, 0, 10, 10^3, 10))
data$CROPDMGEXP <- as.numeric(data$CROPDMGEXP)
data$CROPDMG <- data$CROPDMG * data$CROPDMGEXP
After that, for the same principle as in 4.1.1, the total damages to properties and crops are calculated.
prop <- aggregate(PROPDMG ~ EVTYPE, data, FUN = sum)
crop <- aggregate(CROPDMG ~ EVTYPE, data, FUN = sum)
As before, the new data frame called economy is created to categorize the damages according to the event types and whether it is on properties or crops.
economy <- data.frame(
Type = rep(c("Property Damage", "Crop Damage"), each = 28),
Event = rep(levels(data$EVTYPE), 2),
Damages = c(prop[,2], crop[,2])
)
The last part deals with visualizing the damages using bar chart. Similarly, it is a plot of event type against the total damages, using the ggplot2 package.
gEcon <- ggplot(economy, aes(Damages, Event, fill = Type))
gEcon + geom_bar(stat = "identity") + labs(title =
"Economic Consequence from Weather Events", x = "Total Damage in US Dollars")
According to the bar chart, Flood was the most damaging to the US economy, in terms of properties and crops.
On the aspect of public health, it is clear that Tornado was by far the most damaging weather events. Whilst it is not as distinct as the health plot, the total damages for properties and crops from Flood has exceeded others. Let’s see these impacts in actual number.
subset(health, Event == "Tornado")[, c(1,3)]
## Type Cases
## 23 Fatalities 91367
## 51 Injuries 5633
subset(economy, Event == "Flood")[, c(1,3)]
## Type Damages
## 6 Property Damage 167536782695
## 34 Crop Damage 12388597200
However, this is not the whole picture. If a government is planning to build the infrastructure to prevent, or at least, reduce the damages from these severe weather events, they has to consider other factors. Of course, the budget available and the environmental concerns are some examples. But equally important is the probability of each event occurring. There is no point to spend huge amount of money preventing the volcano eruptions which could be devastating but rarely occurs. The following code calculates the proportion of each event type occurring during this investigated period, which could be used as an approximate measure of probability.
prob <- data.frame(Event = levels(data$EVTYPE), Probability = rep(0,28))
for (i in 1:length(levels(data$EVTYPE))) {
subdata <- subset(data, EVTYPE == levels(data$EVTYPE)[i])
freq <- nrow(subdata)
prob[i,2] <- freq/nrow(data)
}
At this stage, the fact that the difference in frequency of occurrence will be taken into account. This is done so by calculating the Expected Value for the both health and economic damages.
health1 <- merge(health, prob, by = "Event")
health1$Expected_Cases <- health1$Cases * health1$Probability
economy1 <- merge(economy, prob, by = "Event")
economy1$Expected_Damages <- economy1$Damages * economy1$Probability
Then plot the multiple graphs of the 2 expected damage plots. For the purpose of clarity, only the top 5 weather events of each area of concern will be plotted.
From the health plot in the section 4.1.2, the number of fatalities has far exceeded that of injuries, so the figure for fatalities is more important. The event considered as top 5, leading to the greatest damage, will be based on the number of fatalities.
Similarly, in the economy plot in the section 4.2.2, the economic damage for properties are much greater than the crops. So, the ranking for economy will be based on the damages on properties.
rankHealth <- subset(health1, Type == "Fatalities", 1:5)
topHealth <- rankHealth[order(rankHealth$Expected_Cases, decreasing = TRUE)[1:5], 1]
health2 <- subset(health1, Event == topHealth[1] | Event == topHealth[2] | Event == topHealth[3] | Event == topHealth[4] | Event == topHealth[5])
rankEconomy <- subset(economy1, Type == "Property Damage", 1:5)
topEconomy <- rankEconomy[order(rankEconomy$Expected_Damages, decreasing = TRUE)[1:5], 1]
economy2 <- subset(economy1, Event == topEconomy[1] | Event == topEconomy[2] | Event == topEconomy[3] | Event == topEconomy[4] | Event == topEconomy[5])
Lastly, the plots are made using the ggplot2 package. This illustrates the expected damages caused by the top 5, severe event types for both aspects of interested.
gHealth1 <- ggplot(health2, aes(Expected_Cases, Event, fill = Type)) +
geom_bar(stat = "identity") + labs(x = "Expected Number of Cases")
gEcon1 <- ggplot(economy2, aes(Expected_Damages, Event, fill = Type)) +
geom_bar(stat = "identity") + labs(x = "Expected Damages in US Dollars")
library(gridExtra)
grid.arrange(gHealth1, gEcon1, ncol = 1, top = "Expected Impacts of Severe Weather Events on Health and Economy")
From the plots, it is clear that Tornado and Flood are still in the lead in the damages for public health and the economy, respectively. However, the economic consequences from Flood was far greater than the public health concerns from Tornado. Let’s view this in absolute term.
tornado <- subset(health2, Event == "Tornado")
case <- sum(tornado$Expected_Cases)
paste("It is expected that tornado will lead to", case, "cases")
## [1] "It is expected that tornado will lead to 11833.2819915264 cases"
flood <- subset(economy2, Event == "Flood")
dmg <- sum(flood$Expected_Damages)
paste("The expected damage resulting from flood is", dmg, "US Dollars")
## [1] "The expected damage resulting from flood is 21403335642.8319 US Dollars"
Having considered the issue, if the government were to plan to reduce the damages of these severe whether events, it should definitely consider, in order, investing in measures to prevent Flood and Tornado damages.