The purpose of this analysis is to deepen our understanding on the harmful effects of natural disasters. Those harmful effects range from public health, as in fatalities and injuries, to economic, as in property damage. The project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. The events in the database span from the year 1950 to November 2011.
Loading the dplyr library that will be later used to execute the analysis.
library(dplyr)
Downloading and reading the data. The read.csv function is used to directly unzip and read the bz2 file.
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2","./stormdata.csv.bz2")
data <- read.csv("./stormdata.csv.bz2")
Using dim we can see the number of observations and variables.
dim(data)
## [1] 902297 37
The main variable for the disaster type is the EVTYPE variable. As the particular variable is in very bad shape, we will try to clean it by grouping similar events together. To do that we first need to explore the data to get an idea of the different categories.
str(data$EVTYPE)
## Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
summary(data$EVTYPE)
## HAIL TSTM WIND THUNDERSTORM WIND
## 288661 219940 82563
## TORNADO FLASH FLOOD FLOOD
## 60652 54277 25326
## THUNDERSTORM WINDS HIGH WIND LIGHTNING
## 20843 20212 15754
## HEAVY SNOW HEAVY RAIN WINTER STORM
## 15708 11723 11433
## WINTER WEATHER FUNNEL CLOUD MARINE TSTM WIND
## 7026 6839 6175
## MARINE THUNDERSTORM WIND WATERSPOUT STRONG WIND
## 5812 3796 3566
## URBAN/SML STREAM FLD WILDFIRE BLIZZARD
## 3392 2761 2719
## DROUGHT ICE STORM EXCESSIVE HEAT
## 2488 2006 1678
## HIGH WINDS WILD/FOREST FIRE FROST/FREEZE
## 1533 1457 1342
## DENSE FOG WINTER WEATHER/MIX TSTM WIND/HAIL
## 1293 1104 1028
## EXTREME COLD/WIND CHILL HEAT HIGH SURF
## 1002 767 725
## TROPICAL STORM FLASH FLOODING EXTREME COLD
## 690 682 655
## COASTAL FLOOD LAKE-EFFECT SNOW FLOOD/FLASH FLOOD
## 650 636 624
## LANDSLIDE SNOW COLD/WIND CHILL
## 600 587 539
## FOG RIP CURRENT MARINE HAIL
## 538 470 442
## DUST STORM AVALANCHE WIND
## 427 386 340
## RIP CURRENTS STORM SURGE FREEZING RAIN
## 304 261 250
## URBAN FLOOD HEAVY SURF/HIGH SURF EXTREME WINDCHILL
## 249 228 204
## STRONG WINDS DRY MICROBURST ASTRONOMICAL LOW TIDE
## 196 186 174
## HURRICANE RIVER FLOOD LIGHT SNOW
## 174 173 154
## STORM SURGE/TIDE RECORD WARMTH COASTAL FLOODING
## 148 146 143
## DUST DEVIL MARINE HIGH WIND UNSEASONABLY WARM
## 141 135 126
## FLOODING ASTRONOMICAL HIGH TIDE MODERATE SNOWFALL
## 120 103 101
## URBAN FLOODING WINTRY MIX HURRICANE/TYPHOON
## 98 90 88
## FUNNEL CLOUDS HEAVY SURF RECORD HEAT
## 87 84 81
## FREEZE HEAT WAVE COLD
## 74 74 72
## RECORD COLD ICE THUNDERSTORM WINDS HAIL
## 64 61 61
## TROPICAL DEPRESSION SLEET UNSEASONABLY DRY
## 60 59 56
## FROST GUSTY WINDS THUNDERSTORM WINDSS
## 53 53 51
## MARINE STRONG WIND OTHER SMALL HAIL
## 48 48 47
## FUNNEL FREEZING FOG THUNDERSTORM
## 46 45 45
## Temperature record TSTM WIND (G45) Coastal Flooding
## 43 39 38
## WATERSPOUTS MONTHLY PRECIPITATION WINDS
## 37 36 36
## (Other)
## 2940
After exploring the data a little we grepl some keywords and set the events to the appropriate categories:
evtype <- tolower(data$EVTYPE)
grepped <- grepl("wind", evtype)
evtype[which(grepped)] <- "strong wind"
grepped <- grepl("winter|wintry", evtype)
evtype[which(grepped)] <- "winter weather"
grepped <- grepl("snow", evtype)
evtype[which(grepped)] <- "snow"
grepped <- grepl("thunderstorm", evtype)
evtype[which(grepped)] <- "thunderstorm"
grepped <- grepl("flood", evtype)
evtype[which(grepped)] <- "flooding"
grepped <- grepl("current", evtype)
evtype[which(grepped)] <- "rip current"
grepped <- grepl("hurricane", evtype)
evtype[which(grepped)] <- "hurricane"
grepped <- grepl("tornado", evtype)
evtype[which(grepped)] <- "tornado"
grepped <- grepl("heat", evtype)
evtype[which(grepped)] <- "heat"
grepped <- grepl("surf", evtype)
evtype[which(grepped)] <- "high surf"
data[,8] <- as.factor(evtype)
str(data$EVTYPE)
## Factor w/ 427 levels " lightning"," waterspout",..: 356 356 356 356 356 356 356 356 356 356 ...
There are many more duplicate categories that are hard to fix manually. However this much cleaning should suffice for our analysis, especially if we get big margins between each category in our results.
Another thing that needs fixing is the damage on property and on crops. Our data provide us with an abbreviation based on if the number is in the thousands, the millions or the billions of dollars.
grepped <- grepl("K|k", data$PROPDMGEXP)
data[which(grepped), 25] <- data[which(grepped), 25] * 1000
grepped <- grepl("M|m", data$PROPDMGEXP)
data[which(grepped), 25] <- data[which(grepped), 25] * 1000000
grepped <- grepl("B|b", data$PROPDMGEXP)
data[which(grepped), 25] <- data[which(grepped), 25] * 1000000000
grepped <- grepl("K|k", data$CROPDMGEXP)
data[which(grepped), 27] <- data[which(grepped), 27] * 1000
grepped <- grepl("M|m", data$CROPDMGEXP)
data[which(grepped), 27] <- data[which(grepped), 27] * 1000000
grepped <- grepl("B|b", data$CROPDMGEXP)
data[which(grepped), 27] <- data[which(grepped), 27] * 1000000000
First we will only work with disasters that have at least one injury or death:
injs <- tapply(data$INJURIES, data$EVTYPE, sum)
bool <- injs != 0
injs <- injs[which(bool)]
fatlts <- tapply(data$FATALITIES, data$EVTYPE, sum)
bool <- fatlts != 0
fatlts <- fatlts[which(bool)]
Then we plot the data to barplots, setting the appropriate attributes for the plot to be as descriptive as possible:
injs <- sort(injs,TRUE)
fatlts <- sort(fatlts,TRUE)
par(mfrow = c(1, 2), las = 2, mar = c(6, 4, 2, 2))
barplot(injs[1:6], col = c("red", "lightblue", "yellow", "grey", "blue", "white"), main = "Injuries by Disaster")
barplot(fatlts[1:6], col = c("red", "lightblue", "yellow", "grey", "blue", "white"), main = "Deaths by Disaster")
From the above plots we can see that by far the most harmful disaster to public health is tornadoes.
Using the same method as before:
props <- tapply(data$PROPDMG, data$EVTYPE, sum)
bool <- props != 0
props <- props[which(bool)]
crops <- tapply(data$CROPDMG, data$EVTYPE, sum)
bool <- crops != 0
crops <- crops[which(bool)]
Then we plot the data to barplots, setting the appropriate attributes for the plot to be as descriptive as possible:
props <- sort(props,TRUE)
crops <- sort(crops,TRUE)
par(mfrow = c(1, 2), las = 2, mar = c(6, 4, 2, 2))
barplot(props[1:6], col = c("red", "lightblue", "yellow", "grey", "blue", "white"), main = "Property Damage by Disaster")
barplot(crops[1:6], col = c("red", "lightblue", "yellow", "grey", "blue", "white"), main = "Crop Damage by Disaster")
From the above plots we can see that by far the most harmful disaster to property is flooding followed by hurricanes, while to crops the first place goes to drought, followed by flooding.
As we discovered from our plots, we can see that by far the most harmful disaster to public health is tornadoes. Regarding the property damage, the most harmful disasters are flooding followed by hurricanes and to crops the most harmful ones are drought followed by flooding.
This report should indicate where the resources to both prevention and protection from these natural disasters should be allocated.