Weather events can range from mild sunny days to extreme tornadoes or hazzards. Because of its complexity 1 (its outcome depends on a lot of variables in a non-linear fashion), it is really dificult to predict when natural disasters are going to happen and with which intensity. Institutes like the National Oceanic and Atmospheric Administration (NOAA) collects data of previous weather-related events and annotate its implications to society in order to analyze the impacts that each of the events may cause. The objective of this study is to analyze NOAA's report on weather events and find out which of these occasions may be worth to invest in protection against its effects. The two analyzed areas was danger to population health (injuries and deaths) and economical impact.
The initial data set was acquired in the course website (Reproducible Research), at this link. Together with the data set, I obtained the data set manual. To start processing the data, I moved the working directory to where the inital file was.
setwd("C:\\Users\\Caio\\Dropbox\\Coursera\\Reproducible Research\\Project 2")
and then read the file into a “data” variable
data <- read.csv(bzfile("repdata-data-StormData.csv.bz2"))
Because of the data size, this operation can be very long. In my computer (Core I5 and 4 Gb of RAM), this command took about 3:20 minutes to execute
To accomplish the objective, I analyzed the data set manual and decided that the important variables were:
Visually analizing the data set, I could observe a lot of noise and corrupted data, such as non-standardized names and non-existent column categories, according to the NOAA's manual.
unique(data$PROPDMGEXP)
## [1] K M B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
unique(data[grepl("[Tt][Oo][Rr][Nn][Aa]", data$EVTYPE), ]$EVTYPE)
## [1] TORNADO TORNADO F0
## [3] TORNADOS WATERSPOUT/TORNADO
## [5] WATERSPOUT TORNADO WATERSPOUT-TORNADO
## [7] TORNADOES, TSTM WIND, HAIL COLD AIR TORNADO
## [9] WATERSPOUT/ TORNADO TORNADO F3
## [11] TORNADO F1 TORNADO/WATERSPOUT
## [13] TORNADO F2 TORNADOES
## [15] TORNADO DEBRIS
## 985 Levels: HIGH SURF ADVISORY COASTAL FLOOD ... WND
xample of non-standard data, showing two types of inconsistency:
In an attempt to resolve/attenuate this problem, I constructed a function that deals with this inconsistencies and transform then in a standard form (e.g. Tornado/Tornados/Tornadoes/Tornado F1 would be converted to Tornado). I did this by using regular expressions (regex), detecting variations from the standard model. The posible regex expressions were decided in a discussion in the class forum. The created function receive a dataset from this assignment, replace the identified variations events to a standard version and return the processed dataframe.
preProcess <- function(data) {
library(data.table)
evtype <- data.table(unique(toupper(data$EVTYPE)))
# replace '\' or '/ ' with '/'
evtype$V2 <- gsub("\\\\|(\\/\\s)", "/", evtype$V1)
# remove leading and trailing space
evtype$V2 <- gsub("^\\s+|\\s+$", "", evtype$V2)
# replace 'and' with '/'
evtype$V2 <- gsub("(\\sAND(\\s|$))|(\\s&\\s)", "/", evtype$V2)
# remove ( and )
evtype$V2 <- sub("([[:alpha:]]*)(\\(|\\s){0,1}(([[:alpha:]]{0,1})[[:digit:]]{1,}(\\.[[:digit:]]{1,}){0,1})(\\){0,1})",
"\\1 \\3", evtype$V2, perl = TRUE)
# remove trailing '/'
evtype$V2 <- gsub("\\/$", "", evtype$V2)
# remove multiple spaces
evtype$V2 <- gsub("\\s{2,}", " ", evtype$V2)
# replace misspelling and shortcuts with correct ones
evtype$V2 <- sub("\\?", "UNKNOWN", evtype$V2)
evtype$V2 <- sub("AVALANCE", "AVALANCHE", evtype$V2)
evtype$V2 <- sub("WINTRY", "WINTER", evtype$V2)
evtype$V2 <- sub("WINTERY", "WINTER", evtype$V2)
evtype$V2 <- sub("WND", "WIND", evtype$V2)
evtype$V2 <- sub("WI$|WI[^NDL]|WIN[^D]", "WIND", evtype$V2)
evtype$V2 <- sub("\\sAND[[:alpha:]]{1,}", "/", evtype$V2)
evtype$V2 <- gsub("\\s{0,1}-\\s", "/", evtype$V2)
evtype$V2 <- sub("\\sSNO($|\\s)", " SNOW", evtype$V2)
evtype$V2 <- sub("\\sCHIL($|\\s)", " CHILL", evtype$V2)
evtype$V2 <- sub("\\sCHI([^[[:alpha:]]]{0,}|$)", " CHILL", evtype$V2)
evtype$V2 <- sub("\\sCH([^[[:alpha:]]]{0,}|$)", " CHILL", evtype$V2)
evtype$V2 <- sub("CSTL", "COASTAL", evtype$V2)
evtype$V2 <- sub("FLD", "FLOOD", evtype$V2)
evtype$V2 <- sub("FLOODING", "FLOOD", evtype$V2)
evtype$V2 <- sub("FLOODIN", "FLOOD", evtype$V2)
evtype$V2 <- sub("STRM", "STREAM", evtype$V2)
evtype$V2 <- sub("MICOBURST", "MICROBURST", evtype$V2)
evtype$V2 <- sub("WAYTERSPOUT", "WATERSPOUT", evtype$V2)
evtype$V2 <- sub("WATERSPOUTS", "WATERSPOUT", evtype$V2)
evtype$V2 <- sub("VOG", "FOG", evtype$V2)
evtype$V2 <- sub("RIP CURRENTS", "RIP CURRENT", evtype$V2)
evtype$V2 <- sub("FREEZE", "FREEZING", evtype$V2)
evtype$V2 <- sub("WAUSEON", "", evtype$V2) ## WAUSEON is a city
evtype$V2 <- sub("EROSIN", "EROSION", evtype$V2)
evtype$V2 <- sub("PRECIPATATION", "PRECIPITATION", evtype$V2)
evtype$V2 <- sub("TORNADOES", "TORNADOS", evtype$V2)
evtype$V2 <- sub("TORNDAO", "TORNADOS", evtype$V2)
evtype$V2 <- sub("NON(-|\\s)[[:alpha:]]{1,}\\s(.*)", "\\2", evtype$V2)
evtype$V2 <- sub("TSTM", "THUNDERSTORM", evtype$V2)
evtype$V2 <- sub("TUNDERSTORM", "THUNDERSTORM", evtype$V2)
evtype$V2 <- sub("CLOU$", "CLOUD", evtype$V2)
evtype$V2 <- sub("DRIE", "DRY", evtype$V2)
evtype$V3 <- evtype$V1 == evtype$V2 # Which variables have to be corrected?
evtype <- evtype[order(V3, V2), ]
evtype <- evtype[evtype$V3 == FALSE, ] # get variables that have to be corrected
# data2 <- data;
data$EVTYPE <- as.character(data$EVTYPE)
for (i in 1:nrow(evtype)) {
# replace variations to the standard version
data[toupper(data$EVTYPE) == evtype$V1[i], 8] <- tolower(evtype$V2[i])
}
data$EVTYPE <- tolower(data$EVTYPE)
invisible(data)
}
This function was used with two dataframes, subsets of the original 'data' data set: dataQ1 and dataQ2, data sets to respectively answer questions 1 (deaths caused by an event) and 2 (economic impact caused by an event) of the objective.
dataQ1 separated only cases where an event caused at least 1 death or injured 1 person while dataQ2 select rows where the PROPDMGEXP variable had valid markers and where the event caused at least some economical loss
dataQ1 <- data[data$FATALITIES != 0 | data$INJURIES != 0, ]
dataQ1 <- preProcess(dataQ1)
dataQ2 <- data[data$PROPDMGEXP == "K" | data$PROPDMGEXP == "k" | data$PROPDMGEXP ==
"M" | data$PROPDMGEXP == "m" | data$PROPDMGEXP == "B" | data$PROPDMGEXP ==
"b" & data$PROPDMG != 0, ]
dataQ2 <- preProcess(dataQ2)
remove("data") # clear space in memory
In the last line of the code above, as I have create our data set for questions 1 and 2, I do not need the original data set anymore so I deleted it, freeing space in the computer.
In order to discover which events can be considered more dangerous than others, I splited the dataQ1 data set by events and calculated the number of deaths and injuries for each event.
# split by event type
x <- split(dataQ1, as.factor(dataQ1$EVTYPE))
# sum death caused by each event type
sumDeath <- lapply(x, function(x) sum(x$FATALITIES))
# sort it in decreasing order
sumDeath <- sort(unlist(sumDeath), decreasing = TRUE)
# sum injuries caused by each event type
sumInj <- lapply(x, function(x) sum(x$INJURIES))
# sort it in decreasing order
sumInj <- sort(unlist(sumInj), decreasing = TRUE)
Analyzing the 20 most dangerous events related to fatalities and the 10 events who injures more people, I am focusing in ~90% of theses situations, a reazonable number when looking at a large database such as this.
sum(sumDeath[1:20])/sum(sumDeath)
## [1] 0.9071
sum(sumInj[1:10])/sum(sumInj)
## [1] 0.9029
par(las = 3, mfrow = c(1, 2), mar = c(10, 5, 5.1, 2.1))
barplot(sumInj[1:10], main = "90% of Weather-Related Injuries", axes = FALSE,
col = "red", ylab = "Injuries")
axis(2, las = 1)
barplot(sumDeath[1:20], main = "90% of Weather-Related Fatalities", axes = FALSE,
col = "red", ylab = "Fatalities")
axis(2, las = 1)
Figure 1 - Barplot representing the events that counted for 90% of all deaths and injuries since the creation of NOAA's report. Highlight in the tornado bar.
For analyzing economical impact of these events, I splited the dataQ2 data set in its magnitude orders. I did this because I believe the major impacts in billions of dolars can not be overcomed by another in a minor magnitude order, unless it appears much more times in the million magnitude order.
data3B <- dataQ2[tolower(dataQ2$PROPDMGEXP) == "b", c(8, 25, 26)]
data3B <- data3B[order(data3B$PROPDMG, decreasing = TRUE), ]
To avoid the previous possible problem (one event appears many more times in millions in such way that overcome the following magnitude order), I selected the events with a “M”(million) magnitude order, and converted it to billion, dividing it by 1000. Furthermore, I joined this newly convervet data set to the already existent magnitude “B” dataset.
# columns 8,25 and 26 are, respectively, event name, propDamage and
# propDamageExp
data3M <- dataQ2[tolower(dataQ2$PROPDMGEXP) == "m", c(8, 25, 26)]
# sort data from biggest to minimum
data3M <- data3M[order(data3M$PROPDMG, decreasing = TRUE), ]
toBillionData3M <- data3M
toBillionData3M$PROPDMG <- toBillionData3M$PROPDMG/1000
toBillionData3M$PROPDMGEXP <- "B"
data3B <- rbind(data3B, toBillionData3M)
Finally, we summed the economical cost for each event separatedly and sorted in a decreasing order, to find the biggests economical losses. Again, looking for the 10 weather events that caused the biggest economical losses, we look at 90% of all the money loss since the start of NOAA's report.
BSplit <- split(data3B, data3B$EVTYPE)
sumDMG <- lapply(BSplit, function(x) sum(x$PROPDMG))
bSumDMG <- sort(unlist(sumDMG), decreasing = TRUE)
par(las = 3, mar = c(10, 4.1, 5.1, 2.1))
barplot(bSumDMG[1:10], main = "90% Weather-Related DMG in Billions of Dollars",
axes = FALSE, col = "blue", ylab = "Loss in Biliions of Dolars")
axis(2, las = 1)
Figure 2 - Weather related events that accounts for 90% of all the economical loss suffered since the creation of NOAA's report and the loss suffered in billions of dolars.
From the graphics of objective 1, we can say, without any doubt, that we have to stablish prevention measures against tornadoes. Since the start of NOAA's report, this weather event is related to more than 80.000 injuries and more than 5.000 deaths, only in USA, and, alone, it is responsible for more deaths and injuries than all the other events together. According to NOAA institute2, this statistics could be lowered if some actions were done, like:
Secondary precautions would be against extreme temperatures. By analyzing the graph, we can see that temperature-related deaths, together, are the second factor that more kill people in USA. Whether extreme hots or extreme colds. The construction of shelters and people conscientization about the risks of being exposed to such temperatures can help to soften this situation
Analyzing the plot for question 2, we see that the weather event that bring the biggest losses are floods, followed by hurricanes and tornadoes. In the light of these results, a better approach for this problem, would be to improve the city's water drain system. Because of the pavement, most of the water can not run to other regions, being concentrated in one point and, consequently, flooding its surroudings. Besides, the implementation of more green zones (with more trees) can help in draining and absorbing the excessive amount of water, including the embellishment of the city.