Analysis of Severe Weather-Related Events Impact

Sinopsis

Weather events can range from mild sunny days to extreme tornadoes or hazzards. Because of its complexity 1 (its outcome depends on a lot of variables in a non-linear fashion), it is really dificult to predict when natural disasters are going to happen and with which intensity. Institutes like the National Oceanic and Atmospheric Administration (NOAA) collects data of previous weather-related events and annotate its implications to society in order to analyze the impacts that each of the events may cause. The objective of this study is to analyze NOAA's report on weather events and find out which of these occasions may be worth to invest in protection against its effects. The two analyzed areas was danger to population health (injuries and deaths) and economical impact.

Data Processing

The initial data set was acquired in the course website (Reproducible Research), at this link. Together with the data set, I obtained the data set manual. To start processing the data, I moved the working directory to where the inital file was.

setwd("C:\\Users\\Caio\\Dropbox\\Coursera\\Reproducible Research\\Project 2")

and then read the file into a “data” variable

data <- read.csv(bzfile("repdata-data-StormData.csv.bz2"))

Because of the data size, this operation can be very long. In my computer (Core I5 and 4 Gb of RAM), this command took about 3:20 minutes to execute

To accomplish the objective, I analyzed the data set manual and decided that the important variables were:

EVTYPE - Weather Event Name
FATALITIES - Fatalities related to the given event
INJURIES - Injuries related to the given event
PROPDMG - Material damage in dollars
PROPDMGEXP - Magnitude order of damage dollar amount

Visually analizing the data set, I could observe a lot of noise and corrupted data, such as non-standardized names and non-existent column categories, according to the NOAA's manual.

unique(data$PROPDMGEXP)

##  [1] K M   B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels:  - ? + 0 1 2 3 4 5 6 7 8 B h H K m M

unique(data[grepl("[Tt][Oo][Rr][Nn][Aa]", data$EVTYPE), ]$EVTYPE)

##  [1] TORNADO                    TORNADO F0                
##  [3] TORNADOS                   WATERSPOUT/TORNADO        
##  [5] WATERSPOUT TORNADO         WATERSPOUT-TORNADO        
##  [7] TORNADOES, TSTM WIND, HAIL COLD AIR TORNADO          
##  [9] WATERSPOUT/ TORNADO        TORNADO F3                
## [11] TORNADO F1                 TORNADO/WATERSPOUT        
## [13] TORNADO F2                 TORNADOES                 
## [15] TORNADO DEBRIS            
## 985 Levels:    HIGH SURF ADVISORY  COASTAL FLOOD ... WND

xample of non-standard data, showing two types of inconsistency:

Non-existent factors in magnitude order (accepted: H, K, M, B)
Inconsistent event types (Tornado/Tornados/Tornadoes/Tornado F1)

In an attempt to resolve/attenuate this problem, I constructed a function that deals with this inconsistencies and transform then in a standard form (e.g. Tornado/Tornados/Tornadoes/Tornado F1 would be converted to Tornado). I did this by using regular expressions (regex), detecting variations from the standard model. The posible regex expressions were decided in a discussion in the class forum. The created function receive a dataset from this assignment, replace the identified variations events to a standard version and return the processed dataframe.

preProcess <- function(data) {
    library(data.table)

    evtype <- data.table(unique(toupper(data$EVTYPE)))
    # replace '\' or '/ ' with '/'
    evtype$V2 <- gsub("\\\\|(\\/\\s)", "/", evtype$V1)
    # remove leading and trailing space
    evtype$V2 <- gsub("^\\s+|\\s+$", "", evtype$V2)
    # replace 'and' with '/'
    evtype$V2 <- gsub("(\\sAND(\\s|$))|(\\s&\\s)", "/", evtype$V2)
    # remove ( and )
    evtype$V2 <- sub("([[:alpha:]]*)(\\(|\\s){0,1}(([[:alpha:]]{0,1})[[:digit:]]{1,}(\\.[[:digit:]]{1,}){0,1})(\\){0,1})", 
        "\\1 \\3", evtype$V2, perl = TRUE)
    # remove trailing '/'
    evtype$V2 <- gsub("\\/$", "", evtype$V2)
    # remove multiple spaces
    evtype$V2 <- gsub("\\s{2,}", " ", evtype$V2)
    # replace misspelling and shortcuts with correct ones
    evtype$V2 <- sub("\\?", "UNKNOWN", evtype$V2)
    evtype$V2 <- sub("AVALANCE", "AVALANCHE", evtype$V2)
    evtype$V2 <- sub("WINTRY", "WINTER", evtype$V2)
    evtype$V2 <- sub("WINTERY", "WINTER", evtype$V2)
    evtype$V2 <- sub("WND", "WIND", evtype$V2)
    evtype$V2 <- sub("WI$|WI[^NDL]|WIN[^D]", "WIND", evtype$V2)
    evtype$V2 <- sub("\\sAND[[:alpha:]]{1,}", "/", evtype$V2)
    evtype$V2 <- gsub("\\s{0,1}-\\s", "/", evtype$V2)
    evtype$V2 <- sub("\\sSNO($|\\s)", " SNOW", evtype$V2)
    evtype$V2 <- sub("\\sCHIL($|\\s)", " CHILL", evtype$V2)
    evtype$V2 <- sub("\\sCHI([^[[:alpha:]]]{0,}|$)", " CHILL", evtype$V2)
    evtype$V2 <- sub("\\sCH([^[[:alpha:]]]{0,}|$)", " CHILL", evtype$V2)
    evtype$V2 <- sub("CSTL", "COASTAL", evtype$V2)
    evtype$V2 <- sub("FLD", "FLOOD", evtype$V2)
    evtype$V2 <- sub("FLOODING", "FLOOD", evtype$V2)
    evtype$V2 <- sub("FLOODIN", "FLOOD", evtype$V2)
    evtype$V2 <- sub("STRM", "STREAM", evtype$V2)
    evtype$V2 <- sub("MICOBURST", "MICROBURST", evtype$V2)
    evtype$V2 <- sub("WAYTERSPOUT", "WATERSPOUT", evtype$V2)
    evtype$V2 <- sub("WATERSPOUTS", "WATERSPOUT", evtype$V2)
    evtype$V2 <- sub("VOG", "FOG", evtype$V2)
    evtype$V2 <- sub("RIP CURRENTS", "RIP CURRENT", evtype$V2)
    evtype$V2 <- sub("FREEZE", "FREEZING", evtype$V2)
    evtype$V2 <- sub("WAUSEON", "", evtype$V2)  ## WAUSEON is a city
    evtype$V2 <- sub("EROSIN", "EROSION", evtype$V2)
    evtype$V2 <- sub("PRECIPATATION", "PRECIPITATION", evtype$V2)
    evtype$V2 <- sub("TORNADOES", "TORNADOS", evtype$V2)
    evtype$V2 <- sub("TORNDAO", "TORNADOS", evtype$V2)
    evtype$V2 <- sub("NON(-|\\s)[[:alpha:]]{1,}\\s(.*)", "\\2", evtype$V2)
    evtype$V2 <- sub("TSTM", "THUNDERSTORM", evtype$V2)
    evtype$V2 <- sub("TUNDERSTORM", "THUNDERSTORM", evtype$V2)
    evtype$V2 <- sub("CLOU$", "CLOUD", evtype$V2)
    evtype$V2 <- sub("DRIE", "DRY", evtype$V2)

    evtype$V3 <- evtype$V1 == evtype$V2  # Which variables have to be corrected?
    evtype <- evtype[order(V3, V2), ]
    evtype <- evtype[evtype$V3 == FALSE, ]  # get variables that have to be corrected

    # data2 <- data;
    data$EVTYPE <- as.character(data$EVTYPE)
    for (i in 1:nrow(evtype)) {
        # replace variations to the standard version
        data[toupper(data$EVTYPE) == evtype$V1[i], 8] <- tolower(evtype$V2[i])
    }
    data$EVTYPE <- tolower(data$EVTYPE)
    invisible(data)
}

This function was used with two dataframes, subsets of the original 'data' data set: dataQ1 and dataQ2, data sets to respectively answer questions 1 (deaths caused by an event) and 2 (economic impact caused by an event) of the objective.

dataQ1 separated only cases where an event caused at least 1 death or injured 1 person while dataQ2 select rows where the PROPDMGEXP variable had valid markers and where the event caused at least some economical loss

dataQ1 <- data[data$FATALITIES != 0 | data$INJURIES != 0, ]
dataQ1 <- preProcess(dataQ1)

dataQ2 <- data[data$PROPDMGEXP == "K" | data$PROPDMGEXP == "k" | data$PROPDMGEXP == 
    "M" | data$PROPDMGEXP == "m" | data$PROPDMGEXP == "B" | data$PROPDMGEXP == 
    "b" & data$PROPDMG != 0, ]
dataQ2 <- preProcess(dataQ2)

remove("data")  # clear space in memory

In the last line of the code above, as I have create our data set for questions 1 and 2, I do not need the original data set anymore so I deleted it, freeing space in the computer.

Results

Objective 1

In order to discover which events can be considered more dangerous than others, I splited the dataQ1 data set by events and calculated the number of deaths and injuries for each event.

# split by event type
x <- split(dataQ1, as.factor(dataQ1$EVTYPE))
# sum death caused by each event type
sumDeath <- lapply(x, function(x) sum(x$FATALITIES))
# sort it in decreasing order
sumDeath <- sort(unlist(sumDeath), decreasing = TRUE)
# sum injuries caused by each event type
sumInj <- lapply(x, function(x) sum(x$INJURIES))
# sort it in decreasing order
sumInj <- sort(unlist(sumInj), decreasing = TRUE)

Analyzing the 20 most dangerous events related to fatalities and the 10 events who injures more people, I am focusing in ~90% of theses situations, a reazonable number when looking at a large database such as this.

sum(sumDeath[1:20])/sum(sumDeath)

## [1] 0.9071

sum(sumInj[1:10])/sum(sumInj)

## [1] 0.9029


par(las = 3, mfrow = c(1, 2), mar = c(10, 5, 5.1, 2.1))
barplot(sumInj[1:10], main = "90% of Weather-Related Injuries", axes = FALSE, 
    col = "red", ylab = "Injuries")
axis(2, las = 1)
barplot(sumDeath[1:20], main = "90% of Weather-Related Fatalities", axes = FALSE, 
    col = "red", ylab = "Fatalities")
axis(2, las = 1)

plot of chunk unnamed-chunk-7

Figure 1 - Barplot representing the events that counted for 90% of all deaths and injuries since the creation of NOAA's report. Highlight in the tornado bar.

Objective 2

For analyzing economical impact of these events, I splited the dataQ2 data set in its magnitude orders. I did this because I believe the major impacts in billions of dolars can not be overcomed by another in a minor magnitude order, unless it appears much more times in the million magnitude order.

data3B <- dataQ2[tolower(dataQ2$PROPDMGEXP) == "b", c(8, 25, 26)]
data3B <- data3B[order(data3B$PROPDMG, decreasing = TRUE), ]

To avoid the previous possible problem (one event appears many more times in millions in such way that overcome the following magnitude order), I selected the events with a “M”(million) magnitude order, and converted it to billion, dividing it by 1000. Furthermore, I joined this newly convervet data set to the already existent magnitude “B” dataset.

# columns 8,25 and 26 are, respectively, event name, propDamage and
# propDamageExp
data3M <- dataQ2[tolower(dataQ2$PROPDMGEXP) == "m", c(8, 25, 26)]
# sort data from biggest to minimum
data3M <- data3M[order(data3M$PROPDMG, decreasing = TRUE), ]
toBillionData3M <- data3M
toBillionData3M$PROPDMG <- toBillionData3M$PROPDMG/1000
toBillionData3M$PROPDMGEXP <- "B"
data3B <- rbind(data3B, toBillionData3M)

Finally, we summed the economical cost for each event separatedly and sorted in a decreasing order, to find the biggests economical losses. Again, looking for the 10 weather events that caused the biggest economical losses, we look at 90% of all the money loss since the start of NOAA's report.

BSplit <- split(data3B, data3B$EVTYPE)
sumDMG <- lapply(BSplit, function(x) sum(x$PROPDMG))
bSumDMG <- sort(unlist(sumDMG), decreasing = TRUE)

par(las = 3, mar = c(10, 4.1, 5.1, 2.1))
barplot(bSumDMG[1:10], main = "90% Weather-Related DMG in Billions of Dollars", 
    axes = FALSE, col = "blue", ylab = "Loss in Biliions of Dolars")
axis(2, las = 1)

plot of chunk unnamed-chunk-10

Figure 2 - Weather related events that accounts for 90% of all the economical loss suffered since the creation of NOAA's report and the loss suffered in billions of dolars.

Result Analysis

Plot 1

From the graphics of objective 1, we can say, without any doubt, that we have to stablish prevention measures against tornadoes. Since the start of NOAA's report, this weather event is related to more than 80.000 injuries and more than 5.000 deaths, only in USA, and, alone, it is responsible for more deaths and injuries than all the other events together. According to NOAA institute 2, this statistics could be lowered if some actions were done, like:

Know where you can take shelter in a matter of seconds, and practice a family tornado drill at least once a year.
Have a pre-determined place to meet after a disaster
Schools and office building managers should regularly run well-coordinated drills

Secondary precautions would be against extreme temperatures. By analyzing the graph, we can see that temperature-related deaths, together, are the second factor that more kill people in USA. Whether extreme hots or extreme colds. The construction of shelters and people conscientization about the risks of being exposed to such temperatures can help to soften this situation

Plot 2

Analyzing the plot for question 2, we see that the weather event that bring the biggest losses are floods, followed by hurricanes and tornadoes. In the light of these results, a better approach for this problem, would be to improve the city's water drain system. Because of the pavement, most of the water can not run to other regions, being concentrated in one point and, consequently, flooding its surroudings. Besides, the implementation of more green zones (with more trees) can help in draining and absorbing the excessive amount of water, including the embellishment of the city.