The negative effects on health and the economical costs of weather events in the U.S.

Synopsis

This document is a report on the analysis of the U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database. The main objective of the analysis is to identify weather events that have the most sever effects on human health (injuries and fatalities), and that have the highest economical costs across the United States.

It is very important to note, that the NOAA storm database is a low quality piece of shit, and if this would be a real life assignment, I would like to see heads roll in the dust for preparing such a crappy data table. The naming conventions for weather events (the main focus of the database) is practically non-existent, full of typos and inconsistencies. The complete cleanup of the database would be highly desirable, yet I can't envision myself wasting my time on it - and it is also outside of the scope of this assignment.

Data Processing

The data is stored in comma-separated table format. First it needed to be uncompressed from .bz format. While its size is rather large (>500Mb), the data.table library cannot be used, due to a number of inconsistencies in the file. So it's read.csv for us, folks, even it it takes several mins to load in..

Loading in data

data <- read.csv("repdata-data-StormData.csv", head = T, sep = ",")  #That's 2 data too much in that file name

We only need 5 columns (event type, injuries, fatalities, property damage and crop damage), therefore to speed up things I have saved only these columns to a new data frame.

Subsetting the relevant columns

data_relevant <- cbind(data$INJURIES, data$FATALITIES, data$PROPDMG, data$CROPDMG)
data_relevant <- data.frame(data_relevant)
data_relevant$EVTYPE <- data$EVTYPE
colnames(data_relevant) <- c("INJURIES", "FATALITIES", "PROPDMG", "CROPDMG", 
    "EVTYPE")

This new data frame with only the relevant columns was splitted by event type.

Splitting by event types

splitted_data <- split(data_relevant, data_relevant$EVTYPE)
sums <- data.frame(names(splitted_data))

I've calculated the sums of each columns, in order to gain the accumulated number of injuries, fatalities, property and crop damage, splitted by event types.

Calculating the sums of injuries, fatalities, property damage and crop damage

sum_of_injuries = NULL
sum_of_fatalities = NULL
sum_of_property_damage = NULL
sum_of_crop_damage = NULL

for (i in names(splitted_data)) {
    temp <- data.frame(splitted_data[i])
    sum_of_injuries = c(sum_of_injuries, sum(temp[, 1]))
    sum_of_fatalities = c(sum_of_fatalities, sum(temp[, 2]))
    sum_of_property_damage = c(sum_of_property_damage, sum(temp[, 3]))
    sum_of_crop_damage = c(sum_of_crop_damage, sum(temp[, 4]))
}

colnames(sums) <- "EVTYPE"
sums <- cbind(sums, sum_of_injuries, sum_of_fatalities, sum_of_property_damage, 
    sum_of_crop_damage)

Results

I've plotted the top 10 event types for the four concerns: injuries, fatalities, property damage and crop damage.

Top causes of injuries and fatalities

par(mfrow = c(1, 2))
ordered_sums <- sums[order(sums$sum_of_injuries, decreasing = TRUE), ]
par(mar = c(15, 5, 5, 3), mgp = c(4, 1, 0))
plot(ordered_sums[1:10, ]$sum_of_injuries, bty = "n", xaxt = "n", xlab = "", 
    ylab = "Accumulated number of occurrances", pch = 21, bg = "blue", cex = 1.5, 
    main = "Top 10 causes of injuries")
axis(1, at = (1:10), labels = ordered_sums[1:10, ]$EVTYPE, las = 2)

ordered_sums <- sums[order(sums$sum_of_fatalities, decreasing = TRUE), ]
par(mar = c(15, 3, 5, 3), mgp = c(4, 1, 0))
plot(ordered_sums[1:10, ]$sum_of_fatalities, bty = "n", xaxt = "n", xlab = "", 
    ylab = "Accumulated number of occurrances", pch = 21, bg = "orange", cex = 1.5, 
    main = "Top 10 causes of fatalities")
axis(1, at = (1:10), labels = ordered_sums[1:10, ]$EVTYPE, las = 2)

plot of chunk injuries_and_fatalities

This plot show the top 10 event types (x-axis) in function of the accumulated number of occurrances (y-axis) for injuries (left panel) and for fatalities (right panel). Apparently tornados are responsible for the highest number of injuries and fatalities. Excessive heat and flash flooding are also taking many lifes, and they are responsible for a high amount of injuries too.

Weather events with the strongest economical impact

par(mfrow = c(1, 2))
ordered_sums <- sums[order(sums$sum_of_property_damage, decreasing = TRUE), 
    ]
par(mar = c(15, 5, 5, 3), mgp = c(4, 1, 0))
plot(ordered_sums[1:10, ]$sum_of_property_damage, bty = "n", xaxt = "n", xlab = "", 
    ylab = "Accumulated costs (Mill. USD)", pch = 21, bg = "red", cex = 1.5, 
    main = "Top 10 property damage costs")
axis(1, at = (1:10), labels = ordered_sums[1:10, ]$EVTYPE, las = 2)

ordered_sums <- sums[order(sums$sum_of_crop_damage, decreasing = TRUE), ]
par(mar = c(15, 3, 5, 3), mgp = c(4, 1, 0))
plot(ordered_sums[1:10, ]$sum_of_crop_damage, bty = "n", xaxt = "n", xlab = "", 
    ylab = "Accumulated costs (Mill. USD)", pch = 21, bg = "green", cex = 1.5, 
    main = "Top 10 crop damage costs")
axis(1, at = (1:10), labels = ordered_sums[1:10, ]$EVTYPE, las = 2)

plot of chunk economical_costs

This plot show the top 10 event types (x-axis) in function of the accumulated costs (y-axis) for property damage (left panel) and for crop damage (right panel). Again, tornados appear to be a major issue, as they cause massive amounts of property damage. In regards of crop damage however, hails and floods have more severe effects.

Conclusions

The dataset is unreliable, and would need very extensive cleanup effort in order to standardize the nomenclature. In its current state, we can conclude from the dataset that tornados have the biggest impact both in human health (injuries and fatalities) and economical impact (property damage). Flooding and excessive heat would be the next major issues.