This document is a report on the analysis of the U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database. The main objective of the analysis is to identify weather events that have the most sever effects on human health (injuries and fatalities), and that have the highest economical costs across the United States.
It is very important to note, that the NOAA storm database is a low quality piece of shit, and if this would be a real life assignment, I would like to see heads roll in the dust for preparing such a crappy data table. The naming conventions for weather events (the main focus of the database) is practically non-existent, full of typos and inconsistencies. The complete cleanup of the database would be highly desirable, yet I can't envision myself wasting my time on it - and it is also outside of the scope of this assignment.
The data is stored in comma-separated table format. First it needed to be uncompressed from .bz format. While its size is rather large (>500Mb), the data.table library cannot be used, due to a number of inconsistencies in the file. So it's read.csv for us, folks, even it it takes several mins to load in..
data <- read.csv("repdata-data-StormData.csv", head = T, sep = ",") #That's 2 data too much in that file name
We only need 5 columns (event type, injuries, fatalities, property damage and crop damage), therefore to speed up things I have saved only these columns to a new data frame.
data_relevant <- cbind(data$INJURIES, data$FATALITIES, data$PROPDMG, data$CROPDMG)
data_relevant <- data.frame(data_relevant)
data_relevant$EVTYPE <- data$EVTYPE
colnames(data_relevant) <- c("INJURIES", "FATALITIES", "PROPDMG", "CROPDMG",
"EVTYPE")
This new data frame with only the relevant columns was splitted by event type.
splitted_data <- split(data_relevant, data_relevant$EVTYPE)
sums <- data.frame(names(splitted_data))
I've calculated the sums of each columns, in order to gain the accumulated number of injuries, fatalities, property and crop damage, splitted by event types.
sum_of_injuries = NULL
sum_of_fatalities = NULL
sum_of_property_damage = NULL
sum_of_crop_damage = NULL
for (i in names(splitted_data)) {
temp <- data.frame(splitted_data[i])
sum_of_injuries = c(sum_of_injuries, sum(temp[, 1]))
sum_of_fatalities = c(sum_of_fatalities, sum(temp[, 2]))
sum_of_property_damage = c(sum_of_property_damage, sum(temp[, 3]))
sum_of_crop_damage = c(sum_of_crop_damage, sum(temp[, 4]))
}
colnames(sums) <- "EVTYPE"
sums <- cbind(sums, sum_of_injuries, sum_of_fatalities, sum_of_property_damage,
sum_of_crop_damage)
I've plotted the top 10 event types for the four concerns: injuries, fatalities, property damage and crop damage.
par(mfrow = c(1, 2))
ordered_sums <- sums[order(sums$sum_of_injuries, decreasing = TRUE), ]
par(mar = c(15, 5, 5, 3), mgp = c(4, 1, 0))
plot(ordered_sums[1:10, ]$sum_of_injuries, bty = "n", xaxt = "n", xlab = "",
ylab = "Accumulated number of occurrances", pch = 21, bg = "blue", cex = 1.5,
main = "Top 10 causes of injuries")
axis(1, at = (1:10), labels = ordered_sums[1:10, ]$EVTYPE, las = 2)
ordered_sums <- sums[order(sums$sum_of_fatalities, decreasing = TRUE), ]
par(mar = c(15, 3, 5, 3), mgp = c(4, 1, 0))
plot(ordered_sums[1:10, ]$sum_of_fatalities, bty = "n", xaxt = "n", xlab = "",
ylab = "Accumulated number of occurrances", pch = 21, bg = "orange", cex = 1.5,
main = "Top 10 causes of fatalities")
axis(1, at = (1:10), labels = ordered_sums[1:10, ]$EVTYPE, las = 2)
This plot show the top 10 event types (x-axis) in function of the accumulated number of occurrances (y-axis) for injuries (left panel) and for fatalities (right panel). Apparently tornados are responsible for the highest number of injuries and fatalities. Excessive heat and flash flooding are also taking many lifes, and they are responsible for a high amount of injuries too.
par(mfrow = c(1, 2))
ordered_sums <- sums[order(sums$sum_of_property_damage, decreasing = TRUE),
]
par(mar = c(15, 5, 5, 3), mgp = c(4, 1, 0))
plot(ordered_sums[1:10, ]$sum_of_property_damage, bty = "n", xaxt = "n", xlab = "",
ylab = "Accumulated costs (Mill. USD)", pch = 21, bg = "red", cex = 1.5,
main = "Top 10 property damage costs")
axis(1, at = (1:10), labels = ordered_sums[1:10, ]$EVTYPE, las = 2)
ordered_sums <- sums[order(sums$sum_of_crop_damage, decreasing = TRUE), ]
par(mar = c(15, 3, 5, 3), mgp = c(4, 1, 0))
plot(ordered_sums[1:10, ]$sum_of_crop_damage, bty = "n", xaxt = "n", xlab = "",
ylab = "Accumulated costs (Mill. USD)", pch = 21, bg = "green", cex = 1.5,
main = "Top 10 crop damage costs")
axis(1, at = (1:10), labels = ordered_sums[1:10, ]$EVTYPE, las = 2)
This plot show the top 10 event types (x-axis) in function of the accumulated costs (y-axis) for property damage (left panel) and for crop damage (right panel). Again, tornados appear to be a major issue, as they cause massive amounts of property damage. In regards of crop damage however, hails and floods have more severe effects.
The dataset is unreliable, and would need very extensive cleanup effort in order to standardize the nomenclature. In its current state, we can conclude from the dataset that tornados have the biggest impact both in human health (injuries and fatalities) and economical impact (property damage). Flooding and excessive heat would be the next major issues.