Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
Specifically this analysis will show that tornadoes are by far the most damaging in terms of human health. In terms of economic consequences floods are by far the most damaging.
In the working directory there should be a “repdata-data-StormData.csv.bz2” file.
## First we read in the data file, we can directly read the bz2 file
stormData <- read.csv("./repdata-data-StormData.csv.bz2")
## Let's look at what the data looks like
str(stormData)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
## We turn the begin date into a date object so we can later extract the year the event took place more easily
stormData$BGN_DATE <- as.Date(stormData$BGN_DATE, "%m/%d/%Y %H:%M:%S")
We answer this question by first determining the number of fatalities and injuries throughout all the year for which we have data. Then, we show a stacked barplot of the top 7 event types that have resulted in the the most fatalities and injuries combined (in descending order).
The figure shows that if one looks at the number of fatalities and injuries caused by different event types throughout the years for which there is data (1950 to 2011) that tornadoes are by far the most damaging.
## We aggregate the fatalities and injuries by event type and sum up the results, then we turn
## the result in a dataframe for further processing.
healthData <- as.data.frame(aggregate(. ~ EVTYPE, data = stormData[, c("EVTYPE", "FATALITIES", "INJURIES")], FUN = sum))
## We sort the data on the sum of fatalities and injuries in descending order and take out the
## first seven rows of data
healthData <- healthData[order(healthData$FATALITIES + healthData$INJURIES, decreasing = TRUE),][1:7,]
## Now we are ready to create a stacked barplot of the results
barplot(t(as.matrix(healthData))[2:3, ], names.arg = t(as.matrix(healthData))[1, ], cex.names = 0.75, yaxt = "n", width = 5, col = c("red", "blue"), xlab = "Event type", ylab = "Number of fatalities and injuries", main = "Most harmful event types for human health throughout the years 1950 to 2011")
yAxisTicks <- seq(0, 100000, by=10000)
axis(2, at = yAxisTicks, labels=format(yAxisTicks, scientific = FALSE), xpd = TRUE, cex.axis = 0.75)
legend("topright", c("Fatalities", "Injuries"), fill = c("red", "blue"))
However, we should also check the variability throughout the years for this event to make clear how harmful tornadoes have been in the last decade. The diagram belows shows that also in recent years tornadoes have been very harmful.
## We use the following libraries
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
## We aggregate the fatalities and injuries by event type and sum up the results, then we turn
## the result in a dataframe for further processing.
tornadoData <- stormData[stormData$EVTYPE == "TORNADO",]
tornadoData$BGN_YEAR <- year(tornadoData$BGN_DATE)
tornadoData <- aggregate(. ~ BGN_YEAR, data = tornadoData[, c("FATALITIES", "INJURIES", "BGN_YEAR")], FUN = sum)
## Now we are ready to create a plot of the fatalities and injuries throughout the years
plot(x = tornadoData$BGN_YEAR, y = tornadoData$INJURIES, type = "l", col = c("blue"), xlab = "Year", ylab = "Number of fatalities and injuries", main = "Fatalities & injuries caused by tornadoes throughout the years 1950 to 2011", yaxt = "n", xaxt = "n")
xAxisTicks <- seq(1950, 2011, by = 5)
axis(1, at = xAxisTicks, labels = xAxisTicks, xpd = TRUE, cex.axis = 0.75)
lines(x = tornadoData$BGN_YEAR, y = tornadoData$FATALITIES, type = "l", col = c("red"))
yAxisTicks <- seq(0, 7000, by=500)
axis(2, at = yAxisTicks, labels=format(yAxisTicks, scientific = FALSE), xpd=TRUE, cex.axis = 0.75)
legend("topleft", c("Fatalities", "Injuries"), fill = c("red", "blue"))
We answer this question by first adding up the property damage and the crop damage amounts provided in the data set. However some preprocessing is required since for both property damage and crop damage there are dollar amount fields as well as exponent fields determining by which factor to multiply the dollar amount.
## We use the library dplyr
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:lubridate':
##
## intersect, setdiff, union
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## We need to do some preprocessing here because the the property and crop damage fields have separate exponent fields.
## In the following publication https://rpubs.com/flyingdisc/PROPDMGEXP it is shown how the NOAA website deals with
## the different exponents. We will adopt this way of dealing with the exponent fields here also.
damageData <- stormData[, c("EVTYPE", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
## First we change the exponents to upper case (m will be M, h will be H etc.) which makes conversion simpler.
damageData$PROPDMGEXP <- toupper(damageData$PROPDMGEXP)
damageData$CROPDMGEXP <- toupper(damageData$CROPDMGEXP)
converter <- c("0" = 10, "1" = 10, "2" = 10, "3" = 10, "4" = 10, "5" = 10, "6" = 10, "7" = 10, "8" = 10, "H" = 100, "K" = 1000, "M" = 1000000, "B" = 1000000000, "+" = 1, "-" = 0, "?" = 0, "''" = 0)
## Now we convert all exponents to a factor
damageData <- damageData %>% mutate(PROPDMGFACTOR = ifelse(PROPDMGEXP == "", 0,converter[as.character(PROPDMGEXP)]))
damageData <- damageData %>% mutate(CROPDMGFACTOR = ifelse(CROPDMGEXP == "", 0,converter[as.character(CROPDMGEXP)]))
## And finally we calculate the resulting amount
damageData$PROPDMGAMOUNT <- damageData$PROPDMG * damageData$PROPDMGFACTOR
damageData$CROPDMGAMOUNT <- damageData$CROPDMG * damageData$CROPDMGFACTOR
From the diagram below it will be clear that in terms of economic consequences floods have by far the most damaging effect.
## We aggregate the fatalities and injuries by event type and sum up the results, then we turn
## the result in a dataframe for further processing.
damageData <- as.data.frame(aggregate(. ~ EVTYPE, data = damageData[, c("EVTYPE", "PROPDMGAMOUNT", "CROPDMGAMOUNT")], FUN = sum))
## We sort the data on the sum of property and crop damage in descending order and take out the
## first seven rows of data
damageData <- damageData[order(damageData$PROPDMGAMOUNT + damageData$CROPDMGAMOUNT, decreasing = TRUE),][1:7,]
## Now we are ready to create a stacked barplot of the results
barplot(t(as.matrix(damageData))[2:3, ], names.arg = t(as.matrix(damageData))[1, ], yaxt = "n", cex.names = 0.75, width = 5, col = c("blue", "skyblue2"), xlab = "Event type", ylab = "Amount of property & crop damage (in billion US dollars)", main = "Event types with the most economic consequences throughout the years 1950 to 2011")
yAxisTicks <- seq(0, 150000000000, by=10000000000)
axis(2, at = yAxisTicks, labels = yAxisTicks / 1000000000, xpd = TRUE, cex.axis = 0.75)
legend("topright", c("Property damage", "Crop damage"), fill = c("blue", "skyblue2"))