Storms in the U.S. cause significant damage, both in human and economic terms. Identification of the most damaging events could help direct the focus of our efforts to minimize the human and economic damages that result from these natural phenomena. The objective of this study was to determine which weather events warrant our focus. To that end, publicly available storm data were downloaded from the National Climatic Data Center. The data were imported and processed. Then for each weather event, the total number of fatalities and injuries and the economic cost in terms of property and crops (as well as the total cost) were calculated. The results indicate that tornadoes have the highest human cost, while flooding has the highest economic cost. Therefore, we should focus on minimizing the damages caused by these weather events.
The following code downloads the bz2 compressed file and unzips it if the data file is not already present in the working directory
fileName <- "data.csv"
zippedFileName <- paste(fileName, "bz2", sep = ".")
if (!exists("rawData")) {
dataURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(dataURL, zippedFileName)
rawData <- read.csv(zippedFileName, header = TRUE)
}
The imported data has 37 columns. However, only 7 of these are relevant to the questions I seek to address. The relevant column names are listed below: * EVTYPE: This column of event types contains 985 factors.
INJURIES: The number of injuries due to the event.
PROPDMGEXP: The units of property damage.
CROPDMGEXP: The units of crop damage.
The following code subsets the raw data, extracting only the columns listed above.
relevantColumns <- c("EVTYPE",
"FATALITIES",
"INJURIES",
"PROPDMG",
"PROPDMGEXP",
"CROPDMG",
"CROPDMGEXP")
relevantData <- rawData[relevantColumns]
The two code chunks below remap the factors in the appropriate units column to a numeric value and use the corresponding “amount of damage” column to calculate the total damage. Remapping the values required the use of the function “mapvalues” in the “dplyr” package.
Total property damage (PROPDMGTOTAL, units = dollars) was calculated as follows:
library(plyr)
## Warning: package 'plyr' was built under R version 3.2.5
relevantData$PROPDMGEXP <- mapvalues(rawData$PROPDMGEXP,
from = c("K", "M","", "B", "m", "+", "0", "5", "6", "?", "4", "2", "3", "h", "7", "H", "-", "1", "8"),
to = c(10^3, 10^6, 1, 10^9, 10^6, 0,1,10^5, 10^6, 0, 10^4, 10^2, 10^3, 10^2, 10^7, 10^2, 0, 10, 10^8))
relevantData$PROPDMGTOTAL <- as.numeric(as.character(relevantData$PROPDMGEXP)) * relevantData$PROPDMG
Total crop damage (CROPDMGTOTAL, units = dollars) was calculated as follows:
relevantData$CROPDMGEXP <- mapvalues(rawData$CROPDMGEXP,
from = c("","M", "K", "m", "B", "?", "0", "k","2"),
to = c(1,10^6, 10^3, 10^6, 10^9, 0, 1, 10^3, 10^2))
relevantData$CROPDMGTOTAL <- as.numeric(as.character(relevantData$CROPDMGEXP)) * relevantData$CROPDMG
The following code subsets the data once again because the columns used to calculate total damages are no longer needed.
processedColumns <- c("EVTYPE",
"FATALITIES",
"INJURIES",
"PROPDMGTOTAL",
"CROPDMGTOTAL")
processedData <- relevantData[processedColumns]
The following code aggregates fatalities by event type and sorts based on the number of fatalities.
fatalitiesByEvent <- aggregate(FATALITIES ~ EVTYPE, data = processedData, FUN = sum)
orderedFatalities <- fatalitiesByEvent[order(-fatalitiesByEvent$FATALITIES), ]
print(head(orderedFatalities), row.names = FALSE)
## EVTYPE FATALITIES
## TORNADO 5633
## EXCESSIVE HEAT 1903
## FLASH FLOOD 978
## HEAT 937
## LIGHTNING 816
## TSTM WIND 504
The following code aggregates fatalities by event type and sorts based on the number of fatalities.
injuriesByEvent <- aggregate(INJURIES ~ EVTYPE, data = processedData, FUN = sum)
orderedInjuries <- injuriesByEvent[order(-injuriesByEvent$INJURIES), ]
print(head(orderedInjuries), row.names = FALSE)
## EVTYPE INJURIES
## TORNADO 91346
## TSTM WIND 6957
## FLOOD 6789
## EXCESSIVE HEAT 6525
## LIGHTNING 5230
## HEAT 2100
The following code aggregates property damage by event type and sorts based on the total cost.
propCostByEvent <- aggregate(PROPDMGTOTAL ~ EVTYPE, data = processedData, FUN = sum)
orderedPropCost <- propCostByEvent[order(-propCostByEvent$PROPDMGTOTAL), ]
print(head(orderedPropCost), row.names = FALSE)
## EVTYPE PROPDMGTOTAL
## FLOOD 144657709807
## HURRICANE/TYPHOON 69305840000
## TORNADO 56947380617
## STORM SURGE 43323536000
## FLASH FLOOD 16822673979
## HAIL 15735267513
The following code plots the property costs associated with the 5 event types that cause the greatest property damage. This requires the function ggplot in the package ggplot2.
library(ggplot2)
output <- ggplot(data = orderedPropCost[1:5, ], aes(x = EVTYPE, y = PROPDMGTOTAL))
output + geom_bar(stat="identity") + xlab("Event type") + ylab("Economic Damage ($)") + labs(title="Top 5 events causing property damage")
The following code aggregates property damage by event type and sorts based on the total cost.
cropCostByEvent <- aggregate(CROPDMGTOTAL ~ EVTYPE, data = processedData, FUN = sum)
orderedCropCost <- cropCostByEvent[order(-cropCostByEvent$CROPDMGTOTAL), ]
print(head(orderedCropCost), row.names = FALSE)
## EVTYPE CROPDMGTOTAL
## DROUGHT 13972566000
## FLOOD 5661968450
## RIVER FLOOD 5029459000
## ICE STORM 5022113500
## HAIL 3025954473
## HURRICANE 2741910000
The following code plots the crop costs associated with the 5 event types that cause the greatest crop damage.
output <- ggplot(data = orderedCropCost[1:5, ], aes(x = EVTYPE, y = CROPDMGTOTAL))
output + geom_bar(stat="identity") + xlab("Event type") + ylab("Economic Damage ($)") + labs(title="Top 5 events causing crop damage")
The following code sums the total property and crop damage to create the variable ECONDMGTOTAL (unit = dollars). It then aggregates that variable based on event type (EVTYPE).
processedData$ECONDMGTOTAL <- processedData$PROPDMGTOTAL + processedData$CROPDMGTOTAL
totalCostByEvent <- aggregate(ECONDMGTOTAL ~ EVTYPE, data = processedData, FUN = sum)
orderedTotalCost <- totalCostByEvent[order(-totalCostByEvent$ECONDMGTOTAL),]
print(head(orderedTotalCost), row.names = FALSE)
## EVTYPE ECONDMGTOTAL
## FLOOD 150319678257
## HURRICANE/TYPHOON 71913712800
## TORNADO 57362333887
## STORM SURGE 43323541000
## HAIL 18761221986
## FLASH FLOOD 18243991079
The following code plots the total economic costs associated with the 5 event types that cause the greatest total economic damage.
output <- ggplot(data = orderedTotalCost[1:5, ], aes(x = EVTYPE, y = ECONDMGTOTAL))
output + geom_bar(stat="identity") + xlab("Event type") + ylab("Economic Damage ($)") + labs(title="Top 5 events causing economic damage")
“Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?”
Tornadoes cause both more fatalities and more injuries than any other event.
“Across the United States, which types of events have the greatest economic consequences?”
Flooding causes the most property damage. However, it causes the second most crop damage. When these variables are summed to get the total economic impact, flooding is clearly the most damaging.
Based on the data, flooding by far causes the most economic damage of any of the events analyzed. Consequently, we should put effort into flood-proofing buildings. Likewise, the data indicate that tornadoes cause more injuries and fatalities than any of the other events analyzed. Therefore, efforts to improve human wellbeing should focus on alerting people of tornadoes sooner.