Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:
Storm Data[https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2] [47Mb]
There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
National Weather Service Storm Data Documentation [https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf]
National Climatic Data Center Storm Events FAQ [https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf]
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
url = "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url, dest = "storm.bz2")
data = read.csv(bzfile("storm.bz2"))
library(ggplot2)
library(plyr)
# aggregate for injuries
INJbyEVT <- aggregate(INJURIES ~ EVTYPE, data, sum)
#sort injuries with highest results
INJbyEVTSort <- INJbyEVT[order(INJbyEVT$INJURIES,decreasing = TRUE),]
# find the percentage of each
INJbyEVTSort$Percentage <- INJbyEVTSort$INJURIES/sum(INJbyEVT$INJURIES)*100
# keep only the top 10
INJbyEVTSortTop <- head(INJbyEVTSort,10)
#aggregate for fatalities
FATbyEVT <- aggregate(FATALITIES ~ EVTYPE, data, sum)
#sort fatalities with highest results
FATbyEVTSort <- FATbyEVT[order(FATbyEVT$FATALITIES,decreasing = TRUE),]
# find the percentage of each
FATbyEVTSort$Percentage <- FATbyEVTSort$FATALITIES/sum(FATbyEVT$FATALITIES)*100
# keep only the top 10
FATbyEVTSortTop <- head(FATbyEVTSort,10)
#plot
par(mfrow = c(1, 2), mar = c(12, 4, 3, 2), mgp = c(3, 1, 0), cex = 0.8)
barplot(FATbyEVTSortTop$FATALITIES, las = 3, names.arg = FATbyEVTSortTop$EVTYPE, main = "Events with Highest Fatalities",
ylab = "Number of fatalities", col = "red")
barplot(INJbyEVTSortTop$INJURIES, las = 3, names.arg = INJbyEVTSortTop$EVTYPE, main = "Events with Highest Injuries",
ylab = "Number of injuries", col = "red")
The above charts show the top 10 Events causing the most injuries and fatalities. The top ten Events account for 89% of the total number of injuries and 80% of the total number of injuries.
Tornados are the most harmful events by far by fatalities and by injuries. Tornados account for 65% of Total Injuries and 37% of Total Fatalities.
The data needs to be cleaned before the data is usable. One column is called the DMGEXP for both crops and properties damages. Those columns are a multiplier which is composed by letters, number and symbols. We need to remove the symbols and replace letters by numbers. Letters K means kilo, M Millions, B Billions
#create a new file that will be amened
dataUpdated <- data
# put everything of the PROPDMGEXP in capital and change symbols and letters in exp numbers
dataUpdated$PROPDMGEXP <- toupper(dataUpdated$PROPDMGEXP)
dataUpdated$PROPDMGEXP[dataUpdated$PROPDMGEXP %in% c("", "+", "-", "?")] = "0"
dataUpdated$PROPDMGEXP[dataUpdated$PROPDMGEXP %in% c("B")] = "9"
dataUpdated$PROPDMGEXP[dataUpdated$PROPDMGEXP %in% c("M")] = "6"
dataUpdated$PROPDMGEXP[dataUpdated$PROPDMGEXP %in% c("K")] = "3"
dataUpdated$PROPDMGEXP[dataUpdated$PROPDMGEXP %in% c("H")] = "2"
# replace the exp numbers by their real value
dataUpdated$PROPDMGEXP <- 10^(as.numeric(dataUpdated$PROPDMGEXP))
# create new colunm for property damage value
dataUpdated["PROPDMGVAL"] <-dataUpdated$PROPDMG * dataUpdated$PROPDMGEXP
# rince and repeat with crop
dataUpdated$CROPDMGEXP <- toupper(dataUpdated$CROPDMGEXP)
dataUpdated$CROPDMGEXP[dataUpdated$CROPDMGEXP %in% c("", "+", "-", "?")] = "0"
dataUpdated$CROPDMGEXP[dataUpdated$CROPDMGEXP %in% c("B")] = "9"
dataUpdated$CROPDMGEXP[dataUpdated$CROPDMGEXP %in% c("M")] = "6"
dataUpdated$CROPDMGEXP[dataUpdated$CROPDMGEXP %in% c("K")] = "3"
dataUpdated$CROPDMGEXP[dataUpdated$CROPDMGEXP %in% c("H")] = "2"
# replace the exp numbers by their real value
dataUpdated$CROPDMGEXP <- 10^(as.numeric(dataUpdated$CROPDMGEXP))
# create new colunm for crop damage value
dataUpdated["CROPDMGVAL"] <-dataUpdated$CROPDMG * dataUpdated$CROPDMGEXP
#sum the cost of damages on prop and crop by event and sort it
PROPDMGbyEVT <- aggregate(PROPDMGVAL ~ EVTYPE, dataUpdated, sum)
PROPDMGbyEVTSort <- PROPDMGbyEVT[order(PROPDMGbyEVT$PROPDMGVAL,decreasing = TRUE),]
PROPDMGbyEVTSortTop<- head(PROPDMGbyEVTSort,10)
CROPDMGbyEVT <- aggregate(CROPDMGVAL ~ EVTYPE, dataUpdated, sum)
CROPDMGbyEVTSort <- CROPDMGbyEVT[order(CROPDMGbyEVT$CROPDMGVAL,decreasing = TRUE),]
CROPDMGbyEVTSortTop<- head(CROPDMGbyEVTSort ,10)
#plot
par(mfrow = c(1, 2), mar = c(12, 4, 3, 2), mgp = c(3, 1, 0), cex = 0.8)
barplot(CROPDMGbyEVTSortTop$CROPDMGVAL, las = 3, names.arg = CROPDMGbyEVTSortTop$EVTYPE, main = "Events with Highest Crop Damages",
ylab = "Number of fatalities", col = "red")
barplot(PROPDMGbyEVTSortTop$PROPDMGVAL, las = 3, names.arg = PROPDMGbyEVTSortTop$EVTYPE, main = "Events with Highest Property Damages",
ylab = "Number of injuries", col = "red")
While the same type of event accounts for the highest number of fatalities and injuries, the result regarding damages are a bit more mixed.
Drought is the event that accounts for the most crop damages in Value while flood is the vent that accounts for the most property damages in value.