This projects consist in analyzing the National Oceanic and Atmospheric Administration’s (NOAA) Storm Events Database, in order to explore 2 main questions: 1) which weather events are most harmful to population health and 2) which weather events have the highest economic consequences. To address the first question, total fatalities and injuries of each event type were added together, from 1950 to 2011, resulting in Tornados being the most harmful event, with 96,979 instances. A “Deadly Ratio” was also calculated for each event, formed was fatalities/injuries; the deadliest event was Heat with 55%. To answer the second question, the cost of property damage and crop damage was summed into a total damage cost, resulting in Floods being the event with the greatest economic consequences of all, totaling 150 Bn USD from 1950 to 2011.
The original data file was compressed using the bzip 2 algorithm, so it can’t be unzipped using the regular unzip() function. The following code first checks if the .bz2 file is in the current directory, if not, it is downloaded. Then the data is read into R directly, without unzipping it, using the bzfile() function as an argument of read.csv(). Finally 2 R packages are loaded: ggplot2 and dplyr.
#downloads the data and reads it into R
if (!"repdata-data-StormData.csv.bz2" %in% dir()) {
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url, "repdata-data-StormData.csv.bz2")
}
if (!"data" %in% ls()) {
data <- read.csv(bzfile("repdata-data-StormData.csv.bz2"))
}
library(ggplot2)
library(dplyr)
To answer this question, fatalities and injuries are summed by event type, from the beginning to the end of the data set, meaning 1950 to 2011. For instance, the event type “TORNADO” has either killed or injuried 96,979 people from 1950 to 2011, according to this dataset.
The code computes the new variable fatalities+injuries for every event type, and then selects the top 15 events with the highest value. This is shown in Figure 1.The second approach differentiates between injuries and fatalities.
#Group by event type, then sum fatalities+injuries. Sort decreasingly and take top 15.
eventTypes <- data %>% group_by(EVTYPE) %>% summarise(fatalInjuries= sum(FATALITIES,INJURIES),
fatalities=sum(FATALITIES), injuries=sum(INJURIES))
orderedEvents <- eventTypes %>% arrange(desc(fatalInjuries))
top15Events <- orderedEvents[1:15,] #Only give me the top 15 events by (fatalities+injuries)
top15Events$EVTYPE <- factor(top15Events$EVTYPE, levels=top15Events$EVTYPE) #Order the levels decreasingly
head(top15Events)
## Source: local data frame [6 x 4]
##
## EVTYPE fatalInjuries fatalities injuries
## 1 TORNADO 96979 5633 91346
## 2 EXCESSIVE HEAT 8428 1903 6525
## 3 TSTM WIND 7461 504 6957
## 4 FLOOD 7259 470 6789
## 5 LIGHTNING 6046 816 5230
## 6 HEAT 3037 937 2100
Figure 1.
#Plot1. Top 15 most harmful events (fatalities+injuries)
plot1 <- ggplot(top15Events, aes(EVTYPE, fatalInjuries))
plot1 + geom_bar(stat="identity", colour="black") +ylab("Fatalities + Injuries")+
xlab("") + ggtitle("Top 15 most harmful weather events to population health
in USA (1950-2011)") +
theme(legend.position="none") + theme(axis.text.x = element_text(angle = 90, hjust = 1,vjust = 0.5))
Another dimension computed to analyze this question is the ratio of fatalities to injuries, which represents the proportion to how much people a weather event killed as a percentage of how much people it injured. The higher this percentage the more deadly the event is.
#Creates a new variable, fatalities/Injuries, by each Event Type.
top15Events$deadly <- top15Events$fatalities/top15Events$injuries
summary(top15Events$deadly) #the max is 55% deadly rate
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01102 0.06545 0.08938 0.16160 0.18710 0.55040
The weather event with the highest deadly rate are Flas Floods.
top15Events[top15Events$deadly==max(top15Events$deadly),]
## Source: local data frame [1 x 5]
##
## EVTYPE fatalInjuries fatalities injuries deadly
## 1 FLASH FLOOD 2755 978 1777 0.5503658
Two features are used to answer this question: Property Damage (“PROPDMG”) and Crop Damage (“CROPDMG”). They measure the costs incurred by the weather event, in US dollars. There is one problem with these variables:
Problem: These variables by themselves do not express the full USD amount of the cost. Two other, auxiliary features are required in order to compute the full cost. These are “PROPDMGEXP” and CROPDMGEXP“, they indicate whether the cost was in the order of thousands =”K“, millions =”M" or billions=“B”. However, there is one big issue with this features, which is that they not only have “K”, “M” or “B” entries, but also numeric entries (0-8) and symbols such as “?”, “+”, “-”.
After a research period, including the NOAA’s website and this thread: https://class.coursera.org/repdata-011/forum/thread?thread_id=39 , the following conclusions were reached:
Conclusion: given the negligible amounts, the proportionally small amount of numeric and symbolic entries on the PROP&CROP EXP features, and the different interpretations provided by the own NOAA’s different datasets; these inconsistent observations will be removed.
#Removing observations with numbers and symbols in PROPDMGEXP & CRODMGPEXP
leaveOut <- c(0:8, "?","+","-")
numbersSymbolsPROP <- data$PROPDMGEXP %in% leaveOut | data$CROPDMGEXP %in% leaveOut
#sum(numbersSymbolsPROP) 341 rows to be deleted
datav2 <- data[-which(numbersSymbolsPROP),]
#Converting H to 100, K to 1000, M to 1 000 000 and B to 1 000 000 000 in PROPERTY DAMAGE
datav2$unitPROPDMG[grepl("H|h",datav2$PROPDMGEXP)] <- 100
datav2$unitPROPDMG[grepl("K|k",datav2$PROPDMGEXP)] <- 1000
datav2$unitPROPDMG[grepl("M|m",datav2$PROPDMGEXP)] <- 1000000
datav2$unitPROPDMG[grepl("B|b",datav2$PROPDMGEXP)] <- 1000000000
datav2$unitPROPDMG[is.na(datav2$unitPROPDMG)] <- 0 # Turn NAs to 0 so that I can add up
#Converting H to 100, K to 1000, M to 1 000 000 and B to 1 000 000 000 in CROP DAMAGE
datav2$unitCROPDMG[grepl("H|h",datav2$CROPDMGEXP)] <- 100
datav2$unitCROPDMG[grepl("K|k",datav2$CROPDMGEXP)] <- 1000
datav2$unitCROPDMG[grepl("M|m",datav2$CROPDMGEXP)] <- 1000000
datav2$unitCROPDMG[grepl("B|b",datav2$CROPDMGEXP)] <- 1000000000
datav2$unitCROPDMG[is.na(datav2$unitCROPDMG)] <- 0 #Turn NAs to 0 so that I can add up
#Multiplying the Cost by the unit of measurement
datav2$PROPCOST <- datav2$PROPDMG*datav2$unitPROPDMG
datav2$CROPCOST <- datav2$CROPDMG*datav2$unitCROPDMG
#Adding up both costs
datav2$PROPCROPCOST <- datav2$PROPCOST + datav2$CROPCOST
Now that the Property Damage Cost and the Crop Damage Cost have been added up together, we can group by event to see which ones have the greatest economic impact.
#Group by type and add up cost and divides by 1000000000 to convert to billions, then takes top 15
costEvents <- datav2 %>% group_by(EVTYPE) %>% summarise(prop_cropCost= sum(PROPCROPCOST)/1000000000)
orderedCostEvents <- arrange(costEvents, desc(prop_cropCost))
top15Cost <- orderedCostEvents[1:15,]
top15Cost$EVTYPE <- factor(top15Cost$EVTYPE, levels=top15Cost$EVTYPE) #Relevel factors in order
head(top15Cost)
## Source: local data frame [6 x 2]
##
## EVTYPE prop_cropCost
## 1 FLOOD 150.31968
## 2 HURRICANE/TYPHOON 71.91371
## 3 TORNADO 57.30194
## 4 STORM SURGE 43.32354
## 5 HAIL 18.73322
## 6 FLASH FLOOD 17.56154
Figure 2.
#Plot2
plot2 <- ggplot(top15Cost, aes(EVTYPE, prop_cropCost))
plot2 + geom_bar(stat="identity", color="black") +
labs(list(x="", y="Total Property + Crop Damage (Bn USD)",
title="Cumulative Economic Impact of weather events in the US.
1950-2011 (Bn USD)")) +
theme(legend.position="none") +
theme(axis.text.x = element_text(angle = 90, hjust = 1,vjust = 0.5))