In this report we analyze all the wheater events happened between the 3rd of January 1950 and 30 NOvember 2011.
After a first step where data are read and preprocessed, the analysis wants to figure out which was the most harmful events, based on number of fatalities and injuries. This goal is attempted using bar plot charts that visually shows the top 20 events for injuries and fatalities. Finally the two graduatories are merged to show which events have both the most injuries and fatalities.
After that we want to know which events have the greatest economic consequencies in terms of proprerty damage. To answer this question is necessary to refactor some variables, and then, using always a barplot, the top 20 events for economic consequencies are finally reported.
Finally the report shows what are the weather events that had the greatest economic consequences.
The data are loaded from a bz2 archive, that contains a csv file.
data <- read.csv(bzfile("data/repdata_data_StormData.csv.bz2"))
After reading in the Storm Data dataset we want to get a summary of it, to start exploring the variable values.
summary <- summary(data$EVTYPE)
First of all I subset this extremely large dataset, using all the observation taken after the 1996, when the 48 event types are recorded as defined in NWS Directive 10-1605.
Then I refactor all the events as defined in NWS Directive 10-1605. For each event I create an index of the old names and then I rename with the new given event names, placed in a new variable called EVENT. This process requires a lot of code, so I will show only the first 6 events as an example.
Finally we transform the EVENT variable into a factor variable.
## 1. Refactor dates as date, and other variables as factor
with(data, BGN_DATE <- as.Date(as.character(BGN_DATE), format="%m/%d/%Y"))
with(data, STATE__ <- as.factor(STATE__))
with(data, COUNTY <- as.factor(COUNTY))
## 2. Recalculate property damage values based on PROPDMGEXP
# transform the PROPDMGEXP to upper case
data$PROPDMGEXP <- toupper(data$PROPDMGEXP)
## 3. Refactoring the variable EVTYPE
# creates a new variable EVENT
data$EVENT <- NA
data <- data[, 1:38]
# Regrouping events:
# ASTRONOMICAL LOW TIDE
astr_low_tide <- which(data$EVTYPE == "ASTRONOMICAL LOW TIDE")
data[data$EVTYPE == "ASTRONOMICAL LOW TIDE", ]$EVENT <- "ASTRONOMICAL LOW TIDE"
# AVALANCHE
avalanche <- which(data$EVTYPE == "AVALANCHE")
data[avalanche, ]$EVENT <- "AVALANCHE"
# BLIZZARD
blizzard <- which(data$EVTYPE == "BLIZZARD")
data[blizzard, ]$EVENT <- "AVALANCHE"
# COASTAL FLOOD
coastal_flood <- grep('coastal flood', data$EVTYPE, ignore.case = T, perl=T)
data[coastal_flood, ]$EVENT <- "COASTAL FLOOD"
# COLD/WIND CHILL
cold_wind_chill <- c(which(data$EVTYPE == "COLD"), grep('cold/wind chill', data$EVTYPE, ignore.case = T, perl=T))
data[cold_wind_chill, ]$EVENT <- "COLD/WIND CHILL"
# DEBRIS FLOW
debris_flow <- which(data$EVTYPE %in% c("LANDSLIDE", "LANDSLIDES", "MUDSLIDE/LANDSLIDE"))
data[debris_flow, ]$EVENT <- "DEBRIS FLOW"
# ...
Now that we have the refactored events, we can create a barplot with the top 20 events by frequency.
# gets the frequencies of the events
event_freq <- count(data, 'EVENT')
event_freq <- event_freq[order(-event_freq$freq), ]
# gets only the first 20 events by frequency
top20_events <- event_freq[seq(1, 20, 1), ]
# plots the bar chart of the firts 20 events by frequency
ggplot(top20_events, aes(x=EVENT, y=freq, fill=freq)) +
geom_bar(stat="identity") +
xlab("Event") +
ylab("Frequency") +
scale_y_continuous(name="Frequency", labels = comma) +
ggtitle("Top 20 Events by Frequency") +
coord_flip() +
theme_bw() + theme(legend.position = "none")
So the two most frequent event, with a deep difference from the others, are Thunderstorm Wind and Hail.
The health consequencies of weather events are recorded into two variables, FATALITIES and INJURIES. Let’s see which are the most harmful events.
These are the first 20 most harmful events based on the number of direct and indirect fatalities caused by these events.
# aggregates events by fatalities
fatalities <- with(data, aggregate(FATALITIES, by=list(EVENT), FUN = sum))
colnames(fatalities) <- c("EVENT", "FATALITIES")
# orders events by fatalities
fatalities <- fatalities[order(-fatalities$FATALITIES), ]
# selects only the first 20 events by fatalities
top20_fatalities <- fatalities[seq(1, 20, 1), ]
# plots the bar chart of the firts 20 events by fatalities
ggplot(top20_fatalities, aes(x=EVENT, y=FATALITIES, fill=FATALITIES)) +
geom_bar(stat="identity") +
xlab("Event") + ylab("Fatalities") +
scale_y_continuous(name="Fatalities", labels = comma) +
ggtitle("Top 20 Events by Fatalities") +
coord_flip() +
theme_bw() + theme(legend.position = "none")
So the first most harmful events based on the number of fatalities are tornadoes and excessive heat.
These are instead the first 20 most harmful events based on number of direct and indirect injuries caused by these events.
# aggregates events by injuries
injuries <- with(data, aggregate(INJURIES, by=list(EVENT), FUN = sum))
colnames(injuries) <- c("EVENT", "INJURIES")
# orders events by injuries
injuries <- injuries[order(-injuries$INJURIES), ]
# selects only the first 20 events by injuries
top20_injuries <- injuries[seq(1, 20, 1), ]
ggplot(top20_injuries, aes(x=EVENT, y=INJURIES, fill=INJURIES)) +
geom_bar(stat="identity") +
xlab("Event") + ylab("Injuries") +
scale_y_continuous(name="Injuries", labels = comma) +
ggtitle("Top 20 Events by Injuries") +
coord_flip() +
theme_bw() + theme(legend.position = "none")
The greates event for number injuries, with a great gap from the others, is Tornado.
It is possible also to intersect those two dataframes to find out which are the most harmful weather events both by number of fatalities and injuries.
intersect(top20_fatalities[, 1], top20_injuries[, 1])
## [1] "TORNADO" "EXCESSIVE HEAT" "FLASH FLOOD"
## [4] "HEAT" "LIGHTNING" "THUNDERSTORM WIND"
## [7] "RIP CURRENT" "OTHER" "FLOOD"
## [10] "AVALANCHE" "HIGH WIND" "WINTER STORM"
## [13] "HEAVY SNOW" "ICE STORM"
To find out which events have the greatest economic consequences in the US it is needed to multiply each value of the PROPDMG variable for the PROPDMGEXP value, properly refactored (e.g.: K value of PROPDMGEXP stands for 1000 times the value of PROPDMG).
## PROPERTY_DAMAGE calculus
# sets a default value for PROPERTY_DAMAGE
data$PROPERTY_DAMAGE <- NA
# -
data[data$PROPDMGEXP == "-", ]$PROPERTY_DAMAGE <- data[data$PROPDMGEXP == "-", ]$PROPDMG * -1
# for blank value, -, ?, +, 0, 1, returns the PROPDMG value
data[data$PROPDMGEXP %in% c("", "?", "+", "0", "1"), ]$PROPERTY_DAMAGE <- data[data$PROPDMGEXP %in% c("", "?", "+", "0", "1"), ]$PROPDMG
# 2, H
data[data$PROPDMGEXP %in% c("2", "H"), ]$PROPERTY_DAMAGE <- data[data$PROPDMGEXP %in% c("2", "H"), ]$PROPDMG * 100
# 3, K
data[data$PROPDMGEXP %in% c("3", "K"), ]$PROPERTY_DAMAGE <- data[data$PROPDMGEXP %in% c("3", "K"), ]$PROPDMG * 1000
# 4
data[data$PROPDMGEXP == "4", ]$PROPERTY_DAMAGE <- data[data$PROPDMGEXP == "4", ]$PROPDMG * 10000
# 5
data[data$PROPDMGEXP == "5", ]$PROPERTY_DAMAGE <- data[data$PROPDMGEXP == "5", ]$PROPDMG * 100000
# 6, M
data[data$PROPDMGEXP %in% c("6", "M"), ]$PROPERTY_DAMAGE <- data[data$PROPDMGEXP %in% c("6", "M"), ]$PROPDMG * 1000000
# 7
data[data$PROPDMGEXP == "7", ]$PROPERTY_DAMAGE <- data[data$PROPDMGEXP == "7", ]$PROPDMG * 10000000
# 8
data[data$PROPDMGEXP == "8", ]$PROPERTY_DAMAGE <- data[data$PROPDMGEXP == "8", ]$PROPDMG * 100000000
# B
data[data$PROPDMGEXP == "B", ]$PROPERTY_DAMAGE <- data[data$PROPDMGEXP == "B", ]$PROPDMG * 1000000000
After that the function is applied and the result are summarized to see what values contains the new variabile PROPERTY_DAMAGE.
# summarizes the results
summary(data$PROPERTY_DAMAGE)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.500e+01 0.000e+00 0.000e+00 4.746e+05 5.000e+02 1.150e+11
## NA's
## 8
Finally it is possible to plot a bar chart that shows what are the 20 events that had the greatest economic consequences.
# aggregates events by property damages
property_damage <- with(data, aggregate(PROPERTY_DAMAGE, by=list(EVENT), FUN = sum))
colnames(property_damage) <- c("EVENT", "PROPERTY_DAMAGE")
# orders events by property damage
property_damage <- property_damage[order(-property_damage$PROPERTY_DAMAGE), ]
# selects only the first 20 events by property damage
top20_prp_dmg <- property_damage[seq(1, 20, 1), ]
# creates a barplot with these events
ggplot(top20_prp_dmg, aes(x=EVENT, y=PROPERTY_DAMAGE,fill=PROPERTY_DAMAGE)) +
geom_bar(stat="identity") +
xlab("Event") + ylab("Property Damage in $") +
scale_y_continuous(name="Property Damage", labels = comma) +
ggtitle("Top 20 Events by Property Damages") +
coord_flip() +
theme_bw() + theme(legend.position = "none")
The report shows that the worst weather event in terms of number of fatalities - and also number of injuries - is tornado.
Instead floods, hurricanes and tornadoes are the events that have the worst economic consequencies.