Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern. In this report we aim to describe which types of events are most harmful with respect to population health, and which of them have the greatest economic consequences. Our overall hypothesis is that all weather events are not equal in their consequences and so the government can prioritize resources for different types of events and preventing them. We specifically obtained data from 1950 till 2012. From the data, we found that Tornado is the highest and the most dangerous type of storms, standing much higher than other types of weather events in fatalities and injures causation. And the most harmfull type of weather events for economic consequences is Flood.
From the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database we obtained characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
We first read in the 1999 data from the raw csv file included in the zip archive. The data is a delimited file were fields are delimited with the , character and missing values are coded as blank fields.
zip.data <- bzfile("repdata-data-StormData.csv.bz2", open = "r")
data <- read.csv(file = zip.data, header = TRUE)
close(zip.data)
After reading we check the first few rows (there are 902,297) rows in this dataset.
dim(data)
[1] 902297 37
Then we check if there any na values in columns that we need with fatalities, injuries, property and crop damage. We found no NA there.
sum(is.na(data$INJURIES))+sum(is.na(data$FATALITIES))+sum(is.na(data$PROPDMG))+sum(is.na(data$CROPDMG))
[1] 0
According to Data documentation there are alphabetical character signifying the magnitude of the number, i.e., 1.55B for $1,550,000,000 in columns PROPDMGEXP and CROPDMGEXP. So we check the levels of them.
levels(data$PROPDMGEXP)
[1] “” “-” “?” “+” “0” “1” “2” “3” “4” “5” “6” “7” “8” “B” “h” “H” “K” [18] “m” “M”
levels(data$CROPDMGEXP)
[1] “” “?” “0” “2” “B” “k” “K” “m” “M”
We see that with the correct alphabetical character signifying the magnitude (“m”,“M”,“k”,“K”,“B”,“h”,“H”) there are some other characters and digits that are unknown how to be interpeted.
c <- c("m","M","k","K","B","h","H","")
length(data$PROPDMG[data$PROPDMGEXP %in% c])/length(data$PROPDMG)
[1] 0.999652
We found that the part “unknown” observation number is very small (0.03 %) and we suppose them to mean nothing (the multiplier for them will be equal to 1 as for NA values). Then we make two new columns PropMark and CropMark that will indicate the multiplier for PROPDMG value in $.
#Making new columns and replacing character multipliers for numeric
data$PropMark <- data$PROPDMGEXP
data$CropMark <- data$CROPDMGEXP
data$PropMark <- sub("[^HhKkmMB]", 1, data$PropMark)
data$PropMark <- sub("[Hh]", 100, data$PropMark)
data$PropMark <- sub("[Kk]", 1000, data$PropMark)
data$PropMark <- sub("[mM]", 1000000, data$PropMark)
data$PropMark <- sub("[B]", 1000000000, data$PropMark)
data$CropMark <- sub("[^HhKkmMB]", 1, data$CropMark)
data$CropMark <- sub("[Hh]", 100, data$CropMark)
data$CropMark <- sub("[Kk]", 1000, data$CropMark)
data$CropMark <- sub("[mM]", 1000000, data$CropMark)
data$CropMark <- sub("[B]", 1000000000, data$CropMark)
#Making numeric abbreviates numeric
data$PropMark <- as.numeric(data$PropMark)
data$CropMark <- as.numeric(data$CropMark)
#Making NA values equal to 1
data$PropMark[is.na(data$PropMark)==TRUE] <- 1
data$CropMark[is.na(data$CropMark)==TRUE] <- 1
Then we compute the PropDamage and CropDamage in dollars, and then make a new columns TotalDamage and TotalHurt.
data$PropDamage <- data$PropMark*data$PROPDMG
data$CropDamage <- data$CropMark*data$CROPDMG
data$TotalDamage <- data$PropMark*data$PROPDMG + data$CropMark*data$CROPDMG
data$TotalHurt <- data$FATALITIES + data$INJURIES
First of all we want to show top 10 event types that cause the biggest damage fo economics and separately for people health. For that we need to aggregate data
library(reshape2)
DataMelt <- melt (data, id=c("STATE","EVTYPE"), measure.vars=c("TotalDamage","TotalHurt"), na.rm=TRUE)
HurtDamageType <- dcast (DataMelt, EVTYPE ~ variable, sum)
library(plyr)
a <- arrange(HurtDamageType, desc(TotalDamage))
a$EVTYPE <- factor(a$EVTYPE, levels=a$EVTYPE) # for making EVTYPES factor ordered
b <- arrange(HurtDamageType, desc(TotalHurt))
b$EVTYPE <- factor(b$EVTYPE, levels=b$EVTYPE) # for making EVTYPES factor ordered
Then we can make two plots for both economic and hurt consequences.
library(ggplot2)
g <- ggplot(a[1:10,], aes(EVTYPE,TotalDamage/1000000000, fill=TotalDamage))+
geom_bar(stat="identity")+
labs(x = "Weather event type", y="Total Damage, billions $")+
theme_bw(base_family="Verdana", base_size=12)+
ggtitle("Top 10 events by Total damage ($) made")+
scale_fill_continuous(name="Damage, $")+
scale_fill_gradient(low = "#E8DA62", high="#FF7665")
## Scale for 'fill' is already present. Adding another scale for 'fill', which will replace the existing scale.
h <- ggplot(b[1:10,], aes(EVTYPE,TotalHurt/1000, fill=TotalHurt))+
geom_bar(stat="identity")+
labs(x = "Weather event type", y="People hurt, thousands")+
theme_bw(base_family="Verdana", base_size=12)+
ggtitle("Top 10 events by Total People hurt (fatalities+injuries)")+
scale_fill_continuous(name="Hurt")
library(gridExtra)
## Loading required package: grid
grid.arrange(g, h, nrow=2, ncol=1)
We can mention that Tornado is the highest and the most dangerous type of storms, standing much higher than other types of weather events in fatalities and injures causation. And the most harmfull type of weather events for economic consequences is Flood. Now we want to look closer how does the Damage devides into property and crop damage and what are the most harmfull events in both property and crop damage. For that we again make some aggregated data frame and then make plots.
#Making data for Prop and Crop damage
DataMelt <- melt (data, id=c("STATE","EVTYPE"), measure.vars=c("PropDamage","CropDamage"), na.rm=TRUE)
DamageTypes <- dcast (DataMelt, EVTYPE ~ variable, sum)
a <- arrange(DamageTypes, desc(PropDamage))
a$EVTYPE <- factor(a$EVTYPE, levels=a$EVTYPE)
b <- arrange(DamageTypes, desc(CropDamage))
b$EVTYPE <- factor(b$EVTYPE, levels=b$EVTYPE)
# Plots for Prop and Crop damage
p1 <- ggplot(a[1:5,], aes(EVTYPE,PropDamage/1000000000, fill=PropDamage))+
geom_bar(stat="identity")+
labs(x="", y="Damage, bln $")+
theme_bw(base_family="Verdana", base_size=11)+
ggtitle("Top 5 weather events by Property Damage")+
scale_fill_gradient(low = "#E8DA62", high="#FF7665")
c1 <- ggplot(b[1:5,], aes(EVTYPE,CropDamage/1000000000, fill=CropDamage))+
geom_bar(stat="identity")+
labs(x="", y="Damage, bln $")+
theme_bw(base_family="Verdana", base_size=11)+
ggtitle("Top 5 weather events by Crop Damage")+
scale_fill_gradient(low = "#E8DA62", high="#FF7665")
grid.arrange(p1, c1, nrow=1, ncol=2)
We see, that Flood is the event with the highest property damage (almost 150 billions dollars damage since 1950 through made observations). Flood is also harmfull in terms of crop damage (top-2 rate), where Drought is top ranked. But however the total value of all crop damage taken is about 10 times smaller that the damage from propety damage.
Then we make the same operations for dividing total people hurt damage into fatalities and injuries.
#Making data for injuries and fatalities
DataMelt <- melt (data, id=c("STATE","EVTYPE"), measure.vars=c("FATALITIES","INJURIES"), na.rm=TRUE)
HurtTypes <- dcast (DataMelt, EVTYPE ~ variable, sum)
a <- arrange(HurtTypes, desc(FATALITIES))
a$EVTYPE <- factor(a$EVTYPE, levels=a$EVTYPE)
b <- arrange(HurtTypes, desc(INJURIES))
b$EVTYPE <- factor(b$EVTYPE, levels=b$EVTYPE)
# Plots for Fatalities and Injuries
f1 <- ggplot(a[1:5,], aes(EVTYPE,FATALITIES/1000, fill=FATALITIES))+
geom_bar(stat="identity")+
labs(x = "", y="Fatalities, thousands")+
theme_bw(base_family="Verdana", base_size=11)+
ggtitle("Top 5 weather events by Fatalities")
i1 <- ggplot(b[1:5,], aes(EVTYPE,INJURIES/1000, fill=INJURIES))+
geom_bar(stat="identity")+
labs(x = "", y="People injured, thousands")+
theme_bw(base_family="Verdana", base_size=11)+
ggtitle("Top 5 weather events by Injuries")
grid.arrange(f1, i1, nrow=1, ncol=2)
From the plot above we can see that Tornado is incredibly dangerous disaster for population health. It is damage is much higher from all the others weather events and it both injures and kill huge number of people. However Tornado is also top-3 ranked in weather events, making much economical damage.