This analysis was done using the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage, from 1950 to 2011. With the analysis was possible to infer that the most harmful natural event with respect to population health are tornados with 5633 fatallities and 91346 injured across this data set’s timespan. Regarding the economical damage, the most harmful natural event are flood, which are responsible for a total damage of more than 150 billion dollars in this data set’s timespan. Interestingly, the number one event for economical damage in crops aren’t floods but drought, with floods coming in in second place.
The data set used for this analisys can be found here and the National Weather Service Storm Data Documentation can be found here.
library(dplyr)
library(tidyr)
library(ggplot2)
library(patchwork)
dat_storm<-read.csv("repdata_data_StormData.csv")
str(dat_storm)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436774 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
All future analysis are dependent on the type of event, so grouped the data by event type.
grouped_event<-group_by(dat_storm, EVTYPE)
The first analysis is to find out which event brings most harm to the population. For that, I used the FATALITIES and INJURIES variables alongside with the type of event.
First, I determined the 10 events with most fatalities.
fat_sum<-summarise(grouped_event, sum(FATALITIES))
fat_sum<-arrange(fat_sum, desc(`sum(FATALITIES)`))
fat_sum<-slice(fat_sum, 1:10)
fat_sum
## # A tibble: 10 x 2
## EVTYPE `sum(FATALITIES)`
## <fct> <dbl>
## 1 TORNADO 5633
## 2 EXCESSIVE HEAT 1903
## 3 FLASH FLOOD 978
## 4 HEAT 937
## 5 LIGHTNING 816
## 6 TSTM WIND 504
## 7 FLOOD 470
## 8 RIP CURRENT 368
## 9 HIGH WIND 248
## 10 AVALANCHE 224
Then, the 10 events with most injuries.
inj_sum<-summarise(grouped_event, sum(INJURIES))
inj_sum<-arrange(inj_sum, desc(`sum(INJURIES)`))
inj_sum<-slice(inj_sum, 1:10)
inj_sum
## # A tibble: 10 x 2
## EVTYPE `sum(INJURIES)`
## <fct> <dbl>
## 1 TORNADO 91346
## 2 TSTM WIND 6957
## 3 FLOOD 6789
## 4 EXCESSIVE HEAT 6525
## 5 LIGHTNING 5230
## 6 HEAT 2100
## 7 ICE STORM 1975
## 8 FLASH FLOOD 1777
## 9 THUNDERSTORM WIND 1488
## 10 HAIL 1361
Lastly, I created a new variable with the sum of fatalities and injuries and determined the 10 most harmful events with respect to population health.
grouped_event<-mutate(grouped_event, fat_inj=FATALITIES+INJURIES)
fat_inj_sum<-summarise(grouped_event, sum(fat_inj))
fat_inj_sum<-arrange(fat_inj_sum, desc(`sum(fat_inj)`))
fat_inj_sum<-slice(fat_inj_sum, 1:10)
For the analysis itself, I created a new data frame merging the three data frames above by event and using the “fatalities + injuries” variables, found the five most harmful events for the population (the events with the highest number of dead and injured people).
top5<-merge(fat_sum, inj_sum, by="EVTYPE", all=T)
top5<-merge(top5, fat_inj_sum, by="EVTYPE", all=T)
colnames(top5)<-c("evtype", "fatalities", "injuries", "fat+inj")
top5<-arrange(top5, desc(top5$`fat+inj`))
top5<-slice(top5, 1:5)
top5$evtype<-factor(top5$evtype)
top5
## evtype fatalities injuries fat+inj
## 1 TORNADO 5633 91346 96979
## 2 EXCESSIVE HEAT 1903 6525 8428
## 3 TSTM WIND 504 6957 7461
## 4 FLOOD 470 6789 7259
## 5 LIGHTNING 816 5230 6046
With this data frame we can see that the tornado is easily the most harmful natural event.
For the second analysis, the variables PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP were used to determine which were the top 3 natural events that led to the highest economical damage.
In the original data, PROPDMG and CROPDMG are a three significant digits round of the total estimated damage which should be followed by the values in PROPDMGEXP and CROPDMGEXP, which are letter to signify magnitude. Those letters are “K” for thousands, “M” for millions, and “B” for billions. Other factors appear in PROPDMGEXP and CROPDMGEXP, but they aren’t clarified in the Storm Data Documentation.
I started the analysis by processing the four variables. So I can work with only one variable, I first turned all the factors in PROPDMGEXP and CROPDMGEXP in the numbers each of the letters represents (the factors that weren’t specified in the Storm Data Documentation were given the value 0, this should not affect the analysis since they are a very small portion of the whole).
#this summary shows how much of each factor there is in the whole data for property damage
summary(dat_storm$PROPDMGEXP)
## - ? + 0 1 2 3 4 5 6
## 465934 1 8 5 216 25 13 4 4 28 4
## 7 8 B h H K m M
## 5 1 40 1 6 424665 7 11330
#turning factors into their corresponding number (B = 1e+09, M = 1e+06, K = 1e+03)
levels(dat_storm$PROPDMGEXP)<-c(rep(0, 13), 1e+09, 0, 0, 1e+03, 1e+03, 1e+06, 1e+03)
dat_storm$PROPDMGEXP<-as.numeric(as.character(dat_storm$PROPDMGEXP))
#this summary shows how much of each factor there is in the whole data for crop damage
summary(dat_storm$CROPDMGEXP)
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
#turning factors into their corresponding number (B = 1e+09, M = 1e+06, K = 1e+03)
levels(dat_storm$CROPDMGEXP)<-c(rep(0, 4), 1e+09, 1e+03, 1e+03, 1e+06, 1e+06)
dat_storm$CROPDMGEXP<-as.numeric(as.character(dat_storm$CROPDMGEXP))
Now, to create the variable I will be working with, I multiplied the values in PROPDMG and PROPDMGEXP as well as the ones in CROPDMG and CROPDMGEXP. That gives me a single variable with an actual number, rather than an abbreviated one.
dat_storm$PROPDMG<-dat_storm$PROPDMG*dat_storm$PROPDMGEXP
dat_storm$CROPDMG<-dat_storm$CROPDMG*dat_storm$CROPDMGEXP
The second step was to create data frames with damage by event for both property and crop.
This data frame contains the 10 most economicaly harmful event for properties.
p_dmg_sum<-summarise(grouped_event, sum(PROPDMG))
p_dmg_sum<-arrange(p_dmg_sum, desc(`sum(PROPDMG)`))
p_dmg_sum<-slice(p_dmg_sum, 1:10)
p_dmg_sum$EVTYPE<-factor(p_dmg_sum$EVTYPE)
And this data frame contains the 17 most economicaly harmful events for crops.
c_dmg_sum<-summarise(grouped_event, sum(CROPDMG))
c_dmg_sum<-arrange(c_dmg_sum, desc(`sum(CROPDMG)`))
c_dmg_sum<-slice(c_dmg_sum, 1:17)
c_dmg_sum$EVTYPE<-factor(c_dmg_sum$EVTYPE)
To determine which are the most economicaly harmful events overall, I need to know the sum of the values from property and crop damages. For that, I used the grouped data frame from the start of the analysis and created a variable which is the sum of CROPDMG and PROPDMG. With that, I created a data frame with the total damage value by event.
grouped_event<-mutate(grouped_event, total_dmg=CROPDMG+PROPDMG)
total_dmg<-summarise(grouped_event, sum(total_dmg))
total_dmg<-arrange(total_dmg, desc(`sum(total_dmg)`))
total_dmg$EVTYPE<-factor(total_dmg$EVTYPE)
To determine the three natural events which are responsible for the highest economical damage, I merged the three data frames that I had created. The top three natural events were found using total damage as reference.
top3<-merge(c_dmg_sum, p_dmg_sum, by="EVTYPE", all=T)
top3<-merge(top3, total_dmg, by="EVTYPE", all=T)
colnames(top3)<-c("evtype", "cropdmg", "propdmg", "totaldmg")
top3<-arrange(top3, desc(totaldmg))
top3<-slice(top3, 1:3)
top3$evtype<-factor(top3$evtype)
top3
## evtype cropdmg propdmg totaldmg
## 1 TORNADO 100018.5 3212258 3312277
## 2 FLASH FLOOD 179200.5 1420125 1599325
## 3 TSTM WIND 109202.6 1335966 1445168
This analysis shows that the tornado is the most harmful natural event for population’s health. With around 5500 fatalities and more than 90000 injuries, it beats the second most harmful event (excessive heat) by a factor of more than 11 to 1, regarding the total amount of affected people (injured and dead).
p1<-top5[1,1:3] %>%
gather("type", "count", -evtype) %>%
ggplot(aes(evtype, count, fill = type, label=count)) +
geom_col(position= "dodge") +
geom_text(aes(label=count), vjust=-0.5, color="black",
position = position_dodge(0.9), size=2.5) +
labs(x="", y="", title="Number of fatalities or \ninjuries in most harmful event") +
scale_fill_discrete(name="", labels=c("Fatalities", "Injuries"))
p2<-top5[2:5,1:3] %>%
gather("type", "count", -evtype) %>%
ggplot(aes(evtype, count, fill = type, label=count)) +
geom_col(position= "dodge") +
geom_text(aes(label=count), vjust=-0.2, color="black",
position = position_dodge(0.9), size=2.5) +
labs(x="Event type", y="Count", title="Number of fatalities \nor injuries per event type") +
facet_wrap(~evtype,scales = "free_x") +
theme(legend.position = "none")
p1+p2
Regarding the economy, the most harmful natural event overall are floods. They caused an economical damage of more than 150 billion dollars throughout this data set’s timespan. The floods are the most devastating natural event economicaly when it comes to property damage, but for the crops the droughts are the worst natural event. They are responsible for almost 14 billion dollars worth in damage, three time more than the floods, which come in second place.
p3<-ggplot(top3, aes(evtype, cropdmg)) +
geom_col(fill="slateblue1") +
geom_text(aes(label=cropdmg), vjust=-0.5, color="black", position = position_dodge(0.9), size=3) +
labs(x="Event type", y="Estimated damage (Dollars)", title="Total amount of estimated damage \nin crops")
p4<-ggplot(top3, aes(evtype, propdmg)) +
geom_col(fill="brown2") +
geom_text(aes(label=cropdmg), vjust=-0.5, color="black", position = position_dodge(0.9), size=3) +
labs(x="Event type", y="Estimated damage (Dollars)", title="Total amount of estimated damage \nin properties")
p3+p4