This study, made for Coursera assigmnet, is based on data provided by U.S. National Oceanic and Atmospheric Administration’s (NOAA). The main goal was to find the answer for two questions:
The analysis shows that tornadoes have the biggest impact on health (both, in terms of injuries and fatalities). The study also shows, that floods have the biggest impact in economic impact. They cause the largest damage for property, however for crops, the main source of problem are droughts. For increasing readibility code chunks have been set to fold.
Loading necessary libraries
library(ggplot2)
library(ggthemes)
library(scales)
library(tidyr)
library(dplyr)
library(scales)
Cache=TRUE has been set for two code chunks, to avoid timie-consuming processing.
#download data
download.file( "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",file.path("StormData.csv.bz2"))
Loading data to the enviroment.
storm <- read.csv("StormData.csv.bz2")
Quick look on data.
str(storm)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
It can be observed, that PROPDMGEXP and CROPDMGEXP variables have strange factor levels. If we take a closer look, it turnes out, that some of them have suffixes such as K or M what sybmbolises thousansds, million etc. Some of them are in lower case, some in capital letter.
table(storm$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5 6
## 465934 1 8 5 216 25 13 4 4 28 4
## 7 8 B h H K m M
## 5 1 40 1 6 424665 7 11330
table(storm$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
If we take a closer look, it turns out, that some of them have suffixes such as K or M what symbolises thousands, million etc. Some of them are in lower case, some in capital letter. We have to fix this problem, first, by putting all letters into capital, second, by multiplying values from PROPDMG and CROPDMG variables by thousands, millions and so on in order to get correct value.
storm.correct <- storm %>%
mutate(PROPDMGEXP = toupper(PROPDMGEXP), CROPDMGEXP = toupper(CROPDMGEXP))
storm.correct <- storm.correct %>%
mutate(property.damage = if_else(PROPDMGEXP == "H", PROPDMG * 100,
if_else(PROPDMGEXP == "K", PROPDMG * 1000,
if_else(PROPDMGEXP == "M", PROPDMG * 1000000,
if_else(PROPDMGEXP == "B", PROPDMG * 1000000000, 0))))) %>%
mutate(crop.damage = if_else(CROPDMGEXP == "H", CROPDMG * 100,
if_else(CROPDMGEXP == "K", CROPDMG * 1000,
if_else(CROPDMGEXP == "M", CROPDMG * 1000000,
if_else(CROPDMGEXP == "B", CROPDMG * 1000000000, 0)))))
Once values have been corrected, we can move forward to process data.
# Code for question about harm to health
question1 <- storm.correct %>%
select(EVTYPE, FATALITIES, INJURIES) %>%
group_by(EVTYPE)%>%
summarise(
total = sum(FATALITIES, na.rm = T)+sum(INJURIES, na.rm=T),
fatalities = sum(FATALITIES, na.rm = T),
injuries = sum(INJURIES, na.rm=T)
)
# code for comparing injuries and casaulties
question1a <- question1 %>%
arrange(desc(total))%>%
head(10)%>%
select(-2)%>%
pivot_longer(col=c(2:3), names_to = "cases", values_to = "number")
# code for comparing tornado and other wether condition
question1b <- question1%>%
mutate(tornado = if_else(EVTYPE=="TORNADO", paste("tornado"), paste("other")))%>%
group_by(tornado)%>%
summarise(
fatalities = sum(fatalities),
injuries = sum(injuries)
) %>%
pivot_longer(col=c(2:3), names_to = "cases", values_to = "number")
# Code for question about property/crop
question2 <- storm.correct %>%
select(EVTYPE, property.damage, crop.damage)%>%
group_by(EVTYPE)%>%
summarise(
total = sum(property.damage, na.rm = T) + sum(crop.damage, na.rm = T),
property = sum(property.damage, na.rm = T),
crop = sum(crop.damage, na.rm = T)
)
# code for comparing property and crop damage
question2a <- question2 %>%
arrange(desc(total))%>%
head(10)%>%
select(-2)%>%
pivot_longer(col=c(2:3), names_to = "damage", values_to = "number")
In order to answer this questions, five plots have been made.
ggplot(head(arrange(question1,desc(total)), 10), aes(x=reorder(EVTYPE,total), y=total))+
geom_bar(stat = "identity", fill="blue")+
labs(y="number of injuries and fatalities", x="type of event", title="TOP 10 most harmful weather events")+
geom_label(aes(label=total),hjust=-0.25, fill="grey90")+
scale_y_continuous(limits = c(0,110000))+
coord_flip()+
theme_bw()+
theme(plot.title = element_text(hjust = 0.5))
ggplot(head(arrange(question1,desc(injuries)), 10), aes(x=reorder(EVTYPE,injuries), y=injuries))+
geom_bar(stat = "identity", fill="steelblue")+
labs(y="number of injuries", x="type of event", title="TOP 10 weather events causing most injuries")+
geom_label(aes(label=injuries),hjust=-0.25, fill="grey90")+
scale_y_continuous(limits = c(0,105000))+
coord_flip()+
theme_bw()+
theme(plot.title = element_text(hjust = 0.5))
ggplot(head(arrange(question1,desc(fatalities)), 10), aes(x=reorder(EVTYPE,fatalities), y=fatalities))+
geom_bar(stat = "identity", fill="red1")+
labs(y="number of fatalities", x="type of event", title="TOP 10 most deadliest weather events")+
geom_label(aes(label=fatalities),hjust=-0.25, fill="grey90")+
scale_y_continuous(limits = c(0,6000))+
coord_flip()+
theme_bw()+
theme(plot.title = element_text(hjust = 0.5))
ggplot(question1a, aes(x=reorder(EVTYPE, number), y=number, fill=cases))+
geom_bar(stat = "identity")+
labs(y="total number of injured and fatalities", x="type of event", title="TOP 10 most harmful weather events", fill="")+
coord_flip()+
scale_y_continuous(limits = c(0,100000))+
scale_color_manual(values = c("injuries"="steelblue", "fatalities"="red1"), aesthetics = "fill")+
theme_bw()+
theme(plot.title = element_text(hjust = 0.5), legend.position=c(0.8,0.1), legend.direction = "horizontal")
ggplot(question1b, aes(x=reorder(tornado, number), y=number, fill=cases))+
geom_bar(stat = "identity")+
labs(y="total number", x="type of event", title="Comparing tornado and other weather events", fill="")+
facet_wrap(~cases, ncol = 2, scales = "free")+
coord_flip()+
scale_color_manual(values = c("injuries"="steelblue", "fatalities"="red1"), aesthetics = "fill")+
theme_bw()+
theme(plot.title = element_text(hjust = 0.5), legend.position="none")
The study shows, that tornadoes are not only the main threat for health compared to other severe weather events, but they are responsible for more injuries, then all other events combined.
In order to answer this questions, four plots have been made.
ggplot(head(arrange(question2,desc(total)), 10), aes(x=reorder(EVTYPE,total), y=total/1e9))+
geom_bar(stat = "identity", fill="coral1")+
labs(y="total damage (in USD billions)", x="type of event", title="TOP 10 weather events causing most damage")+
geom_label(aes(label=paste(round(total/1e9,1), "$bn")),hjust=-0.25, fill="grey90")+
scale_y_continuous(labels = comma, limits = c(0,180))+
coord_flip()+
theme_bw()+
theme(plot.title = element_text(hjust = 0.5))
ggplot(head(arrange(question2,desc(property)), 10), aes(x=reorder(EVTYPE,property), y=property/1e9))+
geom_bar(stat = "identity", fill="red1")+
labs(y="total damage (in USD billions)", x="type of event", title="TOP 10 weather events causing most property damage")+
geom_label(aes(label=paste(round(property/1e9,1), "$bn")),hjust=-0.25, fill="grey90")+
scale_y_continuous(labels = comma, limits = c(0,175))+
coord_flip()+
theme_bw()+
theme(plot.title = element_text(hjust = 0.5))
ggplot(head(arrange(question2,desc(crop)), 10), aes(x=reorder(EVTYPE,crop), y=crop/1e9))+
geom_bar(stat = "identity", fill="seagreen4")+
labs(y="total damage (in USD billions)", x="type of event", title="TOP 10 weather events causing most crop damage")+
geom_label(aes(label=paste(round(crop/1e9,1), "$bn")),hjust=-0.25, fill="grey90")+
scale_y_continuous(labels = comma, limits = c(0,16))+
coord_flip()+
theme_bw()+
theme(plot.title = element_text(hjust = 0.5))
ggplot(question2a, aes(x=reorder(EVTYPE, number), y=number/1e9, fill=damage))+
geom_bar(stat = "identity")+
labs(y="total damage (in USD billions)", x="type of event", title="TOP 10 weather events causing most crop damage", fill="")+
coord_flip()+
scale_color_manual(values = c("crop"="seagreen4", "property"="red1"), aesthetics = "fill")+
theme_bw()+
theme(plot.title = element_text(hjust = 0.5), legend.position=c(0.8,0.1), legend.direction = "horizontal")
The study shows, that in for the economy, the floods are the biggest problem, causing over 150 USD bn damage. Flood are the main threat for property, however for crops, the biggest challenge pose droughts.