Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
In my analysis I took data which U.S. National Oceanic and Atmospheric Administration’s (NOAA) starts to accumulate in the year 1950 and end in November 2011 and identify most harmful events with respect to population health and events with the greatest economic consequences.
Since it is impossible to express the loss of life and health in monetary terms I didn’t mix this 2 analysis and all the following calculation divided into 2 groups:
The big part of work - Identify and clean duplicates in database in field EVTYPE, since data collected from 1950 and this field wasn’t unified all the time.
setwd("D:/R")
library(dplyr, warn.conflicts = FALSE)
library(ggplot2, warn.conflicts = FALSE)
library(reshape2, warn.conflicts = FALSE)
Dataset: Storm data
if (!file.exists("StormData.csv.bz2")) {
fileURL <- 'https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2'
download.file(fileURL, destfile='StormData.csv.bz2', method = 'curl')
}
sd <- read.csv(bzfile('StormData.csv.bz2'),header=TRUE, stringsAsFactors = FALSE)
Data processing For the analysis we need 7 fields:
4.1 EVTYPE - types of events. According to p.“2.1.1 Storm Data Event Table” of Storm Data Documenrarion there are 48 types of events, but downloaded file contains 985 different EVTYPE types
length(unique(sd$EVTYPE))
## [1] 985
So next part of code creates 2 new fields EVTYPE_NON_GROUPED and EVTYPE_GROUPED. EVTYPE_GROUPED grouped together following types:
EVTYPE_NON_GROUPED - just renamed EVTYPE, and deletes rows with “summary” in EVTYPE preventing distorting the results
sd$INTERIM<-sd$EVTYPE%>% gsub("[[:punct:][:blank:]]+", "",.)%>% gsub("1", "",.)%>%gsub("[[:digit:]]+","", .)%>%gsub("s$|S$","",.)%>%
gsub("TSTM", "THUNDERSTORM",.)%>%gsub("FLD", "FLOOD",.)%>%gsub("TYPHOON", "",.)%>%tolower(.)
sd<-sd[!grepl('summary', sd$INTERIM),]
sd_INTERIM <- aggregate(EVTYPE~INTERIM, data = sd, FUN=head,1)
sd_INTERIM$EVTYPE<-sd_INTERIM$EVTYPE%>%gsub("TSTM", "THUNDERSTORM",.)
sd <- sd %>% left_join(sd_INTERIM, by = "INTERIM")
names(sd)[names(sd) == "EVTYPE.x"] <- "EVTYPE_NON_GROUPED"
names(sd)[names(sd) == "EVTYPE.y"] <- "EVTYPE_GROUPED"
sd<- sd[,!names(sd) %in% c("INTERIM")]
To prove that final datasets ready to work I calculated cumulative share
sd_GROUPED<-sd %>% group_by(EVTYPE_GROUPED) %>% summarise(Count=length(STATE__),.groups='drop')%>%
arrange(desc(Count))%>%mutate(Share = (Count / sum(Count))*100)%>%mutate(Cumulative_Share = cumsum(Share))
So we can see, that biggest 20 EVTYPE_GROUPED types have correct names (all of them exists in Storm Data Documentation) with share 97,1% wich is acceptable for our analysis.
head(sd_GROUPED,n=20)
## # A tibble: 20 × 4
## EVTYPE_GROUPED Count Share Cumulative_Share
## <chr> <int> <dbl> <dbl>
## 1 THUNDERSTORM WIND 323400 35.8 35.8
## 2 HAIL 288761 32.0 67.9
## 3 TORNADO 60653 6.72 74.6
## 4 FLASH FLOOD 54311 6.02 80.6
## 5 FLOOD 25330 2.81 83.4
## 6 HIGH WINDS 21765 2.41 85.8
## 7 LIGHTNING 15756 1.75 87.6
## 8 HEAVY SNOW 15708 1.74 89.3
## 9 MARINE THUNDERSTORM WIND 11987 1.33 90.6
## 10 HEAVY RAIN 11768 1.30 91.9
## 11 WINTER STORM 11436 1.27 93.2
## 12 WINTER WEATHER 7045 0.781 94.0
## 13 FUNNEL CLOUD 6932 0.768 94.7
## 14 WATERSPOUT 3846 0.426 95.2
## 15 STRONG WINDS 3773 0.418 95.6
## 16 URBAN/SML STREAM FLD 3392 0.376 96.0
## 17 WILD FIRES 2773 0.307 96.3
## 18 BLIZZARD 2719 0.301 96.6
## 19 DROUGHT 2488 0.276 96.9
## 20 ICE STORM 2006 0.222 97.1
Deletion of all tables created for intermediate calculations
rm(sd_INTERIM)
rm(sd_GROUPED)
4.2 - 4.3 FATALITIES and INJURIES fields
Don’t have to be proccessed sinse it’s alredy numeric and not NA
class(sd$FATALITIES)
## [1] "numeric"
class(sd$INJURIES)
## [1] "numeric"
sum(is.na(sd$FATALITIES))
## [1] 0
sum(is.na(sd$INJURIES))
## [1] 0
4.4 - 4.5 PROPDMGEXP and CROPDMGEXP fields
Since the value which can have these fields can vary as following:
table(sd$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5 6
## 465858 1 8 5 216 25 13 4 4 28 4
## 7 8 B h H K m M
## 5 1 40 1 6 424665 7 11330
table(sd$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 618337 7 19 1 9 21 281832 1 1994
I wrote the following function to interpret the value
interpreter <- function(x) {
case_when ((tolower(x) %in% "h") ~"2",
(tolower(x) %in% "k") ~"3",
(tolower(x) %in% "m") ~"6",
(tolower(x) %in% "b") ~"9",
x %in% c('','-', '+', '?') ~"0",
TRUE ~"0"
)}
4.6 - 4.7 PROPDMG and CROPDMG fields
Since there is correlation between PROPDMG&PROPDMGEXP and CROPDMG&CROPDMGEXP I created new fields in dataset sd which calculated the final values nessesary for work
sd<-sd%>%mutate(PROPDMG_FINAL =PROPDMG*(10**as.numeric(interpreter(PROPDMGEXP))))%>%mutate(CROPDMG_FINAL =CROPDMG*(10**as.numeric(interpreter(CROPDMGEXP))))
5.1 Definition of most harmful events with respect to population health
The following code determine 10 most harmful events in the section of FATALITIES, INJURIES and FATALITIES+INJURIES. Deletes duplicates and form top of the most harmful events with respect to population health (in our case top 13)
sd_harmful<-sd %>% group_by(EVTYPE_GROUPED) %>% summarise(FATALITIES=sum(FATALITIES),INJURIES=sum(INJURIES),.groups='drop')%>%
mutate(FATALITIES_INJURIES = FATALITIES+INJURIES)%>%arrange(desc(FATALITIES_INJURIES))
sd_harmful_top<-rbind(sd_harmful%>%arrange(desc(FATALITIES))%>%slice(1:10),sd_harmful%>%arrange(desc(INJURIES))%>%slice(1:10),sd_harmful%>%arrange(desc(FATALITIES_INJURIES))%>%slice(1:10))
sd_harmful_top<-sd_harmful_top[!duplicated(sd_harmful_top), ]
sd_harmful_top<-melt(sd_harmful_top,id=c("EVTYPE_GROUPED"),measure.vars=c("FATALITIES","INJURIES","FATALITIES_INJURIES"))
5.2 Definition of events with the greatest economic consequences
The following code determine 10 most harmful events in the section of PROPDMG, CROPDMG and PROPDMG+CROPDMG. Deletes duplicates and form top of the most harmful events with respect to population health (in our case top 14)
sd_economic<-sd%>% group_by(EVTYPE_GROUPED) %>% summarise(PROPDMG_FINAL=sum(PROPDMG_FINAL),CROPDMG_FINAL=sum(CROPDMG_FINAL),.groups='drop')%>%
mutate(PROPDMG_CROPDMG=PROPDMG_FINAL+CROPDMG_FINAL)%>%arrange(desc(PROPDMG_CROPDMG))
sd_economic_top<-rbind(sd_economic%>%arrange(desc(PROPDMG_FINAL))%>%slice(1:10),sd_economic%>%arrange(desc(CROPDMG_FINAL))%>%slice(1:10),sd_economic%>%arrange(desc(PROPDMG_CROPDMG))%>%slice(1:10))
sd_economic_top<-sd_economic_top[!duplicated(sd_economic_top), ]
sd_economic_top<-melt(sd_economic_top,id=c("EVTYPE_GROUPED"),measure.vars=c("PROPDMG_FINAL","CROPDMG_FINAL","PROPDMG_CROPDMG"))
sd_economic_top[is.na(sd_economic_top$value)] <- 0.1
6.1. Most harmful events with respect to population health
qplot(EVTYPE_GROUPED,value,data=sd_harmful_top,facets= variable ~. ,fill=variable) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
xlab("Event Type") + ylab("Cases occurred in the vicinity") + ggtitle("Weather Events most harmful to population health")+
scale_fill_manual(values = c("darkblue","steelblue1","steelblue3"))+
geom_text(data = sd_harmful_top[sd_harmful_top$value < 15000 ,],aes(x = EVTYPE_GROUPED, y = value, label = value), vjust = -0.5,hjust = -0.5, size = 2.5,angle = 45)
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
According to graphic we can see that top 5 the most harmful events with respect to population health are:
The sum of Fatalities and Injuries is not very representative since number of cases in Injuries significantly bigger than in Fatalities =>influence on the common results not in fair share.
6.2 Events with the greatest economic consequences Since the values to big, I used exponential value for presenting results
qplot(EVTYPE_GROUPED,log10(value),data=sd_economic_top,facets= variable ~. ,fill=variable) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
xlab("Event Type") + ylab("Property damage in dollars (log-scale)") + ggtitle("Weather events with greatest economic consequences")+
scale_fill_manual(values = c("darkgreen","darkolivegreen4","chartreuse3"))+
geom_text(data = sd_economic_top,aes(x = EVTYPE_GROUPED, y = log10(value), label = round(log10(value),1)), vjust = 1,hjust = 1, size = 3,angle = 45)
According to graphic we can see that top 5 Events with the greatest economic consequences are:
6.3 In common I can say that 2 weather events presenting in both lists - TORNADO and FLOOD- the most dangerous and the studies should be paid special attention to it.