Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

In my analysis I took data which U.S. National Oceanic and Atmospheric Administration’s (NOAA) starts to accumulate in the year 1950 and end in November 2011 and identify most harmful events with respect to population health and events with the greatest economic consequences.

Since it is impossible to express the loss of life and health in monetary terms I didn’t mix this 2 analysis and all the following calculation divided into 2 groups:

The big part of work - Identify and clean duplicates in database in field EVTYPE, since data collected from 1950 and this field wasn’t unified all the time.

Data Pre-Processing

  1. On first stage I’m setting working directory and library’s necessary in the analysis
setwd("D:/R")
library(dplyr, warn.conflicts = FALSE)
library(ggplot2, warn.conflicts = FALSE)
library(reshape2, warn.conflicts = FALSE)
  1. Downloading file from the database to local computer (database starts to accumulate data in the year 1950 and end in November 2011).

Dataset: Storm data

if (!file.exists("StormData.csv.bz2")) {
    fileURL <- 'https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2'
    download.file(fileURL, destfile='StormData.csv.bz2', method = 'curl')
}
  1. Loading data into R
sd <- read.csv(bzfile('StormData.csv.bz2'),header=TRUE, stringsAsFactors = FALSE)
  1. Data processing For the analysis we need 7 fields:

    4.1 EVTYPE - types of events. According to p.“2.1.1 Storm Data Event Table” of Storm Data Documenrarion there are 48 types of events, but downloaded file contains 985 different EVTYPE types

length(unique(sd$EVTYPE))
## [1] 985

So next part of code creates 2 new fields EVTYPE_NON_GROUPED and EVTYPE_GROUPED. EVTYPE_GROUPED grouped together following types:

EVTYPE_NON_GROUPED - just renamed EVTYPE, and deletes rows with “summary” in EVTYPE preventing distorting the results

sd$INTERIM<-sd$EVTYPE%>% gsub("[[:punct:][:blank:]]+", "",.)%>% gsub("1", "",.)%>%gsub("[[:digit:]]+","", .)%>%gsub("s$|S$","",.)%>%
gsub("TSTM", "THUNDERSTORM",.)%>%gsub("FLD", "FLOOD",.)%>%gsub("TYPHOON", "",.)%>%tolower(.)
sd<-sd[!grepl('summary', sd$INTERIM),]

sd_INTERIM <- aggregate(EVTYPE~INTERIM, data = sd, FUN=head,1)
sd_INTERIM$EVTYPE<-sd_INTERIM$EVTYPE%>%gsub("TSTM", "THUNDERSTORM",.)
sd <-  sd %>% left_join(sd_INTERIM, by = "INTERIM")
names(sd)[names(sd) == "EVTYPE.x"] <- "EVTYPE_NON_GROUPED"
names(sd)[names(sd) == "EVTYPE.y"] <- "EVTYPE_GROUPED"
sd<- sd[,!names(sd) %in% c("INTERIM")]

To prove that final datasets ready to work I calculated cumulative share

sd_GROUPED<-sd %>% group_by(EVTYPE_GROUPED) %>% summarise(Count=length(STATE__),.groups='drop')%>% 
arrange(desc(Count))%>%mutate(Share = (Count / sum(Count))*100)%>%mutate(Cumulative_Share = cumsum(Share))

So we can see, that biggest 20 EVTYPE_GROUPED types have correct names (all of them exists in Storm Data Documentation) with share 97,1% wich is acceptable for our analysis.

head(sd_GROUPED,n=20)
## # A tibble: 20 × 4
##    EVTYPE_GROUPED            Count  Share Cumulative_Share
##    <chr>                     <int>  <dbl>            <dbl>
##  1 THUNDERSTORM WIND        323400 35.8               35.8
##  2 HAIL                     288761 32.0               67.9
##  3 TORNADO                   60653  6.72              74.6
##  4 FLASH FLOOD               54311  6.02              80.6
##  5 FLOOD                     25330  2.81              83.4
##  6 HIGH WINDS                21765  2.41              85.8
##  7 LIGHTNING                 15756  1.75              87.6
##  8 HEAVY SNOW                15708  1.74              89.3
##  9 MARINE THUNDERSTORM WIND  11987  1.33              90.6
## 10 HEAVY RAIN                11768  1.30              91.9
## 11 WINTER STORM              11436  1.27              93.2
## 12 WINTER WEATHER             7045  0.781             94.0
## 13 FUNNEL CLOUD               6932  0.768             94.7
## 14 WATERSPOUT                 3846  0.426             95.2
## 15 STRONG WINDS               3773  0.418             95.6
## 16 URBAN/SML STREAM FLD       3392  0.376             96.0
## 17 WILD FIRES                 2773  0.307             96.3
## 18 BLIZZARD                   2719  0.301             96.6
## 19 DROUGHT                    2488  0.276             96.9
## 20 ICE STORM                  2006  0.222             97.1

Deletion of all tables created for intermediate calculations

rm(sd_INTERIM)
rm(sd_GROUPED)

4.2 - 4.3 FATALITIES and INJURIES fields

Don’t have to be proccessed sinse it’s alredy numeric and not NA

class(sd$FATALITIES)
## [1] "numeric"
class(sd$INJURIES)
## [1] "numeric"
sum(is.na(sd$FATALITIES))
## [1] 0
sum(is.na(sd$INJURIES))
## [1] 0

4.4 - 4.5 PROPDMGEXP and CROPDMGEXP fields

Since the value which can have these fields can vary as following:

table(sd$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5      6 
## 465858      1      8      5    216     25     13      4      4     28      4 
##      7      8      B      h      H      K      m      M 
##      5      1     40      1      6 424665      7  11330
table(sd$CROPDMGEXP)
## 
##             ?      0      2      B      k      K      m      M 
## 618337      7     19      1      9     21 281832      1   1994

I wrote the following function to interpret the value

interpreter <- function(x) {
    case_when ((tolower(x) %in% "h") ~"2",
               (tolower(x) %in% "k") ~"3",
               (tolower(x) %in% "m") ~"6",
               (tolower(x) %in% "b") ~"9", 
               x %in% c('','-', '+', '?') ~"0",
               TRUE ~"0"
              )}

4.6 - 4.7 PROPDMG and CROPDMG fields

Since there is correlation between PROPDMG&PROPDMGEXP and CROPDMG&CROPDMGEXP I created new fields in dataset sd which calculated the final values nessesary for work

sd<-sd%>%mutate(PROPDMG_FINAL =PROPDMG*(10**as.numeric(interpreter(PROPDMGEXP))))%>%mutate(CROPDMG_FINAL =CROPDMG*(10**as.numeric(interpreter(CROPDMGEXP))))

Data Processing

5.1 Definition of most harmful events with respect to population health

The following code determine 10 most harmful events in the section of FATALITIES, INJURIES and FATALITIES+INJURIES. Deletes duplicates and form top of the most harmful events with respect to population health (in our case top 13)

  sd_harmful<-sd %>% group_by(EVTYPE_GROUPED) %>% summarise(FATALITIES=sum(FATALITIES),INJURIES=sum(INJURIES),.groups='drop')%>% 
mutate(FATALITIES_INJURIES = FATALITIES+INJURIES)%>%arrange(desc(FATALITIES_INJURIES))

sd_harmful_top<-rbind(sd_harmful%>%arrange(desc(FATALITIES))%>%slice(1:10),sd_harmful%>%arrange(desc(INJURIES))%>%slice(1:10),sd_harmful%>%arrange(desc(FATALITIES_INJURIES))%>%slice(1:10))
sd_harmful_top<-sd_harmful_top[!duplicated(sd_harmful_top), ]

sd_harmful_top<-melt(sd_harmful_top,id=c("EVTYPE_GROUPED"),measure.vars=c("FATALITIES","INJURIES","FATALITIES_INJURIES"))

5.2 Definition of events with the greatest economic consequences

The following code determine 10 most harmful events in the section of PROPDMG, CROPDMG and PROPDMG+CROPDMG. Deletes duplicates and form top of the most harmful events with respect to population health (in our case top 14)

sd_economic<-sd%>% group_by(EVTYPE_GROUPED) %>% summarise(PROPDMG_FINAL=sum(PROPDMG_FINAL),CROPDMG_FINAL=sum(CROPDMG_FINAL),.groups='drop')%>% 
mutate(PROPDMG_CROPDMG=PROPDMG_FINAL+CROPDMG_FINAL)%>%arrange(desc(PROPDMG_CROPDMG))


sd_economic_top<-rbind(sd_economic%>%arrange(desc(PROPDMG_FINAL))%>%slice(1:10),sd_economic%>%arrange(desc(CROPDMG_FINAL))%>%slice(1:10),sd_economic%>%arrange(desc(PROPDMG_CROPDMG))%>%slice(1:10))
sd_economic_top<-sd_economic_top[!duplicated(sd_economic_top), ]

sd_economic_top<-melt(sd_economic_top,id=c("EVTYPE_GROUPED"),measure.vars=c("PROPDMG_FINAL","CROPDMG_FINAL","PROPDMG_CROPDMG"))
sd_economic_top[is.na(sd_economic_top$value)] <- 0.1

Results

6.1. Most harmful events with respect to population health

qplot(EVTYPE_GROUPED,value,data=sd_harmful_top,facets= variable ~. ,fill=variable) + 
    geom_bar(stat = "identity") + 
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
    xlab("Event Type") + ylab("Cases occurred in the vicinity") + ggtitle("Weather Events most harmful to population health")+
    scale_fill_manual(values = c("darkblue","steelblue1","steelblue3"))+
    geom_text(data = sd_harmful_top[sd_harmful_top$value < 15000 ,],aes(x = EVTYPE_GROUPED, y = value, label = value), vjust = -0.5,hjust = -0.5, size = 2.5,angle = 45)
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.

According to graphic we can see that top 5 the most harmful events with respect to population health are:

The sum of Fatalities and Injuries is not very representative since number of cases in Injuries significantly bigger than in Fatalities =>influence on the common results not in fair share.

6.2 Events with the greatest economic consequences Since the values to big, I used exponential value for presenting results

qplot(EVTYPE_GROUPED,log10(value),data=sd_economic_top,facets= variable ~. ,fill=variable) + 
    geom_bar(stat = "identity") + 
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
    xlab("Event Type") + ylab("Property damage in dollars (log-scale)") + ggtitle("Weather events with greatest economic consequences")+
    scale_fill_manual(values = c("darkgreen","darkolivegreen4","chartreuse3"))+
    geom_text(data = sd_economic_top,aes(x = EVTYPE_GROUPED, y = log10(value), label = round(log10(value),1)), vjust = 1,hjust = 1, size = 3,angle = 45)

According to graphic we can see that top 5 Events with the greatest economic consequences are:

6.3 In common I can say that 2 weather events presenting in both lists - TORNADO and FLOOD- the most dangerous and the studies should be paid special attention to it.