Storms and other severe weather events can cause both public health and economic problems for communities and municipalities.
Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This document explores the impact, in the United States, of storms & severe weather events on :
- human fatalities / injuries
- the economy
28/05/2020 : correcting an error on cost calculation (Billions not correctly affected)
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.0 ✔ purrr 0.3.2
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 1.0.2 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(varhandle)
## Setting Working directory
setwd("~/JohnsHopkins/JohnsHopkins/M05 - W04")
## Preparing a directory to store data
if (!file.exists("data"))
{
dir.create("data")
}
Donwloading raw data from the course Web site
https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2
Raw data is stored in csv (comma separated file)
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileUrl, destfile="./data/rawstorm.bz2",method="libcurl")
Unzipping the storm data file to “data” disrectory by using read.csv and bzfile commands
Unzipped data is transferred to the rawstorm dataframe
rawstorm_df <- read.csv(bzfile("./data/rawstorm.bz2"),sep=',')
This question can be answered by focussing on data appearing in the “Fatalities” and “Injuries” columns of the dataset.
Note that data stored in the dataset is reporting direct fatalities / direct injuries related to a specific storm event.
Creating a subset to focus only on EVTYPE (type of events), Fatalities and injuries then building a pipe :
grouping by EV_TYPE, Summarizing by sum of fatalities & injuries per event + arranging by descending order
The resulting dataframe is stored in the “harmfulevent_df” variable
harmfulevent_df <- select(rawstorm_df,EVTYPE,FATALITIES,INJURIES) %>% group_by(EVTYPE) %>%
summarise(TOT_FATALITIES = sum(FATALITIES), TOT_INJURIES = sum(INJURIES)) %>% arrange(desc(TOT_FATALITIES), desc(TOT_INJURIES))
This question can be answered by focussing on data contained in the “Damage” Columns.
Damage data is encoded into 4 columns :
- PROPDMG : Property Damage
- PROPGMGEXP : Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions.
- CROPDMG : Crop Damage
- CROPDMGEXP : Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions.
Creating a subset to focus only on EVTYPE (type of events) and damages thanks to a pipe.
dmgevent_df <- select(rawstorm_df,EVTYPE,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP) %>% group_by(EVTYPE)
Checking the consistency of the data stored in the PROPDMGEXP column.
- Is there nas ?
- are alphabetical characters conform to desciption : “k”,“M” or “B” ?
sum(is.na(dmgevent_df$PROPDMG))
## [1] 0
sum(is.na(dmgevent_df$PROPDMGEXP))
## [1] 0
Checking unique values content
unique(dmgevent_df$PROPDMGEXP)
## [1] K M B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
A bunch of values aren’ conform to description : “+, 0, 5, 6, h, H, etc.”
Let’s determine the number of inconsistent records
sort(summary(dmgevent_df$PROPDMGEXP))
## - 8 h 3 4 6 + 7 H m ?
## 1 1 1 4 4 4 5 5 6 7 8
## 2 1 5 B 0 M K
## 13 25 28 40 216 11330 424665 465934
It appears that the majority of the events are correctly encoded with values “”,“K” or “M”
Rejecting the other values can be considered as part of the data cleanin process
Checking the consistency of the data stored in the CROPDMGEXP column.
- Is there nas ?
- are alphabetical characters conform to desciption : “k”,“M” or “B” ?
sum(is.na(dmgevent_df$CROPDMG))
## [1] 0
sum(is.na(dmgevent_df$CROPDMGEXP))
## [1] 0
Checking unique values content
unique(dmgevent_df$CROPDMGEXP)
## [1] M K m B ? 0 k 2
## Levels: ? 0 2 B k K m M
A bunch of values aren’ conform to description : “?”,“m”,“2” etc."
Let’s determine the number of inconsistent records
sort(summary(dmgevent_df$CROPDMGEXP))
## 2 m ? B 0 k M K
## 1 1 7 9 19 21 1994 281832 618413
It appears that the majority of the events are correctly encoded with values “”,“K” or “M”
Rejecting the other values can be considered as part of the data cleanin process
As a first step, data cleaning will consist in keeping correctly encoded values let’s define a vector of “correct amount characters” amount_char_v
amount_char_v <- c("","K","M","B")
Extracting proper data from the dataframe using the previous vector
dmgevent_clean_df <- dmgevent_df %>% filter(PROPDMGEXP %in% amount_char_v,CROPDMGEXP %in% amount_char_v)
# Correction factor levels
dmgevent_clean_df <- droplevels(dmgevent_clean_df)
This operations requires to merge information contained on xxxDMG and xxxDMGEXP columns into numerical values
ie if the value of xxxDMGEXP = K => then xxxDMG value will be multiplied by 1000
dmgevent_clean_df <- dmgevent_clean_df %>%
mutate(PROPDMGEXP_num =
case_when(
PROPDMGEXP=="" ~ 1,
PROPDMGEXP=="K" ~ 1000,
PROPDMGEXP=="M" ~ 1000000,
PROPDMGEXP=="B" ~ 1000000000)
)
dmgevent_clean_df <- dmgevent_clean_df %>%
mutate(CROPDMGEXP_num =
case_when(
CROPDMGEXP=="" ~ 1,
CROPDMGEXP=="K" ~ 1000,
CROPDMGEXP=="M" ~ 1000000,
CROPDMGEXP=="B" ~ 1000000000)
)
Adding a column to the DF containing the total damages costs (Property + Crop damages)
dmgevent_clean_df <- mutate(dmgevent_clean_df, TOTDMG = PROPDMG*PROPDMGEXP_num + CROPDMG*CROPDMGEXP_num)
Subsetting the DF in order to report total damage costs grouped by events and sorted in desc order
dmgcost_df <- select(dmgevent_clean_df,EVTYPE,TOTDMG) %>% group_by(EVTYPE) %>% summarise(TOT_DAM_COST = sum(TOTDMG)) %>%
arrange(desc(TOT_DAM_COST))
Focussing on the 6 most impacting events in terms of total number of deaths and injuries, it appears that the most harmful
event is TORNADO
head(harmfulevent_df)
## # A tibble: 6 x 3
## EVTYPE TOT_FATALITIES TOT_INJURIES
## <fct> <dbl> <dbl>
## 1 TORNADO 5633 91346
## 2 EXCESSIVE HEAT 1903 6525
## 3 FLASH FLOOD 978 1777
## 4 HEAT 937 2100
## 5 LIGHTNING 816 5230
## 6 TSTM WIND 504 6957
This observation can be confirmed by a plot showing the 6 most Fatal events
Note that in order to arrange the plot in descending order, the factor levels need to be updated thanks to a pipe
head(harmfulevent_df) %>% mutate(EVTYPE=factor(EVTYPE, levels=EVTYPE)) %>%
ggplot(aes(EVTYPE, group=1)) + geom_line(aes(y=TOT_FATALITIES),color="darkred") +
labs(title = "TOTAL FATALITIES / EVENT")
This observation can be confirmed by a plot showing the 6 most events bringing the higgests injuries
Note that in order to arrange the plot in descending order, the factor levels need to be updated thanks to a pi
head(harmfulevent_df) %>% mutate(EVTYPE=factor(EVTYPE, levels=EVTYPE)) %>%
ggplot(aes(EVTYPE, group=1)) + geom_line(aes(y=TOT_INJURIES),color="darkblue") +
labs(title = "TOTAL INJURIES / EVENT")
Focussing on the 6 most impacting events in terms of economical aspect, it appears that the most significant event is Flood.
head(dmgcost_df)
## # A tibble: 6 x 2
## EVTYPE TOT_DAM_COST
## <fct> <dbl>
## 1 FLOOD 150319678257
## 2 HURRICANE/TYPHOON 71913712800
## 3 TORNADO 57290435593
## 4 STORM SURGE 43323541000
## 5 HAIL 18727703230
## 6 FLASH FLOOD 17561538817
It appears that the event having, by far, the greatest economical impact is Flood
Let’s confirm this observation with a plot
head(dmgcost_df) %>% mutate(EVTYPE=factor(EVTYPE, levels=EVTYPE)) %>%
ggplot(aes(EVTYPE, group=1)) + geom_line(aes(y=TOT_DAM_COST),color="darkgreen") +
labs(title = "TOTAL DAMAGE COST / EVENT")