Storm Events is observed in USA Every Year affecting a wide range of people at different extents. Damages caused by this events includes building,bridges,roads, hospital and other infrastructures along with crops and livestock. This analysis tries to estimate the impact of these storm events in terms of public health by estimating fatality and injuries and in terms of economic consequence in terms of property damage and crop damage.
The data for this analysis is sourced from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
This data come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:
Data: Storm Data 47Mb
There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
National Weather Service Storm Data Documentation
National Climatic Data Center Storm Events FAQ
The events in the database start in the year 1950 and end in November 2011.
library(dplyr)
library(stringr)
library(reshape2)
library(ggplot2)
library(patchwork)
data_url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
storm_data <- data.table::fread(data_url)
The relevant variables for our current question are,
storm_data$EVTYPEstorm_data$FATALITIESstorm_data$INJURIESstorm_data$PROPDMGstorm_data$PROPDMGXPstorm_data$CROPDMGstorm_data$CROPDMGXPstorm_data$BGN_DATEstorm_data2 <- storm_data[,c("BGN_DATE","EVTYPE","FATALITIES","INJURIES","PROPDMG"
,"PROPDMGEXP","CROPDMG","CROPDMGEXP")]
names(storm_data2) <- sub("_",".",tolower(names(storm_data2)))
The storm_data2$bgn.date should be a date variable.
class(storm_data2$bgn.date)
## [1] "character"
But unfortunately it is a character variable. This variable needs to be converted to date variable if we want to make any use of this variable.
storm_data2$bgn.date <- as.Date(storm_data2$bgn.date, "%m/%d/%Y")
class(storm_data2$bgn.date)
## [1] "Date"
For proper representation of economic damage the exponential
variables storm_data$PROPDMGXPand
storm_data$CROPDMGXP should be numeric representing its
exponential number.
This two variables will first be coerced to factor and be assigned to
prop_exp and crop_exp.
prop_exp <- factor(storm_data2[[6]])
crop_exp <- factor(storm_data2[[8]])
levels(prop_exp)
## [1] "" "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K" "m" "M"
levels(crop_exp)
## [1] "" "?" "0" "2" "B" "k" "K" "m" "M"
Looking at the levels of we can immediately notice there are some numbers,some letters with some sign characters. The letters represent some specific exponential values where
levels(prop_exp) <- str_replace_all(levels(prop_exp),
c("h|H"="2","k|K"="3","m|M"="6","b|B"="9"))
levels(prop_exp)
## [1] "" "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "9"
levels(crop_exp) <- str_replace_all(levels(crop_exp),
c("h|H"="2","k|K"="3","m|M"="6","b|B"="9"))
levels(crop_exp)
## [1] "" "?" "0" "2" "9" "3" "6"
But the sign characters (“-”,“?”,“+”) are not specific on their exponential values.
sum(prop_exp %in% c("+","?","-"))/nrow(storm_data2) * 100
## [1] 0.001551596
sum(crop_exp %in% c("+","?","-"))/nrow(storm_data2) * 100
## [1] 0.0007757978
0.0015% values are signed in prop_exp or
storm_data2$propdmgexp and 0.00078% values are signed in
crop_exp or storm_data2$propdmgexp.We can see
they are in a very low in percentage in the data. So, we can safely
ignore them. but removing observation based on one variable may result
in loss of data in other variables so in order to avoid them we may turn
the total value for damage to 0 by assigning the exponential value a
negative infinity.
levels(prop_exp) <- gsub("(\\+|\\?|\\-)","-Inf", levels(prop_exp))
levels(prop_exp)
## [1] "" "-Inf" "0" "1" "2" "3" "4" "5" "6" "7"
## [11] "8" "9"
levels(crop_exp) <- gsub("(\\+|\\?|\\-)","-Inf", levels(crop_exp))
levels(crop_exp)
## [1] "" "-Inf" "0" "2" "9" "3" "6"
lets take look on the empty strings
head(storm_data2[storm_data2$propdmgexp=="",c(5,6)])
## propdmg propdmgexp
## 54 0
## 55 0
## 56 0
## 57 0
## 58 0
## 59 0
head(storm_data2[storm_data2$cropdmgexp=="",c(7,8)])
## cropdmg cropdmgexp
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
Looks like the empty strings are mostly corresponds to value 0 for damage.
sum(storm_data2$propdmgexp=="" & storm_data2$propdmg != 0)/sum(storm_data2$propdmgexp=="") * 100
## [1] 0.01631132
only 0.02% values are which are empty string and not 0. so It is safe to assume the empty strings represent 1 as exponential value.
For values 0 in exponential values will result in damage of 1$ despite any value for damage. Lets look at the damage values for exponential 0 which are not 1.
sum(storm_data2$propdmgexp == "0" & storm_data2$propdmg!=1)
## [1] 209
sum(storm_data2$cropdmgexp == "0" & storm_data2$cropdmg!=1)
## [1] 19
Though they are low in number we can fix them by replacing 0 with 1.
levels(prop_exp) <- gsub("^$|0","1", levels(prop_exp))
levels(prop_exp)
## [1] "1" "-Inf" "2" "3" "4" "5" "6" "7" "8" "9"
levels(crop_exp) <- gsub("^$|0","1", levels(crop_exp))
levels(crop_exp)
## [1] "1" "-Inf" "2" "9" "3" "6"
For empty string and 0 we assigned 1 as their value
lastly we can safely coerce the prop_exp and
crop_exp to factor with numeric levels and assign them to
their respected position.
storm_data2$propdmgexp <- as.numeric(as.character(prop_exp))
storm_data2$cropdmgexp <- as.numeric(as.character(crop_exp))
The storm_data2$evtype should be in lower class for
better correlation
storm_data2$evtype <- tolower(storm_data2$evtype)
health_data <- storm_data2[,2:4]
top_15_events <- group_by(health_data,evtype) %>%
summarise(fatality.sum = sum(fatalities),injury.sum=sum(injuries), total=sum(fatalities)+sum(injuries)) %>%
(function(x){x[order(x[[2]]+x[[3]],decreasing=T),][1:15,]})
top_15_events
## # A tibble: 15 × 4
## evtype fatality.sum injury.sum total
## <chr> <dbl> <dbl> <dbl>
## 1 tornado 5633 91346 96979
## 2 excessive heat 1903 6525 8428
## 3 tstm wind 504 6957 7461
## 4 flood 470 6789 7259
## 5 lightning 816 5230 6046
## 6 heat 937 2100 3037
## 7 flash flood 978 1777 2755
## 8 ice storm 89 1975 2064
## 9 thunderstorm wind 133 1488 1621
## 10 winter storm 206 1321 1527
## 11 high wind 248 1137 1385
## 12 hail 15 1361 1376
## 13 hurricane/typhoon 64 1275 1339
## 14 heavy snow 127 1021 1148
## 15 wildfire 75 911 986
These are the top events that has most impacts on public health.
top_15_events$evtype <- with(top_15_events,
reorder(evtype,-fatality.sum-injury.sum))
top_15_melt <- melt(top_15_events[,-4],id.vars=1)
g_ph <- ggplot(top_15_melt,aes(x=evtype,y=value,fill=variable))
g_ph +
geom_col() + labs(title ="Top 15 Impactful Events on Public Health",
x = "",
y = "Affected People")+
theme(axis.text.x = element_text(angle=45, vjust=1, hjust=1),
plot.title = element_text(hjust = .5))+
scale_fill_discrete(name="",labels=c("Fatalities","Injuries"))
From the plot we can see that tornado is the most impactful storm events with more than 96000 fatalities and injuries from 1950 to 2011.
eco_data <- storm_data2[,c(2,5:8)]
eco_data <- mutate(eco_data,propdmg=propdmg*(10**(propdmgexp-6)),
cropdmg=cropdmg*(10**(cropdmgexp-6)),
.keep='unused')
top_eco_evnets <- group_by(eco_data,evtype) %>%
summarise(propdmg.sum=sum(propdmg),
cropdmg.sum=sum(cropdmg),
total=sum(propdmg)+sum(cropdmg)) %>%
(function(x){x[order(x[[2]]+x[[3]],decreasing = T),][1:15,]})
top_eco_evnets
## # A tibble: 15 × 4
## evtype propdmg.sum cropdmg.sum total
## <chr> <dbl> <dbl> <dbl>
## 1 flood 144658. 5662. 150320.
## 2 hurricane/typhoon 69306. 2608. 71914.
## 3 tornado 56947. 415. 57362.
## 4 storm surge 43324. 0.005 43324.
## 5 hail 15735. 3026. 18761.
## 6 flash flood 16823. 1421. 18244.
## 7 drought 1046. 13973. 15019.
## 8 hurricane 11868. 2742. 14610.
## 9 river flood 5119. 5029. 10148.
## 10 ice storm 3945. 5022. 8967.
## 11 tropical storm 7704. 678. 8382.
## 12 winter storm 6688. 26.9 6715.
## 13 high wind 5270. 639. 5909.
## 14 wildfire 4765. 295. 5061.
## 15 tstm wind 4485. 554. 5039.
These are the top events that has most impacts on economics. Damage of properties are given in millions
top_eco_evnets$evtype <- with(top_eco_evnets,
reorder(evtype,-propdmg.sum-cropdmg.sum))
top_eco_melt <- melt(top_eco_evnets[-4],id.vars=1)
top_eco_melt$variable <- with(top_eco_melt,reorder(variable,value))
g_eco <- ggplot(top_eco_melt,aes(x=evtype,y=value,fill=variable))
g_eco +
geom_col() + labs(title ="Top 15 Impactful Events on Economics",
x = "",
y = "Damage (Million Dollars)")+
theme(axis.text.x = element_text(angle=45, vjust=1, hjust=1),
plot.title = element_text(hjust = .5))+
scale_fill_discrete(name="",
labels=c("Property Damage","Crop Damage"))
From the graph we can see the most imapctful event for economics is flood which has caused about 150B of worth of property and crop From 1950 to 2011.
event_time <- storm_data2[,1:2]
event_time$bgn.date <- as.numeric(format(event_time$bgn.date,"%m"))
event_time$evtype <- with(event_time,reorder(evtype,bgn.date,
length,decreasing=TRUE))
top_freq_events <- group_by(event_time,evtype) %>%
summarise(frequency=length(bgn.date),
most.occur=paste(
month.abb[quantile(bgn.date,0.25)],"-",
month.abb[quantile(bgn.date,0.75)]),
) %>%
(function(x){x[order(x$frequency,decreasing = T),]}) %>%
(function(x){x[,1] <- reorder(x$evtype,-x$frequency);x[1:15,]})
top_freq_events
## # A tibble: 15 × 3
## evtype frequency most.occur
## <fct> <int> <chr>
## 1 hail 288661 Apr - Jul
## 2 tstm wind 219942 May - Jul
## 3 thunderstorm wind 82564 May - Aug
## 4 tornado 60652 Apr - Jul
## 5 flash flood 54277 May - Aug
## 6 flood 25327 Mar - Aug
## 7 thunderstorm winds 20843 May - Jul
## 8 high wind 20214 Feb - Nov
## 9 lightning 15754 Jun - Aug
## 10 heavy snow 15708 Feb - Nov
## 11 heavy rain 11742 May - Sep
## 12 winter storm 11433 Feb - Nov
## 13 winter weather 7045 Jan - Nov
## 14 funnel cloud 6844 May - Jul
## 15 marine tstm wind 6175 May - Aug
fills <- c("#cf4a34",
"#48bc5e",
"#be55ba",
"#80b543",
"#7366cb",
"#b8ae49",
"#6a8fcd",
"#dc923a",
"#52c1b0",
"#c6456e",
"#458e5c",
"#c277ad",
"#687c32",
"#d88170",
"#99662f")
g_freq_hist <- ggplot(top_freq_events,aes(x=evtype,
y=frequency/1000)) +
geom_col(fill=fills)+
scale_x_discrete(limits=top_freq_events[[1]]) +
labs(title="Top Frequent Events and Their Occurence Throughout Year",
x=NULL,
y="Frequency (k)") +
theme(plot.title = element_text(hjust=0.5),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.line.x = element_blank(),
plot.margin = unit(c(0,0,0,0),"pt"),
strip_text = element_text(),
legend.position = "none")
g_time_box <- ggplot(event_time,aes(evtype,bgn.date)) +
geom_boxplot(fill=fills) +
theme(axis.text.x=element_text(angle=45,vjust=1,hjust=1),
plot.margin = unit(c(0,0,0,0),"pt"),
legend.position = "none") +
scale_x_discrete(limits=top_freq_events[[1]]) +
labs(title=NULL,x=NULL,y="Months") +
scale_y_continuous(breaks=1:12,labels = month.abb)
g_freq_hist / plot_spacer()/ g_time_box +plot_layout(heights=c(1,0,1))
From 1950 to 2011 the most frequent storm is hail which have occured more than 288k times and this event mostly occurs between April-July. Most impactful event in terms of public health is tornado which is 4th most frequent events occur mostly between April-July. In terms of economic consequence most impactful event is flood which is the 6th most frequent events and occurs mostly between March-August.
Top Events Affecting Public Health (Number of People)
| Event Type | Fatalities | Injuries | Total |
|---|---|---|---|
| tornado | 5633 | 91346 | 96979 |
| excessive heat | 1903 | 6525 | 8428 |
| tstm wind | 504 | 6957 | 7461 |
| flood | 470 | 6789 | 7259 |
| lightning | 816 | 5230 | 6046 |
| heat | 937 | 2100 | 3037 |
| flash flood | 978 | 1777 | 2755 |
| ice storm | 89 | 1975 | 2064 |
| thunderstorm wind | 133 | 1488 | 1621 |
| winter storm | 206 | 1321 | 1527 |
| high wind | 248 | 1137 | 1385 |
| hail | 15 | 1361 | 1376 |
| hurricane/typhoon | 64 | 1275 | 1339 |
| heavy snow | 127 | 1021 | 1148 |
| wildfire | 75 | 911 | 986 |
Top Events Affecting Economics (Million Dollar)
| Events | Property Damage | Crop Damage | Total |
|---|---|---|---|
| flood | 144657.710 | 5661.9685 | 150319.678 |
| hurricane/typhoon | 69305.840 | 2607.8728 | 71913.713 |
| tornado | 56947.382 | 414.9547 | 57362.337 |
| storm surge | 43323.536 | 0.0050 | 43323.541 |
| hail | 15735.270 | 3025.9547 | 18761.225 |
| flash flood | 16822.678 | 1421.3171 | 18243.995 |
| drought | 1046.106 | 13972.5660 | 15018.672 |
| hurricane | 11868.319 | 2741.9100 | 14610.229 |
| river flood | 5118.945 | 5029.4590 | 10148.405 |
| ice storm | 3944.928 | 5022.1135 | 8967.042 |
| tropical storm | 7703.891 | 678.3460 | 8382.237 |
| winter storm | 6688.497 | 26.9440 | 6715.441 |
| high wind | 5270.046 | 638.5713 | 5908.618 |
| wildfire | 4765.114 | 295.4728 | 5060.587 |
| tstm wind | 4484.959 | 554.0073 | 5038.966 |
Top Frequent Events and Their Most Occurrence Month range
| Event Type | Frequency | Most Occurrence |
|---|---|---|
| hail | 288661 | Apr - Jul |
| tstm wind | 219942 | May - Jul |
| thunderstorm wind | 82564 | May - Aug |
| tornado | 60652 | Apr - Jul |
| flash flood | 54277 | May - Aug |
| flood | 25327 | Mar - Aug |
| thunderstorm winds | 20843 | May - Jul |
| high wind | 20214 | Feb - Nov |
| lightning | 15754 | Jun - Aug |
| heavy snow | 15708 | Feb - Nov |
| heavy rain | 11742 | May - Sep |
| winter storm | 11433 | Feb - Nov |
| winter weather | 7045 | Jan - Nov |
| funnel cloud | 6844 | May - Jul |
| marine tstm wind | 6175 | May - Aug |