In this analysis we tried to find the types of weather events which may cause highest impact in terms of death ,injuries and economic costs.The U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database which contains weather events from 1950s to current decade has been explored to find the required answers.After the analysis we found that TORNADO causes most fatalities and injuries becoming top event causing public health problems. Calculation of economic costs is calculated via both property and crop damage. FLOOD is found to be the top most event causing economic costs.
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:
Storm Data [47Mb]
There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
National Weather Service Storm Data Documentation
National Climatic Data Center Storm Events FAQ
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
The following code downloads the data zip file from the website. As second step using read.csv() function, file is unzipped and loaded into dataframe.
if(!exists("rawdata.bz2")) {
Url<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(Url, "rawdata.bz2")}
Url2<-"https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf"
download.file(Url2, basename(Url2))
if(!exists("df1")) {
df1 <- read.csv(bzfile("rawdata.bz2"),header = TRUE)
}
str(df1)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
The data set has 902297 obs. of 37 variables and to our analysis we only need 6 variables from the dataset.
Health variables:
* FATALITIES: approx. number of deaths
* INJURIES: approx. number of injuries
Economic variables:
PROPDMG: approx. property damags
PROPDMGEXP: the units for property damage value
CROPDMG: approx. crop damages
CROPDMGEXP: the units for crop damage value
So we made a subset of dataframe using the following code.
df2 <- df1[,c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
dim(df2)
## [1] 902297 7
summary(df2)
## EVTYPE FATALITIES INJURIES PROPDMG
## Length:902297 Min. : 0.0000 Min. : 0.0000 Min. : 0.00
## Class :character 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.00
## Mode :character Median : 0.0000 Median : 0.0000 Median : 0.00
## Mean : 0.0168 Mean : 0.1557 Mean : 12.06
## 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.50
## Max. :583.0000 Max. :1700.0000 Max. :5000.00
## PROPDMGEXP CROPDMG CROPDMGEXP
## Length:902297 Min. : 0.000 Length:902297
## Class :character 1st Qu.: 0.000 Class :character
## Mode :character Median : 0.000 Mode :character
## Mean : 1.527
## 3rd Qu.: 0.000
## Max. :990.000
No missing values found in the dataset.
In the data there were two variables regarding health impact;FATALITIES & INJURIES. Using dplyr package, data was initially grouped according to event types, the summarized according to totals of fatalities and injuries for each event. After arranging values, top 10 selected and saved in st1 & st2.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
st1<-df2%>% group_by(EVTYPE)%>%
summarise(fatality=sum(FATALITIES))%>%
slice_max(order_by=fatality,n=10)
st1
## # A tibble: 10 × 2
## EVTYPE fatality
## <chr> <dbl>
## 1 TORNADO 5633
## 2 EXCESSIVE HEAT 1903
## 3 FLASH FLOOD 978
## 4 HEAT 937
## 5 LIGHTNING 816
## 6 TSTM WIND 504
## 7 FLOOD 470
## 8 RIP CURRENT 368
## 9 HIGH WIND 248
## 10 AVALANCHE 224
st2<-df2%>% group_by(EVTYPE)%>%
summarise(injury=sum(INJURIES))%>%
slice_max(order_by=injury,n=10)
st2
## # A tibble: 10 × 2
## EVTYPE injury
## <chr> <dbl>
## 1 TORNADO 91346
## 2 TSTM WIND 6957
## 3 FLOOD 6789
## 4 EXCESSIVE HEAT 6525
## 5 LIGHTNING 5230
## 6 HEAT 2100
## 7 ICE STORM 1975
## 8 FLASH FLOOD 1777
## 9 THUNDERSTORM WIND 1488
## 10 HAIL 1361
Economic Impact
The data provides two types of economic impact, namely property damage (PROPDMG) and crop damage (CROPDMG). The actual damage in $USD is indicated by PROPDMGEXP and CROPDMGEXP parameters as exponential. After searching for individual events in the storm website, the index in the PROPDMGEXP and CROPDMGEXP can be interpreted roughly as the following:-
H, h -> hundreds = x100
K, K -> kilos = x1,000
M, m -> millions = x1,000,000
B,b -> billions = x1,000,000,000
(+) -> x1
(-) -> x0
(?) -> x0
blank -> x0
(0) -> x0
Step1: Two character vectors of unique values were made for property damage and crop damage. Corresponding multiplier vectors also made using above mentioned logic on exponential.
Step2: The unique value vector and multiplier vector were combined into df2,df3 separately for property and crop damage. columns were renamed so as to match the df2 vector.
prop_unique<-unique(df2$PROPDMGEXP)
crop_unique<-unique(df2$CROPDMGEXP)
prop_multi<-c(10^3,10^6,0,10^9,10^6,1,0,10^5,10^6,0,10^4,10^2,10^3,10^2,10^7,10^2,0,10^1,10^8)
crop_multi<-c(0,10^6,10^3,10^6,10^9,0,0,10^3,10^2)
df3<-data.frame(prop_unique,prop_multi)
df4<-data.frame(crop_unique,crop_multi)
colnames(df3)<- c("PROPDMGEXP","prop_multi")
colnames(df4)<- c("CROPDMGEXP","crop_multi")
The resultant vectors are then put into data pipeline of dplyr and each step doing the following task.
st3 <-df2 %>% left_join(df3,by="PROPDMGEXP")%>%
left_join(df4,by="CROPDMGEXP")%>%
mutate(propdamage=PROPDMG*prop_multi,cropdamage=CROPDMG*crop_multi) %>%
mutate(totaldamage=cropdamage+propdamage)%>%
group_by(EVTYPE)%>%
summarise(cost=sum(totaldamage))%>%
slice_max(order_by=cost,n=10)
st3
## # A tibble: 10 × 2
## EVTYPE cost
## <chr> <dbl>
## 1 FLOOD 150319678250
## 2 HURRICANE/TYPHOON 71913712800
## 3 TORNADO 57362333650
## 4 STORM SURGE 43323541000
## 5 HAIL 18761221670
## 6 FLASH FLOOD 18243990610
## 7 DROUGHT 15018672000
## 8 HURRICANE 14610229010
## 9 RIVER FLOOD 10148404500
## 10 ICE STORM 8967041310
par(mfrow = c(1, 2))
plot1<-barplot(st1$fatality,main="Top 10 high fatality events",ylab = "Total fatality",col="blue")
text(plot1, par("usr")[3], labels = st1$EVTYPE, srt = 90, adj = c(1.1,1.1), xpd = TRUE, cex=0.7)
plot2<-barplot(st2$injury,main="Top 10 high injuries events",ylab = "Total injuries",col="magenta")
text(plot2, par("usr")[3], labels = st2$EVTYPE, srt = 90, adj = c(1.1,1.1), xpd = TRUE, cex=0.7)
The st1 & st2 summary tables made in the analysis section is plotted parallely showing top 10 fatalities and injuries causing events. It is understood that TORNADO as the lead cause of mortality and morbidity.
par(mfrow = c(1, 1))
plot1<-barplot(st3$cost,main="Top 10 events with highest economic damage.",ylab = "Total cost",col="purple")
text(plot1, par("usr")[3], labels = st3$EVTYPE, srt = 90, adj = c(1.1,1.1), xpd = TRUE, cex=0.7)
The st3 summary table made earlier is plotted showing top 10 events causing highest damage. It is clear that FLOOD as the most damaging one.