Synopsis
Storms and other severe weather events can cause both public health and economic problems for communities and municipalities.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States.
This project attemps to answer two questions:
1) Across the United States, which types of events (as indicated in the “EVTYPE” variable) are most harmful with respect to population health?
2) Across the United States, which types of events have the greatest economic consequences?
Data Processing
The data (“repdata_data_StormData.csv.bz2”) was provided as part of the assignment documents and is included in the Github repo (https://github.com/GaryFH/RepDataStormStudy). The data is “read” in, converted to a tbl_df and stored as “d1” with the following code:
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.3.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.3
c1<-read.csv("repdata_data_StormData.csv.bz2",stringsAsFactors = FALSE)
d1<-tbl_df(c1)
The data is initially examined with the following code:
d1
## # A tibble: 902,297 × 37
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## 7 1 11/16/1951 0:00:00 0100 CST 9 BLOUNT AL
## 8 1 1/22/1952 0:00:00 0900 CST 123 TALLAPOOSA AL
## 9 1 2/13/1952 0:00:00 2000 CST 125 TUSCALOOSA AL
## 10 1 2/13/1952 0:00:00 2000 CST 57 FAYETTE AL
## # ... with 902,287 more rows, and 30 more variables: EVTYPE <chr>,
## # BGN_RANGE <dbl>, BGN_AZI <chr>, BGN_LOCATI <chr>, END_DATE <chr>,
## # END_TIME <chr>, COUNTY_END <dbl>, COUNTYENDN <lgl>, END_RANGE <dbl>,
## # END_AZI <chr>, END_LOCATI <chr>, LENGTH <dbl>, WIDTH <dbl>, F <int>,
## # MAG <dbl>, FATALITIES <dbl>, INJURIES <dbl>, PROPDMG <dbl>,
## # PROPDMGEXP <chr>, CROPDMG <dbl>, CROPDMGEXP <chr>, WFO <chr>,
## # STATEOFFIC <chr>, ZONENAMES <chr>, LATITUDE <dbl>, LONGITUDE <dbl>,
## # LATITUDE_E <dbl>, LONGITUDE_ <dbl>, REMARKS <chr>, REFNUM <dbl>
The following code shows the top eight weather events based on injuries.
g2<-ggplot(d7,aes(x=reorder(EVTYPE,-INJURIES),y=INJURIES))
plotinjury<-g2+geom_col(fill="magenta")+labs(x="Weather Event",title="Eight most Injuries due to Weather Events US 1950-2011")+theme(axis.text.x = element_text(angle = 90, hjust = 1))+ geom_text(aes(label=INJURIES), vjust=1.5, color="black")
plotinjury

The original dataset showing damage amounts in dollars are stored in two variables (PROPDMG & CROPDMG) the values in these columns need to be multiplied by factors that vary per observation - the code below adds two new columns(property damage multiplier - “pdmltplyr” & crop damage multiplier - “cdmltplyr”) that make the character codes found in the dataset into numeric factors for each observation.
dd2<-group_by(d1,EVTYPE)
dd25<-mutate(dd2, pdmultplyr =
ifelse(PROPDMGEXP == "h" | PROPDMGEXP == "H", 100,
ifelse(PROPDMGEXP == "k" | PROPDMGEXP == "K", 1000,
ifelse(PROPDMGEXP == "m" | PROPDMGEXP == "M", 1000000,
ifelse(PROPDMGEXP == "b" | PROPDMGEXP == "B", 1000000000, 0 )))))
dd26<-mutate(dd25, cdmultplyr=
ifelse(CROPDMGEXP == "h" | CROPDMGEXP == "H", 100,
ifelse(CROPDMGEXP == "k" | CROPDMGEXP == "K", 1000,
ifelse(CROPDMGEXP == "m" | CROPDMGEXP == "M", 1000000,
ifelse(CROPDMGEXP == "b" | CROPDMGEXP == "B", 1000000000, 0 )))))
dd3<-select(dd26,EVTYPE,PROPDMG:CROPDMGEXP,pdmultplyr,cdmultplyr)
The resulting “improved” dataframe can now be used to answer the question - which weather events caused the most damage? The following code applys:
dd35<-mutate(dd3,TotalDamage=(PROPDMG*pdmultplyr)+(CROPDMG*cdmultplyr))
dd4<-filter(dd35,TotalDamage,!is.na(TotalDamage))
dd5<-summarise(dd4,TotalDamage=sum(TotalDamage)/1000000000)
dd6<-arrange(dd5,desc(TotalDamage))
dd7<-head(dd6,8)
dd7
## # A tibble: 8 × 2
## EVTYPE TotalDamage
## <chr> <dbl>
## 1 FLOOD 150.31968
## 2 HURRICANE/TYPHOON 71.91371
## 3 TORNADO 57.35211
## 4 STORM SURGE 43.32354
## 5 HAIL 18.75822
## 6 FLASH FLOOD 17.56213
## 7 DROUGHT 15.01867
## 8 HURRICANE 14.61023
g3<-ggplot(dd7,aes(x=reorder(EVTYPE,-TotalDamage),y=TotalDamage))
plotdamage<-g3+geom_col(fill="orange")+labs(x="Weather Event",title="Eight highest damages due to Weather Events US 1950-2011")+ theme(axis.text.x = element_text(angle = 90, hjust = 1))+geom_text(aes(label=TotalDamage), vjust=1.5)+labs(x="Weather event type",y="Total Property damage in billions",title="Top eight most expensive weather events")
plotdamage
