Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
We are exploring the effect severe weather has on population health and the economy. The analysis conducted on the NOAA Storm Database data collected from 1950 to the end in November 2011.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(scales)
The database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
df <- tbl_df(read.csv("repdata%2Fdata%2FStormData.csv"))
A first look at the data
str(df)
## Classes 'tbl_df', 'tbl' and 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436774 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
#Head
head(df)
## # A tibble: 6 x 37
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## <dbl> <fct> <fct> <fct> <dbl> <fct> <fct> <fct>
## 1 1 4/18/19~ 0130 CST 97 MOBILE AL TORNA~
## 2 1 4/18/19~ 0145 CST 3 BALDWIN AL TORNA~
## 3 1 2/20/19~ 1600 CST 57 FAYETTE AL TORNA~
## 4 1 6/8/195~ 0900 CST 89 MADISON AL TORNA~
## 5 1 11/15/1~ 1500 CST 43 CULLMAN AL TORNA~
## 6 1 11/15/1~ 2000 CST 77 LAUDERDALE AL TORNA~
## # ... with 29 more variables: BGN_RANGE <dbl>, BGN_AZI <fct>,
## # BGN_LOCATI <fct>, END_DATE <fct>, END_TIME <fct>, COUNTY_END <dbl>,
## # COUNTYENDN <lgl>, END_RANGE <dbl>, END_AZI <fct>, END_LOCATI <fct>,
## # LENGTH <dbl>, WIDTH <dbl>, F <int>, MAG <dbl>, FATALITIES <dbl>,
## # INJURIES <dbl>, PROPDMG <dbl>, PROPDMGEXP <fct>, CROPDMG <dbl>,
## # CROPDMGEXP <fct>, WFO <fct>, STATEOFFIC <fct>, ZONENAMES <fct>,
## # LATITUDE <dbl>, LONGITUDE <dbl>, LATITUDE_E <dbl>, LONGITUDE_ <dbl>,
## # REMARKS <fct>, REFNUM <dbl>
#MIddle
df[(nrow(df)/2):((nrow(df)/2)+10),]
## # A tibble: 11 x 37
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## <dbl> <fct> <fct> <fct> <dbl> <fct> <fct> <fct>
## 1 8 1/23/20~ 01:00:0~ MST 72 COZ072 CO HEAVY~
## 2 8 1/23/20~ 02:00:0~ MST 35 COZ035>03~ CO HEAVY~
## 3 8 1/28/20~ 11:40:0~ MST 70 COZ070 - ~ CO HIGH ~
## 4 8 1/28/20~ 09:00:0~ MST 2 COZ002>00~ CO WINTE~
## 5 8 1/29/20~ 05:00:0~ MST 61 COZ061 - ~ CO HEAVY~
## 6 8 1/29/20~ 06:00:0~ MST 35 COZ035 - ~ CO HEAVY~
## 7 8 1/30/20~ 05:00:0~ MST 91 COZ091>092 CO WINTE~
## 8 8 2/1/200~ 03:00:0~ MST 10 COZ010 CO AVALA~
## 9 8 2/6/200~ 04:00:0~ MST 12 COZ012 CO AVALA~
## 10 8 2/8/200~ 07:00:0~ MST 4 COZ004 - ~ CO WINTE~
## 11 8 2/8/200~ 03:00:0~ MST 6 COZ006 CO DUST ~
## # ... with 29 more variables: BGN_RANGE <dbl>, BGN_AZI <fct>,
## # BGN_LOCATI <fct>, END_DATE <fct>, END_TIME <fct>, COUNTY_END <dbl>,
## # COUNTYENDN <lgl>, END_RANGE <dbl>, END_AZI <fct>, END_LOCATI <fct>,
## # LENGTH <dbl>, WIDTH <dbl>, F <int>, MAG <dbl>, FATALITIES <dbl>,
## # INJURIES <dbl>, PROPDMG <dbl>, PROPDMGEXP <fct>, CROPDMG <dbl>,
## # CROPDMGEXP <fct>, WFO <fct>, STATEOFFIC <fct>, ZONENAMES <fct>,
## # LATITUDE <dbl>, LONGITUDE <dbl>, LATITUDE_E <dbl>, LONGITUDE_ <dbl>,
## # REMARKS <fct>, REFNUM <dbl>
#Tail
tail(df)
## # A tibble: 6 x 37
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## <dbl> <fct> <fct> <fct> <dbl> <fct> <fct> <fct>
## 1 47 11/28/2~ 03:00:0~ CST 21 TNZ001>00~ TN WINTE~
## 2 56 11/30/2~ 10:30:0~ MST 7 WYZ007 - ~ WY HIGH ~
## 3 30 11/10/2~ 02:48:0~ MST 9 MTZ009 - ~ MT HIGH ~
## 4 2 11/8/20~ 02:58:0~ AKS 213 AKZ213 AK HIGH ~
## 5 2 11/9/20~ 10:21:0~ AKS 202 AKZ202 AK BLIZZ~
## 6 1 11/28/2~ 08:00:0~ CST 6 ALZ006 AL HEAVY~
## # ... with 29 more variables: BGN_RANGE <dbl>, BGN_AZI <fct>,
## # BGN_LOCATI <fct>, END_DATE <fct>, END_TIME <fct>, COUNTY_END <dbl>,
## # COUNTYENDN <lgl>, END_RANGE <dbl>, END_AZI <fct>, END_LOCATI <fct>,
## # LENGTH <dbl>, WIDTH <dbl>, F <int>, MAG <dbl>, FATALITIES <dbl>,
## # INJURIES <dbl>, PROPDMG <dbl>, PROPDMGEXP <fct>, CROPDMG <dbl>,
## # CROPDMGEXP <fct>, WFO <fct>, STATEOFFIC <fct>, ZONENAMES <fct>,
## # LATITUDE <dbl>, LONGITUDE <dbl>, LATITUDE_E <dbl>, LONGITUDE_ <dbl>,
## # REMARKS <fct>, REFNUM <dbl>
We are only interested in the following relationships and if they have impact.
EVTYPE in relation to FATALITIES and INJURIES
EVTYPE in relation to PROPDMGm PROPDMGEXP, CROPDMG, and CROPDMGEXP
We will create two separate objects to contain the data we are interested in.
h <- select(df, EVTYPE, FATALITIES, INJURIES)
For the impact to the economy we need to update the values to match the units units.
e <- select(df, EVTYPE, PROPDMG, CROPDMG, PROPDMGEXP, CROPDMGEXP)
e <- e %>% mutate(PROPDMGEXP = toupper(PROPDMGEXP), CROPDMGEXP = toupper(CROPDMGEXP))
exp <- data.frame("unit"=c("1","H","K","M","B"),"value"=c(1,100,1000,1000000,1000000000))
e$PROPDMGEXP <- sapply(e$PROPDMGEXP, function(x) exp$value[match(x, exp$unit, nomatch = "1")] )
e$CROPDMGEXP <- sapply(e$CROPDMGEXP, function(x) exp$value[match(x, exp$unit, nomatch = "1")] )
e <- e %>% mutate(PROPTOTAL = PROPDMG*PROPDMGEXP) %>%
mutate(CROPTOTAL = CROPDMG*CROPDMGEXP)
# Clean Up
rm(df)
hsum <- h %>%
group_by(EVTYPE) %>%
summarise_all(funs(sum)) %>%
mutate(IMPACT = FATALITIES + INJURIES)
esum <- e %>%
group_by(EVTYPE) %>%
summarise_all(funs(sum)) %>%
mutate(IMPACT = PROPTOTAL + CROPTOTAL)
# Clean Up
rm(h,e)
# Check for missing values
sum(is.na(hsum))
## [1] 0
sum(is.na(esum))
## [1] 0
There are no missing values to impute so we will remove the rows that have no impact.
hsum <- hsum %>% filter(IMPACT > 0)
esum <- esum %>% filter(IMPACT > 0)
Now that we have a reduced dataset we can look into cleaning the data. There are a number duplicate names and mispellings. We won’t
hsum <- hsum %>% mutate(EVTYPE = toupper(EVTYPE))
hsum$EVTYPE <- gsub("[^0-9A-Za-z' ]"," " , hsum$EVTYPE ,ignore.case = TRUE)
hsum <- hsum %>%
group_by(EVTYPE) %>%
summarise_all(funs(sum))
esum <- esum %>% mutate(EVTYPE = toupper(EVTYPE))
esum$EVTYPE <- gsub("[^0-9A-Za-z' ]"," " , esum$EVTYPE ,ignore.case = TRUE)
esum <- esum %>%
group_by(EVTYPE) %>%
summarise_all(funs(sum))
We will take a look at the top contributing events.
head(arrange(hsum, desc(IMPACT)))
## # A tibble: 6 x 4
## EVTYPE FATALITIES INJURIES IMPACT
## <chr> <dbl> <dbl> <dbl>
## 1 TORNADO 5633 91346 96979
## 2 EXCESSIVE HEAT 1903 6525 8428
## 3 TSTM WIND 504 6957 7461
## 4 FLOOD 470 6789 7259
## 5 LIGHTNING 816 5230 6046
## 6 HEAT 937 2100 3037
head(arrange(esum, desc(IMPACT)))
## # A tibble: 6 x 8
## EVTYPE PROPDMG CROPDMG PROPDMGEXP CROPDMGEXP PROPTOTAL CROPTOTAL IMPACT
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 FLOOD 9.00e5 168038. 6.54e 9 363283704 1.45e11 5.66e9 1.50e11
## 2 HURRI~ 5.84e3 4798. 1.20e10 1021011055 6.93e10 2.61e9 7.19e10
## 3 TORNA~ 3.21e6 100019. 7.52e 9 95558059 5.69e10 4.15e8 5.74e10
## 4 STORM~ 1.94e4 5 2.04e 9 7254 4.33e10 5.00e3 4.33e10
## 5 HAIL 6.89e5 579596. 1.96e 9 646946356 1.57e10 3.03e9 1.88e10
## 6 FLASH~ 1.42e6 179200. 2.54e 9 202530598 1.61e10 1.42e9 1.76e10
tail(arrange(hsum, desc(IMPACT)))
## # A tibble: 6 x 4
## EVTYPE FATALITIES INJURIES IMPACT
## <chr> <dbl> <dbl> <dbl>
## 1 TIDAL FLOODING 0 1 1
## 2 "TSTM WIND G35 " 1 0 1
## 3 "TSTM WIND G40 " 0 1 1
## 4 URBAN AND SMALL STREAM FLOODIN 1 0 1
## 5 WHIRLWIND 1 0 1
## 6 WIND STORM 1 0 1
tail(arrange(esum, desc(IMPACT)))
## # A tibble: 6 x 8
## EVTYPE PROPDMG CROPDMG PROPDMGEXP CROPDMGEXP PROPTOTAL CROPTOTAL IMPACT
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 COLD A~ 0.05 0.05 1000 1000 50 50 100
## 2 SNOW A~ 0.05 0 1032 33 50 0 50
## 3 URBAN ~ 0.05 0 1000 1 50 0 50
## 4 BREAKU~ 20 0 1 1 20 0 20
## 5 FLOODI~ 2 0 1 1 2 0 2
## 6 FLASH ~ 0.41 0 1 1 0.41 0 0.41
We will not do an extensive cleanup as the disparity in the values are quite clear.
Weather events that are most harmful to the population.
ggplot(data = head(arrange(hsum, desc(IMPACT))), aes(x = EVTYPE, y = IMPACT) ) + geom_bar(stat = "identity") + labs(x="Severe Weather Type", y="Total Fatalities and Injuries", title="Most Harmful Weather Events for Population") + scale_y_continuous(label=unit_format(unit = "K", scale = 1e-3))
Weathere events that have the greatest economic consequences.
ggplot(data = head(arrange(esum, desc(IMPACT))), aes(x = EVTYPE, y = IMPACT) ) + geom_bar(stat = "identity") + labs(x="Severe Weather Type", y="Economic Impact", title="Most Harmful Weather Events for Economy") + theme(axis.text.x = element_text(angle = 45, hjust = 1))+ scale_y_continuous(label=unit_format(unit = "B", scale = 1e-9))
In both cases. Tornados are the most damaging to the population while Floods are most damaging to the economy.