Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:
There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
National Weather Service Storm Data Documentation
National Climatic Data Center Storm Events FAQ
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
library(tidyverse)
The basic goal of this assignment is to explore the NOAA Storm Database to determine the types of weather events that 1) are most harmful with respect to population health, and 2) have the greatest economic impact in the United States. The Storm Database was subset into two smaller dataframes - each with data needed to answer one of the questions. Variable names and event types were changed to all lower case. No other transformation was performed on any of the original data, but additional variables were added to each sunset to hold calculation results. Group-by() and summarise() were used to get totals per event type in both subsets.
Read the data into R and take a look its characteristics and structure.
stormdata <- read.csv("repdata_data_StormData.csv.bz2")
str(stormdata)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
The analysis for this project will be need the following
variables:
- EVTYPE: Type of weather event
- FATALITIES
- INJURIES
- PROPDMG: property damage cost _ PROPDMGEXP: K=Thousands, M=Millions,
etc.
- CROPDMG: crop damage
- CROPDMGEXP: K=Thousands, M=Millions, etc.
Create a subset of stormdata with only these variables.
events <- stormdata[, c(8,23,24,25,26,27,28)] ##variable column numbers
head(events)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO 0 15 25.0 K 0
## 2 TORNADO 0 0 2.5 K 0
## 3 TORNADO 0 2 25.0 K 0
## 4 TORNADO 0 2 2.5 K 0
## 5 TORNADO 0 2 2.5 K 0
## 6 TORNADO 0 6 2.5 K 0
Change variable names and EVTYPE to lowercase.
events$EVTYPE <- tolower(events$EVTYPE)
names(events) <- tolower(names(events))
head(events)
## evtype fatalities injuries propdmg propdmgexp cropdmg cropdmgexp
## 1 tornado 0 15 25.0 K 0
## 2 tornado 0 0 2.5 K 0
## 3 tornado 0 2 25.0 K 0
## 4 tornado 0 2 2.5 K 0
## 5 tornado 0 2 2.5 K 0
## 6 tornado 0 6 2.5 K 0
Check the data for NA’s.
colSums(is.na(events))
## evtype fatalities injuries propdmg propdmgexp cropdmg cropdmgexp
## 0 0 0 0 0 0 0
The first question asks about effects of events on population health so we only want data for the event types that resulted in fatalities or injuries. Create a new dataframe, health, that only contains data for events that match that criteria. Select columns evtype, fatalities, and injuries.
health <- events %>% filter(fatalities != 0 | injuries != 0) %>%
select(evtype,fatalities,injuries)
slice_sample(health, n=10)
## evtype fatalities injuries
## 1 tornado 1 1
## 2 tornado 0 4
## 3 tornado 0 3
## 4 lightning 0 1
## 5 glaze 0 2
## 6 tornado 0 15
## 7 tstm wind 0 1
## 8 tstm wind 0 1
## 9 tornado 1 0
## 10 tornado 2 69
Add the number of fatalities and injuries for each row and put the total in a new column,“healtheffect”.
health <- health %>%
mutate(healtheffect = fatalities+injuries)
head(health)
## evtype fatalities injuries healtheffect
## 1 tornado 0 15 15
## 2 tornado 0 2 2
## 3 tornado 0 2 2
## 4 tornado 0 2 2
## 5 tornado 0 6 6
## 6 tornado 0 1 1
Calculate the total health effect for each event type.
health <- health %>%
group_by(evtype) %>%
summarise(healtheffect = sum(healtheffect))
head(health,10)
## # A tibble: 10 × 2
## evtype healtheffect
## <chr> <dbl>
## 1 avalance 1
## 2 avalanche 394
## 3 black ice 25
## 4 blizzard 906
## 5 blowing snow 16
## 6 brush fire 2
## 7 coastal flood 5
## 8 coastal flooding 3
## 9 coastal flooding/erosion 5
## 10 coastal storm 5
The group_by function has the data sorted in alphabetical order by event type. Sort by healtheffect in descending order,
health <- health[order(-health$healtheffect),]
head(health,5)
## # A tibble: 5 × 2
## evtype healtheffect
## <chr> <dbl>
## 1 tornado 96979
## 2 excessive heat 8428
## 3 tstm wind 7461
## 4 flood 7259
## 5 lightning 6046
Tornadoes are by far the most damaging event in terms of population health. Let’s see just how much more damaging they are. Total the health effect of all events other than tornado, and compare that total to tornado’s total.
tornadoes <- health[1,]
others <- sum(health$healtheffect[2:205]) ##total all rows except tornado (row 1)
## create new dataframe to store & plot results
compare <- add_row(tornadoes, evtype = "others", healtheffect = others)
compare
## # A tibble: 2 × 2
## evtype healtheffect
## <chr> <dbl>
## 1 tornado 96979
## 2 others 58694
Plot the results.
barplot(compare$healtheffect, width=.25,names.arg= c("Tornadoes", "All Others"),
xlab="Event Type", ylab="Fatalities & Injuries", col="#1E7AB1", yaxt="n")
axis(2, at=seq(0,120000, by=20000), labels = c("","20K","40K","60K","80K","100K",""))
title("Tornadoes vs All Other Event Types",line=1)
mtext("1950 - November 2011",side=3,line=0)
Tornadoes are responsible for more fatalities and injuries than all other event types combined. Let’s see if this is true for economic impact as well.
To check the economic impact of event types, we need to know the cost of property damage and crop damage. Filter the events dataframe to get propdmg or cropdmg values that do not equal zero; and select columns evtype, propdmg, propdmgexp, cropdmg, and cropdmgexp.
economic <- events %>% filter(propdmg != 0 | cropdmg != 0) %>%
select(evtype,propdmg, propdmgexp,cropdmg,cropdmgexp)
head(economic)
## evtype propdmg propdmgexp cropdmg cropdmgexp
## 1 tornado 25.0 K 0
## 2 tornado 2.5 K 0
## 3 tornado 25.0 K 0
## 4 tornado 2.5 K 0
## 5 tornado 2.5 K 0
## 6 tornado 2.5 K 0
Check the values in the PROPDMGEXP & CROPDMGEXP columns. The values will be used to create a multiplier (x1 & x2) for each DMG variable.
y1 <- unique(economic$propdmgexp)
y1
## [1] "K" "M" "B" "m" "" "+" "0" "5" "6" "4" "h" "2" "7" "3" "H" "-"
y2 <- unique(economic$cropdmgexp)
y2
## [1] "" "M" "K" "m" "B" "?" "0" "k"
Use the information above to create multiplier variables. According to the NWS documentation, propdmgexp & cropdmgexp should only contain monetary indicators (e,g, K,M,B). The values that are not monetary indicators will be assigned a multiplier of one (1).
economic <- economic %>%
mutate(x1=case_when(propdmgexp=="h"|propdmgexp=="H" ~ 100,
propdmgexp=="m"|propdmgexp=="M"~ 1000000,
propdmgexp=="B" ~ 1000000000,
propdmgexp=="K" ~ 1000,
TRUE ~ 1)
)
economic <- economic %>%
mutate(x2=case_when(cropdmgexp=="m"|cropdmgexp=="M" ~ 1000000,
cropdmgexp=="k"| cropdmgexp=="K" ~ 1000,
cropdmgexp=="B" ~ 100000000,
TRUE ~ 1)
)
head(economic,10)
## evtype propdmg propdmgexp cropdmg cropdmgexp x1 x2
## 1 tornado 25.0 K 0 1000 1
## 2 tornado 2.5 K 0 1000 1
## 3 tornado 25.0 K 0 1000 1
## 4 tornado 2.5 K 0 1000 1
## 5 tornado 2.5 K 0 1000 1
## 6 tornado 2.5 K 0 1000 1
## 7 tornado 2.5 K 0 1000 1
## 8 tornado 2.5 K 0 1000 1
## 9 tornado 25.0 K 0 1000 1
## 10 tornado 25.0 K 0 1000 1
Now create new columns for each damage variable * x1 or x2.
economic <- economic %>%
mutate(totalprop=propdmg*x1) %>%
mutate(totalcrop=cropdmg*x2) %>%
mutate(total=totalprop+totalcrop)
head(economic)
## evtype propdmg propdmgexp cropdmg cropdmgexp x1 x2 totalprop totalcrop
## 1 tornado 25.0 K 0 1000 1 25000 0
## 2 tornado 2.5 K 0 1000 1 2500 0
## 3 tornado 25.0 K 0 1000 1 25000 0
## 4 tornado 2.5 K 0 1000 1 2500 0
## 5 tornado 2.5 K 0 1000 1 2500 0
## 6 tornado 2.5 K 0 1000 1 2500 0
## total
## 1 25000
## 2 2500
## 3 25000
## 4 2500
## 5 2500
## 6 2500
for easier review, let’s select only the EVTYPE & totals columns
costs <- select(economic,evtype,totalprop,totalcrop, total)
head(costs)
## evtype totalprop totalcrop total
## 1 tornado 25000 0 25000
## 2 tornado 2500 0 2500
## 3 tornado 25000 0 25000
## 4 tornado 2500 0 2500
## 5 tornado 2500 0 2500
## 6 tornado 2500 0 2500
Sum the totals and group by evtype.
costs <- costs %>%
group_by(evtype) %>%
summarise(totalcost = round((sum(total)/1000000000),3)) ## Convert to Billions
## for easier reading
costs <- costs[order(-costs$totalcost),] ##sort data by total cost in desc order
head(costs,5)
## # A tibble: 5 × 2
## evtype totalcost
## <chr> <dbl>
## 1 flood 150.
## 2 hurricane/typhoon 70.6
## 3 tornado 57.4
## 4 storm surge 43.3
## 5 hail 18.8
Plot results
barplot(costs$totalcost[1:5], width=.25,names.arg= c("floods", "hurricanes","tornadoes","storm surge","hail"),
xlab="Event Type", ylab="Property & Crop Damage Cost (Billions)", col="#ECE355")
title(main="Events with Highest Economic Impact", line=1)
mtext("1950 - November, 2011", line=0)
Floods and tornadoes are the only events that show up in the top 5 in both economic impact and public health impact. Each earns a top spot in our research into how weather events impact the US terms of public health and costs.
Floods have the largest economic impact in the US, costing approx $ 80B more than the next costly event, hurricanes. They are ranked 4th in terms of their impact on public health.
Tornadoes come in at a distant 3rd in terms of costs, yet they account for more injuries/fatalities in the US than all other events combined.