SUMMARY: This dataset had 985 different types of weather events, so we filtered out any weather event that had less than 100 occurrences in the dataset so that we were working with decent sample sizes for each event. Using dplyr, we manipulated the data to identify the weather events with the highest averages for each metric.

Excessive heat was the most deadly type of weather event in terms of fatalities and injuries, ranking first for both in terms of averages.

Hurricanes were the most costly event in terms of property damage and crop damage, ranking first for both in terms of averages.

Data Processing

Below we load in the data through a zip file.

fileURL <- 'https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2'

download.file(fileURL, destfile='StormData.csv.bz2')

storm <- read.csv(bzfile('StormData.csv.bz2'),header=TRUE, stringsAsFactors = FALSE)

Next let’s take a quick look at the makeup of this dataframe. Looks like there are a lot of extra variables. Let’s start by converting BGN_DATE variable to a POSIXct just in case. Next let’s clean up the dataset by just selecting the variables we need and ommitting everything else.

str(storm)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
head(storm)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6
storm$BGN_DATE <- mdy_hms(storm$BGN_DATE)

storm <- storm %>% select(BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, CROPDMG)

kable(head(storm))
BGN_DATE EVTYPE FATALITIES INJURIES PROPDMG CROPDMG
1950-04-18 TORNADO 0 15 25.0 0
1950-04-18 TORNADO 0 0 2.5 0
1951-02-20 TORNADO 0 2 25.0 0
1951-06-08 TORNADO 0 2 2.5 0
1951-11-15 TORNADO 0 2 2.5 0
1951-11-15 TORNADO 0 6 2.5 0

As we know from the assignment description, EVTYPE is a key variable. After tabling a distrubition of the EVTYPE variable, it’s clear that there are a TON of weather event types, 985 to be exact! Since there are so many variables, let’s count the occurence of each weather type, and remove the ones with less than 100 occurrences in the dataset.

length(unique(storm$EVTYPE))
## [1] 985
storm_count <- storm %>% dplyr::select(EVTYPE) %>% dplyr::group_by(EVTYPE) %>% mutate(count = n()) %>% dplyr::arrange(desc(count)) %>% distinct()

storm <- storm %>% dplyr::group_by(EVTYPE) %>% mutate(count = n()) %>% filter(count > 100)

Results

Analyzing Fatalities and Injuries

In order to understand which event types are most fatal we should focus on the Fatality and Injuries variables. Of the 985 weather event types, we want to understand which ones cause the most fatalities and injuries. To deal with the high number of event types, we will find the average fatality for each event type using dplyr, and then sort the new tibble by average fatalities. Then we’ll repeat the same process for the injuries. There do seem to be a few outliers for fatalities and injuries, but there aren’t too many so we’ll leave them in the data.

EVTYPE Avg_Fatalities
HEAT 1.2216428
EXCESSIVE HEAT 1.1340882
RIP CURRENT 0.7829787
RIP CURRENTS 0.6710526
AVALANCHE 0.5803109
## No trace type specified:
##   Based on info supplied, a 'bar' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#bar

As we can see, the weather event with the most fatalities on average is Excessive Heat, and Heat is listed ast the number five most fatal. So heat seems to be the most fatal weather event. But Rip currents are right behind, averaging between 6 and 8 fatalites.

EVTYPE Avg_Injuries
EXCESSIVE HEAT 3.8885578
HEAT 2.7379400
TORNADO 1.5060674
FOG 1.3643123
DUST STORM 1.0304450
ICE STORM 0.9845464
## No trace type specified:
##   Based on info supplied, a 'bar' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#bar

For injuries we see the event that causes the most injuries is excessive heat, so it’s clear that excessive heat is the most harmful weather event. Need to make sure people have ways of staying hydrated and cool for excessive heat.

Economic Damage

Now we will repeat the same process for the economic damage of these storms, first with property damage and then with crop damage. We’ll start by filtering the data for just event type, property damage, and crop damage. Please note, looked at each of these variables and they do not seem to have any outliers, so we’ll leave these variables as-is.

econ_damage <- storm %>% select(EVTYPE, PROPDMG, CROPDMG)

prop_damage <- econ_damage %>% dplyr::group_by(EVTYPE) %>% dplyr::summarise(Avg_Prop_Damage = mean(PROPDMG, na.rm = TRUE)) %>% arrange(desc(Avg_Prop_Damage))

kable(head(prop_damage, 5))
EVTYPE Avg_Prop_Damage
HURRICANE 89.15908
RIVER FLOOD 80.09075
STORM SURGE 74.30456
TROPICAL STORM 70.17925
URBAN FLOOD 53.38554
prop_damage <- prop_damage %>% filter(Avg_Prop_Damage > 53.1)

plot_ly(prop_damage, x = ~EVTYPE, y=~Avg_Prop_Damage) %>% layout(title = "Top 5 Weather Events by Property Damage", xaxis = list(title = "Event Type", categoryorder = "array", categoryarray = c("HURRICANE", "RIVER FLOOD", "STORM SURGE", "TROPICAL STORM", "URBAN FLOOD")),  yaxis = list(title = "Avg. Property Damage"))
## No trace type specified:
##   Based on info supplied, a 'bar' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#bar

Hurricanes seem to cost the most property damage, with an average of almost $90 per weather event. Tropical storm was number four and is related to hurricanes, so those seem to be the worst events for property damage.

Finally, let’s look at crop damage.

crop_damage <- econ_damage %>% dplyr::group_by(EVTYPE) %>% dplyr::summarise(Avg_Crop_Damage = mean(CROPDMG, na.rm = TRUE)) %>% arrange(desc(Avg_Crop_Damage))

kable(head(crop_damage))
EVTYPE Avg_Crop_Damage
HURRICANE 30.685690
FLOODING 28.008333
RIVER FLOOD 20.173410
DROUGHT 13.624847
EXTREME COLD 9.345252
TROPICAL STORM 8.549449
crop_damage <- crop_damage %>% filter(Avg_Crop_Damage > 9)

plot_ly(crop_damage,x=~EVTYPE, y=~Avg_Crop_Damage) %>% layout(title = "Top 5 Weather Events by Crop Damage", xaxis = list(title = "Event Type", categoryorder = "array", categoryarray = c("HURRICANE", "FLOODING", "RIVER FLOOD", "DROUGHT", "EXTREME COLD")), yaxis = list(title = "Avg. Crop Damage"))
## No trace type specified:
##   Based on info supplied, a 'bar' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#bar

After calculating the average damage to crops, and then sorting in descending order, we see that hurricanes is the top weather event once again. So hurricanes cause the most property damage and crop damage.