SUMMARY: This dataset had 985 different types of weather events, so we filtered out any weather event that had less than 100 occurrences in the dataset so that we were working with decent sample sizes for each event. Using dplyr, we manipulated the data to identify the weather events with the highest averages for each metric.
Excessive heat was the most deadly type of weather event in terms of fatalities and injuries, ranking first for both in terms of averages.
Hurricanes were the most costly event in terms of property damage and crop damage, ranking first for both in terms of averages.
Below we load in the data through a zip file.
fileURL <- 'https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2'
download.file(fileURL, destfile='StormData.csv.bz2')
storm <- read.csv(bzfile('StormData.csv.bz2'),header=TRUE, stringsAsFactors = FALSE)
Next let’s take a quick look at the makeup of this dataframe. Looks like there are a lot of extra variables. Let’s start by converting BGN_DATE variable to a POSIXct just in case. Next let’s clean up the dataset by just selecting the variables we need and ommitting everything else.
str(storm)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
head(storm)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
storm$BGN_DATE <- mdy_hms(storm$BGN_DATE)
storm <- storm %>% select(BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, CROPDMG)
kable(head(storm))
BGN_DATE | EVTYPE | FATALITIES | INJURIES | PROPDMG | CROPDMG |
---|---|---|---|---|---|
1950-04-18 | TORNADO | 0 | 15 | 25.0 | 0 |
1950-04-18 | TORNADO | 0 | 0 | 2.5 | 0 |
1951-02-20 | TORNADO | 0 | 2 | 25.0 | 0 |
1951-06-08 | TORNADO | 0 | 2 | 2.5 | 0 |
1951-11-15 | TORNADO | 0 | 2 | 2.5 | 0 |
1951-11-15 | TORNADO | 0 | 6 | 2.5 | 0 |
As we know from the assignment description, EVTYPE is a key variable. After tabling a distrubition of the EVTYPE variable, it’s clear that there are a TON of weather event types, 985 to be exact! Since there are so many variables, let’s count the occurence of each weather type, and remove the ones with less than 100 occurrences in the dataset.
length(unique(storm$EVTYPE))
## [1] 985
storm_count <- storm %>% dplyr::select(EVTYPE) %>% dplyr::group_by(EVTYPE) %>% mutate(count = n()) %>% dplyr::arrange(desc(count)) %>% distinct()
storm <- storm %>% dplyr::group_by(EVTYPE) %>% mutate(count = n()) %>% filter(count > 100)
In order to understand which event types are most fatal we should focus on the Fatality and Injuries variables. Of the 985 weather event types, we want to understand which ones cause the most fatalities and injuries. To deal with the high number of event types, we will find the average fatality for each event type using dplyr, and then sort the new tibble by average fatalities. Then we’ll repeat the same process for the injuries. There do seem to be a few outliers for fatalities and injuries, but there aren’t too many so we’ll leave them in the data.
EVTYPE | Avg_Fatalities |
---|---|
HEAT | 1.2216428 |
EXCESSIVE HEAT | 1.1340882 |
RIP CURRENT | 0.7829787 |
RIP CURRENTS | 0.6710526 |
AVALANCHE | 0.5803109 |
## No trace type specified:
## Based on info supplied, a 'bar' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#bar
As we can see, the weather event with the most fatalities on average is Excessive Heat, and Heat is listed ast the number five most fatal. So heat seems to be the most fatal weather event. But Rip currents are right behind, averaging between 6 and 8 fatalites.
EVTYPE | Avg_Injuries |
---|---|
EXCESSIVE HEAT | 3.8885578 |
HEAT | 2.7379400 |
TORNADO | 1.5060674 |
FOG | 1.3643123 |
DUST STORM | 1.0304450 |
ICE STORM | 0.9845464 |
## No trace type specified:
## Based on info supplied, a 'bar' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#bar
For injuries we see the event that causes the most injuries is excessive heat, so it’s clear that excessive heat is the most harmful weather event. Need to make sure people have ways of staying hydrated and cool for excessive heat.
Now we will repeat the same process for the economic damage of these storms, first with property damage and then with crop damage. We’ll start by filtering the data for just event type, property damage, and crop damage. Please note, looked at each of these variables and they do not seem to have any outliers, so we’ll leave these variables as-is.
econ_damage <- storm %>% select(EVTYPE, PROPDMG, CROPDMG)
prop_damage <- econ_damage %>% dplyr::group_by(EVTYPE) %>% dplyr::summarise(Avg_Prop_Damage = mean(PROPDMG, na.rm = TRUE)) %>% arrange(desc(Avg_Prop_Damage))
kable(head(prop_damage, 5))
EVTYPE | Avg_Prop_Damage |
---|---|
HURRICANE | 89.15908 |
RIVER FLOOD | 80.09075 |
STORM SURGE | 74.30456 |
TROPICAL STORM | 70.17925 |
URBAN FLOOD | 53.38554 |
prop_damage <- prop_damage %>% filter(Avg_Prop_Damage > 53.1)
plot_ly(prop_damage, x = ~EVTYPE, y=~Avg_Prop_Damage) %>% layout(title = "Top 5 Weather Events by Property Damage", xaxis = list(title = "Event Type", categoryorder = "array", categoryarray = c("HURRICANE", "RIVER FLOOD", "STORM SURGE", "TROPICAL STORM", "URBAN FLOOD")), yaxis = list(title = "Avg. Property Damage"))
## No trace type specified:
## Based on info supplied, a 'bar' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#bar
Hurricanes seem to cost the most property damage, with an average of almost $90 per weather event. Tropical storm was number four and is related to hurricanes, so those seem to be the worst events for property damage.
Finally, let’s look at crop damage.
crop_damage <- econ_damage %>% dplyr::group_by(EVTYPE) %>% dplyr::summarise(Avg_Crop_Damage = mean(CROPDMG, na.rm = TRUE)) %>% arrange(desc(Avg_Crop_Damage))
kable(head(crop_damage))
EVTYPE | Avg_Crop_Damage |
---|---|
HURRICANE | 30.685690 |
FLOODING | 28.008333 |
RIVER FLOOD | 20.173410 |
DROUGHT | 13.624847 |
EXTREME COLD | 9.345252 |
TROPICAL STORM | 8.549449 |
crop_damage <- crop_damage %>% filter(Avg_Crop_Damage > 9)
plot_ly(crop_damage,x=~EVTYPE, y=~Avg_Crop_Damage) %>% layout(title = "Top 5 Weather Events by Crop Damage", xaxis = list(title = "Event Type", categoryorder = "array", categoryarray = c("HURRICANE", "FLOODING", "RIVER FLOOD", "DROUGHT", "EXTREME COLD")), yaxis = list(title = "Avg. Crop Damage"))
## No trace type specified:
## Based on info supplied, a 'bar' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#bar
After calculating the average damage to crops, and then sorting in descending order, we see that hurricanes is the top weather event once again. So hurricanes cause the most property damage and crop damage.