This document presents my analysis of the “NOAA Storm Database” dataset. It shows how the raw data is processed and the results are generated. Due conflicting deadlines, I have performed only a minimal analysis and have not completely cleaned up the dataset. Although, this may render the results less valid, my analysis can be reproduced and I point out some of the problems of the analysis. I think that this document still proves that I have understood the main concepts behind “reproducible research”.
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
The dataset needs to be cleaned up extensively. However, due to time constraints I will only show an example how this would be performed.
After loading the CSV-file, the values of the variables EVTYPE, CROPDMGEXP and PROPDMGEXP are converted to lowercase or uppercase, because lower-/uppercase is not used consequently in the CSV-file.
storm_data <-
read.csv("~/Coursera Data Science/Course Project Week 4/repdata_data_StormData.csv") %>%
mutate(EVTYPE = factor(tolower(EVTYPE)),
CROPDMGEXP = toupper(CROPDMGEXP),
PROPDMGEXP = toupper(PROPDMGEXP))
There are only 48 possible event types described in the manual. However, the variable EVTYPE has many more factors.
summary(storm_data$EVTYPE)
## hail tstm wind thunderstorm wind
## 288661 219942 82564
## tornado flash flood flood
## 60652 54277 25327
## thunderstorm winds high wind lightning
## 20843 20214 15754
## heavy snow heavy rain winter storm
## 15708 11742 11433
## winter weather funnel cloud marine tstm wind
## 7045 6844 6175
## marine thunderstorm wind waterspout strong wind
## 5812 3796 3569
## urban/sml stream fld wildfire blizzard
## 3392 2761 2719
## drought ice storm excessive heat
## 2488 2006 1678
## high winds wild/forest fire frost/freeze
## 1533 1457 1343
## dense fog winter weather/mix tstm wind/hail
## 1293 1104 1028
## extreme cold/wind chill heat high surf
## 1002 767 734
## tropical storm flash flooding extreme cold
## 690 682 657
## coastal flood lake-effect snow flood/flash flood
## 656 636 625
## snow landslide cold/wind chill
## 617 600 539
## fog rip current marine hail
## 538 470 442
## dust storm avalanche wind
## 427 386 346
## rip currents storm surge freezing rain
## 304 261 260
## urban flood heavy surf/high surf extreme windchill
## 251 228 204
## strong winds dry microburst coastal flooding
## 204 186 183
## light snow astronomical low tide hurricane
## 176 174 174
## river flood record warmth dust devil
## 173 154 149
## storm surge/tide marine high wind unseasonably warm
## 148 135 126
## flooding astronomical high tide moderate snowfall
## 120 103 101
## urban flooding wintry mix hurricane/typhoon
## 99 94 88
## funnel clouds heavy surf cold
## 87 87 82
## record heat freeze heat wave
## 82 76 75
## record cold gusty winds ice
## 67 65 61
## thunderstorm winds hail tropical depression sleet
## 61 60 59
## frost unseasonably dry small hail
## 57 56 53
## other thunderstorm windss marine strong wind
## 52 51 48
## freezing fog funnel thunderstorm
## 46 46 45
## glaze temperature record tstm wind (g45)
## 43 43 39
## mixed precipitation waterspouts monthly precipitation
## 37 37 36
## (Other)
## 2677
There are several problems:
storm_data$EVTYPE[storm_data$EVTYPE %in% c("avalance", "avalanche")] <- "avalanche"
storm_data$EVTYPE[substr(storm_data$EVTYPE, 1, 8) == "blizzard"] <- "blizzard"
Furthermore, the variables ending with “EXP” should contain “K”, “M” and “B” according to the manual. I have chosen to set other values to NA, although it might have been possible to guess what is meant by them.
storm_data$CROPDMGEXP[!(storm_data$CROPDMGEXP %in% c("", "K", "M", "B"))] <- NA
storm_data$CROPDMGEXP[storm_data$CROPDMGEXP == ""] <- 1
storm_data$CROPDMGEXP[storm_data$CROPDMGEXP == "K"] <- 1000
storm_data$CROPDMGEXP[storm_data$CROPDMGEXP == "M"] <- 1000000
storm_data$CROPDMGEXP[storm_data$CROPDMGEXP == "B"] <- 1000000000
storm_data$PROPDMGEXP[!(storm_data$PROPDMGEXP %in% c("", "K", "M", "B"))] <- NA
storm_data$PROPDMGEXP[storm_data$PROPDMGEXP == ""] <- 1
storm_data$PROPDMGEXP[storm_data$PROPDMGEXP == "K"] <- 1000
storm_data$PROPDMGEXP[storm_data$PROPDMGEXP == "M"] <- 1000000
storm_data$PROPDMGEXP[storm_data$PROPDMGEXP == "B"] <- 1000000000
storm_data <-
storm_data %>%
mutate(crop_damage = CROPDMG * as.numeric(CROPDMGEXP),
property_damage = PROPDMG * as.numeric(PROPDMGEXP),
total_damage = crop_damage + property_damage)
It is difficult to decide how the harmfulness can be compared. Is an event with only 8 fatalities and 43 injuries less harmful than an event with 14 fatalities and no injuries?
Here, you can se a table listing the top 10 most harmful events based on average fatalities. Because I have not extensively cleaned up the dataset, I have chosen to exclude events with a number of observations below 10, because they often refer to singular events (“tropical storm gordon”).
storm_data %>%
group_by(EVTYPE) %>%
summarize(injuries = mean(INJURIES),
fatalities = mean(FATALITIES),
n_observations = n()) %>%
arrange(desc(fatalities), desc(injuries)) %>%
filter(n_observations > 10) %>%
top_n(10, fatalities)
## Source: local data frame [10 x 4]
##
## EVTYPE injuries fatalities n_observations
## 1 extreme heat 7.0454545 4.3636364 22
## 2 heat wave 5.0533333 2.2933333 75
## 3 unseasonably warm and dry 0.0000000 2.2307692 13
## 4 tsunami 6.4500000 1.6500000 20
## 5 heat 2.7379400 1.2216428 767
## 6 excessive heat 3.8885578 1.1340882 1678
## 7 rip current 0.4936170 0.7829787 470
## 8 hurricane/typhoon 14.4886364 0.7272727 88
## 9 rip currents 0.9769737 0.6710526 304
## 10 flash flood/flood 0.0000000 0.6363636 22
Here, I show a list with the top 10 of events with the greates economic consequences and a barplot of the top 5. Note that there are problems with EVTYPE variable, because the events “hurricane/typhoon” and “hurricane” should be one category.
storm_data %>%
group_by(EVTYPE) %>%
summarize(damage = mean(total_damage),
n_observations = n()) %>%
arrange(desc(damage)) %>%
filter(n_observations > 10) %>%
top_n(10, damage)
## Source: local data frame [10 x 3]
##
## EVTYPE damage n_observations
## 1 hurricane/typhoon 817201282 88
## 2 storm surge 165990579 261
## 3 severe thunderstorm 92735385 13
## 4 hurricane 83966833 174
## 5 river flood 58661298 173
## 6 typhoon 54641364 11
## 7 storm surge/tide 31365122 148
## 8 flash flood/flood 12409318 22
## 9 tropical storm 12148169 690
## 10 tsunami 7204100 20
top5_damage <-
storm_data %>%
group_by(EVTYPE) %>%
summarize(damage = mean(total_damage),
n_observations = n()) %>%
arrange(desc(damage)) %>%
filter(n_observations > 10) %>%
top_n(5, damage)
barplot(height = top5_damage$damage, names.arg = top5_damage$EVTYPE, cex.names = .8)
title("Top 5 - Total damage")
Although storms seem to have the biggest economic impact, extreme heat seems to be more harmful to the population.