This report will, based on the storm database of the U.S. National Oceanic and Atmospheric Administration (NOAA), study the adverse effects of severe weather events on the US population health, i.e. the number of associated fatalities and injuries, as well as their economic consequences, i.e. the damage caused to property and crops. For each type of event registered in the NOAA storm database, the total number of fatalities and injuries, and the total property and crop damages (as well as total combined damages) are calculated. The 10 most severe events from each of these categories are selected and compared.
In the first step, we will check, whether the CSV.BZ2 file is already in the folder. If it isn’t, we will download it from the link that was provided in the instructions.
if(!("repdata_data_StormData.csv.bz2" %in% dir())) {
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileUrl,destfile="repdata_data_StormData.csv.bz2",method="libcurl")
}
Next we read in the data using the read.csv
function, and look at the variable names.
storm <- read.csv("repdata_data_StormData.csv.bz2",header=TRUE)
names(storm)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
The columns of interest are the ones labeling the event type, EVTYPE
, the number or associated fatalities and injuries, FATALITIES
and INJURIES
, and the columns classifying the property and crop damages, PROPDMG
and CROPDMG
. Since the damage amounts are supposed to be recorded in numbers between 0 and 999, together with a multiplier (“K” for thousand, “M” for million, “B” for billion), we also need the columns containing these multipliers, PROPDMGEXP
and CROPDMGEXP
. Starting off by looking at some of the values in EVTYPE
, we can see that there are some problems.
head(table(storm$EVTYPE),50)
##
## HIGH SURF ADVISORY COASTAL FLOOD
## 1 1
## FLASH FLOOD LIGHTNING
## 1 1
## TSTM WIND TSTM WIND (G45)
## 4 1
## WATERSPOUT WIND
## 1 1
## ? ABNORMAL WARMTH
## 1 4
## ABNORMALLY DRY ABNORMALLY WET
## 2 1
## ACCUMULATED SNOWFALL AGRICULTURAL FREEZE
## 4 6
## APACHE COUNTY ASTRONOMICAL HIGH TIDE
## 1 103
## ASTRONOMICAL LOW TIDE AVALANCE
## 174 1
## AVALANCHE BEACH EROSIN
## 386 1
## Beach Erosion BEACH EROSION
## 1 3
## BEACH EROSION/COASTAL FLOOD BEACH FLOOD
## 1 2
## BELOW NORMAL PRECIPITATION BITTER WIND CHILL
## 2 1
## BITTER WIND CHILL TEMPERATURES Black Ice
## 3 3
## BLACK ICE BLIZZARD
## 14 2719
## BLIZZARD AND EXTREME WIND CHIL BLIZZARD AND HEAVY SNOW
## 2 1
## Blizzard Summary BLIZZARD WEATHER
## 1 1
## BLIZZARD/FREEZING RAIN BLIZZARD/HEAVY SNOW
## 1 2
## BLIZZARD/HIGH WIND BLIZZARD/WINTER STORM
## 1 1
## BLOW-OUT TIDE BLOW-OUT TIDES
## 1 1
## BLOWING DUST blowing snow
## 4 2
## Blowing Snow BLOWING SNOW
## 3 12
## BLOWING SNOW- EXTREME WIND CHI BLOWING SNOW & EXTREME WIND CH
## 1 2
## BLOWING SNOW/EXTREME WIND CHIL BREAKUP FLOODING
## 1 1
## BRUSH FIRE BRUSH FIRES
## 3 1
Some entries are clearly misspelled, while others appear multiple times with different capitalizations. Since fixing all typos manually would exceed the scope of this project, and likely not make any useful differences (assuming that the majority of entries are spelled correctly), we will just employ a quick fix to capitalize all entries, which already reduces the number of event types by roughly 10%.
length(unique((storm$EVTYPE)))
## [1] 985
length(unique(toupper(storm$EVTYPE)))
## [1] 898
storm$EVTYPE <- toupper(storm$EVTYPE)
In the next step, we check that there are no NA
values in any of the four numeric columns.
sum(is.na(storm$FATALITIES))
## [1] 0
sum(is.na(storm$INJURIES))
## [1] 0
sum(is.na(storm$PROPDMG))
## [1] 0
sum(is.na(storm$CROPDMG))
## [1] 0
Now we take a look at the PROPDMGEXP
and CROPDMGEXP
columns. These should ideally only be filled with empty fields or the characters “K”, “M” and “B”.
table(storm$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5
## 465934 1 8 5 216 25 13 4 4 28
## 6 7 8 B h H K m M
## 4 5 1 40 1 6 424665 7 11330
table(storm$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
Again, there are some clearly mislabeled entries, as well as cases of different capitalization. Over the next series of steps, we will create new columns for the property and crop damages, which multiply the entries with their correct exponents, i.e. 1000 for “K”, 1000000 for “M” and “1000000000 for”B".
thousands <- toupper(storm$PROPDMGEXP) == "K"
millions <- toupper(storm$PROPDMGEXP) == "M"
billions <- toupper(storm$PROPDMGEXP) == "B"
prop <- storm$PROPDMG
prop[thousands] <- 1000*prop[thousands]
prop[millions] <- 1000000*prop[millions]
prop[billions] <- 1000000000*prop[billions]
cthousands <- toupper(storm$CROPDMGEXP) == "K"
cmillions <- toupper(storm$CROPDMGEXP) == "M"
cbillions <- toupper(storm$CROPDMGEXP) == "B"
crop <- storm$CROPDMG
crop[cthousands] <- 1000*crop[cthousands]
crop[cmillions] <- 1000000*crop[cmillions]
crop[cbillions] <- 1000000000*crop[cbillions]
All mistaken entries in the two exponential columns are ignored, but different capitalizations are accounted for. The two columns are added to the original data frame.
storm$propd <- prop
storm$cropd <- crop
The analysis is split into two separate questions: which types of events are most harmful with respect to population health, and which types of events have the greatest economic consequences?
As harm to the population, we can define both the number of fatalities as well as the number of injuries related to certain types of events. These will be treated separately.
The dplyr
package will be used for the analysis. First the data set will be grouped by event type, and the total number of fatalities for each type is calculated.
library(dplyr)
fatal <- storm %>% group_by(EVTYPE) %>% summarise(number=sum(FATALITIES))
head(fatal)
## # A tibble: 6 x 2
## EVTYPE number
## <chr> <dbl>
## 1 " HIGH SURF ADVISORY" 0
## 2 " COASTAL FLOOD" 0
## 3 " FLASH FLOOD" 0
## 4 " LIGHTNING" 0
## 5 " TSTM WIND" 0
## 6 " TSTM WIND (G45)" 0
The deadliest event type and the associated number of fatalities can be calculated.
max(fatal$number)
## [1] 5633
fatal$EVTYPE[fatal$number == max(fatal$number)]
## [1] "TORNADO"
Tornados are responsible for the most fatalities, 5633. The same process is repeated for injuries.
inj <- storm %>% group_by(EVTYPE) %>% summarise(number=sum(INJURIES))
head(inj)
## # A tibble: 6 x 2
## EVTYPE number
## <chr> <dbl>
## 1 " HIGH SURF ADVISORY" 0
## 2 " COASTAL FLOOD" 0
## 3 " FLASH FLOOD" 0
## 4 " LIGHTNING" 0
## 5 " TSTM WIND" 0
## 6 " TSTM WIND (G45)" 0
max(inj$number)
## [1] 91346
inj$EVTYPE[inj$number == max(inj$number)]
## [1] "TORNADO"
Here, too, tornados are the number one source of injuries. In order to have a more comprehensive overview of the most harmful event types, we can look at the top 10 entries in each category. To that purpose, we can sort the earlier output.
fatal_ord <- order(fatal$number,decreasing = TRUE)
fatal10 <- head(fatal[fatal_ord,],10)
fatal10
## # A tibble: 10 x 2
## EVTYPE number
## <chr> <dbl>
## 1 TORNADO 5633
## 2 EXCESSIVE HEAT 1903
## 3 FLASH FLOOD 978
## 4 HEAT 937
## 5 LIGHTNING 816
## 6 TSTM WIND 504
## 7 FLOOD 470
## 8 RIP CURRENT 368
## 9 HIGH WIND 248
## 10 AVALANCHE 224
inj_ord <- order(inj$number,decreasing = TRUE)
inj10 <- head(inj[inj_ord,],10)
inj10
## # A tibble: 10 x 2
## EVTYPE number
## <chr> <dbl>
## 1 TORNADO 91346
## 2 TSTM WIND 6957
## 3 FLOOD 6789
## 4 EXCESSIVE HEAT 6525
## 5 LIGHTNING 5230
## 6 HEAT 2100
## 7 ICE STORM 1975
## 8 FLASH FLOOD 1777
## 9 THUNDERSTORM WIND 1488
## 10 HAIL 1361
While tornados are the clear leader in both categories, the following places show some variation. To illustrate these rankings, we combine both lists into one.
fatal10$type <- "Fatalities"
inj10$type <- "Injuries"
top10 <- rbind(fatal10,inj10)
This combined list can then be plotted, using the ggplot2
package.
library(ggplot2)
g <- ggplot(top10,aes(EVTYPE,number,fill=EVTYPE))
g + geom_bar(stat="identity") + facet_grid(type~.,scales="free") +
theme_bw() + theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Event Type", y = "Number of Fatalities/Injuries") +
ggtitle("Top 10 event types responsible for fatalities and injuries")
This figure shows the total number of fatalities and injuries for the ten most harmful event types in each category. In both cases, tornados are the most harmful type of event. We can also see, that some of the categories should possibly be combined, since they might refer to the same event type (i.e. HEAT
and EXCESSIVE HEAT
, or THUNDERSTORM WIND
and TSTM WIND
). But even then, they are no match for tornado events. The events which are most harmful to the population health can thus be identified as tornados, excessive heat, (flash) floods and lightning strikes.
For the purposes of this report, we assume that the economic consequences can be estimated by the total damages to properties and crops which are associated with each event type. We are interested in looking at these damages both separately and combined. As in the previous section, the original data is grouped according to the respective category, and total damages are calculated for each event type.
prop_sum <- storm %>% group_by(EVTYPE) %>% summarise(cost=sum(propd))
head(prop_sum)
## # A tibble: 6 x 2
## EVTYPE cost
## <chr> <dbl>
## 1 " HIGH SURF ADVISORY" 200000
## 2 " COASTAL FLOOD" 0
## 3 " FLASH FLOOD" 50000
## 4 " LIGHTNING" 0
## 5 " TSTM WIND" 8100000
## 6 " TSTM WIND (G45)" 8000
max(prop_sum$cost)
## [1] 144657709807
prop_sum$EVTYPE[prop_sum$cost == max(prop_sum$cost)]
## [1] "FLOOD"
crop_sum <- storm %>% group_by(EVTYPE) %>% summarise(cost=sum(cropd))
head(crop_sum)
## # A tibble: 6 x 2
## EVTYPE cost
## <chr> <dbl>
## 1 " HIGH SURF ADVISORY" 0
## 2 " COASTAL FLOOD" 0
## 3 " FLASH FLOOD" 0
## 4 " LIGHTNING" 0
## 5 " TSTM WIND" 0
## 6 " TSTM WIND (G45)" 0
max(crop_sum$cost)
## [1] 13972566000
crop_sum$EVTYPE[crop_sum$cost == max(crop_sum$cost)]
## [1] "DROUGHT"
combined <- merge(prop_sum,crop_sum,by="EVTYPE")
combined$total <- combined$cost.x+combined$cost.y
This time, the event type responsible for incurring the largest damage is not the same for the categories. Floods lead to the largest property damage, while droughts are responsible for the largest crop damage (but note that that sum is one order of magnitude smaller). As before, the lists are ordered and the 10 largest damage sources each are extracted:
prop_ord <- order(prop_sum$cost,decreasing = TRUE)
prop10 <- head(prop_sum[prop_ord,],10)
prop10
## # A tibble: 10 x 2
## EVTYPE cost
## <chr> <dbl>
## 1 FLOOD 144657709807
## 2 HURRICANE/TYPHOON 69305840000
## 3 TORNADO 56937160779.
## 4 STORM SURGE 43323536000
## 5 FLASH FLOOD 16140812067.
## 6 HAIL 15732267048.
## 7 HURRICANE 11868319010
## 8 TROPICAL STORM 7703890550
## 9 WINTER STORM 6688497251
## 10 HIGH WIND 5270046295
crop_ord <- order(crop_sum$cost,decreasing = TRUE)
crop10 <- head(crop_sum[crop_ord,],10)
crop10
## # A tibble: 10 x 2
## EVTYPE cost
## <chr> <dbl>
## 1 DROUGHT 13972566000
## 2 FLOOD 5661968450
## 3 RIVER FLOOD 5029459000
## 4 ICE STORM 5022113500
## 5 HAIL 3025954473
## 6 HURRICANE 2741910000
## 7 HURRICANE/TYPHOON 2607872800
## 8 FLASH FLOOD 1421317100
## 9 EXTREME COLD 1312973000
## 10 FROST/FREEZE 1094186000
comb_ord <- order(combined$total,decreasing = TRUE)
comb10 <- head(combined[comb_ord,],10)
comb10
## EVTYPE cost.x cost.y total
## 154 FLOOD 144657709807 5661968450 150319678257
## 372 HURRICANE/TYPHOON 69305840000 2607872800 71913712800
## 758 TORNADO 56937160779 414953270 57352114049
## 599 STORM SURGE 43323536000 5000 43323541000
## 212 HAIL 15732267048 3025954473 18758221521
## 138 FLASH FLOOD 16140812067 1421317100 17562129167
## 84 DROUGHT 1046106000 13972566000 15018672000
## 363 HURRICANE 11868319010 2741910000 14610229010
## 529 RIVER FLOOD 5118945500 5029459000 10148404500
## 387 ICE STORM 3944927860 5022113500 8967041360
The three lists are again combined, in order to create a panel plot.
comb10a <- data.frame(comb10$EVTYPE)
comb10a$cost <- comb10$total
colnames(comb10a) <- c("EVTYPE","cost")
prop10$type <- "Property damage"
crop10$type <- "Crop damage"
comb10a$type <- "Combined damage"
top10dam <- rbind(prop10,crop10,comb10a)
g <- ggplot(top10dam,aes(EVTYPE,cost,fill=EVTYPE))
g + geom_bar(stat="identity") + facet_grid(type~.,scales="free") +
theme_bw() + theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Event Type", y = "Estimated damage [USD]") +
ggtitle("Property and crop damage due to different event types")
This figure shows, the crop damage (middle), property damage (bottom) and total combined damage (top) associated with the top 10 types of severe weather events in each category. As before, some categories should possibly be combined (such as HURRICANE
and HURRICANE/TYPHOON
), but that does not affect the final results significantly. While droughts are the number one source of crop damages, the event types with the largest total economic consequences are floods, followed by hurricanes/typhoons and tornados.