The data on severe weather events shows that the event category with the highest impact on health is - by far - TORNADO, and the one with the highest economic loss is FLOOD. However, it is clear that a lot of data on economic loss is missing as most events do not include loss estimates. Moreover, categorization is caotic, and there is - amon other problems - a clear overlap between different categories.
In order to process the data, we need the following libraries:
library(dplyr)
The data has been manually downloaded from the url https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2 as per assignment instructions. read.csv() is used to read the downloaded data:
stormdata <- read.csv("repdata_data_StormData.csv", header = TRUE, sep = ",")
Note: A more reproducible way would be to download the data from the url when generating the report. However, downloading data from the given url requires a login. If you try to download it directly from R, you get a “403 Forbidden”-error. Thus, the data has been downloaded manually.
We are only looking for events and their consequences in terms of health and economic impact. Since the dataset is huge, we subset it to include only the columns we need for further processing, namely EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, and CROPDMGEXP (2). After subsetting, the original stormdata dataset is removed from memory to limit memory usage.
stormdata_subset <- stormdata[c("EVTYPE", "FATALITIES", "INJURIES",
"PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
rm(stormdata)
Numbers for property damage and crop damage must be calculated, because: > Estimates should be rounded to three significant digits, followed by an alphabetical character signifying the magnitude of the number, i.e., 1.55B for $1,550,000,000. Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions. If additional precision is available, it may be provided in the narrative part of the entry. (1)
The letter in the PROPDMGEXP and CROPDMGEXP are replaced by the corresponding values. We use a separate script, function_magnitude.R to achieve this. The script looks like this:
numeric_value <- function(magnitude) {
numeric_magnitude <- as.numeric(
switch(
as.character(magnitude),
"K" = 10^3,
"M" = 10^6,
"B" = 10^9,
0)
)
return(numeric_magnitude)
}
The numbers in the PROPDMG- and CROPDMG–columns are multiplied by the number in the PROPDMGEXP- and CROPDMGEXP-columns, respectively, and calculated values placed in the PROPDMGRES and CROPDMGRES, respectively:
source("function_magnitude.R")
stormdata_subset$PROPDMGEXP <- lapply(stormdata_subset$PROPDMGEXP,
numeric_value)
stormdata_subset$PROPDMGRES <- stormdata_subset$PROPDMG *
as.numeric(stormdata_subset$PROPDMGEXP)
stormdata_subset$CROPDMGEXP <- lapply(stormdata_subset$CROPDMGEXP,
numeric_value)
stormdata_subset$CROPDMGRES <- stormdata_subset$CROPDMG *
as.numeric(stormdata_subset$CROPDMGEXP)
We add a HARMFULNESS-column to calculate the total of fatalities and injuries. We add a ECONLOSS-column to calculate the total economic loss based on property damage and crop damage.
stormdata_subset$HARMFULNESS <- stormdata_subset$FATALITIES +
stormdata_subset$INJURIES
stormdata_subset$ECONLOSS <- stormdata_subset$PROPDMGRES +
stormdata_subset$CROPDMGRES
Consequences for health are set out in the table’s FATALITIES and INJURIES columns. When calculating harmfulness, the sum of two columns is calculated for each event, set out in the added HARMFULNESS-column.
The single most harmful event registered:
stormdata_subset[which.max(stormdata_subset$HARMFULNESS), ]
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 157885 TORNADO 42 1700 250 1e+06 0 0
## PROPDMGRES CROPDMGRES HARMFULNESS ECONLOSS
## 157885 2.5e+08 0 1742 2.5e+08
The most harmful event category:
df_harmful_events <- stormdata_subset %>% group_by(EVTYPE) %>% summarise(sum=sum(HARMFULNESS)) %>% data.frame()
df_harmful_events[which.max(df_harmful_events$sum), ]
## EVTYPE sum
## 834 TORNADO 96979
There are a few problems with the data, most importantly the realitvely caotic categorization, where some categories overlap in various ways. However, since there is one event that stands out in terms of harm to health, no recategorization is necessary to find the event category with the - by far - highest effect on health:
arrange(df_harmful_events, desc(sum))[1:30, ]
## EVTYPE sum
## 1 TORNADO 96979
## 2 EXCESSIVE HEAT 8428
## 3 TSTM WIND 7461
## 4 FLOOD 7259
## 5 LIGHTNING 6046
## 6 HEAT 3037
## 7 FLASH FLOOD 2755
## 8 ICE STORM 2064
## 9 THUNDERSTORM WIND 1621
## 10 WINTER STORM 1527
## 11 HIGH WIND 1385
## 12 HAIL 1376
## 13 HURRICANE/TYPHOON 1339
## 14 HEAVY SNOW 1148
## 15 WILDFIRE 986
## 16 THUNDERSTORM WINDS 972
## 17 BLIZZARD 906
## 18 FOG 796
## 19 RIP CURRENT 600
## 20 WILD/FOREST FIRE 557
## 21 RIP CURRENTS 501
## 22 HEAT WAVE 481
## 23 DUST STORM 462
## 24 WINTER WEATHER 431
## 25 TROPICAL STORM 398
## 26 AVALANCHE 394
## 27 EXTREME COLD 391
## 28 STRONG WIND 383
## 29 DENSE FOG 360
## 30 HEAVY RAIN 349
Deciding the second most harmfull event, and so on, is - as we can see from the listing above - another matter. Several categories belong together, for instance, TSTM WIND, THUNDERSTORM WIND, and THUNDERSTORM WINDS, and putting these together in one category affects the end result.
The consequences in economic terms are calculated by the PROPDMGRES and CROPDMGRES columns. When calculating economic loss, the sum of two columns is calculated for each event, set out in the added ECONLOSS-column.
The single event with the highest economic loss:
stormdata_subset[which.max(stormdata_subset$ECONLOSS), ]
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 605953 FLOOD 0 0 115 1e+09 32.5 1e+06
## PROPDMGRES CROPDMGRES HARMFULNESS ECONLOSS
## 605953 1.15e+11 32500000 0 115032500000
The event category with the highest economic loss:
df_econloss_events <- stormdata_subset %>% group_by(EVTYPE) %>% summarise(sum=sum(ECONLOSS)) %>% data.frame()
df_econloss_events[which.max(df_econloss_events$sum), ]
## EVTYPE sum
## 170 FLOOD 150319678250
Again, there are a few problems with the resultant data. Most events do not have an economic loss specified, as we can see from the summary (the median is 0):
summary(df_econloss_events)
## EVTYPE sum
## HIGH SURF ADVISORY: 1 Min. :0.000e+00
## COASTAL FLOOD : 1 1st Qu.:0.000e+00
## FLASH FLOOD : 1 Median :0.000e+00
## LIGHTNING : 1 Mean :4.836e+08
## TSTM WIND : 1 3rd Qu.:8.500e+04
## TSTM WIND (G45) : 1 Max. :1.503e+11
## (Other) :979
Then there is the caotic categorization. However, since there is one event that stands out in terms of economic loss (like there is for effect on health above), no recategorization is necessary to find the event category with the highest loss:
arrange(df_econloss_events, desc(sum))[1:15, ]
## EVTYPE sum
## 1 FLOOD 150319678250
## 2 HURRICANE/TYPHOON 71913712800
## 3 TORNADO 57340613590
## 4 STORM SURGE 43323541000
## 5 HAIL 18752904170
## 6 FLASH FLOOD 17562128610
## 7 DROUGHT 15018672000
## 8 HURRICANE 14610229010
## 9 RIVER FLOOD 10148404500
## 10 ICE STORM 8967041310
## 11 TROPICAL STORM 8382236550
## 12 WINTER STORM 6715441250
## 13 HIGH WIND 5908617560
## 14 WILDFIRE 5060586800
## 15 TSTM WIND 5038935790
The most harmful category of events is by far TORNADO as seen in the bar chart of the 15 most harmful event categories.
par(mar=c(2,4,1,1))
df_top15_harmfulness <- arrange(df_harmful_events, desc(sum))[1:15, ]
barplot(df_top15_harmfulness$sum, legend.text=df_top15_harmfulness$EVTYPE,
col=rainbow(15), ylab = "Number of persons harmed",
ylim = c(0,df_top15_harmfulness$sum[1]))
Figure 1: Harmful events by event category.
The event category causing the highest economic loss is FLOOD, as seen in the bar chart of the 15 event categories causing the highest economic loss.
par(mar=c(2,4,1,1))
df_top15_econloss <- arrange(df_econloss_events, desc(sum))[1:15, ]
barplot(df_top15_econloss$sum, legend.text=df_top15_econloss$EVTYPE,
col=rainbow(15), ylab = "Loss in US dollars",
ylim = c(0,df_top15_econloss$sum[1]))
Figure 2: Economic loss by event category.
Listing all categories, we can clearly see that there is a lot of overlap between categories, something that may affect the evaluation of events taking the 2nd place and downwards. A recategorization is needed in cooperation with a content expert.
[NATIONAL WEATHER SERVICE INSTRUCTION 10-1605] (d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf) - Downloaded January 26, 2018.
[NOOA Storm Data FAQ Page] (d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf) - Downloaded January 26, 2018.