Load packages:

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.0     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.2.0
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(cowplot)
## 
## Attaching package: 'cowplot'
## 
## The following object is masked from 'package:lubridate':
## 
##     stamp
library(knitr)
library(dplyr)

Research data and questions

The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site: - Storm Data

There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined. - National Weather Service Storm Data Documentation

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

Your data analysis must address the following questions:

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
  2. Across the United States, which types of events have the greatest economic consequences?

Consider writing your report as if it were to be read by a government or municipal manager who might be responsible for preparing for severe weather events and will need to prioritize resources for different types of events. However, there is no need to make any specific recommendations in your report.

Data Processing

Load data:

StormData <- read.csv("repdata_data_StormData.csv")

Have a look at the data:

head(StormData)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE  EVTYPE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL TORNADO
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL TORNADO
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL TORNADO
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL TORNADO
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL TORNADO
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL TORNADO
##   BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1         0                                               0         NA
## 2         0                                               0         NA
## 3         0                                               0         NA
## 4         0                                               0         NA
## 5         0                                               0         NA
## 6         0                                               0         NA
##   END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1         0                      14.0   100 3   0          0       15    25.0
## 2         0                       2.0   150 2   0          0        0     2.5
## 3         0                       0.1   123 2   0          0        2    25.0
## 4         0                       0.0   100 2   0          0        2     2.5
## 5         0                       0.0   150 2   0          0        2     2.5
## 6         0                       1.5   177 2   0          0        6     2.5
##   PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1          K       0                                         3040      8812
## 2          K       0                                         3042      8755
## 3          K       0                                         3340      8742
## 4          K       0                                         3458      8626
## 5          K       0                                         3412      8642
## 6          K       0                                         3450      8748
##   LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1       3051       8806              1
## 2          0          0              2
## 3          0          0              3
## 4          0          0              4
## 5          0          0              5
## 6          0          0              6

Question #1

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

From what we’ve seen earlier by looking into head(StormData), the casualties are stored in variables FATALITIES and INJURIES.

Let’s first summarize both fatalities and injuries into one variable, which we’ll later use in the analysis. We’ll select Top 15 most damaging events.

StormData$Casualties <- StormData$FATALITIES + StormData$INJURIES

FreqCasual <- StormData%>%
  group_by(EVTYPE)%>%
  summarize(SumCasual = sum(Casualties))%>%
  arrange(desc(SumCasual))
FreqCasual <- FreqCasual[1:15,]

kable(FreqCasual)
EVTYPE SumCasual
TORNADO 96979
EXCESSIVE HEAT 8428
TSTM WIND 7461
FLOOD 7259
LIGHTNING 6046
HEAT 3037
FLASH FLOOD 2755
ICE STORM 2064
THUNDERSTORM WIND 1621
WINTER STORM 1527
HIGH WIND 1385
HAIL 1376
HURRICANE/TYPHOON 1339
HEAVY SNOW 1148
WILDFIRE 986

Let’s visualize the results using GGPLOT2:

ggplot(data = FreqCasual, aes(x = reorder(EVTYPE, SumCasual), y = SumCasual)) +
  coord_flip() +
  geom_bar(stat='identity') +
  geom_text(aes(label = SumCasual),
            hjust= -0.05, color="black", size = 3,
            position = position_dodge(0.6)) +
  scale_y_continuous(limits = c(0, 100000), 
                     breaks = c(25000, 50000, 75000, 100000),
                     labels = c("25000", "50000", "75000", "100000")) +
  theme_classic() +
  labs(x = "Type of Event",
       y = "Number of Casualties")

As we can see, Tornado has the most casualties among the events in the US, with different types of Heat, Flood and Wind to follow.

Question #2

Across the United States, which types of events have the greatest economic consequences?

Have a look at the data once again:

head(StormData)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE  EVTYPE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL TORNADO
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL TORNADO
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL TORNADO
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL TORNADO
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL TORNADO
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL TORNADO
##   BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1         0                                               0         NA
## 2         0                                               0         NA
## 3         0                                               0         NA
## 4         0                                               0         NA
## 5         0                                               0         NA
## 6         0                                               0         NA
##   END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1         0                      14.0   100 3   0          0       15    25.0
## 2         0                       2.0   150 2   0          0        0     2.5
## 3         0                       0.1   123 2   0          0        2    25.0
## 4         0                       0.0   100 2   0          0        2     2.5
## 5         0                       0.0   150 2   0          0        2     2.5
## 6         0                       1.5   177 2   0          0        6     2.5
##   PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1          K       0                                         3040      8812
## 2          K       0                                         3042      8755
## 3          K       0                                         3340      8742
## 4          K       0                                         3458      8626
## 5          K       0                                         3412      8642
## 6          K       0                                         3450      8748
##   LATITUDE_E LONGITUDE_ REMARKS REFNUM Casualties
## 1       3051       8806              1         15
## 2          0          0              2          0
## 3          0          0              3          2
## 4          0          0              4          2
## 5          0          0              5          2
## 6          0          0              6          6

The economic consequence data (which are - Property Damage Estimates) are stored in variables PROPDMG and PROPDMGEXP. Let’s look at them:

summary(StormData$PROPDMG)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00   12.06    0.50 5000.00
unique(StormData$PROPDMGEXP)
##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"

As we see, PROPDM must be used to express the amount of dollars of property damage, and PROPDMGEXP to act as the multiplying factor. From the documentation we see that: - “Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions.”

Let’s modify the numerical value of damage based on the above: notice we have both lower (eg, “k”) and upper case (eg, “K”) values, so first we’ll put all to upper case for convenience.

StormData$PROPDMGEXP <- toupper(StormData$PROPDMGEXP)

StormData$ActDamage <- StormData$PROPDMG
StormData[StormData$PROPDMGEXP=="K",]$ActDamage <- StormData[StormData$PROPDMGEXP=="K",]$PROPDMG*1000
StormData[StormData$PROPDMGEXP=="M",]$ActDamage <- StormData[StormData$PROPDMGEXP=="M",]$PROPDMG*1000000
StormData[StormData$PROPDMGEXP=="B",]$ActDamage <- StormData[StormData$PROPDMGEXP=="B",]$PROPDMG*1000000000

See 15 the most damaging events in table format:

MostDamage <- StormData%>%
  group_by(EVTYPE)%>%
  summarize(SumDamage = sum(ActDamage))%>%
  arrange(desc(SumDamage))
MostDamage <- MostDamage[1:15,]

kable(MostDamage)
EVTYPE SumDamage
FLOOD 144657709807
HURRICANE/TYPHOON 69305840000
TORNADO 56937160779
STORM SURGE 43323536000
FLASH FLOOD 16140812067
HAIL 15732267048
HURRICANE 11868319010
TROPICAL STORM 7703890550
WINTER STORM 6688497251
HIGH WIND 5270046295
RIVER FLOOD 5118945500
WILDFIRE 4765114000
STORM SURGE/TIDE 4641188000
TSTM WIND 4484928495
ICE STORM 3944927860

Since the numbers are big, let’s present the results in millions of US Dollars and round to the nearest integer (for better presentation):

MostDamage$SumDamageM <- round(MostDamage$SumDamage/1000000, digits = 0);

kable(MostDamage)
EVTYPE SumDamage SumDamageM
FLOOD 144657709807 144658
HURRICANE/TYPHOON 69305840000 69306
TORNADO 56937160779 56937
STORM SURGE 43323536000 43324
FLASH FLOOD 16140812067 16141
HAIL 15732267048 15732
HURRICANE 11868319010 11868
TROPICAL STORM 7703890550 7704
WINTER STORM 6688497251 6688
HIGH WIND 5270046295 5270
RIVER FLOOD 5118945500 5119
WILDFIRE 4765114000 4765
STORM SURGE/TIDE 4641188000 4641
TSTM WIND 4484928495 4485
ICE STORM 3944927860 3945

Let’s visualize the results using GGPLOT2:

ggplot(data = MostDamage, aes(x = reorder(EVTYPE, SumDamageM), y = SumDamageM)) +
  coord_flip() +
  geom_bar(stat='identity') +
  geom_text(aes(label = SumDamageM),
            hjust= -0.05, color="black", size = 3,
            position = position_dodge(0.6)) +
  scale_y_continuous(limits = c(0, 150000),
                     breaks = c(50000, 100000, 150000),
                     labels = c("50000", "100000", "150000")) +
  theme_classic() +
  labs(x = "Type of Event",
       y = "Property Damage (in millions of US Dollars)")

As we can see, Flood is the most damaging type of event in terms of property damage.

Additional research

Let’s see how the Top will change if we combine different types of Flood, Heat, Storm and Wind into one category:

StormData$EventMod <- StormData$EVTYPE

StormData$EventMod[grepl('WIND',StormData$EVTYPE )] <- "WIND"
StormData$EventMod[grepl('FLOOD',StormData$EVTYPE )] <- "FLOOD"
StormData$EventMod[grepl('STORM',StormData$EVTYPE )] <- "STORM"
StormData$EventMod[grepl('HEAT',StormData$EVTYPE )] <- "HEAT"

FreqCasual2 <- StormData%>%
  group_by(EventMod)%>%
  summarize(SumCasual = sum(Casualties))%>%
  arrange(desc(SumCasual))
FreqCasual2 <- FreqCasual2[1:15,]

kable(FreqCasual2)
EventMod SumCasual
TORNADO 96979
HEAT 12292
WIND 10276
FLOOD 10124
STORM 7324
LIGHTNING 6046
HAIL 1376
HURRICANE/TYPHOON 1339
HEAVY SNOW 1148
WILDFIRE 986
BLIZZARD 906
FOG 796
RIP CURRENT 600
WILD/FOREST FIRE 557
RIP CURRENTS 501
MostDamage2 <- StormData%>%
  group_by(EventMod)%>%
  summarize(SumDamage = round(sum(ActDamage)/1000000))%>%
  arrange(desc(SumDamage))
MostDamage2 <- MostDamage2[1:15,]

kable(MostDamage2)
EventMod SumDamage
FLOOD 167379
STORM 73055
HURRICANE/TYPHOON 69306
TORNADO 56937
HAIL 15732
WIND 12451
HURRICANE 11868
WILDFIRE 4765
HURRICANE OPAL 3173
WILD/FOREST FIRE 3002
HEAVY RAIN/SEVERE WEATHER 2500
DROUGHT 1046
HEAVY SNOW 933
LIGHTNING 929
HEAVY RAIN 694

We see that in terms of damage there are no changes to the Top 1 in both cases, but there were such events as Blizzard, Forest Fire or Drought introduced to the Top 15 most damaging events.

Results

Answering Question 1, the most damaging in terms of human casualties (considering both injuries and fatalities) were Tornado, Excessive Heat, TSTM Wind, Flood and Lightning. Answering Question 2, the most damaging in terms of property damage were Flood, Typhoon, Tornado, Storm Surge and Flash Flood.