Load packages:
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.0 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.1 ✔ tibble 3.2.0
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(cowplot)
##
## Attaching package: 'cowplot'
##
## The following object is masked from 'package:lubridate':
##
## stamp
library(knitr)
library(dplyr)
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site: - Storm Data
There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined. - National Weather Service Storm Data Documentation
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
Your data analysis must address the following questions:
Consider writing your report as if it were to be read by a government or municipal manager who might be responsible for preparing for severe weather events and will need to prioritize resources for different types of events. However, there is no need to make any specific recommendations in your report.
Load data:
StormData <- read.csv("repdata_data_StormData.csv")
Have a look at the data:
head(StormData)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL TORNADO
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL TORNADO
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL TORNADO
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL TORNADO
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1 0 0 NA
## 2 0 0 NA
## 3 0 0 NA
## 4 0 0 NA
## 5 0 0 NA
## 6 0 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1 0 14.0 100 3 0 0 15 25.0
## 2 0 2.0 150 2 0 0 0 2.5
## 3 0 0.1 123 2 0 0 2 25.0
## 4 0 0.0 100 2 0 0 2 2.5
## 5 0 0.0 150 2 0 0 2 2.5
## 6 0 1.5 177 2 0 0 6 2.5
## PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1 K 0 3040 8812
## 2 K 0 3042 8755
## 3 K 0 3340 8742
## 4 K 0 3458 8626
## 5 K 0 3412 8642
## 6 K 0 3450 8748
## LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3051 8806 1
## 2 0 0 2
## 3 0 0 3
## 4 0 0 4
## 5 0 0 5
## 6 0 0 6
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
From what we’ve seen earlier by looking into head(StormData), the casualties are stored in variables FATALITIES and INJURIES.
Let’s first summarize both fatalities and injuries into one variable, which we’ll later use in the analysis. We’ll select Top 15 most damaging events.
StormData$Casualties <- StormData$FATALITIES + StormData$INJURIES
FreqCasual <- StormData%>%
group_by(EVTYPE)%>%
summarize(SumCasual = sum(Casualties))%>%
arrange(desc(SumCasual))
FreqCasual <- FreqCasual[1:15,]
kable(FreqCasual)
| EVTYPE | SumCasual |
|---|---|
| TORNADO | 96979 |
| EXCESSIVE HEAT | 8428 |
| TSTM WIND | 7461 |
| FLOOD | 7259 |
| LIGHTNING | 6046 |
| HEAT | 3037 |
| FLASH FLOOD | 2755 |
| ICE STORM | 2064 |
| THUNDERSTORM WIND | 1621 |
| WINTER STORM | 1527 |
| HIGH WIND | 1385 |
| HAIL | 1376 |
| HURRICANE/TYPHOON | 1339 |
| HEAVY SNOW | 1148 |
| WILDFIRE | 986 |
Let’s visualize the results using GGPLOT2:
ggplot(data = FreqCasual, aes(x = reorder(EVTYPE, SumCasual), y = SumCasual)) +
coord_flip() +
geom_bar(stat='identity') +
geom_text(aes(label = SumCasual),
hjust= -0.05, color="black", size = 3,
position = position_dodge(0.6)) +
scale_y_continuous(limits = c(0, 100000),
breaks = c(25000, 50000, 75000, 100000),
labels = c("25000", "50000", "75000", "100000")) +
theme_classic() +
labs(x = "Type of Event",
y = "Number of Casualties")
As we can see, Tornado has the most casualties among the events in the US, with different types of Heat, Flood and Wind to follow.
Across the United States, which types of events have the greatest economic consequences?
Have a look at the data once again:
head(StormData)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL TORNADO
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL TORNADO
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL TORNADO
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL TORNADO
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1 0 0 NA
## 2 0 0 NA
## 3 0 0 NA
## 4 0 0 NA
## 5 0 0 NA
## 6 0 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1 0 14.0 100 3 0 0 15 25.0
## 2 0 2.0 150 2 0 0 0 2.5
## 3 0 0.1 123 2 0 0 2 25.0
## 4 0 0.0 100 2 0 0 2 2.5
## 5 0 0.0 150 2 0 0 2 2.5
## 6 0 1.5 177 2 0 0 6 2.5
## PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1 K 0 3040 8812
## 2 K 0 3042 8755
## 3 K 0 3340 8742
## 4 K 0 3458 8626
## 5 K 0 3412 8642
## 6 K 0 3450 8748
## LATITUDE_E LONGITUDE_ REMARKS REFNUM Casualties
## 1 3051 8806 1 15
## 2 0 0 2 0
## 3 0 0 3 2
## 4 0 0 4 2
## 5 0 0 5 2
## 6 0 0 6 6
The economic consequence data (which are - Property Damage Estimates) are stored in variables PROPDMG and PROPDMGEXP. Let’s look at them:
summary(StormData$PROPDMG)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 12.06 0.50 5000.00
unique(StormData$PROPDMGEXP)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
As we see, PROPDM must be used to express the amount of dollars of property damage, and PROPDMGEXP to act as the multiplying factor. From the documentation we see that: - “Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions.”
Let’s modify the numerical value of damage based on the above: notice we have both lower (eg, “k”) and upper case (eg, “K”) values, so first we’ll put all to upper case for convenience.
StormData$PROPDMGEXP <- toupper(StormData$PROPDMGEXP)
StormData$ActDamage <- StormData$PROPDMG
StormData[StormData$PROPDMGEXP=="K",]$ActDamage <- StormData[StormData$PROPDMGEXP=="K",]$PROPDMG*1000
StormData[StormData$PROPDMGEXP=="M",]$ActDamage <- StormData[StormData$PROPDMGEXP=="M",]$PROPDMG*1000000
StormData[StormData$PROPDMGEXP=="B",]$ActDamage <- StormData[StormData$PROPDMGEXP=="B",]$PROPDMG*1000000000
See 15 the most damaging events in table format:
MostDamage <- StormData%>%
group_by(EVTYPE)%>%
summarize(SumDamage = sum(ActDamage))%>%
arrange(desc(SumDamage))
MostDamage <- MostDamage[1:15,]
kable(MostDamage)
| EVTYPE | SumDamage |
|---|---|
| FLOOD | 144657709807 |
| HURRICANE/TYPHOON | 69305840000 |
| TORNADO | 56937160779 |
| STORM SURGE | 43323536000 |
| FLASH FLOOD | 16140812067 |
| HAIL | 15732267048 |
| HURRICANE | 11868319010 |
| TROPICAL STORM | 7703890550 |
| WINTER STORM | 6688497251 |
| HIGH WIND | 5270046295 |
| RIVER FLOOD | 5118945500 |
| WILDFIRE | 4765114000 |
| STORM SURGE/TIDE | 4641188000 |
| TSTM WIND | 4484928495 |
| ICE STORM | 3944927860 |
Since the numbers are big, let’s present the results in millions of US Dollars and round to the nearest integer (for better presentation):
MostDamage$SumDamageM <- round(MostDamage$SumDamage/1000000, digits = 0);
kable(MostDamage)
| EVTYPE | SumDamage | SumDamageM |
|---|---|---|
| FLOOD | 144657709807 | 144658 |
| HURRICANE/TYPHOON | 69305840000 | 69306 |
| TORNADO | 56937160779 | 56937 |
| STORM SURGE | 43323536000 | 43324 |
| FLASH FLOOD | 16140812067 | 16141 |
| HAIL | 15732267048 | 15732 |
| HURRICANE | 11868319010 | 11868 |
| TROPICAL STORM | 7703890550 | 7704 |
| WINTER STORM | 6688497251 | 6688 |
| HIGH WIND | 5270046295 | 5270 |
| RIVER FLOOD | 5118945500 | 5119 |
| WILDFIRE | 4765114000 | 4765 |
| STORM SURGE/TIDE | 4641188000 | 4641 |
| TSTM WIND | 4484928495 | 4485 |
| ICE STORM | 3944927860 | 3945 |
Let’s visualize the results using GGPLOT2:
ggplot(data = MostDamage, aes(x = reorder(EVTYPE, SumDamageM), y = SumDamageM)) +
coord_flip() +
geom_bar(stat='identity') +
geom_text(aes(label = SumDamageM),
hjust= -0.05, color="black", size = 3,
position = position_dodge(0.6)) +
scale_y_continuous(limits = c(0, 150000),
breaks = c(50000, 100000, 150000),
labels = c("50000", "100000", "150000")) +
theme_classic() +
labs(x = "Type of Event",
y = "Property Damage (in millions of US Dollars)")
As we can see, Flood is the most damaging type of event in terms of property damage.
Let’s see how the Top will change if we combine different types of Flood, Heat, Storm and Wind into one category:
StormData$EventMod <- StormData$EVTYPE
StormData$EventMod[grepl('WIND',StormData$EVTYPE )] <- "WIND"
StormData$EventMod[grepl('FLOOD',StormData$EVTYPE )] <- "FLOOD"
StormData$EventMod[grepl('STORM',StormData$EVTYPE )] <- "STORM"
StormData$EventMod[grepl('HEAT',StormData$EVTYPE )] <- "HEAT"
FreqCasual2 <- StormData%>%
group_by(EventMod)%>%
summarize(SumCasual = sum(Casualties))%>%
arrange(desc(SumCasual))
FreqCasual2 <- FreqCasual2[1:15,]
kable(FreqCasual2)
| EventMod | SumCasual |
|---|---|
| TORNADO | 96979 |
| HEAT | 12292 |
| WIND | 10276 |
| FLOOD | 10124 |
| STORM | 7324 |
| LIGHTNING | 6046 |
| HAIL | 1376 |
| HURRICANE/TYPHOON | 1339 |
| HEAVY SNOW | 1148 |
| WILDFIRE | 986 |
| BLIZZARD | 906 |
| FOG | 796 |
| RIP CURRENT | 600 |
| WILD/FOREST FIRE | 557 |
| RIP CURRENTS | 501 |
MostDamage2 <- StormData%>%
group_by(EventMod)%>%
summarize(SumDamage = round(sum(ActDamage)/1000000))%>%
arrange(desc(SumDamage))
MostDamage2 <- MostDamage2[1:15,]
kable(MostDamage2)
| EventMod | SumDamage |
|---|---|
| FLOOD | 167379 |
| STORM | 73055 |
| HURRICANE/TYPHOON | 69306 |
| TORNADO | 56937 |
| HAIL | 15732 |
| WIND | 12451 |
| HURRICANE | 11868 |
| WILDFIRE | 4765 |
| HURRICANE OPAL | 3173 |
| WILD/FOREST FIRE | 3002 |
| HEAVY RAIN/SEVERE WEATHER | 2500 |
| DROUGHT | 1046 |
| HEAVY SNOW | 933 |
| LIGHTNING | 929 |
| HEAVY RAIN | 694 |
We see that in terms of damage there are no changes to the Top 1 in both cases, but there were such events as Blizzard, Forest Fire or Drought introduced to the Top 15 most damaging events.
Answering Question 1, the most damaging in terms of human casualties (considering both injuries and fatalities) were Tornado, Excessive Heat, TSTM Wind, Flood and Lightning. Answering Question 2, the most damaging in terms of property damage were Flood, Typhoon, Tornado, Storm Surge and Flash Flood.