Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
Data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database has been processed to highlight to most seevere events. TORNADO is by far the most harmful type of event. EXCESSIVE HEAT is the second most harmful type of event. FLASH FLOOD, HEAT, and LIGHTNING seem to cause some level of harm. Other events seem to be much less harmful.
Property damage seem to be well spread on many types of events, but TORNADO seems to cause about 30% of property damages.
The dataset can be downdloaded with this link
In order to get more information regarding the dataset, please consult
National Weather Service Storm Data Documentation
National Climatic Data Center Storm Events FAQ
Loading packages
library(dplyr)
library(ggplot2)
Downloading the data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database.
if(!file.exists("FStormData.csv.bz2")){
download.file(url = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", destfile = "FStormData.csv.bz2")
}
Reading bz2 file
if(!exists("rawData")){
rawData <- read.csv("FStormData.csv.bz2")
}
Overview of the raw data
str(rawData)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
data <- rawData %>% select("FATALITIES", "INJURIES", "PROPDMG", "EVTYPE")
Showing the discribution of data
head(table(data$FATALITIES))
##
## 0 1 2 3 4 5
## 895323 5010 996 314 166 114
head(table(data$INJURIES))
##
## 0 1 2 3 4 5
## 884693 7756 3134 1552 931 709
head(table(data$PROPDMG))
##
## 0 0.01 0.02 0.03 0.04 0.05
## 663123 931 80 1610 4 588
head(sort(table(data$EVTYPE), decreasing = TRUE))
##
## HAIL TSTM WIND THUNDERSTORM WIND TORNADO
## 288661 219940 82563 60652
## FLASH FLOOD FLOOD
## 54277 25326
More information on the variables
summary(data)
## FATALITIES INJURIES PROPDMG
## Min. : 0.0000 Min. : 0.0000 Min. : 0.00
## 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.00
## Median : 0.0000 Median : 0.0000 Median : 0.00
## Mean : 0.0168 Mean : 0.1557 Mean : 12.06
## 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.50
## Max. :583.0000 Max. :1700.0000 Max. :5000.00
##
## EVTYPE
## HAIL :288661
## TSTM WIND :219940
## THUNDERSTORM WIND: 82563
## TORNADO : 60652
## FLASH FLOOD : 54277
## FLOOD : 25326
## (Other) :170878
Whe can already see that for fatalities, injuries and property damage, medians are equal or close to 0, meaning that a very large amount of the events resulted in no fatalities, injuries or property damage.
Hail and Too Small to Mesure Winds are the 2 most recorded events.
Grouping data by event type and computing sum of each type of event for Fatalities and Injuries. Creating an index of the most harmful event types, which I estimate as the sum of the normalized sum of Fatalities and Injuries. Ordering the results by the index and keeping only the top 10 even types for lisibility reasons.
harmful <- data %>%
group_by(EVTYPE) %>%
summarize(sumFatalities = sum(FATALITIES, na.rm = TRUE),
sumInjuries = sum(INJURIES, na.rm = TRUE))
harmful$Index <- harmful$sumFatalities/max(harmful$sumFatalities)
+ harmful$sumInjuries/max(harmful$sumInjuries)
mostHarmful<- harmful[order(-harmful$Index),]
mostHarmful <- mostHarmful[1:10,]
Grouping data by event type and computing sum of the property damage. Ordering the results and keeping only the top 10 even types for lisibility reasons.
damage <- data %>%
group_by(EVTYPE) %>%
summarize(sumPropdmg = sum(PROPDMG, na.rm = TRUE))
mostDamage<- damage[order(-damage$sumPropdmg),]
mostDamage$Percentage <- mostDamage$sumPropdmg/sum(mostDamage$sumPropdmg)
mostDamage <- mostDamage[1:10,]
The table and plot below show that
Table of the most harmful event types
mostHarmful
## # A tibble: 10 x 4
## EVTYPE sumFatalities sumInjuries Index
## <fct> <dbl> <dbl> <dbl>
## 1 TORNADO 5633 91346 1.00
## 2 EXCESSIVE HEAT 1903 6525 0.338
## 3 FLASH FLOOD 978 1777 0.174
## 4 HEAT 937 2100 0.166
## 5 LIGHTNING 816 5230 0.145
## 6 TSTM WIND 504 6957 0.0895
## 7 FLOOD 470 6789 0.0834
## 8 RIP CURRENT 368 232 0.0653
## 9 HIGH WIND 248 1137 0.0440
## 10 AVALANCHE 224 170 0.0398
Barplot of the most harmful event types
### Making a clean dataset first
mostHarmfulBarPlot <- data.frame(EVTYPE = mostHarmful$EVTYPE, variable = "Fatalities", value = mostHarmful$sumFatalities, Index = mostHarmful$Index)
mostHarmfulBarPlot <- rbind(mostHarmfulBarPlot,
data.frame(EVTYPE = mostHarmful$EVTYPE, variable = "Injuries", value = mostHarmful$sumInjuries, Index = mostHarmful$Index))
mostHarmfulBarPlot<- mostHarmfulBarPlot[order(-mostHarmfulBarPlot$Index),]
### Making the plot
ggplot(data = mostHarmfulBarPlot, aes(x=EVTYPE, y=value)) +
geom_bar(stat="identity", position=position_dodge()) +
coord_flip() +
facet_grid(. ~ variable, scales = "free") +
ggtitle("Event types causing the most fatalities and injuries") +
scale_x_discrete(limits=rev(mostHarmfulBarPlot$EVTYPE)) +
xlab("Type of event") +
ylab("Number of fatalities and injuries")
The table and plot below show that
Table of the most damaging event types on properties
mostDamage
## # A tibble: 10 x 3
## EVTYPE sumPropdmg Percentage
## <fct> <dbl> <dbl>
## 1 TORNADO 3212258 0.295
## 2 FLASH FLOOD 1420125 0.130
## 3 TSTM WIND 1335966 0.123
## 4 FLOOD 899938 0.0827
## 5 THUNDERSTORM WIND 876844 0.0806
## 6 HAIL 688693 0.0633
## 7 LIGHTNING 603352 0.0554
## 8 THUNDERSTORM WINDS 446293 0.0410
## 9 HIGH WIND 324732 0.0298
## 10 WINTER STORM 132721 0.0122
Barplot of the most damaging event types on properties
### Making the plot
options(scipen = 10)
ggplot(data = mostDamage, aes(x=EVTYPE, y=sumPropdmg)) +
geom_bar(stat="identity", position=position_dodge()) +
scale_x_discrete(limits=rev(mostDamage$EVTYPE)) +
coord_flip() +
ggtitle("Event types causing the most damage on properties") +
xlab("Type of event") +
ylab("Damage to properties")