This analysis will review storm damage for weather events across the United States focusing on harm to the population and economic cost to both property and crop. The data will consist of storm data provided by the National Oceanic & Atmospheric Administration (NOAA). Each recorded event will consist of an event type, injury count, fatality count, property damage cost, and crop damage cost. The report will show the storms with the most harm to the population which include fatalities and injuries with a breakdown of each harmful type. On the economic side, the report will show the top storm events costing the most damage for both property and crops. The report will use the “NOAA National Weather Service Insturction 10-1605”" on Storm Data Preparation as a guideline for grouping weather events and calculating damage cost.
This section with outline the necessary requirements in order to replicate and reproduce the report. The outline will describe the packages used and version of R used to generate the report.
Below are the list of R libraries used in the analysis:
library(dplyr)
library(ggplot2)
library(RColorBrewer)
library(mice)
library(VIM)
library(grid)
library(gridExtra)
The data file used in the analysis is a CSV file compressed as a BZ2 file. The R function read.csv will automatically decompress the file and open it as a CSV file. Due to the size of the file, the time to open and read the data will depend on the machine memory and processor. Our execution took one minute to complete.
data <- read.csv("./StormData.csv.bz2", sep = ",", header = TRUE, fill = TRUE)
The data was first filtered by states, any state not directly a member of the United States was removed from the analysis.
The next step was to clean the Event Type data values. After completing a quick analysis on the data values, several key items are discovered. First, there are more event types then listed in the NOAA document. Second, the values seem to be manually entered based on the best guess by the observer. This caused the data values to contain missing spellings or similiar events defined with different values. To clean the Event Type values, the values were uppercased and the spaces removed. Then, the data was grouped under the researcher’s best guess based on the NOAA document definition. In order to achieve more accurate results based on weather event, the researcher believed this was a necessary step. Note, this is an area for others to review and move events to other groupings.
filteredData = data %>%
filter(STATE %in% state.abb) %>%
mutate(EVTYPE = toupper(gsub(" ", "", EVTYPE, fixed = FALSE)), PROPDMGEXP = toupper(PROPDMGEXP), CROPDMGEXP = toupper(CROPDMGEXP)) %>%
mutate(NOAAEVTYPE = ifelse(EVTYPE == "AVALANCE", "AVALANCHE",
ifelse(EVTYPE %in% c("COASTALFLOODING","COASTAL FLOODING"), "COASTALFLOOD",
ifelse(EVTYPE %in% c("COLD/WINDS","COLD","COLDTEMPERATURE","COLDWEATHER",
"RECORDCOLD","EXTENDEDCOLD"), "COLD/WINDCHILL",
ifelse(EVTYPE %in% c("EXTREMECOLD","EXTREMEWINDCHILL"), "EXTREMECOLD/WINDCHILL",
ifelse(EVTYPE %in% c("FLASHFLOODING","FLOOD/FLASHFLOOD","FLASHFLOOD/FLOOD","FLASHFLOODS",
"FLASHFLOODING/FLOOD","FLASHFLOODING/FLOOD"), "FLASHFLOOD",
ifelse(EVTYPE %in% c("URBANANDSMALLSTREAMFLOODIN","RIVERFLOODING"),"FLOOD",
ifelse(EVTYPE %in% c("FOG"),"DENSEFOG",
ifelse(EVTYPE %in% c("FOGANDCOLDTEMPERATURES","GLAZE","GLAZE/ICESTORM"),"FREEZINGFOG",
ifelse(EVTYPE %in% c("FREEZE","FREEZINGRAIN","FREEZINGDRIZZLE",
"FREEZINGSPRAY","FREEZINGRAIN/SNOW"),"FROST/FREEZE",
ifelse(EVTYPE %in% c("HEAT","EXTREMEHEAT","HEATWAVE","HEATWAVES",
"UNSEASONABLYWARMANDDRY","WARMWEATHER","RECORDHEAT",
"RECORD/EXCESSIVEHEAT","HEATWAVEDROUGHT"),"EXCESSIVEHEAT",
ifelse(EVTYPE %in% c("SMALLHAIL"),"HAIL",
ifelse(EVTYPE %in% c("HEAVYRAINS"),"HEAVYRAIN",
ifelse(EVTYPE %in% c("HEAVYSNOWANDHIGHWINDS"),"HEAVYSNOW",
ifelse(EVTYPE %in% c("HEAVYSURF/HIGHSURF"),"HEAVYSURFANDWIND",
ifelse(EVTYPE %in% c("HIGHWIND","HIGHWIND48"),"HIGHWINDS",
ifelse(EVTYPE %in% c("HURRICANE","HURRICANEEMILY","HURRICANEEDOUARD",
"HURRICANE-GENERATEDSWELLS",
"HURRICANEOPAL/HIGHWINDS","HURRICANEFELIX","HURRICANEOPAL",
"HURRICANEERIN"),"HURRICANE/TYPHOON",
ifelse(EVTYPE %in% c("HYPERTHERMIA/EXPOSURE"),"HYPOTHERMIA",
ifelse(EVTYPE %in% c("ICEONROAD","SLEET"),"ICEROADS",
ifelse(EVTYPE %in% c("LIGHTNING.","LIGHTNINGINJURY"),"LIGHTNING",
ifelse(EVTYPE %in% c("MUDSLIDE"),"MUDSLIDES",
ifelse(EVTYPE %in% c("THUNDERTORMWINDS","THUNDERSTORMWIND","THUNDERSTORMW",
"THUNDERSTORMWINDS/HAIL",
"THUNDERSTORMWINDS13","THUNDERSTORMWINDSS","THUNDERSTORMWINDG52",
"THUNDERSTORMWIND(G40)","TSTMWIND(G35)","TSTMWIND(G40)","TSTMWIND(G45)",
"LIGHTNINGANDTHUNDERSTORMWIN","TSTMWIND",
"THUNDERSTORM"),"THUNDERSTORMWINDS",
ifelse(EVTYPE %in% c("HIGHWIND/HEAVYSNOW","HEAVYSNOWSHOWER","SNOW/BITTERCOLD","EXCESSIVESNOW",
"LIGHTSNOW","SNOWSQUALL","SNOWSQUALLS"),"SNOW",
ifelse(EVTYPE %in% c("MARINEMISHAP"),"MARINEACCIDENT",
ifelse(EVTYPE %in% c("RIPCURRENT","RIPCURRENTS/HEAVYSURF"),"RIPCURRENTS",
ifelse(EVTYPE %in% c("RIVERFLOOD"),"RIVERFLOODING",
ifelse(EVTYPE %in% c("TORNADOES,TSTMWIND,HAIL","TORNADOF3","TORNADOF2"),"TORNADO",
ifelse(EVTYPE %in% c("TROPICALSTORMGORDON"),"TROPICALSTORM",
ifelse(EVTYPE %in% c("WILD/FORESTFIRE","WILDFIRES"),"WILDFIRE",
ifelse(EVTYPE %in% c("WATERSPOUT/TORNADO","WATERSPOUTTORNADO"),"WATERSPOUT",
ifelse(EVTYPE %in% c("WINDS","WIND","GUSTYWIND","GUSTYWINDS","NONTSTMWIND",
"STRONGWIND","STRONGWINDS"),"WINDSTORM",
ifelse(EVTYPE %in% c("WINTERSTORMS","WINTERSTORMHIGHWINDS"),"WINTERSTORM",
ifelse(EVTYPE %in% c("WINTRYMIX","WINTERWEATHERMIX","SNOWANDICE",
"WINTERWEATHER/MIX"),"WINTERWEATHER",
EVTYPE)))))))))))))))))))))))))))))))))
After cleaning the data, the column containing harmful counts which is the sum of fatalities and injuries will be used later in the analysis.
report = filteredData %>% mutate(HARMFUL = FATALITIES + INJURIES)
For events causing harm to the population, group the data by Event Type and summarize the data for number of events, number of injuries, fatalities and total harm (fatalities + injuries).
report.summary = report %>%
group_by(EVENT_TYPE = as.factor(NOAAEVTYPE)) %>%
summarize(NUMBER_OF_EVENTS = n(),
MEAN = mean(HARMFUL, na.rm = TRUE),
TOTAL = sum(HARMFUL, na.rm = TRUE),
INJURIES = sum(INJURIES, na.rm = TRUE),
FATALITIES = sum(FATALITIES, na.rm = TRUE)) %>%
mutate(PER_EVENT = TOTAL / NUMBER_OF_EVENTS) %>%
arrange(desc(TOTAL))
Now select only the top 20 for reporting and charts. Note: For per event counts, in order to make the data meaningful, records which had 30 or more events where selected.
topCount = 20
report.summary.topHarmful <- report.summary[1:topCount,]
report.summary.topInjuries <- arrange(report.summary, desc(INJURIES))[1:topCount,]
report.summary.topFatalities <- arrange(report.summary, desc(FATALITIES))[1:topCount,]
report.summary.topNumberOfEvents <- arrange(report.summary, desc(NUMBER_OF_EVENTS))[1:topCount,]
report.summary.topPerEvent <- (report.summary %>% filter(NUMBER_OF_EVENTS > 30) %>% arrange(desc(PER_EVENT)))[1:topCount,]
Below is a chart showing the top storms causing the most harm to the population based on injury and fatalities counts.
populationHealthChart <- ggplot(data = report.summary.topHarmful,
aes(x=reorder(EVENT_TYPE,TOTAL), y=TOTAL,
fill=colorRampPalette(brewer.pal(9,"Set1"))(topCount))) +
geom_bar(stat="identity") +
geom_text(aes(label=TOTAL), position=position_dodge(width=0.5), vjust=0.50, hjust="inward") +
coord_cartesian(ylim=c(0,200000)) + coord_flip() +
labs(y="Injuries or Fatalities to Population", x="Type of Event",
title="Most Harmful with Respect to Population Health", subtitle="(Top Twenty)") +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 0.5, face = "bold")) +
theme(axis.title = element_text(face="bold"))
print(populationHealthChart)
The report shows Tornados with the largest number of fatalities and injuries combined across the United States. This is probably expected due to the large number of sever weather events across the Southern and Mid-West states. Next question to ask will this event hold true for fatalities and injuries.
Top events with the largest fatalities:
report.summary.topFatalities %>% select(EVENT_TYPE, NUMBER_OF_EVENTS, FATALITIES)
## # A tibble: 20 x 3
## EVENT_TYPE NUMBER_OF_EVENTS FATALITIES
## <fctr> <int> <dbl>
## 1 TORNADO 60641 5658
## 2 EXCESSIVEHEAT 2613 3143
## 3 FLASHFLOOD 54623 994
## 4 LIGHTNING 15702 807
## 5 THUNDERSTORMWINDS 323294 703
## 6 RIPCURRENTS 668 523
## 7 FLOOD 24983 467
## 8 EXTREMECOLD/WINDCHILL 1864 303
## 9 HIGHWINDS 21712 283
## 10 AVALANCHE 387 225
## 11 WINTERSTORM 11419 216
## 12 WINDSTORM 4191 140
## 13 COLD/WINDCHILL 695 140
## 14 HEAVYSNOW 15701 129
## 15 HURRICANE/TYPHOON 237 113
## 16 BLIZZARD 2716 101
## 17 WILDFIRE 4191 90
## 18 ICESTORM 2004 89
## 19 HIGHSURF 686 87
## 20 DENSEFOG 1821 80
Top events with the largest injuries:
report.summary.topInjuries %>% select(EVENT_TYPE, NUMBER_OF_EVENTS, INJURIES)
## # A tibble: 20 x 3
## EVENT_TYPE NUMBER_OF_EVENTS INJURIES
## <fctr> <int> <dbl>
## 1 TORNADO 60641 91364
## 2 THUNDERSTORMWINDS 323294 9369
## 3 EXCESSIVEHEAT 2613 8910
## 4 FLOOD 24983 6787
## 5 LIGHTNING 15702 5213
## 6 ICESTORM 2004 1975
## 7 FLASHFLOOD 54623 1790
## 8 WILDFIRE 4191 1605
## 9 HIGHWINDS 21712 1437
## 10 HAIL 288667 1371
## 11 WINTERSTORM 11419 1353
## 12 DENSEFOG 1821 1076
## 13 HEAVYSNOW 15701 1021
## 14 HURRICANE/TYPHOON 237 973
## 15 BLIZZARD 2716 803
## 16 WINTERWEATHER 8248 616
## 17 RIPCURRENTS 668 471
## 18 DUSTSTORM 422 440
## 19 WINDSTORM 4191 386
## 20 TROPICALSTORM 619 381
Top events with the largest Per Event Harm to the Population:
report.summary.topPerEvent %>% select(EVENT_TYPE, NUMBER_OF_EVENTS, TOTAL, PER_EVENT)
## # A tibble: 20 x 4
## EVENT_TYPE NUMBER_OF_EVENTS TOTAL PER_EVENT
## <fctr> <int> <dbl> <dbl>
## 1 EXCESSIVEHEAT 2613 12053 4.6127057
## 2 HURRICANE/TYPHOON 237 1086 4.5822785
## 3 FREEZINGFOG 91 240 2.6373626
## 4 ICE 61 143 2.3442623
## 5 TORNADO 60641 97022 1.5999406
## 6 RIPCURRENTS 668 994 1.4880240
## 7 ICYROADS 32 36 1.1250000
## 8 DUSTSTORM 422 462 1.0947867
## 9 ICESTORM 2004 2064 1.0299401
## 10 AVALANCHE 387 395 1.0206718
## 11 TROPICALSTORM 619 440 0.7108239
## 12 DENSEFOG 1821 1156 0.6348160
## 13 HEAVYSURF 85 48 0.5647059
## 14 WILDFIRE 4191 1695 0.4044381
## 15 LIGHTNING 15702 6020 0.3833907
## 16 HEAVYSURFANDWIND 192 69 0.3593750
## 17 HIGHSURF 686 230 0.3352770
## 18 BLIZZARD 2716 904 0.3328424
## 19 EXTREMECOLD/WINDCHILL 1864 563 0.3020386
## 20 DUSTDEVIL 149 45 0.3020134
Tornados remained the top event for the most fatalities and injuries to the population while Excessive Heat was the largest Per Event totals.
All economic dollar values are stored with a multiplier
PROPDMG and CROPDMG contain the value
PROPRDMGEXP and CROPDMGEXP contain the multipler with the following reference
H - hundreds multiplier $100
K - thousands multiplier $1000
M - millions multiplier $1,000,000
B - Billions multiplier $1,000,000,000
All other multipliers will be ignored due to lack of documenation and not a
significant number of recorded events to change the data outcome.
The first step is to make all dollar values for crop and property damage to be in the same monetary unit.
hundred = 100
thousand = 1000
million = 1000000
billion = 1000000000
economic = filteredData %>% filter(PROPDMGEXP %in% c("H","K","M","B") | CROPDMGEXP %in% c("H","K","M","B")) %>%
mutate(PROPERTY_COST = ifelse(PROPDMGEXP == "H", PROPDMG * hundred,
ifelse(PROPDMGEXP == "K", PROPDMG * thousand,
ifelse(PROPDMGEXP == "M", PROPDMG * million,
ifelse(PROPDMGEXP == "B", PROPDMG * billion, 0)))),
CROP_COST = ifelse(CROPDMGEXP == "H", CROPDMG * hundred,
ifelse(CROPDMGEXP == "K", CROPDMG * thousand,
ifelse(CROPDMGEXP == "M", CROPDMG * million,
ifelse(CROPDMGEXP == "B", CROPDMG * billion, 0))))
)
For events causing damage to crops and property, group the data by Event Type and summarize the data by cost of damage to crops, damage to properties and total cost damage (Property Cost + Crop Cost).
economic.summary = economic %>%
group_by(EVENT_TYPE = as.factor(NOAAEVTYPE)) %>%
summarize(NUMBER_OF_EVENTS = n(),
PROPERTY_MEAN = mean(PROPERTY_COST, na.rm = TRUE),
PROPERTY_DAMAGE = sum(PROPERTY_COST, na.rm = TRUE),
CROP_MEAN = mean(CROP_COST, na.rm = TRUE),
CROP_DAMAGE = sum(CROP_COST, na.rm = TRUE),
TOTAL_DAMAGE = sum(PROPERTY_COST + CROP_COST))
Summarize the data for top total cost, crop cost and property cost.
economic.summary.property = economic.summary %>%
select(EVENT_TYPE, DAMAGE_COST = PROPERTY_DAMAGE) %>%
mutate(DAMAGE_COST_TYPE = "Property Damage", DAMAGE_COST = DAMAGE_COST / billion) %>%
arrange(desc(DAMAGE_COST))
economic.summary.property.top = economic.summary.property[1:topCount,]
economic.summary.crop = economic.summary %>%
select(EVENT_TYPE, DAMAGE_COST = CROP_DAMAGE) %>%
mutate(DAMAGE_COST_TYPE = "Crop Damage", DAMAGE_COST = DAMAGE_COST / billion) %>%
arrange(desc(DAMAGE_COST))
economic.summary.crop.top = economic.summary.crop[1:topCount,]
economic.summary.total = economic.summary %>%
select(EVENT_TYPE, DAMAGE_COST = TOTAL_DAMAGE) %>%
mutate(DAMAGE_COST_TYPE = "Total Damage", DAMAGE_COST = DAMAGE_COST / billion) %>%
arrange(desc(DAMAGE_COST))
economic.summary.total.top = economic.summary.total[1:topCount,]
Below is a chart showing the top highest events impacting cost on properties and crops.
economicChart <- ggplot(data = economic.summary.total.top,
aes(x=reorder(EVENT_TYPE,DAMAGE_COST), y=DAMAGE_COST,
fill=colorRampPalette(brewer.pal(9,"Set1"))(topCount))) +
geom_bar(stat="identity") +
geom_text(aes(label=format(round(DAMAGE_COST, digits=2), big.mark = ",", scientific = FALSE)),
position=position_dodge(width=0.5), vjust=0.50, hjust="inward") +
coord_flip() +
labs(y="Cost in Dollars (Billions)", x="Type of Event",
title="Highest Economic Damage Cost") +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 0.5, face = "bold")) +
theme(axis.title = element_text(face="bold"))
print(economicChart)
The report shows Floods with the highest cost (in billions) for damage to properties and crops combined across the United States. Tornados continue to show up in the top 10 due to the same factors found in comparing harm to the population.
Top events with the highest cost impact for Properties:
economic.summary.property.top %>% select(EVENT_TYPE, DAMAGE_COST)
## # A tibble: 20 x 2
## EVENT_TYPE DAMAGE_COST
## <fctr> <dbl>
## 1 FLOOD 144.6474628
## 2 HURRICANE/TYPHOON 82.5286190
## 3 TORNADO 58.5390135
## 4 STORMSURGE 43.3234610
## 5 FLASHFLOOD 16.6312325
## 6 HAIL 15.7323302
## 7 THUNDERSTORMWINDS 9.6962257
## 8 WILDFIRE 8.4844715
## 9 TROPICALSTORM 7.4756215
## 10 WINTERSTORM 6.7489922
## 11 HIGHWINDS 5.8765009
## 12 RIVERFLOODING 5.1189420
## 13 STORMSURGE/TIDE 4.6400380
## 14 ICESTORM 3.9449078
## 15 HEAVYRAIN/SEVEREWEATHER 2.5000000
## 16 SEVERETHUNDERSTORM 1.2053600
## 17 DROUGHT 1.0411060
## 18 HEAVYSNOW 0.9325541
## 19 LIGHTNING 0.9276158
## 20 HEAVYRAIN 0.7022801
Top events with the highest cost impact for Crops:
economic.summary.crop.top %>% select(EVENT_TYPE, DAMAGE_COST)
## # A tibble: 20 x 2
## EVENT_TYPE DAMAGE_COST
## <fctr> <dbl>
## 1 DROUGHT 13.9723610
## 2 FLOOD 5.6419935
## 3 RIVERFLOODING 5.0294590
## 4 ICESTORM 5.0221135
## 5 HURRICANE/TYPHOON 4.9589408
## 6 HAIL 3.0467474
## 7 FROST/FREEZE 1.5509110
## 8 FLASHFLOOD 1.5177851
## 9 EXTREMECOLD/WINDCHILL 1.3300230
## 10 THUNDERSTORMWINDS 1.1605896
## 11 EXCESSIVEHEAT 0.9044635
## 12 HEAVYRAIN 0.7937898
## 13 HIGHWINDS 0.6790179
## 14 TROPICALSTORM 0.5744950
## 15 TORNADO 0.4174521
## 16 WILDFIRE 0.4016046
## 17 DAMAGINGFREEZE 0.2962300
## 18 EXCESSIVEWETNESS 0.1420000
## 19 HEAVYSNOW 0.1346531
## 20 FLOOD/RAIN/WINDS 0.1128000
It is ironic that both Floods and Droughts had the highest costly impact due to weather events across the United States.