This data analysis uses the NOAA Storm Database to addresses two fundamental questions about severe weather events: Across the US, which types of events are most harmful with respect to human health? And which type of events have the greatest economic consequences?
The analysis is fully reproducible, beginning with the downloading and reading of the raw NOAA storm data csv file, in a bz2 zip format. Next, the data is processed by removing variables that are unnecessary for the analysis, formatting certain fields, and subsetting the data to cover the period of 1996 (when all event types were first recorded) to 2011 (most recent data in the file), when all event types were. Significant effort is spent cleaning the EVTYPE values to match names of the most recent storm data event table. The EVTYPE values are further condensed to remove distinction between ambiguous qualifiers to provide a clearer picture of the type of events causing the most damage.
The end result is a quantified and visual comparison of the human and economic damage inflicted by severe weather event types with heat and tornadoes causing the greatest human impact, and floods and hurricanes causing the most economic impact.
First, the appropriate packages are loaded and the raw data is read into a tibble. The raw data format is shown for reference.
#load necessary packages
library(tidyverse)
library(lubridate)
library(stringdist)
library(ggplot2)
#download data in a tibble
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
setwd("C:/Users/207014104/Desktop/DataScience/Reproduceable Research/Project2")
filename <- "stormData.csv.bz2"
if (!file.exists(filename)){
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url, filename)
}
stormData <- as_tibble(read.csv(filename, stringsAsFactors = FALSE))
print(stormData)
## # A tibble: 902,297 x 37
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 1 4/18/19~ 0130 CST 97 MOBILE AL TORNA~
## 2 1 4/18/19~ 0145 CST 3 BALDWIN AL TORNA~
## 3 1 2/20/19~ 1600 CST 57 FAYETTE AL TORNA~
## 4 1 6/8/195~ 0900 CST 89 MADISON AL TORNA~
## 5 1 11/15/1~ 1500 CST 43 CULLMAN AL TORNA~
## 6 1 11/15/1~ 2000 CST 77 LAUDERDALE AL TORNA~
## 7 1 11/16/1~ 0100 CST 9 BLOUNT AL TORNA~
## 8 1 1/22/19~ 0900 CST 123 TALLAPOOSA AL TORNA~
## 9 1 2/13/19~ 2000 CST 125 TUSCALOOSA AL TORNA~
## 10 1 2/13/19~ 2000 CST 57 FAYETTE AL TORNA~
## # ... with 902,287 more rows, and 29 more variables: BGN_RANGE <dbl>,
## # BGN_AZI <chr>, BGN_LOCATI <chr>, END_DATE <chr>, END_TIME <chr>,
## # COUNTY_END <dbl>, COUNTYENDN <lgl>, END_RANGE <dbl>, END_AZI <chr>,
## # END_LOCATI <chr>, LENGTH <dbl>, WIDTH <dbl>, F <int>, MAG <dbl>,
## # FATALITIES <dbl>, INJURIES <dbl>, PROPDMG <dbl>, PROPDMGEXP <chr>,
## # CROPDMG <dbl>, CROPDMGEXP <chr>, WFO <chr>, STATEOFFIC <chr>,
## # ZONENAMES <chr>, LATITUDE <dbl>, LONGITUDE <dbl>, LATITUDE_E <dbl>,
## # LONGITUDE_ <dbl>, REMARKS <chr>, REFNUM <dbl>
Then, significant effort is placed on cleaning the data so that records are matched in an efficient and logical manner.
#select only the variables relevant to determining the greatest economic consequences and harm to health
storm <- select(stormData, BGN_DATE, STATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP,CROPDMG, CROPDMGEXP)
#format dates as a date class
storm$BGN_DATE <- as.Date(stormData$BGN_DATE, "%m/%d/%Y")
#remove events recorded in years when all the weather type were not recorded (prior to 1996)
storm <- storm[year(storm$BGN_DATE) >= 1996,]
#remove the summary rows (they were all in OK and had no data)
storm <- storm[!str_detect(storm$EVTYPE, "^Summary"),]
##CLEANING UP EVTYPE
#remove errors in entry of EVTYPE, such as abbreviations and extra info
storm$EVTYPE <- str_to_upper(storm$EVTYPE) #make all upper case
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "TSTM", "THUNDERSTORM")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "FLD", "FLOOD")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "URBAN/SML STREAM", "")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "/FOREST", "")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "/MIX", "")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "THUNDERSTORM WIND/", "")
#remove ambiguous qualifiers that result in separation of similar events
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "EXTREME", "")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "HIGH", "")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "DENSE", "")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "EXCESSIVE", "")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "STRONG", "")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "UNSEASONABLY", "")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "HEAVY", "")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "RECORD", "")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "WAVE", "")
#recode certain events to match more appropriate and general EVTYPES
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "LANDSLIDE", "DEBRIS FLOW")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "STORM SURGE$", "STORM SURGE/TIDE")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "GLAZE", "ICE STORM")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "SNOW SQUALL", "WINTER STORM")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "WINTER WEATHER MIX", "WINTER WEATHER")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "WINTRY MIX", "WINTER WEATHER")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "COASTAL FLOODING/EROSION", "FLOOD")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "COASTAL FLOODING", "FLOOD")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "RIVER FLOOD", "FLOOD")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "FLASH FLOOD", "FLOOD")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "HURRICANE$", "HURRICANE/TYPHOON")
#remove extra spaces from both sides of the character string
storm$EVTYPE <- str_trim(storm$EVTYPE, side = "both")
#adjust certain ending or beginning words the inappropriately separate groups
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "^COLD$", "COLD/WIND CHILL")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "^TYPHOON", "HURRICANE/TYPHOON")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "FREEZE$", "FROST/FREEZE")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "FROST/FROST/FREEZE", "FROST/FREEZE")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "FLOODING$", "FLOOD")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "FLOOD$", "FLOOD")
#load all of the 48 official EVTYPE codes
EVcodes <- c("Astronomical Low Tide", "Avalanche", "Blizzard", "Coastal Flood", "Cold/Wind Chill",
"Debris Flow", "Dense Fog", "Dense Smoke", "Drought", "Dust Devil", "Dust Storm",
"Excessive Heat", "Extreme Cold/Wind Chill", "Flash Flood", "Flood", "Frost/Freeze",
"Funnel Cloud", "Freezing Fog", "Hail", "Heat", "Heavy Rain", "Heavy Snow", "High Surf",
"High Wind", "Hurricane (Typhoon)", "Ice Storm", "Lake-Effect Snow", "Lakeshore Flood",
"Lightning", "Marine Hail", "Marine High Wind", "Marine Strong Wind",
"Marine Thunderstorm Wind", "Rip Current", "Seiche", "Sleet", "Storm Surge/Tide",
"Strong Wind", "Thunderstorm Wind", "Tornado", "Tropical Depression", "Tropical Storm",
"Tsunami", "Volcanic Ash", "Waterspout", "Wildfire", "Winter Storm", "Winter Weather")
#remove ambiguous qualifiers from the official EVTYPES to group like events
EVcodes <- str_to_upper(EVcodes)
EVcodes <- str_replace_all(EVcodes, "EXTREME", "")
EVcodes <- str_replace_all(EVcodes, "HIGH", "")
EVcodes <- str_replace_all(EVcodes, "DENSE", "")
EVcodes <- str_replace_all(EVcodes, "EXCESSIVE", "")
EVcodes <- str_replace_all(EVcodes, "STRONG", "")
EVcodes <- str_replace_all(EVcodes, "HEAVY", "")
EVcodes <- str_replace_all(EVcodes, " \\(TYPHOON\\)", "/TYPHOON")
EVcodes <- str_replace_all(EVcodes, "HEAVY", "")
EVcodes <- str_trim(EVcodes, side = "both")
#create a separate field that matches the processed EVTYPES to the official EV codes, including those that almost match
storm <- mutate(storm, EVTYPE1 = EVcodes[amatch(storm$EVTYPE, EVcodes, maxDist = 1)])
#fraction of records without a matched EVTYPE
unassigned <- mean(is.na(storm$EVTYPE1))
Out of the 653458 records included in the analysis, 0.4167062 percent do not have a matched event name. Effort was also made to ensure that those few events were not some of the largest impact events. Those event names with errors affecting jsut a small numbe rof records, but a large impact were recoded.
The final cleaning step is to calculate the dollar ($) value of the economic impact of events where multipliers are located in separate columns.
# replace multipliers in propdmexp with numeric values
storm$PROPDMGEXP <- str_replace_all(storm$PROPDMGEXP, "K", "1000")
storm$PROPDMGEXP <- str_replace_all(storm$PROPDMGEXP, "M", "1000000")
storm$PROPDMGEXP <- str_replace_all(storm$PROPDMGEXP, "B", "1000000000")
storm$PROPDMGEXP <- as.integer(storm$PROPDMGEXP)
storm$PROPDMGEXP[is.na(storm$PROPDMGEXP)] <- 1
# replace multipliers in cropdmexp with numeric values
storm$CROPDMGEXP <- str_replace_all(storm$CROPDMGEXP, "K", "1000")
storm$CROPDMGEXP <- str_replace_all(storm$CROPDMGEXP, "M", "1000000")
storm$CROPDMGEXP <- str_replace_all(storm$CROPDMGEXP, "B", "1000000000")
storm$CROPDMGEXP <- as.integer(storm$CROPDMGEXP)
storm$CROPDMGEXP[is.na(storm$CROPDMGEXP)] <- 1
# calculate the property damage and crop damage for each event
storm <- mutate(storm, PropertyDamage = PROPDMG * PROPDMGEXP) %>%
mutate(CropDamage = CROPDMG * CROPDMGEXP) %>%
mutate(TotalDamage = PropertyDamage + CropDamage) %>%
select(BGN_DATE, EVTYPE, EVTYPE1, FATALITIES, INJURIES, PropertyDamage, CropDamage, TotalDamage, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
First, the weather events with the greatest impact to human health are examined. The impact of the events is ordered by most fatalities, and also notes number of injuries.
#remove measurements with no matching event type after all cleaning efforts
stormCat <- storm[!is.na(storm$EVTYPE1),]
#create a table with that summarizes total deaths and injuries for each event type and sort by deaths from most to least
HarmSum <- group_by(stormCat, EVTYPE1) %>%
summarise(totalDeaths = sum(FATALITIES),
totalInjuries = sum(INJURIES)) %>%
arrange(desc(totalDeaths))
#print the top 15 events in terms of fatalities, showing associated injuries
print(HarmSum[1:15,])
## # A tibble: 15 x 3
## EVTYPE1 totalDeaths totalInjuries
## <chr> <dbl> <dbl>
## 1 HEAT 2036 7683
## 2 TORNADO 1511 20667
## 3 FLOOD 1334 8517
## 4 LIGHTNING 651 4141
## 5 RIP CURRENT 542 503
## 6 THUNDERSTORM WIND 371 5029
## 7 WIND 364 1466
## 8 COLD/WIND CHILL 353 127
## 9 AVALANCHE 223 156
## 10 WINTER STORM 194 1327
## 11 HURRICANE/TYPHOON 125 1326
## 12 SNOW 109 712
## 13 SURF 96 190
## 14 RAIN 94 230
## 15 WILDFIRE 87 1456
The 3 event types that have the greatest impact on human health are heat, tornado, and flood for both injuries and fatalities. heat is the #1 cause of death. It should be noted that this includes both “excessive heat” and “heat” from the official NOAA Event Table, and the decision to combine these is due to the ambiguous difference in naming. If an event of heat is classified as a severe event and results in deaths, it certainly could be considered excessive, whether explicit in name or not.
Chart 1 identifies both injuries and fatalities by event type on the x and y axes.
chart1 <- ggplot(data = HarmSum,
aes(x=totalInjuries, y=totalDeaths, label=EVTYPE1)) +
geom_point() +
geom_text(data=subset(HarmSum,
totalInjuries > 2500 | totalDeaths > 250),
aes(label=EVTYPE1),
hjust = -.1,
vjust = -.5,
check_overlap = TRUE,
size = 3) +
coord_cartesian(xlim = c(0,25000)) +
labs(x = "Total Human Injuries",
y = "Total Human Deaths",
title = "Chart 1: Greatest human harm by event type ('96-'11)") +
theme_minimal()
plot(chart1)
Next, total economic damage is analyzed by summarizing the cost of property damage and crop damage for each event type. Because both are in dollar ($) terms, they can be added together to calculate a total damage cost. A summary table of the event types causing a total of over $1 Billion in damage is shown below.
#create a summary table with total damage, and its two components
TotalSum <- group_by(stormCat, EVTYPE1) %>%
summarise(TotalDamage = sum(TotalDamage),
totalCropDamage = sum(CropDamage),
totalPropDamage = sum(PropertyDamage)) %>%
arrange(desc(TotalDamage)) %>%
filter(TotalDamage > 1000000000)
print(TotalSum)
## # A tibble: 15 x 4
## EVTYPE1 TotalDamage totalCropDamage totalPropDamage
## <chr> <dbl> <dbl> <dbl>
## 1 FLOOD 165823736310 6348063200 159475673110
## 2 HURRICANE/TYPHOON 87068996810 5350107800 81718889010
## 3 STORM SURGE/TIDE 47835579000 855000 47834724000
## 4 TORNADO 24900370720 283425010 24616945710
## 5 HAIL 17180204620 2540725700 14639478920
## 6 DROUGHT 14413667000 13367566000 1046101000
## 7 THUNDERSTORM WIND 8821057230 952246350 7868810880
## 8 TROPICAL STORM 8320186550 677711000 7642475550
## 9 WILDFIRE 8162704630 402255130 7760449500
## 10 WIND 6126458900 698814800 5427644100
## 11 ICE STORM 3658058810 15660000 3642398810
## 12 WINTER STORM 1544787250 11944000 1532843250
## 13 COLD/WIND CHILL 1365617900 1334665500 30952400
## 14 RAIN 1313584240 728419800 585164440
## 15 FROST/FREEZE 1261591000 1250911000 10680000
#gather the data into a tidy table for plotting
tidyDamage <- select(TotalSum, EVTYPE1, totalCropDamage, totalPropDamage) %>%
gather(damageType, cost, 2:3)
Floods are clearly the most impactful severe weather events. Note that several different types of floods have been combined given that certain entries, such as “river flood” are not an official event type, but if they are added to general flooding, then it is only logical to also add “coastal” and “flash” flooding to this general flood category. While drought causes the most crop damage, the total value is much less than that of flooding, hurricanes/typhoons, storm surge, etc.
Chart 2 highlights the total economic damage of the top severe weather event types while also identifying the prortion of damage resulting from property or crop impacts by color.
chart2 <- ggplot(tidyDamage,
aes(x=reorder(EVTYPE1, cost),
y=cost/1000000000,
fill = damageType)) +
geom_col(color = "black") +
coord_flip() +
labs(y = "Total Damage ($B)",
x = "Type of Event",
title = "Chart 2: Greatest economic damage by event type ('96-'11)") +
scale_fill_discrete(name = "Type of damage",
labels = c("Crop loss", "Property loss"),
direction = -1, l = 75) +
theme_minimal()
plot(chart2)
This analysis provides a starting point for understanding the impact of several weather in both human and economic terms. While it must be noted that many discretionary decisions were made in the cleaning and recoding of event type data to improve the clarity of the analysis and results, many alternate interpretations of the data could be reached leading to diferences in findings. With that said, this analysis provide evidence that the heat and tornadoes are the leading causes of human harm, and floods and hurricane/typhoons cause the greatest economic impact.