Storm event data in the United States has historically been recorded by the National Oceanic and Atmospheric Administration (NOAA)/National Weather Service (NWS). NOAA’s current database contains records from January 1950 to present. The object of this inquiry is to determine which weather event types cause the greatest damage to: crops and property in monetary terms; and to population health, in terms of direct injuries and fatalities. The source and processing of the data are discussed, including data constraint choices and their justification. Economic damage was assessed based upon combined property and crop damage by event type, while population health impact was assessed as the combined direct injuries and fatalities, again by event type. The results of the inquiry indicate that flood events cause the greatest damage to property and crops, while tornado events are the cause of the largest number of direct injuries and fatalities.
Data for this inquiry were provided as an archive (.bz2 format),
downloadable from the course project page: Storm
Data. The data within this file span include records of weather
events from 1950-04-18 through 2011-11-28.
Further information regarding the data set was provided:
The data set used for this inquiry is Storm Data
Operating under the assumption that the archive is in the working directory, load the libraries used in the analysis into R:
library(tidyverse)
## Warning: package 'lubridate' was built under R version 4.5.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(data.table)
## Warning: package 'data.table' was built under R version 4.5.3
##
## Attaching package: 'data.table'
##
## The following objects are masked from 'package:lubridate':
##
## hour, isoweek, isoyear, mday, minute, month, quarter, second, wday,
## week, yday, year
##
## The following objects are masked from 'package:dplyr':
##
## between, first, last
##
## The following object is masked from 'package:purrr':
##
## transpose
Then import the data:
storm_data_full <- fread("repdata_data_StormData.csv.bz2")
Before any analysis, it is useful to examine the structure and size of the data, and examine a sample of the contents. In this case we will inspect ten randomly-chosen rows:
str(storm_data_full)
## Classes 'data.table' and 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
## - attr(*, ".internal.selfref")=<externalptr>
set.seed(93)
sample_n(storm_data_full, 10)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME
## <num> <char> <char> <char> <num> <char>
## 1: 12 6/8/1989 0:00:00 2345 CST 91 OKALOOSA
## 2: 39 8/4/1994 0:00:00 1241 EST 5 ASHLAND
## 3: 29 7/18/2000 0:00:00 07:00:00 PM CST 139 MONTGOMERY
## 4: 35 5/14/1998 0:00:00 04:30:00 PM MST 28 NMZ028>029
## 5: 29 6/4/1969 0:00:00 1950 CST 33 CARROLL
## 6: 28 1/1/2011 0:00:00 03:03:00 AM CST 99 NESHOBA
## 7: 16 12/16/2002 0:00:00 12:28:00 PM MST 10 IDZ010
## 8: 5 9/9/2001 0:00:00 10:30:00 PM CST 113 POLK
## 9: 21 1/27/2009 0:00:00 02:00:00 AM EST 52 KYZ052 - 058 - 068
## 10: 37 5/29/1996 0:00:00 06:50:00 PM EST 1 ALAMANCE
## STATE EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI
## <char> <char> <num> <char> <char>
## 1: FL TSTM WIND 0
## 2: OH THUNDERSTORM WINDS 0 Sullivan
## 3: MO FUNNEL CLOUD 0 NEW FLORENCE
## 4: NM HIGH WIND 0
## 5: MO HAIL 0
## 6: MS THUNDERSTORM WIND 3 SSW OCOBLA
## 7: ID HIGH WIND 0
## 8: AR FLASH FLOOD 0 SOUTHEAST PORTION
## 9: KY WINTER WEATHER 0
## 10: NC HAIL 10 E LIBERTY
## END_DATE END_TIME COUNTY_END COUNTYENDN END_RANGE END_AZI
## <char> <char> <num> <lgcl> <num> <char>
## 1: 0 NA 0
## 2: 0 NA 0
## 3: 7/18/2000 0:00:00 07:00:00 PM 0 NA 0
## 4: 5/14/1998 0:00:00 10:00:00 PM 0 NA 0
## 5: 0 NA 0
## 6: 1/1/2011 0:00:00 03:03:00 AM 0 NA 0
## 7: 12/16/2002 0:00:00 12:28:00 PM 0 NA 0
## 8: 9/9/2001 0:00:00 10:30:00 PM 0 NA 0
## 9: 1/28/2009 0:00:00 06:00:00 PM 0 NA 0
## 10: 5/29/1996 0:00:00 06:50:00 PM 0 NA 10 E
## END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## <char> <num> <num> <int> <num> <num> <num> <num>
## 1: 0 0 NA 50 0 0 0.0
## 2: 0 0 NA 0 0 0 50.0
## 3: NEW FLORENCE 0 0 NA 0 0 0 0.0
## 4: 0 0 NA 59 0 0 10.3
## 5: 0 0 NA 150 0 0 0.0
## 6: 0 0 NA 50 0 0 10.0
## 7: 0 0 NA 52 0 0 0.0
## 8: SOUTHEAST PORTION 0 0 NA 0 0 0 300.0
## 9: 0 0 NA 0 0 0 0.0
## 10: LIBERTY 0 0 NA 75 0 0 0.0
## PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC
## <char> <num> <char> <char> <char>
## 1: 0 MIA
## 2: K 0
## 3: 0 LSX MISSOURI, East
## 4: K 0 MAF NEW MEXICO, Southeast
## 5: 0
## 6: K 0 K JAN MISSISSIPPI, Central
## 7: 0 MSO IDAHO, North
## 8: K 0 LZK ARKANSAS, Central and North Central
## 9: K 0 K JKL KENTUCKY, Eastern
## 10: 0 RAH NORTH CAROLINA, Central
## ZONENAMES
## <char>
## 1:
## 2:
## 3:
## 4: EDDY COUNTY PLAINS - EDDY COUNTY PLAINS - NORTHERN LEA COUNTY
## 5:
## 6:
## 7: EASTERN LEMHI COUNTY - EASTERN LEMHI COUNTY
## 8:
## 9: ROWAN - ROWAN - ESTILL - ROCKCASTLE
## 10:
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_
## <num> <num> <num> <num>
## 1: 3045 8636 0 0
## 2: 0 0 0 0
## 3: 0 0 0 0
## 4: 0 0 0 0
## 5: 3935 9317 0 0
## 6: 3242 8903 0 0
## 7: 0 0 0 0
## 8: 0 0 0 0
## 9: 0 0 0 0
## 10: 3535 8011 3535 8011
## REMARKS
## <char>
## 1:
## 2: Trees were downed, some on power lines, with at least one damaging a house.
## 3: Local law enforcement reported a funnel cloud moving east along Interstate 70 near New Florence.\r
## 4: Gradient winds west of the dry line resulted in south to southwest winds sustained at 45 knots with gusts to 58 knots at the Lea County Airport in Hobbs (KHOB). A few trees were uprooted in town.\r\n\r\nGradient downsloping west winds in the wake of a Pacific Cold Front caused damage in Carlsbad. About 7 utility poles were knocked down as well as a roof taken off a mobile home. The peak gust at the Cavern City Airport (KCNM) was 59 knots.\r
## 5:
## 6: EPISODE NARRATIVE: A potent storm system brought a prolonged outbreak of severe thunderstorms to the Lower Mississippi Valley region from the afternoon hours of New Years Eve lasting through the morning hours of New Years Day. This rare combination of high instability and wind shear is mainly what supported the large outbreak that included multiple strong tornadoes. National Weather Service storm survey teams found 11 total tornadoes that occurred during this event. Of the 11, two were EF-3 and two were EF-2. Six were EF-1 with one EF-0. Damaging straight line winds also brought down numerous trees and large limbs across the area. Large hail also occurred during the event with reports ranging from quarter to golf ball size. In addition, flash flooding was a significant issue across the area. Roads were flooded in several locations, some vehicles were submerged in flood waters, and a few evacuations took place as a result of rising flood waters.EVENT NARRATIVE: Power lines were blown down in the Tucker Community.
## 7: Strong cold front brought damaging winds to north central Idaho during the early afternoon hours. In Idaho County, Sheriff reported roof damage and phone lines downed in Cottonwood. Tree limbs were broken in Grangeville where wind gusts to 56 mph were reported by automated weather station at the airport. In Lemhi County, RAWS station reported wind gusts to 60 mph.
## 8: Heavy rains resulted in flash flooding across a large part of southeast Polk County. Several roads were flooded, with county roads 61, 67, 69, and 664 washed out. At least a dozen bridges were washed out as well. People were stranded in campgrounds due to the bridge and road washouts. Also, a pontoon boat was washed out of a lake, over a small dam, and ended up in a county road.\r
## 9: EPISODE NARRATIVE: A major winter storm affected Eastern Kentucky beginning on January 27th. A low pressure system moved northeast from the Gulf Coast States, reaching Eastern Kentucky by the morning of the 28th. Snow, sleet, and freezing rain overspread the area during the early morning hours on the 27th. The precipitation gradually changed to all freezing rain, with some plain rain closer to the Tennessee border. By the morning of the 28th, numerous trees and power lines were brought down due to the weight of the ice from freezing rain. An estimated 100,000 people were without power. Some were without power for more than a week. Communications systems were affected, with both land line and cellular service inoperative. FEMA estimated that over $13 million in damage occurred as a result of the ice storm in eastern Kentucky alone. Estill, Johnson, Magoffin, and Morgan counties all received over $1 million dollars in federal aid respectively to help with recovery and cleanup.EVENT NARRATIVE: Two inches of snow and sleet accumulated in Morehead.
## 10:
## REFNUM
## <num>
## 1: 20909
## 2: 221935
## 3: 396097
## 4: 331256
## 5: 85173
## 6: 840487
## 7: 454305
## 8: 415678
## 9: 746367
## 10: 267477
As we have seen, the provided data set is large (902,297 rows, 37 columns). Not all of these are needed to answer the purpose of this inquiry. To determine which data are germane to the matter at hand, National Weather Service Instruction 10-1605 provides a description of the variables recorded.
In addition to this information, the Storm Events Database description at the NOAA National Centers For Environmental Information web site indicates that the historical record is of varying completeness:
Based upon the NOAA/NWS publications referenced above:
bgn_date: Beginning date of the event.evtype: Type of the event, of the 48 defined by
NOAA/NWS.fatalities: Number of fatalities
directly caused by the weather event. Indirect
fatalities are excluded from this figure.injuries: Number of injuries directly
caused by the weather event. Indirect fatalities are excluded from this
figure.propdmg: Damage to property, base (three significant
figures).propdmgexp: Property damage order of magnitude (0, K,
M, B).cropdmg: Damage to crops, base (three significant
figures).cropdmgexp: Crop damage order of magnitude (0, K, M,
B).First, subset the data, retaining only the columns of interest, and tidy the variable names:
storm_data <- storm_data_full[, c(2, 8, 23:28)]
storm_data <- storm_data |> clean_names()
We retain the event beginning date (bgn_date) as it will
be needed to further constrain the data per our decision (2) above. To
accomplish this, further format manipulations are needed:
# Convert bgn_date from character to Date class to
storm_data$bgn_date <- gsub(' 0:00:00', '', storm_data$bgn_date)
storm_data$bgn_date <- as.Date.character(storm_data$bgn_date, format = '%m/%d/%Y')
And subset by date to retain only those records from 2007-01-01 or later:
storm_data <- subset(storm_data, bgn_date >= as.Date('2007-01-01'))
For the purpose of this analysis we are interested in events which resulted in damage, either to property, crops, or persons. The data will therefore be further reduced by the elimination of those records which have no documentation of any of these damages.
storm_data <- storm_data |>
filter(!is.na(propdmg) | !is.na(cropdmg) | injuries > 0 | fatalities > 0)
str(storm_data)
## Classes 'data.table' and 'data.frame': 255104 obs. of 8 variables:
## $ bgn_date : Date, format: "2007-02-13" "2007-02-13" ...
## $ evtype : chr "TORNADO" "HAIL" "HAIL" "HAIL" ...
## $ fatalities: num 0 0 0 0 0 0 0 0 0 0 ...
## $ injuries : num 0 0 0 0 0 0 0 0 0 0 ...
## $ propdmg : num 10 0 0 0 0 0 0 0 0 0 ...
## $ propdmgexp: chr "K" "K" "K" "K" ...
## $ cropdmg : num 0 0 0 0 0 0 0 0 0 0 ...
## $ cropdmgexp: chr "K" "K" "K" "K" ...
## - attr(*, ".internal.selfref")=<externalptr>
255,104 records remain after the data constraints are applied. This is sufficient, in the author’s view, to provide suitable accuracy at the level requested in the brief.
Using the coding specified in National Weather Service
Instruction 10-1605, the exponent codes are applied to the values
in propdmg and cropdmg to calculate the damage
caused by the event in US Dollars.
# First, substitute number for exponent in property damage
storm_data <- storm_data |>
mutate(propdmgexp = gsub('B', '1000000000', propdmgexp, ignore.case = TRUE)) |>
mutate(propdmgexp = gsub('M', '1000000', propdmgexp, ignore.case = TRUE)) |>
mutate(propdmgexp = gsub('K', '1000', propdmgexp, ignore.case = TRUE))
# Then crop damage
storm_data <- storm_data |>
mutate(cropdmgexp = gsub('B', '1000000000', cropdmgexp, ignore.case = TRUE)) |>
mutate(cropdmgexp = gsub('M', '1000000', cropdmgexp, ignore.case = TRUE)) |>
mutate(cropdmgexp = gsub('K', '1000', cropdmgexp, ignore.case = TRUE))
# And convert type of both exponent columns to number
storm_data <- storm_data |>
mutate(propdmgexp = as.numeric(propdmgexp)) |>
mutate(cropdmgexp = as.numeric(cropdmgexp))
# Combine property/crop damage and their exponents, replacing propdmg and cropdmg
# and remove exponent columns
storm_data <- storm_data |>
mutate(propdmg = propdmg * propdmgexp) |>
mutate(cropdmg = cropdmg * cropdmgexp) |>
select(-c('propdmgexp', 'cropdmgexp'))
To identify the most damaging weather events:
total_dmg.casualties.storm_data <- storm_data |>
mutate(total_dmg = propdmg + cropdmg) |>
mutate(casualties = fatalities + injuries)
Create individual dataframes which summarise the damage to crops and property by weather event type. We then select the top ten causes to plot:
storm_summ_econ <- storm_data |>
group_by(evtype) |>
summarise(total_property_damage = sum(propdmg),
total_crop_damage = sum(cropdmg)) |>
filter(total_property_damage != 0 | total_crop_damage != 0) |>
mutate(total_damage = total_property_damage + total_crop_damage) |>
arrange(desc(total_damage)) |>
head(10)
econ_x_labels <- as.vector(storm_summ_econ$evtype)
And damage to persons, again by weather event type and selecting the top ten causes:
storm_summ_hum <- storm_data |>
group_by(evtype) |>
summarise(total_fatalities = sum(fatalities),
total_injuries = sum(injuries)) |>
filter(total_fatalities != 0 | total_injuries != 0) |>
mutate(total_casualties = total_fatalities + total_injuries) |>
arrange(desc(total_casualties)) |>
head(10)
hum_x_labels <- as.vector(storm_summ_hum$evtype)
The data manipulations now complete, we proceed to plot the results and draw conclusions.
ggplot(data = storm_summ_econ) +
geom_col(aes(x = factor(evtype, levels = econ_x_labels),
y = total_damage,
fill = evtype)) +
labs(y = 'Damage (USD)', x = '',
title = 'Fig. 1: Weather event damage to property and crops',
subtitle = 'Ten largest causes, 2007-01-01 through 2011-11-28, NOAA/NWS data',) +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = 'none')
The ten most damaging weather event types for the studied period
(2007-01-01 through 2011-11-28) are plotted in fig. 1, above. Damage is
computed as the sum of the damage to crops and damage to property. Flood
events are the largest cause of property and crops for the studied
period.
(Note that NOAA/NWS records “flood” and “flash flood” events separately.
Were these combined, the prominence of inundation events in property and
crop damage would be further emphasized.)
ggplot(data = storm_summ_hum) +
geom_col(aes(x = factor(evtype, levels = hum_x_labels),
y = total_casualties,
fill = evtype)) +
labs(y = 'Casualties (injuries + fatalities)', x = '',
title = 'Fig. 2: Weather event direct injuries/fatalities',
subtitle = 'Ten largest causes, 2007-01-01 through 2011-11-28, NOAA/NWS data') +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = 'none')
The ten weather event types causing the highest number of direct casualties (defined here as the sum of fatalities and injuries) are shown in fig. 2. For the studied period, tornado events are far and away the greatest cause of injury and death.