Using the NOAA Storm Data the following report will look at both the economic damage and damage to population health caused by weather events in the United States. The analysis summaries the total damage caused by each event. There are two types of both economic damage and population health damge. Economic damage consists of property damage and crop damage, measured in United States Dollars. Population health damage consists of fatalities and injuries. The results of this analysis will show the most damage caused in each category by each event.
The data is download from the link in the below code. It is read into
R using the readr package. If the data already exists in
the working directory, it will not be downloaded.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.8
## ✓ tidyr 1.2.0 ✓ stringr 1.4.0
## ✓ readr 2.1.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
if(!file.exists("data.bz2")) {
print("downloading...")
download.file(
"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
"./data.bz2"
)
}
raw_data <- readr::read_csv("data.bz2")
## Rows: 902297 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (18): BGN_DATE, BGN_TIME, TIME_ZONE, COUNTYNAME, STATE, EVTYPE, BGN_AZI,...
## dbl (18): STATE__, COUNTY, BGN_RANGE, COUNTY_END, END_RANGE, LENGTH, WIDTH, ...
## lgl (1): COUNTYENDN
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The initial data processing consists of simple data transforming to increase usability and understanding:
The damage data (PROPDMG and CROPDMG) into integers to represent the complete value of the damage in a single column. This is done by replacing the letter representing the magnitude with the corresponding number. This number is then used to calculate the integer value
The EV_TYPE is lowered for the next step (below) in the analysis
Any event type that is a summary is dropped from the data set. It is dropped as it not explict what event the data is referring to
# Replace letter with the exponential value
replace_exp_with_numeric <- function(data_in) {
data_in[is.na(data_in) | data_in == "?" | data_in == "-"] <- "0"
data_in[data_in == "+"] <- "1"
data_in[data_in == "H" | data_in == "h"] <- "2"
data_in[data_in == "K" | data_in == "k"] <- "3"
data_in[data_in == "M" | data_in == "m"] <- "6"
data_in[data_in == "B" | data_in == "b"] <- "9"
data_out <- as.numeric(data_in)
data_out
}
data <- raw_data %>%
transform(
# BGN_DATE = parse_date_time(raw_data$BGN_DATE, "%m/%d/%y %H:%M:%S"),
EVTYPE = tolower(raw_data$EVTYPE),
PROPDMGEXP = replace_exp_with_numeric(raw_data$PROPDMGEXP),
CROPDMGEXP = replace_exp_with_numeric(raw_data$CROPDMGEXP)
) %>%
transform(
PROPDMG = PROPDMG * (10 ^ PROPDMGEXP),
CROPDMG = CROPDMG * (10 ^ CROPDMGEXP)
) %>%
# mutate(year = format(BGN_DATE, "%Y")) %>%
filter(!grepl("summary", EVTYPE, ignore.case = TRUE))
The final step in the processing is to tranform the events into the
correct category. In the raw data set there are 977 different event
types and 823 different event types after removing the ‘summary’ types.
In the NOAA documentation there are only 48 event types. Using the
stringdist package, the event types (EVTYPE) will be
matched to the nearest event type from the documentation.
length(unique(raw_data$EVTYPE))
## [1] 977
length(unique(data$EVTYPE))
## [1] 823
# Valid event types taken from NOAA documentation section number 2.1.1
# https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf
valid_event_types = c(
"Astronomical Low Tide", "Avalanche", "Blizzard", "Coastal Flood", "Cold/Wind",
"Chill", "Debris Flow", "Dense Fog", "Dense Smoke", "Drought", "Dust Devil",
"Dust Storm", "Excessive Heat", "Extreme Cold/Wind Chill", "Flash Flood",
"Flood", "Frost/Freeze", "Funnel Cloud", "Freezing Fog", "Hail", "Heat",
"Heavy Rain", "Heavy Snow", "High Surf", "High Wind", "Hurricane (Typhoon)",
"Ice Storm", "Lake-Effect Snow", "Lakeshore Flood", "Lightning", "Marine Hail",
"Marine High Wind", "Marine Strong Wind", "Marine Thunderstorm Wind",
"Rip Current", "Seiche", "Sleet", "Storm Surge/Tide", "Strong Wind",
"Thunderstorm Wind", "Tornado", "Tropical Depression", "Tropical Storm",
"Tsunami", "Volcanic Ash", "Waterspout", "Wildfire", "Winter Storm",
"Winter Weather")
Using the stringdistmatrix function, a matrix is created
with the closet match to each corresponding valid event type from the
original data set event types. The data is then converted to a data
frame, followed by selecting the closet match for each event. The closet
match is then inserted into the data object.
library(stringdist)
##
## Attaching package: 'stringdist'
## The following object is masked from 'package:tidyr':
##
## extract
result <- stringdistmatrix(
data$EVTYPE,
valid_event_types,
method = "dl",
useNames = TRUE
)
result <- as_tibble(result)
data$CORRECTED_EVTYPE <- apply(result, 1, function (x) {names(which.min(x))})
This is not a perfect replacement (see below examples). However for the purposes of this exercise it does summarise the major events correctly, and the errors seem to small portion of the data set. The two examples shown only appear once each in the data set.
data$CORRECTED_EVTYPE[data$EVTYPE == "drowning"]
## [1] "Lightning"
length(data$EVTYPE[data$EVTYPE == "drowning"])
## [1] 1
data$CORRECTED_EVTYPE[data$EVTYPE == "heavy swells"]
## [1] "Heavy Rain"
length(data$EVTYPE[data$EVTYPE == "heavy swells"])
## [1] 1
From the analysis in this report, the most damaging weather event from 1950 - 2011 in the United States are as follows. For population health, tornado’s have caused the most injuries and fatalities with a total of 97,033 casualties of which 91,395 were injuries and 5,638 were deaths. For economic damage, the most damaging overall is flooding with a total cost of $151.08 billion, of which $145.4 billion is property damage and $5.67 billion was crop damage. Flooding is also the single most damaging event for property, drought caused the most crop damage with a total cost of $14.04 billion.?
damage_data <- data %>%
group_by(CORRECTED_EVTYPE) %>%
summarise(
PROPDMG = sum(PROPDMG),
CROPDMG = sum(CROPDMG),
) %>%
gather(DMGTYPE, DMG, PROPDMG, CROPDMG) %>%
filter(DMG > quantile(DMG, 0.75)) %>%
arrange(desc(DMG)) %>%
mutate(CORRECTED_EVTYPE = factor(
CORRECTED_EVTYPE, levels = unique(CORRECTED_EVTYPE)
))
ggplot(
damage_data,
aes(x = CORRECTED_EVTYPE, y = DMG / 1000000, fill = DMGTYPE)
) +
geom_bar(stat = "identity", position = "stack") +
coord_flip() +
scale_y_continuous(breaks = scales::pretty_breaks(n = 8)) +
ylab("Total Damage Cost (value in USD millions)") +
xlab("Event type") +
ggtitle("Total cost of damages caused by weather events in the United States from 1950 to 2011") +
scale_fill_discrete(
name="Damage Type",
breaks = c("CROPDMG", "PROPDMG"),
labels = c("Crop Damage", "Property Damage")
)
The above chart illustrates the cost of the damages by the events listed. It only displays the events that in the top 25% quartile of total damage cost. This chart shows that flooding is the most damaging event overall for property and damage. However drought is more damaging to crops then any other event type. The values flooding and drought are listed below.
x <- damage_data[damage_data$CORRECTED_EVTYPE == "Flood",]
print(x)
## # A tibble: 2 × 3
## CORRECTED_EVTYPE DMGTYPE DMG
## <fct> <chr> <dbl>
## 1 Flood PROPDMG 145404327857
## 2 Flood CROPDMG 5671708950
total_dmg_by_flood <- (x$DMG[x$DMGTYPE == "PROPDMG"] + x$DMG[x$DMGTYPE == "CROPDMG"])
print(total_dmg_by_flood)
## [1] 1.51076e+11
damage_data[damage_data$CORRECTED_EVTYPE == "Drought",]
## # A tibble: 1 × 3
## CORRECTED_EVTYPE DMGTYPE DMG
## <fct> <chr> <dbl>
## 1 Drought CROPDMG 14038631000
casualty_data <- data %>%
group_by(CORRECTED_EVTYPE) %>%
summarise(
FATALITIES = sum(FATALITIES),
INJURIES = sum(INJURIES),
) %>%
gather(CASUALTY_TYPE, COUNT, FATALITIES, INJURIES) %>%
filter(COUNT > quantile(COUNT, 0.75)) %>%
arrange(desc(COUNT)) %>%
mutate(CORRECTED_EVTYPE = factor(
CORRECTED_EVTYPE, levels = unique(CORRECTED_EVTYPE)
))
ggplot(
casualty_data,
aes(x = CORRECTED_EVTYPE, y = COUNT / 1000, fill = CASUALTY_TYPE)
) +
geom_bar(stat = "identity", position = "stack") +
coord_flip() +
scale_y_continuous(breaks = scales::pretty_breaks(n = 8)) +
ylab("Total Casulaties (values in 1,000)") +
xlab("Event type") +
ggtitle("Total casulaties (fatalities + injuries) caused by weather events\nin the United States from 1950 to 2011") +
scale_fill_discrete(
name="Damage Type",
breaks = c("FATALITIES", "INJURIES"),
labels = c("Fatalities", "Injuries")
)
The above chart illustrates the total casualties caused by each weather event listed. A casualty is either a fatality or injury. This chart shows that tornado’s are overall most casualty causing event, as well the cause of the fatalities or injuries of any event. The values for the tornado event are list below.
casualty_data[casualty_data$CORRECTED_EVTYPE == "Tornado",]
## # A tibble: 2 × 3
## CORRECTED_EVTYPE CASUALTY_TYPE COUNT
## <fct> <chr> <dbl>
## 1 Tornado INJURIES 91395
## 2 Tornado FATALITIES 5638
From the analysis in this report, the most damaging weather event from 1950 - 2011 in the United States are as follows. For population health, tornado’s have caused the most injuries and fatalities with a total of 97,033 casualties of which 91,395 were injuries and 5,638 were deaths. For economic damage, the most damaging overall is flooding with a total cost of $151.08 billion, of which $145.4 billion is property damage and $5.67 billion was crop damage. Flooding is also the single most damaging event for property, drought caused the most crop damage with a total cost of $14.04 billion.