This paper will analyze data from NOAA Storm Database from 1996 to 2011, to answer the questions: Across the United States, which types of events are most harmful with respect to population health? Across the United States, which types of events have the greatest economic consequences?
The NOAA Storm Data can be downloaded here: National Oceanic and Atmospheric Administration (NOAA). It can be downloaded directly here (47Mb).
library(R.utils)
## Warning: package 'R.utils' was built under R version 4.1.2
## Warning: package 'R.oo' was built under R version 4.1.2
## Warning: package 'R.methodsS3' was built under R version 4.1.2
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.1.2
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.1.2
file_url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
file_name <- "storm_data.csv.bz2"
download.file(file_url, file_name, method = "curl")
storm_data_original <- read.csv(file_name, sep = ",", header = TRUE)
The data set is loaded into storm_data_original.
storm_data_original has the following variables.
str(storm_data_original)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
Of the above 37 variables in the original data, we will focus on the following 8 variables:
storm_data <- storm_data_original[, c("BGN_DATE", "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
storm_data is subset to include only events recorded
after 1996, since that’s when NOAA changed event type standards. Records
from prior to 1996 will be excluded to eliminate errors.
storm_data$BGN_DATE <- as.Date(as.character(storm_data$BGN_DATE), "%m/%d/%Y %H:%M:%S")
storm_data <- subset(storm_data, format(storm_data$BGN_DATE, "%Y") > 1996 )
Here is a list of the 55 event type standards:
storm_events <- c("Astronomical Low Tide", "Avalanche", "Blizzard", "Coastal Flood", "Cold/Wind Chill", "Cold", "Wind Chill", "Debris Flow", "Dense Fog", "Dense Smoke", "Drought", "Dust Devil", "Dust Storm", "Excessive Heat", "Extreme Cold/Wind Chill", "Extreme Cold", "Flash Flood", "Flood", "Freezing Fog", "Frost/Freeze", "Frost", "Freeze", "Funnel Cloud", "Hail", "Heat", "Heavy Rain", "Heavy Snow", "High Surf", "High Wind", "Hurricane/Typhoon", "Hurricane", "Typhoon", "Ice Storm", "Lakeshore Flood", "Lake-Effect Snow", "Lightning", "Marine Hail", "Marine High Wind", "Marine Strong Wind", "Marine Thunderstorm Wind", "Rip Current", "Seiche", "Sleet", "Storm Tide", "Strong Wind", "Thunderstorm Wind", "Tornado", "Tropical Depression", "Tropical Storm", "Tsunami", "Volcanic Ash", "Waterspout", "Wildfire", "Winter Storm", "Winter Weather")
First, we will case match to ensure uniformity, then convert to factor, and eliminate the duplicates.
storm_data$EVTYPE <- factor(toupper(storm_data$EVTYPE))
storm_data <- subset(storm_data, (storm_data$EVTYPE %in% toupper(storm_events)))
droplevels(storm_data$EVTYPE)
FATALATIES and INJURIES variables are most
harmful to population health, we’ll set them to be greater than 0.
storm_data is a subset including only
FATALATIES and INJURIES variables are greater
than 0. We’ll name this subset
storm_data_for_harmfulness.
storm_data_for_harmfulness <- subset(storm_data, storm_data$FATALITIES > 0 | storm_data$INJURIES > 0 )
For population health affects, we’ll focus on EVTYPE,
FATALATIES and INJURIES.
storm_data_for_harmfulness <- storm_data_for_harmfulness[,c("EVTYPE", "FATALITIES", "INJURIES")]
We’ll create a HARMFULNESS variable to summarize the
total harmfulness for the population for FATATLITIES and
INJURIES variables:
storm_data_for_harmfulness$HARMFULNESS <- storm_data_for_harmfulness$FATALITIES + storm_data_for_harmfulness$INJURIES
HARMFULLNESS is then summed together by
EVTYPE and saved in
storm_data_for_harmfulness_grouped_per_event_type
storm_data_for_harmfulness_grouped_per_event_type <- aggregate(x = list(HEALTH_IMPACT = storm_data_for_harmfulness$HARMFULNESS),
by = list(EVENT_TYPE = storm_data_for_harmfulness$EVTYPE),
FUN = sum,
na.rm = TRUE)
storm_data_for_harmfulness_grouped_per_event_type <- storm_data_for_harmfulness_grouped_per_event_type[order(storm_data_for_harmfulness_grouped_per_event_type$HEALTH_IMPACT, decreasing = TRUE),]
head(storm_data_for_harmfulness_grouped_per_event_type)
## EVENT_TYPE HEALTH_IMPACT
## 33 TORNADO 21447
## 10 EXCESSIVE HEAT 8093
## 14 FLOOD 7121
## 26 LIGHTNING 4424
## 13 FLASH FLOOD 2425
## 32 THUNDERSTORM WIND 1530
PROPDMGEXP and CROPDMGEXP use a letter to
signify magnitude (“K” for thousands, “M” for millions, and “B” for
billions).
PROPDMGEXP or CROPDMGEXP is “K”, the
multiplier is 1,000 (e.g. ’PROPDMGFULL = PROPDMG * 1000`).PROPDMGEXP or CROPDMGEXP is “M”, the
multiplier is 1,000,000 (e.g. ’PROPDMGFULL = PROPDMG * 1000000`).PROPDMGEXP or CROPDMGEXP is “B”, the
multiplier is 1,000,000,000 (e.g. ’PROPDMGFULL = PROPDMG *
1000000000`).First, lets remove the missing values from PROPDMGEXP
and CROPDMGEXP.
storm_data_for_economy <- subset(storm_data, storm_data$PROPDMGEXP != "" & storm_data$CROPDMGEXP != "" )
FATALATIES and INJURIES variables are most
harmful to population health, we’ll set them to be greater than 0.
storm_data is a subset including only
FATALATIES and INJURIES variables are greater
than 0. We’ll name this subset
storm_data_for_harmfulness.
storm_data_for_economy <- subset(storm_data_for_economy, storm_data_for_economy$PROPDMG > 0 | storm_data_for_economy$CROPDMG > 0 )
Then simplify the dataframe to EVTYPE,
PROPDMG, PROPDMGEXP, CROPDMG and
CROPDMGEXP.
storm_data_for_economy <- storm_data_for_economy[,c("EVTYPE", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
We will case match to ensure uniformity, then conver to factor.
storm_data_for_economy$PROPDMGEXP <- factor(toupper(storm_data_for_economy$PROPDMGEXP))
storm_data_for_economy$CROPDMGEXP <- factor(toupper(storm_data_for_economy$CROPDMGEXP))
Calculate the numbers!
storm_data_for_economy$PROPDMGFULL <- ifelse(storm_data_for_economy$PROPDMGEXP == "K",
storm_data_for_economy$PROPDMG * 1000,
ifelse(storm_data_for_economy$PROPDMGEXP == "M",
storm_data_for_economy$PROPDMG * 1000000,
ifelse(storm_data_for_economy$PROPDMGEXP == "B",
storm_data_for_economy$PROPDMG * 1000000000, 0)))
storm_data_for_economy$CROPDMGFULL <- ifelse(storm_data_for_economy$CROPDMGEXP == "K",
storm_data_for_economy$CROPDMG * 1000,
ifelse(storm_data_for_economy$CROPDMGEXP == "M",
storm_data_for_economy$CROPDMG * 1000000,
ifelse(storm_data_for_economy$CROPDMGEXP == "B",
storm_data_for_economy$CROPDMG * 1000000000, 0)))
We’ll create a DAMAGE variable to summarize the total
harmfulness for the population for PROPDMGFULL and
CROPDMGFULL variables:
storm_data_for_economy$DAMAGE <- storm_data_for_economy$PROPDMGFULL + storm_data_for_economy$CROPDMGFULL
DAMAGE is then summed together by EVTYPE
and saved in
storm_data_for_economy_grouped_per_event_type
storm_data_for_economy_grouped_per_event_type <- aggregate(x = list(ECONOMIC_IMPACT = storm_data_for_economy$DAMAGE),
by = list(EVENT_TYPE = storm_data_for_economy$EVTYPE),
FUN = sum,
na.rm = TRUE)
storm_data_for_economy_grouped_per_event_type <- storm_data_for_economy_grouped_per_event_type[order(storm_data_for_economy_grouped_per_event_type$ECONOMIC_IMPACT, decreasing = TRUE),]
head(storm_data_for_economy_grouped_per_event_type)
## EVENT_TYPE ECONOMIC_IMPACT
## 15 FLOOD 136877233900
## 27 HURRICANE/TYPHOON 29348167800
## 40 TORNADO 16203902150
## 26 HURRICANE 11474663000
## 20 HAIL 9172124220
## 14 FLASH FLOOD 8246133530
Here we see the top 10 event types which negatively affect the population:
head(storm_data_for_harmfulness_grouped_per_event_type, 10)
## EVENT_TYPE HEALTH_IMPACT
## 33 TORNADO 21447
## 10 EXCESSIVE HEAT 8093
## 14 FLOOD 7121
## 26 LIGHTNING 4424
## 13 FLASH FLOOD 2425
## 32 THUNDERSTORM WIND 1530
## 18 HEAT 1459
## 24 HURRICANE/TYPHOON 1339
## 39 WINTER STORM 1209
## 22 HIGH WIND 1181
Below is a visualization of the top 10 weather event harmful to the population health:
health_impact_chart <- ggplot(data = head(storm_data_for_harmfulness_grouped_per_event_type, 10),
aes(x = reorder(EVENT_TYPE, HEALTH_IMPACT), y = HEALTH_IMPACT, fill = EVENT_TYPE)) +
coord_flip() +
geom_bar(stat = "identity") +
xlab("Event Type") +
ylab("Total Fatalities and Injures") +
ggtitle("Top 10 Weather Events Most Harmful to\nPopulation Health")
print(health_impact_chart)
Here we see the top 10 event types which negatively affect the economy:
head(storm_data_for_economy_grouped_per_event_type, 10)
## EVENT_TYPE ECONOMIC_IMPACT
## 15 FLOOD 136877233900
## 27 HURRICANE/TYPHOON 29348167800
## 40 TORNADO 16203902150
## 26 HURRICANE 11474663000
## 20 HAIL 9172124220
## 14 FLASH FLOOD 8246133530
## 39 THUNDERSTORM WIND 3780985440
## 46 WILDFIRE 3684468370
## 25 HIGH WIND 2873328540
## 42 TROPICAL STORM 1496252350
Below is a visualization of the top 10 weather event harmful to the economy:
health_impact_chart <- ggplot(data = head(storm_data_for_economy_grouped_per_event_type, 10),
aes(x = reorder(EVENT_TYPE, ECONOMIC_IMPACT), y = ECONOMIC_IMPACT, fill = EVENT_TYPE)) +
coord_flip() +
geom_bar(stat = "identity") +
xlab("Event Type") +
ylab("Total Fatalities and Injures") +
ggtitle("Top 10 Weather Events Most Harmful to\nPopulation Health")
print(health_impact_chart)