Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
This analysis will be used to explore the NOAA Storm Database and answer some basic questions about severe weather events. The data set in this analysis consists of weather events from 1950 to November 2011.
I will use the Storm Data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database to perform the analysis.
Displaying Session Info
sessionInfo()
## R version 4.2.2 (2022-10-31)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.30 R6_2.5.1 jsonlite_1.8.3 magrittr_2.0.3
## [5] evaluate_0.18 stringi_1.7.8 rlang_1.0.6 cachem_1.0.6
## [9] cli_3.4.1 rstudioapi_0.14 jquerylib_0.1.4 bslib_0.4.1
## [13] rmarkdown_2.18 tools_4.2.2 stringr_1.4.1 xfun_0.33
## [17] fastmap_1.1.0 compiler_4.2.2 htmltools_0.5.3 knitr_1.40
## [21] sass_0.4.2
data_url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if (!file.exists(".\\Storm_Data.csv.bz2")) {
download.file(url = data_url, destfile = ".\\Storm_Data.csv.bz2")
}
storm_data <- read.csv(".\\Storm_Data.csv.bz2")
After reading the data into storm_data dataframe, checking the number of variables and observations.
str(storm_data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
head(storm_data)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL TORNADO
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL TORNADO
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL TORNADO
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL TORNADO
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1 0 0 NA
## 2 0 0 NA
## 3 0 0 NA
## 4 0 0 NA
## 5 0 0 NA
## 6 0 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1 0 14.0 100 3 0 0 15 25.0
## 2 0 2.0 150 2 0 0 0 2.5
## 3 0 0.1 123 2 0 0 2 25.0
## 4 0 0.0 100 2 0 0 2 2.5
## 5 0 0.0 150 2 0 0 2 2.5
## 6 0 1.5 177 2 0 0 6 2.5
## PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1 K 0 3040 8812
## 2 K 0 3042 8755
## 3 K 0 3340 8742
## 4 K 0 3458 8626
## 5 K 0 3412 8642
## 6 K 0 3450 8748
## LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3051 8806 1
## 2 0 0 2
## 3 0 0 3
## 4 0 0 4
## 5 0 0 5
## 6 0 0 6
Loading the needed packages
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(reshape2)
Since not all the variables and observations necessary to perform the analysis, data will be subsetted where Event Type was captured and also damage is more than 0.
storm_data_subset <- subset(storm_data, EVTYPE != "?" & (FATALITIES > 0 | INJURIES > 0 | PROPDMG > 0 | CROPDMG > 0), select = c("BGN_DATE", "END_DATE", "STATE", "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP"))
dim(storm_data_subset)
## [1] 254632 10
This subset data has only 254632 observations and 10 variables we need to use for the rest of the analysis. However, Event Type still needs to be cleaned since it has more than 487 unique values in it.
length(unique(storm_data_subset$EVTYPE))
## [1] 487
I will start cleaning every single event type by converting them to uppercase.
storm_data_subset$EVTYPE <- toupper(storm_data_subset$EVTYPE)
Next I will convert every event type to a unique value
# AVALANCHE
storm_data_subset$EVTYPE <- gsub('.*AVALANCE.*', 'AVALANCHE', storm_data_subset$EVTYPE)
# BLIZZARD
storm_data_subset$EVTYPE <- gsub('.*BLIZZARD.*', 'BLIZZARD', storm_data_subset$EVTYPE)
# CLOUD
storm_data_subset$EVTYPE <- gsub('.*CLOUD.*', 'CLOUD', storm_data_subset$EVTYPE)
# COLD
storm_data_subset$EVTYPE <- gsub('.*COLD.*', 'COLD', storm_data_subset$EVTYPE)
storm_data_subset$EVTYPE <- gsub('.*FREEZ.*', 'COLD', storm_data_subset$EVTYPE)
storm_data_subset$EVTYPE <- gsub('.*FROST.*', 'COLD', storm_data_subset$EVTYPE)
storm_data_subset$EVTYPE <- gsub('.*ICE.*', 'COLD', storm_data_subset$EVTYPE)
storm_data_subset$EVTYPE <- gsub('.*LOW TEMPERATURE RECORD.*', 'COLD', storm_data_subset$EVTYPE)
storm_data_subset$EVTYPE <- gsub('.*LO.*TEMP.*', 'COLD', storm_data_subset$EVTYPE)
# DRY
storm_data_subset$EVTYPE <- gsub('.*DRY.*', 'DRY', storm_data_subset$EVTYPE)
# DUST
storm_data_subset$EVTYPE <- gsub('.*DUST.*', 'DUST', storm_data_subset$EVTYPE)
# FIRE
storm_data_subset$EVTYPE <- gsub('.*FIRE.*', 'FIRE', storm_data_subset$EVTYPE)
# FLOOD
storm_data_subset$EVTYPE <- gsub('.*FLOOD.*', 'FLOOD', storm_data_subset$EVTYPE)
# FOG
storm_data_subset$EVTYPE <- gsub('.*FOG.*', 'FOG', storm_data_subset$EVTYPE)
# HAIL
storm_data_subset$EVTYPE <- gsub('.*HAIL.*', 'HAIL', storm_data_subset$EVTYPE)
# HEAT
storm_data_subset$EVTYPE <- gsub('.*HEAT.*', 'HEAT', storm_data_subset$EVTYPE)
storm_data_subset$EVTYPE <- gsub('.*WARM.*', 'HEAT', storm_data_subset$EVTYPE)
storm_data_subset$EVTYPE <- gsub('.*HIGH.*TEMP.*', 'HEAT', storm_data_subset$EVTYPE)
storm_data_subset$EVTYPE <- gsub('.*RECORD HIGH TEMPERATURES.*', 'HEAT', storm_data_subset$EVTYPE)
# HYPOTHERMIA/EXPOSURE
storm_data_subset$EVTYPE <- gsub('.*HYPOTHERMIA.*', 'HYPOTHERMIA/EXPOSURE', storm_data_subset$EVTYPE)
# LANDSLIDE
storm_data_subset$EVTYPE <- gsub('.*LANDSLIDE.*', 'LANDSLIDE', storm_data_subset$EVTYPE)
# LIGHTNING
storm_data_subset$EVTYPE <- gsub('^LIGHTNING.*', 'LIGHTNING', storm_data_subset$EVTYPE)
storm_data_subset$EVTYPE <- gsub('^LIGNTNING.*', 'LIGHTNING', storm_data_subset$EVTYPE)
storm_data_subset$EVTYPE <- gsub('^LIGHTING.*', 'LIGHTNING', storm_data_subset$EVTYPE)
# MICROBURST
storm_data_subset$EVTYPE <- gsub('.*MICROBURST.*', 'MICROBURST', storm_data_subset$EVTYPE)
# MUDSLIDE
storm_data_subset$EVTYPE <- gsub('.*MUDSLIDE.*', 'MUDSLIDE', storm_data_subset$EVTYPE)
storm_data_subset$EVTYPE <- gsub('.*MUD SLIDE.*', 'MUDSLIDE', storm_data_subset$EVTYPE)
# RAIN
storm_data_subset$EVTYPE <- gsub('.*RAIN.*', 'RAIN', storm_data_subset$EVTYPE)
# RIP CURRENT
storm_data_subset$EVTYPE <- gsub('.*RIP CURRENT.*', 'RIP CURRENT', storm_data_subset$EVTYPE)
# STORM
storm_data_subset$EVTYPE <- gsub('.*STORM.*', 'STORM', storm_data_subset$EVTYPE)
# SUMMARY
storm_data_subset$EVTYPE <- gsub('.*SUMMARY.*', 'SUMMARY', storm_data_subset$EVTYPE)
# TORNADO
storm_data_subset$EVTYPE <- gsub('.*TORNADO.*', 'TORNADO', storm_data_subset$EVTYPE)
storm_data_subset$EVTYPE <- gsub('.*TORNDAO.*', 'TORNADO', storm_data_subset$EVTYPE)
storm_data_subset$EVTYPE <- gsub('.*LANDSPOUT.*', 'TORNADO', storm_data_subset$EVTYPE)
storm_data_subset$EVTYPE <- gsub('.*WATERSPOUT.*', 'TORNADO', storm_data_subset$EVTYPE)
# SURF
storm_data_subset$EVTYPE <- gsub('.*SURF.*', 'SURF', storm_data_subset$EVTYPE)
# VOLCANIC
storm_data_subset$EVTYPE <- gsub('.*VOLCANIC.*', 'VOLCANIC', storm_data_subset$EVTYPE)
# WET
storm_data_subset$EVTYPE <- gsub('.*WET.*', 'WET', storm_data_subset$EVTYPE)
# WIND
storm_data_subset$EVTYPE <- gsub('.*WIND.*', 'WIND', storm_data_subset$EVTYPE)
# WINTER
storm_data_subset$EVTYPE <- gsub('.*WINTER.*', 'WINTER', storm_data_subset$EVTYPE)
storm_data_subset$EVTYPE <- gsub('.*WINTRY.*', 'WINTER', storm_data_subset$EVTYPE)
storm_data_subset$EVTYPE <- gsub('.*SNOW.*', 'WINTER', storm_data_subset$EVTYPE)
Next I will convert the damage values appropriately by using *EXP values. We will start with converting the EXP values to UPPER case and also trim any of the whitespace before or after the values.
storm_data_subset$PROPDMGEXP <- toupper(storm_data_subset$PROPDMGEXP)
storm_data_subset$PROPDMGEXP <- trimws(storm_data_subset$PROPDMGEXP)
storm_data_subset$CROPDMGEXP <- toupper(storm_data_subset$CROPDMGEXP)
storm_data_subset$CROPDMGEXP <- trimws(storm_data_subset$CROPDMGEXP)
We will then create a function to find the appropriate multiplier based on EXP values.
find_multiplier <- function(exp) {
exp <- exp;
if (exp == "") return (10^0);
if (exp == "+") return (10^0);
if (exp == "-") return (10^0);
if (exp == "?") return (10^0);
if (exp == "H") return (10^2);
if (exp == "K") return (10^3);
if (exp == "M") return (10^6);
if (exp == "B") return (10^9);
if (exp == "0") return (10^0);
if (exp == "1") return (10^1);
if (exp == "2") return (10^2);
if (exp == "3") return (10^3);
if (exp == "4") return (10^4);
if (exp == "5") return (10^5);
if (exp == "6") return (10^6);
if (exp == "7") return (10^7);
if (exp == "8") return (10^8);
if (exp == "9") return (10^9);
return (NA);
}
Now we can actually multiply the values in the DMG values with the multiplier to find the actual dollar amount. We will leverage the function we created to make this easier.
storm_data_subset$PROPDMG <- storm_data_subset$PROPDMG * sapply(storm_data_subset$PROPDMGEXP, find_multiplier)
storm_data_subset$CROPDMG <- storm_data_subset$CROPDMG * sapply(storm_data_subset$CROPDMGEXP, find_multiplier)
Next we will deal with date columns to convert them to appropriate date format and also add a new column for the year.
storm_data_subset$BGN_DATE <- as.Date(storm_data_subset$BGN_DATE, format = "%m/%d/%Y")
storm_data_subset$END_DATE <- as.Date(storm_data_subset$END_DATE, format = "%m/%d/%Y")
storm_data_subset$YEAR <- as.integer(format(storm_data_subset$BGN_DATE, "%Y"))
Now that we have our data set prepared, we will proceed with answering some questions related to analysis.
In order to make the analysis easier, we will convert the damage numbers to millions and store them in new columns. We will use these 2 columns going forward in the rest of the analysis.
storm_data_subset$PROPDMG_MIL <- storm_data_subset$PROPDMG / 10^6
storm_data_subset$CROPDMG_MIL <- storm_data_subset$CROPDMG / 10^6
We will group our results next by event type and store this in our final dataframe.
final_data <- storm_data_subset %>%
group_by(EVTYPE) %>%
summarize(EVENTS = length(EVTYPE),
FATALITIES = sum(FATALITIES),
INJURIES = sum(INJURIES),
PROP_DMG = sum(PROPDMG_MIL),
CROP_DMG = sum(CROPDMG_MIL)) %>%
arrange(desc(EVENTS))
We will add a new column to calculate the casualty total which will be adding of Fatalities and Injuries. We will then sort the final data by the casualty and pick Top 10 to plot it.
final_data$CASUALTY <- final_data$FATALITIES + final_data$INJURIES
final_data <- arrange(final_data, desc(CASUALTY))
top10_casualty_data <- head(final_data, n=10)
We will plot this data next.
plot1 <- top10_casualty_data %>%
select(EVTYPE, FATALITIES, INJURIES)
plot1 <- melt(plot1, id = c("EVTYPE"))
names(plot1) <- c("EV_TYPE", "CAS_TYPE", "VALUE")
pl1 <-ggplot(plot1, aes(fill = CAS_TYPE, y = VALUE, x = reorder(EV_TYPE, -VALUE))) +
geom_bar(position = "stack", stat = "identity") +
ggtitle("Population Health Impact of Weather Events") +
xlab("Event") + ylab("Number of Casualties")
print(pl1)
As seen by the plot, Tornado events by far the biggest impact on the population health followed by Wind, Heat and Flood.
We will start with adding a new column to sum of crop damage and property damage so that we can sort the data by this total damage.
final_data$TOTAL_DAMAGE <- final_data$CROP_DMG + final_data$PROP_DMG
final_data <- arrange(final_data, desc(TOTAL_DAMAGE))
top10_damage_data <- head(final_data, n=10)
We will plot this data next.
plot2 <- top10_damage_data %>%
select(EVTYPE, PROP_DMG, CROP_DMG)
plot2 <- melt(plot2, id = c("EVTYPE"))
names(plot2) <- c("EV_TYPE", "DMG_TYPE", "VALUE")
pl2 <-ggplot(plot2, aes(fill = DMG_TYPE, y = VALUE, x = reorder(EV_TYPE, -VALUE))) +
geom_bar(position = "stack", stat = "identity") +
ggtitle("Economic Consequences of Weather Events") +
xlab("Event") + ylab("Damage Dollar Amount in Millions") +
theme(axis.text.x = element_text(angle = 45), text = element_text(size=15))
print(pl2)
As seen from the 2nd plot, Floods have the highest economic impact followed by Hurricane, Storm and Tornado.