Tornado and Flood are are the Major weather events in the United
States
with respect to their Health or Economic impacts
The basic goal of this work is to explore the NOAA Storm Database in order to answer some basic questions about severe weather events. Based on the analysis,across the United States, Tornado (as indicated in the EVTYPE) are most harmful with respect to population health while Flood have the greatest economic consequences.
Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
Data source
The data for this work come in the form of a comma-separated-value file
compressed via the bzip2 algorithm to reduce its size which can be
downloaded from the following web site:
Storm Data [47Mb]
There is also some documentation of the database available where you will find how some of the variables are constructed/defined.
Data Analysis
Set up
Importing data
#downloading the file
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
path <- getwd()
destfile <- file.path(path, "StormData.csv.bz2")
if (!file.exists(destfile))
{ message("Downloading file...")
download.file(url, destfile, mode = "wb")
} else { message("Using cached file.") }
## Using cached file.
Reading data
#read data
data <- read.csv(bzfile("StormData.csv.bz2"), header = TRUE, sep=",")
#inspecting data
str(data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
Processing data Since we are only interested in the health and economic impacts of the extreme weather events, the following columns are subsetted by the “select” function.
Also, the data is further filtered by excluding those with neither health nor economic impacts.
#subset data
tdata <- tibble::as_tibble(data)
subset_data <- tdata %>%
select(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP,
CROPDMG, CROPDMGEXP) %>%
filter(FATALITIES>0| INJURIES>0| PROPDMG>0| CROPDMG>0)
str(subset_data)
## tibble [254,633 × 7] (S3: tbl_df/tbl/data.frame)
## $ EVTYPE : chr [1:254633] "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ FATALITIES: num [1:254633] 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num [1:254633] 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num [1:254633] 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr [1:254633] "K" "K" "K" "K" ...
## $ CROPDMG : num [1:254633] 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr [1:254633] "" "" "" "" ...
Alphabetical characters used to signify magnitude include “K” for
thousands,
“M” for millions, and “B” for billions. The following codes will
match
the characters with the magnitudes and thus calculate the money values
of the
damage.
# Finding the property/crop damage exponents and levels
unique(subset_data$PROPDMGEXP)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "4" "h" "2" "7" "3" "H" "-"
unique(subset_data$CROPDMGEXP)
## [1] "" "M" "K" "m" "B" "?" "0" "k"
# Remove the white spaces, if any.
subset_data$PROPDMGEXP <- trimws(subset_data$PROPDMGEXP)
subset_data$CROPDMGEXP <- trimws(subset_data$CROPDMGEXP)
# Convert the small letter to capital letter
subset_data$PROPDMGEXP <- toupper(subset_data$PROPDMGEXP)
subset_data$CROPDMGEXP <- toupper(subset_data$CROPDMGEXP)
# Assigning values for the exponent data by generating a data.frame
## for converting exponent characters ("keynames") to magnitudes ("keyvalues")
keynames = c("H", "K", "M", "B",
"0", "1", "2", "3", "4", "5", "6", "7", "8",
"", "?", "+", "-")
keyvalues = c(100, 1000,1e+06, 1e+09,
1, 10, 100, 1000, 10000, 100000, 1000000,1e+07, 1e+08,
1, 1, 1, 1)
map_df <- data.frame(exp = keynames,
val = keyvalues,
stringsAsFactors = FALSE)
subset_data$PROPDMGEKEY <- map_df$val[match(subset_data$PROPDMGEXP, map_df$exp)]
subset_data$CROPDMGEKEY <- map_df$val[match(subset_data$CROPDMGEXP, map_df$exp)]
# Fill any remaining NAs with 1 (as a safety measure)
subset_data$PROPDMGEKEY[is.na(subset_data$PROPDMGEKEY)] <- 1
subset_data$CROPDMGEKEY[is.na(subset_data$CROPDMGEKEY)] <- 1
#Calculation of the value impact of the damage
subset_data$PROPDMGEVAL <- subset_data$PROPDMGEKEY * subset_data$PROPDMG
subset_data$CROPDMGEVAL <- subset_data$CROPDMGEKEY * subset_data$CROPDMG
###Results
Estimation of impacts
Next, we will estimate the health and economic impacts. The health
impacts are
estimated by summation of fatalities and injuries while the economic
impacts are
represented by the sum of money value of property damages and crop
damages.
# adding two new columns, HEALTHIMP for health impact and TOTALDMG for total damages
subset_data<- subset_data %>%
mutate(HEALTHIMP= FATALITIES + INJURIES) %>%
mutate(TOTALDMG = PROPDMGEVAL + CROPDMGEVAL)
# check the values of new columns
summary(subset_data$HEALTHIMP)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.6114 0.0000 1742.0000
summary(subset_data$TOTALDMG)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000e+00 2.500e+03 1.000e+04 1.875e+06 5.000e+04 1.150e+11
Plotting
Then, the impacts for each type of events are calculated by summation of
the impact values
of individual events under each category. After that, a graph is plotted
to show the top 10
events with greatest impact.
#summation of the health impact values of individual events under each category
summary_table_hi <-subset_data %>%
group_by(EVTYPE) %>%
summarise(GRAND_TOTAL = sum(HEALTHIMP, na.rm = TRUE)) %>%
slice_max(GRAND_TOTAL, n = 10) %>% # Get top 10
ungroup()
head(summary_table_hi)
## # A tibble: 6 × 2
## EVTYPE GRAND_TOTAL
## <chr> <dbl>
## 1 TORNADO 96979
## 2 EXCESSIVE HEAT 8428
## 3 TSTM WIND 7461
## 4 FLOOD 7259
## 5 LIGHTNING 6046
## 6 HEAT 3037
#plotting a graph to show the top 10 most costly weather events
summary_table_hi %>%
ggplot(aes(x = reorder(EVTYPE, GRAND_TOTAL), y = GRAND_TOTAL)) +
geom_col() +
coord_flip() +
labs(title = "Top 10 Most Costly Weather Events",
x = "Event Type",
y = "Health Impact")
#summation of the economic impact values of individual events under each category
summary_table_dmg<- subset_data %>%
group_by(EVTYPE) %>%
summarise(GRAND_TOTAL = sum(TOTALDMG, na.rm = TRUE)) %>%
slice_max(GRAND_TOTAL, n = 10) %>% # Get top 10
ungroup()
head(summary_table_dmg)
## # A tibble: 6 × 2
## EVTYPE GRAND_TOTAL
## <chr> <dbl>
## 1 FLOOD 150319678257
## 2 HURRICANE/TYPHOON 71913712800
## 3 TORNADO 57362333946.
## 4 STORM SURGE 43323541000
## 5 HAIL 18761221986.
## 6 FLASH FLOOD 18243991078.
#plotting a graph to show the top 10 most costly weather events
summary_table_dmg %>%
ggplot(aes(x = reorder(EVTYPE, GRAND_TOTAL), y = GRAND_TOTAL)) +
geom_col() +
coord_flip() +
labs(title = "Top 10 Most Costly Weather Events",
x = "Event Type",
y = "Total Damage ($)")
Based on the graphs generated, across the United States, tornado are
most
harmful with respect to population health, leading to 96979
injuries/deaths,
while flood have the greatest economic consequences, resulting loss
of
USD 150 billions.