Throughout more than half a century, storms and other weather events caused severe damages in both health and economy sectors. Through records by event tallies from the National Oceanic and Atmospheric Administration (NOAA), individuals are capable to track and tally events for prevention in the future. This analysis used R (RStudio) in order to find the weather events with greatest damages or casualties. Results indicate that tornadoes causes the greatest casualties while flood and hurricane causes greatest economic losses. Recommendations including tornado and flood preparedness shall be advised for government units to minimize casualties or damages.
Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
Here are the analysis questions to be answered:
- Across the United States, which types of events (as indicated in the
EVTYPEvariable) are most harmful with respect to population health?- Across the United States, which types of events have the greatest economic consequences?
This means that the objective of this analysis is to determine the types of events that are most harmful to population health and events (denoted by columns FATALITIES and INJURIES) that have greatest property and crop loss (denoted by columns PROPDMG and CROPDMG). Both of these data can be found in the database (See Data Processing).
At the time of its last render, here are the details on the libraries used and the versions (including R) for documentation. It is beyond my knowledge if such differences in versions affects the reproducibility of this analysis.
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows >= 8 x64 (build 9200)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
## system code page: 932
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_3.3.2
##
## loaded via a namespace (and not attached):
## [1] knitr_1.29 magrittr_1.5 tidyselect_1.1.0 munsell_0.5.0
## [5] colorspace_1.4-1 R6_2.4.1 rlang_0.4.7 stringr_1.4.0
## [9] dplyr_1.0.2 tools_4.0.2 grid_4.0.2 gtable_0.3.0
## [13] xfun_0.16 withr_2.2.0 htmltools_0.5.0 ellipsis_0.3.1
## [17] yaml_2.2.1 digest_0.6.25 tibble_3.0.3 lifecycle_0.2.0
## [21] crayon_1.3.4 purrr_0.3.4 vctrs_0.3.4.9000 glue_1.4.2
## [25] evaluate_0.14 rmarkdown_2.3 stringi_1.4.6 compiler_4.0.2
## [29] pillar_1.4.6 generics_0.0.2 scales_1.1.1 pkgconfig_2.0.3
Firstly, the dataset must be downloaded through the NOAA website (or Coursera’s Peer-Graded Assignment). For this analysis, the NOAA Storm data is named as Stormdata.csv.bz2. You can obtain these data through NOAA website.
The usual loading of storm data file can be achieved using the code below. This would result to a data frame with 902297 rows and 37 columns.
However, it would be better to load only the necessary columns to save space and loading times. The code below use colClasses to filter certain columns to be used for analysis. As a safe measure, the dataframe would be copied to backup.df as a fallback in case something odd happens (which is unlikely for this analysis).
dataframe <- read.csv("Stormdata.csv.bz2", na.strings = c("","NA")
,colClasses = c(rep("NULL",7),"character",rep("NULL",14)
,rep("numeric",3),"character","numeric"
,"character",rep("NULL",9)))
backup.df <- dataframe
head(dataframe)## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO 0 15 25.0 K 0 <NA>
## 2 TORNADO 0 0 2.5 K 0 <NA>
## 3 TORNADO 0 2 25.0 K 0 <NA>
## 4 TORNADO 0 2 2.5 K 0 <NA>
## 5 TORNADO 0 2 2.5 K 0 <NA>
## 6 TORNADO 0 6 2.5 K 0 <NA>
For column descriptions, visit the documentation from National Weather Service.
In this section, we would explore the event types present in the database. This would be important to group each row by their events and give their total sum for comparison (See Results).
## [1] "TORNADO" "TSTM WIND"
## [3] "HAIL" "FREEZING RAIN"
## [5] "SNOW" "ICE STORM/FLASH FLOOD"
## [7] "SNOW/ICE" "WINTER STORM"
## [9] "HURRICANE OPAL/HIGH WINDS" "THUNDERSTORM WINDS"
## [11] "RECORD COLD" "HURRICANE ERIN"
## [13] "HURRICANE OPAL" "HEAVY RAIN"
## [15] "LIGHTNING" "THUNDERSTORM WIND"
## [17] "DENSE FOG" "RIP CURRENT"
## [19] "THUNDERSTORM WINS" "FLASH FLOOD"
## [21] "FLASH FLOODING" "HIGH WINDS"
## [23] "FUNNEL CLOUD" "TORNADO F0"
## [25] "THUNDERSTORM WINDS LIGHTNING" "THUNDERSTORM WINDS/HAIL"
## [27] "HEAT" "WIND"
## [29] "LIGHTING" "HEAVY RAINS"
## [31] "LIGHTNING AND HEAVY RAIN" "FUNNEL"
## [33] "WALL CLOUD" "FLOODING"
## [35] "THUNDERSTORM WINDS HAIL" "FLOOD"
## [37] "COLD" "HEAVY RAIN/LIGHTNING"
## [39] "FLASH FLOODING/THUNDERSTORM WI" "WALL CLOUD/FUNNEL CLOUD"
## [41] "THUNDERSTORM" "WATERSPOUT"
## [43] "EXTREME COLD" "HAIL 1.75)"
## [45] "LIGHTNING/HEAVY RAIN" "HIGH WIND"
## [47] "BLIZZARD" "BLIZZARD WEATHER"
## [49] "WIND CHILL" "BREAKUP FLOODING"
When viewed in full, there are several instances of: misspellings, extra spaces, unnecessary remarks, etc. and these are detrimental for grouping variables together. For simplicity, here are the main events to be compared in this analysis. Although there would be inaccuracies in the numbers themselves, it is no longer the scope of the analysis on thorough cleanup of these variables.
events <- c("TORNADO","THUNDERSTORM","HURRICANE","FLOOD"
,"SNOW","TSUNAMI","HAIL"
,"HEAT","WILDFIRE","RAIN")This character vector would be used to roughly group certain variables into a common name using grep. However, this would not be 100% accurate as there would be missed events that were not grouped by this code. It would be ideal for future analysis if there’s a thorough cleaning of this column.
for(event in events) dataframe[grep(event,dataframe$EVTYPE),"EVTYPE"] <- event
dataframe <- dataframe[dataframe$EVTYPE %in% events,]To again check the elements in the EVTYPE column:
## [1] "TORNADO" "HAIL" "RAIN" "SNOW" "FLOOD"
## [6] "HURRICANE" "THUNDERSTORM" "HEAT" "WILDFIRE" "TSUNAMI"
These resulted to EVTYPE restricted to certain events for the simplicity of this analysis. These would be used to compare their casualties and property damages in accordance to the analysis questions.
To further subset the given data, it would be separated by its corresponding question. The data frame health focuses on the analysis question 1 while economy on the analysis question 2.
There would be an additional column named CASUALTIES which is just the sum of FATALITIES and INJURIES. “Casualty”, in this analysis, is defined as an injury or death from an event.
health <- dataframe[,c("EVTYPE","FATALITIES","INJURIES")]
health$CASUALTIES <- health$FATALITIES + health$INJURIES
head(health)## EVTYPE FATALITIES INJURIES CASUALTIES
## 1 TORNADO 0 15 15
## 2 TORNADO 0 0 0
## 3 TORNADO 0 2 2
## 4 TORNADO 0 2 2
## 5 TORNADO 0 2 2
## 6 TORNADO 0 6 6
Before proceeding on subsetting economy, the following function multiplier is used to convert variables on PROPDMGEXP and CROPDMGEXP into numeric forms. The conversion from the input in both of the variables is based from this analysis and would be used as a multiplier for PROPDMG and CROPDMG respectively.
multiplier <- function(x) {
x[is.na(x)] <- as.character(0)
x <- gsub("\\?","0",x)
x <- gsub("\\+|\\-","1",x)
x <- gsub("H|h","100",x)
x <- gsub("K|k","1000",x)
x <- gsub("M|m","1000000",x)
x <- gsub("B|b","1000000000",x)
return(as.numeric(x))
}To apply the function above, here’s the code that creates a clean data frame that is similar to health. The multipliers were stored in the columns PROPMULT and CROPMULT and each would be multiplied to its corresponding PROPDMG and CROPDMG respectively.
economy <- dataframe[,c("EVTYPE"
,"PROPDMG","PROPDMGEXP"
,"CROPDMG","CROPDMGEXP")]
economy$PROPMULT <- multiplier(economy$PROPDMGEXP)
economy$CROPMULT <- multiplier(economy$CROPDMGEXP)
economy$PROPDMG <- economy$PROPDMG * economy$PROPMULT
economy$CROPDMG <- economy$CROPDMG * economy$CROPMULT
economy$TOTALDMG <- economy$PROPDMG + economy$CROPDMG
economy <- economy[,c("EVTYPE","PROPDMG","CROPDMG","TOTALDMG")]
head(economy)## EVTYPE PROPDMG CROPDMG TOTALDMG
## 1 TORNADO 25000 0 25000
## 2 TORNADO 2500 0 2500
## 3 TORNADO 25000 0 25000
## 4 TORNADO 2500 0 2500
## 5 TORNADO 2500 0 2500
## 6 TORNADO 2500 0 2500
To find the sum of their response variables by event types, the function aggregate() from base R shall be used. This would result to two new data frames named health.summary and economy.summary. Take note that the values in economy.summary is in USD.
health.summary <- aggregate(.~ EVTYPE, data = health, FUN = sum)
economy.summary <- aggregate(.~ EVTYPE, data = economy, FUN = sum)In this section shows the table summaries and plots from the data processing done previously. Each table/plot shall be accompanied with a brief description.
## EVTYPE FATALITIES INJURIES CASUALTIES
## 8 TORNADO 5661 91407 97068
## 3 HEAT 3138 9154 12292
## 1 FLOOD 1523 8603 10126
## 7 THUNDERSTORM 210 2479 2689
## 2 HAIL 20 1466 1486
## 4 HURRICANE 135 1326 1461
## 6 SNOW 167 1161 1328
## 10 WILDFIRE 75 911 986
## 5 RAIN 108 299 407
## 9 TSUNAMI 33 129 162
From all columns, tornado is the most harmful weather event with respect to the population health. The large difference on the casualties between heat weather is significantly high with about 80000 casualties. On the other hand, tsunami has the least casualties throughout the time period.
To visualize, here is a plot with a logarithmic scale for clarity.
ggplot(data = health.summary, mapping = aes(x=EVTYPE, y=CASUALTIES)) +
geom_bar(stat = "identity") +
scale_y_log10() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
ggtitle("Number of Casualties by Weather Events From 1950 to 2011"
,subtitle = "Log10 Scale") +
xlab("Event Types") +
ylab("Number of Casualties")To interpret the y axis scale, it shows the adjusted scale for data with exponential differences among other groups. If such data scales exponentially, it would appear as linear in this plot.
As previously mentioned, the highest number of casualties would be the tornado weather events. Such difference in the plot with logarithmic scale indicates very large differences among other weather events.
## EVTYPE PROPDMG CROPDMG TOTALDMG
## 1 FLOOD 167378619958 12352059100 179730679058
## 4 HURRICANE 84756180010 5515292800 90271472810
## 8 TORNADO 58593098301 417461360 59010559661
## 2 HAIL 16018899870 3111583850 19130483720
## 7 THUNDERSTORM 6432588578 653005300 7085593878
## 10 WILDFIRE 4865614000 295972800 5161586800
## 5 RAIN 3233041190 804662800 4037703990
## 6 SNOW 1025424749 134663100 1160087849
## 3 HEAT 20325750 904469280 924795030
## 9 TSUNAMI 144062000 20000 144082000
In terms of economic losses, flood and hurricane has the greatest losses with over 200 Billion USD in the span of about half a century. Next in the rank is the the tornado weather event with about 90 Billion USD on crop and property damages. To further illustrate this summary, the code below shows the plot also in logarithmic scale.
ggplot(data = economy.summary, mapping = aes(x=EVTYPE, y=TOTALDMG)) +
geom_bar(stat = "identity") +
scale_y_log10() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
ggtitle("Property and Crop Damages by Weather Events From 1950 to 2011"
,subtitle = "Log10 Scale") +
xlab("Event Types") +
ylab("Total Crop and Property Damage (in USD)")Although it appears as they are close to each other, the differences in actuality is exponential (about differences by multiple of 1000s). As shown in the previous summary, flood has the greatest property and crop damages which results to economic consequences. These are then followed by hurricanes and tornadoes.
With the given descriptive data, necessary actions and preventive measures shall be taken. Government units should respond to flood and tornado prevention to minimize casualties and economic damages. Priority on these weather events should be taken in the future.
However, these results should be taken with a grain of salt. Inaccuracy on calculations because of ungrouped weather events are present in this analysis. In fact, there are 323602 left unused cases that are either not part of the events or overlooked by the grep function. A more thorough analysis is recommended for better and more accurate results.