EDA on Total Casualties and Damages from NOAA Storm Database

Synopsis

Throughout more than half a century, storms and other weather events caused severe damages in both health and economy sectors. Through records by event tallies from the National Oceanic and Atmospheric Administration (NOAA), individuals are capable to track and tally events for prevention in the future. This analysis used R (RStudio) in order to find the weather events with greatest damages or casualties. Results indicate that tornadoes causes the greatest casualties while flood and hurricane causes greatest economic losses. Recommendations including tornado and flood preparedness shall be advised for government units to minimize casualties or damages.

Background

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Here are the analysis questions to be answered:

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

Across the United States, which types of events have the greatest economic consequences?

This means that the objective of this analysis is to determine the types of events that are most harmful to population health and events (denoted by columns FATALITIES and INJURIES) that have greatest property and crop loss (denoted by columns PROPDMG and CROPDMG). Both of these data can be found in the database (See Data Processing).

Data Processing

At the time of its last render, here are the details on the libraries used and the versions (including R) for documentation. It is beyond my knowledge if such differences in versions affects the reproducibility of this analysis.

library(ggplot2)
sessionInfo()

## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows >= 8 x64 (build 9200)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## system code page: 932
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_3.3.2
## 
## loaded via a namespace (and not attached):
##  [1] knitr_1.29       magrittr_1.5     tidyselect_1.1.0 munsell_0.5.0   
##  [5] colorspace_1.4-1 R6_2.4.1         rlang_0.4.7      stringr_1.4.0   
##  [9] dplyr_1.0.2      tools_4.0.2      grid_4.0.2       gtable_0.3.0    
## [13] xfun_0.16        withr_2.2.0      htmltools_0.5.0  ellipsis_0.3.1  
## [17] yaml_2.2.1       digest_0.6.25    tibble_3.0.3     lifecycle_0.2.0 
## [21] crayon_1.3.4     purrr_0.3.4      vctrs_0.3.4.9000 glue_1.4.2      
## [25] evaluate_0.14    rmarkdown_2.3    stringi_1.4.6    compiler_4.0.2  
## [29] pillar_1.4.6     generics_0.0.2   scales_1.1.1     pkgconfig_2.0.3

Firstly, the dataset must be downloaded through the NOAA website (or Coursera’s Peer-Graded Assignment). For this analysis, the NOAA Storm data is named as Stormdata.csv.bz2. You can obtain these data through NOAA website.

The usual loading of storm data file can be achieved using the code below. This would result to a data frame with 902297 rows and 37 columns.

dataframe <- read.csv("Stormdata.csv.bz2", na.strings = c("","NA"))

However, it would be better to load only the necessary columns to save space and loading times. The code below use colClasses to filter certain columns to be used for analysis. As a safe measure, the dataframe would be copied to backup.df as a fallback in case something odd happens (which is unlikely for this analysis).

dataframe <- read.csv("Stormdata.csv.bz2", na.strings = c("","NA")
                      ,colClasses = c(rep("NULL",7),"character",rep("NULL",14)
                                      ,rep("numeric",3),"character","numeric"
                                      ,"character",rep("NULL",9)))
backup.df <- dataframe
head(dataframe)

##    EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO          0       15    25.0          K       0       <NA>
## 2 TORNADO          0        0     2.5          K       0       <NA>
## 3 TORNADO          0        2    25.0          K       0       <NA>
## 4 TORNADO          0        2     2.5          K       0       <NA>
## 5 TORNADO          0        2     2.5          K       0       <NA>
## 6 TORNADO          0        6     2.5          K       0       <NA>

For column descriptions, visit the documentation from National Weather Service.

In this section, we would explore the event types present in the database. This would be important to group each row by their events and give their total sum for comparison (See Results).

head(unique(dataframe$EVTYPE), n = 50)

##  [1] "TORNADO"                        "TSTM WIND"                     
##  [3] "HAIL"                           "FREEZING RAIN"                 
##  [5] "SNOW"                           "ICE STORM/FLASH FLOOD"         
##  [7] "SNOW/ICE"                       "WINTER STORM"                  
##  [9] "HURRICANE OPAL/HIGH WINDS"      "THUNDERSTORM WINDS"            
## [11] "RECORD COLD"                    "HURRICANE ERIN"                
## [13] "HURRICANE OPAL"                 "HEAVY RAIN"                    
## [15] "LIGHTNING"                      "THUNDERSTORM WIND"             
## [17] "DENSE FOG"                      "RIP CURRENT"                   
## [19] "THUNDERSTORM WINS"              "FLASH FLOOD"                   
## [21] "FLASH FLOODING"                 "HIGH WINDS"                    
## [23] "FUNNEL CLOUD"                   "TORNADO F0"                    
## [25] "THUNDERSTORM WINDS LIGHTNING"   "THUNDERSTORM WINDS/HAIL"       
## [27] "HEAT"                           "WIND"                          
## [29] "LIGHTING"                       "HEAVY RAINS"                   
## [31] "LIGHTNING AND HEAVY RAIN"       "FUNNEL"                        
## [33] "WALL CLOUD"                     "FLOODING"                      
## [35] "THUNDERSTORM WINDS HAIL"        "FLOOD"                         
## [37] "COLD"                           "HEAVY RAIN/LIGHTNING"          
## [39] "FLASH FLOODING/THUNDERSTORM WI" "WALL CLOUD/FUNNEL CLOUD"       
## [41] "THUNDERSTORM"                   "WATERSPOUT"                    
## [43] "EXTREME COLD"                   "HAIL 1.75)"                    
## [45] "LIGHTNING/HEAVY RAIN"           "HIGH WIND"                     
## [47] "BLIZZARD"                       "BLIZZARD WEATHER"              
## [49] "WIND CHILL"                     "BREAKUP FLOODING"

When viewed in full, there are several instances of: misspellings, extra spaces, unnecessary remarks, etc. and these are detrimental for grouping variables together. For simplicity, here are the main events to be compared in this analysis. Although there would be inaccuracies in the numbers themselves, it is no longer the scope of the analysis on thorough cleanup of these variables.

events <- c("TORNADO","THUNDERSTORM","HURRICANE","FLOOD"
            ,"SNOW","TSUNAMI","HAIL"
            ,"HEAT","WILDFIRE","RAIN")

This character vector would be used to roughly group certain variables into a common name using grep. However, this would not be 100% accurate as there would be missed events that were not grouped by this code. It would be ideal for future analysis if there’s a thorough cleaning of this column.

for(event in events) dataframe[grep(event,dataframe$EVTYPE),"EVTYPE"] <- event
dataframe <- dataframe[dataframe$EVTYPE %in% events,]

To again check the elements in the EVTYPE column:

unique(dataframe$EVTYPE)

##  [1] "TORNADO"      "HAIL"         "RAIN"         "SNOW"         "FLOOD"       
##  [6] "HURRICANE"    "THUNDERSTORM" "HEAT"         "WILDFIRE"     "TSUNAMI"

These resulted to EVTYPE restricted to certain events for the simplicity of this analysis. These would be used to compare their casualties and property damages in accordance to the analysis questions.

To further subset the given data, it would be separated by its corresponding question. The data frame health focuses on the analysis question 1 while economy on the analysis question 2.

There would be an additional column named CASUALTIES which is just the sum of FATALITIES and INJURIES. “Casualty”, in this analysis, is defined as an injury or death from an event.

health <- dataframe[,c("EVTYPE","FATALITIES","INJURIES")]
health$CASUALTIES <- health$FATALITIES + health$INJURIES
head(health)

##    EVTYPE FATALITIES INJURIES CASUALTIES
## 1 TORNADO          0       15         15
## 2 TORNADO          0        0          0
## 3 TORNADO          0        2          2
## 4 TORNADO          0        2          2
## 5 TORNADO          0        2          2
## 6 TORNADO          0        6          6

Before proceeding on subsetting economy, the following function multiplier is used to convert variables on PROPDMGEXP and CROPDMGEXP into numeric forms. The conversion from the input in both of the variables is based from this analysis and would be used as a multiplier for PROPDMG and CROPDMG respectively.

multiplier <- function(x) {
    x[is.na(x)] <- as.character(0)
    x <- gsub("\\?","0",x)
    x <- gsub("\\+|\\-","1",x)
    x <- gsub("H|h","100",x)
    x <- gsub("K|k","1000",x)
    x <- gsub("M|m","1000000",x)
    x <- gsub("B|b","1000000000",x)
    return(as.numeric(x))
}

To apply the function above, here’s the code that creates a clean data frame that is similar to health. The multipliers were stored in the columns PROPMULT and CROPMULT and each would be multiplied to its corresponding PROPDMG and CROPDMG respectively.

economy <- dataframe[,c("EVTYPE"
                        ,"PROPDMG","PROPDMGEXP"
                        ,"CROPDMG","CROPDMGEXP")]

economy$PROPMULT <- multiplier(economy$PROPDMGEXP)
economy$CROPMULT <- multiplier(economy$CROPDMGEXP)

economy$PROPDMG <- economy$PROPDMG * economy$PROPMULT
economy$CROPDMG <- economy$CROPDMG * economy$CROPMULT

economy$TOTALDMG <- economy$PROPDMG + economy$CROPDMG
economy <- economy[,c("EVTYPE","PROPDMG","CROPDMG","TOTALDMG")]
head(economy)

##    EVTYPE PROPDMG CROPDMG TOTALDMG
## 1 TORNADO   25000       0    25000
## 2 TORNADO    2500       0     2500
## 3 TORNADO   25000       0    25000
## 4 TORNADO    2500       0     2500
## 5 TORNADO    2500       0     2500
## 6 TORNADO    2500       0     2500

To find the sum of their response variables by event types, the function aggregate() from base R shall be used. This would result to two new data frames named health.summary and economy.summary. Take note that the values in economy.summary is in USD.

health.summary <- aggregate(.~ EVTYPE, data = health, FUN = sum)
economy.summary <- aggregate(.~ EVTYPE, data = economy, FUN = sum)

Results and Conclusion

In this section shows the table summaries and plots from the data processing done previously. Each table/plot shall be accompanied with a brief description.

print(health.summary[order(health.summary$CASUALTIES, decreasing = TRUE),])

##          EVTYPE FATALITIES INJURIES CASUALTIES
## 8       TORNADO       5661    91407      97068
## 3          HEAT       3138     9154      12292
## 1         FLOOD       1523     8603      10126
## 7  THUNDERSTORM        210     2479       2689
## 2          HAIL         20     1466       1486
## 4     HURRICANE        135     1326       1461
## 6          SNOW        167     1161       1328
## 10     WILDFIRE         75      911        986
## 5          RAIN        108      299        407
## 9       TSUNAMI         33      129        162

From all columns, tornado is the most harmful weather event with respect to the population health. The large difference on the casualties between heat weather is significantly high with about 80000 casualties. On the other hand, tsunami has the least casualties throughout the time period.

To visualize, here is a plot with a logarithmic scale for clarity.

ggplot(data = health.summary, mapping = aes(x=EVTYPE, y=CASUALTIES)) +
    geom_bar(stat = "identity") +
    scale_y_log10() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
    ggtitle("Number of Casualties by Weather Events From 1950 to 2011"
            ,subtitle = "Log10 Scale") +
    xlab("Event Types") +
    ylab("Number of Casualties")

To interpret the y axis scale, it shows the adjusted scale for data with exponential differences among other groups. If such data scales exponentially, it would appear as linear in this plot.

As previously mentioned, the highest number of casualties would be the tornado weather events. Such difference in the plot with logarithmic scale indicates very large differences among other weather events.

print(economy.summary[order(economy.summary$TOTALDMG, decreasing = TRUE),])

##          EVTYPE      PROPDMG     CROPDMG     TOTALDMG
## 1         FLOOD 167378619958 12352059100 179730679058
## 4     HURRICANE  84756180010  5515292800  90271472810
## 8       TORNADO  58593098301   417461360  59010559661
## 2          HAIL  16018899870  3111583850  19130483720
## 7  THUNDERSTORM   6432588578   653005300   7085593878
## 10     WILDFIRE   4865614000   295972800   5161586800
## 5          RAIN   3233041190   804662800   4037703990
## 6          SNOW   1025424749   134663100   1160087849
## 3          HEAT     20325750   904469280    924795030
## 9       TSUNAMI    144062000       20000    144082000

In terms of economic losses, flood and hurricane has the greatest losses with over 200 Billion USD in the span of about half a century. Next in the rank is the the tornado weather event with about 90 Billion USD on crop and property damages. To further illustrate this summary, the code below shows the plot also in logarithmic scale.

ggplot(data = economy.summary, mapping = aes(x=EVTYPE, y=TOTALDMG)) +
    geom_bar(stat = "identity") +
    scale_y_log10() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
    ggtitle("Property and Crop Damages by Weather Events From 1950 to 2011"
            ,subtitle = "Log10 Scale") +
    xlab("Event Types") +
    ylab("Total Crop and Property Damage (in USD)")

Although it appears as they are close to each other, the differences in actuality is exponential (about differences by multiple of 1000s). As shown in the previous summary, flood has the greatest property and crop damages which results to economic consequences. These are then followed by hurricanes and tornadoes.

With the given descriptive data, necessary actions and preventive measures shall be taken. Government units should respond to flood and tornado prevention to minimize casualties and economic damages. Priority on these weather events should be taken in the future.

However, these results should be taken with a grain of salt. Inaccuracy on calculations because of ungrouped weather events are present in this analysis. In fact, there are 323602 left unused cases that are either not part of the events or overlooked by the grep function. A more thorough analysis is recommended for better and more accurate results.

EDA on Total Casualties and Damages from NOAA Storm Database

KaidenFrizu

2020-10-31

Synopsis

Background

Data Processing

Results and Conclusion