knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)

Synopsis

Natural phenomena are part of the life cycle of our planet, however the occurrence of them cause disastrous and irreversible impacts for the world population, now you can estimate the frequency with which they occur and the economic and social impact they generate in the society, that is the main focus of the present study.

For the analysis, information will be taken from the storm database of the National Oceanic and Atmospheric Administration of the United States (NOAA), which records the occurrence of the main storms and weather events, also saving estimates of fatal, injured victims. and economic losses generated, being these essential inputs to assess the impact of weather phenomena on the economic and public health, thus enabling more effective prevention policies by parts of government entities and general civil society.

As general results, it can be concluded that the events that have the most impact in terms of victims and injuries are tornadoes, excessive heat, floods, lightning and tsunamis. In terms of economic impact, floods, hurricanes, storms, tornadoes and hail stand out.

Data Processing

Next, the stage corresponding to data processing is presented, starting by importing the data set indicating the repository where they can be downloaded and loading them in the R work environment, then a brief descriptive analysis of the data is made. data and the corresponding selection and final transformation of variables to be used, finally a regex process is implemented to correct typographical errors in the eventype variable and a change in the symbologies of the monetary variables that measure the economic impact.

Data Import

The repository to download the data to be used can be verified in the following link Storm_Data, the file is in csv format and compressed in the form bzip2 to reduce its size, you can see a more detailed description regarding the data set in this documentation Storm_Data_Documentation and have a look at the FAQ official document FAQ_NOOA. Then proceed to load the compressed data set in the form bzip2:

setwd("C:/Users/Moises/Desktop/Rafael/Universidad/coursera/Espcializacion_ds_JH/05_investigacion/proyecto")
data_set <- read.csv("repdata_data_StormData.csv.bz2",stringsAsFactors=FALSE)

Libraries

library(magrittr)
library(tidyverse)
library(lubridate)
library(ggpubr)
library(knitr)

Descriptive Analysis

The data has a total of 902297 rows and 37 columns as can be verified below:

dim(data_set)
## [1] 902297     37
names(data_set)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

For purposes of this study, the focus will be solely on the economic and public health impact of the weather events in all the USA, therefore no variables will be taken that discriminate in terms of specific regions or location, nor in those that detail issues as intensity of events, given that only the economic impact and the number of victims and injured will be studied.

Selection and Transformation of Variables

The following variables were finally selected:

  • EVTYPE: Identifies the type of weather event.
  • BGN_DATE: Corresponds to the date of registration of the event.
  • FATALITIES: Deaths caused by the event.
  • INJURIES: Injuries product of the event.
  • PROPDMG: Damage to property in whole numbers.
  • PROPDMGEXP: Multiplicative factor of the variablePROPDMG.
  • CROPDMG: Damage to the harvest in whole numbers.
  • CROPDMGEXP: Multiplicative factor of the variableCROPDMG.
data_set  %<>% select("EVTYPE", "BGN_DATE", "FATALITIES", "INJURIES", 
                   "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
summary(data_set)
##     EVTYPE            BGN_DATE           FATALITIES      
##  Length:902297      Length:902297      Min.   :  0.0000  
##  Class :character   Class :character   1st Qu.:  0.0000  
##  Mode  :character   Mode  :character   Median :  0.0000  
##                                        Mean   :  0.0168  
##                                        3rd Qu.:  0.0000  
##                                        Max.   :583.0000  
##     INJURIES            PROPDMG         PROPDMGEXP       
##  Min.   :   0.0000   Min.   :   0.00   Length:902297     
##  1st Qu.:   0.0000   1st Qu.:   0.00   Class :character  
##  Median :   0.0000   Median :   0.00   Mode  :character  
##  Mean   :   0.1557   Mean   :  12.06                     
##  3rd Qu.:   0.0000   3rd Qu.:   0.50                     
##  Max.   :1700.0000   Max.   :5000.00                     
##     CROPDMG         CROPDMGEXP       
##  Min.   :  0.000   Length:902297     
##  1st Qu.:  0.000   Class :character  
##  Median :  0.000   Mode  :character  
##  Mean   :  1.527                     
##  3rd Qu.:  0.000                     
##  Max.   :990.000

First, an exploration will be made about the date of occurrence of the events, for this the variable BGN_DATE was transformed to a date format and then the year variable was constructed with which it was finally worked.

data_set %<>% mutate(BGN_DATE = as.Date(BGN_DATE,"%m/%d/%Y"), YEAR = year(BGN_DATE)) 

annotate_figure(
    ggarrange(
    data_set %>% group_by(YEAR) %>% 
        summarise(EVTYPE = length(unique(EVTYPE))) %>% 
        ggplot(aes(x = YEAR, y = EVTYPE)) + 
        geom_line(color = "blue") + 
        theme_minimal() + 
        scale_x_continuous(breaks = seq(1950, 2011, by = 10)) +
        labs(title = "Event Type", x = "Years", y = "Number of Event Type"),
    data_set %>% group_by(YEAR) %>% 
        summarise(n = n()) %>% ggplot(aes(x = YEAR, y = n)) + 
        geom_line(color = "red") + 
        theme_minimal() + 
        scale_x_continuous(breaks = seq(1950, 2011, by = 10)) + 
        labs(title = "Number of Events per Year", x = "Years", y = ""), 
    ncol = 2, nrow = 1),
    top = text_grob("Event Occurrence (1950-2011)", color = "black", 
                    face = "bold", size = 13)
    )

It can be observed through the graph to the left that until the first years of 1990 there was a reduced number of categories to group the events that took place, and it is until 1993 that an important growth of the number of categories used to record the events is visualized. , therefore the study will take the data for those records that are after 1992.

In the section kk located on page kk of the document kk, it is specified that there are a total of 48 events, however when analyzing the data we find a total of 985 events.

table(data_set$EVTYPE) %>% length()
## [1] 985
table(data_set$EVTYPE) %>% head()
## 
##    HIGH SURF ADVISORY         COASTAL FLOOD           FLASH FLOOD 
##                     1                     1                     1 
##             LIGHTNING             TSTM WIND       TSTM WIND (G45) 
##                     1                     4                     1

It is also observed that there are events registered with similar names but belonging to the same category, which is why there are almost 1000 events in total, the filtering process for records greater than 1992 and a count of the types is presented below of events that have only more than 15 observations, in this way the few representative cases are eliminated and that they are originated in their majority by typographical errors.

data_set %<>%  filter(YEAR > 1992)

type_even <- data_set %>% 
    group_by(EVTYPE) %>% 
    summarise(n = n()) %>% 
    filter(n > 15) %>% .$EVTYPE %>% unique()

summary(type_even)
##    Length     Class      Mode 
##       140 character character
head(type_even)
## [1] "ASTRONOMICAL HIGH TIDE" "ASTRONOMICAL LOW TIDE" 
## [3] "AVALANCHE"              "BLIZZARD"              
## [5] "COASTAL FLOOD"          "Coastal Flooding"

Now a total of 140 types of events are shown, a very important reduction, when applying the restriction to the data only for these events the number of registers is slightly reduced.

data_set %<>%  filter(EVTYPE %in% type_even)
dim(data_set)
## [1] 712747      9

The final data have about 200,000 fewer records, although 140 types of events is a significant reduction, the goal is to reach the 48 that are required in the Storm_Data_Documentation document.

Regex for Even Type

To mitigate the effect of typographical errors, a regex process was implemented on the variable EVENTYPE, although in the process of data purification the goal is to approach or have only 48 types of events, a new event was created called DRY MICROBURST, given that this event could not be classified in any of the 48 categories, and the Dense Smoke event was not part of the study because it presented less than 15 observations.

data_set %<>% mutate(
    EVTYPE_regex = tolower(case_when(
        EVTYPE %in% grep("Astronomical", type_even, value = T, ignore.case = T) ~ "Astronomical",
        EVTYPE %in% grep("Avalanche|LANDSLIDE", type_even, value = T, ignore.case = T) ~ "Avalanche ",
        EVTYPE %in% grep("COASTAL FLOOD|RIVER|TIDAL",
                         type_even, value = T, ignore.case = T) ~ "Coastal Flood", 
        EVTYPE %in% grep("^COLD|^WIND CHILL",
                         type_even, value = T, ignore.case = T) ~ "COLD/WIND CHILL",
        EVTYPE %in% grep("Dense|^Fog",type_even, value = T, ignore.case = T) ~ "Dense Fog",
        EVTYPE %in% grep("DRY",type_even, value = T, ignore.case = T) ~ "DRY MICROBURST",
        EVTYPE %in% grep("SURGE",type_even, value = T, ignore.case = T) ~ "STORM SURGE/TIDE",
        EVTYPE %in% grep("EXCESSIVE HEAT|EXTREME HEAT|RECORD HEAT|EXCESSIVE SNOW|HEAT WAVE|WARMTH",
                         type_even, value = T, ignore.case = T) ~ "Excessive Heat",
        EVTYPE %in% grep(".cold$|Extreme Cold/Wind Chill|EXTREME WINDCHILL",
                         type_even, value = T, ignore.case = T) ~ "Extreme Cold/Wind Chill",
        EVTYPE %in% grep("FLASH FLOOD",type_even, value = T, ignore.case = T) ~ "FLASH FLOOD",
        EVTYPE %in% grep("URBAN|^FLOODING",type_even, value = T, ignore.case = T) ~ "FLOOD",
        EVTYPE %in% grep("Frost|FREEZE",type_even, value = T, ignore.case = T) ~ "Frost/Freeze",
        EVTYPE %in% grep("Funnel",type_even, value = T, ignore.case = T) ~ "Funnel Cloud", 
        EVTYPE %in% grep("Freezing",type_even, value = T, ignore.case = T) ~ "FREEZING FOG",
        EVTYPE %in% grep("^Hail|^SMALL|GLAZE",type_even, value = T, ignore.case = T) ~ "HAIL",
        EVTYPE %in% grep("Heavy Rain|^RAIN|PRECIPITATION",
                         type_even, value = T, ignore.case = T) ~ "Heavy Rain", 
        EVTYPE %in% grep("HEAVY.SNOW|.LAKE.|SNOW SQUALL|^SNOW$",
                         type_even, value = T, ignore.case = T) ~ "Heavy Snow",
        EVTYPE %in% grep("Surf",type_even, value = T, ignore.case = T) ~ "High Surf",
        EVTYPE %in% grep("^High Wind|^WINDS$|^WIND$|GUSTY|DAMAGE",
                         type_even, value = T, ignore.case = T) ~ "High Wind",
        EVTYPE %in% grep("Hurricane",type_even, value = T, ignore.case = T) ~ "HURRICANE/TYPHOON",
        EVTYPE %in% grep("Ice|ICY",type_even, value = T, ignore.case = T) ~ "Ice Storm",
        EVTYPE %in% agrep("Lake Effect Snow",
                          type_even, value = T, ignore.case = T) ~ "Lake-Effect Snow",
        EVTYPE %in% grep("Light Snow|MODERATE SNOWFALL",
                         type_even, value = T, ignore.case = T) ~ "Light Snow", 
        EVTYPE %in% grep("Rip Current",type_even, value = T, ignore.case = T) ~ "Rip Current", 
        EVTYPE %in% grep("Storm Surge",type_even, value = T, ignore.case = T) ~ "Storm Surge/Tide",
        EVTYPE %in% grep("^Strong Wind",type_even, value = T, ignore.case = T) ~ "Strong Wind",
        EVTYPE %in% grep("^Thunderstorm|SEVERE THUNDERSTORMS",
                         type_even, value = T, ignore.case = T) ~ "Thunderstorm Wind",
        EVTYPE %in% grep("Tornado",type_even, value = T, ignore.case = T) ~ "Tornado",
        EVTYPE %in% grep("Tsunami|TSTM",type_even, value = T, ignore.case = T) ~ "Tsunami",
        EVTYPE %in% grep("Waterspout",type_even, value = T, ignore.case = T) ~ "Waterspout",
        EVTYPE %in% grep("fire",type_even, value = T, ignore.case = T) ~ "Wildfire",
        EVTYPE %in% grep("Winter Weather|WINTRY",
                         type_even, value = T, ignore.case = T) ~ "Winter Weather",
        TRUE ~ as.character(.$EVTYPE)
        )
    )
)

data_set$EVTYPE_regex %>% unique() %>% sort()
##  [1] "astronomical"             "avalanche "              
##  [3] "blizzard"                 "coastal flood"           
##  [5] "cold/wind chill"          "dense fog"               
##  [7] "drought"                  "dry microburst"          
##  [9] "dust devil"               "dust storm"              
## [11] "excessive heat"           "extreme cold/wind chill" 
## [13] "flash flood"              "flood"                   
## [15] "freezing fog"             "frost/freeze"            
## [17] "funnel cloud"             "hail"                    
## [19] "heat"                     "heavy rain"              
## [21] "heavy snow"               "high surf"               
## [23] "high wind"                "hurricane/typhoon"       
## [25] "ice storm"                "lake-effect snow"        
## [27] "lakeshore flood"          "light snow"              
## [29] "lightning"                "marine hail"             
## [31] "marine high wind"         "marine strong wind"      
## [33] "marine thunderstorm wind" "other"                   
## [35] "rip current"              "seiche"                  
## [37] "sleet"                    "storm surge/tide"        
## [39] "strong wind"              "temperature record"      
## [41] "thunderstorm wind"        "tornado"                 
## [43] "tropical depression"      "tropical storm"          
## [45] "tsunami"                  "unseasonably warm"       
## [47] "unseasonably wet"         "volcanic ash"            
## [49] "waterspout"               "wildfire"                
## [51] "winter storm"             "winter weather"

The events UNSEASONABLY WARM, UNSEASONABLY WET, Temperature record and OTHER could not be categorized in any of the 48 types of events, therefore they were omitted.

data_set %<>% filter(!EVTYPE_regex %in% tolower(c("UNSEASONABLY WARM",
                                         "UNSEASONABLY WET","Temperature record","OTHER")))

data_set$EVTYPE_regex %>% unique() %>% length()
## [1] 48

Finally, there are 48 types of events for the variable EVENTYPE.

Multiplicative Factors of Losses

The variables PROPDMGEXP andCROPDMGEXP are constituted by a series of characters that represent multiplicative factors to determine the losses caused by the events, simply multiply the amount of the losses reflected in the variables PROPDMG andCROPDMG by the corresponding value of the factor, then observe the symbols of the factors of both variables:

table(data_set$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5 
## 311440      1      8      3    215     25     13      4      4     27 
##      6      7      8      B      H      K      m      M 
##      4      5      1     34      6 392245      6   8470
table(data_set$CROPDMGEXP)
## 
##             ?      0      2      B      k      K      M 
## 428791      6     18      1      9     21 281710   1955

To find the equivalences of the factors, the methodology suggested in the doubts forum DISCUSSION FORUM week 4 was implemented, where it is indicated that in the Exponent Value study a rather detailed approach is presented in reference to the transformation of the factors, the same is described below:

These are possible values of CROPDMGEXP and PROPDMGEXP: H, h, K, k, M, m, B, b, +, -, ?, 0, 1, 2, 3, 4, 5, 6, 7, 8, and blank-character.

  • H,h = hundreds = 100
  • K,k = kilos = thousands = 1,000
  • M,m = millions = 1,000,000
  • B,b = billions = 1,000,000,000
  • (+) = 1
  • (-) = 0
  • (?) = 0
  • black/empty character = 0
  • numeric 0..8 = 10
data_set %<>% mutate(PROPDMGEXP = case_when(PROPDMGEXP %in% c(as.character(0:8)) ~ 10,
                                           PROPDMGEXP %in% c("","-","?") ~ 0,
                                           PROPDMGEXP == "+" ~ 1,
                                           PROPDMGEXP %in% c("H") ~ 10^2,
                                           PROPDMGEXP %in% c("K","k") ~ 10^3,
                                           PROPDMGEXP %in% c("M","m") ~ 10^6,
                                           PROPDMGEXP %in% c("B","b") ~ 10^9),
                    CROPDMGEXP = case_when(CROPDMGEXP %in% c(as.character(0:8)) ~ 10,
                                           CROPDMGEXP %in% c("","-","?") ~ 0,
                                           CROPDMGEXP == "+" ~ 1,
                                           CROPDMGEXP %in% c("H") ~ 10^2,
                                           CROPDMGEXP %in% c("K","k") ~ 10^3,
                                           CROPDMGEXP %in% c("M","m") ~ 10^6,
                                           CROPDMGEXP %in% c("B","b") ~ 10^9),
                    TOTAL_DAM = PROPDMG*PROPDMGEXP+CROPDMG*CROPDMGEXP)

Results

The analysis of weather events in the USA from the perspective of the economic impact and from the public health approach was addressed, then two questions and their respective studies are presented.

Population Health

Across the United States, which types of events are most harmful with respect to population health?

There is a fairly large proportional difference between fatal incidents and injuries, a finding that is quite logical.

data_set %>% select(FATALITIES,INJURIES) %>% apply(2,sum)
## FATALITIES   INJURIES 
##      10583      68001

Below is the behavior of the occurrence of the types of events, grouped by the type of event, year of occurrence and the consequences, whether they are fatal incidents or only injuries.

Question_1.1 <- data_set %>% 
    group_by(EVTYPE_regex) %>% 
    summarise(FATALITIES = sum(FATALITIES), INJURIES = sum(INJURIES)) %>% 
    arrange(-INJURIES,-FATALITIES) %>% 
    head(5) 

Question_1.2 <- data_set %>%  
    filter(EVTYPE_regex %in% Question_1.1$EVTYPE_regex[1:5]) %>% 
    group_by(YEAR,EVTYPE_regex) %>% 
    summarise(INJURIES = sum(INJURIES))

Question_1.3 <- data_set %>%  
    filter(EVTYPE_regex %in% Question_1.1$EVTYPE_regex[1:5]) %>% 
    group_by(YEAR,EVTYPE_regex) %>% 
    summarise(FATALITIES = sum(FATALITIES))
ggarrange(
    Question_1.1 %>% gather("Severity", "Cases", 2:3) %>%
        ggplot(aes(x = reorder(EVTYPE_regex, -Cases), y = Cases, fill = Severity)) +
        geom_bar(stat="identity") + 
        theme_minimal() + 
        labs(title="Impact on Public Health by Type of Event (1993-2011/USA)",
             x =""),
    ggarrange(
        qplot(YEAR, INJURIES, data = Question_1.2, color = EVTYPE_regex) + 
            geom_line() + 
            theme_minimal() +
            labs(title="Impact of Injuries", x ="Years", y = "Cases", color = "Type of Event") + 
            theme(legend.text = element_text(size = 7)),
        qplot(YEAR, FATALITIES, data = Question_1.3, color = EVTYPE_regex) + 
            geom_line() +
            theme_minimal() +
            labs(title="Impact of Fatalities", x ="Years", y = "", color = "Type of Event") +
            theme(legend.text = element_text(size = 7)),
        ncol = 2, common.legend = TRUE, legend = "bottom"), nrow = 2)

It can be concluded that the five events that have the greatest impact on public health are tornadoes, excessive heat, floods, lightning and tsunamis, these events being the ones that accumulate the greatest number of fatal incidents and injuries.

Regarding injuries tornadoes and excessive heat have been the predominant events over the years, however floods have a peak in 1997 and another tornado in 2010, these atypical data are catastrophic situations Take into account when taking any type of contingency plan.

With regard to fatal incidents, tornadoes and excessive heat also tend to be predominant over the last few years, these two types of events are also those that also group catastrophic events, showing strong peaks over the years in special in regards to excessive heat.

Economic Consequences

The economic impact of climatic events has as its focus the evaluation of the sum of losses to property and crops for the events that occurred, the total amount of losses by event types is visualized, in addition to the fluctuations of losses along the years.

Question_2.1 <- data_set %>% 
    group_by(EVTYPE_regex) %>%
    summarise(TOTAL_DAM = sum(TOTAL_DAM)) %>%
    arrange(-TOTAL_DAM) %>% 
    head(5)

Question_2.2 <- data_set %>% 
    mutate(PROPDMG = PROPDMG*PROPDMGEXP) %>% 
    filter(EVTYPE_regex %in% Question_2.1$EVTYPE_regex) %>% 
    group_by(YEAR,EVTYPE_regex) %>% 
    summarise(TOTAL_PROPDMG = sum(PROPDMG))

Question_2.3 <- data_set %>% 
    mutate(CROPDMG = CROPDMG*CROPDMGEXP) %>% 
    filter(EVTYPE_regex %in% Question_2.1$EVTYPE_regex) %>% 
    group_by(YEAR,EVTYPE_regex) %>% 
    summarise(TOTAL_CROPDMG = sum(CROPDMG))
ggarrange(
    ggplot(Question_2.1,aes(x = reorder(EVTYPE_regex, -TOTAL_DAM), y = TOTAL_DAM)) +
        geom_bar(stat="identity") +
        theme_minimal() + 
        labs(title="Economic Impact by type of event (1993-2011/USA)",x ="",
             y = "Monetary Units") + 
        theme(axis.text.x = element_text(size = 7)), 
    ggarrange(
        qplot(YEAR, TOTAL_PROPDMG, data = Question_2.2, color = EVTYPE_regex) + 
            geom_line() + 
            theme_minimal() + 
            labs(title="Losses for Property",x ="", y = "Monetary Units" , color = "Event Type") +
            theme(legend.text = element_text(size = 7), 
                  axis.text.x = element_text(size = 7)),
        qplot(YEAR, TOTAL_CROPDMG, data = Question_2.3, color = EVTYPE_regex) + 
            geom_line() + 
            theme_minimal() +
            labs(title="Crop Losses",x ="", y = "" , color = "Event Type") + 
            theme(legend.text = element_text(size = 7), 
                  axis.text.x = element_text(size = 7)),
        ncol = 2, common.legend = TRUE, legend = "bottom"), nrow = 2)

Floods, hurricanes, storms, tornadoes and hail are the events that generate the greatest total losses, when observing in detail the distribution of the economic impact in the last 18 years, it is evident that the damages to the property are generated in greater average by the hurricanes and storms, being relevant a peak in the year 2002 product of a flood event, classifying itself automatically as a catastrophic damage. In terms of crop losses, hurricanes and floods stand out as the most lost events, both in catastrophic events and in their regular behavior over the years.