Synopsis

An exploratory data analysis was performed for Storm Data from NOAA agency. We were required to find which type of events had the greater economical and health consequences. The study was made across all United states from 1950 to 2011, the consequences are summed over states for each type of event. The economic consequences are measured summing property losses and crop losses and the health consequences were studied individually by fatalities and injuries. We have found that the greater economical consequences are produced by floods whereas the greater consequences in terms of people’s health are produced by tornado and excessive heat.

Data proccessing

Some Initial configurations:

Sys.setlocale("LC_TIME", "English")
## [1] "English_United States.1252"

The data can be obtained from: “https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2” Documentation about the data can be obtained from: “https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf” The .bz2 file should be downloaded and unziped, inside it is the .csv file

Also the file containing a description of the columns can be found: “https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/Storm-Data-Bulk-csv-Format.pdf

Read the data into a data frame

data <- read.csv("repdata_data_StormData.csv")

Explore first rows and dimensions

print(dim(data))
## [1] 902297     37
head(data, 2)
##   STATE__          BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE  EVTYPE
## 1       1 4/18/1950 0:00:00     0130       CST     97     MOBILE    AL TORNADO
## 2       1 4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL TORNADO
##   BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1         0                                               0         NA
## 2         0                                               0         NA
##   END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1         0                        14   100 3   0          0       15    25.0
## 2         0                         2   150 2   0          0        0     2.5
##   PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1          K       0                                         3040      8812
## 2          K       0                                         3042      8755
##   LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1       3051       8806              1
## 2          0          0              2

There are 5 variables of interest for this report:

Source dplyr and tidyr libraries (they have to be previously installed using install.packages())

#install.packages("dplyr")
#install.packages("tidyr")
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.5.2
## 
## Adjuntando el paquete: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)

Select the columns of interest

reduced_data <- data %>% select(EVTYPE, CROPDMG, CROPDMGEXP, PROPDMG, PROPDMGEXP, INJURIES, FATALITIES)
head(reduced_data)
##    EVTYPE CROPDMG CROPDMGEXP PROPDMG PROPDMGEXP INJURIES FATALITIES
## 1 TORNADO       0               25.0          K       15          0
## 2 TORNADO       0                2.5          K        0          0
## 3 TORNADO       0               25.0          K        2          0
## 4 TORNADO       0                2.5          K        2          0
## 5 TORNADO       0                2.5          K        2          0
## 6 TORNADO       0                2.5          K        6          0

Data dates range.

summary(as.Date(data$BGN_DATE, format = "%m/%d/%Y %H:%M:%S"))
##         Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
## "1950-01-03" "1995-04-20" "2002-03-18" "1998-12-27" "2007-07-28" "2011-11-30"

Check the exponents for damage

table(reduced_data$CROPDMGEXP)
## 
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994
table(reduced_data$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5      6 
## 465934      1      8      5    216     25     13      4      4     28      4 
##      7      8      B      h      H      K      m      M 
##      5      1     40      1      6 424665      7  11330

There are some undefined factor names like “?”, “h”, etc. Let’s study the possible impact of these exponents.

factors1 <- reduced_data %>% group_by(CROPDMGEXP) %>% summarise(total = sum(CROPDMG), count = n())
factors1
## # A tibble: 9 × 3
##   CROPDMGEXP     total  count
##   <chr>          <dbl>  <int>
## 1 ""              11   618413
## 2 "0"            260       19
## 3 "2"              0        1
## 4 "?"              0        7
## 5 "B"             13.6      9
## 6 "K"        1342956.  281832
## 7 "M"          34141.    1994
## 8 "k"            436       21
## 9 "m"             10        1
factors2 <- reduced_data %>% group_by(PROPDMGEXP) %>% summarise(total = sum(PROPDMG), count = n())
factors2
## # A tibble: 19 × 3
##    PROPDMGEXP      total  count
##    <chr>           <dbl>  <int>
##  1 ""              527.  465934
##  2 "+"             117        5
##  3 "-"              15        1
##  4 "0"            7108.     216
##  5 "1"               0       25
##  6 "2"              12       13
##  7 "3"              20        4
##  8 "4"              14.5      4
##  9 "5"             210.      28
## 10 "6"              65        4
## 11 "7"              82        5
## 12 "8"               0        1
## 13 "?"               0        8
## 14 "B"             276.      40
## 15 "H"              25        6
## 16 "K"        10735292.  424665
## 17 "M"          140694.   11330
## 18 "h"               2        1
## 19 "m"              38.9      7

As these exponents may influence the results they would be taken into account, using the following interpretation:

#exponents of ten
map_factors <- c("0"= 0, "1"= 1, "2" = 2, "3" = 3, "4" = 4, "5" = 5, "6" = 6,
                 "7" =  7, "8" = 8, "-" = 0, "+" = 0, "m" = 6, "M" = 6,
                 "B" = 9, "K" = 3, "h" = 2, "H" = 2, "?" = 0)
#"?" explicitly chosen to be 0, as all the values for this factor are zero.

#Get numerical exponents and fill NA with zero.
reduced_data$CROPDMGEXP_num <- map_factors[reduced_data$CROPDMGEXP]
reduced_data$CROPDMGEXP_num[is.na(reduced_data$CROPDMGEXP_num)] <- 0

reduced_data$PROPDMGEXP_num <- map_factors[reduced_data$PROPDMGEXP]
reduced_data$PROPDMGEXP_num[is.na(reduced_data$PROPDMGEXP_num)] <- 0

Let’s convert all damage to millions to have a common scale to compare

reduced_data_all_exp <- reduced_data %>% mutate(PROPDMG = PROPDMG * 10^PROPDMGEXP_num / 10^6, 
                                        CROPDMG = CROPDMG * 10^CROPDMGEXP_num / 10^6) %>%
                                select(EVTYPE, CROPDMG, PROPDMG, INJURIES, FATALITIES)
head(reduced_data_all_exp)
##    EVTYPE CROPDMG PROPDMG INJURIES FATALITIES
## 1 TORNADO       0  0.0250       15          0
## 2 TORNADO       0  0.0025        0          0
## 3 TORNADO       0  0.0250        2          0
## 4 TORNADO       0  0.0025        2          0
## 5 TORNADO       0  0.0025        2          0
## 6 TORNADO       0  0.0025        6          0

Group by event to highlight the most economically harmful events.

dmg_by_event <- reduced_data_all_exp %>% mutate(EVTYPE = as.factor(EVTYPE)) %>%
    mutate(total_dmg = CROPDMG + PROPDMG) %>% group_by(EVTYPE) %>%
    summarise(economic_dmg = sum(total_dmg)) %>% arrange(desc(economic_dmg))

head(dmg_by_event)
## # A tibble: 6 × 2
##   EVTYPE            economic_dmg
##   <fct>                    <dbl>
## 1 FLOOD                  150320.
## 2 HURRICANE/TYPHOON       71914.
## 3 TORNADO                 57362.
## 4 STORM SURGE             43324.
## 5 HAIL                    18761.
## 6 FLASH FLOOD             18244.

In this part it is analyzed if using only the defined exponents K, M, B would affect the results

#All values with exponents different than "K", "M", "B" are set to zero.
reduced_data_def_exp <- reduced_data %>% 
    mutate(PROPDMG = if_else(PROPDMGEXP %in% c("K", "M", "B"), PROPDMG, 0),
           CROPDMG = if_else(CROPDMGEXP %in% c("K", "M", "B"), CROPDMG, 0))


dmg_by_event_def_exp <- reduced_data_def_exp %>% mutate(EVTYPE = as.factor(EVTYPE)) %>%
    mutate(PROPDMG = as.numeric(PROPDMG) * 10^PROPDMGEXP_num / 10^6, 
            CROPDMG = as.numeric(CROPDMG) * 10^CROPDMGEXP_num / 10^6) %>%
    mutate(total_dmg = CROPDMG + PROPDMG) %>% group_by(EVTYPE) %>%
    summarise(economic_dmg = sum(total_dmg)) %>% arrange(desc(economic_dmg))

head(dmg_by_event_def_exp)
## # A tibble: 6 × 2
##   EVTYPE            economic_dmg
##   <fct>                    <dbl>
## 1 FLOOD                  150320.
## 2 HURRICANE/TYPHOON       71914.
## 3 TORNADO                 57341.
## 4 STORM SURGE             43324.
## 5 HAIL                    18753.
## 6 FLASH FLOOD             17562.

The results don’t vary at least for the first types of events. so we can be confdident that our interpetation of the exponents won’t affect the results.

Group by event to highlight the most harmful events for people

harm_by_event <- reduced_data_all_exp %>% mutate(EVTYPE = as.factor(EVTYPE)) %>%
    group_by(EVTYPE) %>%
    summarise(Fatalities = sum(FATALITIES), Injuries = sum(INJURIES)) %>% 
    arrange(desc(Fatalities)) %>% pivot_longer(cols = 2:3, names_to = "type",
                                               values_to = "Count")

head(harm_by_event)
## # A tibble: 6 × 3
##   EVTYPE         type       Count
##   <fct>          <chr>      <dbl>
## 1 TORNADO        Fatalities  5633
## 2 TORNADO        Injuries   91346
## 3 EXCESSIVE HEAT Fatalities  1903
## 4 EXCESSIVE HEAT Injuries    6525
## 5 FLASH FLOOD    Fatalities   978
## 6 FLASH FLOOD    Injuries    1777

Results

The 15 most harmful events for people are shown.

library(ggplot2)
p <- ggplot(data = harm_by_event[1:15,], aes(x = Count, y = reorder(EVTYPE, Count)))
p + facet_grid(cols = vars(type)) + geom_col(aes(fill = type)) +
    labs(title = "Fatalities and Injuries by Event type", y="Type of Event")

The most harmful event for people is Tornado with the greater quantity of fatalities and injuries the second one is Excessive Heat although is not the second event in injuries it is second in fatalities.

The 15 most economical harmful events are shown

g <- ggplot(data = dmg_by_event[1:15,], aes(x = economic_dmg, y = reorder(EVTYPE, economic_dmg)))
g + geom_col(fill = "cyan3") + labs(title = "Property and Crop Monetary Damage",
                           x = "Damage in Millions of Dollars", y="Event")

The most economically harmful events are Floods followed by Hurricanes/Typhoons