Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Tornados is as much the event type that involve de most large amount of fatalities, injuries and damages: 5,633 fatalities, 91,364 injuries and $51,657,454,115 of property damages. Drought is de type event tha most affect crops, with damages of $12,474,066,000.

For property and crop damages, tornados, floods, hails and hurricanes are the most importants events (top 5) that produce economic losses.

library(dplyr, warn.conflicts = F)
library(ggplot2, warn.conflicts = F)
library(stringdist, warn.conflicts = F)
library(gridExtra, warn.conflicts = F)

Data Processing

This first part is about loading data directly from original .bz2 file. The next part is the selection of only columns we will use in the analisys, and an extra step is added changing the class of EVTYPE column for character class, for a posterior easy manipulation.

data <- read.csv(file = "repdata_data_StormData.csv.bz2")

data2 <- data %>%
  select(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
data2$EVTYPE <- as.character(data2$EVTYPE)

data2$EVTYPE <- as.character(data2$EVTYPE)
data2$EVTYPE <-  tolower(data2$EVTYPE)

The formal list of type events is included in the STORM DATA PREPARATION documentation. That is why I included an extra table about events list. In this step I load this file and I prepare it for following processing steps.

events <- read.table(file = "eventTable", sep = ",", header = T)

events$Event <- as.character(events$Event)
events$Event <- tolower(events$Event)
str(events)

## 'data.frame':    48 obs. of  1 variable:
##  $ Event: chr  "astronomical low tide" "avalanche" "blizzard" "coastal flood" ...

Is clear that EVTYPE column in row data (NOAA data) is a lot of complex, because this columns has many typing errors. I trayed to repair the evidents errors in this column with this functions that help me to process a reclasification of data wrongly classficated.

The last row of this chunck call a external extended script that intent to reduce the variability of diferents unique events in EVTYPE column. I call this eternal script because it es very long and it will take a lot of space in the present report. The external script use the functions here presented. I resumned the external script in this code block tha run de complete analysis.

# script for observe the strig dist between events
ev <- function(event = x, maxDist  = 15){
  
  n <- maxDist:1
  
  for(i in 1:maxDist){
    m <- ain(x = data2$EVTYPE, table = events$Event[event], maxDist = n[i])
    print(paste(events$Event[event], " --------------- ", n[i], " ---------------", sep = ""))
    print(table(data2$EVTYPE[m]))
  }
}

# Script that make a event name sustitution
sel <- function(x,y){
  m <- ain(x = data2$EVTYPE, table = events$Event[x], maxDist = y)
  data2$EVTYPE[m] <- events$Event[x]
  return(data2)
}
## Script ###
x <- 1
y <- 3
data2 <- sel(x,y)
x <- 2
y <- 5
data2 <- sel(x,y)
x <- 3
y <- 5
data2 <- sel(x,y)
x <- 4
y <- 3
data2 <- sel(x,y)
x <- 5
y <- 3
data2 <- sel(x,y)
x <- 7
y <- 3
data2 <- sel(x,y)
data2$EVTYPE[
  grep(pattern = "fog", x = data2$EVTYPE, value = F)
  ] <- "dense fog"
x <- 8
y <- 4
data2 <- sel(x,y)
x <- 9
y <- 3
data2 <- sel(x,y)
x <- 10
y <- 4
data2 <- sel(x,y)
x <- 11
y <- 3
data2 <- sel(x,y)
x <- 12
y <- 3
data2 <- sel(x,y)
x <- 13
y <- 7
data2 <- sel(x,y)
x <- 14
y <- 3
data2 <- sel(x,y)
data2$EVTYPE[
  grep(pattern = "flood/flash", x = data2$EVTYPE, value = F, fixed = T)
  ] <- "flash flood"

data2$EVTYPE[
  grep(pattern = "flash flood/flood", x = data2$EVTYPE, value = F, fixed = T)
  ] <- "flash flood"

data2$EVTYPE[
  grep(pattern = "flash flood", x = data2$EVTYPE, value = F)
  ] <- "flash flood"

table(
grep(pattern = "flash flood", x = data2$EVTYPE, value = T)
)

## 
## flash flood 
##       55676

x <- 15
y <- 2
data2 <- sel(x,y)

data2$EVTYPE[
  grep(pattern = "flooding", x = data2$EVTYPE, value = F, fixed = T)
  ] <- "flood"

x <- 16
y <- 6
data2 <- sel(x,y)
data2$EVTYPE[
  grep(pattern = "frost", x = data2$EVTYPE, value = F)
  ] <- "frost/freeze"
data2$EVTYPE[
  grep(pattern = "freezing rain", x = data2$EVTYPE, value = F, fixed = T)
  ] <- "frost/freeze"

x <- 17
y <- 4
data2 <- sel(x,y)
data2$EVTYPE[
  grep(pattern = "funnel", x = data2$EVTYPE, fixed = T, value = F)
  ] <- "funnel cloud"
x <- 18
y <- 3
data2 <- sel(x,y)
x <- 19 # Aquí se debe revisar "hail"
y <- 1
data2 <- sel(x,y)
data2$EVTYPE <- gsub(pattern = "hail 75", 
                     replacement = "hail",
                     x = data2$EVTYPE, 
                     fixed = T
                     )

data2$EVTYPE <- gsub(pattern = "hail 80", 
                     replacement = "hail",
                     x = data2$EVTYPE, 
                     fixed = T
                     )

data2$EVTYPE <- gsub(pattern = "hail1",
                     replacement = "hail",
                     x = data2$EVTYPE, 
                     fixed = T
                     )

data2$EVTYPE <- gsub(pattern = "^hail[0-9]",
                     replacement = "hail",
                     x = data2$EVTYPE
                     )
data2$EVTYPE[
  grep(pattern = "^hail[0-9]", x = data2$EVTYPE,value = F)
  ] <- "hail"

data2$EVTYPE[
  grep(pattern = "hail\\.[0-9]", x = data2$EVTYPE,value = F)
  ] <- "hail"

data2$EVTYPE[
  grep(pattern = "^hail", x = data2$EVTYPE, value = F) 
  ] <- "hail"

x <- 20
y <- 2
data2 <- sel(x,y)
x <- 21
y <- 2
data2 <- sel(x,y)
x <- 22
y <- 2
data2 <- sel(x,y)

data2$EVTYPE[
  grep(pattern = "heavy snow", x = data2$EVTYPE,fixed = T, value = F)
  ] <- "heavy snow"

x <- 23
y <- 2
data2 <- sel(x,y)

x <- 24
y <- 3
data2 <- sel(x,y)

data2$EVTYPE[
  grep(pattern = "high wind", x = data2$EVTYPE, value = F, fixed = T)
  ] <- "high wind"

x <- 25 # Revisar typhon
y <- 11
data2 <- sel(x,y)

data2$EVTYPE[
  grep(pattern = "typhoon", x = data2$EVTYPE, value = F)
] <- "hurricane (typhoon)"

x <- 26
y <- 2
data2 <- sel(x,y)

x <- 27
y <- 8
data2 <- sel(x,y)
x <- 28
y <- 5
data2 <- sel(x,y)
x <- 29
y <- 3
data2 <- sel(x,y)
x <- 30
y <- 4
data2 <- sel(x,y)
x <- 31
y <- 3
data2 <- sel(x,y)
x <- 32
y <- 4
data2 <- sel(x,y)
x <- 33
y <- 4
data2 <- sel(x,y)
x <- 34
y <- 7
data2 <- sel(x,y)
x <- 35
y <- 2
data2 <- sel(x,y)
x <- 36
y <- 2
data2 <- sel(x,y)
x <- 37
y <- 6
data2 <- sel(x,y)
x <- 38
y <- 5
data2 <- sel(x,y)
x <- 39
y <- 8
data2 <- sel(x,y)

data2$EVTYPE[
  grep(pattern = "tstm wind", x = data2$EVTYPE, value = F, fixed = T)
] <- "thunderstorm wind"

x <- 40
y <- 3
data2 <- sel(x,y)
x <- 41
y <- 7
data2 <- sel(x,y)
x <- 42
y <- 4
data2 <- sel(x,y)
x <- 43
y <- 3
data2 <- sel(x,y)
x <- 44
y <- 7
data2 <- sel(x,y)
x <- 45
y <- 3
data2 <- sel(x,y)
x <- 46
y <- 4
data2 <- sel(x,y)
x <- 47
y <- 3
data2 <- sel(x,y)
x <- 48
y <- 4
data2 <- sel(x,y)

This is the final reclassification of EVTYPE column. I reduced the unique EVTYPE entries, from 985 differents entries to 547. It is evident that it need more work to do about this, but is sufficient for now beacuse is very difficult to make changes without a complete knowledge about data.

str(data2)

## 'data.frame':    902297 obs. of  7 variables:
##  $ EVTYPE    : chr  "tornado" "tornado" "tornado" "tornado" ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...

In the same way like EVTYPE, columns PROPDMGEXP and CROPDMGEXP, has problems with the format of damages values.

This is the way that columns look:

table(data2$PROPDMGEXP, useNA = "ifany")

## 
##             -      ?      +      0      1      2      3      4      5 
## 465934      1      8      5    216     25     13      4      4     28 
##      6      7      8      B      h      H      K      m      M 
##      4      5      1     40      1      6 424665      7  11330

table(data2$CROPDMGEXP, useNA = "ifany")

## 
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994

To correct this issue, I followed the next procedure:

## This for Property damage
data2$PROPDMGEXP <- as.character(data2$PROPDMGEXP)

data2$PROP <- gsub(pattern = "[-?+]", replacement = 1, x = data2$PROPDMGEXP)
data2$PROP <- gsub(pattern = "0", replacement = 1, x = data2$PROP)
data2$PROP <- gsub(pattern = "1", replacement = 10, x = data2$PROP)
data2$PROP <- gsub(pattern = "2", replacement = 100, x = data2$PROP)
data2$PROP <- gsub(pattern = "3", replacement = 1000, x = data2$PROP)
data2$PROP <- gsub(pattern = "4", replacement = 10000, x = data2$PROP)
data2$PROP <- gsub(pattern = "5", replacement = 1e+05, x = data2$PROP)
data2$PROP <- gsub(pattern = "6", replacement = 1e+06, x = data2$PROP)
data2$PROP <- gsub(pattern = "7", replacement = 1e+07, x = data2$PROP)
data2$PROP <- gsub(pattern = "8", replacement = 1e+08, x = data2$PROP)
data2$PROP <- gsub(pattern = "B", replacement = 1e+06, x = data2$PROP)
data2$PROP <- gsub(pattern = "[Hh]", replacement = 100, x = data2$PROP)
data2$PROP <- gsub(pattern = "K", replacement = 1000, x = data2$PROP)
data2$PROP <- gsub(pattern = "[Mm]", replacement = 1e+06, x = data2$PROP)
data2$PROP <- as.numeric(data2$PROP)
data2$PROP[is.na(data2$PROP)] <- 1

# This for Crop damage
data2$CROPDMGEXP <- as.character(data2$CROPDMGEXP)

data2$CROP <- gsub(pattern = "[?]", replacement = 1, x = data2$CROPDMGEXP)
data2$CROP <- gsub(pattern = "0", replacement = 1, x = data2$CROP)
data2$CROP <- gsub(pattern = "2", replacement = 100, x = data2$CROP)
data2$CROP <- gsub(pattern = "B", replacement = 1e+06, x = data2$CROP)
data2$CROP <- gsub(pattern = "[Kk]", replacement = 1000, x = data2$CROP)
data2$CROP <- gsub(pattern = "[Mm]", replacement = 1e+06, x = data2$CROP)
data2$CROP <- as.numeric(data2$CROP)
data2$CROP[is.na(data2$CROP)] <- 1

The new values looks like this:

table(data2$PROP, useNA = "ifany")

## 
##      1     10    100   1000  10000  1e+05  1e+06  1e+07  1e+08 
## 465934    255     20 424669      4     28  11381      5      1

table(data2$CROP, useNA = "ifany")

## 
##      1    100   1000  1e+06 
## 618439      1 281853   2004

Then I calculated the correct damages values to each variable.

data2$PROPVAL <- data2$PROP*data2$PROPDMG
data2$CROPVAL <- data2$CROP*data2$CROPDMG

Results

Population Healt

For this part I first grouped data by event type, summarising the values of fatalities by the sum of total values in each type event. After that, I extracted only the hightest top 20 types events.

############### FATALITIES GROUPING #################
fatalities <- data2 %>%
  group_by(EVTYPE) %>%
  summarise(Fatalities = sum(FATALITIES, na.rm = T)) %>%
  arrange(desc(Fatalities))

#FATALITIES TOP 20
fatalities <- fatalities[1:20,]

# CONSTRUCT A FATALITIES PLOT
fatalities$EVTYPE <- factor(x = fatalities$EVTYPE, 
                            levels = fatalities$EVTYPE)

gfatalities <- ggplot(fatalities, aes(x = EVTYPE,
                       y = Fatalities)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  ggtitle("Number of fatalities by type event") +
  xlab("") + 
  ylab("Number of fatalities") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
  
###############  INJURIES GROUPING ############### 
injuries <- data2 %>%
  group_by(EVTYPE) %>%
  summarise(Injuries = sum(INJURIES, na.rm = T)) %>%
  arrange(desc(Injuries))

# INJUIRIES TOP 20
injuries <-  injuries[1:20,]

#CONSTRUCTING A INJURIES PLOT
injuries$EVTYPE <- factor(injuries$EVTYPE, 
                          levels = injuries$EVTYPE)



ginjuries <- ggplot(injuries, aes(x = EVTYPE,
                                      y = Injuries)) +
  geom_bar(stat = "identity") +
  #coord_flip() + 
  theme_minimal() +
  ggtitle("Number of injuries by type event") +
  xlab("") + 
  ylab("Number of injuries") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Figure 1. The next Figure show the amount of fatalities and injuries that represent each one of type event (Top 20).

#PLOTING IN ONE PANEL
grid.arrange(gfatalities, ginjuries, ncol=2)

Tornados are the type event with the biggest efect in population healt. There are a big difference in magnitud of this type of event versus the rest of the events.

Economic impact

For this part I first grouped data by event type, summarising the values of damages by the sum of total values in each type avent. After that, I extracted only the hightest top 20 types events.

###############  GROUPING DAMAGES ############### 
prop <- data2 %>%
  group_by(EVTYPE) %>%
  summarise(prop = sum(PROPVAL, na.rm = T)) %>%
  arrange(desc(prop))

# SELECTION DAMAGES TOP 20
prop <- prop[1:20,]

prop$EVTYPE <- factor(prop$EVTYPE,
                      levels = prop$EVTYPE)

# CONSTRUCTING A PLOT OF PROPERTY DAMAGES
gprop <- ggplot(prop , aes(x = EVTYPE,
                           y = prop)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  ggtitle("Property damage by type event") +
  xlab("") + 
  ylab("Property damage dollars cost") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
 
###############  GROUPING CROP DAMAGES ############### 
crop <- data2 %>%
  group_by(EVTYPE) %>%
  summarise(crop = sum(CROPVAL, na.rm = T)) %>%
  arrange(desc(crop))

# SELECTION TOP 20 CROP DAMAGES
crop <- crop[1:20,]

crop$EVTYPE <- factor(crop$EVTYPE,
                      levels = crop$EVTYPE)

# CONSTRUCTING A CROP DAMAGES PLOT
gcrop <- ggplot(crop , aes(x = EVTYPE,
                           y = crop)) +
  geom_bar(stat = "identity") +
  #coord_flip() + 
  theme_minimal() +
  ggtitle("Crop damages by type event") +
  xlab("") + 
  ylab("Crop Damage dollars cost") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Figure 2. The next figure show the cost in dollars that represent each one of the type events (Top 20)

# PLOTING DAMAGES IN A SINGLE PANEL
grid.arrange(gprop, gcrop, ncol=2)

Tornado is again the type event that have a big influence in the economic impact, It is clearly the biggest factor in property damage. About crop damage, the type avent that has the bigger impact is drought.

Conclusions