Synopsis:

In this work, U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database is analyzed. The focus of the study is to analyze the data and find the natural events that are most harmful for population health and for US economy. The database starts from 1950 to November 2011. The official natural event types are 48 but before January 1993 not all event information were included. Hence, I started the primary data analysis from 1993 and finally from 1996. There are total 37 variables but we don’t need all of them. I only kept 10 variables related to types of events, casualties and economic loss. The total and average damage is calculated by each event type. The analysis shows that Hurricane/Typhoon is the most harmful for economy. On the other hand Tornado and heat are most deadly for human life.

Data Processing:

At first, “stringdist”, “ggplot2”, “gridExtra” and “dplyr” library is loaded.

library(stringdist)
library(ggplot2)
library(gridExtra)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Then the stormData.csv.bz2 file was loaded to “R” using read.csv function.

fileurl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileurl, "StormData.csv.bz2")
storm <- read.csv(gzfile("StormData.csv.bz2"))
storm[1:5, 1:5]
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY
## 1       1  4/18/1950 0:00:00     0130       CST     97
## 2       1  4/18/1950 0:00:00     0145       CST      3
## 3       1  2/20/1951 0:00:00     1600       CST     57
## 4       1   6/8/1951 0:00:00     0900       CST     89
## 5       1 11/15/1951 0:00:00     1500       CST     43

To find the approximate number of events (typos are not fixed) measured in each year, the fillowing analysis is done.

data1 <- storm
data1$BGN_DATE <- as.Date(data1$BGN_DATE, "%m/%d/%Y")
data1$EVTYPE <- as.character(data1$EVTYPE)
data1$BGN_DATE <- format(data1$BGN_DATE, "%Y")
data1 <- table(data1$BGN_DATE, data1$EVTYPE)
data1 <- data.frame(data1)
data1 <- data1[data1$Freq >0 , ]
data1 <- table(data1$Var1)
data1 <- data.frame(data1)
data1$Var1 <- format(as.Date(data1$Var1, "%Y"), "%Y")
plot(data1$Var1, data1$Freq, pch = 17, xlab = "Year", 
     ylab = "Number of Events", col = "red", lwd = 4, 
     main = "Number of events (no processing was done) 
     measured in each year", sub = "Figure 1")

From Figure 1, we can see that before 1993 very limited number of events were stored in the database. So, it is justified to use the information from 1993. There is a drastic increase in the number of events in 1993, 1994 and 1995 but it’s become more consistent from 1996. To get a better idea, I divided the data into two sections (1993-1995 and 1996-2011) and did the following analysis.

  1. Convert date and damage related column into date and character class
storm_96 <- storm
storm_96$BGN_DATE <- as.Date(storm_96$BGN_DATE, "%m/%d/%Y")
storm_96$PROPDMGEXP <- as.character(storm_96$PROPDMGEXP)
storm_96$CROPDMGEXP <- as.character(storm_96$CROPDMGEXP)
  1. Make two data sets, from 1993-1995 and 1996-2011
storm_93 <- subset(storm_96, BGN_DATE >= "1993-01-01" & BGN_DATE < "1996-01-01")
storm_96 <- subset(storm_96, BGN_DATE >= "1996-01-01" & BGN_DATE < "2012-01-01")

unique(storm_93$PROPDMGEXP)
##  [1] ""  "B" "K" "M" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8"
unique(storm_93$CROPDMGEXP)
## [1] ""  "M" "K" "m" "B" "?" "0" "k" "2"
dim(storm_93)
## [1] 61208    37
unique(storm_96$PROPDMGEXP)
## [1] "K" ""  "M" "B" "0"
unique(storm_96$CROPDMGEXP)
## [1] "K" ""  "M" "B"
dim(storm_96)
## [1] 653530     37

There are many types of multipliers in the database from 1993 to 1995. According to this https://rstudio-pubs-static.s3.amazonaws.com/58957_37b6723ee52b455990e149edde45e5b6.html discussion, the values of multipliers are

H,h,K,k,M,m,B,b,+,-,?,0,1,2,3,4,5,6,7,8, and blank-character

H,h = hundreds = 100

K,k = kilos = thousands = 1,000

M,m = millions = 1,000,000

B,b = billions = 1,000,000,000

(+) = 1

(-) = 0

(?) = 0

black/empty character = 0

numeric 0..8 = 10

We are interested to find the events which are most harmful for human and economy. Hence, except H, K, M and B, all other multipliers can be ignored. Data set storm_93 contains many low value multiplier in PROPDMGEXP and CROPDMGEXP variables. Besides, the number of observations are more than 10 times higher in year 1996 (653530), compared to year 1993-1995 (61208). So, it is reasonable to explore the data from data set storm_96 (year 1996).

In the next section, a subset containing 10 variables are created that are necessary for this analysis. Then, another subset is created, where at least one observation is non-zero among FATALITIES, INJURIES, PROPDMG and CROPDMG variables. It means, an observation will be ignored if all the values are zero.

storm_96 <- subset(storm_96, 
                  select = c("STATE__", "BGN_DATE", "EVTYPE", 
                             "FATALITIES", "INJURIES", "PROPDMG", 
                             "PROPDMGEXP", "CROPDMG", "CROPDMGEXP", "REFNUM"))
storm_96 <- storm_96[storm_96$FATALITIES != 0 | storm_96$INJURIES != 0 | 
                       storm_96$PROPDMG != 0 | storm_96$CROPDMG != 0, ]
storm_96$EVTYPE <- as.character(storm_96$EVTYPE)
storm_96$EVTYPE <- toupper(storm_96$EVTYPE)
str(storm_96)
## 'data.frame':    201318 obs. of  10 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Date, format: "1996-01-06" "1996-01-11" ...
##  $ EVTYPE    : chr  "WINTER STORM" "TORNADO" "TSTM WIND" "TSTM WIND" ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ INJURIES  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PROPDMG   : num  380 100 3 5 2 400 12 8 12 75 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  38 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "K" "" "" "" ...
##  $ REFNUM    : num  248768 248769 248770 248771 248772 ...

In the below section all the official event types (48) are written in an array to match with the storm_96 data set.

eventtype <- c("Astronomical Low Tide", "Avalanche", "Blizzard", "Coastal Flood",
               "Cold", "Debris Flow", "Dense Fog", "Dense Smoke", "Drought",
               "Dust Devil", "Dust Storm", "Excessive Heat", "Extreme Cold/Wind Chill",
               "Flash Flood", "Flood", "Frost/Freeze", "Funnel Cloud", "Freezing Fog",
               "Hail", "Heat", "Heavy Rain", "Heavy Snow", "high Surf", "High Wind",
               "Hurricane/Typhoon", "Ice storm", "Lake-Effect Snow", "Lakeshore Flood",
               "Lightning", "Marine Hail", "Marine High Wind", "Marine Strong Wind",
               "Marine Thunderstorm Wind", "Rip Current", "Seiche", "Sleet",
               "Storm Surge/Tide", "Strong Wind", "Thunderstorm Wind", "Tornado",
               "Tropical Depression", "Tropical Storm", "Tsunami", "Volcanic Ash",
               "Waterspout", "Wildfire", "Winter Storm", "Winter Weather")

eventtype <- toupper(eventtype)

Some of the event types in the data set is not possible to match using matching functions. For example, “LANDSLIDE” is not included in the official type but according to the Storm data document given in the course website, “LANDSLIDE”" was renamed as “DEBRIS FLOW”. There is no type named “OTHER” but from the data set “remarks” column, we can see that most “OTHER” events are “HEAVY RAIN”, although there are exceptions too. Hence, in the following code chunk, this kind of events are replaced with appropriate event name.

pattern1 <- c("TSTM", "LANDSLIDE", "URBAN/SML STREAM FLD", "HYPOTHERMIA/EXPOSURE", 
              "HYPERTHERMIA/EXPOSURE", "MIXED PRECIPITATION", "UNSEASONABLY WARM", 
              "UNSEASONABLE COLD", "UNSEASONABLY COLD", "WIND AND WAVE", "FALLING SNOW/ICE",
              "MUDSLIDE", "^FOG", "RAIN", "OTHER", "BURST", "LANDSLUMP", "LIGHT SNOW",
              "GLAZE")
 
replacement1 <- c("THUNDERSTORM", "DEBRIS FLOW", "HEAVY RAIN", "COLD/WIND CHILL",
                  "EXCESSIVE HEAT", "COLD/WIND CHILL", "HEAT", "COLD/WIND CHILL", 
                  "COLD/WIND CHILL", "MARINE HIGH WIND", "SLEET", "HEAVY RAIN",
                  "FREEZING FOG", "HEAVY RAIN", "HEAVY RAIN", "THUNDERSTORM WIND", 
                  "HEAVY RAIN", "WINTER STORM", "FREEZING FOG")
 
  for (i in 1:length(pattern1))
 {
   storm_96$EVTYPE <- gsub(pattern1[i], replacement1[i], storm_96$EVTYPE)
  }

In the following code chunk, “amatch” function is used to check the typos and match events with the official event types.

st_match <- amatch(storm_96$EVTYPE, eventtype, maxDist = 20, method = "lcs")
print(paste0("Sum : ", sum(is.na(st_match))))
## [1] "Sum : 0"
storm_96$EVTYPE <- eventtype[st_match]
print(paste0("Number of events : ", length(unique(storm_96$EVTYPE))))
## [1] "Number of events : 48"

In the above code, we see that there is no “NA” values after matching the column. The total number of unique types are 48, which is the official number of events.

unique(storm_96$PROPDMGEXP)
## [1] "K" ""  "M" "B"
unique(storm_96$CROPDMGEXP)
## [1] "K" ""  "M" "B"
dmgexp <- c("K", "M", "", "B")
dmgmlt <- c(1000, 1000000, 0, 1000000000)
 
 storm_96 <- storm_96 %>% 
     mutate(PROPDMGEXP = ifelse(REFNUM == 605943, "M", PROPDMGEXP))
 
 for (i in 1:length(dmgexp))
 {
     storm_96 <- storm_96 %>% 
         mutate(PROPDMG = ifelse(PROPDMGEXP == dmgexp[i], PROPDMG*dmgmlt[i], PROPDMG))
     
     storm_96 <- storm_96 %>% 
         mutate(CROPDMG = ifelse(CROPDMGEXP == dmgexp[i], CROPDMG*dmgmlt[i], CROPDMG))
     
 }

The property damage value in reference number (REFNUM) 605943 is 115 Billion, which is not possible by a flood, whereas the property damage due to Hurricane Katrina (577615, 522616) is 16.93 and 31.30 Billion. So, I have changed 115 Billion to 115 Million, in the above code. The unique multipliers are “M”, “B”, “K” and “” for both property and crop damage. The damage values are multiplied by the associated multipliers in the EXP columns.

 storm_96$FATALINJURY <- storm_96$FATALITIES + storm_96$INJURIES
 storm_96$TDAMAGE <- storm_96$PROPDMG + storm_96$CROPDMG
 
 storm_new <- storm_96 %>% group_by(EVTYPE) %>% 
     summarise(PROPERTY = sum(PROPDMG), CROP = sum(CROPDMG), 
               TOTALDMG = sum(TDAMAGE), INJURY = sum(FATALINJURY),
               AVGDMG = mean(TDAMAGE), AVGINJURY = mean(FATALINJURY))
 
 head(storm_new)
## # A tibble: 6 x 7
##                  EVTYPE  PROPERTY       CROP   TOTALDMG INJURY     AVGDMG
##                   <chr>     <dbl>      <dbl>      <dbl>  <dbl>      <dbl>
## 1 ASTRONOMICAL LOW TIDE   9745000          0    9745000      0  974500.00
## 2             AVALANCHE   3711800          0    3711800    379   14059.85
## 3              BLIZZARD 525658950    7060000  532718950    455 2336486.62
## 4         COASTAL FLOOD 407218560          0  407218560     14 2046324.42
## 5                  COLD  23561600 1308973000 1332534600    322 5670360.00
## 6           DEBRIS FLOW 324583000   20017000  344600000     91 1804188.48
## # ... with 1 more variables: AVGINJURY <dbl>

In the above code chunk, two new variables are created.

  1. Fatalinjury = Sum of both fatalities and injuries
  2. Tdamage = Sum of both property and crop damage

Then a new data frame (storm_new) is created where calculations are grouped by events,

  1. Property = Sum of all property damage
  2. Crop = Sum of all crop damage
  3. Totaldmg = Sum of total damage (Tdamage)
  4. Injury = Sum of all Fatalities and injuries (Fatalinjury)
  5. Avgdmg = Average of total damage
  6. Avginjury = average of all Fatalities and injuries

Results:

In the following code chunk, a global theme for the two figures are created. Then the first figure is created from the total and average damage.

th <- theme_grey() + 
    theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, 
                                   size = 12, face = "bold"),
           axis.text.y=element_text(size = 12, face = "bold"),
           axis.title.x = element_text(size = 14, face = "bold"),
           axis.title.y = element_text(size = 14, face = "bold"),
           plot.caption = element_text(size = 12, face = "bold"))
 
 
df1 <- storm_new %>% subset(TOTALDMG >= quantile(TOTALDMG)[4])
df2 <- storm_new %>% subset(AVGDMG >= quantile(AVGDMG)[4])
 
 p1 <- ggplot(df1, aes(x = EVTYPE, y = TOTALDMG)) + 
     geom_bar(stat = "identity") + 
     labs(x = "Event Type", y = "Total Damage", caption = "Figure 2(a): shows 
          the  total damage by 12 events (75 percentile)") + th 
     
 p2 <- ggplot(df1, aes(x = EVTYPE, y = AVGDMG)) + 
     geom_bar(stat = "identity") + 
     labs(x = "Event Type", y = "Average Damage",  caption = "Figure 2(b): shows 
          the avearge damage by 12 events (75 percentile)") + th
 
 
grid.arrange(p1, p2, ncol = 2)

The events that are responsible for more than 75 percentile of total and average damage are shown in Figure 2. Figure 2(a) shows that Hurricane, Storm and Flood are three main reasons for product and crop damage. The mean value of each event also shows the similar pattern, except Drought takes the third position instead of Flood.

In the code below, the third figure is plotted from the total and average fatalities.

df1 <- storm_new %>% subset(INJURY >= quantile(INJURY)[4])
df2 <- storm_new %>% subset(AVGINJURY >= quantile(AVGINJURY)[4])
 
 p1 <- ggplot(df1, aes(x = EVTYPE, y = INJURY)) + 
     geom_bar(stat = "identity") + 
     labs(x = "Event Type", y = "Total Fatalities and Injuries",
          caption = "Figure 3(a): shows the events responsible for more than 
          75 parcentile of total population casualties") + th
 
 p2 <- ggplot(df2, aes(x = EVTYPE, y = AVGINJURY)) + 
     geom_bar(stat = "identity") + 
     labs(x = "Event Type", y = "Average Fatalities and Injuries", 
          caption = "Figure 3(b): shows the average population casualties 
          by the events that are responsible for more than 75 
          parcentile of the casualties") + th
 
 
 grid.arrange(p1, p2, ncol = 2)

The events, that are responsible for more than 75 percentile of human life loss and injury are shown in Figure 3. Figure 3(a) shows that Tornado, Excessive heat and Flood are the three most harmful events for population health. On the contrary, in case of mean life loss and injury by each event, Excessive heat, heat and Tsunami are the most harmful.