Synopsis

The goals of the work were to address the following questions:

  1. Across the United States, which types of events are most harmful with respect to population health?
  2. Across the United States, which types of events have the greatest economic consequences?

As the result of the analysis described below, the following answers are:

1. Across the United States, which types of events are most harmful with respect to population health?

  1. Tornado (5633 + 91364 appropriately)
  2. High Wind (808 + 8478 appropriately)
  3. Excessive Heat (2020 + 6703 appropriately)
  1. Tornado (5633)
  2. Excessive Heat (2020)
  3. Heat (1116)
  1. Tornado (91364)
  2. High Wind (8478)
  3. Flood(7915)

2. Across the United States, which types of events have the greatest economic consequences?

  1. Flood (145 + 6 appropriately)
  2. Hurricane (Typhoone) (73 + 3 appropriately)
  3. Tornado (57 + 0,4 appropriately)
  1. Flood (145)
  2. Hurricane (Typhoone) (73)
  3. Tornado (57)
  1. Drought (14)
  2. Flash flood (6,6)
  3. Flood (6)

Overview

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.


Data describtion

The data for this work come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. The data can be downloaded from the web site:

Storm Data [47Mb] https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2

Documentation describing how some of the variables are constructed/defined can be obtained here:

National Weather Service Storm Data Documentation https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf

National Climatic Data Center Storm Events FAQ https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.


Data Processing

This part describes (in words and code) how the data were loaded into R and processed for analysis. Analysis starts from the raw CSV file containing the data.

Read the data

#########################################################
# Obtain name of the file (.bz2 archive with the dataset)
#########################################################

urlStormDataSet <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
fileName <- URLdecode(urlStormDataSet) %>% basename()


###########################
# Download the .bz2 archive
###########################

if (!file.exists(fileName)) {
  download.file(urlStormDataSet, fileName, method = "curl")
  print ("File was downloaded successfully")
} else {
  cat("The file [", fileName, "] already exists!")
}
## The file [ StormData.csv.bz2 ] already exists!

Since the data file in the archive after unpacking has size about 550Mb, let’s first read only names of the columns

###################
# Read column names
###################

stormDataHeaders <- read.csv(bzfile(fileName), nrows = 1)

Take a glance at the column names:

headers <- colnames(stormDataHeaders)
headers
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

In order to address the questions posted above, out of 37 columns we need only 7:

  1. For first question:
  • EVTYPE: Event types e.g. Tornado, Flood
  • FATALITIES: Number of fatalities caused
  • INJURIES: Number of injuries caused
  1. Fot the second question:
  • EVTYPE: Event types e.g. Tornado, Flood
  • PROPDMG: Property damage caused
  • PROPDMGEXP: Property damage exponential levels
  • CROPDMG: Crop damage caused
  • CROPDMGEXP: Crop damage exponential levels

So to save memory let’s read the dataframe only with these columns:

requiredColumns <- c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
columnClasses <- ifelse(headers %in% requiredColumns, NA, "NULL")

# Depending on the performance of your system this process may take some time
subsetedDataFrame <- read.csv(bzfile(fileName), colClasses = columnClasses)

stormDataFrame <- subsetedDataFrame

Take a look at the data frame:

head(stormDataFrame)
##    EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO          0       15    25.0          K       0           
## 2 TORNADO          0        0     2.5          K       0           
## 3 TORNADO          0        2    25.0          K       0           
## 4 TORNADO          0        2     2.5          K       0           
## 5 TORNADO          0        2     2.5          K       0           
## 6 TORNADO          0        6     2.5          K       0

Let’s check structure of the data set:

str(stormDataFrame)
## 'data.frame':    902297 obs. of  7 variables:
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...

We see plenty of rows: 902297 rows! Also we can notice EVTYPE as a factor has 985 levels!

Take a look at some of them:

head(levels(stormDataFrame$EVTYPE) , 50)
##  [1] "   HIGH SURF ADVISORY"          " COASTAL FLOOD"                
##  [3] " FLASH FLOOD"                   " LIGHTNING"                    
##  [5] " TSTM WIND"                     " TSTM WIND (G45)"              
##  [7] " WATERSPOUT"                    " WIND"                         
##  [9] "?"                              "ABNORMAL WARMTH"               
## [11] "ABNORMALLY DRY"                 "ABNORMALLY WET"                
## [13] "ACCUMULATED SNOWFALL"           "AGRICULTURAL FREEZE"           
## [15] "APACHE COUNTY"                  "ASTRONOMICAL HIGH TIDE"        
## [17] "ASTRONOMICAL LOW TIDE"          "AVALANCE"                      
## [19] "AVALANCHE"                      "BEACH EROSIN"                  
## [21] "Beach Erosion"                  "BEACH EROSION"                 
## [23] "BEACH EROSION/COASTAL FLOOD"    "BEACH FLOOD"                   
## [25] "BELOW NORMAL PRECIPITATION"     "BITTER WIND CHILL"             
## [27] "BITTER WIND CHILL TEMPERATURES" "Black Ice"                     
## [29] "BLACK ICE"                      "BLIZZARD"                      
## [31] "BLIZZARD AND EXTREME WIND CHIL" "BLIZZARD AND HEAVY SNOW"       
## [33] "Blizzard Summary"               "BLIZZARD WEATHER"              
## [35] "BLIZZARD/FREEZING RAIN"         "BLIZZARD/HEAVY SNOW"           
## [37] "BLIZZARD/HIGH WIND"             "BLIZZARD/WINTER STORM"         
## [39] "BLOW-OUT TIDE"                  "BLOW-OUT TIDES"                
## [41] "BLOWING DUST"                   "blowing snow"                  
## [43] "Blowing Snow"                   "BLOWING SNOW"                  
## [45] "BLOWING SNOW- EXTREME WIND CHI" "BLOWING SNOW & EXTREME WIND CH"
## [47] "BLOWING SNOW/EXTREME WIND CHIL" "BREAKUP FLOODING"              
## [49] "BRUSH FIRE"                     "BRUSH FIRES"

Here we see that ENVTYPE column has names of the disasters having differents spelling like ‘BLIZZARD/HIGH WIND’ and ‘BLIZZARD/WINTER STORM’ some names contain digits like ‘TSTM WIND (G45)’. But accirding to the table above we have 47 event types so our task as to this column is to match each of the names to the 48 event types. Obviously it’s very hard to perform manually so we will have to use Cluster Analysis for this purpose.

Let’s look at another factor columns ‘PROPDMGEXP’ and ‘CROPDMGEXP’:

levels(stormDataFrame$PROPDMGEXP)
##  [1] ""  "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"
levels(stormDataFrame$CROPDMGEXP)
## [1] ""  "?" "0" "2" "B" "k" "K" "m" "M"

The next part will describe process of transformation the data set to tidy form.


Data preparation

In order to reduce 985 levels of the ‘EVTYPE’ column to 48 event types from the book (page 6), we are going to use Clustering technic. Fot this purpose we are to use ‘stringdist’ package which will calculate distance between strings. The idea of the method to create clusters from 985 levels and link it 48 reference event types.

For more information about the package follow: https://cran.r-project.org/web/packages/stringdist/stringdist.pdf

For first, we need to cleand the column ‘EVTYPE’ from rows containing:

library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
stormDataFrame <- stormDataFrame[!(tolower(stormDataFrame$EVTYPE) %like% "summary"), ]
stormDataFrame <- stormDataFrame[!(tolower(stormDataFrame$EVTYPE) %like% "none"), ]
stormDataFrame <- stormDataFrame[!(tolower(stormDataFrame$EVTYPE) %like% "other"), ]
stormDataFrame <- stormDataFrame[!(tolower(stormDataFrame$EVTYPE) == '?'), ]

dim(stormDataFrame)
## [1] 902159      7

138 rows were deleted.

For second, we are going to delete rows where values of ‘FATALITIES’, ‘INJURIES’, ‘PROPDMG’, ‘CROPDMG’ are 0 since it will give us nothing during aggregation operation which will be described below.

Let’s delete these rows:

stormDataFrame <- stormDataFrame[!(stormDataFrame$FATALITIES == 0 & 
                       stormDataFrame$INJURIES == 0 & 
                          stormDataFrame$PROPDMG == 0 &
                              stormDataFrame$CROPDMG == 0), ]

dim(stormDataFrame)
## [1] 254591      7

647568 rows were deleted.

For third, let’s replace values of ‘PROPDMGEXP’ and ‘CROPDMGEXP’ columns with numbers

#Combine all literal exps:
strangeValues <- c(levels(stormDataFrame$PROPDMGEXP), levels(stormDataFrame$CROPDMGEXP))
strangeValues <- unique(strangeValues)

#create dataframe setting appropriate values
expDictionary <- data.frame(value = strangeValues, exp = c(0, 0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 2, 2, 3, 6, 6, 3))
expDictionary$value <- as.character(expDictionary$value)



stormDataFrame$PROPDMGEXP <- as.character(stormDataFrame$PROPDMGEXP)
stormDataFrame$CROPDMGEXP <- as.character(stormDataFrame$CROPDMGEXP)

Replacement literal exp with numerical one:

#Frequency of exp in PROPDMG
propDmgExpVector <- stormDataFrame$PROPDMGEXP

frequencyProp <- table(propDmgExpVector)
frequencyProp <- as.data.frame(frequencyProp)
frequencyProp <- frequencyProp[order(-frequencyProp$Freq), ]
frequencyProp
##    propDmgExpVector   Freq
## 14                K 231424
## 1                    11547
## 16                M  11320
## 4                 0    210
## 11                B     40
## 8                 5     18
## 15                m      7
## 13                H      6
## 3                 +      5
## 7                 4      4
## 9                 6      3
## 10                7      3
## 2                 -      1
## 5                 2      1
## 6                 3      1
## 12                h      1
#Frequency of exp in CROPDMG
cropDmgExpVector <- stormDataFrame$CROPDMGEXP

frequencyCrop <- table(cropDmgExpVector)
frequencyCrop <- as.data.frame(frequencyCrop)
frequencyCrop <- frequencyCrop[order(-frequencyCrop$Freq), ]
frequencyCrop
##   cropDmgExpVector   Freq
## 1                  152652
## 6                K  99902
## 8                M   1985
## 5                k     21
## 3                0     17
## 4                B      7
## 2                ?      6
## 7                m      1

We can see that ‘K’, ’’ and ‘M’ are the most popular here

#Replace 'PROPDMGEXP' with the dictionary:
if('0' %in% propDmgExpVector) {
    for(index in 1 : length(propDmgExpVector)) {
    pde <- propDmgExpVector[index]
      if(pde != 'K' & pde != '' & pde != 'M') {
        e <- expDictionary[expDictionary$value == pde, ]$exp
        propDmgExpVector[index] <- 10^e
      }
    }
} else {
    print("Replacement is already done!")
}


#Replace 'СROPDMGEXP' with the dictionary:
if('0' %in% cropDmgExpVector) {
  for(index in 1 : length(cropDmgExpVector)) {
    cde <- cropDmgExpVector[index]
    if(cde != 'K' & cde != '' & cde != 'M') {
      e <- expDictionary[expDictionary$value == cde, ]$exp
      cropDmgExpVector[index] <- 10^e
    }
  }
} else {
    print("Replacement is already done!")
}


#replace 'PROPDMGEXP'
propDmgExpVectorReplaced <- replace(propDmgExpVector, propDmgExpVector == 'K', 10^3) %>%     
                                   replace(propDmgExpVector == '', 1) %>%
                                          replace(propDmgExpVector == 'M', 10^6)

propDmgExpVectorReplaced <- as.numeric(propDmgExpVectorReplaced)

stormDataFrame$PROPDMG <- stormDataFrame$PROPDMG * propDmgExpVectorReplaced

#replace 'CROPDMGEXP'
cropDmgExpVectorReplaced <- replace(cropDmgExpVector, cropDmgExpVector == 'K', 10^3) %>%     
                                   replace(cropDmgExpVector == '', 1) %>%
                                          replace(cropDmgExpVector == 'M', 10^6)

cropDmgExpVectorReplaced <- as.numeric(cropDmgExpVectorReplaced)
stormDataFrame$CROPDMG <- stormDataFrame$CROPDMG * cropDmgExpVectorReplaced

Drop ‘PROPDMGEXP’ and ‘CROPDMGEXP’:

stormDataFrame <- stormDataFrame[-c(5, 7)]

testCluster <- stormDataFrame

Finally, we need to reduce ‘EVTYPE’ to 48 book types from the table above

#Create a vector with the reference event types:

targetEventTypes <- c("Astronomical Low Tide", "Avalanche", "Blizzard", "Coastal Flood", "Cold/Wind Chill", "Debris Flow","Dense Fog", "Dense Smoke", "Drought", "Dust Devil", "Dust Storm", "Excessive Heat", "Extreme Cold/Wind Chill", "Flash Flood", "Flood", "Frost/Freeze", "Funnel Cloud", "Freezing Fog", "Hail", "Heat", "Heavy Rain", "Heavy Snow", "High Surf", "High Wind", "Hurricane (Typhoon)", "Ice Storm", "Lake-Effect Snow", "Lakeshore Flood", "Lightning", "Marine Hail", "Marine High Wind", "Marine Strong Wind", "Marine Thunderstorm Wind", "Rip Current", "Seiche", "Sleet", "Storm Surge/Tide", "Strong Wind", "Thunderstorm Wind", "Tornado", "Tropical Depression", "Tropical Storm", "Tsunami", "Volcanic Ash", "Waterspout","Wildfire", "Winter Storm","Winter Weather")

To perform Clustering we are going to use ‘stringdist’ function with the Damerau-Levenshtein distance method which is like the optimal string alignment distance method calculating counts the number of deletions, insertions and substitutions necessary to turn b into a, but except that it allows for multiple edits on substrings.

#get factual event names
factualEventTypes <-as.character(testCluster$EVTYPE)
length(factualEventTypes)
## [1] 254591

Now we are going to use cluster analysis in order to reduce factual event types to 48 book reference event types

#install.packages("stringdist")
library("stringdist")

# Depending on your system this operation may take some time
distanceMatrix <- stringdistmatrix(tolower(factualEventTypes), tolower(targetEventTypes), method = "dl", useNames = "strings")

for(rowNumber in 1 : length(factualEventTypes)) {
  minInRow <- min(distanceMatrix[rowNumber, ])
  indexInTargetVector <- match(minInRow, distanceMatrix[rowNumber, ])
  factualEventTypes[rowNumber] <- colnames(distanceMatrix)[indexInTargetVector]
}

#check that all factyal values were reduced ti target event types
length(unique(factualEventTypes))
## [1] 48

The last step in the data prepatation is aggregation

testCluster$EVTYPE <- toupper(factualEventTypes)

#aggregation per event types:
library(dplyr)
aggregatedStormDataFrame <- testCluster %>% 
          group_by(EVTYPE) %>% 
              summarise(FATALITIES = sum(FATALITIES), INJURIES = sum(INJURIES), PROPDMG = sum(PROPDMG), CROPDMG = sum(CROPDMG))

Aggregated tidy data set

aggregatedStormDataFrame
## # A tibble: 48 x 5
##    EVTYPE                FATALITIES INJURIES    PROPDMG     CROPDMG
##    <chr>                      <dbl>    <dbl>      <dbl>       <dbl>
##  1 ASTRONOMICAL LOW TIDE          0        0    9745000           0
##  2 AVALANCHE                    226      194    3721800     5000000
##  3 BLIZZARD                     101      805  659333950   112060000
##  4 COASTAL FLOOD                 41       89  520273960    43691600
##  5 COLD/WIND CHILL               97       27    2314500    66600000
##  6 DEBRIS FLOW                    1        0   56050000    42000000
##  7 DENSE FOG                     19      342    9774000           0
##  8 DENSE SMOKE                    0        0     100050           0
##  9 DROUGHT                       16       52 1053120600 13972631000
## 10 DUST DEVIL                     2       43     743130           0
## # ... with 38 more rows

1. Across the United States, which types of events are most harmful with respect to population health?

#health impact
healthImpactDataFrame <- aggregatedStormDataFrame[c(1, 2, 3)]

transponedHealthImpactDataFrame <- t(healthImpactDataFrame)
colnames(transponedHealthImpactDataFrame) <- healthImpactDataFrame$EVTYPE
transponedHealthImpactDataFrame <- transponedHealthImpactDataFrame[-1, ]
transponedHealthImpactDataFrame <- transponedHealthImpactDataFrame[ , order((transponedHealthImpactDataFrame[1, ]), decreasing = TRUE)]

Plot “Health impacts from disasters sorted by fatalities”

barplot(transponedHealthImpactDataFrame, 
        main = "Injuries and fatalities from disasters sorted by fatalities",
        col = c("red","blue"),
        ylab = "Casualties, people",
        ylim = range(pretty(c(0, 10000))),
        cex.names = 0.6,
        las = 2
)
legend("topright",
       c("Injuries", "Fatalities"),
       fill = c("blue" ,"red")
)

2. Across the United States, which types of events have the greatest economic consequences?

#economic impact
economicImpactDataFrame <- aggregatedStormDataFrame[c(1, 4, 5)]

transponedEconomicImpactDataFrame <- t(economicImpactDataFrame)
colnames(transponedEconomicImpactDataFrame) <- healthImpactDataFrame$EVTYPE
transponedEconomicImpactDataFrame <- transponedEconomicImpactDataFrame[-1, ]
transponedEconomicImpactDataFrame <- transponedEconomicImpactDataFrame[ , order((transponedEconomicImpactDataFrame[1, ]), decreasing = TRUE)]

Plot “Property and crop damages from disasters sorted by property damages”

barplot(transponedEconomicImpactDataFrame,
        main = "Property and crop damages from disasters sorted by prop damages, USD",
        col = c("red","blue"),
        cex.names = 0.6,
        las = 2
)
legend("topright",
c("Crop damages", "Property damages"),
fill = c("blue" ,"red"),
cex = 0.75
)


Results

As it can be seen from the plots above:

1. Types of events are most harmful with respect to population health:

By total result (Fatalities + Injuries), people:

  1. Tornado (5633 + 91364 appropriately)
  2. High Wind (808 + 8478 appropriately)
  3. Excessive Heat (2020 + 6703 appropriately)

By Fatalities, people:

  1. Tornado (5633)
  2. Excessive Heat (2020)
  3. Heat (1116)

By Injuries, people:

  1. Tornado (91364)
  2. High Wind (8478)
  3. Flood(7915)

2. Types of events have the greatest economic consequences:

By total result (Property damages + Crop damages), billions USD:

  1. Flood (145 + 6 appropriately)
  2. Hurricane (Typhoone) (73 + 3 appropriately)
  3. Tornado (57 + 0,4 appropriately)

By Property damages, billions USD:

  1. Flood (145)
  2. Hurricane (Typhoone) (73)
  3. Tornado (57)

By Crop damages, billions USD:

  1. Drought (14)
  2. Flash flood (6,6)
  3. Flood (6)