Of all weather events, Tornado is most harmful to public health and Flood has the greatest economic consequences in United States

Synopsis

In this report we aim to find the weahter events that have been most harmful to public health and had greatest economic consequences in United States between the years 1950 and 2011. To find these events, we obtained National Oceanic and Atmospheric Administration’s (NOAA) storm database from Coursera web page of Reproducible Research course. From these data, we found that Tornado has by far done the most harm to public health in United States, with more than 58% of total population’s health damage. We also found that Flood and Hurrican/Typhoon had the greatest economic consequences in United States, with more than 33% and 19% of economic damage corresponding to property and crop damages.

Data Processing

This project involves exploring National Oceanic and Atmospheric Administration’s (NOAA) storm database from Coursera web page of Reproducible Research course. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

As part of data processing, we will load and clean this data set in the following sections.

Loading NOAA Storm Database

First of all, we check to see if there is a folder named “data” in working directory. If this folder doesn’t exist, we create it. Next we check to see if NOAA Storm Database named “repdatadataStormData.csv.bz2” exist in “data” directory and download it otherwise.

#check and create data folder
if(!dir.exists("./data")){dir.create("./data")}
#check to see if "repdatadataStormData.csv.bz2" exist, and if not, dowload it
if(!file.exists("./data//repdatadataStormData.csv.bz2")){
    fileURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
    download.file(fileURL, destfile = "./data/repdatadataStormData.csv.bz2", method = "curl")
    downloadTime <- date()
}

When we are sure that “repdatadataStormData.csv.bz2” is in “data” folder, we read it simply using read.csv and store it in “stormDataFull” data frame.

#reading Storm Data
stormDataFull <- read.csv("./data/repdatadataStormData.csv.bz2")

Looking at data

After reading data, we will look at some summeries of it. We start by looking at its dimension.

dim(stormDataFull)
## [1] 902297     37

We can see that it has 902297 rows and 37 columns. Now we look at its column names.

names(stormDataFull)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

Preprocessing NOAA Storm Database

For this data analysis, we want to answer two specific questions which concerncs about events that affect public health and economy of U.S. So we can select only the columns that relate to these questions (with REFNUM as id) and discard other columns. We will use select function from “dplyr” package for this purpose.

library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
stormData <- select(stormDataFull, EVTYPE, FATALITIES:CROPDMGEXP, REFNUM)

We can look at first few rows to get a feeling of this data.

head(stormData)
##    EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP REFNUM
## 1 TORNADO          0       15    25.0          K       0                 1
## 2 TORNADO          0        0     2.5          K       0                 2
## 3 TORNADO          0        2    25.0          K       0                 3
## 4 TORNADO          0        2     2.5          K       0                 4
## 5 TORNADO          0        2     2.5          K       0                 5
## 6 TORNADO          0        6     2.5          K       0                 6

As we can see, each row correspond to an event and has an event type (EVTYPE) that caused fatalities (FATALITIES), injuries (INJURIES), property damage (PROPDMG times 10 to the power of PROPDMGEXP) in dollars and crop damage (CROPDMG times 10 to the power of CROPDMGEXP) in dollars.

Lets look at summary and structure of this data set.

summary(stormData)
##                EVTYPE         FATALITIES          INJURIES        
##  HAIL             :288661   Min.   :  0.0000   Min.   :   0.0000  
##  TSTM WIND        :219940   1st Qu.:  0.0000   1st Qu.:   0.0000  
##  THUNDERSTORM WIND: 82563   Median :  0.0000   Median :   0.0000  
##  TORNADO          : 60652   Mean   :  0.0168   Mean   :   0.1557  
##  FLASH FLOOD      : 54277   3rd Qu.:  0.0000   3rd Qu.:   0.0000  
##  FLOOD            : 25326   Max.   :583.0000   Max.   :1700.0000  
##  (Other)          :170878                                         
##     PROPDMG          PROPDMGEXP        CROPDMG          CROPDMGEXP    
##  Min.   :   0.00          :465934   Min.   :  0.000          :618413  
##  1st Qu.:   0.00   K      :424665   1st Qu.:  0.000   K      :281832  
##  Median :   0.00   M      : 11330   Median :  0.000   M      :  1994  
##  Mean   :  12.06   0      :   216   Mean   :  1.527   k      :    21  
##  3rd Qu.:   0.50   B      :    40   3rd Qu.:  0.000   0      :    19  
##  Max.   :5000.00   5      :    28   Max.   :990.000   B      :     9  
##                    (Other):    84                     (Other):     9  
##      REFNUM      
##  Min.   :     1  
##  1st Qu.:225575  
##  Median :451149  
##  Mean   :451149  
##  3rd Qu.:676723  
##  Max.   :902297  
## 
str(stormData)
## 'data.frame':    902297 obs. of  8 variables:
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Now we look deeper into PROPDMGEXP and CROPDMGEXP.

table(stormData$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5 
## 465934      1      8      5    216     25     13      4      4     28 
##      6      7      8      B      h      H      K      m      M 
##      4      5      1     40      1      6 424665      7  11330
table(stormData$CROPDMGEXP)
## 
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994

We want these exponential multipliers to be numbers (and in numeric type). As “K” is the most common value for both of these variables, we will impute “K” for values that don’t have proper values. It is important to note that there is a probability that missing values for these multipiers were 0, which would result in 1$ multiplications, but we assume that they are equal to most common value which is “K” (x1000$).

#changing the type of PROPDMGEXP and CROPDMGEXP to character
stormData$PROPDMGEXP <- as.character(stormData$PROPDMGEXP)
stormData$CROPDMGEXP <- as.character(stormData$CROPDMGEXP)
#changing unknown exponential multiplires to the common one, which is "K"
stormData[stormData$PROPDMGEXP %in% c("", "-", "?", "+"), "PROPDMGEXP"] <- "K"
stormData[stormData$CROPDMGEXP %in% c("", "?"), "CROPDMGEXP"] <- "K"
#changing abbreviations of exponential multipliers to numbers for PROPDMGEXP
stormData[stormData$PROPDMGEXP %in% c("h", "H"), "PROPDMGEXP"] <- "2"
stormData[stormData$PROPDMGEXP == "K", "PROPDMGEXP"] <- "3"
stormData[stormData$PROPDMGEXP %in% c("m", "M"), "PROPDMGEXP"] <- "6"
stormData[stormData$PROPDMGEXP == "B", "PROPDMGEXP"] <- "9"
#changing abbreviations of exponential multipliers to numbers for CROPDMGEXP
stormData[stormData$CROPDMGEXP %in% c("k", "K"), "CROPDMGEXP"] <- "3"
stormData[stormData$CROPDMGEXP %in% c("m", "M"), "CROPDMGEXP"] <- "6"
stormData[stormData$CROPDMGEXP == "B", "CROPDMGEXP"] <- "9"
#changing the type of PROPDMGEXP and CROPDMGEXP to numeric
stormData$PROPDMGEXP <- as.numeric(stormData$PROPDMGEXP)
stormData$CROPDMGEXP <- as.numeric(stormData$CROPDMGEXP)

Now we want to clean EVTYPE, which according to documents from NOAA should have 48 unique values. Lets look at its number of unique values.

n_distinct(stormData$EVTYPE)
## [1] 985

It has 985 unique values, so we need to do a lot of cleaning to reduce these values to the original 48 values. A quick look at this variable show that there are lots of typos in it. for cleaning this variable, I first changed all of the values to upper case and removed the starting spaces from them.

stormData <- mutate(stormData, EVTYPE = toupper(EVTYPE))
stormData$EVTYPE <- sub("^ +", "", x = stormData$EVTYPE)

After this quick fix, I worked for several hours (about 6 hours) to reduce the number of values to near 48 proper values, specified by NOAA document. My strategy was to group the “stormData” data set by “EVTYPE”, then calculate number of rows for each group, arranged by both number of rows and EVTYPE name. I assumed that EVTYPE values that accured more, have higher probability that have typos in data set. So each time I looked at top of “EVTYPEnumber”, and then in the “EVTYPEname” searched for patterns that might be corresponded to those top values in “EVTYPEnumber”. Whenever I didn’t know what a value might be, I referred to document from NOAA and searched the values in it.

EVTYPEnumber <- group_by(stormData, EVTYPE) %>% summarize(n = n()) %>% arrange(desc(n))
EVTYPEname <- group_by(stormData, EVTYPE) %>% summarize(n = n()) %>% arrange(EVTYPE)

In every cycle of the above strategy, I found a pattern and used mostly gsub to correct the pattern for one value (out of total 48 values). For example, every EVTYPE value that started with “HAIL” and “SMALL HAIL” values corresponded to “HAIL” value. This long list is for correcting incorrect patterns:

#changing events for HAIL
stormData$EVTYPE <- gsub("^HAIL.+|SMALL HAIL", "HAIL", x = stormData$EVTYPE)
#changing events for THUNDERSTORM
stormData$EVTYPE <- gsub("^TSTM", "THUNDERSTORM", x = stormData$EVTYPE)
stormData$EVTYPE <- gsub("^THU.+|.*MICROBURST.*", "THUNDERSTORM WIND", x = stormData$EVTYPE)
stormData$EVTYPE <- gsub("TUNDERSTORM WIND", "THUNDERSTORM WIND", x = stormData$EVTYPE)
#changing events for TORNADO
stormData$EVTYPE <- gsub("^TORN.+", "TORNADO", x = stormData$EVTYPE)
#changing events for FLASH FLOOD
stormData$EVTYPE <- gsub("^FLASH.+", "FLASH FLOOD", x = stormData$EVTYPE)
#changing events for FLOOD
stormData$EVTYPE <- gsub("^FLOOD.+|^RIVER.+", "FLOOD", x = stormData$EVTYPE)
#changing events for HIGH WIND
stormData$EVTYPE <- gsub("^HIGH.+WIND.+", "HIGH WIND", x = stormData$EVTYPE)
#changing events for LIGHTNING
stormData$EVTYPE <- gsub("^LIGHTNING.+|LIGHTING", "LIGHTNING", x = stormData$EVTYPE)
#changing events for HEAVY SNOW
stormData$EVTYPE <- gsub("^HEAVY SNOW.+|HEAVY WET SNOW|HEAVY MIX|^SNOW.*|^EXCESSIVE SNOW.*", "HEAVY SNOW", x = stormData$EVTYPE)
#changing events for HEAVY RAIN
stormData$EVTYPE <- gsub("^HEAVY RAIN.+|^HEAVY PRECIP.+|^HEAVY SHOWER.*", "HEAVY RAIN", x = stormData$EVTYPE)
#changing events for WINTER STORM
stormData$EVTYPE <- gsub("^WINTER STORM.+|^FREEZING RAIN.*|^FREEZING DRIZZLE", "WINTER STORM", x = stormData$EVTYPE)
#changing events for WINTER WEATHER
stormData$EVTYPE <- gsub("^WINTER WEATHER.+|^WINT.+MIX|^LIGHT SNOW.*|^MODERATE SNOW.*", "WINTER WEATHER", x = stormData$EVTYPE)
#changing events for FUNNEL CLOUD
stormData$EVTYPE <- gsub(".*FUNNEL.*|.*CLOUD.*", "FUNNEL CLOUD", x = stormData$EVTYPE)
#changing TSTM in events to THUNDERSTORM
stormData$EVTYPE <- gsub("TSTM", "THUNDERSTORM", x = stormData$EVTYPE)
#changing events for WATERSPOUT
stormData$EVTYPE <- gsub("^WA.*TER.+", "WATERSPOUT", x = stormData$EVTYPE)
#changing events for STRONG WIND
stormData$EVTYPE <- gsub("^STRONG WIND.+|^WIND|^WIND [ADGS].+|^GUSTY WIND.*", "STRONG WIND", x = stormData$EVTYPE)
#changing events of URBAN... (small floods) to HEAVY RAIN
stormData$EVTYPE <- gsub("^URBAN.+|^SMALL STREAM.*|^STREET FLOOD.*", "HEAVY RAIN", x = stormData$EVTYPE)
#changing events for WILDFIRE
stormData$EVTYPE <- gsub("^WILD.+", "WILDFIRE", x = stormData$EVTYPE)
#changing events for BLIZZARD
stormData$EVTYPE <- gsub("^BLIZZARD.+", "BLIZZARD", x = stormData$EVTYPE)
#changing events for DROUGHT
stormData$EVTYPE <- gsub("^DROUGHT.+", "DROUGHT", x = stormData$EVTYPE)
#changing events for ICE STORM
stormData$EVTYPE <- gsub("^ICE.*", "ICE STORM", x = stormData$EVTYPE)
#changing events for EXCESSIVE HEAT
stormData$EVTYPE <- gsub("EXTREME HEAT|^EXCESSIVE HEAT.+", "EXCESSIVE HEAT", x = stormData$EVTYPE)
#changing events for HIGH SURF
stormData$EVTYPE <- gsub(".*SURF.*|^HIGH WA|^HIGH TIDES|^HIGH.+SWELLS", "HIGH SURF", x = stormData$EVTYPE)
#changing events for FROST/FREEZE
stormData$EVTYPE <- gsub(".*FROST.*|^FREEZE", "FROST/FREEZE", x = stormData$EVTYPE)
#changing events for DENSE FOG
stormData$EVTYPE <- gsub("^FOG.*", "DENSE FOG", x = stormData$EVTYPE)
#changing events for EXTREME COLD/WIND CHILL
stormData$EVTYPE <- gsub("^EXTREME WIND.+|^EXTREME.* COLD.*|EXCESSIVE COLD", "EXTREME COLD/WIND CHILL", x = stormData$EVTYPE)
#changing events for HEAT
stormData$EVTYPE <- gsub("^HEAT.+|^UNSEASONABLY WARM|^UNSEASONABLY HOT|^UNUSUAL.*WARM|^RECORD HEAT.*|^RECORD WARM.*", "HEAT", x = stormData$EVTYPE)
#changing events for TROPICAL STORM
stormData$EVTYPE <- gsub("^TROPICAL STORM.+", "TROPICAL STORM", x = stormData$EVTYPE)
#changing events for COASTAL FLOOD
stormData$EVTYPE <- gsub("^COASTAL.*FLOOD.*|^COASTAL.*STORM", "COASTAL FLOOD", x = stormData$EVTYPE)
#changing events for LAKE-EFFECT SNOW
stormData$EVTYPE <- gsub(".*LAKE.* SNOW.*", "LAKE-EFFECT SNOW", x = stormData$EVTYPE)
#changing events for LANDSLIDE (Debris Flow)
stormData$EVTYPE <- gsub("^LAND.+", "LANDSLIDE", x = stormData$EVTYPE)
#changing events for COLD/WIND CHILL
stormData$EVTYPE <- gsub("^COLD.*|^COOL.*|^WIND CHILL.*|^RECORD.*COLD.*|.*LOW.*TEMP.*", "COLD/WIND CHILL", x = stormData$EVTYPE)
#changing events for RIP CURRENT
stormData$EVTYPE <- gsub("RIP CURRENTS", "RIP CURRENT", x = stormData$EVTYPE)
#changing events for STORM SURGE/TIDE
stormData$EVTYPE <- gsub("^STORM SURGE.*", "STORM SURGE/TIDE", x = stormData$EVTYPE)
#changing events for ASTRONOMICAL LOW TIDE
stormData$EVTYPE <- gsub(".*ASTRONOMICAL.*", "ASTRONOMICAL LOW TIDE", x = stormData$EVTYPE)
#changing events for HURRICANE (Typhoon)
stormData$EVTYPE <- gsub("^HURRICANE.*|^TYPHOON", "HURRICANE/TYPHOON", x = stormData$EVTYPE)
#changing events for SLEET
stormData$EVTYPE <- gsub("^SLEET.*", "SLEET", x = stormData$EVTYPE)
#changing events for FREEZING FOG
stormData$EVTYPE <- gsub("^GLAZE.*", "FREEZING FOG", x = stormData$EVTYPE)

So, this way we reduced the number of unique EVTYPE values.

n_distinct(stormData$EVTYPE)
## [1] 347

We can see what proportion of rows are in the most accuring values. We check first 42 values:

#calculating percentage of cleaned data for EVTYPE
a <- group_by(stormData, EVTYPE) %>% summarize(n = n()) %>% arrange(desc(n))
sum(a[1:42, 2])/sum(a[, 2])
## [1] 0.9985404

So, the other 305 values have less than 0.15% of rows! This is enough for cleaning the data and we assume that other 305 values corresponding to 0.15% of rows wouldn’t affect our analysis.

A more solution oriented strategy

A more solution oriented strategy to clean data could be to summerize data with respect to harm to population health and damage to economy for each EVTYPE, and then do the above mentioned steps; but with optimizing our summaries (instead of number of rows). We could also assume a threshold (like 95%) for our output summary and stop the EVTYPE data cleaning at that point.

While this strategy would be more efficient and capture only EVTYPE values that had great impact on our questions, that would be alot solution oriented and not a good data cleaning!

Results

Our data analysis must address the following questions:

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
  2. Across the United States, which types of events have the greatest economic consequences?

Question 1

The first question is that: Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

We can group our data by EVTYPE, then make summaries of fatalities and injuries to answer this question. I hypothesize that every fatality should have 3 times effect and every injury would have 1 time effect. It’s just my subjective hypothesis and by no means is an objective decision. You can make other decisions and with summarizing in other ways, continue with this analysis.

q1StormData <- group_by(stormData, EVTYPE) %>%
    summarize(FATALITIES = sum(FATALITIES), INJURIES = sum(INJURIES), population = sum(FATALITIES*3 + INJURIES)) %>%
    arrange(desc(population))

We can look at events that are most harmful to population health (with our assumption about harmfulness criteria).

head(q1StormData, 10)
## Source: local data frame [10 x 4]
## 
##               EVTYPE FATALITIES INJURIES population
## 1            TORNADO       5658    91364     108338
## 2     EXCESSIVE HEAT       1999     6680      12677
## 3  THUNDERSTORM WIND        715     9537      11682
## 4              FLOOD        499     6809       8306
## 5          LIGHTNING        817     5232       7683
## 6               HEAT       1131     2561       5954
## 7        FLASH FLOOD       1018     1785       4839
## 8          ICE STORM         96     2115       2403
## 9          HIGH WIND        293     1471       2350
## 10       RIP CURRENT        572      529       2245

First 6 events do more that 83% of the harm to population health.

populationSum <- sum(q1StormData$population)
sum(q1StormData$population[1:6])*100/populationSum
## [1] 83.15633

As we want to find the types of events (as indicated in the EVTYPE variable) that are most harmful with respect to population health, we will plot the percentage of harm that each event has done to population health.

barplot(q1StormData$population[1:6]*100/populationSum, names.arg = q1StormData$EVTYPE[1:6], cex.names = 0.45, ylab = "Percentage of harm to population health", main = "Most harmful events for population health")

We can see that Tornado is by far the most harmful event for population health (with our criteria), with 58.26% of the harms, along with Exessive heat and Thundestorm wind with about 6.82% and 6.28% respectively.

Question 2

The second question is that: Across the United States, which types of events have the greatest economic consequences?

We can groupd our data frame by EVTYPE, then calculate damage summaries for property and crop and total economic damage, then arrange our data frame by total economic damage. Here we assume that economic consequence of an event is sum of its damage to properties and crops and we store this value in “totalDMG” variable.

q2StormData <- group_by(stormData, EVTYPE) %>%
    summarize(property = sum(PROPDMG*(10^PROPDMGEXP)), crop = sum(CROPDMG*(10^CROPDMGEXP)), totalDMG = sum(property + crop)) %>%
    arrange(desc(totalDMG))

We can look at events that have implied the most damage to economy, with their damages in dollars.

head(q2StormData, 10)
## Source: local data frame [10 x 4]
## 
##               EVTYPE     property        crop     totalDMG
## 1              FLOOD 150194585307 10936191950 161130777257
## 2  HURRICANE/TYPHOON  85356410010  5516117800  90872527810
## 3            TORNADO  58552215314   417461470  58969676784
## 4   STORM SURGE/TIDE  47964724000      855000  47965579000
## 5               HAIL  15977596956  3046890620  19024487576
## 6        FLASH FLOOD  17414948282  1437163150  18852111432
## 7            DROUGHT   1046106000 13972571780  15018677780
## 8  THUNDERSTORM WIND   9979128673  1225486980  11204615653
## 9          ICE STORM   3971716860  5027113500   8998830360
## 10          WILDFIRE   8491563500   402781630   8894345130

Again, first 6 events have done more than 83% of the damage to the economy.

damageSum <- sum(q2StormData$totalDMG)
sum(q2StormData$totalDMG[1:6])*100/damageSum
## [1] 83.1323

As we want to find the types of events that had the greatest economic consequences, we will plot the percentage of damage that each event has done to economy.

barplot(q2StormData$totalDMG[1:6]*100/damageSum, names.arg = q2StormData$EVTYPE[1:6], cex.names = 0.5, ylab = "Percentage of damage to economy", main = "Most harmful events for economy")

We can see that Flood and Hurricane/Typhoon have done the most damage to economy, with 33.76% and 19.04% of damages respectively. Again, we can see that Tornado is in the third place and has done 12.35% of damage to economy.