1. Synopsis

The basic goal of this analysis is to explore the U.S. National Oceanic and Atmospheric Administration’s (NOAA) Storm Database and answer the following questions:

  1. Across the United States, which types of events are most harmful with respect to population health?

  2. Across the United States, which types of events have the greatest economic consequences?

The NOAA Storm Database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. The database for this assignment was downloaded directly from the coursera assignment web page. For more details, access this link Storm Data Documentation

Results: The analysis showed that, across the United States, tornados are the most harmful event to population health, while floods have the greatest economic consequences.

2. Data Processing

2.1 Loading the data

Setting Rmarkdown global options and loading needed libraries.

knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(message = FALSE)

library(dplyr)
library(knitr)
library(kableExtra)
library(ggplot2)

Let’s first load the data using the read.csv() function and look at its variable names.

raw_data <- read.csv("repdata_data_StormData.csv.bz2", na.strings = "")

names(raw_data)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

The 7 relevant Variables to answer the assignment questions are:

  • EVTYPE: event types e.g. Tornado, Flood
  • FATALITIES: number of fatalities caused
  • INJURIES: number of injuries caused
  • PROPDMG: property damage caused
  • CROPDMG: crop damage caused
  • PROPDMGEXP and CROPDMGEXP: alphabetical characters indicating, respectively, magnitude for property and crop damage: “K” for thousands, “M” for millions, and “B” for billions.

2.2 Working Subset

Creating a subset from the original database containing only these 7 variables. Changing variable names to more intuitive ones.

df <- raw_data %>% 
        select(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG,  CROPDMGEXP)

#renaming col names
colnames(df) <- c("event", "fatalities", "injuries", "property_dmg", "property_dmg_magnitute",
                  "crop_dmg", "crop_dmg_magnitute") 

Some event types, representing the same event, are recorded differently due to typos or different labels.

groups <- df %>% group_by(event) %>% count() 


groups [c(18,19,21,22, 42,43,44),] %>% kable() %>% 
        kable_styling(full_width = FALSE, position = "left") 
event n
AVALANCE 1
AVALANCHE 386
Beach Erosion 1
BEACH EROSION 3
blowing snow 2
Blowing Snow 3
BLOWING SNOW 12

2.3 Reggrouping Events

Before proceeding to actual analysis, we need to correct this. Let’s do it using grepl() function with ignore.case = TRUE, so case is ignored during matching. We finish this step transforming this variable in a factor.

df$event[grepl("AVALANCE", df$event, ignore.case = TRUE)] <- "AVALANCHE"
df$event[grepl("FLOOD|FLOOODING", df$event, ignore.case = TRUE)] <- "FLOOD"
df$event[grepl("TORN", df$event, ignore.case = TRUE)] <- "TORNADO"
df$event[grepl("TSTM|THUNDERSTORM", df$event, ignore.case = TRUE)] <- "TSTM" 
df$event[grepl("TROPICAL|STORM", df$event, ignore.case = TRUE)] <- "STORM"
df$event[grepl("HURRICANE", df$event, ignore.case = TRUE)] <- "HURRICANE"  
df$event[grepl("ICE|SNOW|FROST|SLEET", df$event, ignore.case = TRUE)] <- "SNOW"
df$event[grepl("FOG", df$event, ignore.case = TRUE)] <- "FOG"
df$event[grepl("COLD|WINDCHILL|FREEZE|WINTER|COOL|^LOW T", df$event, ignore.case = TRUE)] <- "COLD"
df$event[grepl("HEAT|WARM|HOT", df$event, ignore.case = TRUE)] <- "HEAT"
df$event[grepl("CLOUD|FUNNEL", df$event, ignore.case = TRUE)] <- "CLOUD" 
df$event[grepl("HAIL", df$event, ignore.case = TRUE)] <- "HAIL"
df$event[grepl("DROUGHT|DRY", df$event, ignore.case = TRUE)] <- "DROUGHT"
df$event[grepl("LIGHTNING", df$event, ignore.case = TRUE)] <- "LIGHTNING"
df$event[grepl("FIRE", df$event, ignore.case = TRUE)] <- "FIRE" 
df$event[grepl("RAIN|SHOWER", df$event, ignore.case = TRUE)] <- "RAIN"   
df$event[grepl("WATERSPOUT", df$event, ignore.case = TRUE)] <- "WATERSPOUT"   
df$event[grepl("SURF", df$event, ignore.case = TRUE)] <- "SURF"   
df$event[grepl("CURRENT", df$event, ignore.case = TRUE)] <- "CURRENT"    
df$event[grepl("WIND|MICROBURST", df$event, ignore.case = TRUE)] <- "WIND" 
df$event[grepl("BLIZZARD", df$event, ignore.case = TRUE)] <- "BLIZZARD"
df$event[grepl("SLIDE", df$event, ignore.case = TRUE)] <- "LANDSLIDE"
df$event[grepl("DUST", df$event, ignore.case = TRUE)] <- "DUST"   
df$event[grepl("^(BEACH ERO|COASTAL ERO)", df$event, ignore.case = TRUE)] <- "BEACH EROSION"  
df$event[grepl("[?]|^SUMM", df$event, ignore.case = TRUE)] <- "UNKNOWN EVENT"
df$event[grepl("^BLOW-OUT TIDE", df$event, ignore.case = TRUE)] <- "BLOW-OUT TIDE"
df$event[grepl("^DAM", df$event, ignore.case = TRUE)] <- "DAM BREAK"
df$event[grepl("EXCESSIVE|^HIGH$|^OTH", df$event, ignore.case = TRUE)] <- "UNKNOWN EVENT"
df$event[grepl("^FREEZING DRI|Freezing Spray", df$event, ignore.case = TRUE)] <- "FREEZING DRIZZLE"
df$event[grepl("GLAZE", df$event, ignore.case = TRUE)] <- "GLAZE"
df$event[grepl("^GUSTNADO", df$event, ignore.case = TRUE)] <- "GUSTNADO"
df$event[grepl("^HEAVY PREC", df$event, ignore.case = TRUE)] <- "HEAVY PRECIPITATION"
df$event[grepl("^(HEAVY SE|HIGH SE|^ROUG)", df$event, ignore.case = TRUE)] <- "HEAVY SEAS"
df$event[grepl("^(HEAVY SWE|HIGH SWEL)", df$event, ignore.case = TRUE)] <- "HEAVY SWELLS"
df$event[grepl("^HIGH WAV|HIGH TID", df$event, ignore.case = TRUE)] <- "HIGH WAVES"
df$event[grepl("^HYPO", df$event, ignore.case = TRUE)] <- "HYPOTHERMIA/EXPOSURE"
df$event[grepl("^ICY R", df$event, ignore.case = TRUE)] <- "ICY ROADS"
df$event[grepl("^LANDSL", df$event, ignore.case = TRUE)] <- "LANDSLIDE"
df$event[grepl("^LIG", df$event, ignore.case = TRUE)] <- "LIGHTNING"
df$event[grepl("^MIXED PR", df$event, ignore.case = TRUE)] <- "MIXED PRECIPITATION"
df$event[grepl("^(RECORD HIGH$|RECORD LOW$)", df$event, ignore.case = TRUE)] <- "UNKNOWN EVENT"
df$event[grepl("^RECORD TEMP", df$event, ignore.case = TRUE)] <- "RECORD TEMPERATURE"
df$event[grepl("^(No Sev|NORMAL PREC)|NONE", df$event, ignore.case = TRUE)] <- "NO SEVERE WEATHER"
df$event[grepl("HIGH TEMP", df$event, ignore.case = TRUE)] <- "RECORD HIGH TEMPERATURE"
df$event[grepl("STREAM|STRM", df$event, ignore.case = TRUE)] <- "URBAN_SMALL STREAM"
df$event[grepl("^(TEMPERATURE RE|RECORD TEMP)", df$event, ignore.case = TRUE)] <- "TEMP RECORD"
df$event[grepl("ASH|PLUME", df$event, ignore.case = TRUE)] <- "VOLCANIC ASH"
df$event[grepl("SPOUT", df$event, ignore.case = TRUE)] <- "WATERSPOUT"
df$event[grepl("WINT", df$event, ignore.case = TRUE)] <- "WINTRY MIX"

df$event<-factor(df$event)
str(df$event)
##  Factor w/ 86 levels "ABNORMALLY WET",..: 65 65 65 65 65 65 65 65 65 65 ...

2.4 Processing property and crop damage data

This database has 2 different variables to register property and crop damage: one variable represents the numeric damage and the other indicates its magnitude, recorded as the following characters: “K” for thousands, “M” for millions, and “B” for billions. For some reason (maybe typos or other record errors) there are a few magnitude values that do not match the characters “K”, “M” and “B”. Before deciding what to do with them, let’s check the values proportions on that magnitudes strings.

#KMB index for magnitudes
magnitude.index <- df %>% mutate(
        prop.mag.index = case_when(is.na(property_dmg_magnitute) ~ "NA",
                                  property_dmg_magnitute == "K" ~ "KMB registered (Prop)",
                                  property_dmg_magnitute == "M" ~ "KMB registered (Prop)",
                                  property_dmg_magnitute == "B" ~ "KMB registered (Prop)",
                                  TRUE ~ "other entries"),
        crop.mag.index = case_when(is.na(crop_dmg_magnitute) ~ "NA",
                                  crop_dmg_magnitute == "K" ~ "KMB registered (Crop)",
                                  crop_dmg_magnitute == "M" ~ "KMB registered (Crop)",
                                  crop_dmg_magnitute == "B" ~ "KMB registered (Crop)",
                                  TRUE ~ "other entries"))

#property damage magnitude
magnitude.index %>% group_by(property_dmg_magnitute) %>% count() %>%  
        arrange(desc(n)) %>% head(10) %>% kable() %>% 
        kable_styling(full_width = FALSE, position = "left") 
property_dmg_magnitute n
NA 465934
K 424665
M 11330
0 216
B 40
5 28
1 25
2 13
? 8
m 7
#crop damage magnitude
magnitude.index %>% group_by(crop_dmg_magnitute) %>% count() %>%  
        arrange(desc(n)) %>% head(10) %>% kable() %>% 
        kable_styling(full_width = FALSE, position = "left") 
crop_dmg_magnitute n
NA 618413
K 281832
M 1994
k 21
0 19
B 9
? 7
2 1
m 1
t<-apply(magnitude.index[c("prop.mag.index", "crop.mag.index")], 2, function(x){
        tbl <- table(x)
        res <- cbind(tbl,paste0(round(prop.table(tbl)*100,2),"%"))
        colnames(res) <- c('Count','Perc')
        as.data.frame (res)})

kable(t,  table.attr = "style='width:60%;'") %>%
         kableExtra::kable_styling()
Count Perc
KMB registered (Prop) 436035 48.32%
NA 465934 51.64%
other entries 328 0.04%
Count Perc
KMB registered (Crop) 283835 31.46%
NA 618413 68.54%
other entries 49 0.01%

As we can see, the “other entries” category is basically irrelevant and we can just ignore it on further analysis. The NA entries usually occur when there’s no damage recorded to a specific event. No damage, no magnitude.

To compute economic damages, we are going to convert the KMB magnitude characters to their respective numeric values and multiply these values with property_dmg and crop_dmg.

To do this, we are going to use recode() function; it will preserve the existing order of levels while changing the values. Note that we decided to to set default=1, so all “other entries” values will be given the value 1.

#Converting property magnitudes to numeric values 
df$property_dmg_magnitute<-recode(df$property_dmg_magnitute,
                                 'K'=1000,'M'=1000000,'B'=1000000000,.default=1)

#Converting crop magnitudes to numeric values
df$crop_dmg_magnitute<-recode(df$crop_dmg_magnitute,
                                'K'=1000,'M'=1000000,'B'=1000000000,.default=1)

# multiplying to the corresponding damage numeric variable. storing the result in new variables
df$PropDmgValue <- df$property_dmg*df$property_dmg_magnitute
df$CropDmgValue <- df$crop_dmg*df$crop_dmg_magnitute

head(df)[,-c(2,3)] %>% kable(table.attr = "style='width:100%;'") %>% kable_styling()
event property_dmg property_dmg_magnitute crop_dmg crop_dmg_magnitute PropDmgValue CropDmgValue
TORNADO 25.0 1000 0 NA 25000 NA
TORNADO 2.5 1000 0 NA 2500 NA
TORNADO 25.0 1000 0 NA 25000 NA
TORNADO 2.5 1000 0 NA 2500 NA
TORNADO 2.5 1000 0 NA 2500 NA
TORNADO 2.5 1000 0 NA 2500 NA

3. Results

3.1 Most harmful event to population health

Assignment question 1, asks us to identify the most harmful events with respect to population health across the US.

To run this analysis, we will use the fatalities and the injuries variables as proxies for population health and look at the impact the event variable had on them.

Let’s first create a data frame with these 3 variables. We will use the aggregate() function to sum fatalities and injuries and place the result under the variable fatalities_injures

The top 10 most harmful event types are illustrated below.

harmful <- aggregate(fatalities + injuries ~ event, df, FUN = sum)

harmful %>%  slice_max(`fatalities + injuries`, n=10)
##        event fatalities + injuries
## 1    TORNADO                 97068
## 2       HEAT                 12421
## 3       TSTM                 10273
## 4      FLOOD                 10129
## 5  LIGHTNING                  6048
## 6      STORM                  4634
## 7       WIND                  2337
## 8       FIRE                  1698
## 9       SNOW                  1511
## 10 HURRICANE                  1463
healthplot <- harmful %>%  slice_max(`fatalities + injuries`, n=10)

ggplot(healthplot, 
       aes(x= reorder(event,desc(`fatalities + injuries`)), y= `fatalities + injuries`,
           fill = event)) +
        geom_bar(stat="identity") +
        scale_fill_hue(c = 40) +
        theme(legend.position="none")+
        labs(title="10 Most Harmful Events to Population Health", x="Event Type", y="Count")      

The data shows that the most harmful event to population health is a Tornado, followed by Heat, Thunderstorms (TSTS) and Floods. Other event impacts can be seen on the table and plot above.

3.2 Event economic consequences

Assignment question 2, asks us to identify the events economic consequences. For this analysis we will evaluated property and crop damage collectively. The top 10 event types with greatest economic consequences are shown below.

#new dataframe with economic damage
eco <- aggregate(PropDmgValue + CropDmgValue ~ event, df, FUN = sum)

eco %>%  slice_max(`PropDmgValue + CropDmgValue`, n=10)
##        event PropDmgValue + CropDmgValue
## 1      FLOOD                157764930803
## 2  HURRICANE                 44283445830
## 3    TORNADO                 18172843863
## 4      STORM                 13215140400
## 5       HAIL                 10046289587
## 6       TSTM                  5513399354
## 7       FIRE                  3838549570
## 8       WIND                  3355326595
## 9    DROUGHT                  1886540000
## 10      SNOW                  1292667400
ecoplot <- eco %>%  slice_max(`PropDmgValue + CropDmgValue`, n=10)

ggplot(ecoplot, 
       aes(x= reorder(event,desc(`PropDmgValue + CropDmgValue`)), y= `PropDmgValue + CropDmgValue`,
           fill = event)) +
        geom_bar(stat="identity") +
        scale_fill_hue(c = 40) +
        theme(legend.position="none")+
        labs(title="Top 10 economically harmfull events", x="Event Type", y="Count")

According to NOAA Storm Database, the most economically harmful events are Floods, followed by Thunderstorms (TSTM), Hails and Tornados. Other relevant events can be seen on the table and plot above.