Synopsis:

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. The analysis will identify two critical questions: which events most affect population health and which events most affect the United States economy.

Data Processing

Data Retrieval

The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site: Data

Information about the data can be found : Storm Data Documentations

Retrieving Data

library(readr)

if(!file.exists("StormData.csv.bz2")) {
    fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
    download.file(fileUrl, destfile = "StormData.csv.bz2")
}

stormData <- read.csv("StormData.csv.bz2")
str(stormData)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Cleaning Data

The main questions are as follows: 1. Which types of events are most harmful to U.S. population health? 2. Which types of events are most harmful to U.S. economy?

Therefore, the columns I am going to be using are as follows: 1. Event Type (EVTYPE) 2. Date (BGN_DATE) 3. Fatalities (FATALITIES) population health 4. Injuries (INJURIES) population health 5. Property Damage (PROPDMG) economy 6. Crop Damage (CROPDMG) economy

Selecting columns to create new dataset

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(magrittr)

table(stormData$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5      6 
## 465934      1      8      5    216     25     13      4      4     28      4 
##      7      8      B      h      H      K      m      M 
##      5      1     40      1      6 424665      7  11330
names <- c("EVTYPE", "BGN_DATE", "FATALITIES", "INJURIES", "PROPDMG",
    "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
data1 <- select(stormData, all_of(names)) 

After checking the structure of the data, I will change the event type column into a factor variable and the date column into a date variable

Changing into appropriate variable type

data1$EVTYPE <- as.factor(data1$EVTYPE)
data1$BGN_DATE <- as.Date(data1$BGN_DATE, format = "%m/%d/%Y")
str(data1)
## 'data.frame':    902297 obs. of  8 variables:
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_DATE  : Date, format: "1950-04-18" "1950-04-18" ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...

Analysis Pt.1.1 Total Fatalities by Event Types

First, the analysis will attempt to understand which types of events are most harmful with respect to population health.

Identifying top 10 events causing fatalities

data2 <- data1

data2 %<>%
    group_by(EVTYPE) %>%
    summarise(FatalSum = sum(FATALITIES)) %>%
    arrange(desc(FatalSum))
## `summarise()` ungrouping output (override with `.groups` argument)
top10_harmingHealth <- data2[1:10,]
top10_harmingHealth
## # A tibble: 10 x 2
##    EVTYPE         FatalSum
##    <fct>             <dbl>
##  1 TORNADO            5633
##  2 EXCESSIVE HEAT     1903
##  3 FLASH FLOOD         978
##  4 HEAT                937
##  5 LIGHTNING           816
##  6 TSTM WIND           504
##  7 FLOOD               470
##  8 RIP CURRENT         368
##  9 HIGH WIND           248
## 10 AVALANCHE           224

Based on the analysis the top 10 events that cause the most fatalities are as follows:

  1. TORNADO
  2. EXCESSIVE HEAT
  3. FLASH FLOOD
  4. HEAT
  5. LIGHTNING
  6. TSTM WIND
  7. FLOOD
  8. RIP CURRENT
  9. HIGH WIND
  10. AVALANCHE

Graphing top 10 events causing fatalities

library(ggplot2)
g <- ggplot(data = top10_harmingHealth, aes(x = reorder(EVTYPE, -FatalSum),
    y = FatalSum))
g + geom_bar( stat = "identity", fill = "#003f5c") +
    labs(x = "Event Type", y = "Fatalities", title = "Total Fatalities by Event Type (1950-2011)") +
    theme(axis.text.x = element_text(size = 8)) +
    geom_text(aes(label = FatalSum), vjust = -.75, size = 3.5)

Analysis Pt.1.2 Total Injuries by Event Types

Identifying top 10 events causing injuries

data3 <- data1
data3 %<>%
    group_by(EVTYPE) %>%
    summarise(InjurySum = sum(INJURIES)) %>%
    arrange(desc(InjurySum))
## `summarise()` ungrouping output (override with `.groups` argument)
top10_injury_events <- data3[1:10,]
top10_injury_events
## # A tibble: 10 x 2
##    EVTYPE            InjurySum
##    <fct>                 <dbl>
##  1 TORNADO               91346
##  2 TSTM WIND              6957
##  3 FLOOD                  6789
##  4 EXCESSIVE HEAT         6525
##  5 LIGHTNING              5230
##  6 HEAT                   2100
##  7 ICE STORM              1975
##  8 FLASH FLOOD            1777
##  9 THUNDERSTORM WIND      1488
## 10 HAIL                   1361

Based on the analysis the top 10 events that cause the most injuries are as follows:

  1. TORNADO
  2. TSTM WIND
  3. FLOOD
  4. EXCESSIVE HEAT
  5. LIGHTNING
  6. HEAT
  7. ICE STORM
  8. FLASH FLOOD
  9. THUNDERSTORM WIND
  10. HAIL

Graphing top 10 events causing injuries

g2 <- ggplot(data = top10_injury_events, aes(x = reorder(EVTYPE, -InjurySum),
    y = InjurySum))
g2 + geom_bar( stat = "identity", fill = "#003f5c") +
    labs(x = "Event Type", y = "Injuries", title = "Total Injuries by Event Type (1950-2011)") +
    theme(axis.text.x = element_text(size = 8)) +
    geom_text(aes(label = InjurySum), vjust = -.75, size = 3.5)

Analysis Pt.2.1 Total Econocmic Cost by Event Types

Because, crop and property damages are all based on the same monetary value (dollars) I decided to create a new variable that combines the monetary value of crop damage and property damage.

Creating total cost variable

data4 <- data1

data4 %<>% 
    mutate(PROPDMGEXP = case_when(
        PROPDMGEXP == "K" ~ 3,
        PROPDMGEXP == "M" ~ 6,
        PROPDMGEXP == "B" ~ 9, 
        PROPDMGEXP == "m" ~ 6,
        PROPDMGEXP == "5" ~ 5,
        PROPDMGEXP == "6" ~ 6,
        PROPDMGEXP == "4" ~ 4,
        PROPDMGEXP == "2" ~ 2,
        PROPDMGEXP == "3" ~ 3,
        PROPDMGEXP == "h" ~ 2,
        PROPDMGEXP == "H" ~ 2,
        PROPDMGEXP == "7" ~ 7,
        PROPDMGEXP == "1" ~ 1,
        PROPDMGEXP == "8" ~ 8,)) %>%
    mutate(CROPDMGEXP = case_when(
        CROPDMGEXP == "M" ~ 6,
        CROPDMGEXP == "K" ~ 3,
        CROPDMGEXP == "m" ~ 6,
        CROPDMGEXP == "B" ~ 9,
        CROPDMGEXP == "k" ~ 3,
        CROPDMGEXP == "2" ~ 2))

data4$PROPDMGEXP[(is.na(data4$PROPDMGEXP) == TRUE)] <- 0 
data4$CROPDMGEXP[(is.na(data4$CROPDMGEXP) == TRUE)] <- 0 

data4 %<>%
    mutate(total_cost = (PROPDMG * 10^PROPDMGEXP) + (CROPDMG * 10^CROPDMGEXP))

Identifying top 10 events causing most economic cost

top10_costs_events <- data4

top10_costs_events %<>%
    group_by(EVTYPE) %>%
    summarize(totalcost = sum(total_cost)) %>%
    arrange(desc(totalcost)) %>%
    mutate(totalcost = totalcost/1000000000)
## `summarise()` ungrouping output (override with `.groups` argument)
top10_costs_events <- top10_costs_events[1:10,]
top10_costs_events
## # A tibble: 10 x 2
##    EVTYPE            totalcost
##    <fct>                 <dbl>
##  1 FLOOD                150.  
##  2 HURRICANE/TYPHOON     71.9 
##  3 TORNADO               57.4 
##  4 STORM SURGE           43.3 
##  5 HAIL                  18.8 
##  6 FLASH FLOOD           18.2 
##  7 DROUGHT               15.0 
##  8 HURRICANE             14.6 
##  9 RIVER FLOOD           10.1 
## 10 ICE STORM              8.97

The top 10 events that cause the most economic cost are as follows:

  1. FLOOD
  2. HURRICANE/TYPHOON
  3. TORNADO
  4. STORM SURGE
  5. HAIL
  6. FLASH FLOOD
  7. DROUGHT
  8. HURRICANE
  9. RIVER FLOOD
  10. ICE STORM

Graphing top 10 events causing most economic cost

g3 <- ggplot(top10_costs_events, aes(x = reorder(EVTYPE, -totalcost), y = totalcost))
g3 + 
    geom_bar(stat = "identity", fill = "#003f5c") + 
    labs(x = "Event Type", y = "Total Cost(in billions)", title = "Total Cost By Event (1950-2011)") +
    geom_text(aes(label = round(totalcost, digits = 1), vjust = -.75, size = .8)) +
    theme(legend.position = "none", axis.text.x = element_text(size = 8))

Results

Based on the results of the analysis we found the following:

The top 10 events that cause the most fatalities are as follows:

  1. TORNADO
  2. EXCESSIVE HEAT
  3. FLASH FLOOD
  4. HEAT
  5. LIGHTNING
  6. TSTM WIND
  7. FLOOD
  8. RIP CURRENT
  9. HIGH WIND
  10. AVALANCHE

The top 10 events that cause the most injuries are as follows:

  1. TORNADO
  2. TSTM WIND
  3. FLOOD
  4. EXCESSIVE HEAT
  5. LIGHTNING
  6. HEAT
  7. ICE STORM
  8. FLASH FLOOD
  9. THUNDERSTORM WIND
  10. HAIL v

The top 10 events that cause the most economic cost are as follows:

  1. FLOOD
  2. HURRICANE/TYPHOON
  3. TORNADO
  4. STORM SURGE
  5. HAIL
  6. FLASH FLOOD
  7. DROUGHT
  8. HURRICANE
  9. RIVER FLOOD
  10. ICE STORM