Impact of major weather events on Population Health and Economy of United States between 1950 and 2011

Synopsis :

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

The basic goal of this assignment is to explore the NOAA Storm Database and answer some basic questions about severe weather events: The data analysis must address the following questions:

1.- Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

2.- Across the United States, which types of events have the greatest economic consequences?

This analysis shows that the most harmful type of weather events (1950 - 2011) to population health (including fatalities and injuries) was “Tornados” with 96,980 casualties and the most harmful to economy cost (Property and Crops) was “Floods” with $150,320 Million dollars


Loading the Libraries

library(ggplot2)
library(grid)
library(gridExtra)
library(ggpubr)
library(dplyr)

Obtaining and Loading the data :

The data for this assignment can be downloaded from the course web site: Storm Data [47Mb]. The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete. For more information, please review the documentation Storm Data Documentation

I have already downloaded the data into my working directory.

storm_data <- read.csv("./repdata_data_StormData.csv")
storm_data <- as_tibble(storm_data)
storm_data
## # A tibble: 902,297 x 37
##    STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE BGN_RANGE
##      <dbl> <chr>    <chr>    <chr>      <dbl> <chr>      <chr> <chr>      <dbl>
##  1       1 4/18/19~ 0130     CST           97 MOBILE     AL    TORNA~         0
##  2       1 4/18/19~ 0145     CST            3 BALDWIN    AL    TORNA~         0
##  3       1 2/20/19~ 1600     CST           57 FAYETTE    AL    TORNA~         0
##  4       1 6/8/195~ 0900     CST           89 MADISON    AL    TORNA~         0
##  5       1 11/15/1~ 1500     CST           43 CULLMAN    AL    TORNA~         0
##  6       1 11/15/1~ 2000     CST           77 LAUDERDALE AL    TORNA~         0
##  7       1 11/16/1~ 0100     CST            9 BLOUNT     AL    TORNA~         0
##  8       1 1/22/19~ 0900     CST          123 TALLAPOOSA AL    TORNA~         0
##  9       1 2/13/19~ 2000     CST          125 TUSCALOOSA AL    TORNA~         0
## 10       1 2/13/19~ 2000     CST           57 FAYETTE    AL    TORNA~         0
## # ... with 902,287 more rows, and 28 more variables: BGN_AZI <chr>,
## #   BGN_LOCATI <chr>, END_DATE <chr>, END_TIME <chr>, COUNTY_END <dbl>,
## #   COUNTYENDN <lgl>, END_RANGE <dbl>, END_AZI <chr>, END_LOCATI <chr>,
## #   LENGTH <dbl>, WIDTH <dbl>, F <int>, MAG <dbl>, FATALITIES <dbl>,
## #   INJURIES <dbl>, PROPDMG <dbl>, PROPDMGEXP <chr>, CROPDMG <dbl>,
## #   CROPDMGEXP <chr>, WFO <chr>, STATEOFFIC <chr>, ZONENAMES <chr>,
## #   LATITUDE <dbl>, LONGITUDE <dbl>, LATITUDE_E <dbl>, LONGITUDE_ <dbl>,
## #   REMARKS <chr>, REFNUM <dbl>

Data Preprocessing :

In order to optimize resources, we only consider rows where Fatalities, Injuries, Property Damage Expense or Crop Damage Expense are > 0 and select 7 columns that will be used in this analysis (below are the columns selected). Next, transform the exponent values (“”,1,H,K,M,B) for both Crop Damages and Property Damage into numerical values and multiply them by economic damages

The raw data contains a lot of rows with values 0. Hence I’m filtering rows when values > 0.

sum (storm_data["FATALITIES"] == 0 & storm_data["INJURIES"] == 0 &
      storm_data["PROPDMG"] == 0 & storm_data["CROPDMG"] == 0)
## [1] 647664
req_data <- storm_data %>%
                select(c(8, 23:28)) %>%
                filter(FATALITIES > 0 | INJURIES > 0 | PROPDMG > 0 | CROPDMG > 0)
    
req_data
## # A tibble: 254,633 x 7
##    EVTYPE  FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
##    <chr>        <dbl>    <dbl>   <dbl> <chr>        <dbl> <chr>     
##  1 TORNADO          0       15    25   K                0 ""        
##  2 TORNADO          0        0     2.5 K                0 ""        
##  3 TORNADO          0        2    25   K                0 ""        
##  4 TORNADO          0        2     2.5 K                0 ""        
##  5 TORNADO          0        2     2.5 K                0 ""        
##  6 TORNADO          0        6     2.5 K                0 ""        
##  7 TORNADO          0        1     2.5 K                0 ""        
##  8 TORNADO          0        0     2.5 K                0 ""        
##  9 TORNADO          1       14    25   K                0 ""        
## 10 TORNADO          0        0    25   K                0 ""        
## # ... with 254,623 more rows

As you can see the required data does not have any NA values.

sum(is.na(req_data))
## [1] 0

Calculating the impact on Health :

Preparing the data The storm data set includes event-related injuries and fatalities, both of which need to be considered for impact to population health.

All the fatalities are grouped by weather events, summed and sorted.

fatalities <- req_data %>%
                group_by(EVTYPE) %>%
                select(FATALITIES) %>%
                summarise(FATALITIES = sum(FATALITIES)) %>%
                arrange(desc(FATALITIES))
## Adding missing grouping variables: `EVTYPE`
## `summarise()` ungrouping output (override with `.groups` argument)
fatalities
## # A tibble: 488 x 2
##    EVTYPE         FATALITIES
##    <chr>               <dbl>
##  1 TORNADO              5633
##  2 EXCESSIVE HEAT       1903
##  3 FLASH FLOOD           978
##  4 HEAT                  937
##  5 LIGHTNING             816
##  6 TSTM WIND             504
##  7 FLOOD                 470
##  8 RIP CURRENT           368
##  9 HIGH WIND             248
## 10 AVALANCHE             224
## # ... with 478 more rows

All the injuries are grouped by weather events, summed and sorted.

injuries <- req_data %>%
                group_by(EVTYPE) %>%
                select(INJURIES) %>%
                summarise(INJURIES = sum(INJURIES)) %>%
                arrange(desc(INJURIES))
## Adding missing grouping variables: `EVTYPE`
## `summarise()` ungrouping output (override with `.groups` argument)
injuries
## # A tibble: 488 x 2
##    EVTYPE            INJURIES
##    <chr>                <dbl>
##  1 TORNADO              91346
##  2 TSTM WIND             6957
##  3 FLOOD                 6789
##  4 EXCESSIVE HEAT        6525
##  5 LIGHTNING             5230
##  6 HEAT                  2100
##  7 ICE STORM             1975
##  8 FLASH FLOOD           1777
##  9 THUNDERSTORM WIND     1488
## 10 HAIL                  1361
## # ... with 478 more rows

Both the fatalities and injuries are combined and grouped by weather events, summed and sorted.

combined <- req_data %>%
                group_by(EVTYPE) %>%
                select(FATALITIES, INJURIES) %>%
                summarise(FATALITIES.and.INJURIES = sum(FATALITIES + INJURIES)) %>%
                arrange(desc(FATALITIES.and.INJURIES))
## Adding missing grouping variables: `EVTYPE`
## `summarise()` ungrouping output (override with `.groups` argument)
combined
## # A tibble: 488 x 2
##    EVTYPE            FATALITIES.and.INJURIES
##    <chr>                               <dbl>
##  1 TORNADO                             96979
##  2 EXCESSIVE HEAT                       8428
##  3 TSTM WIND                            7461
##  4 FLOOD                                7259
##  5 LIGHTNING                            6046
##  6 HEAT                                 3037
##  7 FLASH FLOOD                          2755
##  8 ICE STORM                            2064
##  9 THUNDERSTORM WIND                    1621
## 10 WINTER STORM                         1527
## # ... with 478 more rows

All the three subsets created above are joined together.

health_effects <- left_join(fatalities, injuries, by="EVTYPE") %>%
                     left_join(., combined, by='EVTYPE') 
health_effects
## # A tibble: 488 x 4
##    EVTYPE         FATALITIES INJURIES FATALITIES.and.INJURIES
##    <chr>               <dbl>    <dbl>                   <dbl>
##  1 TORNADO              5633    91346                   96979
##  2 EXCESSIVE HEAT       1903     6525                    8428
##  3 FLASH FLOOD           978     1777                    2755
##  4 HEAT                  937     2100                    3037
##  5 LIGHTNING             816     5230                    6046
##  6 TSTM WIND             504     6957                    7461
##  7 FLOOD                 470     6789                    7259
##  8 RIP CURRENT           368      232                     600
##  9 HIGH WIND             248     1137                    1385
## 10 AVALANCHE             224      170                     394
## # ... with 478 more rows

Calculating the impact on Ecomnomy :

Preparing the data Physical damages include property damages and crop damages, so we will combine both to find the events with the greatest economic consequences.

The damage value is represented by two parts “-DMG” (numeric) and “-DMGEXP” (alphanumeric) so the the first step is to convert exponent DMGEXP to numerical value for comparison then calculate to damage mounts for property and crops, then combine property damages and crop damages into a new column (CROP_PROP_TOTAL_DMG) and finally summarize damages by event types and rank to find the most significant events.

In order to convert exponential values to numeric values, I wrote the function below.

convExponent <- function(dmg, exp) {
    if (exp == "K") {
        dmg * 1000
    } else if (exp == "M") {
        dmg * 1e+06
    } else if (exp == "B") {
        dmg * 1e+09
    } else if (exp == "") {
        dmg
    } else {
        stop("NOT VALID DATA")
    }
}

Changing the exponential values to numeric values.

exp_data <- req_data %>%
                filter(PROPDMGEXP %in% c("", "K", "M", "B") &
                           CROPDMGEXP %in% c("", "K", "M", "B"))

exp_data$NEW_PROPDMG <- mapply(convExponent, exp_data$PROPDMG, exp_data$PROPDMGEXP)
exp_data$NEW_CROPDMG <- mapply(convExponent, exp_data$CROPDMG, exp_data$CROPDMGEXP)

exp_data
## # A tibble: 254,329 x 9
##    EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP NEW_PROPDMG
##    <chr>       <dbl>    <dbl>   <dbl> <chr>        <dbl> <chr>            <dbl>
##  1 TORNA~          0       15    25   K                0 ""               25000
##  2 TORNA~          0        0     2.5 K                0 ""                2500
##  3 TORNA~          0        2    25   K                0 ""               25000
##  4 TORNA~          0        2     2.5 K                0 ""                2500
##  5 TORNA~          0        2     2.5 K                0 ""                2500
##  6 TORNA~          0        6     2.5 K                0 ""                2500
##  7 TORNA~          0        1     2.5 K                0 ""                2500
##  8 TORNA~          0        0     2.5 K                0 ""                2500
##  9 TORNA~          1       14    25   K                0 ""               25000
## 10 TORNA~          0        0    25   K                0 ""               25000
## # ... with 254,319 more rows, and 1 more variable: NEW_CROPDMG <dbl>

All the crop damages are grouped by weather events, summed and sorted.

crop_damage <- exp_data %>%
                    group_by(EVTYPE) %>%
                    select(NEW_CROPDMG) %>%
                    summarise(CROP_TOTAL = sum(NEW_CROPDMG)/10^6) %>%
                    arrange(desc(CROP_TOTAL))
## Adding missing grouping variables: `EVTYPE`
## `summarise()` ungrouping output (override with `.groups` argument)
crop_damage
## # A tibble: 482 x 2
##    EVTYPE            CROP_TOTAL
##    <chr>                  <dbl>
##  1 DROUGHT               13973.
##  2 FLOOD                  5662.
##  3 RIVER FLOOD            5029.
##  4 ICE STORM              5022.
##  5 HAIL                   3001.
##  6 HURRICANE              2742.
##  7 HURRICANE/TYPHOON      2608.
##  8 FLASH FLOOD            1421.
##  9 EXTREME COLD           1293.
## 10 FROST/FREEZE           1094.
## # ... with 472 more rows

All the property damages are grouped by weather events, summed and sorted.

prop_damage <- exp_data %>%
                    group_by(EVTYPE) %>%
                    select(NEW_PROPDMG) %>%
                    summarise(PROP_TOTAL = sum(NEW_PROPDMG)/10^6) %>%
                    arrange(desc(PROP_TOTAL))
## Adding missing grouping variables: `EVTYPE`
## `summarise()` ungrouping output (override with `.groups` argument)
prop_damage
## # A tibble: 482 x 2
##    EVTYPE            PROP_TOTAL
##    <chr>                  <dbl>
##  1 FLOOD                144658.
##  2 HURRICANE/TYPHOON     69306.
##  3 TORNADO               56925.
##  4 STORM SURGE           43324.
##  5 FLASH FLOOD           16141.
##  6 HAIL                  15727.
##  7 HURRICANE             11868.
##  8 TROPICAL STORM         7704.
##  9 WINTER STORM           6688.
## 10 HIGH WIND              5270.
## # ... with 472 more rows

crop and property damages are combined together, grouped by weather events, summed and sorted.

combined_damage <- exp_data %>%
                        group_by(EVTYPE) %>%
                        select(NEW_PROPDMG, NEW_CROPDMG) %>%
                        summarise(TOTAL_DMG = sum(NEW_PROPDMG + NEW_CROPDMG)/10^6) %>%
                        arrange(desc(TOTAL_DMG))
## Adding missing grouping variables: `EVTYPE`
## `summarise()` ungrouping output (override with `.groups` argument)
combined_damage
## # A tibble: 482 x 2
##    EVTYPE            TOTAL_DMG
##    <chr>                 <dbl>
##  1 FLOOD               150320.
##  2 HURRICANE/TYPHOON    71914.
##  3 TORNADO              57290.
##  4 STORM SURGE          43324.
##  5 HAIL                 18728.
##  6 FLASH FLOOD          17562.
##  7 DROUGHT              15019.
##  8 HURRICANE            14610.
##  9 RIVER FLOOD          10148.
## 10 ICE STORM             8967.
## # ... with 472 more rows

All the data subsets created above are joined together.

economic_effects <- left_join(crop_damage, prop_damage, by="EVTYPE") %>%
                        left_join(., combined_damage, by='EVTYPE')

economic_effects
## # A tibble: 482 x 4
##    EVTYPE            CROP_TOTAL PROP_TOTAL TOTAL_DMG
##    <chr>                  <dbl>      <dbl>     <dbl>
##  1 DROUGHT               13973.    1046.      15019.
##  2 FLOOD                  5662.  144658.     150320.
##  3 RIVER FLOOD            5029.    5119.      10148.
##  4 ICE STORM              5022.    3945.       8967.
##  5 HAIL                   3001.   15727.      18728.
##  6 HURRICANE              2742.   11868.      14610.
##  7 HURRICANE/TYPHOON      2608.   69306.      71914.
##  8 FLASH FLOOD            1421.   16141.      17562.
##  9 EXTREME COLD           1293.      67.7      1361.
## 10 FROST/FREEZE           1094.       9.48     1104.
## # ... with 472 more rows

Results:

The following plot demonstrates that tornado is the most harmful weather event to population health and floods have the greatest economic consequences.

g1 <- ggplot(health_effects[1:10,], aes(x=FATALITIES.and.INJURIES, 
                                       y= reorder(EVTYPE,FATALITIES.and.INJURIES )))

g2 <- ggplot(economic_effects[1:10,], aes(x=TOTAL_DMG, y= reorder(EVTYPE, TOTAL_DMG)))



plot1 <- g1 + geom_bar(stat="identity", fill=heat.colors(10, alpha = 1)) +
                labs(x="Fatalities and Injuries", y="Event Types", 
                     title ="Population Health") +
                theme(plot.title = element_text(hjust = 0.5,family="CM Roman", face = "italic")) 

plot2 <- g2 + geom_bar(stat="identity", fill=topo.colors(10, alpha = 1)) +
                labs(x="Crop and Property Damage", y="Event Types", 
                    title = "USA Economy") +
                theme(plot.title = element_text(hjust = 0.5, family="CM Roman", face = "italic")) 


title1=text_grob("Top 10 Events has most Impact on", size = 15, face = "bold", hjust=0.5)

grid.arrange(plot1, plot2, top=title1)