Synopsis

The analysis address two questions about the impact of different event types in the NOAA storm database. First, the health impact. Then, the economic impact. The data es pre-processed to drop unused variables, clean inconsistent explanation characters, and group & summarize the impact by each event type. The health impact is measured with the total amount of fatalities and the total amount of injuries. The economic impact is measured with the total estimated monetary cost in terms of property cost plus crop cost. The final recommendation is to prioritize tornados as the worst event by all means.

Analysis Goal

The 2 questions to be answered in this analysis are:

  1. Across the United States, which types of events are most harmful with respect to population health?

  2. Across the United States, which types of events have the greatest economic consequences?

Data Processing

Required libraries

To complete the following analysis it is required to load these libraries in R.

library(ggplot2)
library(knitr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Reading Data

This is the starting point for the analysis. The data used for this analysis is the NOAA Storm Database available in the course site in the following link Storm Data.
In order to read the data it is supposed that the file is already unzipped in the working directory. To read the table it is used the read.csv function. Also, by using the str function, it is shown the primary details of the database.

storm <- read.csv("./repdata_data_StormData.csv")
str(storm)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436774 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Cleaning Data

As the goal of the analysis is to indicate which types of events are more harmful to the health and economy across the United States, there is no need about specific location details like state, county, latitude, longitude, etc.
To complete the whole analysis it is going to be used 8 variables:

  • BGN_DATE: variable indicating the date of occurrence.

  • EVTYPE: variable indicating the event type.

  • FATALITIES: variable indicating the amount of fatalities caused by the event.

  • INJURIES: variable indicating the amount of injuries caused by the event.

  • PROPDMG: variable indicating the 3 significant digits for the estimated property damage, in USD, caused by the event.

  • PROPDMGEXP: variable indicating the magnitude of the damage. “K” for thousands, “M” for millions, and “B” for billions.

  • CROPDMG: variable indicating the 3 significant digits for the estimated crop damage, in USD, caused by the event.

  • CROPDMGEXP: variable indicating the magnitude of the damage. “K” for thousands, “M” for millions, and “B” for billions.

The first step to clean the data is to remove unused columns. This will be performed by sub-setting the storm data frame. Then, we convert the BGN_DATE variable to a date R format. Also, a summary of the resulting data frame is presented.

storm <- storm[,c(2, 8, 23:28)]
storm <- mutate(storm, BGN_DATE = as.Date(BGN_DATE, format = "%m/%d/%Y"))
summary(storm)
##     BGN_DATE                        EVTYPE         FATALITIES      
##  Min.   :1950-01-03   HAIL             :288661   Min.   :  0.0000  
##  1st Qu.:1995-04-20   TSTM WIND        :219940   1st Qu.:  0.0000  
##  Median :2002-03-18   THUNDERSTORM WIND: 82563   Median :  0.0000  
##  Mean   :1998-12-27   TORNADO          : 60652   Mean   :  0.0168  
##  3rd Qu.:2007-07-28   FLASH FLOOD      : 54277   3rd Qu.:  0.0000  
##  Max.   :2011-11-30   FLOOD            : 25326   Max.   :583.0000  
##                       (Other)          :170878                     
##     INJURIES            PROPDMG          PROPDMGEXP        CROPDMG       
##  Min.   :   0.0000   Min.   :   0.00          :465934   Min.   :  0.000  
##  1st Qu.:   0.0000   1st Qu.:   0.00   K      :424665   1st Qu.:  0.000  
##  Median :   0.0000   Median :   0.00   M      : 11330   Median :  0.000  
##  Mean   :   0.1557   Mean   :  12.06   0      :   216   Mean   :  1.527  
##  3rd Qu.:   0.0000   3rd Qu.:   0.50   B      :    40   3rd Qu.:  0.000  
##  Max.   :1700.0000   Max.   :5000.00   5      :    28   Max.   :990.000  
##                                        (Other):    84                    
##    CROPDMGEXP    
##         :618413  
##  K      :281832  
##  M      :  1994  
##  k      :    21  
##  0      :    19  
##  B      :     9  
##  (Other):     9

As seen in the summary, the PROPDMGEXP and CROPDMGEXP variables contain other characters than “K”, “M”, and “B”. This can be seen in the summary of each variable.

summary(storm$PROPDMGEXP)
##             -      ?      +      0      1      2      3      4      5      6 
## 465934      1      8      5    216     25     13      4      4     28      4 
##      7      8      B      h      H      K      m      M 
##      5      1     40      1      6 424665      7  11330
summary(storm$CROPDMGEXP)
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994

In the following analysis “k” and “m” are considered as “K” and “M”. Events containing other character will not be used for now on. In this step the lower case characters are converted to upper case. Then, three data frames are created, one for the observations of the health damage, one for the valid observations in property damage, and one for the valid observations in crop damage..

for(i in 1:length(storm$PROPDMGEXP)) {
    if(storm[i,6] == "k") {
        storm[i,6] = "K"
    } else if(storm[i,6] == "m") {
        storm[i,6] = "M"
    }
    if(storm[i,8] == "k"){
        storm[i,8] = "K"
    } else if(storm[i,8] == "m"){
        storm[i,8] = "M"
    }
}
valid_EXP <- c("K", "M", "B")

health <- storm[, 1:4]
prop <- storm[storm$PROPDMGEXP %in% valid_EXP, c(1:2, 5:6)]
crop <- storm[storm$CROPDMGEXP %in% valid_EXP, c(1:2, 7:8)]
prop$PROPDMGEXP <- droplevels(prop$PROPDMGEXP)
crop$CROPDMGEXP <- droplevels(crop$CROPDMGEXP)

summary(health)
##     BGN_DATE                        EVTYPE         FATALITIES      
##  Min.   :1950-01-03   HAIL             :288661   Min.   :  0.0000  
##  1st Qu.:1995-04-20   TSTM WIND        :219940   1st Qu.:  0.0000  
##  Median :2002-03-18   THUNDERSTORM WIND: 82563   Median :  0.0000  
##  Mean   :1998-12-27   TORNADO          : 60652   Mean   :  0.0168  
##  3rd Qu.:2007-07-28   FLASH FLOOD      : 54277   3rd Qu.:  0.0000  
##  Max.   :2011-11-30   FLOOD            : 25326   Max.   :583.0000  
##                       (Other)          :170878                     
##     INJURIES        
##  Min.   :   0.0000  
##  1st Qu.:   0.0000  
##  Median :   0.0000  
##  Mean   :   0.1557  
##  3rd Qu.:   0.0000  
##  Max.   :1700.0000  
## 
summary(prop)
##     BGN_DATE                        EVTYPE         PROPDMG        PROPDMGEXP
##  Min.   :1950-01-03   HAIL             :91966   Min.   :   0.00   B:    40  
##  1st Qu.:2000-07-20   THUNDERSTORM WIND:81941   1st Qu.:   0.00   K:424665  
##  Median :2007-09-12   TSTM WIND        :62844   Median :   1.00   M: 11337  
##  Mean   :2003-09-21   TORNADO          :51827   Mean   :  24.94             
##  3rd Qu.:2010-01-21   FLASH FLOOD      :32942   3rd Qu.:  10.00             
##  Max.   :2011-11-30   FLOOD            :17394   Max.   :5000.00             
##                       (Other)          :97128
summary(crop)
##     BGN_DATE                        EVTYPE         CROPDMG        CROPDMGEXP
##  Min.   :1993-01-04   HAIL             :82305   Min.   :  0.000   B:     9  
##  1st Qu.:2007-12-15   THUNDERSTORM WIND:81425   1st Qu.:  0.000   K:281853  
##  Median :2009-05-08   FLASH FLOOD      :21679   Median :  0.000   M:  1995  
##  Mean   :2008-09-23   FLOOD            :13622   Mean   :  4.853             
##  3rd Qu.:2010-09-03   HIGH WIND        :11501   3rd Qu.:  0.000             
##  Max.   :2011-11-30   TORNADO          : 9593   Max.   :990.000             
##                       (Other)          :63732

Damage Conversion

The next step in data processing is to multiply the significant digits with the appropriate number to find the total damage in USD.

for(i in 1:length(prop$PROPDMGEXP)){
    if(prop[i,4] == "K") {prop$PROPDMG_T <- prop$PROPDMG * 1000}
    if(prop[i,4] == "M") {prop$PROPDMG_T <- prop$PROPDMG * 1000000}
    if(prop[i,4] == "B") {prop$PROPDMG_T <- prop$PROPDMG * 1000000000}
}
for(i in 1:length(crop$CROPDMGEXP)){
    if(crop[i,4] == "K") {crop$CROPDMG_T <- crop$CROPDMG * 1000}
    if(crop[i,4] == "M") {crop$CROPDMG_T <- crop$CROPDMG * 1000000}
    if(crop[i,4] == "B") {crop$CROPDMG_T <- crop$CROPDMG * 1000000000}
}
summary(prop)
##     BGN_DATE                        EVTYPE         PROPDMG        PROPDMGEXP
##  Min.   :1950-01-03   HAIL             :91966   Min.   :   0.00   B:    40  
##  1st Qu.:2000-07-20   THUNDERSTORM WIND:81941   1st Qu.:   0.00   K:424665  
##  Median :2007-09-12   TSTM WIND        :62844   Median :   1.00   M: 11337  
##  Mean   :2003-09-21   TORNADO          :51827   Mean   :  24.94             
##  3rd Qu.:2010-01-21   FLASH FLOOD      :32942   3rd Qu.:  10.00             
##  Max.   :2011-11-30   FLOOD            :17394   Max.   :5000.00             
##                       (Other)          :97128                               
##    PROPDMG_T      
##  Min.   :      0  
##  1st Qu.:      0  
##  Median :   1000  
##  Mean   :  24943  
##  3rd Qu.:  10000  
##  Max.   :5000000  
## 
summary(crop)
##     BGN_DATE                        EVTYPE         CROPDMG        CROPDMGEXP
##  Min.   :1993-01-04   HAIL             :82305   Min.   :  0.000   B:     9  
##  1st Qu.:2007-12-15   THUNDERSTORM WIND:81425   1st Qu.:  0.000   K:281853  
##  Median :2009-05-08   FLASH FLOOD      :21679   Median :  0.000   M:  1995  
##  Mean   :2008-09-23   FLOOD            :13622   Mean   :  4.853             
##  3rd Qu.:2010-09-03   HIGH WIND        :11501   3rd Qu.:  0.000             
##  Max.   :2011-11-30   TORNADO          : 9593   Max.   :990.000             
##                       (Other)          :63732                               
##    CROPDMG_T     
##  Min.   :     0  
##  1st Qu.:     0  
##  Median :     0  
##  Mean   :  4853  
##  3rd Qu.:     0  
##  Max.   :990000  
## 

Grouping and Summarizing Data by Event Type

As the analysis questions address the total damage by event type, the next step to process data is to summarize the total damage by each event type. This will be done by using the group_by and summarize functions. In the case of property and crop damage, both data frames will be merged together to obtain the total damage of both.

health <- group_by(health, EVTYPE)
health_summ <- summarize(health, T_FAT_EVT = sum(FATALITIES),
                         T_INJ_EVT = sum(INJURIES))
summary(health_summ$T_FAT_EVT)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00   15.38    0.00 5633.00
summary(health_summ$T_INJ_EVT)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0     0.0   142.7     0.0 91346.0
prop <- group_by(prop, EVTYPE)
prop_summ <- summarize(prop, T_DMG_EVT = sum(PROPDMG_T))
summary(prop_summ$T_DMG_EVT)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.000e+00 5.875e+03 6.000e+04 2.692e+07 5.750e+05 3.212e+09
crop <- group_by(crop, EVTYPE)
crop_summ <- summarize(crop, T_DMG_EVT = sum(CROPDMG_T))
summary(crop_summ$T_DMG_EVT)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##         0      5000     33500   8609727    684525 579573280
P_C_dmg <- rbind(crop_summ, prop_summ)
P_C_dmg <- group_by(P_C_dmg,EVTYPE)
P_C_dmg <- summarize(P_C_dmg, total_dmg = sum(T_DMG_EVT))
summary(P_C_dmg)
##                    EVTYPE      total_dmg        
##     HIGH SURF ADVISORY:  1   Min.   :0.000e+00  
##   FLASH FLOOD         :  1   1st Qu.:1.000e+04  
##   TSTM WIND           :  1   Median :5.900e+04  
##   TSTM WIND (G45)     :  1   Mean   :2.856e+07  
##  ?                    :  1   3rd Qu.:6.010e+05  
##  AGRICULTURAL FREEZE  :  1   Max.   :3.312e+09  
##  (Other)              :423

Results

Health Impact

The health impact will be measured using two different metrics. The first is the amount of fatalities and the second is the amount on injured people.

Fatalities

The following table presents the top 10 event types by impact to health measuring the total amount of fatalities.

health_fat <- arrange(health_summ, desc(T_FAT_EVT))
head(health_fat, 10)
## # A tibble: 10 x 3
##    EVTYPE         T_FAT_EVT T_INJ_EVT
##    <fct>              <dbl>     <dbl>
##  1 TORNADO             5633     91346
##  2 EXCESSIVE HEAT      1903      6525
##  3 FLASH FLOOD          978      1777
##  4 HEAT                 937      2100
##  5 LIGHTNING            816      5230
##  6 TSTM WIND            504      6957
##  7 FLOOD                470      6789
##  8 RIP CURRENT          368       232
##  9 HIGH WIND            248      1137
## 10 AVALANCHE            224       170
{ggplot(health_fat[1:10,], aes(EVTYPE, weight = T_FAT_EVT)) + 
    geom_bar() +
    theme_bw() + 
    theme(axis.text.x = element_text(angle = 90)) + 
    labs(title = "Top-10 Health Impact Event Types - Fatalities",
         x = "Event Type",
         y = "Total Fatalities",
         caption = "(Top-ten health impact event types measuring the total amount of fatalities as result)")}

Injuries

The following table presents the top 10 event types by impact to health measuring the total amount of injuries.

health_inj <- arrange(health_summ, desc(T_INJ_EVT))
head(health_inj, 10)
## # A tibble: 10 x 3
##    EVTYPE            T_FAT_EVT T_INJ_EVT
##    <fct>                 <dbl>     <dbl>
##  1 TORNADO                5633     91346
##  2 TSTM WIND               504      6957
##  3 FLOOD                   470      6789
##  4 EXCESSIVE HEAT         1903      6525
##  5 LIGHTNING               816      5230
##  6 HEAT                    937      2100
##  7 ICE STORM                89      1975
##  8 FLASH FLOOD             978      1777
##  9 THUNDERSTORM WIND       133      1488
## 10 HAIL                     15      1361
{ggplot(health_inj[1:10,], aes(EVTYPE, weight = T_INJ_EVT)) + 
    geom_bar() +
    theme_bw() + 
    theme(axis.text.x = element_text(angle = 90)) + 
    labs(title = "Top-10 Health Impact Event Types - Injuries",
         x = "Event Type",
         y = "Total Injuries",
         caption = "(Top-ten health impact event types measuring the total amount of injuries as result)")}

The following event types appear in the top-ten of both measurements, the list is in order of total fatalities:

  1. Tornado.

  2. Excessive heat.

  3. Flash flood.

  4. Heat.

  5. Lightning.

  6. TSTM wind.

  7. Flood.

Economic Impact

The following table presents the top 10 event types by impact to economy measuring the total estimated monetary damage.

P_C_dmg <- arrange(P_C_dmg, desc(total_dmg))
head(P_C_dmg, 10)
## # A tibble: 10 x 2
##    EVTYPE              total_dmg
##    <fct>                   <dbl>
##  1 TORNADO            3311817980
##  2 FLASH FLOOD        1598767950
##  3 TSTM WIND          1445113210
##  4 HAIL               1267938960
##  5 FLOOD              1067969360
##  6 THUNDERSTORM WIND   943491620
##  7 LIGHTNING           606765690
##  8 THUNDERSTORM WINDS  458716910
##  9 HIGH WIND           341979770
## 10 WINTER STORM        134698580
{ggplot(P_C_dmg[1:10,], aes(EVTYPE, weight = total_dmg)) + 
    geom_bar() +
    theme_bw() + 
    theme(axis.text.x = element_text(angle = 90)) + 
    labs(title = "Top-10 Economy Impact Event Types",
         x = "Event Type",
         y = "Total Estimated Monetary Cost",
         caption = "(Top-ten economy impact events type measured by total estimates monetary cost)")}

Final Recommendation

As seen on both, health and economic, analyses, the worst event type is Tornado. There are other event types that should be considered as priority such as:

  • Flash flood.

  • TSTM wind.

  • Flood.