In the following document we will explore the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, which tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. We will do this with the objective of answering the following 2 questions:

  1. Across the United States, which types of events are most harmful with respect to population health?

  2. Across the United States, which types of events have the greatest economic consequences?

The document consists on 3 sections:

  1. Data processing: Short section detailing the loading of the data.
  2. Analysis: Longest section, containing the analysis process with explanation of the steps undertaken and the code used.
  3. Results: Final section containing the charts and plots answering the proposed questions using the information derived from the ‘Analysis’ section.

1. Data processing

The dataset provided for the assignment has been downloaded an extract it. We will load it in it’s original, raw version now, and store it in a variable called ‘StormData’. To do the following, make sure to set your directory to the location of the file.

StormData <- read.csv("repdata-data-StormData.csv")

2. Analysis

The first step for this analysis will be to load the ggplot2 and dplyr packages, which will be necessary for our analysis.

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

We will now run head() on StormData to get a clue of what we’re working with.

head(StormData)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE  EVTYPE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL TORNADO
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL TORNADO
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL TORNADO
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL TORNADO
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL TORNADO
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL TORNADO
##   BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1         0                                               0         NA
## 2         0                                               0         NA
## 3         0                                               0         NA
## 4         0                                               0         NA
## 5         0                                               0         NA
## 6         0                                               0         NA
##   END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1         0                      14.0   100 3   0          0       15    25.0
## 2         0                       2.0   150 2   0          0        0     2.5
## 3         0                       0.1   123 2   0          0        2    25.0
## 4         0                       0.0   100 2   0          0        2     2.5
## 5         0                       0.0   150 2   0          0        2     2.5
## 6         0                       1.5   177 2   0          0        6     2.5
##   PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1          K       0                                         3040      8812
## 2          K       0                                         3042      8755
## 3          K       0                                         3340      8742
## 4          K       0                                         3458      8626
## 5          K       0                                         3412      8642
## 6          K       0                                         3450      8748
##   LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1       3051       8806              1
## 2          0          0              2
## 3          0          0              3
## 4          0          0              4
## 5          0          0              5
## 6          0          0              6

Now, let’s call colnames() on our dataset in order to see what variables we’re working with.

colnames(StormData)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

Determining most harmful events

For this section, we will work on determining which events are the most harmful. We will first focus on fatalities, since this is the worst case of damage towards population health. Based on our use of colnames(), this measurement corresponds to the variable named ‘FATALITIES’.

Criteria

In order to determine which events are the most harmful towards population health, we will analyze this part in two ways. We will identify: a) Which events have the highest total number of fatalities overall; and b) Which events have the highest total number of injuries overall.

a) Which events have the highest total number of fatalities overall

For this task, we will extract the 2 columns we’re most interested in. That is “EVTYPE” for event type, and “FATALITIES”.

ev_fat <- select(StormData, EVTYPE, FATALITIES)

Now, we will group our data by EVTYPE and summarise the data by calculating the sum of fatalities by event.

total_fat <- group_by(ev_fat, EVTYPE) %>% summarise(total_fatalities = sum(FATALITIES))

Next, lets reorder the dataset in descending order of ‘total_fatalities’ and store it in the variable ‘most_fat’. We will use the dplyr function arrange for this task, specifying the argument ‘desc’.

most_fat <- arrange(total_fat, desc(total_fatalities))

We will analyze the results of this variable in the next section called ‘results’. For now, we will continue with the next criteria.

b) Which events have the highest total number of injuries overall

This section will essentially be the same as the previous section, but applied to the ‘INJURIES’ column, as follows:

ev_inj <- select(StormData, EVTYPE, INJURIES)
total_inj <- group_by(ev_inj, EVTYPE) %>% summarise(total_injuries = sum(INJURIES))
most_inj <- arrange(total_inj, desc(total_injuries))

Determining most ecomomically costly events

To make this analysis possible, we will begin by extracting the columns of interest. We will need “EVTYPE” once more, which indicates the event type. We will also need “PROPDMG” and “CROPDMG” for property damage and crop damage respectively. We will also need “PROPDMGEXP” and “CROPDMGEXP” for property damage exponent and crop damage exponent. This columns will indicate the exponent at which the damage columns are elevated to. Let’s do that now.

dmg <- select(StormData, EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)

Given the damage values have different exponents, we will first make them all into the same exponent. We will choose ‘M’, or millions of dollars. We now run into a problem. The dataset documentation states the meaning of K, M, B as thousands, millions and billions respectively. Let’s now run unique() on dmg$PROPDMGEXP to understand the problem.

unique(dmg$PROPDMGEXP)
##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"

We can see the familiar K, M & B. However, we also observe a bunch of different entry types, most of them being integers. Since no documentation is provided that specifies the meaning of these, we will filter them out. We will only keep the entries from which ‘PROPDMGEXP’ and ‘CROPDMGEXP’ are equal to either K, M or B. We will also keep the observations where this value is empty, assuming this means the numeric value is to be taken as it is, with no additional exponent. We will also split them into two datasets, one for property damage and another for crop damage.

dmg_prop <- select(dmg, EVTYPE, PROPDMG, PROPDMGEXP)
dmg_crop <- select(dmg, EVTYPE, CROPDMG, CROPDMGEXP)
dmg_propf <- filter(dmg_prop, PROPDMGEXP %in% c("K","M","B", ""))
dmg_cropf <- filter(dmg_crop, CROPDMGEXP %in% c("K","M","B", ""))

After this, we ensure all of the exponents in the data are in either thousands, millions or billions of dollars, or simply no exponent at all. Now, we wish to express the damage quantities in millions of dollars. For this, we will multiply each entry corresponding to ‘B’ by a thousand, each “K” entry by 1e-3, and each empty entry by 1e-6. The ‘M’ entries will be left as they are since they are already expressed in millions of dollars.

dmg_propf <- mutate(dmg_propf, PROPDMGF = PROPDMG * case_when(
      PROPDMGEXP == "" ~ 1e-6,
      PROPDMGEXP == "K" ~ 1e-3,
      PROPDMGEXP == "M" ~ 1,
      PROPDMGEXP == "B" ~ 1e3
))

dmg_cropf <- mutate(dmg_cropf, CROPDMGF = CROPDMG * case_when(
      CROPDMGEXP == "" ~ 1e-6,
      CROPDMGEXP == "K" ~ 1e-3,
      CROPDMGEXP == "M" ~ 1,
      CROPDMGEXP == "B" ~ 1e3
))

Now, we will group by EVTYPE and sum the total damage in both cases.

total_dmg_prop <- group_by(dmg_propf, EVTYPE) %>% summarise(TOT_DMG = sum(PROPDMGF))
total_dmg_crop <- group_by(dmg_cropf, EVTYPE) %>% summarise(TOT_DMG = sum(CROPDMGF))

With the total damage sum by event type, we will merge both datasets into one, containing the total damage by event type, across both property and crop damage.

total_dmg <- merge(total_dmg_crop, total_dmg_prop, by = "EVTYPE")
total_dmg$TOT_DMG <- total_dmg$TOT_DMG.x + total_dmg$TOT_DMG.y
total_dmg <- select(total_dmg, EVTYPE, TOT_DMG)

We are now ready to analyze the results.

3. Results

In the following section we present the results of our analysis of the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database.

a) Which events are the most harmful with respect to population?

After analizing the data, we found the following:

most_fat_sliced <- slice(most_fat, 1:10)
most_fat_sliced$EVTYPE <- factor(most_fat_sliced$EVTYPE, levels = unique(most_fat_sliced$EVTYPE))
ggplot(most_fat_sliced, aes(x = EVTYPE, y = total_fatalities)) +
      geom_bar(stat = 'identity', fill = 'royalblue') + 
      labs(title = "Deadliest event types", x = "Event Type", y = "Total Fatalities") +
      theme_minimal() +
      theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

print(most_fat)
## # A tibble: 985 × 2
##    EVTYPE         total_fatalities
##    <chr>                     <dbl>
##  1 TORNADO                    5633
##  2 EXCESSIVE HEAT             1903
##  3 FLASH FLOOD                 978
##  4 HEAT                        937
##  5 LIGHTNING                   816
##  6 TSTM WIND                   504
##  7 FLOOD                       470
##  8 RIP CURRENT                 368
##  9 HIGH WIND                   248
## 10 AVALANCHE                   224
## # ℹ 975 more rows

As we can tell from the chart, the 5 most deadly event types are, in descending order, tornadoes, excessive heat, flash flood, heat and lightning. Tornadoes alone account for over 5 thousand deaths.

Which events cause the most injuries?

most_inj_sliced <- slice(most_inj, 1:10)
most_inj_sliced$EVTYPE <- factor(most_inj_sliced$EVTYPE, levels = unique(most_inj_sliced$EVTYPE))
ggplot(most_inj_sliced, aes(x = EVTYPE, y = total_injuries)) +
      geom_bar(stat = 'identity', fill = 'darkred') + 
      labs(title = "Event types by injuries", x = "Event Type", y = "Total Cost (in millions of dollars)") +
      theme_minimal() +
      theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

print(most_inj)
## # A tibble: 985 × 2
##    EVTYPE            total_injuries
##    <chr>                      <dbl>
##  1 TORNADO                    91346
##  2 TSTM WIND                   6957
##  3 FLOOD                       6789
##  4 EXCESSIVE HEAT              6525
##  5 LIGHTNING                   5230
##  6 HEAT                        2100
##  7 ICE STORM                   1975
##  8 FLASH FLOOD                 1777
##  9 THUNDERSTORM WIND           1488
## 10 HAIL                        1361
## # ℹ 975 more rows

Once more, tornadoes take the first place, this time by an even larger margin, causing almost 100 thousand injuries. TSTM wind, floods, excessive heat and lighting occupied the second to fifth place, respectively. As a conclussion, while there are several event types that cause harm to the human population, such as flood, excessive heat and lightning, non cause as much human damage as tornadoes, which ocuppy a clear first place in both cases.

b) Which events are the costliest?

The following are the results as to the costliest event types:

tot_damage_arr <- arrange(total_dmg, desc(TOT_DMG))
total_dmg_sliced <- slice(tot_damage_arr, 1:10)
total_dmg_sliced$EVTYPE <- factor(total_dmg_sliced$EVTYPE, levels = unique(total_dmg_sliced$EVTYPE))
ggplot(total_dmg_sliced, aes(x = EVTYPE, y = TOT_DMG)) +
      geom_bar(stat = 'identity', fill = 'steelblue') + 
      labs(title = "Costliest event types", x = "Event Type", y = "Total Cost (millions of dollars)") +
      theme_minimal() +
      theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

print(total_dmg_sliced)
##               EVTYPE    TOT_DMG
## 1              FLOOD 150319.678
## 2  HURRICANE/TYPHOON  71913.713
## 3            TORNADO  57340.614
## 4        STORM SURGE  43323.541
## 5               HAIL  18752.904
## 6        FLASH FLOOD  17562.129
## 7            DROUGHT  15018.672
## 8          HURRICANE  14610.229
## 9        RIVER FLOOD  10148.405
## 10         ICE STORM   8967.041

As we can observe, in monetary terms, the 5 costliest event types are flood, hurricane/typhoon, tornado, storm surge and hail. Floods alone accounted for over 150 billion dollars in damages. Hurricanes, tornadoes and storm surges where not too far behind, with total costs of ~72, ~57 and ~43 billion dollars respectively.