This assignment explores the NOAA Storm Database and looks into the types of weather events that cause the most human and material damage across the United States. We conclude that, between 1950 and 2011, heat, tornadoes, thunderstorms, and floods have caused the highest numbers of fatalities and injuries. Similarly, tornadoes, floods, drought, and hurricanes have caused damage in billions of dollars to private property. States that are particularly affected by these events are Texas, Alabama, and Florida, for which we urge the authorities of these states to take the necessary precautions in order to avoid future disasters.

Data Processing

library(dplyr)
library(tidyr)
library(stringr)
library(lubridate)
library(ggplot2)

We open the .zip file with the data. The columns we want to analyze are BGNDATE, EVTYPE, FATALITIES, INJURIES, CROPDMG, PROPDMG, CROPDMGEXP, and PROPDMGEXP.

dataurl<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
temp_file <- tempfile(fileext = ".bz2")
download.file(dataurl, temp_file)
data <- read.csv(temp_file, header = TRUE, sep = ",")
head(data)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE  EVTYPE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL TORNADO
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL TORNADO
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL TORNADO
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL TORNADO
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL TORNADO
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL TORNADO
##   BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1         0                                               0         NA
## 2         0                                               0         NA
## 3         0                                               0         NA
## 4         0                                               0         NA
## 5         0                                               0         NA
## 6         0                                               0         NA
##   END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1         0                      14.0   100 3   0          0       15    25.0
## 2         0                       2.0   150 2   0          0        0     2.5
## 3         0                       0.1   123 2   0          0        2    25.0
## 4         0                       0.0   100 2   0          0        2     2.5
## 5         0                       0.0   150 2   0          0        2     2.5
## 6         0                       1.5   177 2   0          0        6     2.5
##   PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1          K       0                                         3040      8812
## 2          K       0                                         3042      8755
## 3          K       0                                         3340      8742
## 4          K       0                                         3458      8626
## 5          K       0                                         3412      8642
## 6          K       0                                         3450      8748
##   LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1       3051       8806              1
## 2          0          0              2
## 3          0          0              3
## 4          0          0              4
## 5          0          0              5
## 6          0          0              6

We keep only the relevant columns. We will also turn the BGN_DATE column into date objects, add a separate column for year only, and change EVTYPE data into lowercase to avoid duplicates. Trailing and leading spaces were also removed.

nrow(data)
## [1] 902297
damage<-data[, c(2,6:8,23:28)] %>%
  mutate(BGN_DATE= str_sub(BGN_DATE, end = -8)) %>%
  mutate(BGN_DATE= mdy(BGN_DATE))%>%
  mutate(year= year(BGN_DATE))  %>%
mutate(EVTYPE= tolower(EVTYPE)) %>% #lowercase
mutate(EVTYPE=trimws(EVTYPE)) #remove trailing and leading spaces

A quick look into unique(damage$EVTYPE) shows that many events can be merged into one type. We collapse all events with the words “heat” (e.g. “heat”, “excessive heat”) into one type of event; we do the same with events containing “thunderstorm”, “hurricane/typhoon”, and “flood”.

evtypes<-unique(damage$EVTYPE)
heat<-str_subset(evtypes, "heat")
thstm<-str_subset(evtypes, "thunderstorm")
flood<- str_subset(evtypes, "flood")
ht<-str_subset(evtypes, "hurricane|typhoon")
damage <- damage %>%
  mutate(EVTYPE = str_replace_all(EVTYPE, "\\?", "unknown")) %>%
  mutate(EVTYPE = str_replace_all(EVTYPE, "tstm", "thunderstorm")) %>%
      mutate(EVTYPE = if_else(EVTYPE %in% heat, "heat", EVTYPE)) %>%
        mutate(EVTYPE = if_else(EVTYPE %in% thstm, "thunderstorm", EVTYPE)) %>%
          mutate(EVTYPE = if_else(EVTYPE %in% ht, "hurricane/typhoon", EVTYPE)) %>%
            mutate(EVTYPE= if_else(EVTYPE %in% flood, "flood", EVTYPE)) #collapsed all evtypes with "heat", "thstm wind", and "flood"

The columns CROPDMGEXP and PROPDMGEXP seem to have different exponents:

unique(data$CROPDMGEXP)
## [1] ""  "M" "K" "m" "B" "?" "0" "k" "2"
unique(data$PROPDMGEXP)
##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"

It is unclear what the numbers encode, but the remarks column suggests that K, M, and B refer to thousands, millions, and billions. We will turn everything to lowercase and transform into thousands.

damage$PROPDMGEXP<-tolower(data$PROPDMGEXP)
damage$CROPDMGEXP<-tolower(data$CROPDMGEXP)
damage <- damage %>%
  mutate(CROPS= if_else(CROPDMGEXP == "m", CROPDMG*1000, 
                 if_else(CROPDMGEXP == "b", CROPDMG*1000000, CROPDMG)))  %>% 
  mutate(PROPERTY= if_else(PROPDMGEXP == "m", PROPDMG*1000, 
                 if_else(PROPDMGEXP == "b", PROPDMG*1000000, PROPDMG)))

Results

Public health: fatalities and injuries

Fatalities

First we will look into the number of fatalities per event.

summary(damage$FATALITIES)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0000   0.0000   0.0000   0.0168   0.0000 583.0000

We see that the data is highly skewed, with the mean above the 3rd quartile. This means that, from 1950 to date, more than 75% of the weather events registered in the database have caused zero fatalities. We look into the top 3 events that have caused the highest number of fatalities over time:

damage %>%
  group_by(EVTYPE) %>%
  summarise(mean.fatalities = mean(FATALITIES), total.fatalities= sum(FATALITIES), deadliest.single=max(FATALITIES), n=n()) %>% slice_max(total.fatalities, n=3)
## # A tibble: 3 × 5
##   EVTYPE  mean.fatalities total.fatalities deadliest.single     n
##   <chr>             <dbl>            <dbl>            <dbl> <int>
## 1 tornado          0.0929             5633              158 60652
## 2 heat             1.19               3138              583  2648
## 3 flood            0.0184             1525               20 82725

We see that tornadoes, heat, and floods are the events with the highest numbers fatalities. Of these, we see that tornadoes are frequent and deadly, but the mean fatalities are fewer than those left by heat waves. Excessive heat, while less frequent, is more likely to cause a higher number of deaths per event, where the deadliest heatwave caused 583 deaths. Floods, while very frequent, tend to cause fewer fatalities.

Injuries

We now look into the events causing more injuries.

summary(damage$INJURIES)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##    0.0000    0.0000    0.0000    0.1557    0.0000 1700.0000

Again, we see that the mean is above the third quartile. We group the total of events by type and check for the three types of event that have caused the largest numbers of injuries over time:

damage %>%
  group_by(EVTYPE) %>% 
  summarise(mean.injuries = mean(INJURIES), total.injuries= sum(INJURIES), worst.single=max(INJURIES), n=n()) %>% slice_max(total.injuries, n=3)
## # A tibble: 3 × 5
##   EVTYPE       mean.injuries total.injuries worst.single      n
##   <chr>                <dbl>          <dbl>        <dbl>  <int>
## 1 tornado             1.51            91346         1700  60652
## 2 thunderstorm        0.0283           9540           70 336745
## 3 heat                3.48             9224          519   2648

We see that tornadoes are again on top of the list, causing the most total injuries from 1950 to date. Thunderstorms, while very frequent, are less likely to cause injuries. However, heat waves cause the largest number of mean injuries per event, despite being less frequent.

Damages

We will now look into damages.

summary(damage$PROPERTY)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.00e+00 0.00e+00 0.00e+00 4.74e+02 0.00e+00 1.15e+08
summary(damage$CROPS)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0      54       0 5000000

Again, we see that 75% of the events cause no damage.

Property damage

We filter out those cases of property damage with unclear exponents, and leave only those with “k”, “m”, and “b”.

damage %>%
  filter(PROPDMGEXP %in% c("b", "m", "k")) %>%
  group_by(EVTYPE) %>%
  summarise(mean.propdmg = mean(PROPERTY), sum.propdmg= sum(PROPERTY), worst.single=max(PROPERTY), n=n()) %>% slice_max(sum.propdmg, n=3)
## # A tibble: 3 × 5
##   EVTYPE            mean.propdmg sum.propdmg worst.single     n
##   <chr>                    <dbl>       <dbl>        <dbl> <int>
## 1 flood                    3219.  167529215.    115000000 52040
## 2 hurricane/typhoon      381055.   85356410.     16930000   224
## 3 tornado                  1099.   56937160.      2800000 51827

We see that flood, hurricane/typhoon, and tornado are the event types that lead to the highest sums of property damage. While hurricanes and typhoons are less frequent, they leave the highest mean damage.

Crop damage

We now turn to see the damages in agriculture.

damage %>%
  filter(CROPDMGEXP %in% c("b", "m", "k")) %>%
  group_by(EVTYPE) %>%
  summarise(mean.cropdmg = mean(CROPS), sum.cropdmg= sum(CROPS), worst.single=max(CROPS), n=n()) %>% slice_max(sum.cropdmg, n=3)
## # A tibble: 3 × 5
##   EVTYPE            mean.cropdmg sum.cropdmg worst.single     n
##   <chr>                    <dbl>       <dbl>        <dbl> <int>
## 1 drought                  9199.   13972566       1000000  1519
## 2 flood                     343.   12380079.      5000000 36115
## 3 hurricane/typhoon       47146.    5516118.      1510000   117

Crops are affected by drought, flood, and hurricanes. Drought leads the sum of crop damages from 1950 to date, whereas floods are the most frequent. Hurricanes are the least frequent but the count of mean crop damage caused is the highest.

Now we visualize the differences between types of events and the number of fatalities and injuries caused.

worstevents <- c("tornado", "heat", "flood", "thunderstorm", "hurricane/typhoon", "drought")
topmost<-damage %>% filter(EVTYPE %in% worstevents) %>%
    pivot_longer(cols=c(FATALITIES, INJURIES, PROPERTY, CROPS), names_to="TypeofDamage", values_to="counts") %>%
  mutate(AFFECTED=if_else(TypeofDamage %in% c("FATALITIES", "INJURIES"), "human", "material"))
sum<-topmost %>% summarise(.by = c(EVTYPE, TypeofDamage, AFFECTED), counts=sum(counts), mean=sum(counts)/n(), n=n())
sum
## # A tibble: 24 × 6
##    EVTYPE       TypeofDamage AFFECTED    counts      mean      n
##    <chr>        <chr>        <chr>        <dbl>     <dbl>  <int>
##  1 tornado      FATALITIES   human        5633    0.0929   60652
##  2 tornado      INJURIES     human       91346    1.51     60652
##  3 tornado      PROPERTY     material 56937459. 939.       60652
##  4 tornado      CROPS        material   415113.   6.84     60652
##  5 thunderstorm FATALITIES   human         728    0.00216 336745
##  6 thunderstorm INJURIES     human        9540    0.0283  336745
##  7 thunderstorm PROPERTY     material 10982003.  32.6     336745
##  8 thunderstorm CROPS        material  1271797.   3.78    336745
##  9 flood        FATALITIES   human        1525    0.0184   82725
## 10 flood        INJURIES     human        8604    0.104    82725
## # ℹ 14 more rows

Since the data is rather skewed, we log-transform for visualization purposes.

ggplot(topmost, aes(x=EVTYPE, y= log2(counts), color=TypeofDamage)) +  geom_boxplot() + facet_grid(vars(AFFECTED)) + scale_color_discrete(breaks=c("FATALITIES","INJURIES","PROPERTY", "CROPS")) + labs(x= "Event type", y="total damage (log-transformed)")
## Warning: Removed 1727288 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Total damage by event, 1950-2011 (material damage in thousands of dollars)

Total damage by event, 1950-2011 (material damage in thousands of dollars)

Changes over time

Finally, what is the evolution of fatalities and injuries caused by these events over time? Have better measures been implemented after events that turned into massive disasters?

yearsum<-topmost %>%
    filter(AFFECTED== "human") %>%
  group_by(TypeofDamage, EVTYPE, year)%>%
  filter_out(EVTYPE== "drought" | EVTYPE=="hurricane/typhoon") %>%
summarise(total=sum(counts), mean=mean(counts), n=n(), .groups = "drop_last")
ggplot(yearsum, aes(x=year, y= total, color=TypeofDamage)) + geom_line() + facet_wrap(vars(EVTYPE), scales="free")
Total fatalities and injuries caused by event types by year, 1960-2011

Total fatalities and injuries caused by event types by year, 1960-2011

Here we see that disasters show a relatively steady trend regarding fatalities, except for heat, where the latest years show fewer deaths. Tornadoes seem to cause high numbers of fatalities even after 2010. Regarding injuries, floods seem to be relatively under control after 2000, but the other events show high numbers of injuries also in later years.

Regarding damages to crop and property, we summarize and look for total numbers:

yearsum2<-topmost %>%
    filter(AFFECTED== "material") %>%
  group_by(TypeofDamage, EVTYPE, year)%>%
  filter_out(EVTYPE== "thunderstorm" | EVTYPE=="heat") %>%
summarise(total=sum(counts), mean=mean(counts), n=n(), .groups = "drop_last")
yearsum2 %>%
   slice_max(total, n=1)
## # A tibble: 8 × 6
## # Groups:   TypeofDamage, EVTYPE [8]
##   TypeofDamage EVTYPE             year      total      mean     n
##   <chr>        <chr>             <dbl>      <dbl>     <dbl> <int>
## 1 CROPS        drought            2000   2438100    19662.    124
## 2 CROPS        flood              1993   5179773.    3223.   1607
## 3 CROPS        hurricane/typhoon  2005   2012682    42823.     47
## 4 CROPS        tornado            2005     82105.      61.1  1343
## 5 PROPERTY     drought            2003    645150    12407.     52
## 6 PROPERTY     flood              2006 118704535.   30159.   3936
## 7 PROPERTY     hurricane/typhoon  2005  49786635  1059290.     47
## 8 PROPERTY     tornado            2011   9819601.    4480.   2192
ggplot(yearsum2, aes(x=year, y= total, color=TypeofDamage)) + geom_line() + facet_wrap(vars(EVTYPE), scales="free") + labs(y= "Annual damage (in thousands of dollars)")
Annual damage to crops and property, 1950-2011

Annual damage to crops and property, 1950-2011

Crop damage seems steady for all types of events over time, except for drought, where four peaks can be seen between 1995 and 2006. Floods and hurricanes/typhoons have caused the largest damage in property in years 2005 and 2006, where 47 hurricanes and typhoons left total property damages of over 49 billion dollars, and 3936 floods left a total damage of 118 billion dollars. in 2011, tornadoes were exceptionally destructive, causing 9 billion dollars worth of damages to property.

States with highest risk of disasters

We look into the events that cause the most damage and check which states are more at risk:

topmost %>%
  group_by(EVTYPE, STATE) %>%
  summarise(n=n(), .groups = "drop_last") %>%
  slice_max(n, n=1)
## # A tibble: 6 × 3
## # Groups:   EVTYPE [6]
##   EVTYPE            STATE     n
##   <chr>             <chr> <int>
## 1 drought           TX     1732
## 2 flood             TX    37988
## 3 heat              AL      972
## 4 hurricane/typhoon FL      276
## 5 thunderstorm      TX    93568
## 6 tornado           TX    33152

Across the US, the state of Texas has been the most affected by drought, flood, thunderstorm, and tornadoes. Alabama shows the highest incidence of events related to heat. Finally, Florida shows the highest number of hurricanes/typhoons in the country.