U.S. Storms: characteristics

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern. This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete. The data comes in the form of a comma-separated-value file compressed via the bzip2 algorithm retrievable at: Storm Data

There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

National Weather Service Storm Data Documentation
National Climatic Data Center Storm Events FAQ

Key Questions

The basic goal of this assignment is to explore the NOAA Storm Database and answer the following basic questions about severe weather events.

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

Synopsis

Key Findings: Tornado causes the most severe health consequence (contribute the greatest number of injuries and fatalities), whilst floods causes the most severe economic consequence (contribute the greatest property and crop damage)

Data Processing

Loading libraries and dataset

The storm dataset is added into a dataframe titled ‘df’.

library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

df <- read.csv("repdata_data_StormData.csv.bz2")

Examining the dataframe df

The storms dataframe df has 902297 observations and 37 variables.

str(df)

## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

dim(df)

## [1] 902297     37

head(df, n = 10)

##    STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1        1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2        1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3        1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4        1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5        1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6        1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
## 7        1 11/16/1951 0:00:00     0100       CST      9     BLOUNT    AL
## 8        1  1/22/1952 0:00:00     0900       CST    123 TALLAPOOSA    AL
## 9        1  2/13/1952 0:00:00     2000       CST    125 TUSCALOOSA    AL
## 10       1  2/13/1952 0:00:00     2000       CST     57    FAYETTE    AL
##     EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1  TORNADO         0                                               0         NA
## 2  TORNADO         0                                               0         NA
## 3  TORNADO         0                                               0         NA
## 4  TORNADO         0                                               0         NA
## 5  TORNADO         0                                               0         NA
## 6  TORNADO         0                                               0         NA
## 7  TORNADO         0                                               0         NA
## 8  TORNADO         0                                               0         NA
## 9  TORNADO         0                                               0         NA
## 10 TORNADO         0                                               0         NA
##    END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1          0                      14.0   100 3   0          0       15    25.0
## 2          0                       2.0   150 2   0          0        0     2.5
## 3          0                       0.1   123 2   0          0        2    25.0
## 4          0                       0.0   100 2   0          0        2     2.5
## 5          0                       0.0   150 2   0          0        2     2.5
## 6          0                       1.5   177 2   0          0        6     2.5
## 7          0                       1.5    33 2   0          0        1     2.5
## 8          0                       0.0    33 1   0          0        0     2.5
## 9          0                       3.3   100 3   0          1       14    25.0
## 10         0                       2.3   100 3   0          0        0    25.0
##    PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1           K       0                                         3040      8812
## 2           K       0                                         3042      8755
## 3           K       0                                         3340      8742
## 4           K       0                                         3458      8626
## 5           K       0                                         3412      8642
## 6           K       0                                         3450      8748
## 7           K       0                                         3405      8631
## 8           K       0                                         3255      8558
## 9           K       0                                         3334      8740
## 10          K       0                                         3336      8738
##    LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1        3051       8806              1
## 2           0          0              2
## 3           0          0              3
## 4           0          0              4
## 5           0          0              5
## 6           0          0              6
## 7           0          0              7
## 8           0          0              8
## 9        3336       8738              9
## 10       3337       8737             10

Examining key variables related to population health and economy

In this dataset, the following variables are either related to population health or the economy, which we are interested in to answer the 2 research questions stated above.

Health variables:
FATALITIES: approximate number of deaths
INJURIES: approximate number of injuries

Economic variables:
PROPDMG: approximate property damages (an absolute numerical figure i.e. 25.00)
PROPDMGEXP: the denomination for property damage value (i.e. K, M which correspond to thousands and millions respectively)
CROPDMG: approximate crop damages
CROPDMGEXP: the denomination for crop damage value

Furthermore, we are interested in the type of weather event, EVTYPE and would like to see which weather event corresponds to more severe consequences on health and the economy.

Creating new dataframe df1 with only columns of interest to us

df1 <- df[, c("FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP", "EVTYPE")]
head(df1, n = 5)

##   FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP  EVTYPE
## 1          0       15    25.0          K       0            TORNADO
## 2          0        0     2.5          K       0            TORNADO
## 3          0        2    25.0          K       0            TORNADO
## 4          0        2     2.5          K       0            TORNADO
## 5          0        2     2.5          K       0            TORNADO

Check for missing values

There seems to be no NA values in our new dataset df1.

colSums(is.na(df1))

## FATALITIES   INJURIES    PROPDMG PROPDMGEXP    CROPDMG CROPDMGEXP     EVTYPE 
##          0          0          0          0          0          0          0

Types of weather events

There are numerous weather events and we aim to categorise these events such as Thunderstorm Winds and High Wind into broad categories such as “Wind”. First, we see the 30 most common weather events.

From this we can identify the broad categories of events: Wind, Tornado, Flood, Snow, Winter, Storm, Fire, Hail, Rain. For other events, we categorise them into “Others”.

is.factor(df1$EVTYPE)

## [1] FALSE

df1$EVTYPE <- as.factor(df1$EVTYPE)
sort(table(df1$EVTYPE), decreasing = TRUE)[1:30]

## 
##                     HAIL                TSTM WIND        THUNDERSTORM WIND 
##                   288661                   219940                    82563 
##                  TORNADO              FLASH FLOOD                    FLOOD 
##                    60652                    54277                    25326 
##       THUNDERSTORM WINDS                HIGH WIND                LIGHTNING 
##                    20843                    20212                    15754 
##               HEAVY SNOW               HEAVY RAIN             WINTER STORM 
##                    15708                    11723                    11433 
##           WINTER WEATHER             FUNNEL CLOUD         MARINE TSTM WIND 
##                     7026                     6839                     6175 
## MARINE THUNDERSTORM WIND               WATERSPOUT              STRONG WIND 
##                     5812                     3796                     3566 
##     URBAN/SML STREAM FLD                 WILDFIRE                 BLIZZARD 
##                     3392                     2761                     2719 
##                  DROUGHT                ICE STORM           EXCESSIVE HEAT 
##                     2488                     2006                     1678 
##               HIGH WINDS         WILD/FOREST FIRE             FROST/FREEZE 
##                     1533                     1457                     1342 
##                DENSE FOG       WINTER WEATHER/MIX           TSTM WIND/HAIL 
##                     1293                     1104                     1028

Proceed to categorise the EVTYPES column into a new column EV, and view the count of each new broad category.

df1$EV <- as.character(df1$EV)
df1$EV <- "OTHER"
df1$EV[grep("wind",df1$EVTYPE, ignore.case = TRUE)] <- "WIND"
df1$EV[grep("tornado",df1$EVTYPE, ignore.case = TRUE)] <- "TORNADO"
df1$EV[grep("flood",df1$EVTYPE, ignore.case = TRUE)] <- "FLOOD"
df1$EV[grep("snow",df1$EVTYPE, ignore.case = TRUE)] <- "SNOW"
df1$EV[grep("winter",df1$EVTYPE, ignore.case = TRUE)] <- "WINTER"
df1$EV[grep("storm",df1$EVTYPE, ignore.case = TRUE)] <- "STORM"
df1$EV[grep("fire",df1$EVTYPE, ignore.case = TRUE)] <- "FIRE"
df1$EV[grep("hail",df1$EVTYPE, ignore.case = TRUE)] <- "HAIL"
df1$EV[grep("rain",df1$EVTYPE, ignore.case = TRUE)] <- "RAIN"

table(df1$EV)

## 
##    FIRE   FLOOD    HAIL   OTHER    RAIN    SNOW   STORM TORNADO    WIND  WINTER 
##    4240   82689  290401   47378   12241   17636  124527   60699  254323    8163

Values for property and crop damage

Values for both damages are found in two separate columns - DMG and DMGEXP. For example, for the first observation in the dataset, property damage is 25.00K (25.00 recorded under PROPDMG and K under PROPDMGEXP).

We need to form a new column reflecting the absolute numerical figure of this damage, that is 25,000.

View the denominations

table(df1$PROPDMGEXP)

## 
##             -      ?      +      0      1      2      3      4      5      6 
## 465934      1      8      5    216     25     13      4      4     28      4 
##      7      8      B      h      H      K      m      M 
##      5      1     40      1      6 424665      7  11330

table(df1$CROPDMGEXP)

## 
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994

K corresponds to thousands,M corresponds to millions and B corresponds to billions.
For the remaining denominations which cannot be comprehended, we assume that the figure is just denominated in dollars, since the number of observations with such incomprehensible units are not many anyway. We will store the denominations into a separate column, units_prop and units_crop.

df1$units_prop <- 1
is.character(df1$PROPDMGEXP)

## [1] TRUE

df1$units_prop[grep("k", df1$PROPDMGEXP, ignore.case = TRUE)] <- "1000"
df1$units_prop[grep("m", df1$PROPDMGEXP, ignore.case = TRUE)] <- "1000000"
df1$units_prop[grep("b", df1$PROPDMGEXP, ignore.case = TRUE)] <- "1000000000"
table(df1$units_prop)

## 
##          1       1000    1000000 1000000000 
##     466255     424665      11337         40

df1$units_crop <- 1
df1$units_crop[grep("k", df1$CROPDMGEXP, ignore.case = TRUE)] <- "1000"
df1$units_crop[grep("m", df1$CROPDMGEXP, ignore.case = TRUE)] <- "1000000"
df1$units_crop[grep("b", df1$CROPDMGEXP, ignore.case = TRUE)] <- "1000000000"
table(df1$units_crop)

## 
##          1       1000    1000000 1000000000 
##     618440     281853       1995          9

create a new column with final damage expenses, crop_dmg and prop_dmg.
Note that we must first check if columns are numeric before multiplying the values and units.

is.numeric(df1$units_prop) #check that column is numeric

## [1] FALSE

df1$units_prop <- as.numeric(df1$units_prop)
df1$units_crop <- as.numeric(df1$units_crop)
is.numeric(df1$PROPDMG)

## [1] TRUE

is.numeric(df1$CROPDMG)

## [1] TRUE

df1$prop_dmg <- df1$PROPDMG * df1$units_prop
df1$crop_dmg <- df1$CROPDMG * df1$units_crop

Analysis

Health consequences

To analyse the effects of weather event types on population health, we first group the dataframe df1 by the event type, EV (broad categories manually created above). Thereafter, we analyse the total fatalities and injuries by event type.

df1_health <- df1 %>% group_by(df1$EV) %>% summarise(total_fatalities = sum(FATALITIES), total_injuries = sum(INJURIES))

## `summarise()` ungrouping output (override with `.groups` argument)

df1_health

## # A tibble: 10 x 3
##    `df1$EV` total_fatalities total_injuries
##    <chr>               <dbl>          <dbl>
##  1 FIRE                   90           1608
##  2 FLOOD                1524           8602
##  3 HAIL                   45           1467
##  4 OTHER                5674          19840
##  5 RAIN                  114            305
##  6 SNOW                  164           1164
##  7 STORM                 633           6691
##  8 TORNADO              5636          91407
##  9 WIND                 1204           8906
## 10 WINTER                 61            538

Economic onsequences

Similarly, we now summarise df1 by grouping by event EV and coomputing the total property and crop damage.

df1_economic <- df1 %>% group_by(df1$EV) %>% summarise(total_prop_dmg = sum(prop_dmg), total_crop_dmg = sum(crop_dmg))

## `summarise()` ungrouping output (override with `.groups` argument)

df1_economic

## # A tibble: 10 x 3
##    `df1$EV` total_prop_dmg total_crop_dmg
##    <chr>             <dbl>          <dbl>
##  1 FIRE        8501628500       403281630
##  2 FLOOD     167507193930.    12266906100
##  3 HAIL       17619991072.     3114212873
##  4 OTHER      88765409587.    24090068520
##  5 RAIN        3270230192       919315800
##  6 SNOW        1023569752.      134683100
##  7 STORM      72812921120.     6406789888
##  8 TORNADO    56993098029.      414961520
##  9 WIND       10797310118      1338972750
## 10 WINTER        27298000        15000000

Results

Question 1. Across the United States, which types of events are most harmful with respect to population health?

Event type sorted by total fatalities and injuries in descending order

arrange(df1_health, desc(df1_health$total_fatalities))

## # A tibble: 10 x 3
##    `df1$EV` total_fatalities total_injuries
##    <chr>               <dbl>          <dbl>
##  1 OTHER                5674          19840
##  2 TORNADO              5636          91407
##  3 FLOOD                1524           8602
##  4 WIND                 1204           8906
##  5 STORM                 633           6691
##  6 SNOW                  164           1164
##  7 RAIN                  114            305
##  8 FIRE                   90           1608
##  9 WINTER                 61            538
## 10 HAIL                   45           1467

arrange(df1_health, desc(df1_health$total_injuries))

## # A tibble: 10 x 3
##    `df1$EV` total_fatalities total_injuries
##    <chr>               <dbl>          <dbl>
##  1 TORNADO              5636          91407
##  2 OTHER                5674          19840
##  3 WIND                 1204           8906
##  4 FLOOD                1524           8602
##  5 STORM                 633           6691
##  6 FIRE                   90           1608
##  7 HAIL                   45           1467
##  8 SNOW                  164           1164
##  9 WINTER                 61            538
## 10 RAIN                  114            305

Create a graph to visualise total fatalities and injuries by event type. Tornado ranks the highest by total injuries and fatalities (excluding ‘others’ category).

colnames(df1_health)

## [1] "df1$EV"           "total_fatalities" "total_injuries"

colnames(df1_health)[1] <- "EV" #rename first column 
is.factor(df1_health$EV)

## [1] FALSE

df1_health$EV <- as.factor(df1_health$EV)
library(RColorBrewer)
require(gridExtra)

## Loading required package: gridExtra

## Warning: package 'gridExtra' was built under R version 4.0.2

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

plot_fatalities <- ggplot(df1_health, aes(fill = EV, x = reorder(EV, -total_fatalities), y = total_fatalities)) + geom_bar(stat = 'identity') + geom_label(label = df1_health$total_fatalities) + scale_fill_brewer(palette = "Set3") + ggtitle("Event type by total fatalities") + xlab("Event Type") + ylab("Total fatalities")
plot_injuries <- ggplot(df1_health, aes(fill = EV, x = reorder(EV, -total_injuries), y = total_injuries)) + geom_bar(stat = 'identity') + geom_label(label = df1_health$total_injuries) + scale_fill_brewer(palette = "Set3") + ggtitle("Event type by total injuries") + xlab("Event Type") + ylab("Total injjuries")
grid.arrange(plot_fatalities,plot_injuries, ncol = 2)

Question 2. Across the United States, which types of events have the greatest economic consequences?

Floods pose the greatest economic damage to properties and crops.

colnames(df1_economic)

## [1] "df1$EV"         "total_prop_dmg" "total_crop_dmg"

colnames(df1_economic)[1] <- "EV"
plot_prop <- ggplot(df1_economic, aes( fill = EV, x = reorder(EV, -total_prop_dmg), y = total_prop_dmg)) + geom_bar(stat = 'identity') +
        scale_fill_brewer(palette = "Set3") + ggtitle("Event type by total property damage") + xlab("Event Type") + ylab("Total property damage")
plot_crop <- ggplot(df1_economic, aes( fill = EV, x = reorder(EV, -total_crop_dmg), y = total_crop_dmg)) + geom_bar(stat = 'identity')  +
        scale_fill_brewer(palette = "Set3") + ggtitle("Event type by total crop damage") + xlab("Event Type") + ylab("Total crop damage")
grid.arrange(plot_prop, plot_crop, ncol = 2)

options(scipen=999)