Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. In this document the answers for the following two questions are searched:

  1. Across the United States, which types of events are most harmful with respect to population health?
  2. Across the United States, which types of events have the greatest economic consequences?

Data Processing

Setting up the environment

To make the analysis robust, the local system is manually overwritten to English.Then the used libraries are loaded.

Sys.setlocale("LC_TIME", "English")
library(dplyr)
## 
## Kapcsolódás csomaghoz: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
## Warning: a(z) 'ggplot2' csomag az R 4.5.3 verziójával lett fordítva
library(lubridate)
## Warning: a(z) 'lubridate' csomag az R 4.5.3 verziójával lett fordítva
## 
## Kapcsolódás csomaghoz: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(stringdist)
## 
## Kapcsolódás csomaghoz: 'stringdist'
## The following object is masked from 'package:tidyr':
## 
##     extract

Loading, inspecting and initial cleaning of the data

The Storm Data is loaded into the data object with the read.csv() function.

stormdata <- read.csv("repdata_data_StormData.csv.bz2")

Inspecting the data.

head(stormdata)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE  EVTYPE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL TORNADO
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL TORNADO
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL TORNADO
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL TORNADO
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL TORNADO
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL TORNADO
##   BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1         0                                               0         NA
## 2         0                                               0         NA
## 3         0                                               0         NA
## 4         0                                               0         NA
## 5         0                                               0         NA
## 6         0                                               0         NA
##   END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1         0                      14.0   100 3   0          0       15    25.0
## 2         0                       2.0   150 2   0          0        0     2.5
## 3         0                       0.1   123 2   0          0        2    25.0
## 4         0                       0.0   100 2   0          0        2     2.5
## 5         0                       0.0   150 2   0          0        2     2.5
## 6         0                       1.5   177 2   0          0        6     2.5
##   PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1          K       0                                         3040      8812
## 2          K       0                                         3042      8755
## 3          K       0                                         3340      8742
## 4          K       0                                         3458      8626
## 5          K       0                                         3412      8642
## 6          K       0                                         3450      8748
##   LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1       3051       8806              1
## 2          0          0              2
## 3          0          0              3
## 4          0          0              4
## 5          0          0              5
## 6          0          0              6
dim(stormdata)
## [1] 902297     37

The data has 37 columns and 902297 observations, the 37 variable names are below.

names(stormdata)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

Reducing the size of the data

In order to reduce the size of the data, those observations which are not relevant for this analysis are removed. Namely, these are the ones that did not cause any harm in population health (fatalities or injuries) or property or crop damage.

stormdata_filt <- stormdata %>%
  filter(FATALITIES > 0 | INJURIES > 0 | PROPDMG > 0 | CROPDMG > 0) 

Date

The variable BGN_DATE is split to date and time variables, then the date column is converted to date class. In this document only the year in which a severe weather event happened is considered, therefore the date column is further processed to only contain the year. The unnecessary time column is removed.

stormdata_filt <- stormdata_filt %>%
  separate(BGN_DATE, c("date", "time"), sep = " ")

stormdata_filt$date <- year(as.Date(stormdata_filt$date, format = "%m/%d/%Y"))
stormdata_filt <- subset(stormdata_filt, select = -c(time))

The data contains severe weather and storm data between 1950 and 2011.

range(stormdata_filt$date)
## [1] 1950 2011

In the first 32 years only tornado events were recorded.

unique(stormdata_filt[stormdata_filt$date < 1983,]$EVTYPE)
## [1] "TORNADO"

In 1983 thunderstorm wind events started to be recorded by the name “TSTM WIND”

unique(stormdata_filt[stormdata_filt$date == 1983,]$EVTYPE)
## [1] "TORNADO"   "TSTM WIND"

One year later also hail events were recorded up until 1992.

unique(stormdata_filt[stormdata_filt$date == 1984, ]$EVTYPE)
## [1] "TORNADO"   "TSTM WIND" "HAIL"
unique(stormdata_filt[stormdata_filt$date > 1986 & stormdata_filt$date < 1993 , ]$EVTYPE)
## [1] "TSTM WIND" "TORNADO"   "HAIL"
length(unique(stormdata_filt[stormdata_filt$date == 1993,]$EVTYPE))
## [1] 107

In 1993 further 104 events were registered. Therefore in this analysis we are going to consider only the observations after 1992.

stormdata_filt <- stormdata_filt %>%
  filter(date > 1992)

range(stormdata_filt$date)
## [1] 1993 2011

Cleaning the EVTYPE column

We see that the EVTYPE column contains way more unique values that what it should.

length(unique(stormdata_filt$EVTYPE))
## [1] 488
head(unique(stormdata_filt$EVTYPE), 20)
##  [1] "ICE STORM/FLASH FLOOD"        "WINTER STORM"                
##  [3] "HURRICANE OPAL/HIGH WINDS"    "THUNDERSTORM WINDS"          
##  [5] "TORNADO"                      "HURRICANE ERIN"              
##  [7] "HURRICANE OPAL"               "HEAVY RAIN"                  
##  [9] "LIGHTNING"                    "THUNDERSTORM WIND"           
## [11] "DENSE FOG"                    "HAIL"                        
## [13] "RIP CURRENT"                  "THUNDERSTORM WINS"           
## [15] "FLASH FLOODING"               "FLASH FLOOD"                 
## [17] "TORNADO F0"                   "THUNDERSTORM WINDS LIGHTNING"
## [19] "THUNDERSTORM WINDS/HAIL"      "HEAT"
sum(is.na(stormdata_filt$EVTYPE))
## [1] 0

Just by looking at the first 21 unique event type names, it is clear that there are typos and redundancies in the event names. There are no missing values in this variable.

The official Storm Data Event names can be found in the official documentation. The validECTYPE vector contains all these names.

validECTYPE <- c("Astronomical Low Tide", 
                 "Avalanche", 
                 "Blizzard",
                 "Coastal Flood", 
                 "Cold/Wind Chill", 
                 "Debris Flow",
                 "Dense Fog", 
                 "Dense Smoke", 
                 "Drought",
                 "Dust Devil",
                 "Dust Storm", 
                 "Excessive Heat",
                 "Extreme Cold/Wind Chill",
                 "Flash Flood", 
                 "Flood", 
                 "Frost/Freeze", 
                 "Funnel Cloud", 
                 "Freezing Fog",
                 "Hail", 
                 "Heat", 
                 "Heavy Rain",
                 "Heavy Snow", 
                 "High Surf", 
                 "High Wind",
                 "Hurricane (Typhoon)", 
                 "Ice Storm", 
                 "Lake-Effect Snow",
                 "Lakeshore Flood", 
                 "Lightning", 
                 "Marine Hail",
                 "Marine High Wind", 
                 "Marine Strong Wind", 
                 "Marine Thunderstorm Wind",
                 "Rip Current",
                 "Seiche", 
                 "Sleet", 
                 "Storm Surge/Tide",
                 "Strong Wind",
                 "Thunderstorm Wind", 
                 "Tornado", 
                 "Tropical Depression",
                 "Tropical Storm",
                 "Tsunami", 
                 "Volcanic Ash",
                 "Waterspout", 
                 "Wildfire", 
                 "Winter Storm", 
                 "Winter Weather")

length(validECTYPE)
## [1] 48

In order to make the cleaning process easier the event type names are converted to lowercase.

stormdata_filt$EVTYPE <- tolower(stormdata_filt$EVTYPE)
validECTYPE <- tolower(validECTYPE)

Since the aim of this analysis is to find the events which cause the most harm to human health or most damage, we are going to clean up only the first couple of most harmful and frequent event names. During the exploratory analysis, it was possible to find some common pattern of the falsely entered event type names, which are corrected below.

stormdata_filt <- stormdata_filt %>% mutate(EVTYPE = case_when(
  grepl("^tornado+(.*)|^(.*)tornado", EVTYPE) ~ "tornado",
  grepl("^(record|extreme|excessive) .*", EVTYPE) ~ "excessive heat",
  grepl("^tstm wind+(.*)|^thunderstorm wind+(.*)|^tstmw+(.*)", EVTYPE) ~ "thunderstorm wind",
  grepl("^(severe|gusty) thunder+(.*)|^(thu|tun| tstm)+(.*)", EVTYPE) ~ "thunderstorm wind",
  grepl("^(floods|flooding|flood)", EVTYPE) ~ "flood",
  grepl("^(urban|river|minor|rural) flood.*", EVTYPE) ~ "flood",
  grepl("^lightning+(.*)|^ lightning|^(.*)lightning|^lightning+(.*)th+(.*)", EVTYPE) ~ "lightning",
  grepl("^heat+(.*)", EVTYPE) ~ "heat",
  grepl("^flash flo+(.*)|^local flash+(.*)|^flood(/| )(flood/)?flash+(.*)|^ flash flo+(.*)", EVTYPE) ~ "flash flood",
  grepl("lake flood", EVTYPE) ~ "lakeshore flood",
  grepl("major flood", EVTYPE) ~ "flash flood",
  grepl("high  winds|^high wind+(.*)", EVTYPE) ~ "high wind",
  grepl("^non(-| )tstm wind|^gusty+(.*)|^(.*)+wind gusts|^gustnado+(.*)", EVTYPE) ~ "strong wind",
  grepl("strong winds|^wind+(|s)$", EVTYPE) ~ "strong wind",
  grepl("^drought+(.*)|^(.*)+drought", EVTYPE) ~ "drought",
  grepl("^[^marine]+(.*)+hail|^hail+(.*)", EVTYPE) ~ "hail",
  grepl("marine tstm wind", EVTYPE) ~ "marine thunderstorm wind",
  grepl("^hurricane+(.*)", EVTYPE) ~ "hurricane (typhoon)",
  grepl("cold+(/| |and)+wind+(.*)", EVTYPE) ~ "cold/wind chill",
  grepl("(.*)+extreme wind ch+(.*)", EVTYPE) ~ "extreme cold/wind chill",
  grepl("^urban+(/| )+(small|sml)+(.*)|^heavy rain+(.*)|^urban and small+(.*)|^small stream+(.*)", EVTYPE) ~  "heavy rain",
  grepl("rip currents", EVTYPE) ~ "rip current",
  grepl("storm surge", EVTYPE) ~ "storm surge/tide",
  grepl("^coastal.*", EVTYPE) ~ "coastal flood",
  grepl("ice storm/flash flood", EVTYPE) ~ "ice storm",
  TRUE ~ EVTYPE
))
length(unique(stormdata_filt$EVTYPE))
## [1] 227

Now the unique event type names are reduced from 488 to 232.

Cleaning the Damage Exponents

The ‘CROPDMGEXP’ is the exponent values for ‘CROPDMG’ (crop damage). In the same way, ‘PROPDMGEXP’ is the exponent values for ‘PROPDMG’ (property damage). Alphabetical characters used to signify magnitude include - “K” for thousands, - “M” for millions, - “B” for billions, - “-” refers to less than, - “+” refers to greater than, - “?” refers to low certainty, - Numbers between 0 and 7 mean a multiplier of 10.

table(stormdata_filt$PROPDMGEXP)
## 
##             -      +      0      2      3      4      5      6      7      B 
##  10207      1      5    210      1      1      4     18      3      3     40 
##      h      H      K      m      M 
##      1      6 208203      7   8547
table(stormdata_filt$CROPDMGEXP)
## 
##             ?      0      B      k      K      m      M 
## 125288      6     17      7     21  99932      1   1985

These exponent values need to be converted to multiplier numbers in order to determine the economic consequences in dollar amount.

stormdata_filt <- stormdata_filt %>% mutate(PROPDMGEXP = case_when(
  grepl("[Bb]", PROPDMGEXP) ~ "1000000000",
  grepl("[Mm]", PROPDMGEXP) ~ "1000000",
  grepl("[Kk]", PROPDMGEXP) ~ "1000",
  grepl("[Hh]", PROPDMGEXP) ~ "100",
  grepl("0|2|3|4|5|6|7", PROPDMGEXP) ~ "10",
  grepl("-|+| ", PROPDMGEXP) ~ "1",
  grepl("?", PROPDMGEXP) ~ "0",
  TRUE ~ PROPDMGEXP
)) 

table(stormdata_filt$PROPDMGEXP)  
## 
##          1         10        100       1000    1000000 1000000000 
##      10213        240          7     208203       8554         40
stormdata_filt <- stormdata_filt %>% mutate(CROPDMGEXP = case_when(
  grepl("[B]", CROPDMGEXP) ~ "1000000000",
  grepl("[Mm]", CROPDMGEXP) ~ "1000000",
  grepl("[Kk]", CROPDMGEXP) ~ "1000",
  grepl("0", CROPDMGEXP) ~ "10",
  grepl(" ", CROPDMGEXP) ~ "1",
  grepl("?", CROPDMGEXP) ~ "0",
  TRUE ~ CROPDMGEXP
))

table(stormdata_filt$CROPDMGEXP) 
## 
##          0         10       1000    1000000 1000000000 
##     125294         17      99953       1986          7

Across the United States, which types of events are most harmful with respect to population health?

In order to find which types of events are most harmful with respect to population health the followings are considered.

  • The different types of events which are registered in the EVTYPE column.
  • The population health damage frequency due to a given event which is registered in FATALITIES and INJURIES columns.
  • The year in which the event is registered in the date column.

To find which events cause the most fatalities and injuries during the years, the pre-prepared stormdata_filt data is grouped by the EVTYPE variable and then for each of the groups the sum of the fatalities and injuries and their sum are computed and stored in the group1 object. Ordering this new group1 object decreasing by sum of the fatalities and injuries the answer of the main question of this section can be found.

group1 <- stormdata_filt %>% 
  group_by(EVTYPE) %>% 
  summarize(fatalities = sum(FATALITIES), 
            injuries = sum(INJURIES), 
            total = sum(FATALITIES) + sum(INJURIES))
  
group1 <- group1[order(group1$total, decreasing = TRUE),] 

head(group1)
## # A tibble: 6 × 4
##   EVTYPE            fatalities injuries total
##   <chr>                  <dbl>    <dbl> <dbl>
## 1 tornado                 1649    23371 25020
## 2 excessive heat          2308     7013  9321
## 3 flood                    500     6809  7309
## 4 thunderstorm wind        449     6183  6632
## 5 lightning                817     5232  6049
## 6 heat                    1118     2494  3612

From the above table it is straightforward, that tornado is the most hazardous to the overall population health: over the years it caused a total of 1649 fatalities and 23371 injuries. Flood, thunderstorm wind and lightning cased much less fatalities, but injuries at the same order as excessive heat. Heat is the sixth most hazardous event with more than 1000 fatalities and almost 2500 injuries.

If we order the data with respect to the fatalities, we see that excessive heat has caused 2325 fatalities, which is more than what tornado events caused.

head(group1[order(group1$fatalities, decreasing = TRUE), c(1,3) ])
## # A tibble: 6 × 2
##   EVTYPE         injuries
##   <chr>             <dbl>
## 1 excessive heat     7013
## 2 tornado           23371
## 3 heat               2494
## 4 flash flood        1785
## 5 lightning          5232
## 6 rip current         529

In the following tornado and excessive heat events are further analyzed. For this a new object is created called group2 which contains for both events, in each year the total number of the fatalities (variable tF) and injuries (variable tI) and their sum (variable total),

group2 <- stormdata_filt %>% 
  subset(EVTYPE %in% group1$EVTYPE[1:2]) %>% 
  group_by(date, EVTYPE) %>%
  summarize(fatalities = sum(FATALITIES), 
            injuries = sum(INJURIES), 
            total = sum(FATALITIES) + sum(INJURIES), .groups = "drop_last")

In order to see how the the number of fatalities and injuries for each event type over the time in years changed, a panel plot is created.

ggplot(group2) + 
  geom_line(mapping = aes(x = date, y = fatalities, colour = "Fatalities"), linewidth = 1) +
  geom_line(mapping = aes(x = date, y = injuries, colour = "Injuries"), linewidth = 1) +
  facet_wrap(~EVTYPE, ncol = 2) +
  labs(title = "The total number of fatalties and injuries caused by \nthe two most hazardous weather event types",
       x = "Year",
       y = "Number of Fatalties and Injuries") +
  theme_bw()

The above figure shows that the severe excessive heat wave in 1999 caused about 1500 people injured and couple hundred fatalities. In 2006 there was another, less severe excessive heat wave which caused about 1000 people injured and some fatalities. Tornado is a common weather event in the US, every year there are a couple hundred or even a 1-2 thousand injuries. In 2011 there were some huge tornado events that have left more than 6000 people injured and about 200 dead. This was the most sever year between 1992 and 2011 in terms of population health.

Across the United States, which types of events have the greatest economic consequences?

In order to find which types of events have the greatest economic consequences across the united states, the variable of PROPDMG (Property damage) and CROPDMG (Crop damage) need to be multiplied with their multipliers, which are in the variables PROPDMGEXP and CROPDMGEXP, respectively.

stormdata_filt <- stormdata_filt %>% 
  mutate(
    PROPDAMAGE = PROPDMG * as.numeric(PROPDMGEXP), 
    CROPDAMAGE = CROPDMG * as.numeric(CROPDMGEXP))

Now the total property and corp damage per event type are determined and stored in the group3 object.

group3 <- stormdata_filt %>% 
  group_by(EVTYPE) %>% 
  summarise(tProperty = sum(PROPDAMAGE),
            tCrop = sum(CROPDAMAGE),
            tDamage = sum(PROPDAMAGE) + sum(CROPDAMAGE))

By far the most damage is caused by flood, the total amount in dollar between 1993 and 2011 (without considering the time value of money) is more than 161 Billion.

head(group3[order(group3$tDamage, decreasing = TRUE),c(1,4) ] )
## # A tibble: 6 × 2
##   EVTYPE                    tDamage
##   <chr>                       <dbl>
## 1 flood               161154146711 
## 2 hurricane (typhoon)  90271472810 
## 3 storm surge/tide     47965579000 
## 4 tornado              28412364540 
## 5 hail                 19021430734 
## 6 flash flood          18275035478.

Flood is responsible mainly for property damage, which was added up to 150 Billion.

head(group3[order(group3$tProperty, decreasing = TRUE), c(1,2)])
## # A tibble: 6 × 2
##   EVTYPE                  tProperty
##   <chr>                       <dbl>
## 1 flood               150216664761 
## 2 hurricane (typhoon)  84756180010 
## 3 storm surge/tide     47964724000 
## 4 tornado              27994901580 
## 5 flash flood          16837872328.
## 6 hail                 15974542934

Drought is responsible for the most crop damage with almost 14 Billion dollar, and flood is also responsible for almost 11 Billion dollar damage.

head(group3[order(group3$tCrop, decreasing = TRUE), c(1,3)])
## # A tibble: 6 × 2
##   EVTYPE                    tCrop
##   <chr>                     <dbl>
## 1 drought             13972571780
## 2 flood               10937481950
## 3 hurricane (typhoon)  5515292800
## 4 ice storm            5022113500
## 5 hail                 3046887800
## 6 excessive heat       1969425000
group4 <- stormdata_filt %>% 
  subset(EVTYPE %in% c( "flood", "drought")) %>% 
  group_by(date, EVTYPE) %>%
  summarize(property = sum(PROPDAMAGE),
            crop = sum(CROPDAMAGE), .groups = "drop_last")

In order to see how the the number of crop and property damages for flood and drought over the time in years changed, a panel plot is created.

ggplot(group4[group4$EVTYPE == "drought",]) + 
  geom_line(mapping = aes(x = date, y = property, colour = "Property Damage"), linewidth = 1) +
  geom_line(mapping = aes(x = date, y = crop, colour = "Crop Damage"), linewidth = 1) +
  labs(title = "The total amount of property and crop damage caused by drought",
       x = "Year",
       y = "Property and Crop Damage [$]") +
  theme_bw()

During the analyzed time interval there were three years (1998, 2000 and 2006) where severe drought events caused more than 2 Billion dollar crop damage yearly.In 2003 property damage caused by drought was significant, more than half Billion dollar.

ggplot(group4[group4$EVTYPE == "flood",]) + 
  geom_line(mapping = aes(x = date, y = property, colour = "Property Damage"), linewidth = 1) +
  geom_line(mapping = aes(x = date, y = crop, colour = "Crop Damage"), linewidth = 1) +
  facet_wrap(~EVTYPE, ncol = 2) +
  labs(title = "The total amount of property and crop damage caused flood",
       x = "Year",
       y = "Property and Crop Damage [$]") +
  theme_bw()

From the above Figure we can observe that in 2006 there were some severe flood events (known as the 2006 Mid-Atlantic United States flood), which caused the most property damage in the US between 1992 and 2011. In dollar terms the damage reached 120 Billion. In the analysed time interval there were no other similarly destructive events. Flood caused minimal crop damage.

Results

This analysis was looking for the most harmful and destructive severe weather events across the US between the years of 1992 and 2011. We have found that overall tornadoes are the most harmful in terms of population health. Overall, in terms of crop and property damage flood is responsible. The analysis also showed that in 2006 a record high property damage was caused by several flood events. Excessive heat is a very common destructive weather event for crop.