Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. In this document the answers for the following two questions are searched:
To make the analysis robust, the local system is manually overwritten to English.Then the used libraries are loaded.
Sys.setlocale("LC_TIME", "English")
library(dplyr)
##
## Kapcsolódás csomaghoz: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
## Warning: a(z) 'ggplot2' csomag az R 4.5.3 verziójával lett fordítva
library(lubridate)
## Warning: a(z) 'lubridate' csomag az R 4.5.3 verziójával lett fordítva
##
## Kapcsolódás csomaghoz: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(stringdist)
##
## Kapcsolódás csomaghoz: 'stringdist'
## The following object is masked from 'package:tidyr':
##
## extract
The Storm Data is loaded into the data object with the
read.csv() function.
stormdata <- read.csv("repdata_data_StormData.csv.bz2")
Inspecting the data.
head(stormdata)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL TORNADO
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL TORNADO
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL TORNADO
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL TORNADO
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1 0 0 NA
## 2 0 0 NA
## 3 0 0 NA
## 4 0 0 NA
## 5 0 0 NA
## 6 0 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1 0 14.0 100 3 0 0 15 25.0
## 2 0 2.0 150 2 0 0 0 2.5
## 3 0 0.1 123 2 0 0 2 25.0
## 4 0 0.0 100 2 0 0 2 2.5
## 5 0 0.0 150 2 0 0 2 2.5
## 6 0 1.5 177 2 0 0 6 2.5
## PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1 K 0 3040 8812
## 2 K 0 3042 8755
## 3 K 0 3340 8742
## 4 K 0 3458 8626
## 5 K 0 3412 8642
## 6 K 0 3450 8748
## LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3051 8806 1
## 2 0 0 2
## 3 0 0 3
## 4 0 0 4
## 5 0 0 5
## 6 0 0 6
dim(stormdata)
## [1] 902297 37
The data has 37 columns and 902297 observations, the 37 variable names are below.
names(stormdata)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
In order to reduce the size of the data, those observations which are not relevant for this analysis are removed. Namely, these are the ones that did not cause any harm in population health (fatalities or injuries) or property or crop damage.
stormdata_filt <- stormdata %>%
filter(FATALITIES > 0 | INJURIES > 0 | PROPDMG > 0 | CROPDMG > 0)
The variable BGN_DATE is split to date and
time variables, then the date column is
converted to date class. In this document only the year in which a
severe weather event happened is considered, therefore the date column
is further processed to only contain the year. The unnecessary
time column is removed.
stormdata_filt <- stormdata_filt %>%
separate(BGN_DATE, c("date", "time"), sep = " ")
stormdata_filt$date <- year(as.Date(stormdata_filt$date, format = "%m/%d/%Y"))
stormdata_filt <- subset(stormdata_filt, select = -c(time))
The data contains severe weather and storm data between 1950 and 2011.
range(stormdata_filt$date)
## [1] 1950 2011
In the first 32 years only tornado events were recorded.
unique(stormdata_filt[stormdata_filt$date < 1983,]$EVTYPE)
## [1] "TORNADO"
In 1983 thunderstorm wind events started to be recorded by the name “TSTM WIND”
unique(stormdata_filt[stormdata_filt$date == 1983,]$EVTYPE)
## [1] "TORNADO" "TSTM WIND"
One year later also hail events were recorded up until 1992.
unique(stormdata_filt[stormdata_filt$date == 1984, ]$EVTYPE)
## [1] "TORNADO" "TSTM WIND" "HAIL"
unique(stormdata_filt[stormdata_filt$date > 1986 & stormdata_filt$date < 1993 , ]$EVTYPE)
## [1] "TSTM WIND" "TORNADO" "HAIL"
length(unique(stormdata_filt[stormdata_filt$date == 1993,]$EVTYPE))
## [1] 107
In 1993 further 104 events were registered. Therefore in this analysis we are going to consider only the observations after 1992.
stormdata_filt <- stormdata_filt %>%
filter(date > 1992)
range(stormdata_filt$date)
## [1] 1993 2011
We see that the EVTYPE column contains way more unique values that what it should.
length(unique(stormdata_filt$EVTYPE))
## [1] 488
head(unique(stormdata_filt$EVTYPE), 20)
## [1] "ICE STORM/FLASH FLOOD" "WINTER STORM"
## [3] "HURRICANE OPAL/HIGH WINDS" "THUNDERSTORM WINDS"
## [5] "TORNADO" "HURRICANE ERIN"
## [7] "HURRICANE OPAL" "HEAVY RAIN"
## [9] "LIGHTNING" "THUNDERSTORM WIND"
## [11] "DENSE FOG" "HAIL"
## [13] "RIP CURRENT" "THUNDERSTORM WINS"
## [15] "FLASH FLOODING" "FLASH FLOOD"
## [17] "TORNADO F0" "THUNDERSTORM WINDS LIGHTNING"
## [19] "THUNDERSTORM WINDS/HAIL" "HEAT"
sum(is.na(stormdata_filt$EVTYPE))
## [1] 0
Just by looking at the first 21 unique event type names, it is clear that there are typos and redundancies in the event names. There are no missing values in this variable.
The official Storm Data Event names can be found in the official
documentation. The validECTYPE vector contains all these
names.
validECTYPE <- c("Astronomical Low Tide",
"Avalanche",
"Blizzard",
"Coastal Flood",
"Cold/Wind Chill",
"Debris Flow",
"Dense Fog",
"Dense Smoke",
"Drought",
"Dust Devil",
"Dust Storm",
"Excessive Heat",
"Extreme Cold/Wind Chill",
"Flash Flood",
"Flood",
"Frost/Freeze",
"Funnel Cloud",
"Freezing Fog",
"Hail",
"Heat",
"Heavy Rain",
"Heavy Snow",
"High Surf",
"High Wind",
"Hurricane (Typhoon)",
"Ice Storm",
"Lake-Effect Snow",
"Lakeshore Flood",
"Lightning",
"Marine Hail",
"Marine High Wind",
"Marine Strong Wind",
"Marine Thunderstorm Wind",
"Rip Current",
"Seiche",
"Sleet",
"Storm Surge/Tide",
"Strong Wind",
"Thunderstorm Wind",
"Tornado",
"Tropical Depression",
"Tropical Storm",
"Tsunami",
"Volcanic Ash",
"Waterspout",
"Wildfire",
"Winter Storm",
"Winter Weather")
length(validECTYPE)
## [1] 48
In order to make the cleaning process easier the event type names are converted to lowercase.
stormdata_filt$EVTYPE <- tolower(stormdata_filt$EVTYPE)
validECTYPE <- tolower(validECTYPE)
Since the aim of this analysis is to find the events which cause the most harm to human health or most damage, we are going to clean up only the first couple of most harmful and frequent event names. During the exploratory analysis, it was possible to find some common pattern of the falsely entered event type names, which are corrected below.
stormdata_filt <- stormdata_filt %>% mutate(EVTYPE = case_when(
grepl("^tornado+(.*)|^(.*)tornado", EVTYPE) ~ "tornado",
grepl("^(record|extreme|excessive) .*", EVTYPE) ~ "excessive heat",
grepl("^tstm wind+(.*)|^thunderstorm wind+(.*)|^tstmw+(.*)", EVTYPE) ~ "thunderstorm wind",
grepl("^(severe|gusty) thunder+(.*)|^(thu|tun| tstm)+(.*)", EVTYPE) ~ "thunderstorm wind",
grepl("^(floods|flooding|flood)", EVTYPE) ~ "flood",
grepl("^(urban|river|minor|rural) flood.*", EVTYPE) ~ "flood",
grepl("^lightning+(.*)|^ lightning|^(.*)lightning|^lightning+(.*)th+(.*)", EVTYPE) ~ "lightning",
grepl("^heat+(.*)", EVTYPE) ~ "heat",
grepl("^flash flo+(.*)|^local flash+(.*)|^flood(/| )(flood/)?flash+(.*)|^ flash flo+(.*)", EVTYPE) ~ "flash flood",
grepl("lake flood", EVTYPE) ~ "lakeshore flood",
grepl("major flood", EVTYPE) ~ "flash flood",
grepl("high winds|^high wind+(.*)", EVTYPE) ~ "high wind",
grepl("^non(-| )tstm wind|^gusty+(.*)|^(.*)+wind gusts|^gustnado+(.*)", EVTYPE) ~ "strong wind",
grepl("strong winds|^wind+(|s)$", EVTYPE) ~ "strong wind",
grepl("^drought+(.*)|^(.*)+drought", EVTYPE) ~ "drought",
grepl("^[^marine]+(.*)+hail|^hail+(.*)", EVTYPE) ~ "hail",
grepl("marine tstm wind", EVTYPE) ~ "marine thunderstorm wind",
grepl("^hurricane+(.*)", EVTYPE) ~ "hurricane (typhoon)",
grepl("cold+(/| |and)+wind+(.*)", EVTYPE) ~ "cold/wind chill",
grepl("(.*)+extreme wind ch+(.*)", EVTYPE) ~ "extreme cold/wind chill",
grepl("^urban+(/| )+(small|sml)+(.*)|^heavy rain+(.*)|^urban and small+(.*)|^small stream+(.*)", EVTYPE) ~ "heavy rain",
grepl("rip currents", EVTYPE) ~ "rip current",
grepl("storm surge", EVTYPE) ~ "storm surge/tide",
grepl("^coastal.*", EVTYPE) ~ "coastal flood",
grepl("ice storm/flash flood", EVTYPE) ~ "ice storm",
TRUE ~ EVTYPE
))
length(unique(stormdata_filt$EVTYPE))
## [1] 227
Now the unique event type names are reduced from 488 to 232.
The ‘CROPDMGEXP’ is the exponent values for ‘CROPDMG’ (crop damage). In the same way, ‘PROPDMGEXP’ is the exponent values for ‘PROPDMG’ (property damage). Alphabetical characters used to signify magnitude include - “K” for thousands, - “M” for millions, - “B” for billions, - “-” refers to less than, - “+” refers to greater than, - “?” refers to low certainty, - Numbers between 0 and 7 mean a multiplier of 10.
table(stormdata_filt$PROPDMGEXP)
##
## - + 0 2 3 4 5 6 7 B
## 10207 1 5 210 1 1 4 18 3 3 40
## h H K m M
## 1 6 208203 7 8547
table(stormdata_filt$CROPDMGEXP)
##
## ? 0 B k K m M
## 125288 6 17 7 21 99932 1 1985
These exponent values need to be converted to multiplier numbers in order to determine the economic consequences in dollar amount.
stormdata_filt <- stormdata_filt %>% mutate(PROPDMGEXP = case_when(
grepl("[Bb]", PROPDMGEXP) ~ "1000000000",
grepl("[Mm]", PROPDMGEXP) ~ "1000000",
grepl("[Kk]", PROPDMGEXP) ~ "1000",
grepl("[Hh]", PROPDMGEXP) ~ "100",
grepl("0|2|3|4|5|6|7", PROPDMGEXP) ~ "10",
grepl("-|+| ", PROPDMGEXP) ~ "1",
grepl("?", PROPDMGEXP) ~ "0",
TRUE ~ PROPDMGEXP
))
table(stormdata_filt$PROPDMGEXP)
##
## 1 10 100 1000 1000000 1000000000
## 10213 240 7 208203 8554 40
stormdata_filt <- stormdata_filt %>% mutate(CROPDMGEXP = case_when(
grepl("[B]", CROPDMGEXP) ~ "1000000000",
grepl("[Mm]", CROPDMGEXP) ~ "1000000",
grepl("[Kk]", CROPDMGEXP) ~ "1000",
grepl("0", CROPDMGEXP) ~ "10",
grepl(" ", CROPDMGEXP) ~ "1",
grepl("?", CROPDMGEXP) ~ "0",
TRUE ~ CROPDMGEXP
))
table(stormdata_filt$CROPDMGEXP)
##
## 0 10 1000 1000000 1000000000
## 125294 17 99953 1986 7
In order to find which types of events are most harmful with respect to population health the followings are considered.
EVTYPE column.FATALITIES and INJURIES
columns.date
column.To find which events cause the most fatalities and injuries during
the years, the pre-prepared stormdata_filt data is grouped
by the EVTYPE variable and then for each of the groups the
sum of the fatalities and injuries and their sum are computed and stored
in the group1 object. Ordering this new group1
object decreasing by sum of the fatalities and injuries the answer of
the main question of this section can be found.
group1 <- stormdata_filt %>%
group_by(EVTYPE) %>%
summarize(fatalities = sum(FATALITIES),
injuries = sum(INJURIES),
total = sum(FATALITIES) + sum(INJURIES))
group1 <- group1[order(group1$total, decreasing = TRUE),]
head(group1)
## # A tibble: 6 × 4
## EVTYPE fatalities injuries total
## <chr> <dbl> <dbl> <dbl>
## 1 tornado 1649 23371 25020
## 2 excessive heat 2308 7013 9321
## 3 flood 500 6809 7309
## 4 thunderstorm wind 449 6183 6632
## 5 lightning 817 5232 6049
## 6 heat 1118 2494 3612
From the above table it is straightforward, that tornado is the most hazardous to the overall population health: over the years it caused a total of 1649 fatalities and 23371 injuries. Flood, thunderstorm wind and lightning cased much less fatalities, but injuries at the same order as excessive heat. Heat is the sixth most hazardous event with more than 1000 fatalities and almost 2500 injuries.
If we order the data with respect to the fatalities, we see that excessive heat has caused 2325 fatalities, which is more than what tornado events caused.
head(group1[order(group1$fatalities, decreasing = TRUE), c(1,3) ])
## # A tibble: 6 × 2
## EVTYPE injuries
## <chr> <dbl>
## 1 excessive heat 7013
## 2 tornado 23371
## 3 heat 2494
## 4 flash flood 1785
## 5 lightning 5232
## 6 rip current 529
In the following tornado and excessive heat events are further
analyzed. For this a new object is created called group2
which contains for both events, in each year the total number of the
fatalities (variable tF) and injuries (variable
tI) and their sum (variable total),
group2 <- stormdata_filt %>%
subset(EVTYPE %in% group1$EVTYPE[1:2]) %>%
group_by(date, EVTYPE) %>%
summarize(fatalities = sum(FATALITIES),
injuries = sum(INJURIES),
total = sum(FATALITIES) + sum(INJURIES), .groups = "drop_last")
In order to see how the the number of fatalities and injuries for each event type over the time in years changed, a panel plot is created.
ggplot(group2) +
geom_line(mapping = aes(x = date, y = fatalities, colour = "Fatalities"), linewidth = 1) +
geom_line(mapping = aes(x = date, y = injuries, colour = "Injuries"), linewidth = 1) +
facet_wrap(~EVTYPE, ncol = 2) +
labs(title = "The total number of fatalties and injuries caused by \nthe two most hazardous weather event types",
x = "Year",
y = "Number of Fatalties and Injuries") +
theme_bw()
The above figure shows that the severe excessive heat wave in 1999 caused about 1500 people injured and couple hundred fatalities. In 2006 there was another, less severe excessive heat wave which caused about 1000 people injured and some fatalities. Tornado is a common weather event in the US, every year there are a couple hundred or even a 1-2 thousand injuries. In 2011 there were some huge tornado events that have left more than 6000 people injured and about 200 dead. This was the most sever year between 1992 and 2011 in terms of population health.
In order to find which types of events have the greatest economic consequences across the united states, the variable of PROPDMG (Property damage) and CROPDMG (Crop damage) need to be multiplied with their multipliers, which are in the variables PROPDMGEXP and CROPDMGEXP, respectively.
stormdata_filt <- stormdata_filt %>%
mutate(
PROPDAMAGE = PROPDMG * as.numeric(PROPDMGEXP),
CROPDAMAGE = CROPDMG * as.numeric(CROPDMGEXP))
Now the total property and corp damage per event type are determined and stored in the group3 object.
group3 <- stormdata_filt %>%
group_by(EVTYPE) %>%
summarise(tProperty = sum(PROPDAMAGE),
tCrop = sum(CROPDAMAGE),
tDamage = sum(PROPDAMAGE) + sum(CROPDAMAGE))
By far the most damage is caused by flood, the total amount in dollar between 1993 and 2011 (without considering the time value of money) is more than 161 Billion.
head(group3[order(group3$tDamage, decreasing = TRUE),c(1,4) ] )
## # A tibble: 6 × 2
## EVTYPE tDamage
## <chr> <dbl>
## 1 flood 161154146711
## 2 hurricane (typhoon) 90271472810
## 3 storm surge/tide 47965579000
## 4 tornado 28412364540
## 5 hail 19021430734
## 6 flash flood 18275035478.
Flood is responsible mainly for property damage, which was added up to 150 Billion.
head(group3[order(group3$tProperty, decreasing = TRUE), c(1,2)])
## # A tibble: 6 × 2
## EVTYPE tProperty
## <chr> <dbl>
## 1 flood 150216664761
## 2 hurricane (typhoon) 84756180010
## 3 storm surge/tide 47964724000
## 4 tornado 27994901580
## 5 flash flood 16837872328.
## 6 hail 15974542934
Drought is responsible for the most crop damage with almost 14 Billion dollar, and flood is also responsible for almost 11 Billion dollar damage.
head(group3[order(group3$tCrop, decreasing = TRUE), c(1,3)])
## # A tibble: 6 × 2
## EVTYPE tCrop
## <chr> <dbl>
## 1 drought 13972571780
## 2 flood 10937481950
## 3 hurricane (typhoon) 5515292800
## 4 ice storm 5022113500
## 5 hail 3046887800
## 6 excessive heat 1969425000
group4 <- stormdata_filt %>%
subset(EVTYPE %in% c( "flood", "drought")) %>%
group_by(date, EVTYPE) %>%
summarize(property = sum(PROPDAMAGE),
crop = sum(CROPDAMAGE), .groups = "drop_last")
In order to see how the the number of crop and property damages for flood and drought over the time in years changed, a panel plot is created.
ggplot(group4[group4$EVTYPE == "drought",]) +
geom_line(mapping = aes(x = date, y = property, colour = "Property Damage"), linewidth = 1) +
geom_line(mapping = aes(x = date, y = crop, colour = "Crop Damage"), linewidth = 1) +
labs(title = "The total amount of property and crop damage caused by drought",
x = "Year",
y = "Property and Crop Damage [$]") +
theme_bw()
During the analyzed time interval there were three years (1998, 2000 and 2006) where severe drought events caused more than 2 Billion dollar crop damage yearly.In 2003 property damage caused by drought was significant, more than half Billion dollar.
ggplot(group4[group4$EVTYPE == "flood",]) +
geom_line(mapping = aes(x = date, y = property, colour = "Property Damage"), linewidth = 1) +
geom_line(mapping = aes(x = date, y = crop, colour = "Crop Damage"), linewidth = 1) +
facet_wrap(~EVTYPE, ncol = 2) +
labs(title = "The total amount of property and crop damage caused flood",
x = "Year",
y = "Property and Crop Damage [$]") +
theme_bw()
From the above Figure we can observe that in 2006 there were some severe flood events (known as the 2006 Mid-Atlantic United States flood), which caused the most property damage in the US between 1992 and 2011. In dollar terms the damage reached 120 Billion. In the analysed time interval there were no other similarly destructive events. Flood caused minimal crop damage.
This analysis was looking for the most harmful and destructive severe weather events across the US between the years of 1992 and 2011. We have found that overall tornadoes are the most harmful in terms of population health. Overall, in terms of crop and property damage flood is responsible. The analysis also showed that in 2006 a record high property damage was caused by several flood events. Excessive heat is a very common destructive weather event for crop.