Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern. This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete. The data comes in the form of a comma-separated-value file compressed via the bzip2 algorithm retrievable at: Storm Data
There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
National Weather Service Storm Data Documentation
National Climatic Data Center Storm Events FAQ
The basic goal of this assignment is to explore the NOAA Storm Database and answer the following basic questions about severe weather events.
Key Findings: Tornado causes the most severe health consequence (contribute the greatest number of injuries and fatalities), whilst floods causes the most severe economic consequence (contribute the greatest property and crop damage)
The storm dataset is added into a dataframe titled ‘df’.
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
df <- read.csv("repdata_data_StormData.csv.bz2")
The storms dataframe df has 902297 observations and 37 variables.
str(df)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
dim(df)
## [1] 902297 37
head(df, n = 10)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## 7 1 11/16/1951 0:00:00 0100 CST 9 BLOUNT AL
## 8 1 1/22/1952 0:00:00 0900 CST 123 TALLAPOOSA AL
## 9 1 2/13/1952 0:00:00 2000 CST 125 TUSCALOOSA AL
## 10 1 2/13/1952 0:00:00 2000 CST 57 FAYETTE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1 TORNADO 0 0 NA
## 2 TORNADO 0 0 NA
## 3 TORNADO 0 0 NA
## 4 TORNADO 0 0 NA
## 5 TORNADO 0 0 NA
## 6 TORNADO 0 0 NA
## 7 TORNADO 0 0 NA
## 8 TORNADO 0 0 NA
## 9 TORNADO 0 0 NA
## 10 TORNADO 0 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1 0 14.0 100 3 0 0 15 25.0
## 2 0 2.0 150 2 0 0 0 2.5
## 3 0 0.1 123 2 0 0 2 25.0
## 4 0 0.0 100 2 0 0 2 2.5
## 5 0 0.0 150 2 0 0 2 2.5
## 6 0 1.5 177 2 0 0 6 2.5
## 7 0 1.5 33 2 0 0 1 2.5
## 8 0 0.0 33 1 0 0 0 2.5
## 9 0 3.3 100 3 0 1 14 25.0
## 10 0 2.3 100 3 0 0 0 25.0
## PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1 K 0 3040 8812
## 2 K 0 3042 8755
## 3 K 0 3340 8742
## 4 K 0 3458 8626
## 5 K 0 3412 8642
## 6 K 0 3450 8748
## 7 K 0 3405 8631
## 8 K 0 3255 8558
## 9 K 0 3334 8740
## 10 K 0 3336 8738
## LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3051 8806 1
## 2 0 0 2
## 3 0 0 3
## 4 0 0 4
## 5 0 0 5
## 6 0 0 6
## 7 0 0 7
## 8 0 0 8
## 9 3336 8738 9
## 10 3337 8737 10
There seems to be no NA values in our new dataset df1.
colSums(is.na(df1))
## FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP EVTYPE
## 0 0 0 0 0 0 0
There are numerous weather events and we aim to categorise these events such as Thunderstorm Winds and High Wind into broad categories such as “Wind”. First, we see the 30 most common weather events.
From this we can identify the broad categories of events: Wind, Tornado, Flood, Snow, Winter, Storm, Fire, Hail, Rain. For other events, we categorise them into “Others”.
is.factor(df1$EVTYPE)
## [1] FALSE
df1$EVTYPE <- as.factor(df1$EVTYPE)
sort(table(df1$EVTYPE), decreasing = TRUE)[1:30]
##
## HAIL TSTM WIND THUNDERSTORM WIND
## 288661 219940 82563
## TORNADO FLASH FLOOD FLOOD
## 60652 54277 25326
## THUNDERSTORM WINDS HIGH WIND LIGHTNING
## 20843 20212 15754
## HEAVY SNOW HEAVY RAIN WINTER STORM
## 15708 11723 11433
## WINTER WEATHER FUNNEL CLOUD MARINE TSTM WIND
## 7026 6839 6175
## MARINE THUNDERSTORM WIND WATERSPOUT STRONG WIND
## 5812 3796 3566
## URBAN/SML STREAM FLD WILDFIRE BLIZZARD
## 3392 2761 2719
## DROUGHT ICE STORM EXCESSIVE HEAT
## 2488 2006 1678
## HIGH WINDS WILD/FOREST FIRE FROST/FREEZE
## 1533 1457 1342
## DENSE FOG WINTER WEATHER/MIX TSTM WIND/HAIL
## 1293 1104 1028
Proceed to categorise the EVTYPES column into a new column EV, and view the count of each new broad category.
df1$EV <- as.character(df1$EV)
df1$EV <- "OTHER"
df1$EV[grep("wind",df1$EVTYPE, ignore.case = TRUE)] <- "WIND"
df1$EV[grep("tornado",df1$EVTYPE, ignore.case = TRUE)] <- "TORNADO"
df1$EV[grep("flood",df1$EVTYPE, ignore.case = TRUE)] <- "FLOOD"
df1$EV[grep("snow",df1$EVTYPE, ignore.case = TRUE)] <- "SNOW"
df1$EV[grep("winter",df1$EVTYPE, ignore.case = TRUE)] <- "WINTER"
df1$EV[grep("storm",df1$EVTYPE, ignore.case = TRUE)] <- "STORM"
df1$EV[grep("fire",df1$EVTYPE, ignore.case = TRUE)] <- "FIRE"
df1$EV[grep("hail",df1$EVTYPE, ignore.case = TRUE)] <- "HAIL"
df1$EV[grep("rain",df1$EVTYPE, ignore.case = TRUE)] <- "RAIN"
table(df1$EV)
##
## FIRE FLOOD HAIL OTHER RAIN SNOW STORM TORNADO WIND WINTER
## 4240 82689 290401 47378 12241 17636 124527 60699 254323 8163
Values for both damages are found in two separate columns - DMG and DMGEXP. For example, for the first observation in the dataset, property damage is 25.00K (25.00 recorded under PROPDMG and K under PROPDMGEXP).
We need to form a new column reflecting the absolute numerical figure of this damage, that is 25,000.
View the denominations
table(df1$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5 6
## 465934 1 8 5 216 25 13 4 4 28 4
## 7 8 B h H K m M
## 5 1 40 1 6 424665 7 11330
table(df1$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
K corresponds to thousands,M corresponds to millions and B corresponds to billions.
For the remaining denominations which cannot be comprehended, we assume that the figure is just denominated in dollars, since the number of observations with such incomprehensible units are not many anyway. We will store the denominations into a separate column, units_prop and units_crop.
df1$units_prop <- 1
is.character(df1$PROPDMGEXP)
## [1] TRUE
df1$units_prop[grep("k", df1$PROPDMGEXP, ignore.case = TRUE)] <- "1000"
df1$units_prop[grep("m", df1$PROPDMGEXP, ignore.case = TRUE)] <- "1000000"
df1$units_prop[grep("b", df1$PROPDMGEXP, ignore.case = TRUE)] <- "1000000000"
table(df1$units_prop)
##
## 1 1000 1000000 1000000000
## 466255 424665 11337 40
df1$units_crop <- 1
df1$units_crop[grep("k", df1$CROPDMGEXP, ignore.case = TRUE)] <- "1000"
df1$units_crop[grep("m", df1$CROPDMGEXP, ignore.case = TRUE)] <- "1000000"
df1$units_crop[grep("b", df1$CROPDMGEXP, ignore.case = TRUE)] <- "1000000000"
table(df1$units_crop)
##
## 1 1000 1000000 1000000000
## 618440 281853 1995 9
create a new column with final damage expenses, crop_dmg and prop_dmg.
Note that we must first check if columns are numeric before multiplying the values and units.
is.numeric(df1$units_prop) #check that column is numeric
## [1] FALSE
df1$units_prop <- as.numeric(df1$units_prop)
df1$units_crop <- as.numeric(df1$units_crop)
is.numeric(df1$PROPDMG)
## [1] TRUE
is.numeric(df1$CROPDMG)
## [1] TRUE
df1$prop_dmg <- df1$PROPDMG * df1$units_prop
df1$crop_dmg <- df1$CROPDMG * df1$units_crop
To analyse the effects of weather event types on population health, we first group the dataframe df1 by the event type, EV (broad categories manually created above). Thereafter, we analyse the total fatalities and injuries by event type.
df1_health <- df1 %>% group_by(df1$EV) %>% summarise(total_fatalities = sum(FATALITIES), total_injuries = sum(INJURIES))
## `summarise()` ungrouping output (override with `.groups` argument)
df1_health
## # A tibble: 10 x 3
## `df1$EV` total_fatalities total_injuries
## <chr> <dbl> <dbl>
## 1 FIRE 90 1608
## 2 FLOOD 1524 8602
## 3 HAIL 45 1467
## 4 OTHER 5674 19840
## 5 RAIN 114 305
## 6 SNOW 164 1164
## 7 STORM 633 6691
## 8 TORNADO 5636 91407
## 9 WIND 1204 8906
## 10 WINTER 61 538
Similarly, we now summarise df1 by grouping by event EV and coomputing the total property and crop damage.
df1_economic <- df1 %>% group_by(df1$EV) %>% summarise(total_prop_dmg = sum(prop_dmg), total_crop_dmg = sum(crop_dmg))
## `summarise()` ungrouping output (override with `.groups` argument)
df1_economic
## # A tibble: 10 x 3
## `df1$EV` total_prop_dmg total_crop_dmg
## <chr> <dbl> <dbl>
## 1 FIRE 8501628500 403281630
## 2 FLOOD 167507193930. 12266906100
## 3 HAIL 17619991072. 3114212873
## 4 OTHER 88765409587. 24090068520
## 5 RAIN 3270230192 919315800
## 6 SNOW 1023569752. 134683100
## 7 STORM 72812921120. 6406789888
## 8 TORNADO 56993098029. 414961520
## 9 WIND 10797310118 1338972750
## 10 WINTER 27298000 15000000
Event type sorted by total fatalities and injuries in descending order
arrange(df1_health, desc(df1_health$total_fatalities))
## # A tibble: 10 x 3
## `df1$EV` total_fatalities total_injuries
## <chr> <dbl> <dbl>
## 1 OTHER 5674 19840
## 2 TORNADO 5636 91407
## 3 FLOOD 1524 8602
## 4 WIND 1204 8906
## 5 STORM 633 6691
## 6 SNOW 164 1164
## 7 RAIN 114 305
## 8 FIRE 90 1608
## 9 WINTER 61 538
## 10 HAIL 45 1467
arrange(df1_health, desc(df1_health$total_injuries))
## # A tibble: 10 x 3
## `df1$EV` total_fatalities total_injuries
## <chr> <dbl> <dbl>
## 1 TORNADO 5636 91407
## 2 OTHER 5674 19840
## 3 WIND 1204 8906
## 4 FLOOD 1524 8602
## 5 STORM 633 6691
## 6 FIRE 90 1608
## 7 HAIL 45 1467
## 8 SNOW 164 1164
## 9 WINTER 61 538
## 10 RAIN 114 305
Create a graph to visualise total fatalities and injuries by event type. Tornado ranks the highest by total injuries and fatalities (excluding ‘others’ category).
colnames(df1_health)
## [1] "df1$EV" "total_fatalities" "total_injuries"
colnames(df1_health)[1] <- "EV" #rename first column
is.factor(df1_health$EV)
## [1] FALSE
df1_health$EV <- as.factor(df1_health$EV)
library(RColorBrewer)
require(gridExtra)
## Loading required package: gridExtra
## Warning: package 'gridExtra' was built under R version 4.0.2
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
plot_fatalities <- ggplot(df1_health, aes(fill = EV, x = reorder(EV, -total_fatalities), y = total_fatalities)) + geom_bar(stat = 'identity') + geom_label(label = df1_health$total_fatalities) + scale_fill_brewer(palette = "Set3") + ggtitle("Event type by total fatalities") + xlab("Event Type") + ylab("Total fatalities")
plot_injuries <- ggplot(df1_health, aes(fill = EV, x = reorder(EV, -total_injuries), y = total_injuries)) + geom_bar(stat = 'identity') + geom_label(label = df1_health$total_injuries) + scale_fill_brewer(palette = "Set3") + ggtitle("Event type by total injuries") + xlab("Event Type") + ylab("Total injjuries")
grid.arrange(plot_fatalities,plot_injuries, ncol = 2)
Floods pose the greatest economic damage to properties and crops.
colnames(df1_economic)
## [1] "df1$EV" "total_prop_dmg" "total_crop_dmg"
colnames(df1_economic)[1] <- "EV"
plot_prop <- ggplot(df1_economic, aes( fill = EV, x = reorder(EV, -total_prop_dmg), y = total_prop_dmg)) + geom_bar(stat = 'identity') +
scale_fill_brewer(palette = "Set3") + ggtitle("Event type by total property damage") + xlab("Event Type") + ylab("Total property damage")
plot_crop <- ggplot(df1_economic, aes( fill = EV, x = reorder(EV, -total_crop_dmg), y = total_crop_dmg)) + geom_bar(stat = 'identity') +
scale_fill_brewer(palette = "Set3") + ggtitle("Event type by total crop damage") + xlab("Event Type") + ylab("Total crop damage")
grid.arrange(plot_prop, plot_crop, ncol = 2)
options(scipen=999)