The study the follows investigates the relationship between severe weather in the US based on the official data from the National Oceanic and Atmospheric Administration (NOAA). The NOAA database shows the number of injuries and fatalities occurred during a meteorological event, as well as the estimated damage caused to properties and crop during the event. Further details on the database are available at https://www.ncdc.noaa.gov/stormevents/.
The first step in this study is to collect the data and load them
into R. The data is taken from the Coursera
Reproducible Research Course website, downloaded and stored as a
data frame into a variable storm.
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileUrl, destfile = "repdata_data_StormData.csv.bz2", method = "curl")
storm <- read.csv("./repdata_data_StormData.csv.bz2")
Use colnames, str to have a first look at
the variables.
colnames(storm)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
str(storm)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
It is convenient to consider the variable EVTYPE as
factor. Also, in order to tidy up the entries of this variable, we shall
convert all strings into upper-case. Furthermore, since the date is not
in the right format, we have to fix the data type and, since the data
set presents limited data before 1996, is shall be convenient to create
an additional variable, BGN_YEAR to store the year where
the event had happened.
# Converting EVTYPE in Uppercase and
# saving as factors
storm$EVTYPE <- toupper(storm$EVTYPE)
storm$EVTYPE <- factor(storm$EVTYPE)
# Converting the date into Date type
# and creating new variable BGN_YEAR
dates <- sapply(strsplit(storm$BGN_DATE, split = " "), function(x) x[[1]])
storm$BGN_DATE <- dates
storm$BGN_DATE <- as.Date(storm$BGN_DATE, "%m/%d/%Y")
storm$BGN_YEAR <- as.numeric(format(storm$BGN_DATE, "%Y"))
rm(dates)
From the inspection it is clear that the variables of interest are
EVTYPE, the type of meteorological event occurredINJURIES, counting the number of direct and indirect
injuries due to the eventFATALITIES, counting the number of direct and indirect
fatalities due to the eventPROPDMG, PROPDMGEXP, representing the
quantity of monetary cost for property damage and the corresponding
exponentCROPDMG, CROPDMGEXP, representing the
quantity of monetary cost for crop damage and the corresponding
exponentBGN_YEAR, the year where the meteorological event took
placeTo select and work with this specific subset it is convenient to load
the library dplyr.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
In taking the selection, we have also changed the position of the
variable BGN_YEAR at the front of the data set.
stormData <- select(storm, c(BGN_DATE, BGN_YEAR, EVTYPE,
FATALITIES, INJURIES,
PROPDMG, PROPDMGEXP,
CROPDMG, CROPDMGEXP))
# Only take events from 1996 (before data are incomplete)
stormData <- filter(stormData, BGN_YEAR >= 1996)
# Only take data with an impact
stormData <- filter(stormData, FATALITIES > 0 | INJURIES > 0 | PROPDMG > 0 | CROPDMG > 0)
From the preliminary data inspection it turns out that the monetary cost of property and crop damage are stored in two pairs of variables containing the monetary cost and the power of this cost. This power, however, is stored in several different ways, as shown in the tables.
table(storm$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5 6
## 465934 1 8 5 216 25 13 4 4 28 4
## 7 8 B h H K m M
## 5 1 40 1 6 424665 7 11330
table(storm$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
Some letters appear both in upper-case and lower-case, some other times, instead of the magnitude (k, M, B) a number is used (0 to 9). There are also some irregular entries (-, +, ?) which need to be addressed: we shall consider these as \(10^0 = 1\). The following function has aims to fix these exponents in an organic way.
ExponentConvert <- function(x) {
if(x %in% 0:9) x <- 10 ^ as.numeric(x)
else if(tolower(x) == "h") x <- 10 ^ 2
else if(tolower(x) == "k") x <- 10 ^ 3
else if(tolower(x) == "m") x <- 10 ^ 6
else if(tolower(x) == "b") x <- 10 ^ 9
else x <- 1
return(x)
}
This function can be used to create two new variables, representing magnitude, as a power of 10.
# Adding power variables of economic damages
stormData$PROPDMGFACTOR <- sapply(stormData$PROPDMGEXP, ExponentConvert)
stormData$CROPDMGFACTOR <- sapply(stormData$CROPDMGEXP, ExponentConvert)
With these two new variables, we can now create the variables which
represent the health impact and the economical cost of the
meteorological event, which are labelled by HEALTHIMP and
ECONCOST.
# Getting health impact and economic impact variables
stormData <- mutate(stormData, HEALTHIMP = FATALITIES + INJURIES)
stormData <- mutate(stormData, ECONCOST = PROPDMG * PROPDMGFACTOR + CROPDMG * CROPDMGFACTOR)
We can now remove from the data set the variables that we don’t need and focus only on the variable of interest.
stormData <- select(stormData, BGN_YEAR, EVTYPE, HEALTHIMP, ECONCOST)
This gives finally a tidy data set that can be analysed.
head(stormData)
## BGN_YEAR EVTYPE HEALTHIMP ECONCOST
## 1 1996 WINTER STORM 0 418000
## 2 1996 TORNADO 0 100000
## 3 1996 TSTM WIND 0 3000
## 4 1996 TSTM WIND 0 5000
## 5 1996 TSTM WIND 0 2000
## 6 1996 HIGH WIND 0 400000
Let us start by noticing that the variable EVTYPE has
many different levels and some may be duplicates. For example:
evtypeUnique <- unique(stormData$EVTYPE)
evtypeUnique[grep("THUND", evtypeUnique)]
## [1] THUNDERSTORM THUNDERSTORM WIND (G40) THUNDERSTORM WIND
## [4] MARINE THUNDERSTORM WIND
## 898 Levels: HIGH SURF ADVISORY COASTAL FLOOD FLASH FLOOD ... WND
This also shows that there are actually 898 levels. Going through all
these levels is pointlessly time consuming, so we shall take a faster
route. To this end, let us start by taking a new data set consisting of
the total health impact due to severe weather, taken adding up the
variable HEALTHIMP grouped by EVTYPE and let’s
check the subset of the 95% most extreme events.
healthImpact <- stormData %>% group_by(EVTYPE) %>% summarise(HEALTHIMP = sum(HEALTHIMP))
subset(healthImpact, HEALTHIMP > quantile(HEALTHIMP, prob = 0.95))
## # A tibble: 10 × 2
## EVTYPE HEALTHIMP
## <fct> <dbl>
## 1 EXCESSIVE HEAT 8188
## 2 FLASH FLOOD 2561
## 3 FLOOD 7172
## 4 HEAT 1459
## 5 HURRICANE/TYPHOON 1339
## 6 LIGHTNING 4792
## 7 THUNDERSTORM WIND 1530
## 8 TORNADO 22178
## 9 TSTM WIND 3870
## 10 WINTER STORM 1483
healthImpact <- as.data.frame(healthImpact)
As it is clear from the table shown, line 7 and line 9 both correspond to thunderstorm wind though sometimes this is recorded as TSTM. Let us also fix the record HURRICANTE/TYPHOON changing every possible hurricane to HURRICANE.
stormData[(stormData$EVTYPE == "TSTM WIND"), "EVTYPE"] <- "THUNDERSTORM WIND"
stormData[grep("HURRICANE", stormData$EVTYPE), "EVTYPE"] <- "HURRICANE"
Let’s do the same with the economic cost data set:
economicCost <- stormData %>% group_by(EVTYPE) %>% summarise(ECONCOST = sum(ECONCOST))
subset(as.data.frame(economicCost), ECONCOST > quantile(ECONCOST, prob = 0.95))
## EVTYPE ECONCOST
## 32 DROUGHT 14413667000
## 46 FLASH FLOOD 16557105610
## 48 FLOOD 148919611950
## 66 HAIL 17071172870
## 83 HIGH WIND 5881421660
## 86 HURRICANE 86467941810
## 139 STORM SURGE 43193541000
## 144 THUNDERSTORM WIND 8812957230
## 147 TORNADO 24900370720
## 150 TROPICAL STORM 8320186550
stormData[grep("STORM SURGE", stormData$EVTYPE), "EVTYPE"] <- "STORM SURGE"
Once the problem of duplicates has been dealt with, we can finally
recreate the data sets healthImpact and
economicCost and plot a diagram to see order the events
depending on their health impact and economic cost, repectively. Let us
start by loading the package ggplot2.
library(ggplot2)
# health Impact
healthImpact <- stormData %>%
group_by(EVTYPE) %>%
summarise(HEALTHIMP = sum(HEALTHIMP)) %>%
arrange(desc(HEALTHIMP))
ggplot(healthImpact[1:10,], aes(x = reorder(EVTYPE, -HEALTHIMP),
y = HEALTHIMP, colour = EVTYPE)) +
geom_bar(stat = "identity", fill = "white") +
ggtitle("Fatalities and Injuries in the US caused by severe weather") +
xlab("Event") + ylab("Number of fatalities and injuries") +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none",
plot.title = element_text(hjust = 0.5, face = "bold"))
The diagram shows the first 10 most significative events in terms of health impact (injuries and fatalities). The graph shows that
economicCost <- stormData %>%
group_by(EVTYPE) %>%
summarise(ECONCOST = sum(ECONCOST)) %>%
arrange(desc(ECONCOST))
ggplot(economicCost[1:10,], aes(x = reorder(EVTYPE, -ECONCOST),
y = ECONCOST / 1e6, colour = EVTYPE)) +
geom_bar(stat = "identity", fill = "white") +
ggtitle("Economic Cost in the US caused by severe weather") +
xlab("Event") + ylab("Economic cost in million USD") +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none",
plot.title = element_text(hjust = 0.5, face = "bold"))
The diagram shows the first 10 most significative events in terms of economic cost. The graph shows that