In this report, we aim to analyze the consequences of storm events in the United States between the years 1996 and 2011. Our goal is to find out the most impacting events in this period regarding the total number of fatalities and injuries, as well as the economic damage to properties and crops.
To achieve this objective, we obtained storm events data from the National Weather Service, which is collected from reports across the U.S.. These reports date back to 1950, however, up until 1995, only tornado, thunderstorm wind and hail events were registered. This way, our analyses consider only 1996 and beyond.
Results indicate that excessive heat and tornadoes are the most fatal events, while tornadoes are the ones that cause most injuries, followed by floods and excessive heat. In particular, it seems that major tornadoes hit the U.S. in 2011. Finally, floods have caused most property damages, while droughts have caused most crop damages, and, considering both property and crop damages together, floods are the most critical events.
The data file has been downloaded from the following link: https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2
For the analysis, this file is expected to be downloaded and placed in the working directory with its original name (i.e., repdata%2Fdata%2FStormData.csv.bz2).
Load the necessary libraries in order to perform the analysis.
suppressPackageStartupMessages(library("dplyr"))
suppressPackageStartupMessages(library("data.table"))
suppressPackageStartupMessages(library("lubridate"))
suppressPackageStartupMessages(library("lattice"))
Let's take a look at the first 5 rows of our dataset and have a general idea of what it looks like.
filebz2 <- "repdata%2Fdata%2FStormData.csv.bz2"
filecsv <- bzfile(filebz2)
read.csv(filecsv, nrows = 5)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 NA NA NA NA 0
## 2 TORNADO 0 NA NA NA NA 0
## 3 TORNADO 0 NA NA NA NA 0
## 4 TORNADO 0 NA NA NA NA 0
## 5 TORNADO 0 NA NA NA NA 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 NA NA 14.0 100 3 0 0
## 2 NA 0 NA NA 2.0 150 2 0 0
## 3 NA 0 NA NA 0.1 123 2 0 0
## 4 NA 0 NA NA 0.0 100 2 0 0
## 5 NA 0 NA NA 0.0 150 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0 NA NA NA NA
## 2 0 2.5 K 0 NA NA NA NA
## 3 2 25.0 K 0 NA NA NA NA
## 4 2 2.5 K 0 NA NA NA NA
## 5 2 2.5 K 0 NA NA NA NA
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 NA 1
## 2 3042 8755 0 0 NA 2
## 3 3340 8742 0 0 NA 3
## 4 3458 8626 0 0 NA 4
## 5 3412 8642 0 0 NA 5
Now we know we are only interested in the following variables: BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP. So we can save some time by reading just them.
# Load and process the data
filecsv <- bzfile(filebz2)
dat <- read.csv(filecsv,
as.is = TRUE,
colClasses = c("NULL",
"character", # BGN_DATE
rep("NULL", 5),
"factor", # EVTYPE
rep("NULL", 14),
"numeric", # FATALITIES
"numeric", # INJURIES
"numeric", # PROPDMG
"factor", # PROPDMGEXP
"numeric", # CROPDMG
"factor", # CROPDMGEXP
rep("NULL", 9)),
check.names = FALSE)
Extract the year from BGN_DATE
dat$BGN_DATE <- year(as.Date(dat$BGN_DATE, format = "%m/%d/%Y %T", tz = "GMT"))
Let's filter some data.
First, from the Storm Events Database website, we know that up until 1995 (included), there were only data for tornado, thunderstorm wind and hail. In order to avoid such bias in our analysis, we can take only data from 1996 (included) onwards.
Furthermore, from the preparation document, we know that magnitude characters from damage estimates include none, “K” for thousands, “M” for millions, and “B” for billions. This way, we can ignore additional characters, which are potentially inaccurate.
dat <- filter(dat, BGN_DATE >= 1996,
PROPDMGEXP %in% c("", "B", "M", "K"),
CROPDMGEXP %in% c("", "B", "M", "K"))
Let's check some summaries from our current data
# Convert to data.table
dat <- dat %>% data.table %>% setkey(EVTYPE, BGN_DATE)
# Print data summaries
str(dat)
## Classes 'data.table' and 'data.frame': 653529 obs. of 8 variables:
## $ BGN_DATE : num 2001 2003 2002 1998 1998 ...
## $ EVTYPE : Factor w/ 985 levels "?","ABNORMALLY DRY",..: 2 2 3 4 4 4 4 5 5 5 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 0 0 ...
## $ INJURIES : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, ".internal.selfref")=<externalptr>
## - attr(*, "sorted")= chr "EVTYPE" "BGN_DATE"
summary(dat)
## BGN_DATE EVTYPE FATALITIES
## Min. :1996 HAIL :207715 Min. : 0.00000
## 1st Qu.:2000 TSTM WIND :128662 1st Qu.: 0.00000
## Median :2005 THUNDERSTORM WIND: 81402 Median : 0.00000
## Mean :2004 FLASH FLOOD : 50998 Mean : 0.01336
## 3rd Qu.:2008 FLOOD : 24247 3rd Qu.: 0.00000
## Max. :2011 TORNADO : 23154 Max. :158.00000
## (Other) :137351
## INJURIES PROPDMG PROPDMGEXP CROPDMG
## Min. :0.00e+00 Min. : 0.00 K :369938 Min. : 0.000
## 1st Qu.:0.00e+00 1st Qu.: 0.00 :276185 1st Qu.: 0.000
## Median :0.00e+00 Median : 0.00 M : 7374 Median : 0.000
## Mean :8.87e-02 Mean : 11.69 B : 32 Mean : 1.839
## 3rd Qu.:0.00e+00 3rd Qu.: 1.00 - : 0 3rd Qu.: 0.000
## Max. :1.15e+03 Max. :5000.00 ? : 0 Max. :990.000
## (Other): 0
## CROPDMGEXP
## :373069
## K :278685
## M : 1771
## B : 4
## ? : 0
## 0 : 0
## (Other): 0
In this section, we show our results according to each question.
There is a number of ways to answer this question. We will analyze fatalities and injuries separately.
Let's check the 5 most harmful events regarding fatalities.
tot.fat <- dat[, .(TOTAL_FATALITIES = sum(FATALITIES)), by = EVTYPE] %>%
setorder(-TOTAL_FATALITIES) %>%
head(5)
tot.fat
## EVTYPE TOTAL_FATALITIES
## 1: EXCESSIVE HEAT 1797
## 2: TORNADO 1511
## 3: FLASH FLOOD 887
## 4: LIGHTNING 651
## 5: FLOOD 414
Now we plot their effects in each year.
# Get data to plot
plt <- dat[EVTYPE %in% tot.fat$EVTYPE,
.(TOTAL_FATALITIES = sum(FATALITIES)),
by = .(BGN_DATE, EVTYPE)]
plt$EVTYPE <- droplevels(plt$EVTYPE)
# Plot
xyplot(TOTAL_FATALITIES ~ BGN_DATE,
plt,
groups = EVTYPE,
type = "b",
auto.key = list(corner = c(0.6, 0.9)),
lwd = 2,
main = "Total fatalities by year",
xlab = "Year",
ylab = "Number of fatalities")
Let's check the 5 most harmful events regarding injuries.
tot.inj <- dat[, .(TOTAL_INJURIES = sum(INJURIES)), by = EVTYPE] %>%
setorder(-TOTAL_INJURIES) %>%
head(5)
tot.inj
## EVTYPE TOTAL_INJURIES
## 1: TORNADO 20667
## 2: FLOOD 6758
## 3: EXCESSIVE HEAT 6391
## 4: LIGHTNING 4141
## 5: TSTM WIND 3629
Now we plot their effects in each year.
# Get data to plot
plt <- dat[EVTYPE %in% tot.inj$EVTYPE,
.(TOTAL_INJURIES = sum(INJURIES)),
by = .(BGN_DATE, EVTYPE)]
plt$EVTYPE <- droplevels(plt$EVTYPE)
# Plot
xyplot(TOTAL_INJURIES ~ BGN_DATE,
plt,
groups = EVTYPE,
type = "b",
auto.key = list(corner = c(0.6, 0.9)),
lwd = 2,
main = "Total injuries by year",
xlab = "Year",
ylab = "Number of injuries")
There is a number of ways to answer this question. We will analyze property and crop damages separately.
First, we need to convert magnitudes to values.
convertMagToVal <- function(mag) {
if (mag == "") {
val <- 1
} else if (mag == "K") {
val <- 1000
} else if (mag == "M") {
val <- 1000000
} else if (mag == "B") {
val <- 1000000000
}
return (val)
}
dat$PROPDMG <- dat$PROPDMG * sapply(dat$PROPDMGEXP, convertMagToVal)
dat$CROPDMG <- dat$CROPDMG * sapply(dat$CROPDMGEXP, convertMagToVal)
Let's check the 5 most damaging events regarding property.
tot.prop <- dat[, .(TOTAL_PROPDMG = sum(PROPDMG)), by = EVTYPE] %>%
setorder(-TOTAL_PROPDMG) %>%
head(5)
tot.prop
## EVTYPE TOTAL_PROPDMG
## 1: FLOOD 143944833550
## 2: HURRICANE/TYPHOON 69305840000
## 3: STORM SURGE 43193536000
## 4: TORNADO 24616945710
## 5: FLASH FLOOD 15222203910
Let's check the 5 most damaging events regarding crop.
tot.crop <- dat[, .(TOTAL_CROPDMG = sum(CROPDMG)), by = EVTYPE] %>%
setorder(-TOTAL_CROPDMG) %>%
head(5)
tot.crop
## EVTYPE TOTAL_CROPDMG
## 1: DROUGHT 13367566000
## 2: FLOOD 4974778400
## 3: HURRICANE 2741410000
## 4: HURRICANE/TYPHOON 2607872800
## 5: HAIL 2476029450
Now we plot the 5 most damaging events regarding both property and crop together.
tot.dmg <- dat[, .(TOTAL_DMG = sum(PROPDMG, CROPDMG)), by = EVTYPE] %>%
setorder(-TOTAL_DMG) %>%
head(5)
tot.dmg
## EVTYPE TOTAL_DMG
## 1: FLOOD 148919611950
## 2: HURRICANE/TYPHOON 71913712800
## 3: STORM SURGE 43193541000
## 4: TORNADO 24900370720
## 5: HAIL 17071172870
barchart(EVTYPE ~ TOTAL_DMG/1000000000,
tot.dmg,
main = "Most damaging events",
xlab = "Damage (in billions of dollars)",
ylab = "Event",
col = "darkred")