In this work, U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database is analyzed. The focus of the study is to analyze the data and find the natural events that are most harmful for population health and for US economy. The database starts from 1950 to November 2011. The official natural event types are 48 but before January 1993 not all event information were included. Hence, I started the primary data analysis from 1993 and finally from 1996. There are total 37 variables but we don’t need all of them. I only kept 10 variables related to types of events, casualties and economic loss. The total and average damage is calculated by each event type. The analysis shows that Hurricane/Typhoon is the most harmful for economy. On the other hand Tornado and heat are most deadly for human life.
At first, “stringdist”, “ggplot2”, “gridExtra” and “dplyr” library is loaded.
library(stringdist)
library(ggplot2)
library(gridExtra)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Then the stormData.csv.bz2 file was loaded to “R” using read.csv function.
fileurl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileurl, "StormData.csv.bz2")
storm <- read.csv(gzfile("StormData.csv.bz2"))
storm[1:5, 1:5]
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY
## 1 1 4/18/1950 0:00:00 0130 CST 97
## 2 1 4/18/1950 0:00:00 0145 CST 3
## 3 1 2/20/1951 0:00:00 1600 CST 57
## 4 1 6/8/1951 0:00:00 0900 CST 89
## 5 1 11/15/1951 0:00:00 1500 CST 43
To find the approximate number of events (typos are not fixed) measured in each year, the fillowing analysis is done.
data1 <- storm
data1$BGN_DATE <- as.Date(data1$BGN_DATE, "%m/%d/%Y")
data1$EVTYPE <- as.character(data1$EVTYPE)
data1$BGN_DATE <- format(data1$BGN_DATE, "%Y")
data1 <- table(data1$BGN_DATE, data1$EVTYPE)
data1 <- data.frame(data1)
data1 <- data1[data1$Freq >0 , ]
data1 <- table(data1$Var1)
data1 <- data.frame(data1)
data1$Var1 <- format(as.Date(data1$Var1, "%Y"), "%Y")
plot(data1$Var1, data1$Freq, pch = 17, xlab = "Year",
ylab = "Number of Events", col = "red", lwd = 4,
main = "Number of events (no processing was done)
measured in each year", sub = "Figure 1")
From Figure 1, we can see that before 1993 very limited number of events were stored in the database. So, it is justified to use the information from 1993. There is a drastic increase in the number of events in 1993, 1994 and 1995 but it’s become more consistent from 1996. To get a better idea, I divided the data into two sections (1993-1995 and 1996-2011) and did the following analysis.
storm_96 <- storm
storm_96$BGN_DATE <- as.Date(storm_96$BGN_DATE, "%m/%d/%Y")
storm_96$PROPDMGEXP <- as.character(storm_96$PROPDMGEXP)
storm_96$CROPDMGEXP <- as.character(storm_96$CROPDMGEXP)
storm_93 <- subset(storm_96, BGN_DATE >= "1993-01-01" & BGN_DATE < "1996-01-01")
storm_96 <- subset(storm_96, BGN_DATE >= "1996-01-01" & BGN_DATE < "2012-01-01")
unique(storm_93$PROPDMGEXP)
## [1] "" "B" "K" "M" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8"
unique(storm_93$CROPDMGEXP)
## [1] "" "M" "K" "m" "B" "?" "0" "k" "2"
dim(storm_93)
## [1] 61208 37
unique(storm_96$PROPDMGEXP)
## [1] "K" "" "M" "B" "0"
unique(storm_96$CROPDMGEXP)
## [1] "K" "" "M" "B"
dim(storm_96)
## [1] 653530 37
There are many types of multipliers in the database from 1993 to 1995. According to this https://rstudio-pubs-static.s3.amazonaws.com/58957_37b6723ee52b455990e149edde45e5b6.html discussion, the values of multipliers are
H,h,K,k,M,m,B,b,+,-,?,0,1,2,3,4,5,6,7,8, and blank-character
H,h = hundreds = 100
K,k = kilos = thousands = 1,000
M,m = millions = 1,000,000
B,b = billions = 1,000,000,000
(+) = 1
(-) = 0
(?) = 0
black/empty character = 0
numeric 0..8 = 10
We are interested to find the events which are most harmful for human and economy. Hence, except H, K, M and B, all other multipliers can be ignored. Data set storm_93 contains many low value multiplier in PROPDMGEXP and CROPDMGEXP variables. Besides, the number of observations are more than 10 times higher in year 1996 (653530), compared to year 1993-1995 (61208). So, it is reasonable to explore the data from data set storm_96 (year 1996).
In the next section, a subset containing 10 variables are created that are necessary for this analysis. Then, another subset is created, where at least one observation is non-zero among FATALITIES, INJURIES, PROPDMG and CROPDMG variables. It means, an observation will be ignored if all the values are zero.
storm_96 <- subset(storm_96,
select = c("STATE__", "BGN_DATE", "EVTYPE",
"FATALITIES", "INJURIES", "PROPDMG",
"PROPDMGEXP", "CROPDMG", "CROPDMGEXP", "REFNUM"))
storm_96 <- storm_96[storm_96$FATALITIES != 0 | storm_96$INJURIES != 0 |
storm_96$PROPDMG != 0 | storm_96$CROPDMG != 0, ]
storm_96$EVTYPE <- as.character(storm_96$EVTYPE)
storm_96$EVTYPE <- toupper(storm_96$EVTYPE)
str(storm_96)
## 'data.frame': 201318 obs. of 10 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Date, format: "1996-01-06" "1996-01-11" ...
## $ EVTYPE : chr "WINTER STORM" "TORNADO" "TSTM WIND" "TSTM WIND" ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 0 0 ...
## $ INJURIES : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PROPDMG : num 380 100 3 5 2 400 12 8 12 75 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 38 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "K" "" "" "" ...
## $ REFNUM : num 248768 248769 248770 248771 248772 ...
In the below section all the official event types (48) are written in an array to match with the storm_96 data set.
eventtype <- c("Astronomical Low Tide", "Avalanche", "Blizzard", "Coastal Flood",
"Cold", "Debris Flow", "Dense Fog", "Dense Smoke", "Drought",
"Dust Devil", "Dust Storm", "Excessive Heat", "Extreme Cold/Wind Chill",
"Flash Flood", "Flood", "Frost/Freeze", "Funnel Cloud", "Freezing Fog",
"Hail", "Heat", "Heavy Rain", "Heavy Snow", "high Surf", "High Wind",
"Hurricane/Typhoon", "Ice storm", "Lake-Effect Snow", "Lakeshore Flood",
"Lightning", "Marine Hail", "Marine High Wind", "Marine Strong Wind",
"Marine Thunderstorm Wind", "Rip Current", "Seiche", "Sleet",
"Storm Surge/Tide", "Strong Wind", "Thunderstorm Wind", "Tornado",
"Tropical Depression", "Tropical Storm", "Tsunami", "Volcanic Ash",
"Waterspout", "Wildfire", "Winter Storm", "Winter Weather")
eventtype <- toupper(eventtype)
Some of the event types in the data set is not possible to match using matching functions. For example, “LANDSLIDE” is not included in the official type but according to the Storm data document given in the course website, “LANDSLIDE”" was renamed as “DEBRIS FLOW”. There is no type named “OTHER” but from the data set “remarks” column, we can see that most “OTHER” events are “HEAVY RAIN”, although there are exceptions too. Hence, in the following code chunk, this kind of events are replaced with appropriate event name.
pattern1 <- c("TSTM", "LANDSLIDE", "URBAN/SML STREAM FLD", "HYPOTHERMIA/EXPOSURE",
"HYPERTHERMIA/EXPOSURE", "MIXED PRECIPITATION", "UNSEASONABLY WARM",
"UNSEASONABLE COLD", "UNSEASONABLY COLD", "WIND AND WAVE", "FALLING SNOW/ICE",
"MUDSLIDE", "^FOG", "RAIN", "OTHER", "BURST", "LANDSLUMP", "LIGHT SNOW",
"GLAZE")
replacement1 <- c("THUNDERSTORM", "DEBRIS FLOW", "HEAVY RAIN", "COLD/WIND CHILL",
"EXCESSIVE HEAT", "COLD/WIND CHILL", "HEAT", "COLD/WIND CHILL",
"COLD/WIND CHILL", "MARINE HIGH WIND", "SLEET", "HEAVY RAIN",
"FREEZING FOG", "HEAVY RAIN", "HEAVY RAIN", "THUNDERSTORM WIND",
"HEAVY RAIN", "WINTER STORM", "FREEZING FOG")
for (i in 1:length(pattern1))
{
storm_96$EVTYPE <- gsub(pattern1[i], replacement1[i], storm_96$EVTYPE)
}
In the following code chunk, “amatch” function is used to check the typos and match events with the official event types.
st_match <- amatch(storm_96$EVTYPE, eventtype, maxDist = 20, method = "lcs")
print(paste0("Sum : ", sum(is.na(st_match))))
## [1] "Sum : 0"
storm_96$EVTYPE <- eventtype[st_match]
print(paste0("Number of events : ", length(unique(storm_96$EVTYPE))))
## [1] "Number of events : 48"
In the above code, we see that there is no “NA” values after matching the column. The total number of unique types are 48, which is the official number of events.
unique(storm_96$PROPDMGEXP)
## [1] "K" "" "M" "B"
unique(storm_96$CROPDMGEXP)
## [1] "K" "" "M" "B"
dmgexp <- c("K", "M", "", "B")
dmgmlt <- c(1000, 1000000, 0, 1000000000)
storm_96 <- storm_96 %>%
mutate(PROPDMGEXP = ifelse(REFNUM == 605943, "M", PROPDMGEXP))
for (i in 1:length(dmgexp))
{
storm_96 <- storm_96 %>%
mutate(PROPDMG = ifelse(PROPDMGEXP == dmgexp[i], PROPDMG*dmgmlt[i], PROPDMG))
storm_96 <- storm_96 %>%
mutate(CROPDMG = ifelse(CROPDMGEXP == dmgexp[i], CROPDMG*dmgmlt[i], CROPDMG))
}
The property damage value in reference number (REFNUM) 605943 is 115 Billion, which is not possible by a flood, whereas the property damage due to Hurricane Katrina (577615, 522616) is 16.93 and 31.30 Billion. So, I have changed 115 Billion to 115 Million, in the above code. The unique multipliers are “M”, “B”, “K” and “” for both property and crop damage. The damage values are multiplied by the associated multipliers in the EXP columns.
storm_96$FATALINJURY <- storm_96$FATALITIES + storm_96$INJURIES
storm_96$TDAMAGE <- storm_96$PROPDMG + storm_96$CROPDMG
storm_new <- storm_96 %>% group_by(EVTYPE) %>%
summarise(PROPERTY = sum(PROPDMG), CROP = sum(CROPDMG),
TOTALDMG = sum(TDAMAGE), INJURY = sum(FATALINJURY),
AVGDMG = mean(TDAMAGE), AVGINJURY = mean(FATALINJURY))
head(storm_new)
## # A tibble: 6 x 7
## EVTYPE PROPERTY CROP TOTALDMG INJURY AVGDMG
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ASTRONOMICAL LOW TIDE 9745000 0 9745000 0 974500.00
## 2 AVALANCHE 3711800 0 3711800 379 14059.85
## 3 BLIZZARD 525658950 7060000 532718950 455 2336486.62
## 4 COASTAL FLOOD 407218560 0 407218560 14 2046324.42
## 5 COLD 23561600 1308973000 1332534600 322 5670360.00
## 6 DEBRIS FLOW 324583000 20017000 344600000 91 1804188.48
## # ... with 1 more variables: AVGINJURY <dbl>
In the above code chunk, two new variables are created.
Then a new data frame (storm_new) is created where calculations are grouped by events,
In the following code chunk, a global theme for the two figures are created. Then the first figure is created from the total and average damage.
th <- theme_grey() +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5,
size = 12, face = "bold"),
axis.text.y=element_text(size = 12, face = "bold"),
axis.title.x = element_text(size = 14, face = "bold"),
axis.title.y = element_text(size = 14, face = "bold"),
plot.caption = element_text(size = 12, face = "bold"))
df1 <- storm_new %>% subset(TOTALDMG >= quantile(TOTALDMG)[4])
df2 <- storm_new %>% subset(AVGDMG >= quantile(AVGDMG)[4])
p1 <- ggplot(df1, aes(x = EVTYPE, y = TOTALDMG)) +
geom_bar(stat = "identity") +
labs(x = "Event Type", y = "Total Damage", caption = "Figure 2(a): shows
the total damage by 12 events (75 percentile)") + th
p2 <- ggplot(df1, aes(x = EVTYPE, y = AVGDMG)) +
geom_bar(stat = "identity") +
labs(x = "Event Type", y = "Average Damage", caption = "Figure 2(b): shows
the avearge damage by 12 events (75 percentile)") + th
grid.arrange(p1, p2, ncol = 2)
The events that are responsible for more than 75 percentile of total and average damage are shown in Figure 2. Figure 2(a) shows that Hurricane, Storm and Flood are three main reasons for product and crop damage. The mean value of each event also shows the similar pattern, except Drought takes the third position instead of Flood.
In the code below, the third figure is plotted from the total and average fatalities.
df1 <- storm_new %>% subset(INJURY >= quantile(INJURY)[4])
df2 <- storm_new %>% subset(AVGINJURY >= quantile(AVGINJURY)[4])
p1 <- ggplot(df1, aes(x = EVTYPE, y = INJURY)) +
geom_bar(stat = "identity") +
labs(x = "Event Type", y = "Total Fatalities and Injuries",
caption = "Figure 3(a): shows the events responsible for more than
75 parcentile of total population casualties") + th
p2 <- ggplot(df2, aes(x = EVTYPE, y = AVGINJURY)) +
geom_bar(stat = "identity") +
labs(x = "Event Type", y = "Average Fatalities and Injuries",
caption = "Figure 3(b): shows the average population casualties
by the events that are responsible for more than 75
parcentile of the casualties") + th
grid.arrange(p1, p2, ncol = 2)
The events, that are responsible for more than 75 percentile of human life loss and injury are shown in Figure 3. Figure 3(a) shows that Tornado, Excessive heat and Flood are the three most harmful events for population health. On the contrary, in case of mean life loss and injury by each event, Excessive heat, heat and Tsunami are the most harmful.