The goal of this research is to explore the National Oceanic & Atmospheric Association (NOAA) Storm Database and answer some basic questions about severe weather events. We analyze data for impacts on population health by focusing in on 2 variables, fatalities and injuries. We discover that heat exposure, followed by tornadoes, cause the most fatalities. Our analysis also shows that tornadoes cause the most injuries, followed by heat exposure.
We also analyze the types of events that have the greatest economics consequences. We find that floods are the costliest cause of damage for both crops and property, followed by hurricanes for both categories.
Initially we need to read NOAA’s Storm data into the R system for analyis.
library(knitr)
storm <- read.csv("repdata-data-StormData.csv.bz2", header = T, na.strings="NA")
dim(storm)
## [1] 902297 37
The data is quite large (902,297 rows by 37 columns) so we will subset the data so that we can focus on the relevant columns, as well as the relevant events. The columns that we have identified as being of the most use are: BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP. We also show some data processing in the results section that was specifically required for each of the individual questions.
keycolm <- storm[, c(2,8,23:28)]
keycolm$EVTYPE <- tolower(keycolm$EVTYPE) #standardizing events to lower case
keycolm$BGN_DATE <- gsub("0:00:00", "", keycolm$BGN_DATE) #tidying up the dates
keycolm$BGN_DATE <- as.Date(strptime(keycolm$BGN_DATE, format = "%m/%d/%Y"))
keycolm$BGN_DATE <- as.numeric(format(keycolm$BGN_DATE,format="%Y"))
hist(keycolm$BGN_DATE, breaks=61, col="light blue", main="Most observations are after 1995", xlab="Years", ylab="Number of observations")
data <- subset(keycolm, BGN_DATE >=1995)
From the histogram above we can see that there is less data before 1995, so we intend to focus on the data from 1995. Also during some exploratory data analysis we observed that the data pre 1995 was poorer in quality. We then set out to do some initial analysis and then do some cleaning of the event type(EVTYPE).
datafatal <- aggregate(FATALITIES ~ EVTYPE, data=data, sum)
head(datafatal[order(datafatal$FATALITIES, decreasing= T),], 30)
## EVTYPE FATALITIES
## 99 excessive heat 1903
## 594 tornado 1545
## 120 flash flood 934
## 200 heat 924
## 314 lightning 729
## 129 flood 423
## 403 rip current 360
## 251 high wind 241
## 610 tstm wind 241
## 16 avalanche 223
## 404 rip currents 204
## 707 winter storm 195
## 201 heat wave 161
## 536 thunderstorm wind 131
## 107 extreme cold 128
## 108 extreme cold/wind chill 125
## 220 heavy snow 115
## 456 strong wind 103
## 244 high surf 102
## 60 cold/wind chill 95
## 207 heavy rain 95
## 109 extreme heat 91
## 288 ice storm 84
## 695 wildfire 75
## 25 blizzard 71
## 276 hurricane/typhoon 64
## 141 fog 61
## 269 hurricane 61
## 605 tropical storm 57
## 239 heavy surf/high surf 42
datainjur <- aggregate(INJURIES ~ EVTYPE, data=data, sum)
head(datainjur[order(datainjur$INJURIES, decreasing= T),], 30)
## EVTYPE INJURIES
## 594 tornado 21765
## 129 flood 6769
## 99 excessive heat 6525
## 314 lightning 4631
## 610 tstm wind 3630
## 200 heat 2030
## 120 flash flood 1734
## 536 thunderstorm wind 1426
## 707 winter storm 1298
## 276 hurricane/typhoon 1275
## 251 high wind 1093
## 175 hail 916
## 695 wildfire 911
## 220 heavy snow 751
## 141 fog 718
## 693 wild/forest fire 545
## 561 thunderstorm winds 444
## 91 dust storm 420
## 711 winter weather 398
## 25 blizzard 385
## 201 heat wave 378
## 288 ice storm 357
## 605 tropical storm 338
## 404 rip currents 297
## 456 strong wind 278
## 67 dense fog 276
## 207 heavy rain 237
## 403 rip current 227
## 160 glaze 216
## 16 avalanche 159
The simple analysis above shows that the storm dataset is not labelled consistently. We will try to group the most relevant event types into some of the key categories. We observed that there was significantly more event types than required which was due to poor labelling in the data. Next we grouped the event types into more sensible standardized categories. This was enough initial cleaning to attempt to answer the first question.
data$EVTYPE[grepl("heat|warm", data$EVTYPE, ignore.case=T)] <- "heat exposure"
data$EVTYPE[grepl("torn", data$EVTYPE, ignore.case=T)] <- "tornado"
data$EVTYPE[grepl("flood|fld", data$EVTYPE, ignore.case=T)] <- "flood"
data$EVTYPE[grepl("wind", data$EVTYPE, ignore.case=T)] <- "high wind"
data$EVTYPE[grepl("cold|winter|hypo", data$EVTYPE, ignore.case=T)] <- "cold exposure"
data$EVTYPE[grepl("hurr|trop|typh", data$EVTYPE, ignore.case=T)] <-"hurricane"
data$EVTYPE[grepl("fog", data$EVTYPE, ignore.case=T)] <-"fog accident"
data$EVTYPE[grepl("fire", data$EVTYPE, ignore.case=T)] <- "wild fire"
data$EVTYPE[grepl("rip", data$EVTYPE, ignore.case=T)] <- "rip current"
data$EVTYPE[grepl("surf|seas|coast", data$EVTYPE, ignore.case=T)] <- "rough seas"
data$EVTYPE[grepl("snow|blizz", data$EVTYPE, ignore.case=T)] <- "snow"
data$EVTYPE[grepl("land|mud", data$EVTYPE, ignore.case=T)] <- "landslide"
data$EVTYPE[grepl("surge", data$EVTYPE, ignore.case=T)] <- "storm surge"
data$EVTYPE[grepl("rain", data$EVTYPE, ignore.case=T)] <- "heavy rain"
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.1.2
datafatal <- aggregate(FATALITIES ~ EVTYPE, data=data, sum)
datainjur <- aggregate(INJURIES ~ EVTYPE, data=data, sum)
fatalinjur <- merge(datafatal, datainjur, by="EVTYPE")
fatalinjur <- head(fatalinjur[order(fatalinjur$FATALITIES, decreasing= T),], 10)
fatal <- fatalinjur[,1:2]
names(fatal) <- c("EVTYPE","NUMBER")
fatal$TYPE <- "fatal"
injured <- fatalinjur[,c(1,3)]
names(injured) <- c("EVTYPE","NUMBER")
injured$TYPE <- "injured"
casualty <- rbind(injured, fatal)
g <- ggplot(casualty, aes(x=EVTYPE, y=NUMBER, fill=TYPE))+
geom_bar(stat="identity",colour="blue")+
theme(axis.text.x =element_text(size = 10,angle = 90,hjust = 1,vjust= 1))+
ggtitle("Top 10 event types that cause casualties")+
labs(x="EVENT TYPES", y="NUMBER OF PEOPLE")
print(g)
While heat exposure caused the most fatalities we can see that tornadoes caused a lot of injuries and also affect the greatest proportion of the population. Tornadoes also account for the second highest amount of deaths and heat exposure accounts for the second highest amount of injuries. Other events that cause numerous fatalities and injuries are floods, and high winds.
Across the United States, which types of events have the greatest economic consequences? Key data for this are fairly untidy and spread over 4 columns. PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP. The dollar amounts are not provided in consistent units.
unique(data$PROPDMGEXP)
## [1] B M K m + 0 5 6 ? 4 2 3 7 H - 1 8
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
unique(data$CROPDMGEXP)
## [1] M m K B ? 0 k 2
## Levels: ? 0 2 B k K m M
According to the FAQ, alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, “B” for billions and “H” for hundred. The data uses both upper and lower case. There are some other values, for example “+”, “5”, and “6”, which we have assigned a value of 1 for the purposes of creating the dollar cost amounts. So the next step we must do is convert the damage amount to equivalent units. We also needed to add some extra columns for our calculations.
data$PROPDMGUNIT <- 1
data$PROPDMGUNIT[data$PROPDMGEXP %in% c("b","B")] <- 1000000000
data$PROPDMGUNIT[data$PROPDMGEXP %in% c("m","M")] <- 1000000
data$PROPDMGUNIT[data$PROPDMGEXP %in% c("k","K")] <- 1000
data$PROPDMGUNIT[data$PROPDMGEXP %in% c("h","H")] <- 100
data$PROPDMGAMT <- data$PROPDMG*data$PROPDMGUNIT
dataprop <- aggregate(PROPDMGAMT ~ EVTYPE, data=data, sum)
data$CROPDMGUNIT <- 1
data$CROPDMGUNIT[data$CROPDMGEXP %in% c("b","B")] <- 1000000000
data$CROPDMGUNIT[data$CROPDMGEXP %in% c("m","M")] <- 1000000
data$CROPDMGUNIT[data$CROPDMGEXP %in% c("k","K")] <- 1000
data$CROPDMGAMT <- data$CROPDMG*data$CROPDMGUNIT
datacrop <- aggregate(CROPDMGAMT ~ EVTYPE, data=data, sum)
databoth <- merge(datacrop, dataprop, by="EVTYPE")
databoth<- head(databoth[order(databoth$PROPDMGAMT, decreasing=T),],10)
propdata <- databoth[,1:2]
names(propdata) <- c("EVTYPE","COST")
propdata$ASSET <- "Property"
cropdata <- databoth[,c(1,3)]
names(cropdata) <- c("EVTYPE","COST")
cropdata$ASSET <- "Crop"
cropprop <- rbind(cropdata, propdata)
g <- ggplot(cropprop, aes(x=EVTYPE,y=COST/10^9, fill=ASSET))+
geom_bar(stat="identity", colour="blue")+
theme(axis.text.x =element_text(size = 10,angle = 90,hjust = 1,vjust = 1))+
ggtitle("Economic cost from natural disasters")+
labs(x="EVENT TYPES", y="DAMAGE AMOUNT(Billion $)")
print(g)
In the above graph we look at the economic consequences and we find that floods cause significant economic damage to crops ($160B). Floods are also the main economic damage for property as well with $7B of damage. The next major cause of economic damage was hurricanes, with $9.3B damage to property and $6.2B of damage to crops. Storm surges also cause a reasonable amount of property damages with $4.8B.