In this report we aim to identify two things: 1. Which type of weather events are most harmful with respect to public health, and 2. Which type of events have the greatest economic consequences. Our hypothesis is that weather events have varying severity on health and population, and the intention of this report is to identify which ones are the most severe. To investigate the hypothesis we have obtained data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. Events from before 1995 were excluded from the analysis due to a lack of sufficient records. From the analysis we found that the weather events with the greatest health impacts are Tornadoes, followed, by heat events, and then flood events. The weather events with the greatest economic impact are floods, followed by hurricanes, and then storm surges.
We used data from US NOAA storm database. The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete. A codebook for the data set is available here
We first manually downloaded and unzipped the data. The data is zipped using bzip2 algorithm, so appropriate unzipping software was used.
We first read in the data from the raw text file. It is text data, with fields separated with a “,” (comma character). We don’t read in the header file.
sdat <- read.csv("./data/dataset.csv", skip = 1, header = FALSE,
na.strings = "")
After reading in the data, we check the first few rows (there are 902,297) rows in the data base.
dim(sdat)
## [1] 902297 37
head(sdat[,1:10])
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO 0 <NA>
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO 0 <NA>
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL TORNADO 0 <NA>
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL TORNADO 0 <NA>
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL TORNADO 0 <NA>
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL TORNADO 0 <NA>
We then attach column headers to the dataset and make sure they are properly formated for R data frames.
cnames <- readLines("./data/dataset.csv",1)
cnames <- strsplit(cnames,",",fixed=TRUE)
names(sdat) <- make.names(cnames[[1]])
head(sdat[,1:10])
## X.STATE__. X.BGN_DATE. X.BGN_TIME. X.TIME_ZONE. X.COUNTY.
## 1 1 4/18/1950 0:00:00 0130 CST 97
## 2 1 4/18/1950 0:00:00 0145 CST 3
## 3 1 2/20/1951 0:00:00 1600 CST 57
## 4 1 6/8/1951 0:00:00 0900 CST 89
## 5 1 11/15/1951 0:00:00 1500 CST 43
## 6 1 11/15/1951 0:00:00 2000 CST 77
## X.COUNTYNAME. X.STATE. X.EVTYPE. X.BGN_RANGE. X.BGN_AZI.
## 1 MOBILE AL TORNADO 0 <NA>
## 2 BALDWIN AL TORNADO 0 <NA>
## 3 FAYETTE AL TORNADO 0 <NA>
## 4 MADISON AL TORNADO 0 <NA>
## 5 CULLMAN AL TORNADO 0 <NA>
## 6 LAUDERDALE AL TORNADO 0 <NA>
We are interested in the effeccts of different event types. To assess public health impacts we are interested in fatalities and injuries. To assess economic impacts we are interested in property damage and crop damage. Here, we extract the data, and identify the amount of missing values.
Event type:
etyp <- sdat$X.EVTYPE.
mean(is.na(etyp))
## [1] 0
Fatalities:
fats <- sdat$X.FATALITIES.
mean(is.na(fats))
## [1] 0
Injuries:
injs <- sdat$X.INJURIES.
mean(is.na(injs))
## [1] 0
Property damage:
pdmg <- sdat$X.PROPDMG.
mean(is.na(pdmg))
## [1] 0
pdmgex <- sdat$X.PROPDMGEXP.
mean(is.na(pdmgex))
## [1] 0.5163865
Crop damage:
cdmg <- sdat$X.CROPDMG.
mean(is.na(cdmg))
## [1] 0
cdmgex <- sdat$X.CROPDMGEXP.
mean(is.na(cdmgex))
## [1] 0.6853763
There are missing values for the damage exponents. Need to be aware of this in later analysis.
Some event types are really the same kind of event, but described separately. So we combine all “heat” events into one, as well as all flood, thunder storm, etc.
sdat$X.EVTYPE. <- sub(".*[Hh][Ee][Aa][Tt].*","HEAT",sdat$X.EVTYPE.)
sdat$X.EVTYPE. <- sub(".*[Ff][Ll][Oo][Oo][Dd].*","FLOOD",sdat$X.EVTYPE.)
sdat$X.EVTYPE. <- sub(".*HURRICANE.*","HURRICANE",sdat$X.EVTYPE.)
sdat$X.EVTYPE. <- sub(".*TROPICAL STORM.*","TROPICAL STORM",sdat$X.EVTYPE.)
sdat$X.EVTYPE. <- sub(".*RIP CURRENT.*","RIP CURRENTS",sdat$X.EVTYPE.)
sdat$X.EVTYPE. <- sub(".*TSTM.*","THUNDER STORM",sdat$X.EVTYPE.)
sdat$X.EVTYPE. <- sub(".*THUNDERSTORM.*","THUNDER STORM",sdat$X.EVTYPE.)
The data was collected starting in 1950. The older data may not be comprehensive. To check this we take a look at the number of events recorded per year and plot the data.
library(dplyr)
sdat <- mutate(sdat, X.YEAR = as.numeric(format(as.Date(as.character(X.BGN_DATE.),
format = "%m/%d/%Y %H:%M:%S"), "%Y")))
by_yr <- group_by(sdat,X.YEAR) %>%
summarise(No.Events = n())
library(ggplot2)
g <- ggplot(by_yr, aes(x = X.YEAR,y = No.Events))
p <- g + geom_line() + ggtitle("Number of recorded events per year") + xlab("Year") + ylab("Number of events")
print(p)
There seems to be a spike in the number of events at around 1995. In order not to bias the data, let’s only take the data from 1995 onwards.
sdat <- subset(sdat,X.YEAR > 1994)
Finally, we need to convert the property and crop damage data to actual values, using the correct scaling, and address the NA issues as well:
pex <- data.frame("exp" = c("K","M","B"),"PROP.MULT" = c(1000,1000000,1000000000))
dex <- data.frame("exp" = c("K","M","B"),"CROP.MULT" = c(1000,1000000,1000000000))
sdat <- merge(sdat,pex,by.x = "X.PROPDMGEXP.", by.y = "exp",sort=FALSE, all.x = TRUE)
sdat <- merge(sdat,dex,by.x = "X.CROPDMGEXP.", by.y = "exp",sort=FALSE, all.x = TRUE)
sdat$PROP.MULT[is.na(sdat$PROP.MULT)] <- 0
sdat$CROP.MULT[is.na(sdat$CROP.MULT)] <- 0
sdat <- mutate(sdat, X.TOTAL.PDAMAGE = X.PROPDMG. * PROP.MULT) %>%
mutate(X.TOTAL.CDAMAGE = X.CROPDMG. * CROP.MULT)
We take a look at the health impacts of each type of event, using the total fatalaties, and total injuries metrics:
health <- group_by(sdat,X.EVTYPE.) %>%
summarise(Total.Fatalities = sum(X.FATALITIES.),
Total.Injuries = sum(X.INJURIES.)) %>%
mutate(Total.Casualties = Total.Fatalities + Total.Injuries) %>%
filter(Total.Casualties > 0)
health <- health[order(-health$Total.Casualties),]
head(health)
## # A tibble: 6 x 4
## X.EVTYPE. Total.Fatalities Total.Injuries Total.Casualties
## <chr> <dbl> <dbl> <dbl>
## 1 TORNADO 1545 21765 23310
## 2 HEAT 3087 9103 12190
## 3 FLOOD 1386 8519 9905
## 4 THUNDER STORM 438 5688 6126
## 5 LIGHTNING 729 4631 5360
## 6 WINTER STORM 195 1298 1493
We plot out the casualties for the largest weather events.
health <- health[1:5,]
g <- ggplot(health, aes(X.EVTYPE.,Total.Casualties))
p <- g + geom_bar(stat="identity") + ggtitle("Total casualties by weather event type") +
xlab("Event type") + ylab("Total casualties") +
geom_text(aes(label = format(Total.Casualties,big.mark=","),vjust = -1)) +
ylim(c(0,25000))
print(p)
So, since 1995, the weather events that have caused the greatests number of casualties are Tornadoes, followed by Heat, and then Flooding.
We now take a look at economic impacts, using the property and crop damage metrics:
econ <- group_by(sdat,X.EVTYPE.) %>%
summarise(Total.PROP.DAMAGE = sum(X.TOTAL.PDAMAGE),
Total.CROP.DAMAGE = sum(X.TOTAL.CDAMAGE)) %>%
mutate(Total.DAMAGE.Bil = (Total.PROP.DAMAGE + Total.CROP.DAMAGE)/1000000000) %>%
filter(Total.DAMAGE.Bil > 0)
econ <- econ[order(-econ$Total.DAMAGE.Bil),]
head(econ)
## # A tibble: 6 x 4
## X.EVTYPE. Total.PROP.DAMAGE Total.CROP.DAMAGE Total.DAMAGE.Bil
## <chr> <dbl> <dbl> <dbl>
## 1 FLOOD 160151551770 7032166400 167.
## 2 HURRICANE 84630180010 5504792800 90.1
## 3 STORM SURGE 43193536000 5000 43.2
## 4 TORNADO 24915219460 296595610 25.2
## 5 HAIL 15040821320 2613777050 17.7
## 6 DROUGHT 1046106000 13922066000 15.0
We plot out the damage for the largest weather events:
econ <- econ[1:5,]
g <- ggplot(econ, aes(X.EVTYPE.,Total.DAMAGE.Bil))
p <- g + geom_bar(stat="identity") +
ggtitle("Total economic damage ($Billions) by weather event type") +
xlab("Event type") + ylab("Total damage ($Billions)") +
geom_text(aes(label = format(round(Total.DAMAGE.Bil,2)),vjust = -1)) +
ylim(c(0,180))
print(p)
So the weather events with the largest economic impacts are Flood, followed by Hurricane, and then storm surge.