In this report we are trying to find out the type of severe weather events that cause the greatest economic lost and are most harmful with respect to population health. We looked into the fatalities, injuries and property damage resulted by these weather events, where that data are available from U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. The events in the database cover from the start in the year 1950 and end in November 2011. From the data and analysis, we found that the type of flooding events have the greatest economic consequences, and type of tornado events are most harmful with respect to population health.
The database we are going to analyse tracking characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage, spanning from 1950 to 2011.
The data is in comma-separated-value form and compressed in bzip2 format and can be obtained from the course website.
library(ggplot2)
library(scales)
# Load data
data <- read.csv(bzfile("data/repdata_data_StormData.csv.bz2"))
# Check the features of the dataset
str(data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
There are 902297 observations with 37 variables. We are interested in:
1. Event type: ‘EVTYPE’
2. Health related variables: ‘FATALITIES’ and ‘INJURIES’
3. Damage cost related variables:‘PROPDMG’, ‘PROPDMGEXP’, ‘CROPDMG’, and ‘CROPDMGEXP’
The details of some of the variables can be found from National Weather Service Storm Data Documentation
The cost of damage of the event is separated into property damage ‘PROPDMG’ and crop damage ‘CROPDMG’, and the magnitudes are indicated in the ‘PROPDMGEXP’ and ‘CROPDMGEXP’ fields. The valid values of the magnitude fields are blank, ‘K’ for thousands, ‘M’ for millions, and ‘B’ for billions. We first take a look on these 2 fields.
summary(data$PROPDMGEXP)
## - ? + 0 1 2 3 4 5
## 465934 1 8 5 216 25 13 4 4 28
## 6 7 8 B h H K m M
## 4 5 1 40 1 6 424665 7 11330
summary(data$CROPDMGEXP)
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
There are values besides the valid chacracters. However the number is small enough for us to ignore at the moment.
In order to make the calculation more convenient, we create new fields showing actual cost which are calculated from the ‘DMG’ and ‘DMGEXP’ fields, and add
#`%+=%` = function(e1,e2) eval.parent(substitute(e1 <- e1 + e2))
data[,'TOTALDMG'] <- 0
data$TOTALDMG[grep("B", data$CROPDMGEXP)] = (data$CROPDMG[grep("B", data$CROPDMGEXP)] * 1e9)
data$TOTALDMG[grep("M", data$CROPDMGEXP)] = (data$CROPDMG[grep("M", data$CROPDMGEXP)] * 1e6)
data$TOTALDMG[grep("K", data$CROPDMGEXP)] = (data$CROPDMG[grep("K", data$CROPDMGEXP)] * 1e3)
data$TOTALDMG[grep("B", data$PROPDMGEXP)] = (data$PROPDMG[grep("B", data$PROPDMGEXP)] * 1e9)
data$TOTALDMG[grep("M", data$PROPDMGEXP)] = (data$PROPDMG[grep("M", data$PROPDMGEXP)] * 1e6)
data$TOTALDMG[grep("K", data$PROPDMGEXP)] = (data$PROPDMG[grep("K", data$PROPDMGEXP)] * 1e3)
Verify the calculation by showing some of the newly calculated costs.
head(data$TOTALDMG[data$TOTALDMG > 1e9])
## [1] 5.0000e+09 2.1050e+09 1.6025e+09 1.0000e+10 2.5000e+09 5.0005e+09
Similary, since there are two health related variables, we would like to combine them into one for further analysis.
data[,'TOTALHEALTH'] <- 0
data$TOTALHEALTH <- data$FATALITIES + data$INJURIES
Finally, we add up the number of injuries/fatalities of each events, as well as damage cost which belongs to the same event, and combine them into one data frame.
evt_econ <- setNames(aggregate(data$TOTALDMG ~ data$EVTYPE, sum, data=data), c("Event", "Value"))
evt_health <- setNames(aggregate(data$TOTALHEALTH ~ data$EVTYPE, sum, data=data), c("Event", "Value"))
evt_health$cat <- "Health"
evt_econ$cat <- "Economy"
re <- rbind(evt_health, evt_econ)
The following figure gives a brief idea on the relationship between events and cost of damange costed.
ggplot(re, aes(x=Event, y=Value)) + geom_point() + facet_grid(cat ~ ., scales = "free")
Since there are too many events carrying damage cost, we take a look on the top 4 events, which are the most prominent ones.
theme_update(plot.title = element_text(hjust = 0.5))
ggplot(head(evt_econ[order(-evt_econ[,2]),], 4), aes(Event, Value)) + geom_point(size=4, color="steelblue") + labs(title = "The 4 Events Have the Greatest Economic Consequences", y = "Cost")
We take a look the propotion of total economic damage casued by the top 1 and top 4 events.
total_cost <- sum(evt_econ$Value)
percent(sum(head(evt_econ[order(-evt_econ[,2]),], 4)$Value)/total_cost)
## [1] "67.8%"
percent(sum(head(evt_econ[order(-evt_econ[,2]),], 1)$Value)/total_cost)
## [1] "31.6%"
We see that the event “Flood” has accounted for more than 31% of the total damage cost.
Similarly, we show the top 5 events related to population health, which are the most prominent ones.
ggplot(head(evt_health[order(-evt_health[,2]),], 4), aes(Event, Value)) + geom_point(size=4, color="steelblue") + labs(title = "The 5 Most Harmful Events with respect to Population Health", y = "Injuries/Fatalities") + theme(plot.title = element_text(hjust = 0.5))
It is very obvious that “Tornado” is the event causing highest amount of casualties.
percent(sum(head(evt_health[order(-evt_health[,2]),], 1)$Value)/sum(evt_health$Value))
## [1] "62.3%"
Where 62.3% of the casualties are caused by tornado event.