Abstract

This work investigates damage from severe weather events in the United States from January 1950 to November 2011. The Data was taken from NOAA storm database. The damage was separately calculated in relation to public health, property, and crops. The data was restructured to correspond the NOAA’s official 48 event types. We also built a new variable “HEALTHDMG” to combine the number of fatalities and injuries into a single number. Two graphs that display the damage by event type are built: for public health damage and for property and crop damage combined. Our results showed, that tornados and floods are two types of events that are both present in top five destructive events (related to publich health and property, respectively).

Part I: Data Processing.

First, we process the raw data from the NOAA storm database.
R Packages used: - dplyr - reshape2 - ggplot2

1.1 Reading the data.

t1 <- Sys.time()
d <- read.csv("Stormdata.csv.bz2", sep=",", header = TRUE)
print( paste("Time reading the dataset:",
              round(difftime(Sys.time(), t1, units="secs")), "secs", sep=" "))
## [1] "Time reading the dataset: 132 secs"
#head(d)


1.2 Omitting unrelevant variables.

In order to process data more quickly, we will sample only that variables, that are relevant for our analysis, namely:
- Case reference number (REFNUM)
- Event type name (EVTYPE)
- Damage to public health (FATALITIES and INJURIES)
- Property damage (PROPDMG and PROPDMGEXP)
- Crop damage (CROPDMG and CROPDMGEXP)

The results are calculated across the United States (all states) and for all years, thus, the respective variables are omitted too.

d <- select(d, c(REFNUM, EVTYPE, FATALITIES:CROPDMGEXP)) 
head(d, 3)
##   REFNUM  EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1      1 TORNADO          0       15    25.0          K       0           
## 2      2 TORNADO          0        0     2.5          K       0           
## 3      3 TORNADO          0        2    25.0          K       0


1.3 Reducing event type names.

One of the major challenges of the raw data was a big number (985) of unstandardized event type names, including typos (compared to official 48 types):

paste0("Number of unique event types: ", length(unique(d$EVTYPE)))
## [1] "Number of unique event types: 985"

Because our work is preliminary, we decided to remove all unique cases, which come less then 50 times in the whole dataset. We assume, that for our analysis it won’t severely affect the accuracy of the results:

table_cut <- table(d$EVTYPE)
table_cut <- table_cut[ which(table_cut>50) ]
d <- d [d$EVTYPE %in% names(table_cut), ]
paste0("Number of unique event types: ", length(unique(d$EVTYPE)))
## [1] "Number of unique event types: 87"

Thus, we have only 87 event types left. We now include the list of official event types to compare them with these in the data. The full list is available here: https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf We also set the lettering to the lower case both in original events vector and in our data, in order to avoid mismatch:

events <- tolower(c("Astronomical Low Tide",
"Avalanche",
"Blizzard",
"Coastal Flood",
"Cold/Wind Chill",
"Debris Flow",
"Dense Fog",
"Dense Smoke",
"Drought",
"Dust Devil",
"Dust Storm",
"Excessive Heat",
"Extreme Cold/Wind Chill",
"Flash Flood",
"Flood",
"Frost/Freeze",
"Funnel Cloud",
"Freezing Fog",
"Hail",
"Heat",
"Heavy Rain",
"Heavy Snow",
"High Surf",
"High Wind",
"Hurricane (Typhoon)",
"Ice Storm",
"Lake-Effect Snow",
"Lakeshore Flood",
"Lightning",
"Marine Hail",
"Marine High Wind",
"Marine Strong Wind",
"Marine Thunderstorm Wind",
"Rip Current",
"Seiche",
"Sleet",
"Storm Surge/Tide",
"Strong Wind",
"Thunderstorm Wind",
"Tornado",
"Tropical Depression",
"Tropical Storm",
"Tsunami",
"Volcanic Ash",
"Waterspout",
"Wildfire",
"Winter Storm",
"Winter Weather"))

d$EVTYPE <- tolower(d$EVTYPE)

We match the original event names with our data, eventually replacing them, if there’s even a partial match between two names. The loop takes each of the original event names, and searches for a match in our dataset:

i <- 1L
    for (i in 1:length(events)) {
        rows <- grep(events[i], d$EVTYPE, ignore.case=TRUE)
        d$EVTYPE[rows] <- events[i]
    }
paste0("Number of unique event types: ", length(unique(d$EVTYPE)))
## [1] "Number of unique event types: 61"


Finally, we manually match the remaining event types to the official list of 48 names.

d$EVTYPE [ grep("marine tstm wind", d$EVTYPE)] <- "marine thunderstorm wind"
d$EVTYPE [ grep("tstm wind", d$EVTYPE)] <- "thunderstorm wind"
d$EVTYPE [ grep("freezing rain", d$EVTYPE)] <- "frost/freeze"
d$EVTYPE [ grep("snow|wintry|snowfall", d$EVTYPE)] <- "winter weather"
d$EVTYPE [ grep("extreme cold|record cold|extreme windchill", d$EVTYPE) ] <- 
    "extreme cold/wind chill"
d$EVTYPE [ grep("warm", d$EVTYPE)] <- "heat"
d$EVTYPE [ grep("hurricane|typhoon", d$EVTYPE)] <- "hurricane(typhoon)"
d$EVTYPE [ grep("tide", d$EVTYPE)] <- "storm surge/tide"
d$EVTYPE [ grep("microburst", d$EVTYPE)] <- "tornado"
d$EVTYPE [ grep("stream fld", d$EVTYPE)] <- "flood"

d$EVTYPE [d$EVTYPE=="storm surge"] <- "storm surge/tide"
d$EVTYPE [d$EVTYPE=="wind"] <- "strong wind"
d$EVTYPE [d$EVTYPE=="gusty winds"] <- "strong wind"
d$EVTYPE [d$EVTYPE=="heavy surf"] <- "high surf"
d$EVTYPE [d$EVTYPE=="cold"] <- "cold/wind chill"
d$EVTYPE [d$EVTYPE=="freeze"] <- "frost/freeze"
d$EVTYPE [d$EVTYPE=="frost"] <- "frost/freeze"
d$EVTYPE [d$EVTYPE=="wind"] <- "strong wind"
d$EVTYPE [d$EVTYPE=="fog"] <- "dense fog"
d$EVTYPE [d$EVTYPE=="ice"] <- "winter weather"
d$EVTYPE [d$EVTYPE=="unseasonably dry"] <- "drought"
d$EVTYPE [d$EVTYPE=="forest fire"] <- "wildfire"

paste0("Number of unique event types: ", length(unique(d$EVTYPE)))
## [1] "Number of unique event types: 34"


1.4 Calculating damage to public health.

The damage to public health in the data set is coded by two variables: the number of deaths (FATALITIES) and number of injuries (INJURIES). We decided to build a new, composed variable, by applying a multiplier of 20 to the number of deaths (as they are a much more dramatic event than an injury), and summing it with the number of injuries. The new variable is called “HEALTHDMG”. The original variables are deleted.

d <- d %>% mutate(HEALTHDMG=FATALITIES*20+INJURIES) %>% 
                    select(-FATALITIES,-INJURIES)


1.5 Calculating property and crop damage.

The variables PROPDMG and CROPDMG must be raised in a n-power of 10, which is coded through the variables PROPDMGEXP and CROPDMGEXP. The codes with the description are available via the link below: https://rstudio-pubs-static.s3.amazonaws.com/58957_37b6723ee52b455990e149edde45e5b6.html

First, we recode variables PRODMGEXP and CROPDMGEXP

d$PROPDMGEXP = recode (d$PROPDMGEXP,
                         .default=0,
                         `+`=1,
                         `0`=10,
                         `1`=10,
                         `2`=10,
                         `3`=10,
                         `4`=10,
                         `5`=10,
                         `6`=10,
                         `7`=10,
                         `8`=10,
                         h=100,
                         H=100,
                         k=1000,
                         K=1000,
                         m=10^6,
                         M=10^6,
                         b=10^9,
                         B=10^9)

d$CROPDMGEXP = recode (d$CROPDMGEXP,
                         .default=0,
                         `0`=10,
                         `2`=10,
                         k=1000,
                         K=1000,
                         m=10^6,
                         M=10^6,
                         B=10^9)

Then, we create new variables PROPDMG_R and CROPDMG_R, deleting the original variables:

d <- d %>% mutate(PROPDMG_R=PROPDMG*PROPDMGEXP) %>%
                mutate(CROPDMG_R=CROPDMG*CROPDMGEXP) %>%
                    select(-PROPDMG:-CROPDMGEXP)

As pointed by a colleague (the credit goes to Mark Blackmore on Coursera’s discussion forums), a case with the referent number 605943 is mis-coded, having the power of ten e+9 instead of e+6. Thus, we divide this case through 1000:

d$PROPDMG_R[ which(d$REFNUM==605943) ] <-
    d$PROPDMG_R[ which(d$REFNUM==605943) ] / 1000

We now have all the data ready for the analysis.

Part II: Results.

In this part, we show the bar graphs of total health or property damage by the event type, reordered in a descending order.

2.1 Health damage.

Health damage is based on our compositional variable “HEALTHDMG” that equals to the sum of fatalities multiplied by 20 and the number of injuries (see chapter 1.4).

ggplot(d, aes(x=reorder(EVTYPE, HEALTHDMG, sum), y=HEALTHDMG) ) +
    geom_bar(stat="identity", fill="#FF9999") +
    coord_flip() +
    labs (x="Event Type", y="Health damage") +
    theme (plot.title=element_text(hjust=0.5)) +
    ggtitle ("Damage to health (fatalities and injuries)")


We conclude that, as related to public health, top 5 harmful events are:
1. Tornados.
2. Heat.
3. Floods.
4. Thunderstorm Winds.
5. Lightnings.

2.2 Property damage.

Please note that the damage is displayed in Billions of US dollars.

# First, melting the variables of our interest:
gprop <- melt(d[, c("EVTYPE", "CROPDMG_R", "PROPDMG_R")], id.vars=1)
gprop[,3] <- gprop[,3]/10^9
levels(gprop$variable)[levels(gprop$variable)=="CROPDMG_R"] <- "Crop"
levels(gprop$variable)[levels(gprop$variable)=="PROPDMG_R"] <- "Property"

# Then, plotting the graph:
ggplot(gprop, aes(x=reorder(EVTYPE, value, sum), y=value) ) +
    geom_bar(stat="identity", aes(fill=variable)) +
    coord_flip() +
    labs (x="Event Type", y="Damage cost (Billions USD)") +
    theme (legend.position = "top", plot.title=element_text(hjust=0.5)) +
    ggtitle ("Damage to property and crops") +
    scale_fill_manual (name = "Type of damage",
                        values=c("#00CCCC", "#FF9999"))


We conclude that, as related to public health, top 5 harmful events are:
1. Hurricanes (Typhoons).
2. Floods.
3. Tornados.
4. Storm Surges and Tides.
5. Hails.

2.3 Conclusion.

We conclude that tornados and floods were two most harmful events in the US from 1950 to 2011, since they are both present in top 5 events, measured by damage to publich health and property.