Economic and Health Impact of Storms in the United States

Synopsis

The National Oceanic and Atmospheric Administration’s (NOAA) Storm Events Database contains data on storms in the United States and its territories from 1951 to 2011. The data details the type of storm, location, duration, economic costs, and human costs. The purpose of this analysis is to analyze past events and present any trends or insights that would assist in preparing for natural disasters. The analysis includes details on data sources, data preparation, and data visualization. The visualizations reveal storm event types that are national in reach, while other are regional and have never affected other parts of the United States. While causation is discussed in the analysis, it is up to the reader to make their own judgments based on their knowledge of storms and their region.

Data Processing

We will be using the stringr and dplyr libraries to process the data into a format we can use. We will then use maps to visualize the data. We also disable scientific notation to ease the reading of tables.

library(stringr)
library(dplyr)
library(xtable)
library(maps)
options(scipen=999)

With the file in our working directory, we unzip and read in the data

data <- read.csv(bzfile("repdata_data_StormData.csv.bz2"), stringsAsFactors = FALSE) #, nrows =100)

We will now separate the date from the time, which is 0:00:00 throughout the variable. Reviewing the structure of the data on the NOAA Website indicates that there are three distinct time periods where there will be variations in how the data was recorded. Looking at the time periods and the amount of data in each shows that the time period from 1996 - 2011 contains 70% of the records in the data-set. Given that the more recent data has more complete information, is more representative of current population and other demographics, and has fewer corrections for inflation affects, we will confine the analysis to the data from 1996 to 2011 and place it in a data.frame called recent.

data$BGN_DATE <- gsub(" 0:00:00", "", data$BGN_DATE)
data$year <- as.numeric(str_sub(data$BGN_DATE, start= -4))
recent <- subset(data, year >= 1996)

There are several variables that are not related to the analysis, so we will subset out the variables we are interested in into a data.frame called recent.clean.

recent.clean <- subset(recent, select = c(BGN_DATE, COUNTYNAME, STATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP, LATITUDE, LONGITUDE, year))

With the data subsetted into the most recent era of data collection, we will convert the Property and Crop damage values from exponential to absolute values. In the raw data, the damage is broken up into 3 significant figures in one column, with another column containing K, M, B, or nothing to indicate Thousands, Millions, or Billions of dollars.

recent.clean$PROPDMGEXP <- gsub("", "1", recent.clean$PROPDMGEXP)
recent.clean$PROPDMGEXP <- gsub("101", "1", recent.clean$PROPDMGEXP)
recent.clean$PROPDMGEXP <- gsub("1B1", "1000000000", recent.clean$PROPDMGEXP)
recent.clean$PROPDMGEXP <- gsub("1M1", "1000000", recent.clean$PROPDMGEXP)
recent.clean$PROPDMGEXP <- gsub("1K1", "1000", recent.clean$PROPDMGEXP)
recent.clean$PROPDMGEXP <- as.numeric(recent.clean$PROPDMGEXP)

recent.clean$CROPDMGEXP <- gsub("", "1", recent.clean$CROPDMGEXP)
recent.clean$CROPDMGEXP <- gsub("1B1", "1000000000", recent.clean$CROPDMGEXP)
recent.clean$CROPDMGEXP <- gsub("1M1", "1000000", recent.clean$CROPDMGEXP)
recent.clean$CROPDMGEXP <- gsub("1K1", "1000", recent.clean$CROPDMGEXP)
recent.clean$CROPDMGEXP <- as.numeric(recent.clean$CROPDMGEXP)

Reviewing the location data reveals that Latitude and Longitude are not in the correct format. We multiply each by 0.01 and change the sign on the Longitude values since the United States is West of the Prime Meridian.

recent.clean$PROPDMG <- recent.clean$PROPDMG*recent.clean$PROPDMGEXP
recent.clean$CROPDMG <- recent.clean$CROPDMG*recent.clean$CROPDMGEXP
recent.clean$LATITUDE <- recent.clean$LATITUDE *0.01
recent.clean$LONGITUDE <- recent.clean$LONGITUDE *-0.01

We wish to confine the study to the 50 states, so we use the matching function %in% to subset out only the events that match up with the abbreviations for the 50 states.

states <- c("AK", "AL", "AR", "AZ", "CA", "CO", "CT", "DE", "FL",
            "GA", "HI", "IA", "ID", "IL", "IN", "KS", "KY", "LA", "MA",
            "MD", "ME", "MI", "MN", "MO", "MS", "MT", "NC", "ND", "NE",
            "NH", "NJ", "NM", "NV", "NY", "OH", "OK", "OR", "PA", "RI",
            "SC", "SD", "TN", "TX", "UT", "VA", "VT", "WA", "WI", "WV", "WY")

recent.clean <- subset(recent.clean, STATE %in% states)

Next we combine the counts for Fatalities and Injuries into one stat called Harm since it appears to be a variable that is (thankfully) sparse in the data-set.

recent.clean$harm <- recent.clean$FATALITIES + recent.clean$INJURIES

We then create a new variable with the full state names using the state.name and state.abb functions.

recent.clean$statename <- state.name[match(recent.clean$STATE, state.abb)]

While the NOAA website lists 48 Event Types, the data-set records over 400 unique events in the EVTYPE variable. Upon inspection we find alternative spelling, misspellings, lowercase vs. uppercase, etc. It would greatly expand the scope of the project to clean up every instance, and the author’s only rudimentary knowledge of text analytics does not contain an easy solution. The following text adjustments were made to the data and took the number of unique events down to only 348.

recent.clean$EVTYPE <- tolower(recent.clean$EVTYPE)
recent.clean$EVTYPE <- gsub(" ", "", recent.clean$EVTYPE)
recent.clean$EVTYPE <- gsub("[0-9]*", "", recent.clean$EVTYPE)
recent.clean$EVTYPE <- gsub("\\(g)", "", recent.clean$EVTYPE)
recent.clean$EVTYPE <- gsub("\\(.)", "", recent.clean$EVTYPE)
recent.clean$EVTYPE <- gsub("\\(", "", recent.clean$EVTYPE)
recent.clean$EVTYPE <- gsub("\\)", "", recent.clean$EVTYPE)
recent.clean$EVTYPE <- gsub("tstmwind", "thunderstormwind", recent.clean$EVTYPE)
recent.clean$EVTYPE <- gsub("\\/typhoon", "", recent.clean$EVTYPE)

Review of the updated EVTYPE variable shows that over 80% of the entries fall into the top 20 Events by count. Converting the events that have 1 entry into one of the 48 would have diminishing returns, hence we will work with the data set having the 348 unique event types.

Results

With the tidy data-set names recent.clean, we can begin to look at the impact of the weather events on property/crop damage and harm to the population. The first aspect to look at is the top 6 events by count, Total Property Damage, Total Crop Damage, and Number (Count) of People Harmed from 1996 - 2011. We will use the group_by() and summarise() functions from dplyr to perform the analysis

events <- group_by(recent.clean, EVTYPE)
events_sum <- summarise(events, Count =n(), Total_Prop = sum(PROPDMG),
                        Total_Crop = sum(CROPDMG), Health = sum(harm))

With the data summarized in the data.table events_sum, we can start looking a the top events. Note that all of the monetary values have not be adjusted for inflation. In an expansion of this report, it would be recommended to get all monetary values in the terms of a base year.

Top 6 Events by Count

events_sum <- arrange(events_sum, desc(Count))
topsix_count <- events_sum[1:6,1]
events_sum[1:6,]
Source: local data frame [6 x 5]

            EVTYPE  Count   Total_Prop Total_Crop Health
1 thunderstormwind 209947   7849636570  952241350   5372
2             hail 207685  14595136920 2476029450    720
3       flashflood  50012  14950618910 1320489700   2512
4            flood  23874 143818447550 4926778400   7163
5          tornado  23139  24616723710  283424010  22178
6         highwind  19878   5246651360  633287300   1315

The number of events is dominated by Thunder Storm Winds and Hail. The next closest category, at quarter of the numbers, is Flash Floods. Interestingly, the events with the most occurrences are nowhere near the top events in the Damage categories, except maybe for Hail, as we will examine next.

Top 6 Events by Total Property Damage ($)

events_sum <- arrange(events_sum, desc(Total_Prop))
topsix_prop <- events_sum[1:6,1]
events_sum[1:6,]
Source: local data frame [6 x 5]

      EVTYPE  Count   Total_Prop Total_Crop Health
1      flood  23874 143818447550 4926778400   7163
2  hurricane    211  78942098010 4793430800   1069
3 stormsurge    251  43193461000       5000     39
4    tornado  23139  24616723710  283424010  22178
5 flashflood  50012  14950618910 1320489700   2512
6       hail 207685  14595136920 2476029450    720

Looking at Property Damage, Floods have had the most economic impact on Property. After that, each successive ‘event’ is almost half of the previous. One might have thought that Hurricanes would be the top event, but this may be explained when we look at the geographical distribution of events later.

Top 6 Events by Total Crop Damage ($)

events_sum <- arrange(events_sum, desc(Total_Crop))
topsix_crop <- events_sum[1:6,1]
events_sum[1:6,]
Source: local data frame [6 x 5]

       EVTYPE  Count   Total_Prop  Total_Crop Health
1     drought   2408   1041101000 13367361000      4
2       flood  23874 143818447550  4926778400   7163
3   hurricane    211  78942098010  4793430800   1069
4        hail 207685  14595136920  2476029450    720
5  flashflood  50012  14950618910  1320489700   2512
6 extremecold    617     19760400  1308973000    194

Crop Damage has mostly occurred from Drought for 1996 - 2011, with Floods, Hurricanes, and Storm Surge rounding out the top 4. Drought damage to crops is 4 times as high as the next Event on the list. If you add up the three Events immediately after Drought, they still would not be equal to the damage drought has done.

Top 6 Events by People Harmed (Count)

events_sum <- arrange(events_sum, desc(Health))
topsix_health <- events_sum[1:6,1]
events_sum[1:6,]
Source: local data frame [6 x 5]

            EVTYPE  Count   Total_Prop Total_Crop Health
1          tornado  23139  24616723710  283424010  22178
2    excessiveheat   1638      7723700  492402000   7866
3            flood  23874 143818447550 4926778400   7163
4 thunderstormwind 209947   7849636570  952241350   5372
5        lightning  13152    742045580    6898440   4765
6       flashflood  50012  14950618910 1320489700   2512

Harmful events show that Tornadoes are by far and away the most dangerous of weather events in the United State. The next highest Event, Excessive Heat, is only 1 one third of the numbers. Floods, Thunder storm Winds, and Lightning round out the top 5. It would be an interesting additional study to look into the relationships between Tornadoes and Thunder Storm Winds to see if they are the same thing being reported differently.

Visualization

While lists of the top events are informative, a more useful visualization would be to plot the levels of the each event on a map of the United States to get a feel for the geographic distribution of Storm Events. A choropleth is a plot utilizing Latitude and Longitude as the X and Y variables. For the purposes of this analysis, we will not be plotting at the point scale since storms cover a large geographic area. We will be plotting the Top 6 Events on a state level. We will create 6 maps per plot, each one representing each of the Top 6 Events for Property and Crop Damage, along the a count of the Harm incurred during the Event.

The code for the creating the plot is as follows: We initially create some color functions in order to distinguish between the various levels of each category. All of the data is skewed to the left, but Log based axis may not make much sense. We will use a set of bins that will use a pseudo-log style, but keep the numbers in an easily translatable form. Also, given the distance that U.S. territories, Alaska, and Hawai’i are from the U.S. mainland, only the 48 states and the district of Columbia will be shown in the choropleth plots.

getColor <- function(x) {
  if (x > 1000000000) {
    col <- "#13373e"    
  } else if (x > 100000000) {
    col <- "#246571"
  } else if (x > 1000000) {
    col <- "#308898"
  } else {
    col <- "#7bc7d5"
  }
  return(col)
}

getColor.harm <- function(x) {   
  if (x > 600) {
    col <- "#13373e"    
  } else if (x > 400) {
    col <- "#246571"
  } else if (x > 200) {
    col <- "#308898"
  } else {
    col <- "#7bc7d5"
  }
}

With the color functions set, we can start to lay out the plots for 6 charts on each plot.

par(mfrow=c(4,2), mar = c(1,1,1,1))
topevents <- 1:6

The heart of the process is taking each event in the Top 6, summarizing the relevant category, assigning a color to it, and then plotting it. The map() function in R needs some help because some state are broken up into multiple regions, hence the section with mapnames, regionlist, ect. Once the maps are on the plot, we create a legend based on our pseudo-log assignment of colors.

par(mfrow=c(4,2), mar = c(1,1,1,1), oma = c(0,0,2,0))
topevents <- 1:6
for (evt in topevents) {
  # Select out the current Top Event, Sum up the category for each state
  prop.state <- recent.clean[recent.clean$EVTYPE == topsix_prop[evt],]
  prop.state.group <- group_by(prop.state, statename)
  prop.state.group_sum <- summarize(prop.state.group, total = sum(PROPDMG))
  
  # Match values to database region names
  mapnames <- map("state", plot=FALSE)$names
  regionlist <- strsplit(mapnames, ":")
  mapnames.fin <- sapply(regionlist, "[", 1)
  m <- match(mapnames.fin, tolower(prop.state.group_sum$statename))
  maprates <- prop.state.group_sum$total[m]
  maprates <- na.omit(maprates)
  
  # Assign colors based on Event category and then plot eac state to that color
  statecols <- sapply(maprates, FUN=getColor)
  map("state", regions=prop.state.group_sum$statename[m], proj = "albers",
      param = c(39,45), fill=TRUE, col=statecols, border=NA, resolution=0)
  text(mapproject(-90,26), topsix_prop[evt], cex = 1)
}
mtext("Lattice Plot of Top 6 Events: Property Damage by State (1996-2011)", outer=TRUE)
# Make-shift legend
plot(0, 0, type="n", axes=FALSE, xlim=c(0,30), ylim=c(0,2))
rect(c(5,10,15,20), c(1,1,1,1), c(10,15,20,25), c(1.25,1.25,1.25,1.25),
     col=sapply(c(0,1000001,100000001,1000000001), getColor), border="white", lwd=0.4)
text(15, 1.35, "Property Damage ($)")
text(c(5,10,15,20), c(0.9,0.9,0.9), c("$0", "$1M","$100M","$1B"), cex=0.8)  # Tick labels

plot of chunk unnamed-chunk-18

The choropleths reveal some interesting aspects of the data. Floods, which are the top property damage values are present in almost every state and may be the largest total damage from reports in California, Florida, and the Northeast corridor. These areas have high property values and would result in a high summation. Hurricanes on the other hand, while very destructive, are concentrated on the southern coasts. Note that Nevada shows up in the Hurricane plot. The bin goes from $0 to $1 Million, but while most state reported no Hurricane costs, Nevada reports $0, so the state gets assigned a color. Other interesting observations are that, as expected, Storm Surges are confined to the coasts (except for Montana), Tornadoes are predominately in the area known as ‘Tornado Alley’, and Flash Floods and Hail have been encountered in all states.

par(mfrow=c(4,2), mar = c(1,1,1,1), oma = c(0,0,2,0))
topevents <- 1:6

for (evt in topevents) {
  crop.state <- recent.clean[recent.clean$EVTYPE == topsix_crop[evt],]
  crop.state.group <- group_by(crop.state, statename)
  crop.state.group_sum <- summarize(crop.state.group, total = sum(CROPDMG))
  
  # Match values to database region names
  mapnames <- map("state", plot=FALSE)$names
  regionlist <- strsplit(mapnames, ":")
  mapnames.fin <- sapply(regionlist, "[", 1)
  m <- match(mapnames.fin, tolower(crop.state.group_sum$statename))
  maprates <- crop.state.group_sum$total[m]
  maprates <- na.omit(maprates)
  statecols <- sapply(maprates, FUN=getColor)
  map("state", regions=crop.state.group_sum$statename[m], proj = "albers",
      param = c(39,45), fill=TRUE, col=statecols, border=NA, resolution=0)
  text(mapproject(-90,26), topsix_crop[evt], cex = 1)
}
mtext("Lattice Plot of Top 6 Events: Crop Damage by State (1996-2011)", outer=TRUE)
# Make-shift legend
plot(0, 0, type="n", axes=FALSE, xlim=c(0,30), ylim=c(0,2))
rect(c(5,10,15,20), c(1,1,1,1), c(10,15,20,25), c(1.25,1.25,1.25,1.25),
     col=sapply(c(0,1000001,100000001,1000000001), getColor), border="white", lwd=0.4)
text(15, 1.35, "Crop Damage ($)")
text(c(5,10,15,20), c(0.9,0.9,0.9), c("$0", "$1M","$100M","$1B"), cex=0.8)  # Tick labels

plot of chunk unnamed-chunk-19

The next plot of choropleths show the distribution of the Top 6 Events for crop damage. Crop damage is going to be influenced by the effect that most crops in the United States are grown in the South, Midwest, and California. Drought leads the way with concentrations in Texas and Oklahoma. The rest of the Midwest and states south of the Great Lakes also show effects from Drought. Floods occur in all states, with a Concentration in California, where Drought did not have such a large effect. Again Hurricanes show up on the southeast coastline. Hail and Flashfloods show up in the Midwest and South. Extreme Cold is interesting because it shows up in some states that usually do not get as cold as the rest of the country. Because of this, we can deduce more cold sensitive crops are grown in Texas, Louisiana, Florida, and California.

par(mfrow=c(4,2), mar = c(1,1,1,1), oma = c(0,0,2,0))
topevents <- 1:6

for (evt in topevents) {
  health.state <- recent.clean[recent.clean$EVTYPE == topsix_health[evt],]
  health.state.group <- group_by(health.state, statename)
  health.state.group_sum <- summarize(health.state.group, total = sum(harm))
  
  # Match values to database region names
  mapnames <- map("state", plot=FALSE)$names
  regionlist <- strsplit(mapnames, ":")
  mapnames.fin <- sapply(regionlist, "[", 1)
  m <- match(mapnames.fin, tolower(health.state.group_sum$statename))
  maprates <- health.state.group_sum$total[m]
  maprates <- na.omit(maprates)
  statecols <- sapply(maprates, FUN=getColor.harm)
  map("state", regions=health.state.group_sum$statename[m], proj = "albers",
      param = c(39,45), fill=TRUE, col=statecols, border=NA, resolution=0)
  text(mapproject(-90,26), topsix_health[evt], cex = 1)
}
mtext("Lattice Plot of Top 6 Events: Harmful Events by State (1996-2011)", outer = TRUE)
# Make-shift legend
plot(0, 0, type="n", axes=FALSE, xlim=c(0,30), ylim=c(0,2))
rect(c(5,10,15,20), c(1,1,1,1), c(10,15,20,25), c(1.25,1.25,1.25,1.25),
     col=sapply(c(0,201,401,601), getColor.harm), border="white", lwd=0.4)
text(15, 1.35, "Harm (Count)")
text(c(5,10,15,20), c(0.9,0.9,0.9), c("0", "200","400","600"), cex=0.8)  # Tick labels

plot of chunk unnamed-chunk-20

The final plot shows the choropleths associated with Harm, injuries and fatalities, from Storm Events. Tornado deaths and injuries are concentrated in the ‘Tornado Alley’ and are practically non-existent in the West. Excessive Heat shows up in Northern Midwest and Northeast states that normally do not get as hot, along with California and Texas. Floods seem unusually concentrated in Texas, but with only a count of 7163 in 15 years, one or two events could skew Texas to the top. Thunder Storm Winds follow a path from Texas to New York, which could be a result of Jet Stream effects. Lightning is concentrated in Florida, with some additional cases in Texas, Georgia, and North Carolina. Flash Floods are similar to Floods and concentrated in Texas, with some in Arizona.