Dispersion of Disasters in the United States

 A question of tremendous importance is found when justifying the allocation of resources for a natural disaster. What should be prepared for and where? In this analysis, we will look at the top contributors to disaster both to the population and the economy.

Loading Packages and Setting the Environment

First, it is wise to load in all of the packages that will be used in this analysis.

suppressMessages(require(downloader))
suppressMessages(require(dplyr))
suppressMessages(require(zipcode))
suppressMessages(require(lubridate))
suppressMessages(require(ggplot2))
suppressMessages(require(maps))
suppressMessages(require(tidyr))
suppressMessages(require(scales))
set.seed(48884)

Data Processing

This analysis explores the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. The dataset is easily accessed from the web. It is downloaded here.

temp <- tempfile()
download("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
         dest = temp,
         mode = "wb")
data <- read.csv(bzfile(temp))
unlink(temp)

 In this analysis, results for more recent years will be presented. Only records that were recorded during and after the year 2000 will be used. In addition, only a certain number of columns are necessary. The other columns can be tossed.

data <- filter(data, year(mdy_hms(as.character(BGN_DATE))) >= 2000)
data <- data[,c(2,3,7,8,23:28,32,33)]

Here is a brief view of the data set being used.

head(data)

##            BGN_DATE    BGN_TIME STATE    EVTYPE FATALITIES INJURIES
## 1 2/13/2000 0:00:00 06:09:00 PM    AL TSTM WIND          0        0
## 2 2/13/2000 0:00:00 06:15:00 PM    AL TSTM WIND          0        0
## 3 2/13/2000 0:00:00 06:24:00 PM    AL TSTM WIND          0        0
## 4 2/13/2000 0:00:00 06:35:00 PM    AL TSTM WIND          0        0
## 5 2/13/2000 0:00:00 06:35:00 PM    AL TSTM WIND          0        0
## 6 2/13/2000 0:00:00 06:48:00 PM    AL      HAIL          0        0
##   PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP LATITUDE LONGITUDE
## 1      20          K       0          K     3452      8732
## 2      20          K       0          K     3450      8717
## 3       2          K       0          K     3446      8742
## 4       5          K       0          K     3345      8807
## 5       2          K       0          K     3401      8746
## 6       5          K       0          K     3409      8724

It doesn’t appear as though there are any NA values to worry about. It couldn’t hurt to check though.

data_na <- data[!complete.cases(data),]
v1 <- nrow(data_na) #Automatically fills in value below
v2 <- round(nrow(data_na) / nrow(data) * 100, 2) #Automatically finds percentage

There are 47 records with NA values. That is only 0.01% of the data. These values can be ignored.

data <- data[complete.cases(data),]

The only events of interest in this analysis took place in the United States. All other places can be removed.

data <- filter(data, (STATE %in% state.abb))

Now, the dataset as it stands is a little messy. According to the documentation, if the value in column “PROPDMGEXP” is a “K”, then the corresponding value in “PROPDMG” is meant to be multiplied by 1,000. If the value is an “M”, then it should be multiplied by 1,000,000, and so on. Ditto for “CROPDMGEXP” and “CROPDMG”. To fix this involves some simple conditional formatting.

for (i in 1:nrow(data)){
  if (data$CROPDMGEXP[i] %in% c("K", "k")){
    data$CROPDMG[i] <- data$CROPDMG[i] * 1000
  } else if (data$CROPDMGEXP[i] %in% c("M", "m")){
    data$CROPDMG[i] <- data$CROPDMG[i] * 1000000
  } else if (data$CROPDMGEXP[i] %in% c("B", "b")){
    data$CROPDMG[i] <- data$CROPDMG[i] * 1000000000
  }
  
  if (data$PROPDMGEXP[i] %in% c("K", "k")){
    data$PROPDMG[i] <- data$PROPDMG[i] * 1000
  } else if (data$PROPDMGEXP[i] %in% c("M", "m")){
    data$PROPDMG[i] <- data$PROPDMG[i] * 1000000
  } else if (data$PROPDMGEXP[i] %in% c("B", "b")){
    data$PROPDMG[i] <- data$PROPDMG[i] * 1000000000
  }
}

Now, a look at EVTYPE reveals that these values were not entered consistantly. For instance, “THUNDERSTORM WIND” and “TSTM WIND” are the same, but count as two different entities. Here, values are changed to create such a consistancy. I do not claim that all values were corrected, only as many as I could find were categorized.

data <- mutate(data, EVTYPE2 = "")
for (i in 1:nrow(data)){
     if(grepl("TSTM", data$EVTYPE[i])){
          data$EVTYPE2[i] <- "THUNDERSTORM WIND"
     } else if (grepl("THUNDERSTORM WIND", data$EVTYPE[i])){
          data$EVTYPE2[i] <- "THUNDERSTORM WIND"
     } else if (grepl("HAIL", data$EVTYPE[i])){
          data$EVTYPE2[i] <- "HAIL"
     } else if (grepl("HEAT", data$EVTYPE[i])){
          data$EVTYPE2[i] <- "HEAT"
     } else if (grepl("COLD", data$EVTYPE[i])){
          data$EVTYPE2[i] <- "COLD"
     } else if (grepl("PRECIPITATION", data$EVTYPE[i])){
          data$EVTYPE2[i] <- "PRECIPITATION CHANGE"
     } else if (grepl("FLOOD", data$EVTYPE[i])){
          data$EVTYPE2[i] <- "FLOOD"
     } else if (grepl("GUST", data$EVTYPE[i])){
          data$EVTYPE2[i] <- "GUST"
     } else if (grepl("UNSEASONABLY", data$EVTYPE[i])){
          data$EVTYPE2[i] <- "UNSEASONABLE WEATHER"
     } else if (grepl("SNOW", data$EVTYPE[i])){
          data$EVTYPE2[i] <- "SNOW"
     } else if (grepl("DRY", data$EVTYPE[i])){
          data$EVTYPE2[i] <- "DRYNESS"
     } else if (grepl("RIP", data$EVTYPE[i])){
          data$EVTYPE2[i] <- "RIP CURRENT"
     } else if (grepl("TORNADO", data$EVTYPE[i])){
          data$EVTYPE2[i] <- "TORNADO"
     } else if (grepl("SLEET", data$EVTYPE[i])){
          data$EVTYPE2[i] <- "SLEET"
     } else if (grepl("HURRICANE", data$EVTYPE[i])){
          data$EVTYPE2[i] <- "HURRICANE"
     } else if (grepl("FIRE", data$EVTYPE[i])){
          data$EVTYPE2[i] <- "FIRE"
     } else if (grepl("RAIN", data$EVTYPE[i])){
          data$EVTYPE2[i] <- "RAIN"
     } else if (grepl("SEAS", data$EVTYPE[i])){
          data$EVTYPE2[i] <- "SEA INCIDENT"
     } else if (grepl("WIND", data$EVTYPE[i])){
          data$EVTYPE2[i] <- "MISC. WIND"
     } else {
          data$EVTYPE2[i] <- as.character(data$EVTYPE[i])
     }
}

data <- data[,c(1,2,3,13,5,6,7,8,9,10,11,12)]
names(data)[4] <- "EVTYPE"

Results

 Different types of events affect different areas of the United States. As such, it is necessary to prepare each region accordingly. There are, however, common denominators to consider in preparations. Thunderstorms, in particular their winds, are the greatest cause of injury and property damage throughout the United States.  In addition, Floods and Tornados cause the most fatalities. Hail and Floods cause the most crop damage. 

 While these events do occur the most often, they plague with a low rate of destruction. The vast majority of the time, these events cause no damage and force no injury. When they do cause injury and damage, it does not affect local communities on a large scale. Rather, on average, one or two individuals are hurt with a relatively low amount of crop and property damage. There are exceptions, of course. According to this data, it appears the best way to prepare for disasters is to maintain a large and evenly distributed number of hopitals in the United States (particularly in eastern, central, and coastal U.S.). In addition, it may be wise to contruct walls in key areas to prevent winds and floods from destroying crops and property.

Analysis

What Events are Most Harmful to People?

 Naturally, different types of events will be more harmful to people in different areas of the United States. Longitude and latitude for each event are given. It may be best to plot each event across a map of the United States. A further analysis of the LONGITUDE and LATITUDE columns may prove helpful here. For instance, upon further investigation of the data, it is easy to see that if the longitude and latitude are unknown (or not given), then their values are replaced with zeros. How many of these values are there?

data_place_check <- filter(data,
                           as.numeric(LONGITUDE) == 0,
                           as.numeric(LATITUDE) == 0)
v3 <- nrow(data_place_check) #Automatically fills in value below
v4 <- round(v3 / nrow(data) * 100, 2) #Automatically fills in percentage below

There are 108822 records with unknown longitude and latitudes. That is 21.54% of the data! These values need to be filled in somehow before continuing.

 In addition to the longitudes and latitudes, the states for each disaster are given. For each disaster, a location will be chosen in that particular disaster's home state. The package "zipcode" will be useful here. It loads a dataset that contains major areas of interest for each of the fifty states with corresponding logitudes and latitudes.

data(zipcode)

# This loop will assign coordinates to every disaster with no coordinates.
for (i in 1:nrow(data_place_check)){
     # first, we filter zips to the state that matches the disaster's state
     foo <- filter(zipcode, state == data_place_check$STATE[i])
     
     # We pick a location out at random from that subset
     r1 <- nrow(foo)
     r <- sample(1:r1+1, 1)
     
     # We reassign the coordinates. The mutiplication 
     # changes the scale to match that of the original dataset
     data_place_check$LONGITUDE[i] <- (-100 * foo$longitude[r])
     data_place_check$LATITUDE[i] <- (100 * foo$latitude[r])
}

# We go get only the records that already had their coordinates filled in
data_place_check.1 <-  filter(data,
                              !(as.numeric(LONGITUDE) == 0),
                              !(as.numeric(LATITUDE) == 0))

#We bind these two subsets together
data <- rbind(data_place_check, data_place_check.1)

###############################################################
# The resulting dataframe is the same as the original         #
#dataframe, but with 0's replaced for longitude and latitude. #
###############################################################

There are some anomalies within the data set that need to be corrected

# Just in case some were missed in the previous step, these values need to be removed. Naturally, this subset will be minuscule
data <- filter(data, !(as.numeric(LONGITUDE) == 0),
                !(as.numeric(LATITUDE) == 0))

#Fit scale of points to scale of map. 
# The x coordinates need to be flipped across the y axis.
data$LONGITUDE <- (-1 * data$LONGITUDE) / 100
data$LATITUDE <- (data$LATITUDE) / 100

# There are a handful (<10) of instances where the datapoints were 
# inputed incorrectly. The following four lines limit the longitude and 
# latitude to be close enough to the United States.
data <- filter(data, LONGITUDE < -20)
data <- filter(data, LONGITUDE > -200)
data <- filter(data, LATITUDE < 90)
data <- filter(data, LATITUDE > 10)

 Now the dataset is ready to be plotted. Before doing so, it may be wise to investigate the EVTYPE column of the dataset. This contains the delimeter for the types of disasters. How many different types are there?

v5 <- length(unique(data$EVTYPE))

 There are 89 different types of disasters in this dataset. Arguably, that is to many to plot. It is wiser instead to choose the top five biggest causes of injury and death and plot those.

# Organize disasters by # of injury
data1.1 <- group_by(data, EVTYPE) %>%
     summarize(sum = sum(as.numeric(INJURIES))) %>%
     arrange(desc(sum))

# Get names of top five events
injuries <- character()
for (i in 1:5){
     injuries <- c(injuries, as.character(data1.1$EVTYPE[i])) 
}

# Get only the records from the original dataset that are in the top 
# five for injury. Fatalities is set to NA so we do not overcount.
data1 <- filter(data, EVTYPE %in% injuries) %>%
     mutate(type = "Injury", value = INJURIES)

#####################################################################

# Organize disasters by # of deaths
data1.2 <- group_by(data, EVTYPE) %>%
     summarize(sum = sum(as.numeric(FATALITIES))) %>%
     arrange(desc(sum))

# Get names of top 5 events
deaths <- character()
for (i in 1:5){
     deaths <- c(deaths, as.character(data1.2$EVTYPE[i])) 
}

# Get only the records from the original dataset that are in the top 
# five for fatalities. Injuries is set to NA so we do not overcount.
data2 <- filter(data, EVTYPE %in% deaths) %>%
     mutate(type = "Fatalities", value = FATALITIES)

####################################################################

data3 <- rbind(data1, data2)

Now the data is truly ready to plot.

world <- map_data("world")

p1 <- ggplot()
p1 <- p1 + geom_polygon(data = world,
                        aes (x = long,
                             y = lat,
                             group = group),
                        col = "white")
p1 <- p1 + geom_point(data = data3,
                      aes(x = LONGITUDE,
                          y = LATITUDE,
                          col = EVTYPE,
                          alpha = value),
                      size = 0.7)
p1 <- p1 + scale_x_continuous(limits = c(min(data3$LONGITUDE) - 1,
                                         max(data3$LONGITUDE) + 15))
p1 <- p1 + scale_y_continuous(limits = c(min(data3$LATITUDE) - 5,
                                         max(data3$LATITUDE) + 15))
p1 <- p1 + guides(colour = guide_legend(override.aes = list(alpha = 1,
                                                            size = rel(3))))
p1 <- p1 + facet_grid(type ~ .)
p1 <- p1 + scale_color_manual(values = c("red", "orange", "yellow",
                                         "brown", "blue", "purple",
                                         "grey"))
p1 <- p1 + labs(x = "Longitude",
                y = "Latitude",
                col = "Type of Disaster",
                title = "Disasters in the United States\n2000-2011",
                alpha = "Number of Injuries or Deaths")
p1 <- p1 + theme(plot.title=element_text(size = rel(1),
                                         colour ="#777777",
                                         vjust=1),
                 axis.title.y = element_text(size = rel(1), 
                                             angle = 90, 
                                             colour="#777777",
                                             vjust=1.5),
                 axis.title.x = element_text(size = rel(1),
                                             angle = 0, 
                                             colour="#777777",
                                             vjust=-0.5),
                 axis.text = element_text(size=rel(0.5),
                                          colour="#32363f"),
                 panel.grid.major=element_line(colour="#aeb0b6",
                                               size=.5),
                 panel.grid.minor=element_blank(),
                 axis.ticks=element_blank(),
                 text=element_text(family="sans"))

The first thing immediatly seen from this graphic is that what hurts the population most is certainly not the same as what kills the population most.

Thunderstorm winds appear to injur the most amount of people in a pretty consistant area. A close second appears to be Fire on the west coast.
Fatalities in the United States paint a different picture. Flooding is a major problem with tornados striking in the central U.S.

Looking more closely at preventing injuries the stats for Thunderstorm Winds are:

summary(filter(data, EVTYPE == "THUNDERSTORM WIND") %>%
             select(INJURIES))

##     INJURIES       
##  Min.   : 0.00000  
##  1st Qu.: 0.00000  
##  Median : 0.00000  
##  Mean   : 0.01894  
##  3rd Qu.: 0.00000  
##  Max.   :70.00000

What does this say? Although Thunderstorm cause widespread injury, they only do so less than 25% of the time. In fact,

v6 <- round(nrow(filter(data, EVTYPE == "THUNDERSTORM WIND", INJURIES == 0)) / nrow(data) * 100,
            2)

Thunderstorm winds do not cause injury 32.72% of the (recorded) time. When they do cause injury, it is only minor.

summary(filter(data, EVTYPE == "THUNDERSTORM WIND", !(INJURIES == 0)) %>%
             select(INJURIES))

##     INJURIES     
##  Min.   : 1.000  
##  1st Qu.: 1.000  
##  Median : 1.000  
##  Mean   : 2.258  
##  3rd Qu.: 2.000  
##  Max.   :70.000

A spread of good hospitals is all that is needed to take care of injuries.

What about fatalities? The biggest distribution of deaths across the United States appear to be Floods. What can be said about them?

summary(filter(data, EVTYPE == "FLOOD") %>%
             select(FATALITIES))

##    FATALITIES      
##  Min.   : 0.00000  
##  1st Qu.: 0.00000  
##  Median : 0.00000  
##  Mean   : 0.01383  
##  3rd Qu.: 0.00000  
##  Max.   :20.00000

v7 <- round(nrow(filter(data, EVTYPE == "FLOOD", FATALITIES == 0)) / nrow(data) * 100,
            2)

Again, rarely do disasters appear to do any real harm. With Floods neglecting to take lives 11.75% of the (recorded) time, it looks as though it would be better to prepare for Floods than Thunderstorm Winds. But alas,

summary(filter(data, EVTYPE == "FLOOD", !(FATALITIES == 0)) %>%
             select(FATALITIES))

##    FATALITIES    
##  Min.   : 1.000  
##  1st Qu.: 1.000  
##  Median : 1.000  
##  Mean   : 1.424  
##  3rd Qu.: 1.000  
##  Max.   :20.000

Without the events that took no lives, events that are fatal kill at a slightly smaller rate than events that injur.

In terms of human health, the best way to prepare for disasters is to maintain well equiped hospitals thoughout the country. Particularly in the eastern, central, and coastal United States.

What Types of Events have the Greatest Economic Impact?

 For this type of analysis, it seems obvious that different disasters will have more or less effect on the economy based on location. A similar graphic to the one above will be constructed. Because the original dataset was messy, it is worth invetigating how many values were inputted incorrectly, and therefore was not processed correctly.

v8 <- nrow(filter(data, !(CROPDMGEXP %in% c("K", "k", "M", "m", "B", "b"))))
v9 <- nrow(filter(data, !(PROPDMGEXP %in% c("K", "k", "M", "m", "B", "b"))))
v10 <- nrow(filter(data, !(CROPDMGEXP %in% c("K", "k", "M", "m", "B", "b")),  
                   !(CROPDMG == 0)))
v11 <- nrow(filter(data, !(PROPDMGEXP %in% c("K", "k", "M", "m", "B", "b")),  
                   !(PROPDMG == 0)))
v12 <- round(nrow(filter(data, 
                         (!(CROPDMGEXP %in% c("K", "k", "M", "m", "B", "b")))
                         | (!(PROPDMGEXP %in% c("K", "k", "M", "m", "B", "b"))))) / nrow(data) * 100, 2)

 For crop and property damage, there are 241233 records and 180247 records (respectively) that were incorrectly entered. But of those records, there are 0 and 0 (respectively) records that have nonzero values as damage amounts. It is impossible to tell if these are all errors, or if incorrect formating truly means zero. Logically, though, if these values are zero, then it is okay for them to be ignored; they would not be represented visually on the map. If these values are not zero, they represent 48.22% of the data and should indeed be placed. For the purposes of this report, because these values must have been inputted as 0 for 0 to appear, we will assume that these values are correct. This 48.22% will not be altered. They will appear on the map, but their influence will be miniscule.

A graphic similar to the previous one will now be contructed.

# Organize disasters by Property Damage
data1.3 <- group_by(data, EVTYPE) %>%
     summarize(sum = sum(as.numeric(PROPDMG))) %>%
     arrange(desc(sum))

# Get names of top five events
types1 <- character()
for (i in 1:5){
     types1 <- c(types1, as.character(data1.1$EVTYPE[i])) 
}

# Get only the records from the original dataset that are in the top 
# five for property damage.
data4 <- filter(data, EVTYPE %in% types1) %>%
     mutate(type = "Property", value = PROPDMG)

#####################################################################

# Organize disasters by Crop damage
data1.4 <- group_by(data, EVTYPE) %>%
     summarize(sum = sum(as.numeric(CROPDMG))) %>%
     arrange(desc(sum))

# Get names of top 5 events
types2 <- character()
for (i in 1:5){
     types2 <- c(types2, as.character(data1.4$EVTYPE[i])) 
}

# Get only the records from the original dataset that are in the top 
# five for crop damage.
data5 <- filter(data, EVTYPE %in% types2) %>%
     mutate(type = "Crop", value = CROPDMG)

####################################################################

data6 <- rbind(data4, data5)

The data is ready to plot

p2 <- ggplot()
p2 <- p2 + geom_polygon(data = world,
                        aes (x = long,
                             y = lat,
                             group = group),
                        col = "white")
p2 <- p2 + geom_point(data = data6,
                      aes(x = LONGITUDE,
                          y = LATITUDE,
                          col = EVTYPE,
                          alpha = value),
                      size = 0.7)
p2 <- p2 + scale_x_continuous(limits = c(min(data6$LONGITUDE) - 1,
                                         max(data6$LONGITUDE) + 15))
p2 <- p2 + scale_y_continuous(limits = c(min(data6$LATITUDE) - 5,
                                         max(data6$LATITUDE) + 15))
p2 <- p2 + guides(colour = guide_legend(override.aes = list(alpha = 1,
                                                            size = rel(3))))
p2 <- p2 + facet_grid(type ~ .)
p2 <- p2 + scale_color_manual(values = c("red", "orange", "yellow",
                                         "brown", "blue", "purple",
                                         "grey", "green", "white",
                                         "pink"))
p2 <- p2 + labs(x = "Longitude",
                y = "Latitude",
                col = "Type of Disaster",
                title = "Disasters in the United States\n2000-2011",
                alpha = "Damage")
p2 <- p2 + theme(plot.title=element_text(size = rel(1),
                                         colour ="#777777",
                                         vjust=1),
                 axis.title.y = element_text(size = rel(1), 
                                             angle = 90, 
                                             colour="#777777",
                                             vjust=1.5),
                 axis.title.x = element_text(size = rel(1),
                                             angle = 0, 
                                             colour="#777777",
                                             vjust=-0.5),
                 axis.text = element_text(size=rel(0.5),
                                          colour="#32363f"),
                 panel.grid.major=element_line(colour="#aeb0b6",
                                               size=.5),
                 panel.grid.minor=element_blank(),
                 axis.ticks=element_blank(),
                 text=element_text(family="sans"))

It appears that events that damage crops are different from the events that damage property. While Thunderstorm Winds by and large claim damage for property damage in the United States, Hail and Floods affect crops. The stats for these events are:

summary(filter(data, EVTYPE == "HAIL") %>%
             select(CROPDMG))

##     CROPDMG        
##  Min.   :       0  
##  1st Qu.:       0  
##  Median :       0  
##  Mean   :   10800  
##  3rd Qu.:       0  
##  Max.   :70000000

summary(filter(data, EVTYPE == "FLOOD") %>%
             select(CROPDMG))

##     CROPDMG         
##  Min.   :        0  
##  1st Qu.:        0  
##  Median :        0  
##  Mean   :    84489  
##  3rd Qu.:        0  
##  Max.   :500000000

summary(filter(data, EVTYPE == "THUNDERSTORM WIND") %>%
             select(PROPDMG))

##     PROPDMG         
##  Min.   :        0  
##  1st Qu.:        0  
##  Median :        0  
##  Mean   :    33221  
##  3rd Qu.:     5000  
##  Max.   :750000000

While the number of times that these winds are reported to do damage is still low, property damage has certainly picked up the ante a bit. Now, given that there is damage, the summaries change to:

summary(filter(data, EVTYPE == "HAIL",
               !(CROPDMG == 0)) %>%
             select(CROPDMG))

##     CROPDMG        
##  Min.   :     500  
##  1st Qu.:    5000  
##  Median :   10000  
##  Mean   :  287363  
##  3rd Qu.:  100000  
##  Max.   :70000000

summary(filter(data, EVTYPE == "FLOOD",
               !(CROPDMG == 0)) %>%
             select(CROPDMG))

##     CROPDMG        
##  Min.   :9.00e+02  
##  1st Qu.:1.00e+04  
##  Median :5.00e+04  
##  Mean   :1.88e+06  
##  3rd Qu.:2.50e+05  
##  Max.   :5.00e+08

summary(filter(data, EVTYPE == "THUNDERSTORM WIND",
               !(PROPDMG == 0)) %>%
             select(PROPDMG))

##     PROPDMG         
##  Min.   :       10  
##  1st Qu.:     2000  
##  Median :     5000  
##  Mean   :    69307  
##  3rd Qu.:    15000  
##  Max.   :750000000

It can be seen that Thunderstorm Winds do a considerable amount of damage. It is of slight concern that Thunderstorm Winds are leading causes of injury and property damage. It is worth investingating when the most damage occurs for Thunderstorm Winds.

#This data frame find the mean value of PropDmg and Injuries for each
#day reported
data6 <- filter(data, EVTYPE == "THUNDERSTORM WIND") %>%
     mutate(date = mdy_hms(as.character(BGN_DATE)) + hms(as.character(BGN_TIME))) %>%
     mutate(date = ymd(paste("2000-", #2000 is chosen to put all of the entries 
                                      #in the same time interval
                                 substr(as.character(date), 6, 10),
                                 sep = ""))) %>%
     group_by(date) %>%
     summarize(p_dollar_total = mean(PROPDMG),
               i_total = mean(INJURIES)) %>%
     mutate(delim = "mean")


#This data frame find the median value of PropDmg and Injuries for each
#day reported
data7 <- filter(data, EVTYPE == "THUNDERSTORM WIND") %>%
     mutate(date = mdy_hms(as.character(BGN_DATE)) + hms(as.character(BGN_TIME))) %>%
     mutate(date = ymd(paste("2000-",
                                 substr(as.character(date), 6, 10),
                                 sep = ""))) %>%
     group_by(date) %>%
     summarize(p_dollar_total = median(PROPDMG),
               i_total = median(INJURIES)) %>%
     mutate(delim = "median")

#This data frame find the mean value of the number fof reported cases for each
# interval reported
data8 <- filter(data, EVTYPE == "THUNDERSTORM WIND") %>%
     mutate(date = mdy_hms(as.character(BGN_DATE)) + hms(as.character(BGN_TIME))) %>%
     group_by(date = ymd(substr(as.character(date), 1, 10))) %>%
     summarize(total = n()) %>%
     mutate(date = ymd(paste("2000-",
                                 substr(as.character(date), 6, 10),
                                 sep = ""))) %>%
     group_by(date) %>%
     summarize(total = mean(total)) %>%
     mutate(option = "Frequency",
            delim = "mean")
data8 <- data8[,c(1,4,3,2)]
names(data8) <- c("Date", "delim", "option", "value")


#This data frame find the median value of the number fof reported cases for each
# interval reported
data8.5 <- filter(data, EVTYPE == "THUNDERSTORM WIND") %>%
     mutate(date = mdy_hms(as.character(BGN_DATE)) + hms(as.character(BGN_TIME))) %>%
     group_by(date = ymd(substr(as.character(date), 1, 10))) %>%
     summarize(total = n()) %>%
     mutate(date = ymd(paste("2000-",
                                 substr(as.character(date), 6, 10),
                                 sep = ""))) %>%
     group_by(date) %>%
     summarize(total = median(total)) %>%
     mutate(option = "Frequency",
            delim = "median")
data8.5 <- data8.5[,c(1,4,3,2)]
names(data8.5) <- c("Date", "delim", "option", "value")

#Start putting the data frames together
data8 <- rbind(data8, data8.5)

#some formatting
data9 <- rbind(data6, data7)
names(data9) <- c("Date", "Avg. Prop Damage (in $)", "Avg Injuries (in People)", "delim")
data9 <- gather(data9, "option", "value", 2:3)

#last bind
data9 <- rbind(data8, data9)

#time to plot!
p3 <- ggplot(data = data9,
             aes(x = Date))
p3 <- p3 + geom_line(data = data9,
                     aes(y = value,
                         col = delim),
                     size = 1)
p3 <- p3 + facet_grid(option ~ .,
                      scale = "free")
p3 <- p3 + labs(x = "Month",
                y = "Value",
                col = "Operation",
                title = "Thunderstorm Winds Average\n2000-2011")
p3 <- p3 + scale_color_manual(values = c("red",
                                        "blue"))
p3 <- p3 + scale_x_datetime(labels = date_format("%B"))

 What's interesting to note here is that the magnitude of average injuries has an inverse relationship with frequency. In other words, where there are more records of these winds occuring, there are less people injured. In addition, there is no interval where people are guaranteed to be hurt a third of the time. The average cost stays relatively low year round.

sessionInfo()

## R version 3.1.3 (2015-03-09)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] scales_0.2.4    tidyr_0.2.0     maps_2.3-9      ggplot2_1.0.1  
## [5] lubridate_1.3.3 zipcode_1.0     dplyr_0.4.1     downloader_0.3 
## 
## loaded via a namespace (and not attached):
##  [1] assertthat_0.1   colorspace_1.2-6 DBI_0.3.1        digest_0.6.8    
##  [5] evaluate_0.6     formatR_1.1      grid_3.1.3       gtable_0.1.2    
##  [9] htmltools_0.2.6  knitr_1.9        labeling_0.3     lazyeval_0.1.10 
## [13] magrittr_1.5     MASS_7.3-39      memoise_0.2.1    munsell_0.4.2   
## [17] parallel_3.1.3   plyr_1.8.1       proto_0.3-10     Rcpp_0.11.5     
## [21] reshape2_1.4.1   rmarkdown_0.5.1  stringr_0.6.2    tools_3.1.3