Reproducible Research: U.S. National Oceanic and Atmospheric Administration’s (NOAA) Storm Database Assessment

Synopsis:

The exploration of the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database was conducted in less than a week’s time, and most remarkable was the time required to tidy the data. To perform a thorough analysis of the data would require another week of cleansing the data.

However, this is an academic exercise, and despite the limited time and quality of the data, there were some interesting facts retrieved. It is obvious that the frequency and severity of relevant weather events differ across the United States and between regions. The quality of the data, and time, limited this level of analysis. If this study were to continue, it would be very helpful to develop a data dictionary defining each data element, as well as standardizing the reference values (especially the Event column). Regardless of the data issues, there is some interesting information on the impact of weather events across the nation described below.

Introduction:

This data certainly reinforces the idea that Mother Nature is extremely powerful. Probably most surprising in the review was the fact that so many lives have been lost due to heat related weather incidents. This seems to be the easiest of weather phenomena to conquer. Having fans, portable air-conditioning units, and public shelters available is a low cost solution to prevent the loss of life.

However, that said, severe weather events cause a significant amount of damage across the United States resulting in the loss of life and property. The more we know about the frequency, and likelihood of one of these severe events occurring, the more prepared we can be to protect both person and property. Observation of events begins in the year 1950, and end in November 2011. As you might expect, the completeness of the data improves with the more recent events.

Data Processing:

The first step in the process is to load, and study the data for problems such as missing data elements, invalid data elements (like times provided that are out of range and Alpha characters in numeric fields). One thing that I immediately noticed is that there are no definitions for the fields as one would normally expect. Also, the Event field is not standardized, so events like Flood, and Flash Flood are counted separately, and Heat and Extreme Heat are counted separately. Due to time, this latter problem was not totally addressed, but it would be extremely helpful if there was a Standardized Category for grouping types of events.

library(knitr)

## 
## Attaching package: 'knitr'
## 
## The following object is masked _by_ '.GlobalEnv':
## 
##     opts_chunk

opts_chunk$set(echo = TRUE, results = "show")
opts_chunk$fig.height=6
opts_chunk$fig.width=8

Load the data (i.e. read.csv())

library(dplyr)
NOAA <- tbl_df (read.csv(bzfile("repdata-data-StormData.csv.bz2"), 
                na.strings = "NA", stringsAsFactors = FALSE))

Check for Missing Data

To answer the two questions about storm damages resulting in loss of life and/or injuries, and economic impact, this study will review the following fields from the data:

BGN_DATE: Renamed Date

BGN_TIME: Renamed Time

EVTYPE: Renamed Event

FATALITIES: Renamed Fatalities

INJURIES: Renamed Injuries

PROPDMG: Renamed PropDmg

PROPDMGEXP: Renamed PUnits

CROPDMG: Renamed CropDmg

CROPDMGEXP: Renamed CUnits

Combined Date, and Time to create a dateTime field

LATITUDE: Combined with LATITUDE_E and renamed lat

LATITUDE_E: Combined with LATITUDE_E and renamed lat

LONGITUDE: Combined with LONGITUDE_ and called long

LONMGITUDE_: Combined with LONGITUDE_ and called long

This data set is extremely messy, and to get it to the point it is truly accurate would require more time that is available in this course.

As a result, some assumptions were made:

The first fields that required attention were the LONGITUDE and LATITUDE since there is data presented in each section on maps below. In many cases, the LONG/LAT data was not provided. In those cases, the Center of Mass for the state was calculated using the “maps” package, and that value was placed in the LONG/LAT fields. This is not the most precise method, but in this version of R (3.1.3) the packages for calculating accurate center of mass were not available, so I used the median which approximated very closely.

The date and time were provided as two separate character fields and needed to be converted.

The units of measure for damages was inconsistently applied, and there was limited documentation available to define the data elements. As a result, the following multipliers were used (upper and lower case values were used):

K = 1,000

M = 1,000,000

B = 1,000,000,000

library(dplyr)
library(lubridate)

# There are 49 records with missing latitude or inconsistent latitudes 
#  They are all in GU, and AS -- both US Territory.
# LONGITUDE should be negative, but it is up to the user to convert it 
# to negative. These two are the only two that have negative numbers
# and the lat/lomng coordinates put it the middle of the Pacific -
# not near Guam which is where these two observations are from.
FIELDS <- select(NOAA, REFNUM, STATE , BGN_DATE, BGN_TIME, EVTYPE, FATALITIES, 
        INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP, 
        LATITUDE, LATITUDE_E, LONGITUDE, LONGITUDE_)%>%
        filter(!is.na(LATITUDE) & LONGITUDE >= 0 )   
names(FIELDS) <- c("REFNUM", "ST", "Date", "Time", "Event", "Fatalities", 
        "Injuries", "PropDmg", "PUnits", "CropDmg", "CUnits", 
        "lat", "lat1", "long", "long1")
FIELDS$Date <- mdy_hms(FIELDS$Date)

FIELDS <- mutate(FIELDS, lat1 = ifelse( lat1 == 0, "0000", as.character(lat1)))
FIELDS$temp <- paste(as.character(FIELDS$lat), as.character(FIELDS$lat1), sep="")
FIELDS$lat <- as.numeric(sprintf("%.6f", as.numeric(FIELDS$temp, length = 9) * 10^-6))

FIELDS <- mutate(FIELDS, long1 = ifelse( long1 == 0, "0000", as.character(long1)))
FIELDS$temp <- paste(as.character(FIELDS$long), as.character(FIELDS$long1), sep="")
FIELDS$long <- as.numeric(sprintf("%.6f", as.numeric(FIELDS$temp, length = 9) * -10^-6))
FIELDS<-select (FIELDS, -temp, -lat1, -long1)
sum(is.na(FIELDS))

## [1] 0

The 49 records filtered above have missing lat or invalid long. All of these observations are outside of the continental US in one of the US Territories.

In combining the Date and Time fields to create the dateTime field, it was discovered that the Time field has inconsistent Time formats. The majority of the Time data is on 12 hour format, and some of the Time is in 24 hour format, but also includes the PM designation (75,757 records).

There are 248,767 that have a 24 hour format with 4 digits (character field), while 653,536 are in a 12 hour format designating AM and PM. Additionally, there are six records in the 24 hour format with invalid values, such as numbers greater than 5 in the third digit and the letter “O” used instead of the number zero (0). The invalid times are replaced below with a time of “0000”.

head(FIELDS$Time)

## [1] "0130" "0145" "1600" "0900" "1500" "2000"

tail(FIELDS$Time)

## [1] "03:00:00 PM" "10:30:00 PM" "02:48:00 PM" "02:58:00 PM" "10:21:00 AM"
## [6] "08:00:00 PM"

The next step in tidying the data is to convert the date and time to a POSIXct field.

library(dplyr)
library(lubridate)
# Combined Date and Time in 24 hour format into dateTime Field
FIELDS$dateTime <- paste(FIELDS$Date, FIELDS$Time, sp = " ")
FIELDS$dateTime <- parse_date_time(FIELDS$dateTime,
        c("%y%m%d %I%M%S %p", "%y%m%d %H%M", "%y%m%d %H%M%S"))

## Warning: 5 failed to parse.

Here the 5 malformed 24 hour Time values discovered and replaced with “0000”:

library(dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(lubridate)

FIELDS <- mutate(FIELDS, Time = ifelse(!is.na(dateTime), Time, "0000")) 
FIELDS$dateTime <- paste(FIELDS$Date, FIELDS$Time, sp = " ")
FIELDS$dateTime <- parse_date_time(FIELDS$dateTime,
        c("%y%m%d %I%M%S %p", "%y%m%d %H%M", "%y%m%d %H%M%S"))

All missing or invalid values are replaced, and the dateTime field is in the desired data type.

sum(is.na(FIELDS))

## [1] 0

class(FIELDS$dateTime)

## [1] "POSIXct" "POSIXt"

The last task to perform in tidying the data is to convert damages to dollars.

As mentioned above, the data is not in good shape, and some assumptions have been made for the sake of time. The following units will be used to convert the property and crop damages into dollars:

K = 1,000

M = 1,000,000

B = 1,000,000,000

If the fields PROPDMGEXP (renamed PUnits) and CROPDMGEXP (renamed CUnits) do not contain one of the above values in either upper or lower case, the observation will be ignored. Without a data dictionary, it is impractical to know what the various values respresent.

library(dplyr)

unique(FIELDS$PUnits)

##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8"

unique(FIELDS$CUnits)

## [1] ""  "M" "K" "m" "B" "?" "0" "k" "2"

Units <- c("K", "k",  "M", "m", "B")

Damages<- select(FIELDS, ST, Event, PropDmg, PUnits, 
                 CropDmg, CUnits, Fatalities, Injuries, long, lat, dateTime)%>%
        filter(PUnits %in% Units | CUnits %in% Units )%>%
        mutate(PropDmg = as.numeric(ifelse(PUnits == "K" | 
                        PUnits == "k", PropDmg * 1000, PropDmg)))%>%
        mutate(PropDmg = as.numeric(ifelse(PUnits == "M" | 
                        PUnits == "m", PropDmg * 1000000, PropDmg)))%>%
        mutate(PropDmg = as.numeric(ifelse(PUnits == "B" | 
                        PUnits == "b", PropDmg * 1000000000, PropDmg)))%>%
        mutate(CropDmg = as.numeric(ifelse(CUnits == "K" | 
                        CUnits == "k", CropDmg * 1000, CropDmg)))%>%
        mutate(CropDmg = as.numeric(ifelse(CUnits == "M" | 
                        CUnits == "m", CropDmg * 1000000, CropDmg)))%>%
        mutate(CropDmg = as.numeric(ifelse(CUnits == "B" | 
                        CUnits == "b", CropDmg * 1000000000, CropDmg)))

sum(is.na(Damages))

## [1] 0

Now the data is ready to work with.

Results:

1. Across the United States, which types of events (as indicated in the Event variable) are most harmful with respect to population health?

Tornados are among the top killers in the nation, but as you can see in the following figure, not the top killer everywhere. The map below clearly shows that tornados tend to occur in a fairly large band from the Southwest up through the Midwest, and into the Northeast of the country.

To be considered in this category, the observation must list either a fatality or injury from the event.

library(dplyr)
library(ggplot2)
library(maps)
all_states <- map_data("state") # map_data {ggplot2}

# Developing Center of Mass LAT/LONG for States to put in Events where State is 
# provided, but LAT/LONG is not provided. 

States_COM <- select (all_states, lat, long, region)%>%
        group_by(region)%>%
        summarize(median(lat), median(long))
names(States_COM) <- c("State", "latCOM", "longCOM")

States_COM <- inner_join(States_COM, All49, by = "State")%>%
        select(State, ST, latCOM, longCOM)
# The records in FIELDS that do not have a matching ST code
# are not joined in. These are all observations outside of the 
# continental US. Including Alaska and Hawaii.

# Going to get the observation with the Highest Faltalities 
# Per State - then join that list to the Center of Mass
# Long/Lat for that State and show on the map the Event and Fatality
# Should see different events in different regions


TornadoFatality<- select(FIELDS, ST, Event, long, lat, Fatalities)%>%
        filter(Event == "TORNADO" & Fatalities != 0 )%>%
        group_by(ST, Event)
        #tally(PropDmg)%>%
        #top_n(1)

TornadoInjury<- select(FIELDS, ST, Event, long, lat, Injuries)%>%
        filter(Event == "TORNADO" & Injuries != 0 )%>%
        group_by(ST, Event)
        #tally(CropDmg)%>%
        #top_n(1)

#Filter LAT/LONG that are outside of the map boundaries 
        TornadoFatality <- filter(TornadoFatality, lat < 48.98)
        TornadoFatality <- filter(TornadoFatality, lat > 18.00)
        TornadoFatality <- filter(TornadoFatality, long > -126.62)
        TornadoFatality<- filter(TornadoFatality, long < -62.36)
        TornadoFatality<- filter(TornadoFatality, lat != 0 | long !=0 )

#Filter LAT/LONG that are outside of the map boundaries 
        TornadoInjury <- filter(TornadoInjury, lat < 48.98)
        TornadoInjury <- filter(TornadoInjury, lat > 18.00)
        TornadoInjury <- filter(TornadoInjury, long > -126.62)
        TornadoInjury<- filter(TornadoInjury, long < -62.36)
        TornadoInjury<- filter(TornadoInjury, lat != 0 | long !=0 )
Fatal <- c("Fatalities")
Injury <- c("Injuries")

        p <- ggplot()
        p <- p + geom_polygon( data=all_states, aes(x=long, y=lat, 
                        group = group),colour="black", fill="steelblue4" ) +
                        labs(title="Tornado Event Fatalities and Injuries") + 
                        xlab("Longitude") + ylab("Latitude")
        p <- p + geom_jitter( data=TornadoInjury, 
                        position=position_jitter(width=0.5, height=0.5), 
                        aes(x=long, y=lat, color=Injury)) +
                geom_jitter( data=TornadoFatality, 
                        position=position_jitter(width=0.5, height=0.5), 
                        aes(x=long, y=lat, size = Fatalities,color=Fatal)) + 
                        scale_size(name="Number Fat/Inj")
        p

The output from the table below depicts the top event in fatalities per state across the continental United States. RI had two Events with the same total so there are 51 records.

In general you can tell that the top events causing fatalities are (not in order of numbers but regionally):

Avalanche

Heat

Flood

High Surf

High Wind

Lightning

RIP Current

Tornado

Thunderstorm Wind

library(dplyr)

All51 <- data.frame(ST = c("CT", "ME", "MA", "NH", "RI", "VT","NJ", "NY", 
        "PA","IL", "IN", "MI", "OH", "WI","IA", "KS", "MN", "MO", "NE", 
        "ND", "SD","DE", "FL", "GA", "MD", "NC", "SC", "VA", "DC", "WV",
        "AL","KY", "MS", "TN","AR", "LA", "OK", "TX","AZ", "CO", "ID", 
        "MT", "NV", "NM", "UT", "WY", "AK", "CA", "HI", "OR","WA"),  
        State = c("connecticut", "maine", "massachusetts", "new hampshire", 
        "rhode island", "vermont", "new jersey", "new york", "pennsylvania", 
        "illinois", "indiana", "michigan", "ohio", "wisconsin", "iowa", 
        "kansas", "minnesota", "missouri", "nebraska", "north dakota", 
        "south dakota", "delaware", "florida", "georgia", "maryland", 
        "north carolina", "south carolina", "virginia", "district of columbia", 
        "west virginia", "alabama", "kentucky", "mississippi", "tennessee", 
        "arkansas", "louisiana", "oklahoma", "texas", "arizona", "colorado", 
        "idaho", "montana", "nevada", "new mexico", "utah", "wyoming", 
        "alaska", "california", "hawaii", "oregon", "washington"), 
        stringsAsFactors = FALSE)


WeatherFatality <- select(FIELDS, ST, Event, Fatalities, Injuries)
WeatherFatality <- left_join(All51, WeatherFatality, by = "ST")
WeatherFatalityF<- group_by(WeatherFatality, ST, Event)%>%
        filter(Fatalities != 0)%>%
        tally(Fatalities)%>%
        top_n(1)

## Selecting by n

WeatherFatalityI<- group_by(WeatherFatality, ST, Event)%>%
        filter( Injuries != 0)%>%
        tally(Injuries)%>%
        top_n(1)

## Selecting by n

WeatherFatalityR <- inner_join(WeatherFatalityF, WeatherFatalityI, by = c("ST"))
names(WeatherFatalityR) <- c("ST", "Event", "Fatalities", 
        "Event", "Injuries")
WeatherFatalityR <- WeatherFatalityR[order(WeatherFatalityR$ST),]

print(tbl_df(WeatherFatalityR), n=52)

## Source: local data frame [53 x 5]
## 
##    ST          Event Fatalities           Event.1 Injuries
## 1  AK      AVALANCHE         33         ICE STORM       34
## 2  AL        TORNADO        617           TORNADO     7929
## 3  AR        TORNADO        379           TORNADO     5116
## 4  AZ    FLASH FLOOD         62        DUST STORM      179
## 5  CA EXCESSIVE HEAT        110          WILDFIRE      623
## 6  CO      AVALANCHE         48           TORNADO      261
## 7  CO      LIGHTNING         48           TORNADO      261
## 8  CT      HIGH WIND          6           TORNADO      703
## 9  DC EXCESSIVE HEAT         20    EXCESSIVE HEAT      316
## 10 DE EXCESSIVE HEAT          7           TORNADO       73
## 11 FL    RIP CURRENT        172           TORNADO     3340
## 12 GA        TORNADO        180           TORNADO     3926
## 13 HI      HIGH SURF         21       STRONG WIND       20
## 14 IA        TORNADO         81           TORNADO     2208
## 15 ID      AVALANCHE         16 THUNDERSTORM WIND       74
## 16 IL           HEAT        653           TORNADO     4145
## 17 IN        TORNADO        252           TORNADO     4224
## 18 KS        TORNADO        236           TORNADO     2721
## 19 KY        TORNADO        125           TORNADO     2806
## 20 LA        TORNADO        153           TORNADO     2637
## 21 MA        TORNADO        108           TORNADO     1758
## 22 MD EXCESSIVE HEAT         88    EXCESSIVE HEAT      461
## 23 ME      LIGHTNING          6         LIGHTNING       70
## 24 MI        TORNADO        243           TORNADO     3362
## 25 MN        TORNADO         99           TORNADO     1976
## 26 MO        TORNADO        388           TORNADO     4330
## 27 MS        TORNADO        450           TORNADO     6244
## 28 MT      LIGHTNING          9  WILD/FOREST FIRE       33
## 29 NC        TORNADO        126           TORNADO     2536
## 30 ND        TORNADO         25           TORNADO      326
## 31 NE        TORNADO         54           TORNADO     1158
## 32 NH      TSTM WIND          6         LIGHTNING       85
## 33 NJ EXCESSIVE HEAT         39    EXCESSIVE HEAT      300
## 34 NM    FLASH FLOOD         16           TORNADO      155
## 35 NV           HEAT         54             FLOOD       50
## 36 NY EXCESSIVE HEAT         93           TORNADO      315
## 37 OH        TORNADO        191           TORNADO     4438
## 38 OK        TORNADO        296           TORNADO     4829
## 39 OR      HIGH WIND         19         HIGH WIND       50
## 40 PA EXCESSIVE HEAT        359           TORNADO     1241
## 41 RI           HEAT          2           TORNADO       23
## 42 RI      HIGH SURF          2           TORNADO       23
## 43 SC        TORNADO         59           TORNADO     1314
## 44 SD        TORNADO         18           TORNADO      452
## 45 TN        TORNADO        368           TORNADO     4748
## 46 TX        TORNADO        538           TORNADO     8207
## 47 UT      AVALANCHE         44      WINTER STORM      415
## 48 VA        TORNADO         36           TORNADO      914
## 49 VT          FLOOD          4         TSTM WIND       24
## 50 WA      AVALANCHE         35           TORNADO      303
## 51 WI        TORNADO         96           TORNADO     1601
## 52 WV    FLASH FLOOD         24         TSTM WIND      142
## .. ..            ...        ...               ...      ...

Fatalities and injuries appear to be closely associated with the seasons as can be seen in the barplot below:

library(ggplot2)  
library(lubridate)

x <- select (FIELDS, dateTime, Fatalities) %>%
        filter(... = !is.na(Fatalities) & Fatalities != 0) %>%
    group_by(month(dateTime, label=TRUE, abbr=TRUE)) %>%
    summarize(mean(Fatalities))

names(x) <- c("Month", "Fatalities")

g <- ggplot(x, mapping = aes(x = Month, y = round(Fatalities, digits = 2))) + 
        geom_bar(stat= "identity", fill = "red") 

g <- g + labs(title="Weather Related Fatalities by Month (mean) 
              from all Events Across United States") + 
        xlab("Month") + ylab("Fatalities (mean by month)") +
    theme( axis.line = element_line(colour = "black"), 
                 axis.ticks = element_line( color = "blue", size = 2), 
                 axis.text = element_text( angle = 90), 
                 panel.background = element_rect(fill = "whitesmoke")) 
print(g)

2. Across the United States, which types of events have the greatest economic consequences?

The following shows the Property and Crop Damages across the United States due to Tornados alone:

library(dplyr)
library(ggplot2)
library(maps)
all_states <- map_data("state") # map_data {ggplot2}

# COM Not accurate but best I can do on R 3.1.3 and limited time.
# Developing Center of Mass LAT/LONG for States to put in Events where State is 
# provided, but LAT/LONG is not provided. 

States_COM <- select (all_states, lat, long, region)%>%
        group_by(region)%>%
        summarize(median(lat), median(long))
names(States_COM) <- c("State", "latCOM", "longCOM")

States_COM <- inner_join(States_COM, All49, by = "State")%>%
        select(State, ST, latCOM, longCOM)
# The records in Damages that do not have a matching ST code
# are not joined in. These are all observations outside of the 
# continental US. Including Alaska and Hawaii.

# Going to get the observation with the Highest Propertty and Crop  
# Damages Per State caused by Tornados - then join that list to the Center of Mass
# Long/Lat for that State and show on the map the Event and Damages

TornadoDamagesProp<- select(Damages, ST, Event, long, lat, PropDmg)%>%
        filter(Event == "TORNADO" & PropDmg != 0 )%>%
        group_by(ST, Event)
        #tally(PropDmg)%>%
        #top_n(1)

TornadoDamagesCrop<- select(Damages, ST, Event, long, lat, CropDmg)%>%
        filter(Event == "TORNADO" & CropDmg != 0 )%>%
        group_by(ST, Event)
        #tally(CropDmg)%>%
        #top_n(1)

#Filter LAT/LONG that are outside of the map boundaries 
        TornadoDamagesProp <- filter(TornadoDamagesProp, lat < 48.98)
        TornadoDamagesProp <- filter(TornadoDamagesProp, lat > 18.00)
        TornadoDamagesProp <- filter(TornadoDamagesProp, long > -126.62)
        TornadoDamagesProp<- filter(TornadoDamagesProp, long < -62.36)
        TornadoDamagesProp<- filter(TornadoDamagesProp, lat != 0 | long !=0 )

#Filter LAT/LONG that are outside of the map boundaries 
        TornadoDamagesCrop <- filter(TornadoDamagesCrop, lat < 48.98)
        TornadoDamagesCrop <- filter(TornadoDamagesCrop, lat > 18.00)
        TornadoDamagesCrop <- filter(TornadoDamagesCrop, long > -126.62)
        TornadoDamagesCrop<- filter(TornadoDamagesCrop, long < -62.36)
        TornadoDamagesCrop<- filter(TornadoDamagesCrop, lat != 0 | long !=0 )
PropDmgC <- c("Prop Dmg")
TypeDmg <- c("Crop Dmg")

        p <- ggplot()
        p <- p + geom_polygon( data=all_states, aes(x=long, y=lat, 
                        group = group),colour="black", fill="steelblue4" ) +
                        labs(title="Tornado Event Damages") + 
                        xlab("Longitude") + ylab("Latitude")
        p <- p + geom_jitter( data=TornadoDamagesCrop, 
                        position=position_jitter(width=0.5, height=0.5), 
                        aes(x=long, y=lat, color=TypeDmg)) + 
                        scale_size(name="US $") +
                geom_jitter( data=TornadoDamagesProp, 
                        position=position_jitter(width=0.5, height=0.5), 
                        aes(x=long, y=lat, size = PropDmg,color=PropDmgC))
        p

The output from the table below depicts the top events in damages per state across the continental United States.

library(dplyr)

All51 <- data.frame(ST = c("CT", "ME", "MA", "NH", "RI", "VT","NJ", "NY", 
        "PA","IL", "IN", "MI", "OH", "WI","IA", "KS", "MN", "MO", "NE", 
        "ND", "SD","DE", "FL", "GA", "MD", "NC", "SC", "VA", "DC", "WV",
        "AL","KY", "MS", "TN","AR", "LA", "OK", "TX","AZ", "CO", "ID", 
        "MT", "NV", "NM", "UT", "WY", "AK", "CA", "HI", "OR","WA"),  
        State = c("connecticut", "maine", "massachusetts", "new hampshire", 
        "rhode island", "vermont", "new jersey", "new york", "pennsylvania", 
        "illinois", "indiana", "michigan", "ohio", "wisconsin", "iowa", 
        "kansas", "minnesota", "missouri", "nebraska", "north dakota", 
        "south dakota", "delaware", "florida", "georgia", "maryland", 
        "north carolina", "south carolina", "virginia", "district of columbia", 
        "west virginia", "alabama", "kentucky", "mississippi", "tennessee", 
        "arkansas", "louisiana", "oklahoma", "texas", "arizona", "colorado", 
        "idaho", "montana", "nevada", "new mexico", "utah", "wyoming", 
        "alaska", "california", "hawaii", "oregon", "washington"), 
        stringsAsFactors = FALSE)

printCurrency <- function(value, currency.sym="$", 
                digits=2, sep=",", decimal=".") {
  v1 <- as.character(value)
  for(i in 1:length(value)){
        if (is.na(value[i])){Unit<- "UNK"; value[i]<-0; Multiplier<-0 
        } else if(value[i] < 1000000){
                Unit <- "K"
                Multiplier <- 0.001
                } else if(value[i] < 1000000000){
                        Unit <- "M"
                        Multiplier <- 0.000001
                } else if(value[i] >= 1000000000){
                        Unit <- "B"
                        Multiplier <- 0.000000001
                }
        x <- paste(currency.sym, formatC(value[i]*Multiplier, format = "f", 
                        big.mark = sep, digits=digits, decimal.mark=decimal), sep="")
        v1[i] <- paste(x, Unit)
  }
  v1
}

WeatherDamages <- select(Damages, ST, Event, PropDmg, CropDmg)
WeatherDamages <- left_join(All51, WeatherDamages, by = "ST")
WeatherDamagesP<- group_by(WeatherDamages, ST, Event)%>%
        tally(PropDmg)%>%
        top_n(1)

## Selecting by n

WeatherDamagesC<- group_by(WeatherDamages, ST, Event)%>%
        filter( CropDmg != 0)%>%
        tally(CropDmg)%>%
        top_n(1)

## Selecting by n

WeatherDamagesR <- left_join(WeatherDamagesP, WeatherDamagesC, by = c("ST"))
names(WeatherDamagesR) <- c("ST", "EventPropDmg", "PropCost", 
        "EventCropDmg", "CropCost")

# The following is necessary to remove the NULL values from the LEFT JOIN
# R returns an NA instead of a NULL

WeatherDamagesR <- mutate( WeatherDamagesR, 
        EventCropDmg = ifelse(is.na(EventCropDmg), "", EventCropDmg))
WeatherDamagesR <- mutate( WeatherDamagesR, 
        CropCost = ifelse(is.na(CropCost), 0 , CropCost))

WeatherDamagesR <- WeatherDamagesR[order(WeatherDamagesR$ST),]
WeatherDamagesR$PropCost <- printCurrency(WeatherDamagesR$PropCost)
WeatherDamagesR$CropCost <- printCurrency(WeatherDamagesR$CropCost)

print(tbl_df(WeatherDamagesR), n=53)

## Source: local data frame [51 x 5]
## 
##    ST      EventPropDmg  PropCost      EventCropDmg  CropCost
## 1  AK             FLOOD $157.13 M         HIGH WIND $157.00 K
## 2  AL           TORNADO   $6.32 B              HEAT $400.10 M
## 3  AR           TORNADO   $2.59 B             FLOOD $141.06 M
## 4  AZ              HAIL   $2.83 B    TROPICAL STORM $200.00 M
## 5  CA             FLOOD $116.75 B      EXTREME COLD $731.16 M
## 6  CO              HAIL   $1.42 B              HAIL $116.49 M
## 7  CT           TORNADO $596.24 M              HAIL  $30.00 K
## 8  DC    TROPICAL STORM $127.60 M           DROUGHT   $5.00 K
## 9  DE     COASTAL FLOOD  $40.15 M           DROUGHT  $29.10 M
## 10 FL HURRICANE/TYPHOON  $27.60 B HURRICANE/TYPHOON $955.20 M
## 11 GA           TORNADO   $3.26 B           DROUGHT $717.28 M
## 12 HI       FLASH FLOOD $156.51 M         HIGH WIND   $2.60 M
## 13 IA           TORNADO   $2.29 B           DROUGHT   $2.01 B
## 14 ID             FLOOD $114.19 M    TSTM WIND/HAIL   $6.00 M
## 15 IL       RIVER FLOOD   $5.02 B       RIVER FLOOD   $5.01 B
## 16 IN           TORNADO   $2.59 B             FLOOD $698.20 M
## 17 KS           TORNADO   $2.67 B              HAIL $259.41 M
## 18 KY           TORNADO $888.77 M           DROUGHT $226.00 M
## 19 LA       STORM SURGE  $31.74 B           DROUGHT $587.43 M
## 20 MA           TORNADO $756.04 M THUNDERSTORM WIND   $1.25 M
## 21 MD    TROPICAL STORM $538.50 M           DROUGHT  $99.72 M
## 22 ME         ICE STORM $318.23 M       HEAVY RAINS $500.00 K
## 23 MI           TORNADO   $1.07 B           DROUGHT $150.00 M
## 24 MN           TORNADO   $1.90 B              HAIL $140.70 M
## 25 MO           TORNADO   $4.80 B             FLOOD $593.04 M
## 26 MS HURRICANE/TYPHOON  $13.49 B         ICE STORM   $5.00 B
## 27 MT              HAIL  $94.73 M              HAIL  $34.34 M
## 28 NC         HURRICANE   $4.98 B         HURRICANE   $1.43 B
## 29 ND             FLOOD   $3.92 B              HAIL $189.36 M
## 30 NE           TORNADO   $1.72 B              HAIL $737.99 M
## 31 NH         ICE STORM  $64.93 M             FLOOD $200.00 K
## 32 NJ             FLOOD   $2.11 B           DROUGHT  $80.00 M
## 33 NM  WILD/FOREST FIRE   $1.51 B           DROUGHT  $14.40 M
## 34 NV             FLOOD $677.94 M             FLOOD   $6.00 M
## 35 NY       FLASH FLOOD   $1.83 B           DROUGHT $100.20 M
## 36 OH           TORNADO   $2.28 B           DROUGHT $200.00 M
## 37 OK           TORNADO   $3.27 B           DROUGHT   $1.10 B
## 38 OR             FLOOD $722.17 M              HAIL  $36.03 M
## 39 PA           TORNADO   $1.79 B           DROUGHT $539.40 M
## 40 RI             FLOOD  $92.86 M                     $0.00 K
## 41 SC           TORNADO $531.51 M         HURRICANE  $20.30 M
## 42 SD           TORNADO $231.21 M              HAIL  $64.27 M
## 43 TN             FLOOD   $4.25 B         TSTM WIND   $9.15 M
## 44 TX    TROPICAL STORM   $5.49 B           DROUGHT   $6.37 B
## 45 UT             FLOOD $331.76 M         HIGH WIND   $1.11 M
## 46 VA HURRICANE/TYPHOON $512.00 M           DROUGHT $297.48 M
## 47 VT             FLOOD   $1.10 B       FLASH FLOOD  $14.72 M
## 48 WA             FLOOD $212.68 M              HAIL $209.50 M
## 49 WI              HAIL $961.72 M             FLOOD $466.89 M
## 50 WV       FLASH FLOOD $485.23 M           DROUGHT  $19.75 M
## 51 WY              HAIL $111.22 M              HAIL   $1.88 M

Below is the unique list of Events that resulted in Property Damages:

unique(WeatherDamagesR$EventPropDmg)

##  [1] "FLOOD"             "TORNADO"           "HAIL"             
##  [4] "TROPICAL STORM"    "COASTAL FLOOD"     "HURRICANE/TYPHOON"
##  [7] "FLASH FLOOD"       "RIVER FLOOD"       "STORM SURGE"      
## [10] "ICE STORM"         "HURRICANE"         "WILD/FOREST FIRE"

Below is the unique list of Events that resulted in Crop Damages:

unique(WeatherDamagesR$EventCropDmg)

##  [1] "HIGH WIND"         "HEAT"              "FLOOD"            
##  [4] "TROPICAL STORM"    "EXTREME COLD"      "HAIL"             
##  [7] "DROUGHT"           "HURRICANE/TYPHOON" "TSTM WIND/HAIL"   
## [10] "RIVER FLOOD"       "THUNDERSTORM WIND" "HEAVY RAINS"      
## [13] "ICE STORM"         "HURRICANE"         ""                 
## [16] "TSTM WIND"         "FLASH FLOOD"

Conclusion

It would be interesting to see if, over time, the early warning systems have reduced the number of Fatalities and Injuries, and if the property and crop damage continues to rise regardless of early warning systems due to inflation and population expansion. Also, we might find that the improved building standards and codes have helped limit the damages caused by storms. Who knows what patterns might be identified that would help improve our understanding of the weather and ways we can better prepare for events to reduce loss of life and property.

The code below was not used in this review due to a lack of time. However, it provides the foundation for further parsing and dissecting this data to improve our understanding.

library(dplyr)
library(lubridate)
library(scales)

# Divisions as broken up into 4 Regions. 
NewEngland <- c("CT", "ME", "MA", "NH", "RI", "VT")
MidAtlantic <- c("NJ", "NY", "PA")
MidwestEastNC <- c("IL", "IN", "MI", "OH", "WI")
MidwestWestNC <- c("IA", "KS", "MN", "MO", "NE", "ND", "SD")
SouthSouthAtlantic <- c("DE", "FL", "GA", "MD", "NC", "SC", "VA", "DC", "WV")
SouthEastSC <- c("AL","KY", "MS", "TN")
SouthWestSC <- c("AR", "LA", "OK", "TX")
WestMountain <- c("AZ", "CO", "ID", "MT", "NV", "NM", "UT", "WY")
WestPacific <- c("AK", "CA", "HI", "OR","WA")

RegionsDivsions<- data.frame ( RegDiv = c("R1DIV1", "R1DIV2", "R2DIV3", 
                        "R2DIV4", "R3DIV5", "R3DIV6", "R3DIV7", "R4DIV8", "R4DIV9"),
                        RegNm = c("NewEngland","MidAtlantic","MidwestEastNC",
                        "MidwestWestNC","SouthSouthAtlantic","SouthEastSC",
                        "SouthWestSC","WestMountain","WestPacific"), 
                        stringsAsFactors = FALSE)
for (i in 1:nrow(RegionsDivsions)){
# Create a Data Frame for each Division of the nation for Property Damage
        DataName <-paste(RegionsDivsions$RegDiv[i], "P", sep="")
        df <- select(Damages, ST, dateTime, PropDmg)%>%
                filter(ST %in% get(RegionsDivsions$RegNm[i]) & PropDmg !=0 )%>%
                group_by(ST, dateTime)%>%
                tally(PropDmg)
        names(df)<- c("ST","dateTime", "PropDmg")
        df <- mutate(df, Region = RegionsDivsions$RegNm[i])
        assign(DataName,df)
# Create a Data Frame for each Region of the nation for Crop Damage
        DataName <-paste(RegionsDivsions$RegDiv[i], "C", sep="")
        df <- select(Damages, ST, dateTime, CropDmg)%>%
                filter(ST %in% get(RegionsDivsions$RegNm[i]) & CropDmg !=0 )%>%
                group_by(ST, dateTime)%>%
                tally(CropDmg)
        names(df)<- c("ST","dateTime", "CropDmg")
        df <- mutate(df, Region = RegionsDivsions$RegNm[i])
        assign(DataName,df)
# Create a Data Frame for each Region of the nation for Fatalities
        DataName <-paste(RegionsDivsions$RegDiv[i], "F", sep="")
        df <- select(Damages, ST, dateTime, Fatalities)%>%
                filter(ST %in% get(RegionsDivsions$RegNm[i]) & Fatalities !=0 )%>%
                group_by(ST, dateTime)%>%
                tally(Fatalities)
        names(df)<- c("ST","dateTime", "Fatalities")
        df <- mutate(df, Region = RegionsDivsions$RegNm[i])
        assign(DataName,df)
# Create a Data Frame for each Region of the nation for Injures
                DataName <-paste(RegionsDivsions$RegDiv[i], "I", sep="")
        df <- select(Damages, ST, dateTime, Injuries)%>%
                filter(ST %in% get(RegionsDivsions$RegNm[i]) & Injuries !=0 )%>%
                group_by(ST, dateTime)%>%
                tally(Injuries)
        names(df)<- c("ST","dateTime", "Injuries")
        df <- mutate(df, Region = RegionsDivsions$RegNm[i])
        assign(DataName,df)
}
#The various Regions and Divisions can be grouped as needed to analyze further
Prop <- rbind(R1DIV1P, R1DIV2P,R2DIV3P,R2DIV4P,R3DIV5P,R3DIV6P,R3DIV7P,R4DIV8P,R4DIV9P)
Crop<- rbind(R1DIV1C, R1DIV2C,R2DIV3C,R2DIV4C,R3DIV5C,R3DIV6C,R3DIV7C,R4DIV8C,R4DIV9C)
Fatal<- rbind(R1DIV1F, R1DIV2F,R2DIV3F,R2DIV4F,R3DIV5F,R3DIV6F,R3DIV7F,R4DIV8F,R4DIV9F)
Inj<- rbind(R1DIV1I, R1DIV2I,R2DIV3I,R2DIV4I,R3DIV5I,R3DIV6I,R3DIV7I,R4DIV8I,R4DIV9I)

Reference Information

Regions of the United States:

Region 1: Northeast Division 1: New England
(Connecticut, Maine, Massachusetts, New Hampshire, Rhode Island, and Vermont)

Division 2: Mid-Atlantic
(New Jersey, New York, and Pennsylvania)

Region 2: Midwest
Division 3: East North Central
(Illinois, Indiana, Michigan, Ohio, and Wisconsin)

Division 4: West North Central
(Iowa, Kansas, Minnesota, Missouri, Nebraska, North Dakota, and South Dakota)

Region 3: South Division 5: South Atlantic
(Delaware, Florida, Georgia, Maryland, North Carolina, South Carolina, Virginia, Washington D.C., and West Virginia)

Division 6: East South Central
(Alabama, Kentucky, Mississippi, and Tennessee)

Division 7: West South Central
(Arkansas, Louisiana, Oklahoma, and Texas)

Region 4: West
Division 8: Mountain
(Arizona, Colorado, Idaho, Montana, Nevada, New Mexico, Utah, and Wyoming)

Division 9: Pacific
(Alaska, California, Hawaii, Oregon, and Washington)

Reference Data for mapping LONG/LAT:

Positive latitude is above the equator (N) (Which the United States is), and negative latitude is below the equator (S).

Positive longitude is east of the prime meridian (Which the United States is), while negative longitude is west of the prime meridian (a north-south line that runs through a point in England).

Latitude and Longitude boundaries of the US 48.987386 is the northern most latitude +18.005611 is the southern most latitude -124.626080 is the west most longitude -62.361014 is a east most longitude