Joshua Poirier

Abstract

Across the United States, storms and severe weather events cause public health and economic problems for communities and municipalities. These events directly result in fatalities, injuries, and property damage. This paper will explore the United States’ National Oceanic and Atmospheric Administration’s (NOAA) storm database, which tracks information about severe weather events in the United States. This information includes estimates of fatalities, injuries, and property damage.

Information tracking for severe weather began in 1950, and includes data recorded up to November 2011. It is worth noting that in earlier years there are fewer events recorded. This can likely be attributed to a lack of good records; more recent years can be considered complete.

Data Source and Loading

The data used for this paper originates from the United States’s NOAA storm database. This data was supplied to the author by the organizers of Coursera course Reproducible Research (http://www.coursera.org/course/repdata). For the purpose of this paper, assume the data has been downloaded to the working directory. This data was downloaded from https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2 on December 16, 2014 at 21:55 MT.

Further information regarding this dataset can be found through the National Weather Service Storm Data Documentation (https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf) or the National Climatic Data Center Storm Events FAQ (https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf).

To start, let’s set our working directory (where the downloaded data is stored), and read in the data.

library(lubridate)
setwd("C:/Papers/Coursera/Reproducible Research/RepData_PeerAssessment2_Data")
stormData <- read.csv("repdata-data-StormData.csv.bz2", nrow=902297)
stormData$BGN_DATE <- mdy_hms(stormData$BGN_DATE)
stormData$Year <- year(stormData$BGN_DATE)

Below shows the first few records in the data set:

rawNum <- nrow(stormData)
print(head(stormData,3))
##   STATE__   BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE  EVTYPE
## 1       1 1950-04-18     0130       CST     97     MOBILE    AL TORNADO
## 2       1 1950-04-18     0145       CST      3    BALDWIN    AL TORNADO
## 3       1 1951-02-20     1600       CST     57    FAYETTE    AL TORNADO
##   BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1         0                                               0         NA
## 2         0                                               0         NA
## 3         0                                               0         NA
##   END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES
## 1         0                      14.0   100 3   0          0       15
## 2         0                       2.0   150 2   0          0        0
## 3         0                       0.1   123 2   0          0        2
##   PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE
## 1    25.0          K       0                                         3040
## 2     2.5          K       0                                         3042
## 3    25.0          K       0                                         3340
##   LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM Year
## 1      8812       3051       8806              1 1950
## 2      8755          0          0              2 1950
## 3      8742          0          0              3 1951

The raw data set features 902297 severe weather event records.

Data Processing

Cleaning the Data

Let us remove any records where the exponent is not blank, K, M, or B (dollars, thousands, millions, or billions of dollars respectively) for both the PROPDGMEXP and CROPDMGEXP fields. Other values are considered to be invalid because they cannot be interpreted.

## Convert fields to factor
stormData$PROPDMGEXP <- as.factor(stormData$PROPDMGEXP)
stormData$CROPDMGEXP <- as.factor(stormData$CROPDMGEXP)

## compute boolean vectors for prop/crop values being 'B' (billions), 
## 'M' (millions), 'K' (thousands), or blank (dollars)
propBillions <- stormData$PROPDMGEXP == "B"
propMillions <- stormData$PROPDMGEXP == "M"
propThousands <- stormData$PROPDMGEXP == "K"
propDollars <- stormData$PROPDMGEXP == ""
cropBillions <- stormData$CROPDMGEXP == "B"
cropMillions <- stormData$CROPDMGEXP == "M"
cropThousands <- stormData$CROPDMGEXP == "K"
cropDollars <- stormData$CROPDMGEXP == ""

## calculate actual dollar values for property and crop damages
stormData$PROP_DOLLARS <- (stormData$PROPDMG * propBillions * 1000000000) + 
                            (stormData$PROPDMG * propMillions * 1000000) +
                            (stormData$PROPDMG * propThousands * 1000) + 
                            (stormData$PROPDMG * propDollars)
stormData$CROP_DOLLARS <- (stormData$CROPDMG * cropBillions * 1000000000) +
                            (stormData$CROPDMG * cropMillions * 1000000) +
                            (stormData$CROPDMG * cropThousands * 1000) +
                            (stormData$CROPDMG * cropDollars)

## remove any records with un-interpretable damage exponents
stormData <- stormData[(propBillions | propMillions | 
                           propThousands | propDollars |
                           cropBillions | cropMillions | 
                           cropThousands | cropDollars),]

## cleanup memory
rm(propBillions, propMillions, propThousands, propDollars,
   cropBillions, cropMillions, cropThousands, cropDollars)

There are a lot of crossover for event types. The author interpreted and classified each of the permitted event types into a new field EVENT_CLASS.

## build vector of event types which are permitted
permittedElements <- c("HAIL", "MARINE HAIL", "SMALL HAIL", "TSTM WIND/HAIL", "TSTM_WIND", "THUNDERSTORM WIND", "THUNDERSTORM WINDS", "HIGH WIND", "MARINE TSTM WIND", "MARINE THUNDERSTORM WIND", "STRONG WIND", "HIGH WINDS", "WIND", "STRONG WINDS", "DRY MICROBURST", "MARINE HIGH WIND", "THUNDERSTORM WINDS HAIL", "GUSTY WINDS", "THUNDERSTORM WINDSS", "MARINE STRONG WIND", "THUNDERSTORM", "TSTM WIND (G45)", "WINDS", "WATERSPOUTS", "TORNADO", "FUNNEL CLOUD", "WATERSPOUT", "DUST DEVEL", "FUNNEL CLOUDS", "FUNNEL", "FLASH FLOOD", "FLOOD", "HEAVY RAIN", "URBAN/SML STREAM FLD", "URBAN FLOODING", "FLASH FLOODING", "FLOOD/FLASH FLOOD", "URBAN FLOOD", "RIVER FLOOD", "COASTAL FLOODING", "FLOODING", "COASTAL FLOOD", "Coastal Flooding", "MONTHLY PRECIPITATION", "LIGHTNING", "HEAVY SNOW", "WINTER STORM", "WINTER WEATHER", "BLIZZARD", "ICE STORM", "WINTER WEATHER/MIX", "LAKE-EFFECT SNOW", "SNOW", "FREEZING RAIN", "LIGHT SNOW", "MODERATE SNOWFALL", "ICE", "WINTRY MIX", "SLEET", "WILDFIRE", "WILD/FOREST FIRE", "DROUGHT", "UNSEASONABLY DRY", "EXCESSIVE HEAT", "HEAT", "RECORD WARMTH", "UNSEASONABLY WARM", "RECORD HEAT", "HEAT WAVE", "DENSE FOG", "FOG", "FREEZING FOG", "HIGH SURF", "TROPICAL STORM", "STORM SURGE", "HEAVY SURF/HIGH SURF", "ASTRONOMICAL LOW TIDE", "HURRICANE", "STORM SURGE/TIDE", "ASTRONOMICAL HIGH TIDE", "HURRICANE/TYPHOON", "HEAVY SURF", "TROPICAL DEPRESSION", "FROST/FREEZE", "EXTREME COLD/WIND CHILL", "EXTREME COLD", "COLD/WIND CHILL", "EXTREME WINDCHILL", "FREEZE", "COLD", "RECORD COLD", "FROST", "LANDSLIDE", "RIP CURRENT", "RIP CURRENTS", "DUST STORM", "AVALANCHE", "Temperature Record")

## classify un-interpretable event types as 'NOT CLASSIFIED'
stormData$EVENT_CLASS <- apply(stormData, 1, FUN = function(x) {
    if (is.element(x[8], permittedElements)) { x[8]
    } else "NOT CLASSIFIED"
    })

## remove records with un-interpretable event types
good <- stormData$EVENT_CLASS != "NOT CLASSIFIED"
stormData <- stormData[good,]

Now let’s clean the latitude and longitude fields. Both are given as a four or five digit number, where the final two digits represent the minutes, and the preceding digits the degrees longitude/latitude. Longitudinal values are West, and latitudinal values are North. Now let’s parse the latitude and longitude values into decimal degrees.

## compute decimal longitude values
stormData$LONGITUDE <- -1 * 
    (floor(abs(stormData$LONGITUDE) / 100) +
    ((abs(stormData$LONGITUDE) - 
         floor(abs(stormData$LONGITUDE) / 100) * 100) / 60))

## compute decimal latitude values
stormData$LATITUDE <- floor(abs(stormData$LATITUDE) / 100) +
    ((abs(stormData$LATITUDE) - 
         floor(abs(stormData$LATITUDE) / 100) * 100) / 60)

Let’s finish by removing any 0 or NA values:

## First, remove any data where the lat/long values are not given (zero)
## The United States does not cross (0,0) geographic coordinates
good <- stormData$LONGITUDE != 0 & stormData$LATITUDE != 0 & 
    stormData$LONGITUDE != NA & stormData$LATITUDE != NA
stormData <- stormData[good,]

Create a clean data set with only the columns we are interested in:

-EVENT_CLASS - The type of severe weather event -BGN_DATE - Beginning date of the severe weather event
-FATALITIES - Number of fatalities resulting from the severe weather event
-INJURIES - Number of injuries resulting from the severe weather event
-PROP_DOLLARS - Property damage (in dollars) resulting from the severe weather event
-CROP_DOLLARS - Crop damage (in dollars) resulting from the severe weather event -STATE - State for which the severe weather event occurred
-Year - Year for which the severe weather event took place
-LATITUDE - Geographic latitude of the severe weather event
-LONGITUDE - Geographic longitude of the severe weather event

Therefore, let’s create a clean dataset featuring only these fields.

## create clean data set
stormDataClean <- data.frame(EVENT_CLASS=stormData$EVENT_CLASS, 
                             BGN_DATE=stormData$BGN_DATE,
                             FATALITIES=stormData$FATALITIES, 
                             INJURIES=stormData$INJURIES,
                             PROP_DOLLARS=stormData$PROP_DOLLARS,
                             CROP_DOLLARS=stormData$CROP_DOLLARS,
                             STATE=stormData$STATE, 
                             Year=stormData$Year,
                             LATITUDE=stormData$LATITUDE, 
                             LONGITUDE=stormData$LONGITUDE)
cleanNum <- nrow(stormDataClean)

## cleanup memory
rm(stormData)

We have reduced the data set from 902297 to 468479 records. Valuable information could be hidden within these records; however, the author expects the major trends (and purpose of the study) to remain within the data set.

Exploratory Data Analysis

It is worth noting that the reliability of records has increased dramatically since recordkeeping began. If the reliability of records increased dramatically at a discrete point in time, we may wish to eliminate records occurring prior to that. The reliability of records may have changed with the implementation of new recordkeeping systems (such as computers).

The years 1993-1995 and 2006 are highlighted above as years featuring major jumps in the data. Let’s look at the number of injuries, fatalities, property damage, and crop damage occurring to see if this helps explain these anomalies.

## Summarize data by year
annualData <- aggregate(data.frame(INJURIES=stormDataClean$INJURIES,
                                   FATALITIES=stormDataClean$FATALITIES,
                                   PROP_DOLLARS=stormDataClean$PROP_DOLLARS,
                                   CROP_DOLLARS=stormDataClean$CROP_DOLLARS),                            
                        by=list(stormDataClean$Year),
                        FUN=sum)
names(annualData) <- c("Year", "Injuries", "Fatalities", "Prop_Dollars", 
                        "Crop_Dollars")

library(ggplot2)
library(gridExtra)    

## Plot severe weather injuries on an annual basis
g1 <- ggplot(annualData, aes(x=Year, y=Injuries)) +
    geom_point(aes(x=Year, y=Injuries)) +
    geom_smooth(method="lm") + 
    geom_vline(xintercept=1995) +
    geom_vline(xintercept=2006) +
    labs(title="Injuries")

## Plot severe weather fatalities on an annual basis
g2 <- ggplot(annualData, aes(x=Year, y=Fatalities)) +
    geom_point(aes(x=Year, y=Fatalities)) +
    geom_vline(xintercept=1995) +
    geom_vline(xintercept=2006) +
    labs(title="Fatalities")

## Plot severe weather property damage (dollars) on an annual basis
g3 <- ggplot(annualData, aes(x=Year, y=log(Prop_Dollars))) +
    geom_point(aes(x=Year, y=log(Prop_Dollars))) +
    geom_smooth(method="lm") +
    geom_vline(xintercept=1995) +
    geom_vline(xintercept=2006) +
    labs(title="Property Damage")

## Plot severe weather crop damage (dollars) on an annual basis
g4 <- ggplot(annualData, aes(x=Year, y=log(Crop_Dollars))) +
    geom_point(aes(x=Year, y=log(Crop_Dollars))) +
    geom_vline(xintercept=1995) +
    geom_vline(xintercept=2006) +
    labs(title="Crop Damage")

## Print out the plots!
grid.arrange(g1,g2,g3,g4,ncol=2)

plot of chunk unnamed-chunk-8

The data shows that crop damage records began both widespread and reliably in 1993. The period between 1993-1995 had unreliable latitude/longitude coordinates - this was likely due to a transition between manual estimates and using GPS measurements.

eventType <- aggregate(data.frame(INJURIES=stormDataClean$INJURIES,
                                   FATALITIES=stormDataClean$FATALITIES,
                                   PROP_DOLLARS=stormDataClean$PROP_DOLLARS,
                                   CROP_DOLLARS=stormDataClean$CROP_DOLLARS),                            
                        by=list(stormDataClean$EVENT_CLASS),
                        FUN=sum)
names(eventType) <- c("EVENT_CLASS", "INJURIES", "FATALITIES", 
                      "PROP_DOLLARS", "CROP_DOLLARS")

## Find row having maximum injuries/fatalities/prop&crop damage
injMaxEvent <- which.max(eventType$INJURIES)
fatMaxEvent <- which.max(eventType$FATALITIES)
propMaxEvent <- which.max(eventType$PROP_DOLLARS)
cropMaxEvent <- which.max(eventType$CROP_DOLLARS)

## Find the event with maximum injuries/fatalities/prop&crop damage
maxInj <- eventType$EVENT_CLASS[injMaxEvent]
maxFat <- eventType$EVENT_CLASS[fatMaxEvent]
maxProp <- eventType$EVENT_CLASS[propMaxEvent]
maxCrop <- eventType$EVENT_CLASS[cropMaxEvent]

Results

Population Health

Furthermore, it is clear that there total injuries has been relatively constant over time. Fatalities have increased in the past few years; although it is difficult to predict if that trend will continue. Prior to the recent increase, fatalities due to severe weather were relatively constant on an annual basis.

From the earlier analysis, the severe weather event type with most injuries is TORNADO .
The severe weather event type with most fatalities is TORNADO.

## Injuries over time
g1 <- ggplot(annualData, aes(x=Year, y=Injuries)) +
    geom_point(aes(x=Year, y=Injuries)) +
    geom_smooth(method="lm") + 
    labs(title="Severe Weather Injuries 1950-2011")

## Fatalities over time
g2 <- ggplot(annualData, aes(x=Year, y=Fatalities)) +
    geom_point(aes(x=Year, y=Fatalities)) +
    geom_smooth(method="lm") +
    labs(title="Severe Weather Fatalities 1950-2011")
grid.arrange(g1,g2,ncol=2)

plot of chunk unnamed-chunk-10

We can tell from the above figures that injuries and fatalities are concentrated in the major population centers within the United States (most are notably in the eastern half of the country). Injuries have been fairly constant over the study period. Fatalities have seen a recent surge; however, prior to the past few years their numbers were relatively constant as well.

Government agencies should expect the following injuries and fatalities over the upcoming years:

## Build linear model between time and injuries
modInj <- lm(annualData$Injuries ~ annualData$Year)
injIntercept <- coef(modInj)[1]
injSlope <- coef(modInj)[2]

## Build linear model between time and fatalities
modFat <- lm(annualData$Fatalities ~ annualData$Year)
fatIntercept <- coef(modFat)[1]
fatSlope <- coef(modFat)[2]

## Years to calculate model for
predictedHealth <- data.frame(Year = c(2012, 2013, 2014, 2015, 2016, 
                                       2017, 2018, 2019, 2020))

## Calculate predicted injuries through 2020
predictedHealth$PredInjuries <- round(injSlope * predictedHealth$Year + injIntercept)

## Calculate predicted injuries through 2020
predictedHealth$PredFatalities <- round(fatSlope * predictedHealth$Year + fatIntercept)

print(predictedHealth)
##   Year PredInjuries PredFatalities
## 1 2012         1423            119
## 2 2013         1419            119
## 3 2014         1416            120
## 4 2015         1413            120
## 5 2016         1410            120
## 6 2017         1406            121
## 7 2018         1403            121
## 8 2019         1400            122
## 9 2020         1397            122

Economic Effects

There appears to be an exponential increase in property damage due to severe weather over time. Crop damage appears to be increasing linearly with respect to time.

The severe weather event type with most property damage is FLOOD. The severe weather event type with most crop damage is FLOOD.

## Property damage over time
g3 <- ggplot(annualData, aes(x=Year, y=log(Prop_Dollars))) +
    geom_point(aes(x=Year, y=log(Prop_Dollars))) +
    geom_smooth(method="lm") +
    labs(title="Property Damage")

## Restrict crop data to beyond 1995, when recordkeeping became widespread
cropData <- annualData[annualData$Year >= 1995,]

## Crop damage over time
g4 <- ggplot(cropData, aes(x=Year, y=log(Crop_Dollars))) +
    geom_point(aes(x=Year, y=log(Crop_Dollars))) +
    geom_smooth(method="lm") +
    labs(title="Severe Weather Crop Damage 1995-2011")
grid.arrange(g3,g4,ncol=2)

plot of chunk unnamed-chunk-12

From the above figures, we can see that property damage has been increasing exponentially over the study period. We have limited the crop damage data from 1995-2011 as that is when record began. There appears to be an exponential relationship increase with respect to time.

Government agencies should expect the following property and crop damage values over the upcoming years:

## Build linear model between time and property damage
modProp <- lm(log(annualData$Prop_Dollars) ~ annualData$Year)
PropIntercept <- coef(modProp)[1]
PropSlope <- coef(modProp)[2]

## Build linear model between time and crop damage
modCrop <- lm(log(cropData$Crop_Dollars) ~ cropData$Year)
CropIntercept <- coef(modCrop)[1]
CropSlope <- coef(modCrop)[2]

## Years to calculate model for
predictedEcon <- data.frame(Year = c(2012, 2013, 2014, 2015, 2016, 
                                       2017, 2018, 2019, 2020))

## Calculate predicted property damages through 2020
predictedEcon$PredPropDMG <- paste0("$", round(
    exp(PropSlope * predictedEcon$Year + PropIntercept)))

## Calculate predicted crop damages through 2020
predictedEcon$PredCropDMG <- paste0("$", round(
    exp(CropSlope * predictedEcon$Year + CropIntercept)))

print(predictedEcon)
##   Year PredPropDMG PredCropDMG
## 1 2012 $5541897578 $1011474784
## 2 2013 $5926458185 $1190424285
## 3 2014 $6337704032 $1401033421
## 4 2015 $6777486846 $1648903396
## 5 2016 $7247786851 $1940626376
## 6 2017 $7750721681 $2283960808
## 7 2018 $8288555915 $2688037757
## 8 2019 $8863711275 $3163603752
## 9 2020 $9478777531 $3723306593

Further work

It would be worth correcting dollar values for inflation, and performing calculations on a per capita basis. We could also further divide the values by state and county to see how recordkeeping compares. Some counties may still not report crop damages.