Summary

Based on NOAA Storm data from Jan. 1996 to Nov. 2011.

Data Processing

Loading the Raw Data

fileurl = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileurl, "StormData.csv.bz2", method = "curl")

After downloading the CSV file, it’s read into a dataframe called ‘rawdata’

rawdata <- read.csv("StormData.csv.bz2")

Reducing the Dataset

Select only the columns needed

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
selvector <- c("BGN_DATE", "STATE", "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", 
               "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")

procdata <- select(rawdata, which(names(rawdata) %in% selvector))

Next, subset the data by date. The original ‘rawdata’ contains records back to 1970, but according to NOAA, “All Event Types (48 from Directive 10-1605): From 1996 to present, 48 event types are recorded as defined in NWS Directive 10-1605”, Therefore, we’ll only use data starting in 1996 for analysis. To do this, we have to fix the DATE variable so it becomes useful.

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
procdata$BGN_DATE <- mdy_hms(as.character.factor(procdata$BGN_DATE))
# https://www.ncdc.noaa.gov/stormevents/details.jsp?type=eventtype . All Event Types (48 from Directive 10-1605): From 1996 to present, 48 event types are recorded as defined in NWS Directive 10-1605. 

procdata <- filter(procdata, procdata$BGN_DATE >= "1996-01-01")

The property and crop damage cost data consists of two variable - a number and an exponent. Unfortunately, the exponent is expressed as K, M, or B instead of a more useful 10^3, 10^6 or 10^9. This required some reworking by using sapply() with a matching argument.

# fix expo
m <- c("", "0", "K", "M", "B")
n <- c(0,0,3,6,9)

propdmgp10 <- sapply(procdata$PROPDMGEXP, function(x) {n[match(x, m)]})
cropdmgp10 <- sapply(procdata$CROPDMGEXP, function(x) {n[match(x, m)]})

procdata$PROPDMGEXP <- propdmgp10
procdata$CROPDMGEXP <- cropdmgp10

We can get our almost cleaned ‘procdata’ completed by mutating it to multiply the cost by its exponent. Also note the creation of a new variable called CASUALTIES which is the sum of fatalites and injuries.

procdata <- procdata %>% 
        mutate(CASUALTIES = FATALITIES + INJURIES) %>%
        mutate(PROPCOST = PROPDMG * 10^PROPDMGEXP) %>%
        mutate(CROPCOST = CROPDMG * 10^CROPDMGEXP) %>%
        mutate(TOTALCOST = PROPCOST + CROPCOST)

Checking and Handling Outlier Data

After cleaning the dataset, an examination of casualties and damage was made. This involved looking at the maximum value in each of the variables and referring back to ‘rawdata’ to see if it “made sense”. In one case, a property damage entry was recorded as being $115 Billion. The following is the procedure used in finding, and correcting the data:

#Outliers
range(procdata$PROPCOST)
## [1] 0.00e+00 1.15e+11
procdata[which(procdata$PROPCOST == max(procdata$PROPCOST)) ,]
##          BGN_DATE STATE EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP
## 357163 2006-01-01    CA  FLOOD          0        0     115          9
##        CROPDMG CROPDMGEXP CASUALTIES PROPCOST CROPCOST    TOTALCOST
## 357163    32.5          6          0 1.15e+11 32500000 115032500000

The highest property damage from a storm was $115 Billion. Getting data on this entry from ‘procdata’, a search was done on ‘rawdata’ to find the specific record.

with(rawdata, which(STATE == "CA" & EVTYPE == "FLOOD" & PROPDMG == 115 & PROPDMGEXP == "B"))
## [1] 605953
rawdata[605953,] 
##        STATE__         BGN_DATE    BGN_TIME TIME_ZONE COUNTY COUNTYNAME
## 605953       6 1/1/2006 0:00:00 12:00:00 AM       PST     55       NAPA
##        STATE EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI         END_DATE
## 605953    CA  FLOOD         0         COUNTYWIDE 1/1/2006 0:00:00
##           END_TIME COUNTY_END COUNTYENDN END_RANGE END_AZI END_LOCATI
## 605953 07:00:00 AM          0         NA         0         COUNTYWIDE
##        LENGTH WIDTH  F MAG FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG
## 605953      0     0 NA   0          0        0     115          B    32.5
##        CROPDMGEXP WFO          STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 605953          M MTR CALIFORNIA, Western               3828     12218
##        LATITUDE_E LONGITUDE_
## 605953       3828      12218
##                                                                                                                                                                                                                                                                                                                                                                                               REMARKS
## 605953 Major flooding continued into the early hours of January 1st, before the Napa River finally fell below flood stage and the water receeded. Flooding was severe in Downtown Napa from the Napa Creek and the City and Parks Department was hit with $6 million in damage alone. The City of Napa had 600 homes with moderate damage, 150 damaged businesses with costs of at least $70 million.
##        REFNUM
## 605953 605943

Reading the remarks and doing a subsequent Internet search, the storm in question, while causing significant property damage, was characterized as in the “millions” and not “billions”. As such, the values in ‘procdata’ were explicitly changed to reflect their true value

procdata[357163, 7] = 6
procdata[357163, 11] = 1.15e+8

Cleaning EVTYPE

The NOAA Storm database has 48 official event, or EVTYPES. However, the raw dataset has over 900 factors, many of these are due to misuse of case, improper spelling, or simply added variables not in the official description.

After exploratory investigation of the data, and using a Pareto analysis (explained later), the need to clean these entries was vastly reduced by doing two operations on ‘procdata’:

First - change all the EVTYPES to uppercase:

procdata$EVTYPE <- toupper(procdata$EVTYPE)

Second - searce and replace HURRICAINE and TSTM WIND

for(i in 1:nrow(procdata)) {
        
        if(procdata$EVTYPE[i] == "HURRICANE") {
                
                procdata$EVTYPE[i] <- "HURRICANE/TYPHOON"
                
        }
}
for(i in 1:nrow(procdata)) {        
        if(procdata$EVTYPE[i] == "TSTM WIND") {
                        
                        procdata$EVTYPE[i] <- "THUNDERSTORM WIND"
        }
        
}

Pareto Analysis

Pareto analysis from Wikipedia

Pareto analysis is a formal technique useful where many possible courses of action are competing for attention. In essence, the problem-solver estimates the benefit delivered by each action, then selects a number of the most effective actions that deliver a total benefit reasonably close to the maximal possible one.

Pareto is another version of the “80/20 rule” and provides policy makers a decision system for concentrating resources to the problems that are the most severe.

To do this, a function was created

# Inspiration for these funcltion came from these web sites:
# https://rstudio-pubs-static.s3.amazonaws.com/72023_670962b57f444c04999fd1a0a393e113.html
# https://www.r-bloggers.com/passing-columns-of-a-dataframe-to-a-function-without-quotes/

pareto <- function(y, data) {
        arguments <- as.list(match.call())
        y = eval(arguments$y, data)
        
        mutate( data, 
                cumsum = cumsum(y),
                freq = round(y / sum(y), 3),
                cum_freq = cumsum(freq)
        )
        
}

Preparing Dataset for Analysis

Now create and run pareto() on four new datasets for casualties, total damage, property damage and crop damage. We’ll select only the data with greater than 0 occurances.

# Create Datasets
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
casdata <- select(procdata, BGN_DATE, STATE, EVTYPE, CASUALTIES)
casdata <- filter(casdata, CASUALTIES > 0)

totcostdata <- select(procdata, BGN_DATE, STATE, EVTYPE, TOTALCOST)
totcostdata <- filter(totcostdata, TOTALCOST > 0)

propcostdata <- select(procdata, BGN_DATE, STATE, EVTYPE, PROPCOST)
propcostdata <- filter(propcostdata, PROPCOST > 0)

cropcostdata <- select(procdata, BGN_DATE, STATE, EVTYPE, CROPCOST)
cropcostdata <- filter(cropcostdata, CROPCOST > 0)

Now run pareto() on all the datasets:

sumcas <- aggregate(casdata$CASUALTIES, list(type = casdata$EVTYPE), sum)
sumcas <- arrange(sumcas, desc(x))
sumcas <- pareto(x, data = sumcas)

sumtotcost <- aggregate(totcostdata$TOTALCOST, list(type = totcostdata$EVTYPE), sum)
sumtotcost <- arrange(sumtotcost, desc(x))
sumtotcost <- pareto(x, data = sumtotcost)

sumpropcost <- aggregate(propcostdata$PROPCOST, list(type = propcostdata$EVTYPE), sum)
sumpropcost <- arrange(sumpropcost, desc(x))
sumpropcost <- pareto(x, data = sumpropcost)

sumcropcost <- aggregate(cropcostdata$CROPCOST, list(type = cropcostdata$EVTYPE), sum)
sumcropcost <- arrange(sumcropcost, desc(x))
sumcropcost <- pareto(x, data = sumcropcost)

Results

Caualties

head(sumcas, 10)
##                 type     x cumsum  freq cum_freq
## 1            TORNADO 22178  22178 0.332    0.332
## 2     EXCESSIVE HEAT  8188  30366 0.123    0.455
## 3              FLOOD  7172  37538 0.108    0.563
## 4  THUNDERSTORM WIND  5400  42938 0.081    0.644
## 5          LIGHTNING  4790  47728 0.072    0.716
## 6        FLASH FLOOD  2561  50289 0.038    0.754
## 7       WINTER STORM  1483  51772 0.022    0.776
## 8               HEAT  1459  53231 0.022    0.798
## 9  HURRICANE/TYPHOON  1446  54677 0.022    0.820
## 10         HIGH WIND  1318  55995 0.020    0.840

Tornados accounted for 33% of casualties (fatalities + injures) over the studied timeframe. The Pareto chart of the top results show the other storm types requiring major concern.

par(mar=c(2,5,4,2)) 
sumcasplot <- sumcas[1:10,]

yscale = sumcasplot$cum_freq * max(sumcasplot$x) # For right hand of Pareto chart

scp <- barplot(sumcasplot$x, border = NA, axes = F, width = 1, 
               
               ylab = "Casualties", 
               names.arg = sumcasplot$type,
               cex.names = 0.4,
               main = "Pareto Chart of Storm Casualties")

axis(side = 2, at = c(0, sumcasplot$x), las = 1, col.axis = "grey62", col = "grey62", 
     tick = T, cex.axis = 0.8)

lines(scp, yscale, type = "b", cex = 0.7, pch = 19, col="cyan4")

axis(side = 4, at = c(0, yscale), labels = paste(c(0, round(sumcasplot$cum_freq * 100)) ,"%",sep=""), 
     las = 1, col.axis = "grey62", 
     col = "cyan4", 
     cex.axis = 0.8, col.axis = "cyan4")

Total Damage

head(sumtotcost, 10)
##                 type            x       cumsum  freq cum_freq
## 1              FLOOD 148919611950 148919611950 0.371    0.371
## 2  HURRICANE/TYPHOON  86467941810 235387553760 0.215    0.586
## 3        STORM SURGE  43193541000 278581094760 0.108    0.694
## 4            TORNADO  24900330720 303481425480 0.062    0.756
## 5               HAIL  17071172870 320552598350 0.043    0.799
## 6        FLASH FLOOD  16557105610 337109703960 0.041    0.840
## 7            DROUGHT  14413667000 351523370960 0.036    0.876
## 8  THUNDERSTORM WIND   8812957230 360336328190 0.022    0.898
## 9     TROPICAL STORM   8320186550 368656514740 0.021    0.919
## 10         HIGH WIND   5881421660 374537936400 0.015    0.934

Floods cause the most total damage (37%), followed closely by Hurricanes(Typoons)(22%). The other factors are listed in the Pareto chart

par(mar=c(2,5,4,2)) 
sumtotplot <- sumtotcost[1:6,]

# Reduce the size of the dollar amounts

sumtotplot$x <- round(sumtotplot$x/10^9)

yscale = sumtotplot$cum_freq * max(sumtotplot$x) # For right hand of Pareto chart

stp <- barplot(sumtotplot$x, border = NA, axes = F, width = 1, 
                
               ylab = "Total Damage in $billions", 
               names.arg = sumtotplot$type,
               cex.names = 0.6,
               main = "Pareto Chart of Total Property Damage from Storms")

axis(side = 2, at = c(0, sumtotplot$x), las = 1, col.axis = "grey62", col = "grey62", 
     tick = T, cex.axis = 0.8)

lines(stp, yscale, type = "b", cex = 0.7, pch = 19, col="cyan4")

axis(side = 4, at = c(0, yscale), labels = paste(c(0, round(sumtotplot$cum_freq * 100)) ,"%",sep=""), 
     las = 1, col.axis = "grey62", 
     col = "cyan4", 
     cex.axis = 0.8, col.axis = "cyan4")

Property and Crop Damage

head(sumpropcost)
##                type           x       cumsum  freq cum_freq
## 1 HURRICANE/TYPHOON 81118659010  81118659010 0.322    0.322
## 2       STORM SURGE 43193536000 124312195010 0.171    0.493
## 3             FLOOD 29059833550 153372028560 0.115    0.608
## 4           TORNADO 24616905710 177988934270 0.098    0.706
## 5       FLASH FLOOD 15222203910 193211138180 0.060    0.766
## 6              HAIL 14595143420 207806281600 0.058    0.824
head(sumcropcost)
##                type           x      cumsum  freq cum_freq
## 1           DROUGHT 13367566000 13367566000 0.385    0.385
## 2 HURRICANE/TYPHOON  5349282800 18716848800 0.154    0.539
## 3             FLOOD  4974778400 23691627200 0.143    0.682
## 4              HAIL  2476029450 26167656650 0.071    0.753
## 5       FLASH FLOOD  1334901700 27502558350 0.038    0.791
## 6      EXTREME COLD  1308973000 28811531350 0.038    0.829

Breaking down the total cost data, we find that while Hurricanes cause most property damage (32%), Drought causes the most crop damage (39%). Hurricanes and Floods are also major factors in each catagory.

par(mar=c(2,5,4,2), mfrow=c(1,2)) 
sumpropplot <- sumpropcost[1:6,]
sumcropplot <- sumcropcost[1:6,]

# Reduce the size of the dollar amounts

sumpropplot$x <- round(sumpropplot$x/10^9)
sumcropplot$x <- round(sumcropplot$x/10^9)


pyscale = sumpropplot$cum_freq * max(sumpropplot$x) # For right hand of Pareto chart
cyscale = sumcropplot$cum_freq * max(sumcropplot$x)


spp <- barplot(sumpropplot$x, border = NA, axes = F, width = 1, 
               
               ylab = "Total Damage in $billions", 
               names.arg = sumpropplot$type,
               cex.names = 0.3,
               main = "Pareto Chart of Property Damage from Storms")

axis(side = 2, at = c(0, sumpropplot$x), las = 1, col.axis = "grey62", col = "grey62", 
     tick = T, cex.axis = 0.8)

lines(spp, pyscale, type = "b", cex = 0.7, pch = 19, col="cyan4")

axis(side = 4, at = c(0, pyscale), labels = paste(c(0, round(sumpropplot$cum_freq * 100)) ,"%",sep=""), 
     las = 1, col.axis = "grey62", 
     col = "cyan4", 
     cex.axis = 0.8, col.axis = "cyan4")


scp <- barplot(sumcropplot$x, border = NA, axes = F, width = 1, 
               
               ylab = "Total Damage in $billions", 
               names.arg = sumcropplot$type,
               cex.names = 0.3,
               main = "Pareto Chart of Crop Damage from Storms")

axis(side = 2, at = c(0, sumcropplot$x), las = 1, col.axis = "grey62", col = "grey62", 
     tick = T, cex.axis = 0.8)

lines(spp, cyscale, type = "b", cex = 0.7, pch = 19, col="cyan4")

axis(side = 4, at = c(0, cyscale), labels = paste(c(0, round(sumcropplot$cum_freq * 100)) ,"%",sep=""), 
     las = 1, col.axis = "grey62", 
     col = "cyan4", 
     cex.axis = 0.8, col.axis = "cyan4")