Based on NOAA Storm data from Jan. 1996 to Nov. 2011.
fileurl = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileurl, "StormData.csv.bz2", method = "curl")
After downloading the CSV file, it’s read into a dataframe called ‘rawdata’
rawdata <- read.csv("StormData.csv.bz2")
Select only the columns needed
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
selvector <- c("BGN_DATE", "STATE", "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG",
"PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
procdata <- select(rawdata, which(names(rawdata) %in% selvector))
Next, subset the data by date. The original ‘rawdata’ contains records back to 1970, but according to NOAA, “All Event Types (48 from Directive 10-1605): From 1996 to present, 48 event types are recorded as defined in NWS Directive 10-1605”, Therefore, we’ll only use data starting in 1996 for analysis. To do this, we have to fix the DATE variable so it becomes useful.
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
procdata$BGN_DATE <- mdy_hms(as.character.factor(procdata$BGN_DATE))
# https://www.ncdc.noaa.gov/stormevents/details.jsp?type=eventtype . All Event Types (48 from Directive 10-1605): From 1996 to present, 48 event types are recorded as defined in NWS Directive 10-1605.
procdata <- filter(procdata, procdata$BGN_DATE >= "1996-01-01")
The property and crop damage cost data consists of two variable - a number and an exponent. Unfortunately, the exponent is expressed as K, M, or B instead of a more useful 10^3, 10^6 or 10^9. This required some reworking by using sapply() with a matching argument.
# fix expo
m <- c("", "0", "K", "M", "B")
n <- c(0,0,3,6,9)
propdmgp10 <- sapply(procdata$PROPDMGEXP, function(x) {n[match(x, m)]})
cropdmgp10 <- sapply(procdata$CROPDMGEXP, function(x) {n[match(x, m)]})
procdata$PROPDMGEXP <- propdmgp10
procdata$CROPDMGEXP <- cropdmgp10
We can get our almost cleaned ‘procdata’ completed by mutating it to multiply the cost by its exponent. Also note the creation of a new variable called CASUALTIES which is the sum of fatalites and injuries.
procdata <- procdata %>%
mutate(CASUALTIES = FATALITIES + INJURIES) %>%
mutate(PROPCOST = PROPDMG * 10^PROPDMGEXP) %>%
mutate(CROPCOST = CROPDMG * 10^CROPDMGEXP) %>%
mutate(TOTALCOST = PROPCOST + CROPCOST)
After cleaning the dataset, an examination of casualties and damage was made. This involved looking at the maximum value in each of the variables and referring back to ‘rawdata’ to see if it “made sense”. In one case, a property damage entry was recorded as being $115 Billion. The following is the procedure used in finding, and correcting the data:
#Outliers
range(procdata$PROPCOST)
## [1] 0.00e+00 1.15e+11
procdata[which(procdata$PROPCOST == max(procdata$PROPCOST)) ,]
## BGN_DATE STATE EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP
## 357163 2006-01-01 CA FLOOD 0 0 115 9
## CROPDMG CROPDMGEXP CASUALTIES PROPCOST CROPCOST TOTALCOST
## 357163 32.5 6 0 1.15e+11 32500000 115032500000
The highest property damage from a storm was $115 Billion. Getting data on this entry from ‘procdata’, a search was done on ‘rawdata’ to find the specific record.
with(rawdata, which(STATE == "CA" & EVTYPE == "FLOOD" & PROPDMG == 115 & PROPDMGEXP == "B"))
## [1] 605953
rawdata[605953,]
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME
## 605953 6 1/1/2006 0:00:00 12:00:00 AM PST 55 NAPA
## STATE EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE
## 605953 CA FLOOD 0 COUNTYWIDE 1/1/2006 0:00:00
## END_TIME COUNTY_END COUNTYENDN END_RANGE END_AZI END_LOCATI
## 605953 07:00:00 AM 0 NA 0 COUNTYWIDE
## LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG
## 605953 0 0 NA 0 0 0 115 B 32.5
## CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 605953 M MTR CALIFORNIA, Western 3828 12218
## LATITUDE_E LONGITUDE_
## 605953 3828 12218
## REMARKS
## 605953 Major flooding continued into the early hours of January 1st, before the Napa River finally fell below flood stage and the water receeded. Flooding was severe in Downtown Napa from the Napa Creek and the City and Parks Department was hit with $6 million in damage alone. The City of Napa had 600 homes with moderate damage, 150 damaged businesses with costs of at least $70 million.
## REFNUM
## 605953 605943
Reading the remarks and doing a subsequent Internet search, the storm in question, while causing significant property damage, was characterized as in the “millions” and not “billions”. As such, the values in ‘procdata’ were explicitly changed to reflect their true value
procdata[357163, 7] = 6
procdata[357163, 11] = 1.15e+8
The NOAA Storm database has 48 official event, or EVTYPES. However, the raw dataset has over 900 factors, many of these are due to misuse of case, improper spelling, or simply added variables not in the official description.
After exploratory investigation of the data, and using a Pareto analysis (explained later), the need to clean these entries was vastly reduced by doing two operations on ‘procdata’:
First - change all the EVTYPES to uppercase:
procdata$EVTYPE <- toupper(procdata$EVTYPE)
Second - searce and replace HURRICAINE and TSTM WIND
for(i in 1:nrow(procdata)) {
if(procdata$EVTYPE[i] == "HURRICANE") {
procdata$EVTYPE[i] <- "HURRICANE/TYPHOON"
}
}
for(i in 1:nrow(procdata)) {
if(procdata$EVTYPE[i] == "TSTM WIND") {
procdata$EVTYPE[i] <- "THUNDERSTORM WIND"
}
}
Pareto analysis from Wikipedia
Pareto analysis is a formal technique useful where many possible courses of action are competing for attention. In essence, the problem-solver estimates the benefit delivered by each action, then selects a number of the most effective actions that deliver a total benefit reasonably close to the maximal possible one.
Pareto is another version of the “80/20 rule” and provides policy makers a decision system for concentrating resources to the problems that are the most severe.
To do this, a function was created
# Inspiration for these funcltion came from these web sites:
# https://rstudio-pubs-static.s3.amazonaws.com/72023_670962b57f444c04999fd1a0a393e113.html
# https://www.r-bloggers.com/passing-columns-of-a-dataframe-to-a-function-without-quotes/
pareto <- function(y, data) {
arguments <- as.list(match.call())
y = eval(arguments$y, data)
mutate( data,
cumsum = cumsum(y),
freq = round(y / sum(y), 3),
cum_freq = cumsum(freq)
)
}
Now create and run pareto() on four new datasets for casualties, total damage, property damage and crop damage. We’ll select only the data with greater than 0 occurances.
# Create Datasets
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
casdata <- select(procdata, BGN_DATE, STATE, EVTYPE, CASUALTIES)
casdata <- filter(casdata, CASUALTIES > 0)
totcostdata <- select(procdata, BGN_DATE, STATE, EVTYPE, TOTALCOST)
totcostdata <- filter(totcostdata, TOTALCOST > 0)
propcostdata <- select(procdata, BGN_DATE, STATE, EVTYPE, PROPCOST)
propcostdata <- filter(propcostdata, PROPCOST > 0)
cropcostdata <- select(procdata, BGN_DATE, STATE, EVTYPE, CROPCOST)
cropcostdata <- filter(cropcostdata, CROPCOST > 0)
Now run pareto() on all the datasets:
sumcas <- aggregate(casdata$CASUALTIES, list(type = casdata$EVTYPE), sum)
sumcas <- arrange(sumcas, desc(x))
sumcas <- pareto(x, data = sumcas)
sumtotcost <- aggregate(totcostdata$TOTALCOST, list(type = totcostdata$EVTYPE), sum)
sumtotcost <- arrange(sumtotcost, desc(x))
sumtotcost <- pareto(x, data = sumtotcost)
sumpropcost <- aggregate(propcostdata$PROPCOST, list(type = propcostdata$EVTYPE), sum)
sumpropcost <- arrange(sumpropcost, desc(x))
sumpropcost <- pareto(x, data = sumpropcost)
sumcropcost <- aggregate(cropcostdata$CROPCOST, list(type = cropcostdata$EVTYPE), sum)
sumcropcost <- arrange(sumcropcost, desc(x))
sumcropcost <- pareto(x, data = sumcropcost)
head(sumcas, 10)
## type x cumsum freq cum_freq
## 1 TORNADO 22178 22178 0.332 0.332
## 2 EXCESSIVE HEAT 8188 30366 0.123 0.455
## 3 FLOOD 7172 37538 0.108 0.563
## 4 THUNDERSTORM WIND 5400 42938 0.081 0.644
## 5 LIGHTNING 4790 47728 0.072 0.716
## 6 FLASH FLOOD 2561 50289 0.038 0.754
## 7 WINTER STORM 1483 51772 0.022 0.776
## 8 HEAT 1459 53231 0.022 0.798
## 9 HURRICANE/TYPHOON 1446 54677 0.022 0.820
## 10 HIGH WIND 1318 55995 0.020 0.840
Tornados accounted for 33% of casualties (fatalities + injures) over the studied timeframe. The Pareto chart of the top results show the other storm types requiring major concern.
par(mar=c(2,5,4,2))
sumcasplot <- sumcas[1:10,]
yscale = sumcasplot$cum_freq * max(sumcasplot$x) # For right hand of Pareto chart
scp <- barplot(sumcasplot$x, border = NA, axes = F, width = 1,
ylab = "Casualties",
names.arg = sumcasplot$type,
cex.names = 0.4,
main = "Pareto Chart of Storm Casualties")
axis(side = 2, at = c(0, sumcasplot$x), las = 1, col.axis = "grey62", col = "grey62",
tick = T, cex.axis = 0.8)
lines(scp, yscale, type = "b", cex = 0.7, pch = 19, col="cyan4")
axis(side = 4, at = c(0, yscale), labels = paste(c(0, round(sumcasplot$cum_freq * 100)) ,"%",sep=""),
las = 1, col.axis = "grey62",
col = "cyan4",
cex.axis = 0.8, col.axis = "cyan4")
head(sumtotcost, 10)
## type x cumsum freq cum_freq
## 1 FLOOD 148919611950 148919611950 0.371 0.371
## 2 HURRICANE/TYPHOON 86467941810 235387553760 0.215 0.586
## 3 STORM SURGE 43193541000 278581094760 0.108 0.694
## 4 TORNADO 24900330720 303481425480 0.062 0.756
## 5 HAIL 17071172870 320552598350 0.043 0.799
## 6 FLASH FLOOD 16557105610 337109703960 0.041 0.840
## 7 DROUGHT 14413667000 351523370960 0.036 0.876
## 8 THUNDERSTORM WIND 8812957230 360336328190 0.022 0.898
## 9 TROPICAL STORM 8320186550 368656514740 0.021 0.919
## 10 HIGH WIND 5881421660 374537936400 0.015 0.934
Floods cause the most total damage (37%), followed closely by Hurricanes(Typoons)(22%). The other factors are listed in the Pareto chart
par(mar=c(2,5,4,2))
sumtotplot <- sumtotcost[1:6,]
# Reduce the size of the dollar amounts
sumtotplot$x <- round(sumtotplot$x/10^9)
yscale = sumtotplot$cum_freq * max(sumtotplot$x) # For right hand of Pareto chart
stp <- barplot(sumtotplot$x, border = NA, axes = F, width = 1,
ylab = "Total Damage in $billions",
names.arg = sumtotplot$type,
cex.names = 0.6,
main = "Pareto Chart of Total Property Damage from Storms")
axis(side = 2, at = c(0, sumtotplot$x), las = 1, col.axis = "grey62", col = "grey62",
tick = T, cex.axis = 0.8)
lines(stp, yscale, type = "b", cex = 0.7, pch = 19, col="cyan4")
axis(side = 4, at = c(0, yscale), labels = paste(c(0, round(sumtotplot$cum_freq * 100)) ,"%",sep=""),
las = 1, col.axis = "grey62",
col = "cyan4",
cex.axis = 0.8, col.axis = "cyan4")
head(sumpropcost)
## type x cumsum freq cum_freq
## 1 HURRICANE/TYPHOON 81118659010 81118659010 0.322 0.322
## 2 STORM SURGE 43193536000 124312195010 0.171 0.493
## 3 FLOOD 29059833550 153372028560 0.115 0.608
## 4 TORNADO 24616905710 177988934270 0.098 0.706
## 5 FLASH FLOOD 15222203910 193211138180 0.060 0.766
## 6 HAIL 14595143420 207806281600 0.058 0.824
head(sumcropcost)
## type x cumsum freq cum_freq
## 1 DROUGHT 13367566000 13367566000 0.385 0.385
## 2 HURRICANE/TYPHOON 5349282800 18716848800 0.154 0.539
## 3 FLOOD 4974778400 23691627200 0.143 0.682
## 4 HAIL 2476029450 26167656650 0.071 0.753
## 5 FLASH FLOOD 1334901700 27502558350 0.038 0.791
## 6 EXTREME COLD 1308973000 28811531350 0.038 0.829
Breaking down the total cost data, we find that while Hurricanes cause most property damage (32%), Drought causes the most crop damage (39%). Hurricanes and Floods are also major factors in each catagory.
par(mar=c(2,5,4,2), mfrow=c(1,2))
sumpropplot <- sumpropcost[1:6,]
sumcropplot <- sumcropcost[1:6,]
# Reduce the size of the dollar amounts
sumpropplot$x <- round(sumpropplot$x/10^9)
sumcropplot$x <- round(sumcropplot$x/10^9)
pyscale = sumpropplot$cum_freq * max(sumpropplot$x) # For right hand of Pareto chart
cyscale = sumcropplot$cum_freq * max(sumcropplot$x)
spp <- barplot(sumpropplot$x, border = NA, axes = F, width = 1,
ylab = "Total Damage in $billions",
names.arg = sumpropplot$type,
cex.names = 0.3,
main = "Pareto Chart of Property Damage from Storms")
axis(side = 2, at = c(0, sumpropplot$x), las = 1, col.axis = "grey62", col = "grey62",
tick = T, cex.axis = 0.8)
lines(spp, pyscale, type = "b", cex = 0.7, pch = 19, col="cyan4")
axis(side = 4, at = c(0, pyscale), labels = paste(c(0, round(sumpropplot$cum_freq * 100)) ,"%",sep=""),
las = 1, col.axis = "grey62",
col = "cyan4",
cex.axis = 0.8, col.axis = "cyan4")
scp <- barplot(sumcropplot$x, border = NA, axes = F, width = 1,
ylab = "Total Damage in $billions",
names.arg = sumcropplot$type,
cex.names = 0.3,
main = "Pareto Chart of Crop Damage from Storms")
axis(side = 2, at = c(0, sumcropplot$x), las = 1, col.axis = "grey62", col = "grey62",
tick = T, cex.axis = 0.8)
lines(spp, cyscale, type = "b", cex = 0.7, pch = 19, col="cyan4")
axis(side = 4, at = c(0, cyscale), labels = paste(c(0, round(sumcropplot$cum_freq * 100)) ,"%",sep=""),
las = 1, col.axis = "grey62",
col = "cyan4",
cex.axis = 0.8, col.axis = "cyan4")