First we will load in the data from the url:
“https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2”, “.\data\StormData.csv.bz2”
And read into R with read.csv
if(!file.exists(".\\data")){dir.create(".\\data")}
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
".\\data\\StormData.csv.bz2")
StormData <- read.csv(".\\data\\StormData.csv.bz2")
In cleaning our data, first we will isolate the relevant columns the refer to casualty amounts (injuries and fatalities), damage amounts (crop and property damage), and identifying characteristics. Then removing entries that do not feature either damage or casualties.
sData <- StormData[ , c(2, 7,8,23:28,37)]
names(sData)
## [1] "BGN_DATE" "STATE" "EVTYPE" "FATALITIES" "INJURIES"
## [6] "PROPDMG" "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "REFNUM"
# remove all rows with zero crop/property damage, and
# no injuries/fatalities
sData <- sData[sData[,4]!=0 | sData[,5]!=0 |
sData[,6]!=0 | sData[,8]!=0,]
Crop and property damage exist in two colums each, a number and a multiplier. The multiplier has values of “k,” “m,” or “b” of thousands, millions or billions of dollars. Here combine those columns into one by multiplying the number by 1e3, 1e6, or 1e9 as appropriate. Finally, discard now redundant columns for property and corp damage and combine property damage and crop damage into a total damage column
#Create Vector of multipliers for property damage.
# Treat all non "m","k",or"b" as equal to 1
sData$pdmult <- sData$PROPDMGEXP
sData$pdmult <- gsub("[^KMmB]|\"\"", "1", sData$pdmult)
sData$pdmult[sData$pdmult==""] <- "1"
sData$pdmult <- gsub("K", 1000, sData$pdmult)
sData$pdmult <- gsub("[Mm]", 1000000, sData$pdmult)
sData$pdmult <- gsub("B", 1000000000, sData$pdmult)
# Vector of multipliers for crop damage.
# Treat all non "m","k",or"b" as equal to 1
sData$cdmult <- sData$CROPDMGEXP
sData$cdmult <- gsub("[^KMmB]|\"\"", "1", sData$cdmult)
sData$cdmult[sData$cdmult==""] <- "1"
sData$cdmult <- gsub("K", 1000, sData$cdmult)
sData$cdmult <- gsub("[Mm]", 1000000, sData$cdmult)
sData$cdmult <- gsub("B", 1000000000, sData$cdmult)
#as.numeric
sData$cdmult <- as.numeric(sData$cdmult)
sData$pdmult <- as.numeric(sData$pdmult)
#vectors of damages in dollars
sData$pdamage <- sData$PROPDMG*sData$pdmult
sData$cdamage <- sData$CROPDMG*sData$cdmult
# Trim data to identifiers, damage, injury and fatalitys
sData <- sData[ , c(1:5,10,13,14)]
sData$tdamage <- sData$pdamage + sData$cdamage
For our last bit of data procesing, we will split our data into two (non-mutually exclusive) tables. One, sdam contains evens which featured damage, the other sharm involved casualties. In sharm we will create a casualty total that adds injuries and fatalities
sdam <- sData[sData$tdamage!=0,c(1:3,6:9)]
sharm <- sData[sData[,4]!=0 | sData[,5]!=0,c(1:6)]
sharm$total <- sharm$FATALITIES + sharm$INJURIES
First let’s look at a boxplot and scatter plot of total damages and casualty figures
par(mfrow = c(2, 2))
boxplot(sdam$tdamage, xlab = "Total Damage in $")
boxplot(sharm$total, xlab = "Sum of Injuries / Fatalities")
plot(sharm$total, sharm$REFNUM, xlab = "Scatterplot of Casualty Totals", ylab = "Reference Number")
plot(sdam$tdamage, sdam$REFNUM, xlab = "Scatterplot of Damage Totals", ylab = "Reference Number")
Many events cause damage and casualty, but we are most interested in events that cause the greatest damage, and most threat to loss of life. As such, we will take the top 5% of casualty inducing events and the top 3% of damage inducing events (There are many, many more damage inducing events).
# Look at top 10% of harm totals
quantile(sharm$total, prob = seq(.9, 1, .01))
## 90% 91% 92% 93% 94% 95% 96% 97% 98%
## 10.00 12.00 13.00 15.00 18.32 22.00 28.00 38.00 53.00
## 99% 100%
## 101.00 1742.00
#take top 5%
harmCut <- quantile(sharm$total, prob = .95)
topharm <- sharm[sharm$total >= harmCut, ]
#Look at top 10% of damage totals
quantile(sdam$tdamage, prob = seq(.9, 1, .01))
## 90% 91% 92% 93% 94%
## 250000 255000 375000 500000 550000
## 95% 96% 97% 98% 99%
## 1000000 1500000 2500000 4000000 11000000
## 100%
## 115032500000
#Because of larger amount of data, take only top 3%
damCut <- quantile(sdam$tdamage, prob = .97)
topdam <- sdam[sdam$tdamage >= damCut, ]
Once again take a look at the box and scatter plots of damages
par(mfrow = c(2, 2))
boxplot(topdam$tdamage, xlab = "Total Damage in $")
boxplot(topharm$total, xlab = "Sum of Injuries / Fatalities")
plot(topharm$total,topharm$REFNUM, xlab = "Scatterplot of Casualty Totals", ylab = "Reference Number")
plot(topdam$tdamage, topdam$REFNUM, xlab = "Scatterplot of Damage Totals", ylab = "Reference Number")
Even looking at just these “top” events, the distribution is dominated by a few major, major events. In fact, with the plots this small, they almost look identical to the plots of all damage/casualty inducing events! However, because each of these “top” events involved significant impact we will look at the most frequent type of events for these damages/casualties. Our interest is to determine which kind of events will likely to become major, impactful events, not “storm of the century” events.
tail( sort( table( droplevels( topdam )$EVTYPE)), 15)
##
## HEAVY SNOW HURRICANE TROPICAL STORM
## 67 74 80
## WINTER STORM WILDFIRE THUNDERSTORM WINDS
## 82 94 104
## THUNDERSTORM WIND ICE STORM DROUGHT
## 113 116 138
## HIGH WIND TSTM WIND HAIL
## 188 269 695
## FLASH FLOOD FLOOD TORNADO
## 847 1036 3729
For damage, Tornadoes are far and away the most frequent culprit, however, floods, (non-tornado) winds and hail also feature prominently.
tail( sort( table( droplevels( topharm )$EVTYPE)), 15)
##
## THUNDERSTORM WINDS WILDFIRE DUST STORM
## 6 6 7
## ICE STORM HEAVY SNOW LIGHTNING
## 7 8 8
## HAIL FOG FLASH FLOOD
## 9 10 11
## WINTER STORM HEAT TSTM WIND
## 14 20 24
## FLOOD EXCESSIVE HEAT TORNADO
## 33 70 817
For casualties, Tornadoes, once again, are responsible for many of the top events, with other event-types not coming anywhere near as close.
From this table it is possible to see some of the coding inconsistencies of the EVTYPE variable as both “TSTM WIND” and “THUNDERSTORM WINDS” are mentioned. organizing the 900+ Event-types went beyond the scope of this project.
Quite clearly, when it comes to weather events that have impact on human health and on property/crops, tornadoes are most likely to involve relatively high casualty and damage numbers.
The major major damage event was:
topdam[topdam$tdamage == max(topdam$tdamage),]
## BGN_DATE STATE EVTYPE REFNUM pdamage cdamage tdamage
## 605953 1/1/2006 0:00:00 CA FLOOD 605943 1.15e+11 32500000 115032500000
In looking up descriptions of December 2005/ January 2006 flooding in CA, there is little indication that 115 Billion dollars of damage occured. However, I decided casting out outliars to be outside the scope of this investigation. Link below to report: