This is an analysis of storm data between 1950 and November of 2011 from the NOAA (National Oceanic and Atmospheric Administration) Storm Database. This analysis aims to answer two important questions:
Across the United States, which types of events are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
The analysis showed that Floods, Hail, High Winds, and Tornadoes are the most harmful in regards to population health and economic damages. The deadliest natural event is the Tornado, which also caused the most injuries. Floods cause the most property damage and Drought causes the most crop damage. All of these events have become more frequent in the years leading up to 2011.
Some important links are below:
Downloading and reading in the data.
download.file(url="https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",destfile="./stormdata.csv.bz2",method="curl")
stormdata=read.csv("./stormdata.csv.bz2")
print(object.size(stormdata),units="Mb")
## 409.4 Mb
We can cut out a few columns that we don’t need to trim down the data set. We only really need the following variables for our analysis:
REFNUM, so the data can be referenced back to the original table if needed
MONTH, which will be extracted from the BGN_DATE column
YEAR, which will be extracted from the BGN_DATE column
EVTYPE, which describes the event type
FATALITIES
INJURIES
PROPDMG.INF.ADJ, which will be a new inflation and magnitude adjusted number
CROPDMG.INF.ADJ, which will be a new inflation and magnitude adjusted number
We can also cut down the data set to where either FATALITIES, INJURIES, CROPDMG, or PROPDMG are greater than zero. Then we can extract the easy variables (from the list above) first.
stormdata2<-stormdata[stormdata$FATALITIES>0|stormdata$INJURIES>0|stormdata$CROPDMG>0|stormdata$PROPDMG>0,]
REFNUM<-stormdata2$REFNUM
MONTH<-as.factor(unclass(as.POSIXlt(stormdata2$BGN_DATE,format="%m/%d/%Y"))$mon+1)
YEAR<-as.factor(unclass(as.POSIXlt(stormdata2$BGN_DATE,format="%m/%d/%Y"))$year+1900)
FATALITIES<-as.numeric(stormdata2$FATALITIES)
INJURIES<-as.numeric(stormdata2$INJURIES)
Next, we need to figure out what the inflation adjustments needs to be and how to eventually work that into the damage numbers. Inflation is defined as a sustained increase in the general level of prices of goods and services. That means that a dollar fifty years ago had much more purchasing power than a dollar does today, assuming positive cumulative inflation. This analysis will be incomplete if historical dollar figures are unadjusted and compared to more recent dollar figures. A good proxy for inflation is the Consumer Price Index.
The following data set could be downloaded from The BLS Website. The “From” box has been set to 1950 and the “To” box has been set to 2011. Unfortunately, there is no automated way to get this exact dataset from the BLS (U.S. Bureau of Labor Statistics). It is worth sticking to the BLS website to get the most accurate inflation estimates.
First, let’s import the CPI data and get it ready for use.
if("xlsx" %in% rownames(installed.packages()) == FALSE) {install.packages("xlsx")}
library(xlsx)
## Loading required package: rJava
## Loading required package: xlsxjars
CPIdata<-read.xlsx("./BLS CPI Data 1950-2011.xlsx",rowIndex = 11:73, sheetName = "BLS Data Series")
names(CPIdata)[2:13]<-c(1:12)
if("reshape2" %in% rownames(installed.packages()) == FALSE) {install.packages("reshape2")}
library(reshape2)
CPIdatamelt<-melt(CPIdata,measure.vars = names(CPIdata)[2:16])
CPIdatamelt$Year<-as.factor(CPIdatamelt$Year)
names(CPIdatamelt)[2]<-"Year.Part"
Now let’s prepare both data sets for the CPI matching loop.
MONTH.YEAR.CPI<-paste(CPIdatamelt$Year.Part,CPIdatamelt$Year)
CPIdatamelt<-cbind(MONTH.YEAR.CPI,CPIdatamelt)
MONTH.YEAR.STORM<-paste(MONTH,YEAR)
stormdata2<-cbind(MONTH.YEAR.STORM,stormdata2)
Finally, let’s write a loop to match the correct CPI numbers to the stormdata2 time periods.
CPInumbers<-vector()
i=1
for(i in i:length(stormdata2$MONTH.YEAR.STORM))
{
q<-which(as.character(CPIdatamelt$MONTH.YEAR.CPI) == as.character(stormdata2$MONTH.YEAR.STORM[i]))
CPInumbers<-c(CPInumbers,CPIdatamelt[q,4])
}
At this point, we have a vector that has the correct CPI number for each month in each year of the data set. This will be used in a later calculation when adjusting the damage figures.
Now we need to see what the orders of magnitude are for PROPDMGEXP so we can bring the numbers to the correct scale. Let’s do a spot check on one. Let’s also see how many instances we have where the damage is 0 but there is a scale.
levels(stormdata2$PROPDMGEXP)
## [1] "" "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"
q<-which(stormdata2$PROPDMGEXP=="3")
stormdata2[q,]
## MONTH.YEAR.STORM STATE__ BGN_DATE BGN_TIME TIME_ZONE
## 214375 5 1995 29 5/16/1995 0:00:00 1750 CST
## COUNTY COUNTYNAME STATE EVTYPE BGN_RANGE BGN_AZI
## 214375 205 SHELBY MO THUNDERSTORM WINDS 2 S
## BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN END_RANGE
## 214375 Shelbyville 0 NA 0
## END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 214375 0 0 NA 0 0 0 20
## PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE
## 214375 3 2.5 K 0
## LONGITUDE LATITUDE_E LONGITUDE_
## 214375 0 0 0
## REMARKS
## 214375 The county sheriff reported numerous windows broken and large tree limbs down in Shelbyville from high wind gusts. As the storms moved south, four barns were heavily damaged south of town. Numerous large trees across the county were blown over as well.
## REFNUM
## 214375 214307
sum(stormdata2$PROPDMG=="0" & stormdata2$PROPDMGEXP!="")
## [1] 3950
We can see that there are different ways to call the same order of magnitude and all of these need to be accounted for. Some of the PROPDMG numbers are recorded as “0” with an EXP most likely because there was no clear estimate. We need to keep all of these levels in mind as we are going through the loop and adjust for the 0s. 1 through 9 will be sampled randomly every time there is a property damage order of magnitude without a base number in PROPDMG. This figure will be multiplied by the order of magnitude to produce a pseudo-random estimate of property damage.
set.seed(123)
zero.random.adj<-sample(1:9,size=sum(stormdata2$PROPDMG=="0" & stormdata2$PROPDMGEXP!=""),replace=TRUE)
i=1
j=1
propdmgfull<-vector()
stormdata2$PROPDMGEXP<-as.character(stormdata2$PROPDMGEXP)
for(i in i:length(stormdata2$PROPDMGEXP))
{
if(stormdata2$PROPDMGEXP[i]==""|stormdata2$PROPDMGEXP[i]=="-"|stormdata2$PROPDMGEXP[i]=="?"|stormdata2$PROPDMGEXP[i]=="+"|stormdata2$PROPDMGEXP[i]=="0")
{propdmgfull<-c(propdmgfull,stormdata2$PROPDMG[i])}
if(stormdata2$PROPDMGEXP[i]=="1" & stormdata2$PROPDMG[i]!=0)
{propdmgfull<-c(propdmgfull,stormdata2$PROPDMG[i]*1e1)}
if(stormdata2$PROPDMGEXP[i]=="1" & stormdata2$PROPDMG[i]==0)
{propdmgfull<-c(propdmgfull,(zero.random.adj[j]*1e1))
j<-j+1
}
if(stormdata2$PROPDMGEXP[i]=="2" & stormdata2$PROPDMG[i]!=0)
{propdmgfull<-c(propdmgfull,stormdata2$PROPDMG[i]*1e2)}
if(stormdata2$PROPDMGEXP[i]=="2" & stormdata2$PROPDMG[i]==0)
{propdmgfull<-c(propdmgfull,(zero.random.adj[j]*1e2))
j<-j+1
}
if(stormdata2$PROPDMGEXP[i]=="3" & stormdata2$PROPDMG[i]!=0)
{propdmgfull<-c(propdmgfull,stormdata2$PROPDMG[i]*1e3)}
if(stormdata2$PROPDMGEXP[i]=="3" & stormdata2$PROPDMG[i]==0)
{propdmgfull<-c(propdmgfull,(zero.random.adj[j]*1e3))
j<-j+1
}
if(stormdata2$PROPDMGEXP[i]=="4" & stormdata2$PROPDMG[i]!=0)
{propdmgfull<-c(propdmgfull,stormdata2$PROPDMG[i]*1e4)}
if(stormdata2$PROPDMGEXP[i]=="4" & stormdata2$PROPDMG[i]==0)
{propdmgfull<-c(propdmgfull,(zero.random.adj[j]*1e4))
j<-j+1
}
if(stormdata2$PROPDMGEXP[i]=="5" & stormdata2$PROPDMG[i]!=0)
{propdmgfull<-c(propdmgfull,stormdata2$PROPDMG[i]*1e5)}
if(stormdata2$PROPDMGEXP[i]=="5" & stormdata2$PROPDMG[i]==0)
{propdmgfull<-c(propdmgfull,(zero.random.adj[j]*1e5))
j<-j+1
}
if(stormdata2$PROPDMGEXP[i]=="6" & stormdata2$PROPDMG[i]!=0)
{propdmgfull<-c(propdmgfull,stormdata2$PROPDMG[i]*1e6)}
if(stormdata2$PROPDMGEXP[i]=="6" & stormdata2$PROPDMG[i]==0)
{propdmgfull<-c(propdmgfull,(zero.random.adj[j]*1e6))
j<-j+1
}
if(stormdata2$PROPDMGEXP[i]=="7" & stormdata2$PROPDMG[i]!=0)
{propdmgfull<-c(propdmgfull,stormdata2$PROPDMG[i]*1e7)}
if(stormdata2$PROPDMGEXP[i]=="7" & stormdata2$PROPDMG[i]==0)
{propdmgfull<-c(propdmgfull,(zero.random.adj[j]*1e7))
j<-j+1
}
if(stormdata2$PROPDMGEXP[i]=="8" & stormdata2$PROPDMG[i]!=0)
{propdmgfull<-c(propdmgfull,stormdata2$PROPDMG[i]*1e8)}
if(stormdata2$PROPDMGEXP[i]=="8" & stormdata2$PROPDMG[i]==0)
{propdmgfull<-c(propdmgfull,(zero.random.adj[j]*1e8))
j<-j+1
}
if(stormdata2$PROPDMGEXP[i]=="B" & stormdata2$PROPDMG[i]!=0)
{propdmgfull<-c(propdmgfull,stormdata2$PROPDMG[i]*1e9)}
if(stormdata2$PROPDMGEXP[i]=="B" & stormdata2$PROPDMG[i]==0)
{propdmgfull<-c(propdmgfull,(zero.random.adj[j]*1e9))
j<-j+1
}
if(stormdata2$PROPDMGEXP[i]=="h" & stormdata2$PROPDMG[i]!=0)
{propdmgfull<-c(propdmgfull,stormdata2$PROPDMG[i]*1e2)}
if(stormdata2$PROPDMGEXP[i]=="h" & stormdata2$PROPDMG[i]==0)
{propdmgfull<-c(propdmgfull,(zero.random.adj[j]*1e2))
j<-j+1
}
if(stormdata2$PROPDMGEXP[i]=="H" & stormdata2$PROPDMG[i]!=0)
{propdmgfull<-c(propdmgfull,stormdata2$PROPDMG[i]*1e2)}
if(stormdata2$PROPDMGEXP[i]=="H" & stormdata2$PROPDMG[i]==0)
{propdmgfull<-c(propdmgfull,(zero.random.adj[j]*1e2))
j<-j+1
}
if(stormdata2$PROPDMGEXP[i]=="K" & stormdata2$PROPDMG[i]!=0)
{propdmgfull<-c(propdmgfull,stormdata2$PROPDMG[i]*1e3)}
if(stormdata2$PROPDMGEXP[i]=="K" & stormdata2$PROPDMG[i]==0)
{propdmgfull<-c(propdmgfull,(zero.random.adj[j]*1e3))
j<-j+1
}
if(stormdata2$PROPDMGEXP[i]=="m" & stormdata2$PROPDMG[i]!=0)
{propdmgfull<-c(propdmgfull,stormdata2$PROPDMG[i]*1e6)}
if(stormdata2$PROPDMGEXP[i]=="m" & stormdata2$PROPDMG[i]==0)
{propdmgfull<-c(propdmgfull,(zero.random.adj[j]*1e6))
j<-j+1
}
if(stormdata2$PROPDMGEXP[i]=="M" & stormdata2$PROPDMG[i]!=0)
{propdmgfull<-c(propdmgfull,stormdata2$PROPDMG[i]*1e6)}
if(stormdata2$PROPDMGEXP[i]=="M" & stormdata2$PROPDMG[i]==0)
{propdmgfull<-c(propdmgfull,(zero.random.adj[j]*1e6))
j<-j+1
}
}
Now we can do the same thing for crop damage.
set.seed(123)
zero.random.adj<-sample(1:9,size=sum(stormdata2$CROPDMG=="0" & stormdata2$CROPDMGEXP!=""),replace=TRUE)
i=1
j=1
CROPdmgfull<-vector()
stormdata2$CROPDMGEXP<-as.character(stormdata2$CROPDMGEXP)
for(i in i:length(stormdata2$CROPDMGEXP))
{
if(stormdata2$CROPDMGEXP[i]==""|stormdata2$CROPDMGEXP[i]=="-"|stormdata2$CROPDMGEXP[i]=="?"|stormdata2$CROPDMGEXP[i]=="+"|stormdata2$CROPDMGEXP[i]=="0")
{CROPdmgfull<-c(CROPdmgfull,stormdata2$CROPDMG[i])}
if(stormdata2$CROPDMGEXP[i]=="1" & stormdata2$CROPDMG[i]!=0)
{CROPdmgfull<-c(CROPdmgfull,stormdata2$CROPDMG[i]*1e1)}
if(stormdata2$CROPDMGEXP[i]=="1" & stormdata2$CROPDMG[i]==0)
{CROPdmgfull<-c(CROPdmgfull,(zero.random.adj[j]*1e1))
j<-j+1
}
if(stormdata2$CROPDMGEXP[i]=="2" & stormdata2$CROPDMG[i]!=0)
{CROPdmgfull<-c(CROPdmgfull,stormdata2$CROPDMG[i]*1e2)}
if(stormdata2$CROPDMGEXP[i]=="2" & stormdata2$CROPDMG[i]==0)
{CROPdmgfull<-c(CROPdmgfull,(zero.random.adj[j]*1e2))
j<-j+1
}
if(stormdata2$CROPDMGEXP[i]=="3" & stormdata2$CROPDMG[i]!=0)
{CROPdmgfull<-c(CROPdmgfull,stormdata2$CROPDMG[i]*1e3)}
if(stormdata2$CROPDMGEXP[i]=="3" & stormdata2$CROPDMG[i]==0)
{CROPdmgfull<-c(CROPdmgfull,(zero.random.adj[j]*1e3))
j<-j+1
}
if(stormdata2$CROPDMGEXP[i]=="4" & stormdata2$CROPDMG[i]!=0)
{CROPdmgfull<-c(CROPdmgfull,stormdata2$CROPDMG[i]*1e4)}
if(stormdata2$CROPDMGEXP[i]=="4" & stormdata2$CROPDMG[i]==0)
{CROPdmgfull<-c(CROPdmgfull,(zero.random.adj[j]*1e4))
j<-j+1
}
if(stormdata2$CROPDMGEXP[i]=="5" & stormdata2$CROPDMG[i]!=0)
{CROPdmgfull<-c(CROPdmgfull,stormdata2$CROPDMG[i]*1e5)}
if(stormdata2$CROPDMGEXP[i]=="5" & stormdata2$CROPDMG[i]==0)
{CROPdmgfull<-c(CROPdmgfull,(zero.random.adj[j]*1e5))
j<-j+1
}
if(stormdata2$CROPDMGEXP[i]=="6" & stormdata2$CROPDMG[i]!=0)
{CROPdmgfull<-c(CROPdmgfull,stormdata2$CROPDMG[i]*1e6)}
if(stormdata2$CROPDMGEXP[i]=="6" & stormdata2$CROPDMG[i]==0)
{CROPdmgfull<-c(CROPdmgfull,(zero.random.adj[j]*1e6))
j<-j+1
}
if(stormdata2$CROPDMGEXP[i]=="7" & stormdata2$CROPDMG[i]!=0)
{CROPdmgfull<-c(CROPdmgfull,stormdata2$CROPDMG[i]*1e7)}
if(stormdata2$CROPDMGEXP[i]=="7" & stormdata2$CROPDMG[i]==0)
{CROPdmgfull<-c(CROPdmgfull,(zero.random.adj[j]*1e7))
j<-j+1
}
if(stormdata2$CROPDMGEXP[i]=="8" & stormdata2$CROPDMG[i]!=0)
{CROPdmgfull<-c(CROPdmgfull,stormdata2$CROPDMG[i]*1e8)}
if(stormdata2$CROPDMGEXP[i]=="8" & stormdata2$CROPDMG[i]==0)
{CROPdmgfull<-c(CROPdmgfull,(zero.random.adj[j]*1e8))
j<-j+1
}
if(stormdata2$CROPDMGEXP[i]=="B" & stormdata2$CROPDMG[i]!=0)
{CROPdmgfull<-c(CROPdmgfull,stormdata2$CROPDMG[i]*1e9)}
if(stormdata2$CROPDMGEXP[i]=="B" & stormdata2$CROPDMG[i]==0)
{CROPdmgfull<-c(CROPdmgfull,(zero.random.adj[j]*1e9))
j<-j+1
}
if(stormdata2$CROPDMGEXP[i]=="h" & stormdata2$CROPDMG[i]!=0)
{CROPdmgfull<-c(CROPdmgfull,stormdata2$CROPDMG[i]*1e2)}
if(stormdata2$CROPDMGEXP[i]=="h" & stormdata2$CROPDMG[i]==0)
{CROPdmgfull<-c(CROPdmgfull,(zero.random.adj[j]*1e2))
j<-j+1
}
if(stormdata2$CROPDMGEXP[i]=="H" & stormdata2$CROPDMG[i]!=0)
{CROPdmgfull<-c(CROPdmgfull,stormdata2$CROPDMG[i]*1e2)}
if(stormdata2$CROPDMGEXP[i]=="H" & stormdata2$CROPDMG[i]==0)
{CROPdmgfull<-c(CROPdmgfull,(zero.random.adj[j]*1e2))
j<-j+1
}
if(stormdata2$CROPDMGEXP[i]=="K" & stormdata2$CROPDMG[i]!=0)
{CROPdmgfull<-c(CROPdmgfull,stormdata2$CROPDMG[i]*1e3)}
if(stormdata2$CROPDMGEXP[i]=="K" & stormdata2$CROPDMG[i]==0)
{CROPdmgfull<-c(CROPdmgfull,(zero.random.adj[j]*1e3))}
j<-j+1
if(stormdata2$CROPDMGEXP[i]=="k" & stormdata2$CROPDMG[i]!=0)
{CROPdmgfull<-c(CROPdmgfull,stormdata2$CROPDMG[i]*1e3)}
if(stormdata2$CROPDMGEXP[i]=="k" & stormdata2$CROPDMG[i]==0)
{CROPdmgfull<-c(CROPdmgfull,(zero.random.adj[j]*1e3))
j<-j+1
}
if(stormdata2$CROPDMGEXP[i]=="m" & stormdata2$CROPDMG[i]!=0)
{CROPdmgfull<-c(CROPdmgfull,stormdata2$CROPDMG[i]*1e6)}
if(stormdata2$CROPDMGEXP[i]=="m" & stormdata2$CROPDMG[i]==0)
{CROPdmgfull<-c(CROPdmgfull,(zero.random.adj[j]*1e6))
j<-j+1
}
if(stormdata2$CROPDMGEXP[i]=="M" & stormdata2$CROPDMG[i]!=0)
{CROPdmgfull<-c(CROPdmgfull,stormdata2$CROPDMG[i]*1e6)}
if(stormdata2$CROPDMGEXP[i]=="M" & stormdata2$CROPDMG[i]==0)
{CROPdmgfull<-c(CROPdmgfull,(zero.random.adj[j]*1e6))
j<-j+1
}
}
At this point we have the full non-inflation adjusted PROPDMG and CROPDMG numbers. Now all we need to do is apply the inflation adjustment. The adjustment works as follows:
i=1
inf.numerator<-tail(CPInumbers,1)
PROPDMG.INF.ADJ<-vector()
for(i in i:length(propdmgfull))
{
PROPDMGADJ<-(inf.numerator/CPInumbers[i])*propdmgfull[i]
PROPDMG.INF.ADJ<-c(PROPDMG.INF.ADJ,PROPDMGADJ)
}
i=1
CROPDMG.INF.ADJ<-vector()
for(i in i:length(CROPdmgfull))
{
CROPDMGADJ<-(inf.numerator/CPInumbers[i])*CROPdmgfull[i]
CROPDMG.INF.ADJ<-c(CROPDMG.INF.ADJ,CROPDMGADJ)
}
Now the CPI adjustment is completed and it is time to move on to the last part of tidying up the data.
Finally, the last step is to clean up EVTYPE. There are many levels to this variable. The documentation lists 48 different event names but the EVTYPE variable has many more levels.
summary(stormdata2$EVTYPE)
## TSTM WIND THUNDERSTORM WIND TORNADO
## 63234 43655 39944
## HAIL FLASH FLOOD LIGHTNING
## 26130 20967 13293
## THUNDERSTORM WINDS FLOOD HIGH WIND
## 12086 10175 5522
## STRONG WIND WINTER STORM HEAVY SNOW
## 3370 1508 1342
## HEAVY RAIN WILDFIRE ICE STORM
## 1105 857 708
## URBAN/SML STREAM FLD EXCESSIVE HEAT HIGH WINDS
## 702 698 657
## TSTM WIND/HAIL TROPICAL STORM WINTER WEATHER
## 441 416 407
## RIP CURRENT WILD/FOREST FIRE FLASH FLOODING
## 400 388 302
## FLOOD/FLASH FLOOD AVALANCHE DROUGHT
## 279 268 266
## BLIZZARD RIP CURRENTS HEAT
## 253 241 215
## EXTREME COLD LAKE-EFFECT SNOW LANDSLIDE
## 197 194 193
## STORM SURGE COASTAL FLOOD URBAN FLOOD
## 177 164 139
## WINTER WEATHER/MIX HIGH SURF HURRICANE
## 139 130 129
## LIGHT SNOW FROST/FREEZE EXTREME COLD/WIND CHILL
## 119 116 111
## MARINE TSTM WIND FOG RIVER FLOOD
## 109 107 107
## DUST STORM COLD/WIND CHILL DUST DEVIL
## 103 90 89
## WIND URBAN FLOODING DRY MICROBURST
## 83 80 78
## DENSE FOG HURRICANE/TYPHOON FLOODING
## 74 72 58
## COASTAL FLOODING SNOW HEAVY SURF/HIGH SURF
## 55 52 50
## STRONG WINDS STORM SURGE/TIDE WATERSPOUT
## 50 47 47
## MARINE STRONG WIND THUNDERSTORM WINDS HAIL TSTM WIND (G45)
## 46 40 36
## FREEZING RAIN TROPICAL DEPRESSION HEAT WAVE
## 35 35 34
## THUNDERSTORM WINDSS MARINE THUNDERSTORM WIND OTHER
## 34 33 33
## COLD HEAVY SURF EXCESSIVE SNOW
## 32 30 25
## ICE ICY ROADS LIGHT FREEZING RAIN
## 24 22 22
## THUNDERSTORM WINDS/HAIL GUSTY WINDS HEAVY SNOW SQUALLS
## 22 21 21
## Light Snow HEAVY RAINS EXTREME WINDCHILL
## 21 20 19
## GLAZE MARINE HIGH WIND THUNDERSTORM
## 19 19 19
## WINDS EXTREME HEAT FLASH FLOOD/FLOOD
## 18 17 16
## FREEZE SNOW SQUALL HEAVY SNOW-SQUALLS
## 16 16 15
## MIXED PRECIPITATION TSUNAMI FLASH FLOODS
## 15 14 13
## FUNNEL CLOUD GUSTY WIND RIVER FLOODING
## 13 13 13
## SMALL HAIL DROUGHT/EXCESSIVE HEAT Gusty Winds
## 11 10 10
## (Other)
## 773
The stringdist package has some great tools to help with this sort of problem. The amatch function, which stands for “approximate match”, will help us with partial string matching. First, we need to manually input the correct 48 names for EVTYPE. Then we will use amatch to force the old EVTYPE variables into one of the new 48 levels. This will not be perfect for every case but it will handle the vast majority of cases correctly.
Label = c("Astronomical Low Tide", "Avalanche", "Blizzard",
"Coastal Flood", "Cold/Wind Chill", "Debris Flow",
"Dense Fog", "Dense Smoke", "Drought", "Dust Devil",
"Dust Storm", "Excessive Heat", "Extreme Cold/Wind Chill",
"Flash Flood", "Flood", "Freezing Fog", "Frost/Freeze",
"Funnel Cloud", "Hail", "Heat", "Heavy Rain", "Heavy Snow",
"High Surf", "High Wind", "Hurricane (Typhoon)", "Ice Storm",
"Lake-Effect Snow", "Lakeshore Flood", "Lightning",
"Marine Hail", "Marine High Wind", "Marine Strong Wind",
"Marine Thunderstorm Wind", "Rip Current", "Seiche", "Sleet",
"Storm Surge/Tide", "Strong Wind", "Thunderstorm Wind",
"Tornado", "Tropical Depression", "Tropical Storm", "Tsunami",
"Volcanic Ash", "Waterspout", "Wildfire", "Winter Storm",
"Winter Weather")
if("stringdist" %in% rownames(installed.packages()) == FALSE) {install.packages("stringdist")}
library(stringdist)
ClosestMatch2 = function(string, stringVector){
stringVector[amatch(string, stringVector, maxDist=Inf)]
}
EVTYPE<-stormdata2$EVTYPE
EVTYPE<-as.character(EVTYPE)
i=1
for (i in i:length(EVTYPE))
{
EVTYPE[i]<-ClosestMatch2(string=stormdata2$EVTYPE[i],stringVector=Label)
}
Now we can join together all of the relevant variables into a much smaller and tidier data set.
tidystormdata<-data.frame(REFNUM,MONTH,YEAR,EVTYPE,FATALITIES,INJURIES,PROPDMG.INF.ADJ,CROPDMG.INF.ADJ)
Now we can perform some exploratory analysis. Let’s try to answer the first question:
par(mar=c(7,4.5,2,2))
par(mfrow=c(1,2))
plotdata.fat<-tapply(tidystormdata$FATALITIES,tidystormdata$EVTYPE, FUN = sum,na.rm=TRUE)
barplot(plotdata.fat[plotdata.fat>500],col = "blue", main = "Total Fatalities By Event",las=2)
plotdata.inj<-tapply(tidystormdata$INJURIES,tidystormdata$EVTYPE, FUN = sum,na.rm=TRUE)
barplot(plotdata.inj[plotdata.inj>5000],col = "blue", main = "Total Injuries By Event",las=2)
We can see that Tornadoes cause the most fatalities and injuries. This could be because Tornadoes are usually the hardest to prepare for.
Let’s take a look at the frequency of these events over time. We will start the plot at 1991 and end it at 2011 for each event.
par(mar=c(3,4.5,2,2))
Wind.Count<-table(tidystormdata$YEAR,tidystormdata$EVTYPE=="High Wind")[,1]
Tornado.Count<-table(tidystormdata$YEAR,tidystormdata$EVTYPE=="Tornado")[,1]
Hail.Count<-table(tidystormdata$YEAR,tidystormdata$EVTYPE=="Hail")[,1]
Flood.Count<-table(tidystormdata$YEAR,tidystormdata$EVTYPE=="Flood")[,1]
counts<-cbind(Wind.Count,Tornado.Count,Hail.Count,Flood.Count)
par(mfrow=c(1,1))
barplot(counts[42:62,],beside=TRUE,col=c("Blue"), main = "Counts Of Event Per Year 1991-2011")
We can see that there has been an abnormally high number of tornadoes in the years leading up to 2011. High Winds and Hail have been getting progressively more frequent.
Finally, let’s move on to answer question number two.
par(mar=c(7,4.5,2,2))
par(mfrow=c(1,2))
plotdata.prop<-tapply(tidystormdata$PROPDMG.INF.ADJ,tidystormdata$EVTYPE, FUN = sum, na.rm=TRUE)
barplot(plotdata.prop[plotdata.prop>10000000000], col = "blue", main = "Total Prop Damage in 2011 $",las=2)
plotdata.crop<-tapply(tidystormdata$CROPDMG.INF.ADJ,tidystormdata$EVTYPE, FUN = sum, na.rm=TRUE)
barplot(plotdata.crop[plotdata.crop>1000000000], col = "blue", main = "Total Crop Damage in 2011 $",las=2)
We can see that Floods and Tornadoes cause the most property damage overall. Drought causes the most crop damage by far.
Thank you for your time.