This analysis was prepared with the aim to find the event types that have the most drastic effects on population. The assumption is that these events are the ones with the greatest fatalities and injuries and property and crop damages.Natioanal Oceanic and Atmospheric Administration’s (NOAA) Data spanning from 1950 and 2011 is preprocessed and analyzed. Analysis show that tornados and excessive heat are the event types with most injuries and fatalities while flood, hurricane typhoon and tornados again causing the biggest economic damage.
For this review the data from U.S. National Oceanic and Atmospheric Administration’s (NOAA) is used. It covers the events from 1950 until November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete. The data is downloaded and uploaded via the open source R Program ( version 3.1.2 ) with the below code.
if(!file.exists("Storm.bz2")){
URL <- ("http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2")
download.file(URL,"Storm.bz2")
Storm <- read.csv("Storm.bz2")
}
Storm <-read.csv("Storm.bz2")
After reading the data we check the structure of the data
str(Storm)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
Since the Beginning Date and End DAte are factor variables, we are turning them into to a Date format.
Storm$BGN_DATE <- as.Date(Storm$BGN_DATE,"%m/%d/%Y")
Storm$END_DATE <- as.Date(Storm$END_DATE,"%m/%d/%Y")
Since Time and Time Zone variables are not particular interest for this paper, we do not process them even though their format could be converted.
A look to the EVTYPE variable shows that there are encoding problems.
head(sort(table(Storm$EVTYPE),decreasing=TRUE),n=60)
##
## HAIL TSTM WIND THUNDERSTORM WIND
## 288661 219940 82563
## TORNADO FLASH FLOOD FLOOD
## 60652 54277 25326
## THUNDERSTORM WINDS HIGH WIND LIGHTNING
## 20843 20212 15754
## HEAVY SNOW HEAVY RAIN WINTER STORM
## 15708 11723 11433
## WINTER WEATHER FUNNEL CLOUD MARINE TSTM WIND
## 7026 6839 6175
## MARINE THUNDERSTORM WIND WATERSPOUT STRONG WIND
## 5812 3796 3566
## URBAN/SML STREAM FLD WILDFIRE BLIZZARD
## 3392 2761 2719
## DROUGHT ICE STORM EXCESSIVE HEAT
## 2488 2006 1678
## HIGH WINDS WILD/FOREST FIRE FROST/FREEZE
## 1533 1457 1342
## DENSE FOG WINTER WEATHER/MIX TSTM WIND/HAIL
## 1293 1104 1028
## EXTREME COLD/WIND CHILL HEAT HIGH SURF
## 1002 767 725
## TROPICAL STORM FLASH FLOODING EXTREME COLD
## 690 682 655
## COASTAL FLOOD LAKE-EFFECT SNOW FLOOD/FLASH FLOOD
## 650 636 624
## LANDSLIDE SNOW COLD/WIND CHILL
## 600 587 539
## FOG RIP CURRENT MARINE HAIL
## 538 470 442
## DUST STORM AVALANCHE WIND
## 427 386 340
## RIP CURRENTS STORM SURGE FREEZING RAIN
## 304 261 250
## URBAN FLOOD HEAVY SURF/HIGH SURF EXTREME WINDCHILL
## 249 228 204
## STRONG WINDS DRY MICROBURST ASTRONOMICAL LOW TIDE
## 196 186 174
## HURRICANE RIVER FLOOD LIGHT SNOW
## 174 173 154
First we created another variable for cleaned Event Types. With only targeting the high frequency event types some degree of data cleaning is performed. Such as all uppercase converted to lower case, space trimming, “/” replacing, winds" converted into “wind”, “tstm” into “thunderstorm”. Even though there are other event types to be consolidated (e.g “Excessive Heat” vs. “Heat”, “Hail” vs. “hail 1.5”) documentation doesn’t make it clear that they can be classified as the same.So they were left out.
Storm$CL_EVTYPE <- Storm$EVTYPE
Storm$CL_EVTYPE <- tolower(Storm$CL_EVTYPE)
Storm$CL_EVTYPE <- gsub("/"," ", Storm$CL_EVTYPE)
Storm$CL_EVTYPE <- gsub(" "," ", Storm$CL_EVTYPE)
Storm$CL_EVTYPE <- gsub("winds","wind", Storm$CL_EVTYPE)
Storm$CL_EVTYPE <- gsub("tstm","thunderstorm", Storm$CL_EVTYPE)
Finally we convert our old EVTYPE variable with our new variable and delete the CL_EVTYPE variable and take a look to the cleaned event types
Storm$EVTYPE <- Storm$CL_EVTYPE
Storm$CL_EVTYPE <- NULL
head(sort(table(Storm$EVTYPE),decreasing=TRUE),n=60)
##
## thunderstorm wind hail tornado
## 323362 288661 60652
## flash flood flood high wind
## 54277 25327 21748
## lightning heavy snow marine thunderstorm wind
## 15754 15708 11987
## heavy rain winter storm winter weather
## 11742 11433 7045
## funnel cloud waterspout strong wind
## 6844 3796 3773
## urban sml stream fld wildfire blizzard
## 3392 2761 2719
## drought ice storm excessive heat
## 2488 2006 1678
## wild forest fire frost freeze dense fog
## 1457 1343 1293
## thunderstorm wind hail winter weather mix extreme cold wind chill
## 1115 1110 1002
## heat high surf tropical storm
## 767 734 690
## flash flooding extreme cold coastal flood
## 682 657 656
## lake-effect snow flood flash flood snow
## 636 626 617
## landslide cold wind chill fog
## 600 539 538
## rip current marine hail dust storm
## 470 442 427
## avalanche wind rip currents
## 386 382 304
## storm surge freezing rain urban flood
## 261 260 251
## heavy surf high surf extreme windchill dry microburst
## 228 204 186
## coastal flooding light snow astronomical low tide
## 183 176 174
## hurricane river flood record warmth
## 174 173 154
## dust devil storm surge tide marine high wind
## 149 148 135
Inspecting most harmful events, total fatalities and injuries could be revealing.
head(sort(tapply(Storm$FATALITIES,Storm$EVTYPE,sum),decreasing=TRUE))
## tornado excessive heat flash flood heat
## 5633 1903 978 937
## lightning thunderstorm wind
## 816 701
head(sort(tapply(Storm$INJURIES,Storm$EVTYPE,sum),decreasing=TRUE))
## tornado thunderstorm wind flood excessive heat
## 91346 9363 6789 6525
## lightning heat
## 5230 2100
In order to have separete lists with top ten total values we can do the following subsetting
FatalitiesByEvent <- aggregate(FATALITIES ~EVTYPE,Storm,sum)
TopTenFatalities <- FatalitiesByEvent[order(FatalitiesByEvent$FATALITIES,decreasing=TRUE),]
TopTenFatalities <- TopTenFatalities[1:10,]
TopTenFatalities
## EVTYPE FATALITIES
## 732 tornado 5633
## 115 excessive heat 1903
## 137 flash flood 978
## 234 heat 937
## 401 lightning 816
## 660 thunderstorm wind 701
## 152 flood 470
## 505 rip current 368
## 306 high wind 283
## 19 avalanche 224
The same for Injuries:
InjuriesByEvent <- aggregate(INJURIES ~EVTYPE,Storm,sum)
TopTenInjuries <- InjuriesByEvent[order(InjuriesByEvent$INJURIES,decreasing=TRUE),]
TopTenInjuries <- TopTenInjuries[1:10,]
TopTenInjuries
## EVTYPE INJURIES
## 732 tornado 91346
## 660 thunderstorm wind 9363
## 152 flood 6789
## 115 excessive heat 6525
## 401 lightning 5230
## 234 heat 2100
## 371 ice storm 1975
## 137 flash flood 1777
## 306 high wind 1439
## 204 hail 1361
To see the the total picture of most harmful event types with respect to population health (priorizing fatalities) we can look to a combined table
HealthDamage <- merge(TopTenFatalities,TopTenInjuries,by.x="EVTYPE",by.y="EVTYPE",all=TRUE)
HealthDamage <- HealthDamage[order(HealthDamage$FATALITIES,decreasing=TRUE),]
HealthDamage
## EVTYPE FATALITIES INJURIES
## 12 tornado 5633 91346
## 2 excessive heat 1903 6525
## 3 flash flood 978 1777
## 6 heat 937 2100
## 9 lightning 816 5230
## 11 thunderstorm wind 701 9363
## 4 flood 470 6789
## 10 rip current 368 NA
## 7 high wind 283 1439
## 1 avalanche 224 NA
## 5 hail NA 1361
## 8 ice storm NA 1975
With this table wee see that tornados and excessive heat are the leading events for fatalities and injuries.
To show the results side by side we have utilized the reshape2 package and created long data frame with the melt function. And then created a bar plot with ggplot2.
library(reshape2)
## Warning: package 'reshape2' was built under R version 3.1.3
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.1.3
TopFive <- HealthDamage[1:5,]
LongTable <- melt(TopFive,id="EVTYPE", measure.vars=c("FATALITIES","INJURIES"))
ggplot(LongTable, aes(x=EVTYPE,y=value,fill=factor(variable)))+geom_bar(stat="identity",position="dodge",width=.5)+ scale_fill_discrete(name="") + xlab("Event Type") + ylab("Total Incidents")
Another point of interest would be the economic impact. In order to calculate the damage, the exponent variables are used.To understand both variables are summarizied with table function.
table(Storm$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
table(Storm$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5
## 465934 1 8 5 216 25 13 4 4 28
## 6 7 8 B h H K m M
## 4 5 1 40 1 6 424665 7 11330
With this short look to the variables and the help of Storm Data Documentation, a series of steps are applied to data and number symbols such as “k”, “m” converted to the appropriate numbers for multiplication. All the remaining ambiguous characters converted into 1 and other numbers left as they are.
Storm$PropEXP <- Storm$PROPDMGEXP
Storm$PropEXP <- gsub("[Kk]","1000",Storm$PropEXP)
Storm$PropEXP <- gsub("[Mm]","1000000",Storm$PropEXP)
Storm$PropEXP <- gsub("B","1000000000",Storm$PropEXP)
Storm$PropEXP <- gsub("[Hh]","1",Storm$PropEXP)
Storm$PropEXP <- gsub("^0","1",Storm$PropEXP)
Storm$PropEXP <- as.integer(Storm$PropEXP)
## Warning: NAs introduced by coercion
Storm$PropEXP[is.na(Storm$PropEXP)] <- 1
Storm$PROPDMGEXP <- Storm$PropEXP
Storm$PropEXP <-NULL
table(Storm$PROPDMGEXP)
##
## 1 2 3 4 5 6 7 8 1000 1e+06
## 466196 13 4 4 28 4 5 1 424665 11337
## 1e+09
## 40
The same was done for the CROPEXP variable.
Storm$CropEXP <- Storm$CROPDMGEXP
Storm$CropEXP <- gsub("[Kk]","1000",Storm$CropEXP)
Storm$CropEXP <- gsub("[Mm]","1000000",Storm$CropEXP)
Storm$CropEXP <- gsub("B","1000000000",Storm$CropEXP)
Storm$CropEXP <- gsub("^0","1",Storm$CropEXP)
Storm$CropEXP <- as.integer(Storm$CropEXP)
## Warning: NAs introduced by coercion
Storm$CropEXP[is.na(Storm$CropEXP)] <- 1
Storm$CROPDMGEXP <- Storm$CropEXP
Storm$CropEXP <-NULL
table(Storm$CROPDMGEXP)
##
## 1 2 1000 1e+06 1e+09
## 618439 1 281853 1995 9
To see the total damage two variables added into a new Total Damage Variable after multiplying with the respective “..EXP variable.
Storm$TOTDAM <- (Storm$PROPDMG *Storm$PROPDMGEXP ) +(Storm$CROPDMG * Storm$CROPDMGEXP)
As the next step We are aggregating the total damages by Event Types:
DamageByEvent <- aggregate(TOTDAM ~EVTYPE,Storm,sum)
Checking the dimension of this data shows there are a lot of type of events, a closer look to top ten would be more meaningful.
TopFiveDamage <- DamageByEvent[order(DamageByEvent$TOTDAM,decreasing=TRUE),]
TopFiveDamage <- TopFiveDamage[1:5,]
TopFiveDamage
## EVTYPE TOTDAM
## 152 flood 150319678257
## 355 hurricane typhoon 71913712800
## 732 tornado 57352114458
## 575 storm surge 43323541000
## 204 hail 18758221641
Plot showing the top five event types with most total economic damages :
ggplot(TopFiveDamage,aes(x=EVTYPE,y=TOTDAM))+ geom_bar(stat="identity",position="dodge",fill="blue",width=.5)+xlab("Event Type") + ylab("Total Damage")
Analysis show that tornados and excessive heat are the event types with most injuries and fatalities while flood, hurricane typhoon and tornados causing the biggest economic damage. Even though there are common event types in both economic and health damages they do not overlap totally. Excessive heat and heat are some of top causes of fatalities while they are causing considerably less economic damage. This could be one of the areas that needs further analysis.