This report is published to identify which type of storm events, across the United States, that are (1) most harmful in terms of population health and (2) has greatest economic consequences. Data used is from U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, having 902,297 rows and 37 columns of data. While data availability is from 1950 to 2011, only those data having fatalities and injuries, and damages to crops and properties were selected, which results in two datasets of 21,929 records (3 columns) and 245,031 records (5 columns) respectively, for purpose of this report. The selected datasets were further processed to map the events into 48 types as per the National Weather Service Storm Data Documentation. Further data cleaning is done to fatalities, casualties, property damage exponent, and crop damage exponents column of the datasets. Finally, the data is aggregated and sorted accordingly to derive the answer. Based on the analysis, Tornado is most harmful in terms of population health (fatalities and injuries totalling 96,997 people) while Flood caused greatest economic consequences (damages of 161 billion dollars).
df <- read.csv("repdata-data-StormData.csv.bz2")
#understand dataset
head(df)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
summary(df)
## STATE__ BGN_DATE BGN_TIME
## Min. : 1.0 5/25/2011 0:00:00: 1202 12:00:00 AM: 10163
## 1st Qu.:19.0 4/27/2011 0:00:00: 1193 06:00:00 PM: 7350
## Median :30.0 6/9/2011 0:00:00 : 1030 04:00:00 PM: 7261
## Mean :31.2 5/30/2004 0:00:00: 1016 05:00:00 PM: 6891
## 3rd Qu.:45.0 4/4/2011 0:00:00 : 1009 12:00:00 PM: 6703
## Max. :95.0 4/2/2006 0:00:00 : 981 03:00:00 PM: 6700
## (Other) :895866 (Other) :857229
## TIME_ZONE COUNTY COUNTYNAME STATE
## CST :547493 Min. : 0.0 JEFFERSON : 7840 TX : 83728
## EST :245558 1st Qu.: 31.0 WASHINGTON: 7603 KS : 53440
## MST : 68390 Median : 75.0 JACKSON : 6660 OK : 46802
## PST : 28302 Mean :100.6 FRANKLIN : 6256 MO : 35648
## AST : 6360 3rd Qu.:131.0 LINCOLN : 5937 IA : 31069
## HST : 2563 Max. :873.0 MADISON : 5632 NE : 30271
## (Other): 3631 (Other) :862369 (Other):621339
## EVTYPE BGN_RANGE BGN_AZI
## HAIL :288661 Min. : 0.000 :547332
## TSTM WIND :219940 1st Qu.: 0.000 N : 86752
## THUNDERSTORM WIND: 82563 Median : 0.000 W : 38446
## TORNADO : 60652 Mean : 1.484 S : 37558
## FLASH FLOOD : 54277 3rd Qu.: 1.000 E : 33178
## FLOOD : 25326 Max. :3749.000 NW : 24041
## (Other) :170878 (Other):134990
## BGN_LOCATI END_DATE END_TIME
## :287743 :243411 :238978
## COUNTYWIDE : 19680 4/27/2011 0:00:00: 1214 06:00:00 PM: 9802
## Countywide : 993 5/25/2011 0:00:00: 1196 05:00:00 PM: 8314
## SPRINGFIELD : 843 6/9/2011 0:00:00 : 1021 04:00:00 PM: 8104
## SOUTH PORTION: 810 4/4/2011 0:00:00 : 1007 12:00:00 PM: 7483
## NORTH PORTION: 784 5/30/2004 0:00:00: 998 11:59:00 PM: 7184
## (Other) :591444 (Other) :653450 (Other) :622432
## COUNTY_END COUNTYENDN END_RANGE END_AZI
## Min. :0 Mode:logical Min. : 0.0000 :724837
## 1st Qu.:0 NA's:902297 1st Qu.: 0.0000 N : 28082
## Median :0 Median : 0.0000 S : 22510
## Mean :0 Mean : 0.9862 W : 20119
## 3rd Qu.:0 3rd Qu.: 0.0000 E : 20047
## Max. :0 Max. :925.0000 NE : 14606
## (Other): 72096
## END_LOCATI LENGTH WIDTH
## :499225 Min. : 0.0000 Min. : 0.000
## COUNTYWIDE : 19731 1st Qu.: 0.0000 1st Qu.: 0.000
## SOUTH PORTION : 833 Median : 0.0000 Median : 0.000
## NORTH PORTION : 780 Mean : 0.2301 Mean : 7.503
## CENTRAL PORTION: 617 3rd Qu.: 0.0000 3rd Qu.: 0.000
## SPRINGFIELD : 575 Max. :2315.0000 Max. :4400.000
## (Other) :380536
## F MAG FATALITIES INJURIES
## Min. :0.0 Min. : 0.0 Min. : 0.0000 Min. : 0.0000
## 1st Qu.:0.0 1st Qu.: 0.0 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median :1.0 Median : 50.0 Median : 0.0000 Median : 0.0000
## Mean :0.9 Mean : 46.9 Mean : 0.0168 Mean : 0.1557
## 3rd Qu.:1.0 3rd Qu.: 75.0 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :5.0 Max. :22000.0 Max. :583.0000 Max. :1700.0000
## NA's :843563
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## Min. : 0.00 :465934 Min. : 0.000 :618413
## 1st Qu.: 0.00 K :424665 1st Qu.: 0.000 K :281832
## Median : 0.00 M : 11330 Median : 0.000 M : 1994
## Mean : 12.06 0 : 216 Mean : 1.527 k : 21
## 3rd Qu.: 0.50 B : 40 3rd Qu.: 0.000 0 : 19
## Max. :5000.00 5 : 28 Max. :990.000 B : 9
## (Other): 84 (Other): 9
## WFO STATEOFFIC
## :142069 :248769
## OUN : 17393 TEXAS, North : 12193
## JAN : 13889 ARKANSAS, Central and North Central: 11738
## LWX : 13174 IOWA, Central : 11345
## PHI : 12551 KANSAS, Southwest : 11212
## TSA : 12483 GEORGIA, North and Central : 11120
## (Other):690738 (Other) :595920
## ZONENAMES
## :594029
## :205988
## GREATER RENO / CARSON CITY / M - GREATER RENO / CARSON CITY / M : 639
## GREATER LAKE TAHOE AREA - GREATER LAKE TAHOE AREA : 592
## JEFFERSON - JEFFERSON : 303
## MADISON - MADISON : 302
## (Other) :100444
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_
## Min. : 0 Min. :-14451 Min. : 0 Min. :-14455
## 1st Qu.:2802 1st Qu.: 7247 1st Qu.: 0 1st Qu.: 0
## Median :3540 Median : 8707 Median : 0 Median : 0
## Mean :2875 Mean : 6940 Mean :1452 Mean : 3509
## 3rd Qu.:4019 3rd Qu.: 9605 3rd Qu.:3549 3rd Qu.: 8735
## Max. :9706 Max. : 17124 Max. :9706 Max. :106220
## NA's :47 NA's :40
## REMARKS REFNUM
## :287433 Min. : 1
## : 24013 1st Qu.:225575
## Trees down.\n : 1110 Median :451149
## Several trees were blown down.\n : 568 Mean :451149
## Trees were downed.\n : 446 3rd Qu.:676723
## Large trees and power lines were blown down.\n: 432 Max. :902297
## (Other) :588295
str(df)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
From the summary, the relevant columns (EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) does not contain NAs.
Create a list of the 48 events for crosschecking after mapping the event types.
Events <- as.factor(c("Astronomical Low Tide", "Avalanche", "Blizzard", "Coastal Flood", "Cold/Wind Chill", "Debris Flow", "Dense Fog", "Dense Smoke", "Drought", "Dust Devil", "Dust Storm", "Excessive Heat", "Extreme Cold/Wind Chill", "Flash Flood", "Flood", "Freezing Fog", "Frost/Freeze", "Funnel Cloud", "Hail", "Heat", "Heavy Rain", "Heavy Snow", "High Surf", "High Wind", "Hurricane/Typhoon", "Ice Storm", "Lakeshore Flood", "Lake-Effect Snow", "Lightning", "Marine Hail", "Marine High Wind", "Marine Strong Wind", "Marine Thunderstorm Wind", "Rip Current", "Seiche", "Sleet", "Storm Surge/Tide", "Strong Wind", "Thunderstorm Wind", "Tornado", "Tropical Depression", "Tropical Storm", "Tsunami", "Volcanic Ash", "Waterspout", "Wildfire", "Winter Storm", "Winter Weather"))
Load relevant libraries
library(stringr)
library(ggplot2)
library(scales)
The following steps were done to identify the most harmful storm event.
Select columns EVTYPE, FATALITIES and INJURIES. Select rows where FATALITIES OR INJURIES are more than zero.
df2 <- df[,c(8,23,24)]
df2.health <-df2[df2$FATALITIES>0|df2$INJURIES>0,]
nrow(df2.health)
## [1] 21929
sum(complete.cases(df2.health))
## [1] 21929
This results in 21929 rows of data. All columns does not contain NAs as indicated by sum(complete.cases) which returns same value as nrow().
Convert values to upper case, remove leading/trailing whitespaces,remove whitespaces in between.
#upper case
df2.health$EVTYPE <- toupper(df2.health$EVTYPE)
#remove leading or trailing white spaces
df2.health$EVTYPE <- str_trim(df2.health$EVTYPE)
#remove double spaces
df2.health$EVTYPE <- gsub(" "," ",df2.health$EVTYPE)
Map the values in EVTYPE column to 48 types. Most mapping were done based on pattern matching. Some required further reading, by referring to the REMARKS column of the raw dataset, the storm data documentation or via Internet searches.
#correct Avalanche
df2.health$EVTYPE <- gsub("AVALANCE","AVALANCHE",df2.health$EVTYPE)
#correct Blizzard
df2.health$EVTYPE <- gsub("BLOWING SNOW|BLIZZARD SUMMARY","BLIZZARD",df2.health$EVTYPE)
#correct Coastal Flood
df2.health$EVTYPE <- gsub("^COASTAL FLOODING.*|^TIDAL FLOOD$|^TIDAL FLOODING$","COASTAL FLOOD",df2.health$EVTYPE)
#correct Cold/Wind Chill
df2.health$EVTYPE <- gsub("^LOW TEMPERATURE$|^COLD$|^COLD/WINDS$|COLD WEATHER|^COLD TEMPERATURE$|^COLD WAVE$|UNSEASONABLY COLD|HYPOTHERMIA.*|^COLD AND SNOW$|^FOG AND COLD/WIND CHILLS$","COLD/WIND CHILL",df2.health$EVTYPE)
#correct Debris Flow
df2.health$EVTYPE <- gsub("^MUDSLIDE.*|^LANDSLIDE.*","DEBRIS FLOW",df2.health$EVTYPE)
#correct Dense Fog
df2.health$EVTYPE <- gsub("^FOG$|^FOG AND COLD TEMPERATURES$","DENSE FOG",df2.health$EVTYPE)
#correct Dense Smoke - not found in dataset
#correct Drought - not found in dataset
#correct Dust Devil
df2.health$EVTYPE <- gsub("^OTHER$","DUST DEVIL",df2.health$EVTYPE)
#correct Dust Storm - not found in dataset
#correct Excessive Heat
df2.health$EVTYPE <- gsub("^EXTREME HEAT$|DROUGHT/EXCESSIVE HEAT|RECORD HEAT|RECORD/EXCESSIVE HEAT|^HEAT WAVE.*|HYPERTHERMIA.*","EXCESSIVE HEAT",df2.health$EVTYPE)
#correct Extreme Cold/Wind Chill
df2.health$EVTYPE <- gsub("^EXTREME WINDCHILL$|^EXTENDED COLD$|^EXTREME COLD$|RECORD COLD","EXTREME COLD/WIND CHILL",df2.health$EVTYPE)
#correct Flash Flood
df2.health$EVTYPE <- gsub("^FLASH FLOOD/FLOOD$|LOCAL FLASH FLOOD|^FLASH FLOODING$|^FLASH FLOODING/FLOOD$|^FLASH FLOODS$","FLASH FLOOD",df2.health$EVTYPE)
#correct Flood
df2.health$EVTYPE <- gsub("^FLOOD/FLASH FLOOD$|^FLOOD/FLOOD$|FLOOD & HEAVY RAIN|^FLOODING$|^LOCAL FLOOD|^MAJOR FLOOD$|^MINOR FLOOD$|^MINOR FLOODING$|^URBAN FLOOD$|^URBAN FLOODING$|^URBAN FLOODS$|URBAN/STREET FLOODING|URBAN/SML STREAM FLDG|URBAN/SML STREAM FLD|URBAN/SMALL STRM FLDG|URBAN/SMALL STREAM FLOODING|URBAN/SMALL STREAM FLOOD|URBAN/SMALL FLOODING|URBAN SMALL STREAM FLOOD|URBAN AND SMALL STREAM FLOODIN|URBAN AND SMALL STREAM FLOOD|SMALL STREAM AND URBAN FLOOD|SMALL STREAM AND URBAN FLOODIN|SMALL STREAM FLOOD|SMALL STREAM FLOODING|SMALL STREAM URBAN FLOOD|SMALL STREAM/URBAN FLOOD|SML STREAM FLD|^RURAL FLOOD$|^RIVER FLOOD$|^RIVER FLOODING$|RIVER AND STREAM FLOOD|^FLOOD/RIVER FLOOD$","FLOOD",df2.health$EVTYPE)
#correct Freezing Fog
df2.health$EVTYPE <- gsub("^GLAZE$|GLAZE/ICE STORM","FREEZING FOG",df2.health$EVTYPE)
#correct Frost/Freeze
df2.health$EVTYPE <- gsub("^ICE$|^ICE ROADS$|^ICY ROADS$|ICE ON ROAD|BLACK ICE|LIGHT SNOW|FROST\\FREEZE|HARD FREEZE|LATE FREEZE|DAMAGING FREEZE|AGRICULTURAL FREEZE|^FREEZING.*|^FROST$|^FREEZE$","FROST/FREEZE",df2.health$EVTYPE)
#correct Funnel Cloud - not found in dataset
#correct Hail
df2.health$EVTYPE <- gsub("^GUSTY WIND/HAIL$|DEEP HAIL|^WIND/HAIL$|HAIL ALOFT|HAIL FLOODING|HAIL DAMAGE|HAIL/ICY ROADS|LATE SEASON HAIL|^HAILSTORM$|^HAILSTORMS$|NON SEVER HAIL|SMALL HAIL","HAIL",df2.health$EVTYPE)
#correct Heat
df2.health$EVTYPE <- gsub("^WARM WEATHER$|^UNSEASONABLY WARM.*","HEAT",df2.health$EVTYPE)
#correct Heavy Rain
df2.health$EVTYPE <- gsub("^RAIN/WIND$|^HIGH WATER$|^HEAVY RAINFALL$|EXCESSIVE RAINFALL|^DROWNING$|RAPIDLY RISING WATER|TORRENTIAL RAIN|^HEAVY RAINS$","HEAVY RAIN",df2.health$EVTYPE)
#correct Heavy Snow
df2.health$EVTYPE <- gsub("^SNOW$|^SNOW AND ICE$|^SNOW/ BITTER COLD$|^SNOW/HIGH WINDS$|FALLING SNOW/ICE|^EXCESSIVE SNOW$|^SNOW SQUALL.*|^HEAVY SNOW.*","HEAVY SNOW",df2.health$EVTYPE)
#correct High Surf
df2.health$EVTYPE <- gsub("^ROGUE WAVE$|^ROUGH SEAS$|^ROUGH SURF$|^MARINE ACCIDENT$|^HEAVY SEAS$|^HIGH SEAS$|^HIGH SWELLS$|^HIGH WAVES$|HAZARDOUS SURF|HEAVY SURF.*","HIGH SURF",df2.health$EVTYPE)
#correct High Wind
df2.health$EVTYPE <- gsub("^NON THUNDERSTORM WIND$|^NON-SEVERE WIND DAMAGE$|^HIGH WIND.*|^HIGH$","HIGH WIND",df2.health$EVTYPE)
#correct Hurricane/Typhoon
df2.health$EVTYPE <- gsub("^HURRICANE.*|^TYPHOON$","HURRICANE/TYPHOON",df2.health$EVTYPE)
#correct Ice Storm
df2.health$EVTYPE <- gsub("^ICE STORM/FLASH FLOOD$","ICE STORM",df2.health$EVTYPE)
#correct Lakeshore Flood - not found in dataset
#correct "Lake-Effect Snow - not found in dataset
#correct Lightning
df2.health$EVTYPE <- gsub("LIGHTING|LIGHTNING .*|LIGHTNING.|LIGNTNING","LIGHTNING",df2.health$EVTYPE)
#correct Marine Hail - not found in dataset
#correct Marine High Wind - not found in dataset
#correct Marine Strong Wind
df2.health$EVTYPE <- gsub("MARINE MISHAP","MARINE STRONG WIND",df2.health$EVTYPE)
#correct Marine Thunderstorm Wind
df2.health$EVTYPE <- gsub("^MARINE TSTM WIND$","MARINE THUNDERSTORM WIND",df2.health$EVTYPE)
#correct Rip Current
df2.health$EVTYPE <- gsub("^RIP CURRENT.*","RIP CURRENT",df2.health$EVTYPE)
#correct Seiche - not found in dataset
#correct Sleet - not found in dataset
#correct Storm Surge/Tide
df2.health$EVTYPE <- gsub("^STORM SURGE$","STORM SURGE/TIDE",df2.health$EVTYPE)
#correct Strong Wind
df2.health$EVTYPE <- gsub("^NON TSTM WIND$|^WIND$|^WINDS$|^WIND STORM$|^STRONG WINDS$|^GUSTY WIND.*","STRONG WIND",df2.health$EVTYPE)
#correct Thunderstorm Wind
df2.health$EVTYPE <- gsub("^THUNDERSTORM$|^THUNDERSTORM WIND (G40)$|^THUNDERSTORM WIND G52$|^THUNDERSTORM WINDS$|^THUNDERSTORM WINDS 13$|^THUNDERSTORM WINDS/HAIL$|^THUNDERSTORM WINDSS$|^THUNDERSTORMS WIND$|^THUNDERSTORMS WINDS$|^THUNDERTORM WINDS$|^TORNADOES, TSTM WIND, HAIL$|^THUNDERSTORM$|^TSTM WIND$|^TSTM WIND \\(G.*|THUNDERSTORM WINS|DRY MICROBURST|DRY MIRCOBURST WIND|THUNDERESTORM WIND|^THUNDERSTORM WIND 13|THUNDERSTORM WIND.|^THUNDERSTORMW.*|THUNDERSTROM WIND|^THUNDERTORM WIND$|^THUNDERSTORMS$|^THUNDERSTORM WIND .*|WHIRLWIND|^TSTM WIND/HAIL$","THUNDERSTORM WIND",df2.health$EVTYPE)
#correct Tornado
df2.health$EVTYPE <- gsub("TORNADO DEBRIS|TORNDAO|^TORNADOES$|TORNADO F0|TORNADO F1|TORNADO F2|TORNADO F3|COLD AIR TORNADO|^TORNADOS$","TORNADO",df2.health$EVTYPE)
#correct Tropical Depression - not found in dataset
#correct Tropical Storm
df2.health$EVTYPE <- gsub("COASTAL STORM|COASTALSTORM|TROPICAL STORM.*","TROPICAL STORM",df2.health$EVTYPE)
#correct Waterspout
df2.health$EVTYPE <- gsub("WATERSPOUT TORNADO|WATERSPOUT/TORNADO","WATERSPOUT",df2.health$EVTYPE)
#correct Wildfire
df2.health$EVTYPE <- gsub("WILD FIRES|WILD/FOREST FIRE|WILD/FOREST FIRES|WILDFIRES|BRUSH FIRE|BRUSH FIRES|FOREST FIRES|GRASS FIRES","WILDFIRE",df2.health$EVTYPE)
#correct Winter Storm
df2.health$EVTYPE <- gsub("^RAIN/SNOW$|THUNDERSNOW|^MIXED PRECIP$|WINTER STORM HIGH WINDS|WINTER STORMS","WINTER STORM",df2.health$EVTYPE)
#correct Winter Weather
df2.health$EVTYPE <- gsub("WINTER WEATHER MIX|WINTER WEATHER/MIX|WINTRY MIX","WINTER WEATHER",df2.health$EVTYPE)
#final cleanup
df2.health$EVTYPE <- gsub("^THUNDERSTORM WINDS$","THUNDERSTORM WIND",df2.health$EVTYPE)
df2.health$EVTYPE <- gsub("^HEAVY RAINFALL$","HEAVY RAIN",df2.health$EVTYPE)
After the above mapping, all 21,929 values in EVTYPE has been mapped to the 48 Event Types. This can be verified by below code:
df2.health[!df2.health$EVTYPE %in% toupper(Events),]
## [1] EVTYPE FATALITIES INJURIES
## <0 rows> (or 0-length row.names)
There are no values that doesn’t match any event in the Events list.
Next step is to aggregate the FATALITIES and INJURIES values by EVTYPE. Both values are merged into a table df2.sum.
df2.health.Fsum <- aggregate(FATALITIES ~ EVTYPE, data = df2.health,sum)
df2.health.Isum <- aggregate(INJURIES ~ EVTYPE, data = df2.health,sum)
df2.sum <- cbind(df2.health.Fsum,df2.health.Isum$INJURIES)
For purpose of this analysis, criteria for establishing harm to public health is to add up number of fatalities and number of injuries as casualties.
df2.sum$CASUALTIES <- df2.sum$FATALITIES+df2.sum$`df2.health.Isum$INJURIES`
Rename columns appropriately.
colnames(df2.sum) <- c("EventType","Fatalities","Injuries","Total")
To identify the top event, the dataset is ordered in descending order by TOTAL column. Then the top 5 events are selected for comparison.
df2.ordered <- df2.sum[order(df2.sum$Total,decreasing = TRUE),]
df2.top <- df2.ordered[1:5,]
knitr::kable(df2.top)
| EventType | Fatalities | Injuries | Total | |
|---|---|---|---|---|
| 33 | TORNADO | 5633 | 91364 | 96997 |
| 32 | THUNDERSTORM WIND | 740 | 9537 | 10277 |
| 10 | EXCESSIVE HEAT | 2202 | 7124 | 9326 |
| 13 | FLOOD | 529 | 6888 | 7417 |
| 24 | LIGHTNING | 817 | 5232 | 6049 |
ggplot(df2.top,aes(EventType,Total)) + geom_bar(stat="identity",fill=rainbow(5)) + labs(title="Top 5 Most Harmful Events", x="Event Type", y="Total Casualties")
Based on the table and plot above, the most harmful storm type is TORNADO which caused casualties (fatalities + injuries) of 96,997. It is far ahead in terms of casualties if compared to the second ranked event.
The following steps were done to identify the storm event with most economic consequences.
Select columns EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP. Select rows where PROPDMG OR CROPDMG values are more than zero.
df3 <- df[,c(8,25,26,27,28)]
df3.economy <-df3[df3$PROPDMG>0|df3$CROPDMG>0,]
nrow(df3.economy)
## [1] 245031
sum(complete.cases(df3.economy))
## [1] 245031
This results in 245031 rows of data. Again, there is no NAs as the sum(complete.cases()) returns same value as nrow().
Convert values to upper case, remove leading/trailing whitespaces, remove whitespaces in between.
#upper case
df3.economy$EVTYPE <- toupper(df3.economy$EVTYPE)
#remove leading or trailing white spaces (reduces 8 duplicates)
df3.economy$EVTYPE <- str_trim(df3.economy$EVTYPE)
#remove double spaces
df3.economy$EVTYPE <- gsub(" "," ",df3.economy$EVTYPE)
There were 2 rows with invalid data for EVTYPE (values ? and “APACHE”). The strategy used is to remove the rows.
#remove rows with invalid value
df3.economy <- df3.economy[-grep("\\?|APACHE",df3.economy$EVTYPE),]
Map the values in EVTYPE column to 48 types. Most mapping were done based on pattern matching. Some required further reading, by referring to the REMARKS column of the raw dataset, the storm data documentation or via Internet searches. For values stated as OTHER, the mapping was done based on their respective REMARKS columns.
#correct OTHER
df3.economy["399112",1] <- "DUST DEVIL"
df3.economy["296122",1] <- "DUST DEVIL"
df3.economy["249470",1] <- "FLOOD"
df3.economy$EVTYPE <- gsub("^OTHER$","HEAVY RAIN",df3.economy$EVTYPE)
#correct Avalanche
df3.economy$EVTYPE <- gsub("AVALANCE","AVALANCHE",df3.economy$EVTYPE)
#correct Blizzard
df3.economy$EVTYPE <- gsub("^GROUND BLIZZARD$|^BLIZZARD/WINTER STORM$|BLOWING SNOW|BLIZZARD SUMMARY","BLIZZARD",df3.economy$EVTYPE)
#correct Coastal Flood
df3.economy$EVTYPE <- gsub("^EROSION/CSTL FLOOD$|^COASTAL SURGE$|^COASTAL FLOODING.*|^TIDAL FLOOD$|^TIDAL FLOODING$","COASTAL FLOOD",df3.economy$EVTYPE)
#correct Cold/Wind Chill
df3.economy$EVTYPE <- gsub("^UNSEASONABLE COLD$|^COOL AND WET$|^LOW TEMPERATURE$|^COLD$|^COLD/WINDS$|COLD WEATHER|^COLD TEMPERATURE$|^COLD WAVE$|^UNSEASONABLY COLD$|^HYPOTHERMIA.*|^COLD AND SNOW$|^FOG AND COLD/WIND CHILLS$","COLD/WIND CHILL",df3.economy$EVTYPE)
#correct Debris Flow
df3.economy$EVTYPE <- gsub("^ROCK SLIDE$|^LANDSLUMP$|^ICE FLOES$|^MUDSLIDE.*|^MUD SLIDE.*|^LANDSLIDE.*","DEBRIS FLOW",df3.economy$EVTYPE)
#correct Dense Fog
df3.economy$EVTYPE <- gsub("^FOG$|^FOG AND COLD TEMPERATURES$","DENSE FOG",df3.economy$EVTYPE)
#correct Dense Smoke - not found in dataset
#correct Drought - not found in dataset
#correct Dust Devil
df3.economy$EVTYPE <- gsub("^LANDSPOUT$|^DUST DEVIL WATERSPOUT$|^BLOWING DUST$","DUST DEVIL",df3.economy$EVTYPE)
#correct Dust Storm
df3.economy$EVTYPE <- gsub("^DUST STORM/HIGH WINDS$","DUST STORM",df3.economy$EVTYPE)
#correct Excessive Heat
df3.economy$EVTYPE <- gsub("^EXTREME HEAT$|DROUGHT/EXCESSIVE HEAT|^RECORD HEAT$|RECORD/EXCESSIVE HEAT|^HEAT WAVE.*|^HYPERTHERMIA.*","EXCESSIVE HEAT",df3.economy$EVTYPE)
#correct Extreme Cold/Wind Chill
df3.economy$EVTYPE <- gsub("^EXTREME WIND CHILL$|^EXTREME WINDCHILL$|^EXTENDED COLD$|^EXTREME COLD$|^RECORD COLD$","EXTREME COLD/WIND CHILL",df3.economy$EVTYPE)
#correct Flash Flood
df3.economy$EVTYPE <- gsub("^FLASH FLOODING/THUNDERSTORM WI$|^FLASH FLOOD WINDS$|^FLOODING/HEAVY RAIN$|^FLOOD/FLASH$|^FLOOD FLASH$|^FLASH FLOOD/LANDSLIDE$|^FLASH FLOOD/ STREET$|^FLASH FLOOD LANDSLIDES$|^FLASH FLOOD/$|^FLASH FLOOD FROM ICE JAMS$|^FLASH FLOOD - HEAVY RAIN$|^FLASH FLOOD/FLOOD$|LOCAL FLASH FLOOD|^FLASH FLOODING$|^FLASH FLOODING/FLOOD$|^FLASH FLOODS$","FLASH FLOOD",df3.economy$EVTYPE)
#correct Flood
df3.economy$EVTYPE <- gsub("^SNOWMELT FLOODING$|^FLOODS$|^FLOOD/RAIN/WINDS$|^FLOOD/FLASH/FLOOD$|^FLOOD/FLASHFLOOD$|^BREAKUP FLOODING$|^FLOOD/FLASH FLOOD$|^FLOOD/FLOOD$|FLOOD & HEAVY RAIN|^FLOODING$|^LOCAL FLOOD|^MAJOR FLOOD$|^MINOR FLOOD$|^MINOR FLOODING$|^URBAN FLOOD$|^URBAN FLOODING$|^URBAN FLOODS$|URBAN/STREET FLOODING|URBAN/SML STREAM FLDG|URBAN/SML STREAM FLD|URBAN/SMALL STRM FLDG|URBAN/SMALL STREAM FLOODING|URBAN/SMALL STREAM FLOOD|URBAN/SMALL FLOODING|URBAN SMALL STREAM FLOOD|URBAN AND SMALL STREAM FLOODIN|URBAN AND SMALL STREAM FLOOD|SMALL STREAM AND URBAN FLOOD|SMALL STREAM AND URBAN FLOODIN|SMALL STREAM FLOOD|SMALL STREAM FLOODING|SMALL STREAM URBAN FLOOD|SMALL STREAM/URBAN FLOOD|SML STREAM FLD|^RURAL FLOOD$|^RIVER FLOOD$|^RIVER FLOODING$|RIVER AND STREAM FLOOD|^FLOOD/RIVER FLOOD$|^ICE JAM$|^ICE JAM FLOOD \\(MINOR$|^ICE JAM FLOODING$|^URBAN/SMALL STREAM$|^URBAN SMALL$|^URBAN AND SMALL$","FLOOD",df3.economy$EVTYPE)
#correct Freezing Fog
df3.economy$EVTYPE <- gsub("^GLAZE ICE$|^GLAZE$|GLAZE/ICE STORM","FREEZING FOG",df3.economy$EVTYPE)
#correct Frost/Freeze
df3.economy$EVTYPE <- gsub("^FROST\\\\FREEZE$|^FROST/FREEZEFALL|^EARLY FROST$|^ICE$|^ICE ROADS$|^ICY ROADS$|ICE ON ROAD|BLACK ICE|LIGHT SNOW|HARD FREEZE|LATE FREEZE|DAMAGING FREEZE|AGRICULTURAL FREEZE|^FREEZING.*|^FROST$|^FREEZE$","FROST/FREEZE",df3.economy$EVTYPE)
#correct Funnel Cloud - not found in dataset
#correct Hail
df3.economy$EVTYPE <- gsub("^GUSTY WIND/HAIL$|^DEEP HAIL$|^WIND/HAIL$|HAIL ALOFT|^HAIL FLOODING$|HAIL DAMAGE|HAIL/ICY ROADS|LATE SEASON HAIL|^HAILSTORM$|^HAILSTORMS$|NON SEVER HAIL|SMALL HAIL|HAIL 0.75|HAIL 075|HAIL 100|HAIL 125|HAIL 150|HAIL 175|HAIL 200|HAIL 275|HAIL 450|HAIL 75|^HAIL/WIND$|^HAIL/WINDS$","HAIL",df3.economy$EVTYPE)
#correct Heat
df3.economy$EVTYPE <- gsub("^WARM WEATHER$|^UNSEASONABLY WARM.*","HEAT",df3.economy$EVTYPE)
#correct Heavy Rain
df3.economy$EVTYPE <- gsub("^RAINSTORM$|^RECORD RAINFALL$|^EXCESSIVE WETNESS$|^DAM BREAK$|^COLD AND WET CONDITIONS$|^COLD AND WET$|^RAIN/WIND$|^HIGH WATER$|^HEAVY RAINFALL$|EXCESSIVE RAINFALL|^DROWNING$|RAPIDLY RISING WATER|TORRENTIAL RAIN|^HEAVY RAINS$|^HEAVY RAIN AND FLOOD$|^HEAVY RAIN/HIGH SURF$|^HEAVY RAIN/LIGHTNING$|^HEAVY RAIN/SEVERE WEATHER$|^HEAVY RAIN/SMALL STREAM URBAN$|^HEAVY RAIN/SNOW$|^HEAVY RAINS/FLOODING$|^HEAVY SHOWER$|^HVY RAIN$|^RAIN$|^UNSEASONAL RAIN$","HEAVY RAIN",df3.economy$EVTYPE)
#correct Heavy Snow
df3.economy$EVTYPE <- gsub("^SNOW/BLIZZARD$|^SNOW/ ICE$|^SNOW/COLD$|^SNOW/HEAVY SNOW$|^SNOW/ICE$|^SNOW/ICE STORM$|^SNOW ACCUMULATION$|^SNOW AND HEAVY SNOW$|^LATE SEASON SNOW$|^SNOW$|^SNOW AND ICE$|^SNOW/ BITTER COLD$|^SNOW/HIGH WINDS$|FALLING SNOW/ICE|^EXCESSIVE SNOW$|^SNOW SQUALL.*|^HEAVY SNOW.*|^RECORD SNOW.*","HEAVY SNOW",df3.economy$EVTYPE)
#correct High Surf
df3.economy$EVTYPE <- gsub("^HIGH SURF ADVISORY$|^HIGH TIDES$|^HEAVY SWELLS$|^COASTAL EROSION$|^BEACH EROSION$|^ASTRONOMICAL HIGH TIDE$|^ROGUE WAVE$|^ROUGH SEAS$|^ROUGH SURF$|^MARINE ACCIDENT$|^HEAVY SEAS$|^HIGH SEAS$|^HIGH SWELLS$|^HIGH WAVES$|HAZARDOUS SURF|HEAVY SURF.*","HIGH SURF",df3.economy$EVTYPE)
#correct High Wind
df3.economy$EVTYPE <- gsub("^GRADIENT WIND$|^NON THUNDERSTORM WIND$|^NON-SEVERE WIND DAMAGE$|^HIGH WIND.*|^HIGH$","HIGH WIND",df3.economy$EVTYPE)
#correct Hurricane/Typhoon
df3.economy$EVTYPE <- gsub("^HURRICANE.*|^TYPHOON$","HURRICANE/TYPHOON",df3.economy$EVTYPE)
#correct Ice Storm
df3.economy$EVTYPE <- gsub("^LIGHT FREEZING RAIN$|^ICE STORM/FLASH FLOOD$","ICE STORM",df3.economy$EVTYPE)
#correct Lakeshore Flood
df3.economy$EVTYPE <- gsub("^LAKE FLOOD$","LAKESHORE FLOOD",df3.economy$EVTYPE)
#correct "Lake-Effect Snow
df3.economy$EVTYPE <- gsub("^LAKE EFFECT SNOW$|^HEAVY LAKE SNOW$","LAKE-EFFECT SNOW",df3.economy$EVTYPE)
#correct Lightning
df3.economy$EVTYPE <- gsub("^LIGHTNING AND HEAVY RAIN$|^LIGHTNING FIRE$|^LIGHTNING THUNDERSTORM WINDS$|^LIGHTNING WAUSEON$|^LIGHTNING/HEAVY RAIN$|LIGHTING|LIGHTNING .*|LIGHTNING.|LIGNTNING","LIGHTNING",df3.economy$EVTYPE)
#correct Marine Hail - not found in dataset
#correct Marine High Wind - not found in dataset
#correct Marine Strong Wind
df3.economy$EVTYPE <- gsub("^WIND AND WAVE$|MARINE MISHAP","MARINE STRONG WIND",df3.economy$EVTYPE)
#correct Marine Thunderstorm Wind
df3.economy$EVTYPE <- gsub("^MARINE TSTM WIND$","MARINE THUNDERSTORM WIND",df3.economy$EVTYPE)
#correct Rip Current
df3.economy$EVTYPE <- gsub("^RIP CURRENT.*","RIP CURRENT",df3.economy$EVTYPE)
#correct Seiche - not found in dataset
#correct Sleet
df3.economy$EVTYPE <- gsub("^SLEET/ICE STORM$|^ICE AND SNOW$","SLEET",df3.economy$EVTYPE)
#correct Storm Surge/Tide
df3.economy$EVTYPE <- gsub("^STORM SURGE$","STORM SURGE/TIDE",df3.economy$EVTYPE)
#correct Strong Wind
df3.economy$EVTYPE <- gsub("^STORM FORCE WINDS$|^ICE/STRONG WINDS$|^SEVERE TURBULENCE$|^NON-TSTM WIND$|^NON TSTM WIND$|^WIND$|^WINDS$|^WIND STORM$|^STRONG WINDS$|^GUSTY WIND.*","STRONG WIND",df3.economy$EVTYPE)
#correct Thunderstorm Wind
df3.economy$EVTYPE <- gsub("^WIND DAMAGE$|^THUDERSTORM WINDS$|^SEVERE THUNDERSTORM$|^SEVERE THUNDERSTORMS$|^SEVERE THUNDERSTORM WINDS$|^GUSTNADO$|^THUNDERSTORM$|^THUNDERSTORM WIND (G40)$|^THUNDERSTORM WIND G52$|^THUNDERSTORM WINDS$|^THUNDERSTORM WINDS 13$|^THUNDERSTORM WINDS/HAIL$|^THUNDERSTORM WINDSS$|^THUNDERSTORMS WIND$|^THUNDERSTORMS WINDS$|^THUNDERTORM WINDS$|^TORNADOES, TSTM WIND, HAIL$","THUNDERSTORM WIND",df3.economy$EVTYPE)
df3.economy$EVTYPE <- gsub("^THUNDERSTORM$|^TSTM WIND$|^TSTM WIND \\(G.*|THUNDERSTORM WINS|^DOWNBURST$|DRY MICROBURST|DRY MIRCOBURST WIND|^WET MICROBURST$|^MICROBURST$|^MICROBURST WINDS$|THUNDERESTORM WIND|^THUNDERSTORM WIND 13|^THUNDERSTORM WIND.$|^THUNDERSTORMW.*|THUNDERSTROM WIND|^THUNDERTORM WIND$|^THUNDERSTORMS$|^THUNDERSTORM WIND .*|WHIRLWIND|^TSTM WIND/HAIL$|^THUNDEERSTORM WINDS$","THUNDERSTORM WIND",df3.economy$EVTYPE)
df3.economy$EVTYPE <- gsub("^THUNDERSTORM DAMAGE TO$|^THUNDERSTORM HAIL$|^THUNDERSTORM WIND FLOOD$|^THUNDERSTORM WIND3$|^THUNDERSTORM WINDAIL$|^THUNDERSTORM WINDFLOODING$|^THUNDERSTORM WINDFUNNEL CLOU$|^THUNDERSTORM WINDIGHTNING$|^THUNDERSTORM WINDWNING$|^THUNERSTORM WINDS$|^TSTM WIND.*|^TSTMW$|^TUNDERSTORM WIND$|^THUNDERSTORM WINDHTNING$|^THUNDERSTORM WINDING$|^THUNDERSTORM WINDL$|^THUNDERSTORM WINDNNEL CLOU$|^THUNDERSTORM WINDOODING$","THUNDERSTORM WIND",df3.economy$EVTYPE)
#correct Tornado
df3.economy$EVTYPE <- gsub("TORNADO DEBRIS|TORNDAO|^TORNADOES$|TORNADO F0|TORNADO F1|TORNADO F2|TORNADO F3|COLD AIR TORNADO|^TORNADOS$","TORNADO",df3.economy$EVTYPE)
#correct Tropical Depression - not found in dataset
#correct Tropical Storm
df3.economy$EVTYPE <- gsub("COASTAL STORM|COASTALSTORM|TROPICAL STORM.*","TROPICAL STORM",df3.economy$EVTYPE)
#correct Waterspout
df3.economy$EVTYPE <- gsub("^WATERSPOUT-$|^WATERSPOUT-TORNADO$|WATERSPOUT TORNADO|^WATERSPOUT/ TORNADO$|WATERSPOUT/TORNADO","WATERSPOUT",df3.economy$EVTYPE)
#correct Wildfire
df3.economy$EVTYPE <- gsub("^WILD/FOREST FIRE$|^GRASS FIRES$|^FOREST FIRES$|WILD FIRES|^WILD/FOREST FIRES$|^WILDFIRES$|^BRUSH FIRE$|^BRUSH FIRES$|GRASS FIRES","WILDFIRE",df3.economy$EVTYPE)
#correct Winter Storm
df3.economy$EVTYPE <- gsub("^SNOW/FREEZING RAIN$|^SNOW/SLEET$|^SNOW/SLEET/FREEZING RAIN$|^SNOW FREEZING RAIN$|^SNOW AND ICE STORM$|^HEAVY MIX$|^HEAVY PRECIPITATION$|^RAIN/SNOW$|THUNDERSNOW|^MIXED PRECIP$|^MIXED PRECIPITATION$|WINTER STORM HIGH WINDS|WINTER STORMS","WINTER STORM",df3.economy$EVTYPE)
#correct Winter Weather
df3.economy$EVTYPE <- gsub("WINTER WEATHER MIX|WINTER WEATHER/MIX|WINTRY MIX","WINTER WEATHER",df3.economy$EVTYPE)
#final cleanup
df3.economy$EVTYPE <- gsub("^FROST/FREEZEFALL$","FROST/FREEZE",df3.economy$EVTYPE)
df3.economy$EVTYPE <- gsub("^THUNDERSTORM WIND.*","FROST/FREEZE",df3.economy$EVTYPE)
After the above mapping, all values in EVTYPE has been mapped to the 48 Event Types. This can be verified by below code:
df3.economy[!df3.economy$EVTYPE %in% toupper(Events),]
## [1] EVTYPE PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## <0 rows> (or 0-length row.names)
There are no values that doesn’t match any event in the Events list.
A check of PROPDMG and CROPDMG columns shows that all values are filled and properly formatted, thus requiring no further processing.
#create a copy of dataset to work on
df4.economy <- df3.economy
p <- as.data.frame(unique(df4.economy$PROPDMG))
c <- as.data.frame(unique(df4.economy$CROPDMG))
Remove leading or trailing white spaces:
df4.economy$CROPDMGEXP <- str_trim(df4.economy$CROPDMGEXP)
A list of unique values in the column is generated.
ce <- as.data.frame(unique(df4.economy$CROPDMGEXP))
ce
## unique(df4.economy$CROPDMGEXP)
## 1
## 2 M
## 3 K
## 4 m
## 5 B
## 6 ?
## 7 0
## 8 k
Based on the unique values above, the following strategies were adopted based on the Storm Data documentation:
“” converted to 0
“?” converted to 0
“B” converted to 1,000,000,000 (B refers to billion)
“k” and “K” converted to 1,000 (k or K refers to kilo or thousand)
“m” and “M” converted to 1,000,000 (m or M refers to million)
However, after running step 1 above, the other values changed as below:
“?” converted to “0?0”"
“B” converted to “0B0”
“k” and “K” converted to “0k0” and “0K0”
“m” and “M” converted to “0m0” and “0M0”
Thus the following adjustments were made to the replacement expression codes:
df4.economy$CROPDMGEXP <- gsub("",0,df4.economy$CROPDMGEXP)
df4.economy$CROPDMGEXP <- gsub("0\\?0","0",df4.economy$CROPDMGEXP)
df4.economy$CROPDMGEXP <- gsub("0|000","0",df4.economy$CROPDMGEXP)
df4.economy$CROPDMGEXP <- gsub("0k0|0K0",1000,df4.economy$CROPDMGEXP)
df4.economy$CROPDMGEXP <- gsub("0m0|0M0",1000000,df4.economy$CROPDMGEXP)
df4.economy$CROPDMGEXP <- gsub("0b0|0B0",1000000000,df4.economy$CROPDMGEXP)
After replacing the values, the unique values for CROPDMGEXP are 0, 1e+06, 1000, 1e+09 as shown below:
ce <- as.data.frame(unique(df4.economy$CROPDMGEXP))
ce
## unique(df4.economy$CROPDMGEXP)
## 1 0
## 2 1e+06
## 3 1000
## 4 1e+09
Remove leading or trailing white spaces:
df4.economy$PROPDMGEXP <- str_trim(df4.economy$PROPDMGEXP)
A list of unique values in the column is generated.
pe <- as.data.frame(unique(df4.economy$PROPDMGEXP))
pe
## unique(df4.economy$PROPDMGEXP)
## 1 K
## 2 M
## 3 B
## 4 m
## 5
## 6 +
## 7 0
## 8 5
## 9 6
## 10 4
## 11 h
## 12 2
## 13 7
## 14 3
## 15 H
## 16 -
Based on the unique values above, the following strategies were adopted based on the Storm Data documentation and source listed in Reference section:
“” converted to 0
Integers 1 to 10 converted to 10
“+” converted to 1
“-” converted to 0
“h” and “H” converted to 100 (h or H refers to hundred)
“k” and “K” converted to 1,000 (k or K refers to kilo or thousand)
“m” and “M” converted to 1,000,000 (m or M refers to million)
“B” converted to 1,000,000,000 (B refers to billion)
However, after running step 1 above, the other values changed as below:
Integers 1 to 10 converted to 010 to 0100
“+” converted to “0+0”
“-” converted to “0-0”"
“h” and “H” converted to “0h0” and “0H0”
“k” and “K” converted to “0k0” and “0K0”
“m” and “M” converted to “0m0” and “0M0”
“B” converted to “0B0”
Thus the following adjustments were made to the replacement expression codes:
df4.economy$PROPDMGEXP <- gsub("",0,df4.economy$PROPDMGEXP)
df4.economy$PROPDMGEXP <- gsub("010|020|030|040|050|060|070|080|090|0100",10,df4.economy$PROPDMGEXP)
df4.economy$PROPDMGEXP <- gsub("0\\+0",1,df4.economy$PROPDMGEXP)
df4.economy$PROPDMGEXP <- gsub("0-0|000",0,df4.economy$PROPDMGEXP)
df4.economy$PROPDMGEXP <- gsub("0H0|0h0",100,df4.economy$PROPDMGEXP)
df4.economy$PROPDMGEXP <- gsub("0k0|0K0",1000,df4.economy$PROPDMGEXP)
df4.economy$PROPDMGEXP <- gsub("0m0|0M0",1000000,df4.economy$PROPDMGEXP)
df4.economy$PROPDMGEXP <- gsub("0b0|0B0",1000000000,df4.economy$PROPDMGEXP)
After replacing the values, the unique values for PROPDMGEXP are 1000, 1e+06, 1e+09, 0, 1, 10, 100 as shown below:
pe <- as.data.frame(unique(df4.economy$PROPDMGEXP))
pe
## unique(df4.economy$PROPDMGEXP)
## 1 1000
## 2 1e+06
## 3 1e+09
## 4 0
## 5 1
## 6 10
## 7 100
The damage value for Property is PROPDMG times PROPDMGEXP while damage value for Crop is CROPDMG time CROPDMGEXP. Thus the following new columns, PROP and CROP are added, which are product of the respective columns.
df4.economy$PROP <- df4.economy$PROPDMG*as.numeric(df4.economy$PROPDMGEXP)
df4.economy$CROP <- df4.economy$CROPDMG*as.numeric(df4.economy$CROPDMGEXP)
head(df4.economy)
## EVTYPE PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP PROP CROP
## 1 TORNADO 25.0 1000 0 0 25000 0
## 2 TORNADO 2.5 1000 0 0 2500 0
## 3 TORNADO 25.0 1000 0 0 25000 0
## 4 TORNADO 2.5 1000 0 0 2500 0
## 5 TORNADO 2.5 1000 0 0 2500 0
## 6 TORNADO 2.5 1000 0 0 2500 0
The damage in terms of economy is calculated by adding together the damages to property and crops. A new column TOTALDMG is added to reflect this.
df4.economy$TOTALDMG <- df4.economy$PROP + df4.economy$CROP
head(df4.economy)
## EVTYPE PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP PROP CROP TOTALDMG
## 1 TORNADO 25.0 1000 0 0 25000 0 25000
## 2 TORNADO 2.5 1000 0 0 2500 0 2500
## 3 TORNADO 25.0 1000 0 0 25000 0 25000
## 4 TORNADO 2.5 1000 0 0 2500 0 2500
## 5 TORNADO 2.5 1000 0 0 2500 0 2500
## 6 TORNADO 2.5 1000 0 0 2500 0 2500
To calculate total damages by event, the dataset is aggregated by using sum().
df4.sum <- aggregate(TOTALDMG ~ EVTYPE, data = df4.economy,sum)
df4.sum
## EVTYPE TOTALDMG
## 1 ASTRONOMICAL LOW TIDE 320000
## 2 AVALANCHE 3721800
## 3 BLIZZARD 771888950
## 4 COASTAL FLOOD 444335060
## 5 COLD/WIND CHILL 38286500
## 6 DEBRIS FLOW 347518100
## 7 DENSE FOG 22829500
## 8 DENSE SMOKE 100000
## 9 DROUGHT 15018672000
## 10 DUST DEVIL 751630
## 11 DUST STORM 9199000
## 12 EXCESSIVE HEAT 521536530
## 13 EXTREME COLD/WIND CHILL 1463313400
## 14 FLASH FLOOD 18171037648
## 15 FLOOD 161341827502
## 16 FROST/FREEZE 15895960491
## 17 FUNNEL CLOUD 194600
## 18 HAIL 19021448120
## 19 HEAT 403268500
## 20 HEAVY RAIN 4253396390
## 21 HEAVY SNOW 1108907857
## 22 HIGH SURF 111901500
## 23 HIGH WIND 6689699845
## 24 HURRICANE/TYPHOON 90872527810
## 25 ICE STORM 8967492310
## 26 LAKE-EFFECT SNOW 40682000
## 27 LAKESHORE FLOOD 7570000
## 28 LIGHTNING 945834542
## 29 MARINE HAIL 4000
## 30 MARINE HIGH WIND 1297010
## 31 MARINE STRONG WIND 1418330
## 32 MARINE THUNDERSTORM WIND 5907400
## 33 RIP CURRENT 163000
## 34 SEICHE 980000
## 35 SLEET 5500000
## 36 STORM SURGE/TIDE 47965579000
## 37 STRONG WIND 263718240
## 38 TORNADO 57356894672
## 39 TROPICAL DEPRESSION 1737000
## 40 TROPICAL STORM 8409336550
## 41 TSUNAMI 144082000
## 42 VOLCANIC ASH 500000
## 43 WATERSPOUT 60730200
## 44 WILDFIRE 8899910130
## 45 WINTER STORM 6788541250
## 46 WINTER WEATHER 42310500
To identify the top event, the dataset is ordered in descending order by TOTALDMG column. Then the top 5 events are selected for comparison.
df4.ordered <- df4.sum[order(df4.sum$TOTALDMG,decreasing = TRUE),]
df4.top <- df4.ordered[1:5,]
knitr::kable(df4.top)
| EVTYPE | TOTALDMG | |
|---|---|---|
| 15 | FLOOD | 161341827502 |
| 24 | HURRICANE/TYPHOON | 90872527810 |
| 38 | TORNADO | 57356894672 |
| 36 | STORM SURGE/TIDE | 47965579000 |
| 18 | HAIL | 19021448120 |
ggplot(df4.top,aes(EVTYPE,TOTALDMG/1000000000)) + geom_bar(stat="identity",fill=rainbow(5)) + labs(title="Top 5 Most Economically Damaging Events", x="Event Type", y="Total Damage (Billion Dollars)")
Based on the table and plot above, the most economically damaging storm type is FLOOD which caused damages (properties + crops) of 161,341,827,502 dollars. It is about 44% more damaging than the second ranked event.
Based on the analysis in the previous section:
The event which is most harmful to population health is TORNADO which caused casualties (fatalities + injuries) of 96,997.
The event with greatest economic consequence is FLOOD which caused damages (properties + crops) of 161,341,827,502 dollars.