Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The basic goal of this analysis is to explore the NOAA Storm Database and answer some basic questions about severe weather events. We will use the database to answer the questions:
This report aims to help a government or municipal manager who might be responsible for preparing for severe weather events and will need to prioritize resources for different types of events. However, this document doesn’t make any specific recommendations on how to prevent these events.
According to our analysis tornadoes and floods are the most harmful events to public health and the economy respectively.
The data for this analysis comes in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. The file can be downloaded here [47Mb].
if(!file.exists("./data")){dir.create("./data")}
if(!file.exists("./data/repdata-data-StormData.csv.bz2")) {
dataUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(dataUrl, "./data/repdata-data-StormData.csv.bz2", method = "curl")
dateDownloaded <- date()
}
There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
eventsds <- read.csv("./data/repdata-data-StormData.csv.bz2", stringsAsFactors = FALSE, na.strings = "")
str(eventsds)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr NA NA NA NA ...
## $ BGN_LOCATI: chr NA NA NA NA ...
## $ END_DATE : chr NA NA NA NA ...
## $ END_TIME : chr NA NA NA NA ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr NA NA NA NA ...
## $ END_LOCATI: chr NA NA NA NA ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr NA NA NA NA ...
## $ WFO : chr NA NA NA NA ...
## $ STATEOFFIC: chr NA NA NA NA ...
## $ ZONENAMES : chr NA NA NA NA ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr NA NA NA NA ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
The database has 902297 observations with 37 variables each. However, to answer the proposed questions, we will need only the following variables:
The other variables can be discarded in this analysis. So we will delete them to free memory space and make the analysis faster.
eventsds <- select(eventsds, BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG,
CROPDMGEXP)
The database has one string variable called “BGN_DATE” that represents the date when the event began. This is in the format “%m/%d/%Y %H:%M:%S”. This variable needs to be converted to date type to make the analysis easier.
eventsds$BGN_DATE <- as.Date(eventsds$BGN_DATE, "%m/%d/%Y %H:%M:%S")
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database, there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
In order to define which years we will consider, we will list the number of events per year, the percentage of events the year represents comparing with the whole database, and the accumulated percentage of events along the years.
occurrences <- table( Ocurrences = format(eventsds$BGN_DATE, "%Y"))
occurrences <- cbind(occurrences, percentage = round(occurrences/nrow(eventsds)*100, digits = 2))
occurrences <- occurrences[order(row.names(occurrences), decreasing = TRUE),]
accum <- occurrences[1, "percentage"]
for (i in c(2:nrow(occurrences))){
accum <- rbind(accum, occurrences[i, "percentage"] + accum[i-1])
}
occurrences <- cbind(occurrences, accumulated = round(accum, digits = 2))
kable(occurrences, caption = "TABLE 1: Events per year", col.names = c("Number of events",
"Percentage from total",
"Accumulated percantage"),
align = c("c", "c", "c"))
| Number of events | Percentage from total | Accumulated percantage | |
|---|---|---|---|
| 2011 | 62174 | 6.89 | 6.89 |
| 2010 | 48161 | 5.34 | 12.23 |
| 2009 | 45817 | 5.08 | 17.31 |
| 2008 | 55663 | 6.17 | 23.48 |
| 2007 | 43289 | 4.80 | 28.28 |
| 2006 | 44034 | 4.88 | 33.16 |
| 2005 | 39184 | 4.34 | 37.50 |
| 2004 | 39363 | 4.36 | 41.86 |
| 2003 | 39752 | 4.41 | 46.27 |
| 2002 | 36293 | 4.02 | 50.29 |
| 2001 | 34962 | 3.87 | 54.16 |
| 2000 | 34471 | 3.82 | 57.98 |
| 1999 | 31289 | 3.47 | 61.45 |
| 1998 | 38128 | 4.23 | 65.68 |
| 1997 | 28680 | 3.18 | 68.86 |
| 1996 | 32270 | 3.58 | 72.44 |
| 1995 | 27970 | 3.10 | 75.54 |
| 1994 | 20631 | 2.29 | 77.83 |
| 1993 | 12607 | 1.40 | 79.23 |
| 1992 | 13534 | 1.50 | 80.73 |
| 1991 | 12522 | 1.39 | 82.12 |
| 1990 | 10946 | 1.21 | 83.33 |
| 1989 | 10410 | 1.15 | 84.48 |
| 1988 | 7257 | 0.80 | 85.28 |
| 1987 | 7367 | 0.82 | 86.10 |
| 1986 | 8726 | 0.97 | 87.07 |
| 1985 | 7979 | 0.88 | 87.95 |
| 1984 | 7335 | 0.81 | 88.76 |
| 1983 | 8322 | 0.92 | 89.68 |
| 1982 | 7132 | 0.79 | 90.47 |
| 1981 | 4517 | 0.50 | 90.97 |
| 1980 | 6146 | 0.68 | 91.65 |
| 1979 | 4279 | 0.47 | 92.12 |
| 1978 | 3657 | 0.41 | 92.53 |
| 1977 | 3728 | 0.41 | 92.94 |
| 1976 | 3768 | 0.42 | 93.36 |
| 1975 | 4975 | 0.55 | 93.91 |
| 1974 | 5386 | 0.60 | 94.51 |
| 1973 | 4463 | 0.49 | 95.00 |
| 1972 | 2168 | 0.24 | 95.24 |
| 1971 | 3471 | 0.38 | 95.62 |
| 1970 | 3215 | 0.36 | 95.98 |
| 1969 | 2926 | 0.32 | 96.30 |
| 1968 | 3312 | 0.37 | 96.67 |
| 1967 | 2688 | 0.30 | 96.97 |
| 1966 | 2388 | 0.26 | 97.23 |
| 1965 | 2855 | 0.32 | 97.55 |
| 1964 | 2348 | 0.26 | 97.81 |
| 1963 | 1968 | 0.22 | 98.03 |
| 1962 | 2389 | 0.26 | 98.29 |
| 1961 | 2246 | 0.25 | 98.54 |
| 1960 | 1945 | 0.22 | 98.76 |
| 1959 | 1813 | 0.20 | 98.96 |
| 1958 | 2213 | 0.25 | 99.21 |
| 1957 | 2184 | 0.24 | 99.45 |
| 1956 | 1703 | 0.19 | 99.64 |
| 1955 | 1413 | 0.16 | 99.80 |
| 1954 | 609 | 0.07 | 99.87 |
| 1953 | 492 | 0.05 | 99.92 |
| 1952 | 272 | 0.03 | 99.95 |
| 1951 | 269 | 0.03 | 99.98 |
| 1950 | 223 | 0.02 | 100.00 |
If we apply the Pareto principle (states that, for many events, roughly 80% of the effects come from 20% of the causes). We will reduce the number of years to 20%. It means that we will consider the events from 1999 to 2011. Conversely, in this case, the Pareto principle doesn’t seem to apply very well, as the cumulative percentage of events between 1999 and 2011 is 61.45%. Anyway this subset seems to be representative enough.
eventsds <- filter(eventsds, format(eventsds$BGN_DATE, "%Y") >= 1999)
According to the database documentation, the event type variable “EVTYPE” should have 48 variations, as follow:
A quick look into the database shows us that this variable has a lot of filler errors such as spaces at the beginning or at the end, double spaces, misspellings, typing errors, abbreviations and so on.
table(eventsds$EVTYPE[order(eventsds$EVTYPE)])
##
## HIGH SURF ADVISORY FLASH FLOOD
## 1 1
## TSTM WIND WATERSPOUT
## 3 1
## WIND ABNORMALLY DRY
## 1 2
## ABNORMALLY WET ACCUMULATED SNOWFALL
## 1 4
## ASTRONOMICAL HIGH TIDE ASTRONOMICAL LOW TIDE
## 103 174
## AVALANCHE BEACH EROSION
## 351 2
## BLACK ICE BLIZZARD
## 7 2316
## BLOWING DUST BRUSH FIRE
## 2 1
## COASTAL FLOOD COASTAL FLOODING
## 563 67
## COASTAL FLOODING/EROSION COLD
## 2 19
## COLD WEATHER COLD WIND CHILL TEMPERATURES
## 1 4
## COLD/WIND CHILL COOL SPELL
## 539 1
## CSTL FLOODING/EROSION DAM BREAK
## 2 1
## DENSE FOG DENSE SMOKE
## 1188 10
## DRIEST MONTH DROUGHT
## 1 2253
## DROWNING DRY
## 1 8
## DRY CONDITIONS DRY MICROBURST
## 6 100
## DRY SPELL DRYNESS
## 3 1
## DUST DEVEL DUST DEVIL
## 1 115
## DUST STORM EARLY SNOWFALL
## 394 5
## EXCESSIVE HEAT EXCESSIVE HEAT/DROUGHT
## 1395 1
## EXCESSIVE RAINFALL EXCESSIVE SNOW
## 2 25
## EXCESSIVELY DRY EXTREME COLD
## 1 321
## EXTREME COLD/WIND CHILL EXTREME WIND CHILL
## 1002 1
## EXTREME WINDCHILL EXTREME WINDCHILL TEMPERATURES
## 93 19
## EXTREMELY WET FALLING SNOW/ICE
## 1 2
## FIRST FROST FIRST SNOW
## 1 2
## FLASH FLOOD FLASH FLOOD/FLOOD
## 42550 2
## FLASH FLOODING FLOOD
## 1 20881
## FLOOD/FLASH/FLOOD FOG
## 1 372
## FREEZE FREEZING DRIZZLE
## 59 5
## FREEZING FOG FREEZING RAIN
## 45 121
## FREEZING RAIN/SLEET FROST
## 2 41
## FROST/FREEZE FUNNEL CLOUD
## 1342 4959
## FUNNEL CLOUDS GLAZE
## 1 4
## GRADIENT WIND GUSTY LAKE WIND
## 2 1
## GUSTY THUNDERSTORM WIND GUSTY THUNDERSTORM WINDS
## 3 5
## GUSTY WIND GUSTY WIND/HAIL
## 13 1
## GUSTY WINDS HAIL
## 32 175844
## HARD FREEZE HAZARDOUS SURF
## 5 1
## HEAT HEAVY RAIN
## 711 10146
## HEAVY RAIN EFFECTS HEAVY RAINFALL
## 1 2
## HEAVY SEAS HEAVY SNOW
## 1 11662
## HEAVY SURF HEAVY SURF/HIGH SURF
## 56 228
## HIGH SEAS HIGH SURF
## 6 628
## HIGH SURF ADVISORIES HIGH SURF ADVISORY
## 1 4
## HIGH WATER HIGH WIND
## 2 17359
## HURRICANE HURRICANE/TYPHOON
## 91 88
## ICE ICE ON ROAD
## 1 1
## ICE ROADS ICE STORM
## 1 1477
## ICE/SNOW ICY ROADS
## 2 10
## LAKE EFFECT SNOW LAKE-EFFECT SNOW
## 8 635
## LAKESHORE FLOOD LANDSLIDE
## 23 588
## LANDSLUMP LANDSPOUT
## 1 2
## LATE SEASON HAIL LATE SEASON SNOW
## 1 1
## LIGHT FREEZING RAIN LIGHT SNOW
## 23 126
## LIGHT SNOW/FREEZING PRECIP LIGHTNING
## 1 10548
## LOCALLY HEAVY RAIN MARINE HAIL
## 1 442
## MARINE HIGH WIND MARINE STRONG WIND
## 135 48
## MARINE THUNDERSTORM WIND MARINE TSTM WIND
## 5812 6175
## MIXED PRECIPITATION MODERATE SNOWFALL
## 34 83
## MONTHLY PRECIPITATION MONTHLY SNOWFALL
## 36 1
## MONTHLY TEMPERATURE MUD SLIDE
## 4 3
## MUDSLIDE NON SEVERE HAIL
## 5 7
## NON TSTM WIND NON-SEVERE WIND DAMAGE
## 2 1
## NON-TSTM WIND NORTHERN LIGHTS
## 1 1
## OTHER PATCHY ICE
## 12 1
## PROLONG COLD PROLONG WARMTH
## 9 4
## RAIN RECORD COLD
## 7 1
## RECORD COLD RECORD COOL
## 19 3
## RECORD DRYNESS RECORD HEAT
## 2 27
## RECORD LOW RAINFALL RECORD PRECIPITATION
## 2 1
## RECORD RAINFALL RECORD SNOW
## 11 2
## RECORD SNOWFALL RECORD TEMPERATURE
## 1 3
## RECORD WARMTH RED FLAG CRITERIA
## 92 2
## RED FLAG FIRE WX REMNANTS OF FLOYD
## 2 2
## RIP CURRENT RIP CURRENTS
## 431 217
## ROGUE WAVE ROUGH SEAS
## 1 3
## SEICHE SEVERE THUNDERSTORM
## 13 1
## SEVERE THUNDERSTORMS SLEET
## 7 58
## SLEET STORM SMALL HAIL
## 12 36
## SMOKE SNOW
## 11 281
## SNOW ADVISORY SNOW AND ICE
## 1 2
## SNOW AND SLEET SNOW DROUGHT
## 4 2
## SNOW SHOWERS SNOW SQUALLS
## 5 6
## SNOW/BLOWING SNOW SNOW/FREEZING RAIN
## 6 4
## SNOW/SLEET SNOWMELT FLOODING
## 3 3
## STORM SURGE STORM SURGE/TIDE
## 152 148
## STREET FLOODING STRONG WIND
## 2 3514
## STRONG WINDS THUNDERSTORM
## 88 14
## THUNDERSTORM WIND THUNDERSTORM WIND (G40)
## 81402 1
## THUNDERSTORMS TIDAL FLOODING
## 2 10
## TORNADO TORNADO DEBRIS
## 19206 1
## TROPICAL DEPRESSION TROPICAL STORM
## 59 621
## TSTM WIND TSTM WIND (G40)
## 95322 3
## TSTM WIND (G45) TSTM WIND G45
## 11 1
## TSTM WIND/HAIL TSUNAMI
## 530 20
## UNSEASONABLY COLD UNSEASONABLY COOL
## 11 7
## UNSEASONABLY COOL & WET UNSEASONABLY DRY
## 2 26
## UNSEASONABLY HOT UNSEASONABLY WARM
## 6 61
## UNSEASONABLY WARM & WET UNSEASONABLY WARM AND DRY
## 1 4
## UNSEASONABLY WARM/WET UNSEASONABLY WET
## 2 13
## UNSEASONAL LOW TEMP UNSEASONAL RAIN
## 2 1
## UNUSUALLY COLD UNUSUALLY LATE SNOW
## 8 1
## UNUSUALLY WARM URBAN/SML STREAM FLD
## 4 1867
## VERY DRY VERY WARM
## 2 1
## VOG VOLCANIC ASH
## 1 20
## VOLCANIC ASHFALL VOLCANIC ERUPTION
## 3 2
## WALL CLOUD WARM WEATHER
## 1 1
## WATERSPOUT WHIRLWIND
## 2838 1
## WILD/FOREST FIRE WILDFIRE
## 1231 2732
## WIND WIND ADVISORY
## 239 12
## WIND AND WAVE WIND CHILL
## 1 5
## WIND DAMAGE WIND GUSTS
## 2 3
## WINDS WINTER MIX
## 2 1
## WINTER STORM WINTER WEATHER
## 10169 6951
## WINTER WEATHER MIX WINTER WEATHER/MIX
## 6 1104
## WINTRY MIX WND
## 50 1
We will try to clean the filler errors in the variable “EVTYPE” and group the different types to the ones listed in the documentation. Unfortunately, there are some event types in the database that are impossible to be matched with the types shown in the documentation. These types won’t be matched.
eventsds$EVTYPE <- toupper(eventsds$EVTYPE)
eventsds$EVTYPE <- trimws(eventsds$EVTYPE, "both")
eventsds$EVTYPE <- gsub(".*(ASTR).*", "ASTRONOMICAL LOW TIDE", eventsds$EVTYPE)
eventsds$EVTYPE <- gsub(".*(AVAL|LAN|SLI).*", "AVALANCHE", eventsds$EVTYPE)
eventsds$EVTYPE <- gsub(".*(BLIZ).*", "BLIZZARD", eventsds$EVTYPE)
eventsds$EVTYPE <- gsub(".*(COAST|CST|BEA).*", "COASTAL FLOOD", eventsds$EVTYPE)
eventsds$EVTYPE <- gsub("^(?=.*(COLD|COO|CHI|LOW|WND))((?!FUN|TOR|FRO|SNO|EXC|EXTR|HIG|REC|SEV|UNU|TID).)*$", "COLD/WIND CHILL", eventsds$EVTYPE, perl = TRUE)
eventsds$EVTYPE <- gsub("^(?=.*(FOG|VOG))((?!FRE|ICE).)*$", "DENSE FOG", eventsds$EVTYPE, perl = TRUE)
eventsds$EVTYPE <- gsub(".*(SMO).*", "DENSE SMOKE", eventsds$EVTYPE)
eventsds$EVTYPE <- gsub(".*(DROU|DRY|DRI).*", "DROUGHT", eventsds$EVTYPE)
eventsds$EVTYPE <- gsub(".*(DEV).*", "DUST DEVIL", eventsds$EVTYPE)
eventsds$EVTYPE <- gsub("^(?=.*DUS)((?!DEV).)*$", "DUST STORM", eventsds$EVTYPE, perl = TRUE)
eventsds$EVTYPE <- gsub(".*(EXCESSIVE HEAT|EXTREME HEAT|HEATBURST|RECORD HEAT).*",
"EXCESSIVE HEAT", eventsds$EVTYPE)
eventsds$EVTYPE <- gsub(".*(WAR).*", "EXCESSIVE HEAT", eventsds$EVTYPE)
eventsds$EVTYPE <- gsub(".*(EXC|EXT|REC|SEV|UNU).*(COLD|CHI|COO).*", "EXTREME COLD/WIND CHILL",
eventsds$EVTYPE)
eventsds$EVTYPE <- gsub(".*(FLAS|FLD|FLDG).*", "FLASH FLOOD", eventsds$EVTYPE)
eventsds$EVTYPE <- gsub("^(?=.*FLOO)((?!COA|BEA|CST|FLA|TID).)*$", "FLOOD", eventsds$EVTYPE, perl = TRUE)
eventsds$EVTYPE <- gsub("^(?=.*(FRO|FRE))((?!FOG).)*$", "FROST/FREEZE", eventsds$EVTYPE, perl = TRUE)
eventsds$EVTYPE <- gsub(".*(FUN).*", "FUNNEL CLOUD", eventsds$EVTYPE)
eventsds$EVTYPE <- gsub("^(?=.*FOG)((?!DEN).)*$", "FREEZING FOG", eventsds$EVTYPE, perl = TRUE)
eventsds$EVTYPE <- gsub("^(?=.*HAI)((?!MAR).)*$", "HAIL", eventsds$EVTYPE, perl = TRUE)
eventsds$EVTYPE <- gsub("^(?=.*HEAT)((?!EXC).)*$", "HEAT", eventsds$EVTYPE, perl = TRUE)
eventsds$EVTYPE <- gsub("^(?=.*(HEAV|RAI|PREC|SHOW))((?!SNO|SUR).)*$", "HEAVY RAIN",
eventsds$EVTYPE, perl = TRUE)
eventsds$EVTYPE <- gsub("^(?=.*HEAV)((?!RAI|SUR|LAK|ICE).)*$", "HEAVY SNOW", eventsds$EVTYPE,
perl = TRUE)
eventsds$EVTYPE <- gsub("^(?=.*SNO)((?!LAK).)*$", "HEAVY SNOW", eventsds$EVTYPE, perl = TRUE)
eventsds$EVTYPE <- gsub("^(?=.*SURF)((?!RIP).)*$", "HIGH SURF", eventsds$EVTYPE, perl = TRUE)
eventsds$EVTYPE <- gsub("^(?=.*HIG)((?!SUR|SWE|SEA|TEM|TID|WAT|WAV|REC|MAR).)*$", "HIGH WIND",
eventsds$EVTYPE, perl = TRUE)
eventsds$EVTYPE <- gsub(".*(HUR).*", "HURRICANE (TYPHOON)", eventsds$EVTYPE)
eventsds$EVTYPE <- gsub(".*(ICE).*", "ICE STORM", eventsds$EVTYPE)
eventsds$EVTYPE <- gsub(".*(LAK).*", "LAKE-EFFECT SNOW", eventsds$EVTYPE)
eventsds$EVTYPE <- gsub(".*(LIGHTN).*", "LIGHTNING", eventsds$EVTYPE)
eventsds$EVTYPE <- gsub("^(?=.*MARI)((?!HAI|HIG|THU|TST).)*$", "MARINE STRONG WIND", eventsds$EVTYPE,
perl = TRUE)
eventsds$EVTYPE <- gsub("^(?=.*MARI)((?!HAI|HIG|STR).)*$", "MARINE THUNDERSTORM WIND", eventsds$EVTYPE,
perl = TRUE)
eventsds$EVTYPE <- gsub(".*(RIP).*", "RIP CURRENT", eventsds$EVTYPE)
eventsds$EVTYPE <- gsub(".*(SLE).*", "SLEET", eventsds$EVTYPE)
eventsds$EVTYPE <- gsub("^(?=.*(TID|SURG))((?!AST).)*$", "STORM SURGE/TIDE", eventsds$EVTYPE,
perl = TRUE)
eventsds$EVTYPE <- gsub("^(?=.*WIND)((?!CHI|EXT|THU|HIG|MAR|TST|TUN).)*$", "STRONG WIND",
eventsds$EVTYPE, perl = TRUE)
eventsds$EVTYPE <- gsub("^(?=.*(THU|TST))((?!MAR).)*$", "THUNDERSTORM WIND", eventsds$EVTYPE,
perl = TRUE)
eventsds$EVTYPE <- gsub(".*(TORN).*", "TORNADO", eventsds$EVTYPE)
eventsds$EVTYPE <- gsub("^(?=.*TROP)((?!DEP).)*$", "TROPICAL STORM", eventsds$EVTYPE, perl = TRUE)
eventsds$EVTYPE <- gsub(".*(VOL).*", "VOLCANIC ASH", eventsds$EVTYPE)
eventsds$EVTYPE <- gsub(".*(WAT).*", "WATERSPOUT", eventsds$EVTYPE)
eventsds$EVTYPE <- gsub(".*(FIRE).*", "WILDFIRE", eventsds$EVTYPE)
eventsds$EVTYPE <- gsub(".*(WINTER S).*", "WINTER STORM", eventsds$EVTYPE)
eventsds$EVTYPE <- gsub("^(?=.*WINT)((?!STO).)*$", "WINTER WEATHER", eventsds$EVTYPE, perl = TRUE)
One last step to prepare the data is merging the values in PROPDMG with PROPDMGEXP and CROPDMG with CROPDMGEXP. By merging these variables we are able to obtain the actual value in dollar of the loss in properties and crops caused by each event.
eventsds[eventsds$PROPDMGEXP == "K" & !is.na(eventsds$PROPDMGEXP), "PROPDMGEXP"] <- 1000
eventsds[eventsds$PROPDMGEXP == "M" & !is.na(eventsds$PROPDMGEXP), "PROPDMGEXP"] <- 1000000
eventsds[eventsds$PROPDMGEXP == "B" & !is.na(eventsds$PROPDMGEXP), "PROPDMGEXP"] <- 1000000000
eventsds$PROPDMGEXP <- as.numeric(eventsds$PROPDMGEXP)
eventsds$PROPDMG <- eventsds$PROPDMG * eventsds$PROPDMGEXP
eventsds[eventsds$CROPDMGEXP == "K" & !is.na(eventsds$CROPDMGEXP), "CROPDMGEXP"] <- 1000
eventsds[eventsds$CROPDMGEXP == "M" & !is.na(eventsds$CROPDMGEXP), "CROPDMGEXP"] <- 1000000
eventsds[eventsds$CROPDMGEXP == "B" & !is.na(eventsds$CROPDMGEXP), "CROPDMGEXP"] <- 1000000000
eventsds$CROPDMGEXP <- as.numeric(eventsds$CROPDMGEXP)
eventsds$CROPDMG <- eventsds$CROPDMG * eventsds$CROPDMGEXP
Firstly we will analyse the consequence of the events in public health. We will start listing the 10 events with the most impact on public health.
eventsds <- group_by(eventsds, EVTYPE)
healthSummary <- summarise(eventsds, totalInjuries = sum(INJURIES, na.rm = TRUE),
totalFatalities = sum(FATALITIES, na.rm = TRUE),
totalPeopleAffected = sum(INJURIES + FATALITIES, na.rm = TRUE),
totalEconomicLoss = currency(sum(PROPDMG + CROPDMG, na.rm = TRUE)/1000000000),
frequency = sum(!is.na(EVTYPE)))
healthSummary <- healthSummary[order(-healthSummary$totalPeopleAffected),]
kable(healthSummary[1:10, c(1:4, 6)], caption = "TABLE 2: Top 10 events impacting on population health",
col.names = c("Event type", "Total injuries", "Total fatalities", "Total people affected",
"Event frequency"), align = c("c", "c", "c", "c"))
| Event type | Total injuries | Total fatalities | Total people affected | Event frequency |
|---|---|---|---|---|
| TORNADO | 17055 | 1287 | 18342 | 19207 |
| EXCESSIVE HEAT | 5188 | 1515 | 6703 | 1588 |
| THUNDERSTORM WIND | 3478 | 276 | 3754 | 176778 |
| LIGHTNING | 3236 | 512 | 3748 | 10548 |
| FLASH FLOOD | 1132 | 664 | 1796 | 44422 |
| HEAT | 1222 | 231 | 1453 | 711 |
| WILDFIRE | 1326 | 86 | 1412 | 3966 |
| HURRICANE (TYPHOON) | 1295 | 84 | 1379 | 179 |
| HIGH WIND | 802 | 158 | 960 | 17359 |
| RIP CURRENT | 414 | 485 | 899 | 648 |
The table above can be represented as a bubble chart where the x axis represents the number total of fatalities caused by each type of event, while the y axis represents the number of injuries. The diameter of the bubble represents the frequency of each type of event occurred in the 13 years interval.
ggplot(healthSummary[1:10,], aes(x = totalFatalities, y = totalInjuries, col = EVTYPE)) +
geom_point(aes(size = frequency), alpha = 0.5) +
scale_fill_manual(values = getPalette(colourCount)) +
scale_size_area(max_size = 30) +
ggtitle("FIGURE 1: Top 10 events impacting on population health") +
xlab("Total fatality") +
ylab("Total injuries") +
labs(col = "Event type", size = "Frequency") +
geom_text(data = healthSummary[1,], aes(x = totalFatalities, y = totalInjuries, label = EVTYPE),
vjust = 3) +
geom_text(data = healthSummary[2,], aes(x = totalFatalities, y = totalInjuries, label = EVTYPE),
vjust = 2, hjust = 0.9) +
geom_text(data = healthSummary[3,], aes(x = totalFatalities, y = totalInjuries, label = EVTYPE),
vjust = -4.7) +
geom_text(data = healthSummary[4,], aes(x = totalFatalities, y = totalInjuries, label = EVTYPE),
vjust = -1.5, hjust = -0.1) +
geom_text(data = healthSummary[5,], aes(x = totalFatalities, y = totalInjuries, label = EVTYPE),
vjust = -1.5, hjust = -0.2) +
geom_text(data = healthSummary[6,], aes(x = totalFatalities, y = totalInjuries, label = EVTYPE),
vjust = 0.2, hjust = -0.2) +
geom_text(data = healthSummary[7,], aes(x = totalFatalities, y = totalInjuries, label = EVTYPE),
vjust = 4.5, hjust = 0.3) +
geom_text(data = healthSummary[8,], aes(x = totalFatalities, y = totalInjuries, label = EVTYPE),
vjust = -1.2, hjust = 0.1) +
geom_text(data = healthSummary[9,], aes(x = totalFatalities, y = totalInjuries, label = EVTYPE),
vjust = 0.7, hjust = -0.23) +
geom_text(data = healthSummary[10,], aes(x = totalFatalities, y = totalInjuries, label = EVTYPE),
vjust = 1.5, hjust = -0.05)
Following the same approach to analyse the economic loss caused by each event type we will find the top 10 events impacting on the economy:
economicSummary <- summarise(eventsds,
totalPropertyLoss = currency(sum(PROPDMG, na.rm = TRUE)/1000000000),
totalCropLoss = currency(sum(CROPDMG, na.rm = TRUE)/1000000000),
totalPeopleAffected = sum(INJURIES + FATALITIES, na.rm = TRUE),
totalEconomicLoss = currency(sum(PROPDMG + CROPDMG, na.rm = TRUE)
/1000000000),
frequency = sum(!is.na(EVTYPE)))
economicSummary <- economicSummary[order(-economicSummary$totalEconomicLoss),]
kable(economicSummary[1:10, c(1:3, 5, 6)],
caption = "TABLE 3: Top 10 events impacting on economy (billions)",
col.names = c("Event type", "Total property loss", "Total crop loss", "Total economic loss",
"Event frequency"), align = c("c", "c", "c", "c"))
| Event type | Total property loss | Total crop loss | Total economic loss | Event frequency |
|---|---|---|---|---|
| FLOOD | $135.06 | $4.29 | $135.39 | 20909 |
| HURRICANE (TYPHOON) | $76.51 | $4.44 | $38.04 | 179 |
| TORNADO | $21.45 | $0.24 | $15.72 | 19207 |
| HAIL | $12.45 | $1.98 | $8.89 | 176419 |
| FLASH FLOOD | $12.94 | $1.21 | $7.96 | 44422 |
| STORM SURGE/TIDE | $47.82 | $0.00 | $4.64 | 310 |
| THUNDERSTORM WIND | $5.85 | $0.74 | $4.32 | 176778 |
| WILDFIRE | $7.25 | $0.31 | $3.71 | 3966 |
| HIGH WIND | $5.03 | $0.50 | $2.82 | 17359 |
| TROPICAL STORM | $7.22 | $0.45 | $1.45 | 621 |
A bubble chart can give us again a visual representation of the number shown in the table above.
ggplot(economicSummary[1:10,], aes(x = totalPropertyLoss, y = currency(totalCropLoss), col = EVTYPE)) +
geom_point(aes(size = frequency), alpha = 0.5) +
scale_fill_manual(values = getPalette(colourCount)) +
scale_size_area(max_size = 30) +
ggtitle("FIGURE 2: Top 10 events impacting on economy") +
xlab("Total property loss (billions)") +
ylab("Total crop loss (billions)") +
labs(col = "Event type", size = "Frequency") +
geom_text(data = economicSummary[1,], aes(x = totalPropertyLoss, y = totalCropLoss, label = EVTYPE),
vjust = 3, hjust = 0.7) +
geom_text(data = economicSummary[2,], aes(x = totalPropertyLoss, y = totalCropLoss, label = EVTYPE),
vjust = 2, hjust = 0.5) +
geom_text(data = economicSummary[3,], aes(x = totalPropertyLoss, y = totalCropLoss, label = EVTYPE),
vjust = 0.3, hjust = -0.25) +
geom_text(data = economicSummary[4,], aes(x = totalPropertyLoss, y = totalCropLoss, label = EVTYPE),
vjust = -1.5, hjust = -1.3) +
geom_text(data = economicSummary[5,], aes(x = totalPropertyLoss, y = totalCropLoss, label = EVTYPE),
vjust = -0.4, hjust = -0.24) +
geom_text(data = economicSummary[6,], aes(x = totalPropertyLoss, y = totalCropLoss, label = EVTYPE),
vjust = 0.2, hjust = -0.04) +
geom_text(data = economicSummary[7,], aes(x = totalPropertyLoss, y = totalCropLoss, label = EVTYPE),
vjust = -1.5, hjust = -0.27) +
geom_text(data = economicSummary[8,], aes(x = totalPropertyLoss, y = totalCropLoss, label = EVTYPE),
vjust = 3, hjust = 0.3) +
geom_text(data = economicSummary[9,], aes(x = totalPropertyLoss, y = totalCropLoss, label = EVTYPE),
vjust = -1.5, hjust = 0.18) +
geom_text(data = economicSummary[10,], aes(x = totalPropertyLoss, y = totalCropLoss,
label = EVTYPE), vjust = -0.5, hjust = -0.02)
Now we have two lists of the 10 events that cause the most impact on the people’s health and on the national economy. Depending on the focus of the analysis, we can use the charts above or we can join them in a health/economic chart.
We have 6 event types that appear on both lists (Flash Flood, High Wind, Hurricane (Typhoon), Thunderstorm Wind, Tornado and Wildfire). 4 event types appearing only in the top 10 health harassers (Excessive Heat, Heat, Lightning and Rip Current) and other 4 appearing only in the top 10 economy harassers (Flood, Hail, Storm Surge/Tide and Tropical Storm). Merging the two lists we have a single list of 14 events.
uniqueSummary = merge(healthSummary[1:10,], economicSummary[1:10,], all = TRUE)
kable(uniqueSummary[, 1:4], caption = "TABLE 4: Top 14 events impacting on the health and economy",
col.names = c("Event type", "Total injuries and fatalities", "Total economic loss (billions)",
"Event frequency"), align = c("c", "c", "c", "c"))
| Event type | Total injuries and fatalities | Total economic loss (billions) | Event frequency |
|---|---|---|---|
| EXCESSIVE HEAT | 6703 | $0.49 | 1588 |
| FLASH FLOOD | 1796 | $7.96 | 44422 |
| FLOOD | 606 | $135.39 | 20909 |
| HAIL | 637 | $8.89 | 176419 |
| HEAT | 1453 | $0.00 | 711 |
| HIGH WIND | 960 | $2.82 | 17359 |
| HURRICANE (TYPHOON) | 1379 | $38.04 | 179 |
| LIGHTNING | 3748 | $0.31 | 10548 |
| RIP CURRENT | 899 | $0.00 | 648 |
| STORM SURGE/TIDE | 28 | $4.64 | 310 |
| THUNDERSTORM WIND | 3754 | $4.32 | 176778 |
| TORNADO | 18342 | $15.72 | 19207 |
| TROPICAL STORM | 318 | $1.45 | 621 |
| WILDFIRE | 1412 | $3.71 | 3966 |
We can show the information in the table above on a chart where the x axis represents the total health impact and the y axis represents the total economic impact. The size of the bubble still represents the frequency of the event.
Knowing the growing trend of event types can be a important information in prioritizing their prevention. Events with a fast growth trend may have higher priority over downward trend events.
topEventsds <- filter(eventsds, EVTYPE %in% uniqueSummary$EVTYPE)
bubble <- ggplot(uniqueSummary, aes(x = totalPeopleAffected, y = totalEconomicLoss, col = EVTYPE)) +
geom_point(aes(size = frequency), alpha = 0.5) +
scale_fill_manual(values = getPalette(colourCount)) +
scale_size_area(max_size = 30) +
xlab("Total health impact") +
ylab("Total economy loss (billions)") +
labs(col = "Event type", size = "Frequency")
healthTrend <- ggplot(topEventsds, aes(x = BGN_DATE, y = FATALITIES + INJURIES,
col = EVTYPE)) +
geom_smooth(method = lm, se = FALSE, na.rm = TRUE, alpha = 0.5) +
ggtitle("Total health impact trend") +
xlab("Year") +
ylab("Total health impact") +
labs(col = "Event type")
economicTrend <- ggplot(topEventsds, aes(x = BGN_DATE, y = (PROPDMG + CROPDMG)/1000000000, col = EVTYPE)) +
geom_smooth(method = lm, se = FALSE, na.rm = TRUE, alpha = 0.5) +
ggtitle("Total economy loss trend") +
xlab("Year") +
ylab("Total economy loss (billions)") +
labs(col = "Event type")
fig3 <- ggarrange(bubble,
ggarrange(healthTrend, economicTrend, ncol = 2, legend = "none"),
nrow = 2, common.legend = TRUE, legend = "right")
annotate_figure(fig3, top = "FIGURE 3: Top 14 events impacting on the health and economy")
According to Figure 1, tornadoes are the most harmful event to the population health in relation to either the number of injuries or the number of fatalities. Tornadoes injured 17,055 people and killed 1,287 others between 1999 and 2011.
Tornadoes are destructive events and always make headlines. Conversely, it’s important to highlight that a more silent event has stood out in our analysis. The excessive heat killed 17.72% more people than tornadoes in the same time frame (1,515 fatalities).
The most frequent events, however, are the thunderstorm winds with 176,778 occurrences and 3,478 injuries.
The Figure 2 clearly shows that floods have the greatest economic consequences to either property or crop loss. Floods caused a property loss of 135.06 billions of dollars and a crop loss of 4.29 billions. The database has 20,909 occurrences of floods in the period between 1999 and 2011.
Hurricanes reach second place in the ranking with $76.51 billion in property loss and $4.44 billions in crop loss (3.38% over floods). Hurricanes, however, are relatively rare as there are only 179 occurrences in the 13 years time frame.
The third in the ranking are tornadoes with a total economic loss of 15.72 billions in the period. Despite the fact that it’s in the third position the difference from the first place is enormous.
It seems from the analysis of the Figure 3 that the prioritization of a single event to prevent would force the stakeholders to choose between public health and economy. There are no events with big impacts in both areas. Tornadoes and floods lead the health and economy ranks respectively, with a huge advantage over the second places.
In the case of economic restrictions to invest in the prevention of both events, tornadoes can be the best single choice. Firstly because they are the most harmful event to the public health and the third in the greatest economic consequence. Secondly, because this analysis doesn’t consider the economic impact to treat and recover injured people after the events. Tornadoes can be even more expensive if we include this economic impact in the equation.
As we can see in Figure 3, tornadoes have a rapidly growing trend in their impact on public health compared to other types of events. Regarding the economic impact, tornadoes are the only events that show a slight increase in their trend.