Over 40 weather events are monitored by the National Oceanic and Atmospheric Administration (NOAA). An analysis of this data from 1996 to 2011 reveal that excessive heat, tornadoes, and flash floods are the top three causes of fatalities while tornadoes, excessive heat and thunderstorm wind are the top three causes of injuries among weather events in the United States.
Hurricanes, tornadoes, and floods lead the pack in terms of total cost from damage to property while drought, hurricane, and flood top the list in terms of total cost from damage to crops.
Analysis of this data will be useful to formulate disaster risk reduction strategies to lessen the impact of these weather events. With the advent of climate change, there is an even greater need to improve the ability to forecast which communities are going to be hit the hardest and create mitigating solutions to safeguard lives, property and food security.
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site: Storm Data[47Mb]
fileurl <- "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileurl, destfile = "./data/storm.csv.bz2")
dateDownloaded <- date()
print(dateDownloaded)
## [1] "Wed Oct 28 12:56:36 2015"
clc <- c("NULL", "character", "NULL", "NULL", "NULL", "NULL", "NULL", "character", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL","NULL", "NULL", "NULL", "NULL", "numeric", "numeric","numeric", "character", "numeric", "character", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "character", "numeric")
stormdt <- bzfile("./data/storm.csv.bz2", open = "rt")
stormdata <- read.csv(stormdt, header = TRUE, nrows = 653641, sep = ",", colClasses = clc, skip = 251830)
colnames(stormdata) <- c("BGN_DATE", "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP", "REMARKS", "REFNUM")
str(stormdata)
## 'data.frame': 653641 obs. of 10 variables:
## $ BGN_DATE : chr "11/9/1994 0:00:00" "11/10/1994 0:00:00" "11/10/1994 0:00:00" "11/10/1994 0:00:00" ...
## $ EVTYPE : chr "URBAN FLOODS" "THUNDERSTORM WINDS" "URBAN FLOODS" "FLASH FLOODS" ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 0 0 ...
## $ INJURIES : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PROPDMG : num 1 0 0.5 0 0 0 0 5 5 0 ...
## $ PROPDMGEXP: chr "K" "" "K" "" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ REMARKS : chr "Due to the rains 12 to 15 houses were flooded at Guayaney, principal road number 43. " "A tree fell over telephone lines at Barrio Guayabotas and Calabazas, road 182. " "A house was flooded at Monte Brisas urbanization, road 101. Several streets at Barrio Bebe Calzada were flooded EL Conquistador"| __truncated__ "Road 970 at Barrio Duque, Maizal sector was impassable. Also, road 31 from La Fe sector towards Pitina, and road from Naguabo t"| __truncated__ ...
## $ REFNUM : num 248622 248623 248624 248625 248626 ...
The dataset contains data from January 1950 to July 2015, as entered by NOAA’s National Weather Service (NWS). Due to changes in the data collection and processing procedures over time, only the data from 1996 onwards will be used for the analysis. Beginning in 1996, data collection increased from 4 to 48 different weather events. We will only retain the column variables that will be needed for the analysis.
```{r, transformers, cache=TRUE
library(dplyr)
post1996 <- filter(stormdata, stormdata$BGN_DATE >= "1996-01-01")
str(post1996)
## 'data.frame': 564718 obs. of 10 variables:
## $ BGN_DATE : chr "5/27/1995 0:00:00" "5/30/1995 0:00:00" "6/16/1995 0:00:00" "6/17/1995 0:00:00" ...
## $ EVTYPE : chr "URBAN FLOOD" "HEAVY RAIN" "THUNDERSTORM WINDS" "THUNDERSTORM WINDS" ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 0 1 ...
## $ INJURIES : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PROPDMG : num 0 0 5 5 0 0 0 0 0 500 ...
## $ PROPDMGEXP: chr "" "" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ REMARKS : chr "Civil Defense reported street flooding in Mayaguez and Aguadilla. " "Civil Defense reported River Fajardo close to overflowing its banks due to the heavy rains in Rio Grande and Fajardo. " "Thunderstorm winds downed power lines and trees in Aguada, Aguadilla, and Anasco. Most of the damage occurred at the Piedras B"| __truncated__ "Downed power lines were reported by Civil Defense at Barrio Miraflores, road 109, km 7.8 in Anasco. " ...
## $ REFNUM : num 248627 248628 248629 248630 248631 ...
Due to the wide variety of sources from which the data is gathered, errors in data tabulations are inevitable. Furthermore, the weather events described are contiguous with one another, often occuring in groups of 3 or 4. We will attempt to properlly allocate the data in their proper classification based on the criteria put forth in the National Weather Service Storm data Documentation. First, we remove entries of summaries in the column variable that should contain only the type of events monitored.
summary_ent <- grepl("summary", post1996$EVTYPE, ignore.case = TRUE)
no_summary_ent <- post1996[!summary_ent, ]
Since our analysis will focus on the impact of weather events on the health and the economy, we will be removing entries that contain 0 values in terms of number of fatalities and injuries.
zeroes <- no_summary_ent$FATALITIES == 0 & no_summary_ent$INJURIES == 0
non_zeroes <- no_summary_ent[!zeroes, ]
str(non_zeroes)
## 'data.frame': 10247 obs. of 10 variables:
## $ BGN_DATE : chr "9/18/1993 0:00:00" "2/2/1996 0:00:00" "2/5/1996 0:00:00" "2/5/1996 0:00:00" ...
## $ EVTYPE : chr "FLASH FLOODS" "FLASH FLOOD" "EXTREME COLD" "EXTREME COLD" ...
## $ FATALITIES: num 1 1 1 1 0 0 4 2 0 0 ...
## $ INJURIES : num 0 0 0 0 15 1 40 17 1 3 ...
## $ PROPDMG : num 500 0 0 0 500 2 8 1.5 0 0 ...
## $ PROPDMGEXP: chr "K" "" "" "" ...
## $ CROPDMG : num 0 0 0 0 0 0 50 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ REMARKS : chr "Heavy rains affected the interior section of Puerto Rico causing the rivers to overflow in many areas. A man of 70 years of ag"| __truncated__ "Heavy rain was responsible for flooding a number of small streams and creeks across Chilton County. A woman was killed on Coun"| __truncated__ "A 71 YEAR OLD MALE DIED FROM COLD WEATHER IN THE PRICHARD AREA. THE TEMPERATURE DROPPED TO 11 DEGREES DURING THE NIGHT. EVIDE"| __truncated__ "A man believed to be in his 70s was found dead in his residence around noon on Monday. M74PH" ...
## $ REFNUM : num 248637 248799 248802 248803 248818 ...
The column variable EVTYPE contains entries which are not part of the prescribed classification of the National Weather Service Instruction. Furthermore, spelling mistakes and unauthorized abbreviations are frequent. And as mentioned earlier, classification is inherently difficult due to the contiguous nature of the events, with differences that are not intuitive. Let’s start with reclassifying entries to: Wildfire, Dust Devil, Hail, Debris Flow, Freezing Fog, Frost/Freeze, Tropical Depression, Sleet, Storm Surge/Tide, Marine Thunderstorm Wind, Dense Fog and High Surf.
non_zeroes$EVTYPE[grep("BRUSHFIRE|BRUSH FIRE|WILDFIRE|FOREST", non_zeroes$EVTYPE, ignore.case = TRUE)] <- "WILDFIRE"
non_zeroes$EVTYPE[grep("DUST DEVIL|BLOWING DUST|LANDSPOUT", non_zeroes$EVTYPE, ignore.case = TRUE)] <- "DUST DEVIL"
non_zeroes$EVTYPE[grep("SMALL HAIL|GUSTY WIND/HAIL", non_zeroes$EVTYPE, ignore.case = TRUE)] <- "HAIL"
non_zeroes$EVTYPE[grep("ROCK SLIDE|MUD SLIDE|MUDSLIDE|Mudslides|LANDSLIDE|LANDSLIDES", non_zeroes$EVTYPE, ignore.case = TRUE)] <- "DEBRIS FLOW"
non_zeroes$EVTYPE[grep("GLAZE", ignore.case = TRUE, non_zeroes$EVTYPE)] <- "FREEZING FOG"
non_zeroes$EVTYPE[grep("AGRICULTURAL FREEZE|Damaging Freeze|DAMAGING FREEZE|Freeze|FREEZE|HARD FREEZE|FROST|Frost/Freeze|FROST/FREEZE|Early Frost", non_zeroes$EVTYPE)] <- "FROST/FREEZE"
non_zeroes$EVTYPE[grep("GRADIENT WIND|LAKESHORE FLOOD", ignore.case = TRUE, non_zeroes$EVTYPE)] <- "TROPICAL DEPRESSION"
non_zeroes$EVTYPE[grep("ICE ON ROAD|ICE ROADS|ICY ROADS", non_zeroes$EVTYPE, ignore.case = TRUE)] <- "SLEET"
non_zeroes$EVTYPE[grep("STORM SURGE", non_zeroes$EVTYPE, ignore.case = TRUE)] <- "STORM SURGE/TIDE"
non_zeroes$EVTYPE[grep("MARINE TSTM WIND|SNOW SQUALL|Snow Squalls|SNOW SQUALLS", non_zeroes$EVTYPE)] <- "MARINE THUNDERSTORM WIND"
non_zeroes$EVTYPE[grep("FOG", non_zeroes$EVTYPE)] <- "DENSE FOG"
non_zeroes$EVTYPE[grep("High Surf|Erosion/Cstl Flood|HIGH SWELLS|Beach Erosion|COASTAL FLOODING/EROSION|COASTAL FLOODING/EROSION|COASTAL EROSION| HEAVY SEAS|HEAVY SURF|Heavy surf and wind|HIGH SEAS", non_zeroes$EVTYPE, ignore.case = TRUE)] <- "HIGH SURF"
We continue our re-classification with events that are more inter-related compared to the previous group we re-classified.
non_zeroes$EVTYPE[grep("Hurricane Edouard|HURRICANE/TYPHOON", non_zeroes$EVTYPE, ignore.case = TRUE)] <- "HURRICANE"
non_zeroes$EVTYPE[grep("Coastal Storm|COASTAL STORM|COASTALSTORM", non_zeroes$EVTYPE)] <- "TROPICAL STORM"
non_zeroes$EVTYPE[grep("ASTRONOMICAL HIGH TIDE|COASTAL FLOOD|TIDAL FLOODING|COASTAL FLOODING", non_zeroes$EVTYPE, ignore.case = TRUE)] <- "COASTAL FlOOD"
non_zeroes$EVTYPE[grep("SNOW AND ICE|Freezing Spray|LIGHT SNOW|Light Snow|Light snow|SNOW|Snow|Light Snowfall|EXCESSIVE SNOW", non_zeroes$EVTYPE)] <- "WINTER STORM"
non_zeroes$EVTYPE[grep("SNOW AND ICE|SNOW|Snow|EXCESSIVE SNOW", non_zeroes$EVTYPE)] <- "WINTER STORM"
non_zeroes$EVTYPE[grep(" FLASH FLOOD|FLASH FLOOD/FLOOD|FLOOD/FLASH/FLOOD|DAM BREAK|RIVER FLOOD|River Flooding|RIVER FLOODING|Ice jam flood", non_zeroes$EVTYPE, ignore.case = TRUE)] <- "FLASH FLOOD"
non_zeroes$EVTYPE[grep("MIXED PRECIPITATION|MIXED PRECIP|WINTRY MIX|WINTER WEATHER MIX|WINTER WEATHER/MIX|RAIN/SNOW|Heavy snow shower|HEAVY SNOW|FREEZING RAIN|FREEZING DRIZZLE|FALLING SNOW/ICE|blowing snow", non_zeroes$EVTYPE, ignore.case = TRUE)] <- "WINTER WEATHER"
non_zeroes$EVTYPE[grep("URBAN/SML STREAM FLD", non_zeroes$EVTYPE, ignore.case = TRUE)] <- "HEAVY RAIN"
non_zeroes$EVTYPE[grep("HIGH WINDS|NON TSTM WIND|NON-TSTM WIND", non_zeroes$EVTYPE, ignore.case = TRUE)] <- "HIGH WIND"
non_zeroes$EVTYPE[grep("STRONG WIND|GUSTY WIND/HVY RAIN|Gusty wind/rain|GUSTY WINDS|STRONG WINDS", non_zeroes$EVTYPE, ignore.case = TRUE)] <- "STRONG WIND"
non_zeroes$EVTYPE[grep(" TSTM|THUNDERSTORM WIND (G40)|STRONG WIND|Microburst|DOWNBURST|DRY MICROBURST|WHIRLWIND|Wind Damage", non_zeroes$EVTYPE, ignore.case = TRUE)] <- "THUNDERSTORM WIND"
non_zeroes$EVTYPE[grep("^THUNDERSTORM", non_zeroes$EVTYPE, ignore.case = TRUE)] <- "THUNDERSTORM WIND"
non_zeroes$EVTYPE[grep("^WIND", non_zeroes$EVTYPE, ignore.case = TRUE)] <- "THUNDERSTORM WIND"
non_zeroes$EVTYPE[grep("Cold|COLD AND SNOW|Cold Temperature|COLD WEATHER|UNSEASONABLY COLD|COLD|Unseasonable Cold", non_zeroes$EVTYPE)] <- "COLD/WIND CHILL"
We now attend to the items that are difficult to subset as it may cause mixing of the data we previously classified.
non_zeroes$EVTYPE[grep("^TSTM", non_zeroes$EVTYPE, ignore.case = TRUE)] <- "THUNDERSTORM WIND"
non_zeroes$EVTYPE[grep("Extreme Cold|EXTREME COLD/WIND CHILL|EXTREME WINDCHILL|Hypothermia/Exposure|HYPOTHERMIA/EXPOSURE|HYPERTHERMIA/EXPOSURE", non_zeroes$EVTYPE)] <- "EXTREME COLD"
non_zeroes$EVTYPE[grep(")$", non_zeroes$EVTYPE, ignore.case = TRUE)] <- "HIGH WIND"
non_zeroes$EVTYPE[grep("RECORD HEAT", non_zeroes$EVTYPE, ignore.case = TRUE)] <- "EXCESSIVE HEAT"
non_zeroes$EVTYPE[grep("RIP CURRENTs", non_zeroes$EVTYPE, ignore.case = TRUE)] <- "RIP CURRENT"
non_zeroes$EVTYPE[grep("RECORD HEAT|Heat Wave", non_zeroes$EVTYPE, ignore.case = TRUE)] <- "EXCESSIVE HEAT"
non_zeroes$EVTYPE[grep("RECORD HEAT|Heat Wave", non_zeroes$EVTYPE, ignore.case = TRUE)] <- "EXCESSIVE HEAT"
non_zeroes$EVTYPE[grep("WARM WEATHER|UNSEASONABLY WARM", non_zeroes$EVTYPE, ignore.case = TRUE)] <- "HEAT"
non_zeroes$EVTYPE[grep("RAIN|Torretial Rainfall|TYPHOON|UNSEASONAL RAIN", non_zeroes$EVTYPE, ignore.case = TRUE)] <- "HURRICANE"
non_zeroes$EVTYPE[grep("ROUGH SEAS|ROUGH SURF|HEAVY SEAS|GUSTY WIND|HAZARDOUS SURF", non_zeroes$EVTYPE, ignore.case = TRUE)] <- "HIGH SURF"
We now undertake to reclassify those items whose meaning are not intuitive and require examination of the remarks column in the original data.
non_zeroes$EVTYPE[grep("BLACK ICE", non_zeroes$EVTYPE)] <- "WINTER WEATHER"
non_zeroes$EVTYPE[grep("Landslump", non_zeroes$EVTYPE)] <- "HIGH SURF"
non_zeroes$EVTYPE[grep("HIGH WATER", non_zeroes$EVTYPE)] <- "FLOOD"
Upon Examination of the entries labelled “OTHERS” and “Others” in the column variable EVTYPE in the original data frame, we discover that various entries were lumped together. Based on the column variable REMARKS, we now remove these rows from our data
not_zeroes <- filter(non_zeroes, non_zeroes$EVTYPE != "other" | non_zeroes$EVTYPE != "OTHERS")
evnt_types <- group_by(not_zeroes, EVTYPE)
sum_fatal <- summarise(evnt_types, sum(FATALITIES))
names(sum_fatal) <- c("Event", "Total_Fatalities")
sum_fatal <- arrange(sum_fatal, desc(Total_Fatalities))
sum_injur <- summarise(evnt_types, sum(INJURIES))
names(sum_injur) <- c("Event", "Total_Injured")
sum_injur <- arrange(sum_injur, desc(Total_Injured))
head(sum_fatal, 10)
## Source: local data frame [10 x 2]
##
## Event Total_Fatalities
## 1 EXCESSIVE HEAT 1798
## 2 TORNADO 1324
## 3 FLASH FLOOD 743
## 4 LIGHTNING 631
## 5 RIP CURRENT 473
## 6 THUNDERSTORM WIND 455
## 7 FLOOD 292
## 8 HEAT 236
## 9 HURRICANE 208
## 10 HIGH WIND 135
head(sum_injur, 10)
## Source: local data frame [10 x 2]
##
## Event Total_Injured
## 1 TORNADO 17796
## 2 EXCESSIVE HEAT 6461
## 3 THUNDERSTORM WIND 5023
## 4 LIGHTNING 3997
## 5 FLASH FLOOD 1312
## 6 HEAT 1240
## 7 HURRICANE 1199
## 8 WILDFIRE 941
## 9 WINTER STORM 879
## 10 HAIL 710
Since our analysis will focus on the impact of weather events on the damage to crops and properties, we will be removing entries that contain 0 values in terms of number of fatalities and injuries.
zerodam <- no_summary_ent$CROPDMG == 0 & no_summary_ent$PROPDMG == 0
non_zerodam <- no_summary_ent[!zerodam, ]
The column variable EVTYPE contains entries which are not part of the prescribed classification of the National Weather Service Instruction. Furthermore, spelling mistakes and unauthorized abbreviations are frequent. And as mentioned earlier, classification is inherently difficult due to the contiguous nature of the events, with differences that are not intuitive. Let’s start with reclassifying entries to: Wildfire, Dust Devil, Hail, Debris Flow, Freezing Fog, Frost/Freeze, Tropical Depression, Sleet, Storm Surge/Tide, Marine Thunderstorm Wind, Dense Fog and High Surf.
non_zerodam$EVTYPE[grep("BRUSHFIRE|BRUSH FIRE|WILDFIRE|FOREST", non_zerodam$EVTYPE, ignore.case = TRUE)] <- "WILDFIRE"
non_zerodam$EVTYPE[grep("DUST DEVIL|BLOWING DUST|LANDSPOUT", non_zerodam$EVTYPE, ignore.case = TRUE)] <- "DUST DEVIL"
non_zerodam$EVTYPE[grep("SMALL HAIL|GUSTY WIND/HAIL", non_zerodam$EVTYPE, ignore.case = TRUE)] <- "HAIL"
non_zerodam$EVTYPE[grep("ROCK SLIDE|MUD SLIDE|MUDSLIDE|Mudslides|LANDSLIDE|LANDSLIDES", non_zerodam$EVTYPE, ignore.case = TRUE)] <- "DEBRIS FLOW"
non_zerodam$EVTYPE[grep("GLAZE", ignore.case = TRUE, non_zerodam$EVTYPE)] <- "FREEZING FOG"
non_zerodam$EVTYPE[grep("AGRICULTURAL FREEZE|Damaging Freeze|DAMAGING FREEZE|Freeze|FREEZE|HARD FREEZE|FROST|Frost/Freeze|FROST/FREEZE|Early Frost", non_zerodam$EVTYPE)] <- "FROST/FREEZE"
non_zerodam$EVTYPE[grep("GRADIENT WIND|LAKESHORE FLOOD", ignore.case = TRUE, non_zerodam$EVTYPE)] <- "TROPICAL DEPRESSION"
non_zerodam$EVTYPE[grep("ICE ON ROAD|ICE ROADS|ICY ROADS", non_zerodam$EVTYPE, ignore.case = TRUE)] <- "SLEET"
non_zerodam$EVTYPE[grep("STORM SURGE", non_zerodam$EVTYPE, ignore.case = TRUE)] <- "STORM SURGE/TIDE"
non_zerodam$EVTYPE[grep("MARINE TSTM WIND|SNOW SQUALL|Snow Squalls|SNOW SQUALLS", non_zerodam$EVTYPE)] <- "MARINE THUNDERSTORM WIND"
non_zerodam$EVTYPE[grep("FOG", non_zerodam$EVTYPE)] <- "DENSE FOG"
non_zerodam$EVTYPE[grep("High Surf|Erosion/Cstl Flood|HIGH SWELLS|Beach Erosion|COASTAL FLOODING/EROSION|COASTAL FLOODING/EROSION|COASTAL EROSION| HEAVY SEAS|HEAVY SURF|Heavy surf and wind|HIGH SEAS", non_zerodam$EVTYPE, ignore.case = TRUE)] <- "HIGH SURF"
We continue our re-classification with events that are more inter-related compared to the previous group we re-classified.
non_zerodam$EVTYPE[grep("Hurricane Edouard|HURRICANE/TYPHOON", non_zerodam$EVTYPE, ignore.case = TRUE)] <- "HURRICANE"
non_zerodam$EVTYPE[grep("Coastal Storm|COASTAL STORM|COASTALSTORM", non_zerodam$EVTYPE)] <- "TROPICAL STORM"
non_zerodam$EVTYPE[grep("ASTRONOMICAL HIGH TIDE|COASTAL FLOOD|TIDAL FLOODING|COASTAL FLOODING", non_zerodam$EVTYPE, ignore.case = TRUE)] <- "COASTAL FlOOD"
non_zerodam$EVTYPE[grep("SNOW AND ICE|Freezing Spray|LIGHT SNOW|Light Snow|Light snow|SNOW|Snow|Light Snowfall|EXCESSIVE SNOW", non_zerodam$EVTYPE)] <- "WINTER STORM"
non_zerodam$EVTYPE[grep("SNOW AND ICE|SNOW|Snow|EXCESSIVE SNOW", non_zerodam$EVTYPE)] <- "WINTER STORM"
non_zerodam$EVTYPE[grep(" FLASH FLOOD|FLASH FLOOD/FLOOD|FLOOD/FLASH/FLOOD|DAM BREAK|RIVER FLOOD|River Flooding|RIVER FLOODING|Ice jam flood", non_zerodam$EVTYPE, ignore.case = TRUE)] <- "FLASH FLOOD"
non_zerodam$EVTYPE[grep("MIXED PRECIPITATION|MIXED PRECIP|WINTRY MIX|WINTER WEATHER MIX|WINTER WEATHER/MIX|RAIN/SNOW|Heavy snow shower|HEAVY SNOW|FREEZING RAIN|FREEZING DRIZZLE|FALLING SNOW/ICE|blowing snow", non_zerodam$EVTYPE, ignore.case = TRUE)] <- "WINTER WEATHER"
non_zerodam$EVTYPE[grep("URBAN/SML STREAM FLD", non_zerodam$EVTYPE, ignore.case = TRUE)] <- "HEAVY RAIN"
non_zerodam$EVTYPE[grep("HIGH WINDS|NON TSTM WIND|NON-TSTM WIND", non_zerodam$EVTYPE, ignore.case = TRUE)] <- "HIGH WIND"
non_zerodam$EVTYPE[grep("STRONG WIND|GUSTY WIND/HVY RAIN|Gusty wind/rain|GUSTY WINDS|STRONG WINDS", non_zerodam$EVTYPE, ignore.case = TRUE)] <- "STRONG WIND"
non_zerodam$EVTYPE[grep(" TSTM|THUNDERSTORM WIND (G40)|STRONG WIND|Microburst|DOWNBURST|DRY MICROBURST|WHIRLWIND|Wind Damage", non_zerodam$EVTYPE, ignore.case = TRUE)] <- "THUNDERSTORM WIND"
non_zerodam$EVTYPE[grep("^THUNDERSTORM", non_zerodam$EVTYPE, ignore.case = TRUE)] <- "THUNDERSTORM WIND"
non_zerodam$EVTYPE[grep("^WIND", non_zerodam$EVTYPE, ignore.case = TRUE)] <- "THUNDERSTORM WIND"
non_zerodam$EVTYPE[grep("Cold|COLD AND SNOW|Cold Temperature|COLD WEATHER|UNSEASONABLY COLD|COLD|Unseasonable Cold", non_zerodam$EVTYPE)] <- "COLD/WIND CHILL"
We now attend to the items that are difficult to subset as it may cause mixing of the data we previously classified.
non_zerodam$EVTYPE[grep("^TSTM", non_zerodam$EVTYPE, ignore.case = TRUE)] <- "THUNDERSTORM WIND"
non_zerodam$EVTYPE[grep("Extreme Cold|EXTREME COLD/WIND CHILL|EXTREME WINDCHILL|Hypothermia/Exposure|HYPOTHERMIA/EXPOSURE|HYPERTHERMIA/EXPOSURE", non_zerodam$EVTYPE)] <- "EXTREME COLD"
non_zerodam$EVTYPE[grep(")$", non_zerodam$EVTYPE, ignore.case = TRUE)] <- "HIGH WIND"
non_zerodam$EVTYPE[grep("RECORD HEAT", non_zerodam$EVTYPE, ignore.case = TRUE)] <- "EXCESSIVE HEAT"
non_zerodam$EVTYPE[grep("RIP CURRENTs", non_zerodam$EVTYPE, ignore.case = TRUE)] <- "RIP CURRENT"
non_zerodam$EVTYPE[grep("RECORD HEAT|Heat Wave", non_zerodam$EVTYPE, ignore.case = TRUE)] <- "EXCESSIVE HEAT"
non_zerodam$EVTYPE[grep("RECORD HEAT|Heat Wave", non_zerodam$EVTYPE, ignore.case = TRUE)] <- "EXCESSIVE HEAT"
non_zerodam$EVTYPE[grep("WARM WEATHER|UNSEASONABLY WARM", non_zerodam$EVTYPE, ignore.case = TRUE)] <- "HEAT"
non_zerodam$EVTYPE[grep("RAIN|Torretial Rainfall|TYPHOON|UNSEASONAL RAIN", non_zerodam$EVTYPE, ignore.case = TRUE)] <- "HURRICANE"
non_zerodam$EVTYPE[grep("ROUGH SEAS|ROUGH SURF|HEAVY SEAS|GUSTY WIND|HAZARDOUS SURF", non_zerodam$EVTYPE, ignore.case = TRUE)] <- "HIGH SURF"
We now undertake to reclassify those items whose meaning are not intuitive and require examination of the remarks column in the original data.
non_zerodam$EVTYPE[grep("BLACK ICE", non_zerodam$EVTYPE)] <- "WINTER WEATHER"
non_zerodam$EVTYPE[grep("Landslump", non_zerodam$EVTYPE)] <- "HIGH SURF"
non_zerodam$EVTYPE[grep("HIGH WATER", non_zerodam$EVTYPE)] <- "FLOOD"
Upon Examination of the entries labelled “OTHERS” and “Others” in the column variable EVTYPE in the original data frame, we discover that various entries were lumped together. Based on the column variable REMARKS, we now remove these rows from our data
no_otherss <- filter(non_zerodam, non_zerodam$EVTYPE != "other" | non_zerodam$EVTYPE != "OTHERS")
damages <- select(no_otherss, c(2, 5, 6, 7, 8))
str(damages)
## 'data.frame': 31632 obs. of 5 variables:
## $ EVTYPE : chr "HURRICANE" "WINTER STORM" "WINTER WEATHER" "COLD/WIND CHILL" ...
## $ PROPDMG : num 0 595 10 0 0 0 195 15 0 2.5 ...
## $ PROPDMGEXP: chr "" "K" "K" "" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
unique(damages$PROPDMGEXP)
## [1] "" "K" "M" "B"
unique(damages$CROPDMGEXP)
## [1] "" "K" "M"
We can see that the column variables “PROPDMGEXP” and “CROPDMGEXP” are multipliers for the values in “PROPDMG” and “CROPDMG” column variables. The multipliers in the form of empty spaces and letters (K, M, B) serve to provide the actual amount of damage in dollars by multiplying the values by billions, millions, thousands, hundreds and ones. In order to assure that we don’t mistakenly alter the data, we take note of the number of items represented by each letter and space.
sum(damages$PROPDMGEXP == "K")
## [1] 18215
sum(damages$PROPDMGEXP == "M")
## [1] 316
sum(damages$PROPDMGEXP == "B")
## [1] 2
sum(damages$PROPDMGEXP == "")
## [1] 13099
sum(damages$CROPDMGEXP == "K")
## [1] 13717
sum(damages$CROPDMGEXP == "M")
## [1] 79
sum(damages$CROPDMGEXP == "")
## [1] 17836
We now replace them with the appropriate multipliers in numbers.
damages$PROPDMGEXP[damages$PROPDMGEXP == "K"] <- 1000
damages$PROPDMGEXP[damages$PROPDMGEXP == "M"] <- 1000000
damages$PROPDMGEXP[damages$PROPDMGEXP == "B"] <- 1000000000
damages$PROPDMGEXP[damages$PROPDMGEXP == ""] <- 1
damages$CROPDMGEXP[damages$CROPDMGEXP == "K"] <- 1000
damages$CROPDMGEXP[damages$CROPDMGEXP == "M"] <- 1000000
damages$CROPDMGEXP[damages$CROPDMGEXP == ""] <- 1
We now compare the transformation to verify that we did not alter the data.
sum(damages$PROPDMGEXP == 1000)
## [1] 18215
sum(damages$PROPDMGEXP == 1000000)
## [1] 316
sum(damages$PROPDMGEXP == 1000000000)
## [1] 2
sum(damages$PROPDMGEXP == 1)
## [1] 13099
sum(damages$CROPDMGEXP == 1000)
## [1] 13717
sum(damages$CROPDMGEXP == 1000000)
## [1] 79
sum(damages$CROPDMGEXP == 1)
## [1] 17836
We now look at our data and transform the multiplier column variables into numeric class. We then create new column variables (“cropcost” and “propcost”) to reflect the actual cost of damages in dollars.
str(damages)
## 'data.frame': 417220 obs. of 5 variables:
## $ EVTYPE : chr "URBAN FLOOD" "HURRICANE" "HURRICANE" "WATERSPOUT" ...
## $ PROPDMG : num 0 0 0 0 0 0 0 500 0 0 ...
## $ PROPDMGEXP: chr "1" "1" "1" "1" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "1" "1" "1" "1" ...
damages$CROPDMGEXP <- as.numeric(damages$CROPDMGEXP)
damages$PROPDMGEXP <- as.numeric(damages$PROPDMGEXP)
damages$cropcost <- damages$CROPDMG * damages$CROPDMGEXP
damages$propcost <- damages$PROPDMG * damages$PROPDMGEXP
We now aggregate and summarise the data to reflect the order in which different climate cause damage to crop and property.
damages_byEV <- group_by(damages, EVTYPE)
sum_damages <- summarise(damages_byEV, sum(propcost), sum(cropcost))
colnames(sum_damages) <- c("Event", "Total_PropCost", "Total_CropCost")
sum_damagesProp <- arrange(sum_damages, desc(Total_PropCost))
sum_damagesCrop <- arrange(sum_damages, desc(Total_CropCost))
head(sum_damagesProp[1:2], 15)
## Source: local data frame [15 x 2]
##
## Event Total_PropCost
## 1 HURRICANE 41246572100
## 2 TORNADO 19545883000
## 3 FLOOD 7043247040
## 4 TROPICAL STORM 6543704400
## 5 FLASH FLOOD 5948940260
## 6 STORM SURGE/TIDE 4001600000
## 7 THUNDERSTORM WIND 3147571800
## 8 HIGH WIND 2843022870
## 9 HAIL 2777771540
## 10 WILDFIRE 1518205200
## 11 BLIZZARD 234810000
## 12 ICE STORM 216737000
## 13 DROUGHT 156182000
## 14 WINTER STORM 100860000
## 15 TSUNAMI 84000000
head(sum_damagesCrop[c(1,3)], 15)
## Source: local data frame [15 x 2]
##
## Event Total_CropCost
## 1 DROUGHT 8993943000
## 2 HURRICANE 5380501700
## 3 FLOOD 4181022400
## 4 HAIL 2358314450
## 5 THUNDERSTORM WIND 1070827100
## 6 FLASH FLOOD 1050193700
## 7 FROST/FREEZE 675338000
## 8 TROPICAL STORM 591161000
## 9 HIGH WIND 547340700
## 10 EXCESSIVE HEAT 492402000
## 11 COLD/WIND CHILL 387370500
## 12 WILDFIRE 268670630
## 13 TORNADO 266404010
## 14 DEBRIS FLOW 20017000
## 15 WINTER WEATHER 15000000
We now present the data in a barplot
par( oma = c( 6, 0, 0, 0 ) )
barplot(height = sum_fatal$Total_Fatalities[1:15], names.arg = sum_fatal$Event[1:15], las = 2, cex.axis = 0.8, cex.names = 0.8, col = rainbow(20), ylab = "Number of Fatalities")
title("Top 15\nWeather Events \n Causing Fatalities", line=-4, cex = 1.5)
barplot(height = sum_injur$Total_Injured[1:15], names.arg = sum_injur$Event[1:15], las = 2, cex.axis = 0.8, cex.names = 0.8, col = rainbow(20), ylab = "Number of Injuries")
title("Top 15\nWeather Events \n Causing Injuries", line=-4, cex = 1.5)
par( mfrow = c( 1, 2 ) )
par( oma = c( 5, 0, 4, 0 ) )
barplot(height = sum_damagesProp$Total_PropCost[1:15]/1000000000, names.arg = sum_damagesProp$Event[1:15], las = 2, cex.axis = 0.8, cex.names = 0.8, col = rainbow(20), ylab = "Damage to Property (B$)")
barplot(height = sum_damagesCrop$Total_CropCost[1:15]/100000000, names.arg = sum_damagesCrop$Event[1:15], las = 2, cex.axis = 0.8, cex.names = 0.8, col = rainbow(20), ylab = "Damage to Crops (B$)")
mtext("Top 15 Weather Events \n Causing Damage to Crops and Property", outer = TRUE, col = "Black", cex = 2)
An analysis of this data from 1996 to 2011 reveal that excessive heat, tornadoes, and flash floods are the top three causes of fatalities while tornadoes, excessive heat and thunderstorm wind are the top three causes of injuries among weather events in the United States.
Hurricanes, tornadoes, and floods lead the pack in terms of total cost from damage to property while drought, hurricane, and flood top the list in terms of total cost from damage to crops.
Analysis of this data will be useful to formulate disaster risk reduction strategies to lessen the impact of these weather events. With the advent of climate change, there is an even greater need to improve the ability to forecast which communities are going to be hit the hardest and create mitigating solutions to safeguard lives, property and food security.
sessionInfo()
## R version 3.2.1 (2015-06-18)
## Platform: i386-w64-mingw32/i386 (32-bit)
## Running under: Windows 7 (build 7601) Service Pack 1
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] dplyr_0.4.2
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.0 codetools_0.2-14 digest_0.6.8 assertthat_0.1
## [5] R6_2.1.0 DBI_0.3.1 formatR_1.2 magrittr_1.5
## [9] evaluate_0.7.2 stringi_0.5-5 lazyeval_0.1.10 rmarkdown_0.7
## [13] tools_3.2.1 stringr_1.0.0 yaml_2.1.13 parallel_3.2.1
## [17] htmltools_0.2.6 knitr_1.11