This analysis examines weather data from the National Oceanic and Atmospheric Administration (NOAA). The data list weather events, when they occurred, and several measures of adverse events associated with each event. For this analysis we are interested in the toll that weather events take in terms of injury, death, and property damage.
To examine which weather events are the most destructive, this analysis looks at the rates of injury, death, and property damage for all events since 1990, limited to those for which there are at least 500 occurences since 1990.
The raw data were available from a URL provided by the Coursera project instructions, and were downloaded using the download.file() function in R:
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2","~/Ed/R directorystorms.csv.bz2")
The data were then read into R using the read.csv() function:
storms <- read.csv("storms.csv.bz2",stringsAsFactors=F)
Once the data were in the dataframe called “storms”, the weather event text description was made to be all upper case letters, and the weather event date was changed into a proper date variable:
storms$EVTYPE <- toupper(storms$EVTYPE)
storms$BGN_DATE <- as.Date(storms$BGN_DATE,format="%m/%d/%Y")
Next, I subset the data to include only weather events that occured in 1990 or later. I chose this year because it seemed to be the right balance of recency and availability of data; much earlier than this the data got thin and things may have been sufficiently different (infrastructure, monetary value of property) that the results may have gotten skewed.
## Keep the records 1990 and later for fair comparison
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
storms2 <- storms[year(storms$BGN_DATE)>=1990,]
As inital exploratory analyses (not reproduced here), I examined the frequencies of the FATALITIES, INJURIES, and PROPDMG fields in the raw data. There were no missing values in these fields and all values seemed logical. I also examined EVTYPE (weather event type) feature in the data, looking at the unique values and their frequencies in the raw data (also not reproduced here). Noting similarities between many of the less frequent categories and the most frequent categories of EVTYPE, I combined similar categories using the grep() command:
## Combine similar categories
storms2$EVTYPE[grep("AVALAN",storms2$EVTYPE)] <- "AVALANCHE"
storms2$EVTYPE[grep("BLIZZARD|WINTER",storms2$EVTYPE)] <- "BLIZZARD"
storms2$EVTYPE[grep("FLOOD|FLDG|FLD|URBAN|FLOOO",storms2$EVTYPE)] <- "FLOOD"
storms2$EVTYPE[grep("RAIN",storms2$EVTYPE)] <- "HEAVY RAIN"
storms2$EVTYPE[grep("FREEZING RAIN",storms2$EVTYPE)] <- "FREEZING RAIN"
storms2$EVTYPE[grep("ICE|ICY",storms2$EVTYPE)] <- "ICE"
storms2$EVTYPE[grep("SNOW",storms2$EVTYPE)] <- "SNOW"
storms2$EVTYPE[grep("HAIL",storms2$EVTYPE)] <- "HAIL"
storms2$EVTYPE[grep("SURF",storms2$EVTYPE)] <- "SURF"
storms2$EVTYPE[grep("HIGH WIND",storms2$EVTYPE)] <- "HIGH WIND"
storms2$EVTYPE[grep("HURRICANE|TYPHOON",storms2$EVTYPE)] <- "HURRICANE"
storms2$EVTYPE[grep("TROPICAL",storms2$EVTYPE)] <- "TROP STORM"
storms2$EVTYPE[grep("THUNDER|TSTM",storms2$EVTYPE)] <- "THUNDER STORM"
storms2$EVTYPE[grep("LIGHTN|LIGHTI|LIGNT",storms2$EVTYPE)] <- "LIGHTNING"
storms2$EVTYPE[grep("WIND|WND",storms2$EVTYPE)] <- "WIND"
storms2$EVTYPE[grep("TORN|FUNNEL",storms2$EVTYPE)] <- "TORNADO"
storms2$EVTYPE[grep("FIRE",storms2$EVTYPE)] <- "WILDFIRE"
storms2$EVTYPE[grep("HEAT|HOT|HIGH TEMP",storms2$EVTYPE)] <- "HEATWAVE"
storms2$EVTYPE[grep("SURGE",storms2$EVTYPE)] <- "STORM SURGE"
storms2$EVTYPE[grep("RIP",storms2$EVTYPE)] <- "RIPTIDE"
storms2$EVTYPE[grep("SWELL|SEA|TIDE",storms2$EVTYPE)] <- "HI SEAS"
storms2$EVTYPE[grep("SPOUT",storms2$EVTYPE)] <- "WATERSPOUT"
storms2$EVTYPE[grep("SLIDE",storms2$EVTYPE)] <- "LANDSLIDE"
storms2$EVTYPE[grep("DUST",storms2$EVTYPE)] <- "DUSTSTORM"
Once this was complete, I set about aggregating the total numbers of injuries and deaths by weather event and the count of weather events themselves:
## Count injuries and deaths by event type and sum damage by event type
deaths <- as.data.frame(with(storms2,aggregate(list(Deaths=FATALITIES),list(Event=EVTYPE),sum)))
injuries <- as.data.frame(with(storms2,aggregate(list(Injuries=INJURIES),list(Event=EVTYPE),sum)))
events <- as.data.frame(table(storms2$EVTYPE)); colnames(events) <- c("Event","Count")
I next examined the codebook to understand how the damage variables work. Each damage numerical value (for both property and crop damage) must be multiplied by a number according to the secondary fields of PROPDMGEXP and CROPDMGEXP.
## Compute damage amounts from raw data as the sum of property and crop damage,
## adjusted by the correct units of dollars
storms2$PDMG[storms2$PROPDMG==0] <- 0
storms2$PDMG[storms2$PROPDMGEXP=="H" | storms2$PROPDMGEXP=="h"] <-
100*storms2$PROPDMG[storms2$PROPDMGEXP=="H" | storms2$PROPDMGEXP=="h"]
storms2$PDMG[storms2$PROPDMGEXP=="K" | storms2$PROPDMGEXP=="k"] <-
1000*storms2$PROPDMG[storms2$PROPDMGEXP=="K" | storms2$PROPDMGEXP=="k"]
storms2$PDMG[storms2$PROPDMGEXP=="M" | storms2$PROPDMGEXP=="m"] <-
1000000*storms2$PROPDMG[storms2$PROPDMGEXP=="M" | storms2$PROPDMGEXP=="m"]
storms2$PDMG[storms2$PROPDMGEXP=="B" | storms2$PROPDMGEXP=="b"] <-
1000000000*storms2$PROPDMG[storms2$PROPDMGEXP=="B" | storms2$PROPDMGEXP=="b"]
storms2$CDMG[storms2$CROPDMG==0] <- 0
storms2$CDMG[storms2$CROPDMGEXP=="H" | storms2$CROPDMGEXP=="h"] <-
100*storms2$CROPDMG[storms2$CROPDMGEXP=="H" | storms2$CROPDMGEXP=="h"]
storms2$CDMG[storms2$CROPDMGEXP=="K" | storms2$CROPDMGEXP=="k"] <-
1000*storms2$CROPDMG[storms2$CROPDMGEXP=="K" | storms2$CROPDMGEXP=="k"]
storms2$CDMG[storms2$CROPDMGEXP=="M" | storms2$CROPDMGEXP=="m"] <-
1000000*storms2$CROPDMG[storms2$CROPDMGEXP=="M" | storms2$CROPDMGEXP=="m"]
storms2$CDMG[storms2$CROPDMGEXP=="B" | storms2$CROPDMGEXP=="b"] <-
1000000000*storms2$CROPDMG[storms2$CROPDMGEXP=="B" | storms2$CROPDMGEXP=="b"]
storms2$TDMG <- (storms2$PDMG + storms2$CDMG)/1000
damage <- as.data.frame(with(storms2,aggregate(list(Damage=TDMG),list(Event=EVTYPE),sum)))
Now I combined these counts and sums into one dataframe and computed rates for each outcome (injury, death, property damage):
## Combine weather events and casualty events
casualties <- merge(deaths,injuries,by="Event")
totaldam <- merge(casualties,damage,by="Event")
event_final <- merge(totaldam,events,by="Event")
event_final <- event_final[event_final$Count>=500,]
## Compute the rates
event_final$deathrate <- event_final$Deaths/event_final$Count
event_final$injuryrate <- event_final$Injuries/event_final$Count
event_final$damagerate <- event_final$Damage/event_final$Count
barplot(head(event_final[order(event_final$injuryrate,decreasing=T),7]),
names.arg=head(event_final[order(event_final$injuryrate,decreasing=T),1]),
ylab="Injury Rate (Injuries/Event)", main="Most Injurious Weather Events",
cex.axis=0.8, cex.names=0.5, col="red", axes=T, las=2, ylim=c(0,3.6))
head(event_final[order(event_final$injuryrate,decreasing=T),c(1,5,3,7)])
## Event Count Injuries injuryrate
## 74 HEATWAVE 2673 9224 3.4508043
## 59 FOG 538 734 1.3643123
## 89 ICE 2215 2195 0.9909707
## 45 DUSTSTORM 586 483 0.8242321
## 215 TORNADO 36799 26738 0.7265958
## 216 TROP STORM 757 383 0.5059445
This chart and the printed output show us that HEATWAVE, or abnormally high heat weather events are responsible for the greeatest number of injuries, on average, with nearly 3.5 injuries per abnormal heat event. The next most injurious weather event was FOG, with 1.36 injuries per event. These were followed by ICE, DUSTSTORM, TORNADO, and TROP STORM (tropical storm), all with injury rates between 0.5 and 1.0.
I next repeated the chart and the output call for the death rate.
barplot(head(event_final[order(event_final$deathrate,decreasing=T),6]),
names.arg=head(event_final[order(event_final$deathrate,decreasing=T),1]),
ylab="Mortality Rate (Deaths/Event)", main="Most Deadly Weather Events",
cex.axis=0.8, cex.names=0.5, col="black", axes=T, las=2, ylim=c(0,1.4))
head(event_final[order(event_final$deathrate,decreasing=T),c(1,5,2,6)])
## Event Count Deaths deathrate
## 74 HEATWAVE 2673 3138 1.17396184
## 81 HI SEAS 1334 631 0.47301349
## 54 EXTREME COLD 657 162 0.24657534
## 212 SURF 1064 166 0.15601504
## 59 FOG 538 62 0.11524164
## 216 TROP STORM 757 66 0.08718626
Here we can see that not only is HEATWAVE the most injurious weather event, it is also the most deadly weather event, with a death rate of 1.17 people per heat event. FOG is in the top 6 most deadly weather events, but moves down to the #5 slot with a rate of 0.11 deaths per fog event. As in the injurious rankings, we see TROP STORM in the #6 spot, with a rate of 0.09 deaths per fog event. However, the remaining 3 of the top 6 are different, with HI SEAS, EXTREME COLD, and SURF filling the #2, #3, and #4 spots respectively. All death rates ranged between 0.09 deaths per event and 1.17 deaths per event, on average.
Finally, I examined what I call the damage rate, or the amount of property damage per weather event (measured in dollars/event).
barplot(head(event_final[order(event_final$damagerate,decreasing=T),8]),
names.arg=head(event_final[order(event_final$damagerate,decreasing=T),1]),
ylab="Damage Rate ($k/Event)", main="Most Expensive Weather Events",
cex.axis=0.8, cex.names=0.5, col="dark green", axes=T, las=2, ylim=c(0,14000))
head(event_final[order(event_final$damagerate,decreasing=T),c(1,5,4,8)])
## Event Count Damage damagerate
## 216 TROP STORM 757 8411023.6 11110.9954
## 31 DROUGHT 2488 15018672.0 6036.4437
## 54 EXTREME COLD 657 1380710.4 2101.5379
## 238 WILDFIRE 4239 8899910.1 2099.5306
## 67 FROST/FREEZE 1343 1104666.0 822.5361
## 90 LANDSLIDE 646 346843.1 536.9088
It should be noted that the event rates computed here do not take account of two important factors: time and area. A HEATWAVE event usually takes place over several days and in a large area, as do cold weather events, tropical storms, hurricances, and others. Some events may be unlisted in the rankings because they take place over a shorter time period and/or act over a smaller area, putting fewer people and less property at risk. A more thorough analysis should take these factors into account, potentially by computing the rankings within a higher level of weather event category, over geographic size categories, and by duration. As always, asking the right question and understanding the data are paramount to reaching good conclusions.