In this report we aim to find out the types of weather events in the U.S. which post the most threats to the nation. We analyse the data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database and consider two measures of threats: the harmfulness to the human population and the damage in economic terms.
Using the data from January 1996 to November 2011, we find the following:
- The events “tornado”, “excessive heat”, “flood”, “lightning” and “flash flood” are the five weather events being most harmful to the human population in terms of both fatality and injury counts, with tornados having caused a substantially higher injury counts comparing to all other events.
- In terms of economic consequences, “flood” and “tornado” have led to the most property damage, whereas “drought” and “flood” have resulted in the most crop damage. Overall, the nation suffered the most damage from “flood” and “tornado”.
The data processing steps and concluding results are detailed below.
The following libraries are needed for running the subsequent codes in the report.
library(plyr)
library(dplyr)
library(pdftools)
library(tesseract)
library(reshape2)
library(ggplot2)
The NOAA dataset is downloaded from this link and is read into R.
raw <- read.csv("repdata_data_StormData.csv", sep=",", header=T)
The original dataset has 902297 rows and 37 variables,
dim(raw)
## [1] 902297 37
with the variable names listed as below.
names(raw)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
To process the data into a form suitable for our analysis, we first select the columns which we are interested in, including the event (beginning) date, event type, fatality count, injury count, property damage and crop damage. We then further filter the data and consider only the events happened from the 01 January 1996 onwards, since this is the date when the NOAA started recording all types of weather events (see this page from the NOAA for reference).
#select relevant columns
processed <- raw[,c(2,8,23:28)]
#remove data before 01 Jan 1996
processed$BGN_DATE <- as.POSIXct(strptime(processed$BGN_DATE,"%m/%d/%Y %H:%M:%S"))
processed <- processed[processed$BGN_DATE >= as.POSIXct("1996-01-01"),]
Now the dataset consists of data from 01 January 1996 to 30 November 2011.
min(processed$BGN_DATE)
## [1] "1996-01-01 CET"
max(processed$BGN_DATE)
## [1] "2011-11-30 CET"
Unfortunately the event type entries in the dataset do not follow any stardardized format, and there are also a considerable amount of typos. For example, casually drawing 25 different event type entries, we have
unique(processed$EVTYPE)[1:25]
## [1] "WINTER STORM" "TORNADO" "TSTM WIND"
## [4] "HAIL" "HIGH WIND" "HEAVY RAIN"
## [7] "FLASH FLOOD" "FREEZING RAIN" "EXTREME COLD"
## [10] "EXCESSIVE HEAT" "LIGHTNING" "FUNNEL CLOUD"
## [13] "EXTREME WINDCHILL" "BLIZZARD" "URBAN/SML STREAM FLD"
## [16] "FLOOD" "TSTM WIND/HAIL" "WATERSPOUT"
## [19] "RIP CURRENTS" "HEAVY SNOW" "Other"
## [22] "Record dry month" "Temperature record" "WILD/FOREST FIRE"
## [25] "Minor Flooding"
from which we already see various kinds of entry styles and problems: full event names (e.g. “tornado”), short-forms (e.g. “tstm” for thunderstorm?), two events in one entry (e.g. “tstm wind/hail”) and confusing strings (e.g. “other”, “temperature record”).
To make an analysis possible, we catergorize the chaotic entries into the 48 official weather event types as reported by the NOAA in their documentation. The 48 offical event types are extracted from the documentation in 2007 available online (see the url in the code below).
#download the PDF documentation from the NOAA and extract the text on the relevant page
if(!file.exists(pdf <- "repdata_peer2_doc_pd01016005curr.pdf")){
fileUrl <- "http://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf"
download.file(fileUrl,pdf, mode="wb")
}
extractpdf <- pdf_ocr_text(pdf, pages=6, dpi=700) %>% strsplit(split="\n")
#sort out the event types
EventType <- extractpdf[[1]][9:32] %>%
strsplit("( Z )| Z$|( C )| C$|( M )| M$") %>%
unlist() %>%
tolower()
For example, the first 10 official event types are
EventType[1:10]
## [1] "astronomical low tide" "hurricane (typhoon)" "avalanche"
## [4] "ice storm" "blizzard" "lake-effect snow"
## [7] "coastal flood" "lakeshore flood" "cold/wind chill"
## [10] "lightning"
We then create a data frame called “matching”, which we will use for matching the messy string entries into the 48 offical categories.
matching <- data.frame(event=EventType, string=EventType)
for(i in 1:length(EventType)){
if(grepl("\\(.*\\)",EventType[i])){
temp <- unlist(strsplit(EventType[i],"\\("))
temp <- trimws(gsub("\\)","",temp))
alternative <- data.frame(event=EventType[i], string=temp)
matching <- rbind(matching, alternative)
}
if(grepl("/",EventType[i])){
temp <- unlist(strsplit(EventType[i],"/"))
alternative <- data.frame(event=EventType[i], string=temp)
matching <- rbind(matching, alternative)
}
}
matching <- matching[order(matching$event),]
Part of the “matching” data frame looks like the following:
matching[4:8,]
## event string
## 7 coastal flood coastal flood
## 9 cold/wind chill cold/wind chill
## 51 cold/wind chill cold
## 52 cold/wind chill wind chill
## 11 debris flow debris flow
meaning that, for example, if a record has an event type entry with the characters equal “cold/wind chill” (see row 9), “cold” (see row 51) or “wind chill” (see row 52), then we assign the offical category name “cold/wind chill” to this record. Upper and/or lower casts in the entry do not affect the assignments, only the characters matter.
The next 3 lines of codes are run to clear any duplicated string entries in the “matching” data frame:
#check if strings are unique
matching %>% group_by(string) %>% filter(n()>1)
## # A tibble: 2 x 2
## # Groups: string [1]
## event string
## <chr> <chr>
## 1 cold/wind chill wind chill
## 2 extreme cold/wind chill wind chill
#display rows that have the identical string "wind chill""
matching[matching$string=="wind chill",]
## event string
## 52 cold/wind chill wind chill
## 54 extreme cold/wind chill wind chill
#remove one of the rows, it is decided to be row 52
matching <- matching[-52,]
The records are now matched to the 48 official catgories.
#transform event type strings to lower cast for easier matching and remove leading/trailing white spaces
processed$EVTYPE <- trimws(tolower(processed$EVTYPE))
#map to official categories
processed$EVTmatched <- matching$event[match(processed$EVTYPE, matching$string)]
Under this mapping method, we see that around 22.6% of the records cannot be mapped.
sum(is.na(processed$EVTmatched))/nrow(processed)
## [1] 0.2259789
These include, among others, strings that do not exist in the list of offical event names, unidentified short-forms and typos. We remove these records from our processing dataset.
processed <- processed[!is.na(processed$EVTmatched),]
Finally, we also need the values of propoerty and crop damage with the appropriate orders of magnitude. For this, the following function is constructed and called:
correctMag <- function(DMG,DMGEXP){
if(!length(DMG)==length(DMGEXP)) stop("vectors of unequal length")
for(i in 1:length(DMG)){
if(DMGEXP[i]=="K") DMG[i] <- DMG[i]*10^3
else if(DMGEXP[i]=="M") DMG[i] <- DMG[i]*10^6
else if(DMGEXP[i]=="K") DMG[i] <- DMG[i]*10^9
}
return(DMG)
}
processed$PROPDMGnum <- correctMag(processed$PROPDMG, processed$PROPDMGEXP)
processed$CROPDMGnum <- correctMag(processed$CROPDMG, processed$CROPDMGEXP)
After removing the now unnecessary columns,
processed <- processed[,c(1,9,3,4,10,11)]
the final processed dataset has the following form:
head(processed,4)
## BGN_DATE EVTmatched FATALITIES INJURIES PROPDMGnum CROPDMGnum
## 248768 1996-01-06 winter storm 0 0 380000 38000
## 248769 1996-01-11 tornado 0 0 100000 0
## 248773 1996-01-18 hail 0 0 0 0
## 248774 1996-01-18 high wind 0 0 400000 0
with 505846 rows and 6 columns in total.
dim(processed)
## [1] 505846 6
To find out the weather events which post the most threats to the nation, we group the processed dataset by the event types and calculate 4 quantities for each event: the total fatality count, the total injury count, the total amount of property damage and the total amount of crop damage throughout the period considered.
grouped <- ddply(processed, .(EVTmatched), summarize,
Fatalities=sum(FATALITIES),
Injuries=sum(INJURIES),
"Property damage"=sum(PROPDMGnum),
"Crop damage"=sum(CROPDMGnum) )
We sort out the five weather events which have led to the greatest casualty in terms of each of the 4 quantities listed.
max5Fatalities <- grouped$EVTmatched[order(grouped$Fatalities,decreasing=T)[1:5]]
max5Injuries <- grouped$EVTmatched[order(grouped$Injuries,decreasing=T)[1:5]]
max5PropDmg <- grouped$EVTmatched[order(grouped$`Property damage`,decreasing=T)[1:5]]
max5CropDmg <- grouped$EVTmatched[order(grouped$`Crop damage`,decreasing=T)[1:5]]
The 5 events which caused the most fatalities and injuries are (in order of fatality counts) “excessive heat”, “tornado”, “flash flood”, “lightning” and “flood”.
max5Fatalities
## [1] "excessive heat" "tornado" "flash flood" "lightning"
## [5] "flood"
max5Injuries
## [1] "tornado" "flood" "excessive heat" "lightning"
## [5] "flash flood"
The 5 events which caused the most property damage are “flood”, “tornado”, “flash flood”, “hail” and “hurricane (typhoon)”, whereas those 5 which caused the most crop damage are “drought”, “flood”, “hurricane (typhoon)”, “hail” and “flash flood”.
max5PropDmg
## [1] "flood" "tornado" "flash flood"
## [4] "hail" "hurricane (typhoon)"
max5CropDmg
## [1] "drought" "flood" "hurricane (typhoon)"
## [4] "hail" "flash flood"
Two data frames are then created by reshaping data on these most threatening events.
#data frame by melting data on fatalities and injuries
Dangerous <- melt(grouped, id.var="EVTmatched",
measure.vars=c("Fatalities","Injuries"))
Dangerous <- filter(Dangerous,
EVTmatched %in% max5Fatalities | EVTmatched %in% max5Injuries)
colnames(Dangerous) <- c("Event", "Threat", "Count")
#data frame by melting data on property and crop damage
EconLoss <- melt(grouped, id.var="EVTmatched",
measure.vars=c("Property damage","Crop damage"))
EconLoss <- filter(EconLoss,
EVTmatched %in% max5PropDmg | EVTmatched %in% max5CropDmg)
EconLoss$EVTmatched <- gsub("hurricane \\(typhoon\\)","hurricane", EconLoss$EVTmatched)
colnames(EconLoss) <- c("Event", "Type", "Damage")
Using the data frames above, we create several plots to visualize the results.
The first plot is a bar chart on the weather events which caused the most fatality and injury counts.
gDangerous <- ggplot(Dangerous, aes(y=Count,x=Event,fill=Threat)) +
geom_bar(position="dodge", stat="identity") +
ggtitle("Weather events being the most harmful to population in the U.S. in 1996-2011")
print(gDangerous)
From the chart it can been seen that, excessive heat and tornados have resulted in the most fatalities in the period considered. Furthermore, comparing to other events, tornados have led to a substantially higher injury count, over 20000 across the 16-years period.
The second plot is a bar chart on the weather events which caused the most property and crop damage.
gEconLoss <- ggplot(EconLoss, aes(y=Damage/10^9,x=Event,fill=Type)) +
geom_bar(position="dodge", stat="identity") + labs(y="Damage (Billion $)") +
ggtitle("Weather events causing the most economic damage in the U.S. in 1996-2011")
print(gEconLoss)
Observe from the chart that floods and tornados have resulted in the most property damage, whereas drought and hail have contributed to the most crop damage.
We can also look at a stacked bar chart to inspect the total economic damage from these events:
gEconLossStack <- ggplot(EconLoss, aes(y=Damage/10^9,x=Event,fill=Type)) +
geom_bar(stat="identity") + labs(y="Damage (Billion $)") +
ggtitle("Weather events causing the most economic damage in the U.S. in 1996-2011")
print(gEconLossStack)
Overall, floods and tornados have brought about the most economic damage. In addition, most of these influencial events have induced a greater damage on properties than on crops, except for the event of drought.