This publication is part of the Coursera Reproducible Research class taught by Dr. Roger Peng as part of the Data Science Specialization.
Using data collected and documented by the National Oceanic & Atmospheric Administration, We want to answer these two questions:
In their raw format, the data have a lot of variability in how specific weather events are labeled as the result of abbreviations, typos, etc. The first part of the processing code simplifies and combines the Event Type classification variable. Then plots are constructed to assess the top ten extreme weather event types by Fatalities (to guage effect on population health) and Property Damage (as a proxy for economic damage).
if (!file.exists("repdata-data-StormData.csv.bz2")) {
link<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2" #From the Course Website
download.file(link,"repdata-data-StormData.csv.bz2")
}
if (!exists("stormDF")) stormDF <- read.csv("repdata-data-StormData.csv.bz2")
#Let's examine the data
str(stormDF)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
head(table(stormDF$EVTYPE),10)
##
## HIGH SURF ADVISORY COASTAL FLOOD FLASH FLOOD
## 1 1 1
## LIGHTNING TSTM WIND TSTM WIND (G45)
## 1 4 1
## WATERSPOUT WIND ?
## 1 1 1
## ABNORMAL WARMTH
## 4
In this raw format, our event type labels are not adequate. For example, we have events listed under WINTER STORM HIGH WINDS, WINTER STORM/HIGH WINDS, and WINTER STORM/HIGH WIND as well as THUNDERSTORM WINDS and TSTM WINDS. We need to consolidate the categories as best we can into the 48 official categories given by the National Weather Service.
First, Labels will be cleaned by:
Second labels will be combined based on keyword or abbreviation. For example, “ice” and “icy” should both qualify as “ice storm,” and “thunder” or “tstm” should both qualify as thunderstorms (unless paired with “marine”).
#Goal is to have 48 types (or as near as possible). Currently have 985
eventTypes <- stormDF$EVTYPE
length(levels(eventTypes))
## [1] 985
#Step 1: Capitalize all the Labels
eventTypes <- toupper(as.character(eventTypes))
#Step 2: Remove excess whitespace
eventTypes <- gsub("^\\s+|\\s+$", "", eventTypes)
eventTypes <- gsub("\\s{2}", " ", eventTypes)
#Step 3: Remove multiple event listings
getFirst<-function(label) {
if (!grepl("\\/",label)) {return(label)}
if (label == "EXTREME COLD/WIND CHILL" |
label == "FROST/FREEZE" |
label == "STORM SURGE/TIDE"
) {return(label)}
label <- unlist(strsplit(label, split="/"))
return(label[1])
}
eventTypes <- sapply(eventTypes,FUN=getFirst)
#Step 4: Remove Plurals
eventTypes <- gsub("[S]$","", eventTypes)
length(levels(as.factor(eventTypes)))
## [1] 694
#Combines
majLabels <- function(label) {
if (grepl("TORNADO",label)) {return("TORNADO")}
if (grepl("BLIZZARD",label)) {return("BLIZZARD")}
if (grepl("HAIL",label)) {return("HAIL")}
if (grepl("COLD|CHILL|COOL",label)) {return("EXTREME COLD/WIND CHILL")}
if (grepl("HEAT|HOT|WARM",label)) {return("EXCESSIVE HEAT")}
if (grepl("HIGH WIND",label)) {return("HIGH WIND")}
if (grepl("HEAVY SNOW",label)) {return("HEAVY SNOW")}
if (grepl("TROPICAL STORM",label)) {return("TROPICAL STORM")}
if (grepl("SNOW",label)) {return("HEAVY SNOW")}
if (grepl("HEAVY RAIN",label)) {return("HEAVY RAIN")}
if (grepl("DRY|DROUGHT",label)) {return("DROUGHT")}
if (grepl("ICE|ICY",label)) {return("ICE STORM")}
if (grepl("HURRICANE|TYPHOON",label)) {return("HURRICANE (TYPHOON)")}
if (!grepl("MARINE",label) &
grepl("TSTM|THUN",label)) {return("THUNDERSTORM WIND")}
if (!grepl("FLASH|LAKESHORE",label) &
grepl("FLOOD",label)) {return("FLOOD")}
label
}
eventTypes <- sapply(eventTypes,FUN=majLabels)
length(levels(as.factor(eventTypes)))
## [1] 313
#Reapply to dataset and filter
stormDF$EVTYPE <- as.factor(eventTypes)
keepRow <- function(label) {
masterList <- toupper(c("Astronomical Low Tide" ,
"Avalanche" ,
"Blizzard" ,
"Coastal Flood" ,
"Cold/Wind Chill" ,
"Debris Flow" ,
"Dense Fog" ,
"Dense Smoke" ,
"Drought" ,
"Dust Devil" ,
"Dust Storm" ,
"Excessive Heat" ,
"Extreme Cold/Wind Chill" ,
"Flash Flood" ,
"Flood" ,
"Frost/Freeze" ,
"Funnel Cloud" ,
"Freezing Fog" ,
"Hail" ,
"Heat" ,
"Heavy Rain" ,
"Heavy Snow" ,
"High Surf" ,
"High Wind" ,
"Hurricane (Typhoon)" ,
"Ice Storm" ,
"Lake-Effect Snow" ,
"Lakeshore Flood" ,
"Lightning" ,
"Marine Hail" ,
"Marine High Wind" ,
"Marine Strong Wind" ,
"Marine Thunderstorm Wind" ,
"Rip Current" ,
"Seiche" ,
"Sleet" ,
"Storm Surge/Tide" ,
"Strong Wind" ,
"Thunderstorm Wind" ,
"Tornado" ,
"Tropical Depression" ,
"Tropical Storm" ,
"Tsunami" ,
"Volcanic Ash" ,
"Waterspout" ,
"Wildfire" ,
"Winter Storm" ,
"Winter Weather"))
if (label %in% masterList) {return(TRUE)}
return(FALSE)
}
stormDF <- subset(stormDF, sapply(EVTYPE,FUN=keepRow),TRUE)
Across the United States, which types of events are most harmful with respect to population health? To answer this, let’s identify the event types responsible for the most deaths throughout the entire time range of this dataset.
#Use tapply to summarize FATALITIES by EVTYPE
sums<-sort(tapply((stormDF$FATALITIES),stormDF$EVTYPE,sum),decreasing=TRUE)
#Sort this summary and display the table
topEvents<-data.frame(Event_Type=factor(names(sums[1:10])),Deaths=as.numeric(sums[1:10]))
print(topEvents)
## Event_Type Deaths
## 1 TORNADO 5658
## 2 EXCESSIVE HEAT 3159
## 3 FLASH FLOOD 994
## 4 LIGHTNING 816
## 5 THUNDERSTORM WIND 711
## 6 RIP CURRENT 577
## 7 FLOOD 506
## 8 EXTREME COLD/WIND CHILL 467
## 9 HIGH WIND 297
## 10 AVALANCHE 224
And the graphical representation of this same table
#Take the table produce above and use ggplot2 to create a bar graph
graph.health <- ggplot(data=topEvents)
graph.health <- graph.health + aes(x=reorder(Event_Type,Deaths),y=Deaths,fill=factor(reorder(Event_Type,Deaths)),show_guide = FALSE) +
ggtitle("Fig 1. Fatalities by Event Type") +
xlab("Event Type") +
geom_bar(stat="identity") +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
scale_fill_manual(values=c(brewer.pal(9,"Reds"),"#000000")) +
guides(fill=FALSE)
graph.health
So we can see from the graph and from the table that Tornados have caused the most harm in terms of population health (deaths) than other forms of severe weather.
Across the United States, which types of events have the greatest economic consequences? Let’s also look at how much aggregate dollars property (PROPDMG) and crop damage (CROPDMG) was done by each event by type.
To do this, we will take the sum of all Property Damage + all Crop Damage across all storms, grouped by event type. Then display a sorted table for the top 10 Storm types by dollars of damage.
sums<-sort(tapply((stormDF$PROPDMG + stormDF$CROPDMG),stormDF$EVTYPE,sum),decreasing=TRUE)
topEvents<-data.frame(Event_Type=factor(names(sums[1:10]),ordered=TRUE),Damage=as.numeric(sums[1:10]))
print(topEvents)
## Event_Type Damage
## 1 TORNADO 3314617.78
## 2 THUNDERSTORM WIND 2873494.34
## 3 FLASH FLOOD 1603895.55
## 4 HAIL 1272128.79
## 5 FLOOD 1157771.83
## 6 LIGHTNING 606937.39
## 7 HIGH WIND 403286.98
## 8 HEAVY SNOW 154043.28
## 9 WINTER STORM 135699.58
## 10 WILDFIRE 89923.54
That same table in graphical form
graph.econ <- ggplot(data=topEvents)
graph.econ <- graph.econ + aes(x=reorder(Event_Type,Damage),y=Damage,fill=factor(reorder(Event_Type,Damage)),show_guide = FALSE) +
ggtitle("Fig 2. Economic Damage by Event Type") +
xlab("Event Type") +
geom_bar(stat="identity") +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
scale_y_continuous(labels=dollar) +
scale_fill_manual(values=c(brewer.pal(9,"Greens"),"#000000")) +
guides(fill=FALSE)
graph.econ
So we can see from the graph and from the table that, again, Tornados have caused the most economic harm (in terms of economic damage to crops and property) than other forms of severe weather.