Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
This data analysis addresses the following questions:
- Across the United States, which types of events are most harmful with respect to population health?
- Across the United States, which types of events have the greatest economic consequences
The event type variable from the NOAA database has been cleaned (typo’s, abbreviations,upper-/lowercase, …) before analysis.
Both fatalities and injuries are used as seperate measures for “most harmful to population health”, and accross the US, tornado’s are by far the most harmful weather events, followed by heat events.
Economic consequences (measured as the sum of both property- and cropdamage) are worst for floods, followed by hurricanes. Tornado’s and storm surge damage is important as well.
Regional variations do exist, although the most severe events for each region are mostly the ones above (such as fatalities from avalanches in Mountains region and injuries from wildfires in the Pacific region).
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url=url,destfile = "StormData.csv.bz2")
storm <- read.csv("StormData.csv.bz2")
str(storm)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
head(storm)
The preprocessing uses the dplyr package extensively. Care is taken to use only the relevant columns for further use, in order to reduce the dataset size.
library("dplyr")
library("ggplot2")
library("gridExtra")
# keep only the columns useful for further analysis
storm <- storm %>% select(STATE,BGN_DATE,EVTYPE, FATALITIES, INJURIES, PROPDMGEXP,PROPDMG,CROPDMGEXP,CROPDMG,LATITUDE)
The year (dataset contains events between 1950-2011) is retrieved from begindate.
# get year of event
storm$begindate <- as.Date(storm$BGN_DATE,"%m/%d/%Y")
storm$YR <- format(storm$begindate,"%Y")
The region is derived from the state (and latitude) in order to have a geographical variable with fewer values than state, so it can be shown in a plot more easily. Some of the non-state codes, like LO (=Lake Ontario) have been mapped (manually) using the STATEOFFIC variable to check what they meant. The Atlantic codes AN and AM have been mapped to one of the three Atlantic Regions by using LATITUDE.
# set regions :
storm$Region <- ifelse( storm$STATE %in% c('IL','IN','OH','MI','WI','LC','LE','LH','LM', 'LO', 'LS','SL'), 'East North-Central',
ifelse( storm$STATE %in% c('AL','KY','MS','TN'), 'East South-Central',
ifelse( storm$STATE %in% c('NJ','NY','PA'), 'Mid-Atlantic',
ifelse( storm$STATE %in% c('AZ','CO','ID','MT','NV','NM','UT','WY'), 'Mountain',
ifelse( storm$STATE %in% c('CT','ME','NH','MA','RI','VT'), 'New England',
ifelse( storm$STATE %in% c('AK','CA','HI','OR','WA','GU', 'AS','VI','MH','PM','PH','PK','PZ'), 'Pacific',
ifelse( storm$STATE %in% c('DE','DC','FL','GA','NC','MD','SC','VA','WV','PR'), 'South Atlantic',
ifelse( storm$STATE %in% c('IA','KS','NE','ND','MN','MO','SD'), 'West North-Central',
ifelse( storm$STATE %in% c('AR','LA','OK','TX','GM'), 'West South-Central',
ifelse( storm$STATE %in% c('AM','AN'), ifelse(storm$LATITUDE<3872,'South Atlantic',ifelse(storm$LATITUDE>4133,'New England','Mid-Atlantic')),NA
))))))))))
storm$Region <- as.factor(storm$Region)
## keep only the columns useful for analysis
storm <- storm %>% select(Region,YR,EVTYPE, FATALITIES, INJURIES, PROPDMGEXP,PROPDMG,CROPDMGEXP,CROPDMG)
From page 12 of the documentation
one deduces that each damage (property or crop) is spread over two variables, one containing the significant digits, the other (with the EXP suffix) a character denoting its magnitude. Here we only took into account “K” for thousands, “M” for millions, and “B” for billions. Subsequently the damage figures were reconstructed, and only the total damage has been retained.
# set damage
storm$PROPDMGEXP <- toupper(storm$PROPDMGEXP)
storm$CROPDMGEXP <- toupper(storm$CROPDMGEXP)
storm$PropertyDamage <- ifelse( storm$PROPDMGEXP == "K", storm$PROPDMG*1000,
ifelse( storm$PROPDMGEXP == "M", storm$PROPDMG*1000000,
ifelse( storm$PROPDMGEXP == "B", storm$PROPDMG*1000000000,storm$PROPDMG )))/1000000
storm$CropDamage <- ifelse( storm$CROPDMGEXP == "K", storm$CROPDMG*1000,
ifelse( storm$CROPDMGEXP == "M", storm$CROPDMG*1000000,
ifelse( storm$CROPDMGEXP == "B", storm$CROPDMG*1000000000,storm$CROPDMG )))/1000000
storm$Damage <- storm$PropertyDamage + storm$CropDamage
## keep only the columns useful for analysis
storm <- storm %>% select(YR,Region,EVTYPE, FATALITIES, INJURIES, Damage)
The variable EVTTYPE apparently suffered from sloppy data-entry. I have tried to do some cleaning up.
## a little cleanup of the EVTTYPE ##############################################################################
# all upper case & remove trailing blanks
storm$EventType <- toupper(storm$EVTYPE)
storm$EventType <- gsub("^\\s+|\\s+$", "", storm$EventType)
# group events together which seem the same
storm$EventType <- gsub("WND", "WIND", storm$EventType)
storm$EventType <- gsub("WINDS", "WIND", storm$EventType)
storm$EventType <- gsub("TSTM", "THUNDERSTORM", storm$EventType)
storm$EventType <-gsub("HURRICANE.*", "HURRICANE", storm$EventType)
storm$EventType <- gsub("^TORNADO.*", "TORNADO", storm$EventType)
storm$EventType <- gsub("^HAIL.*", "HAIL", storm$EventType)
storm$EventType <- gsub("^HIGH WIND.*", "HIGH WIND", storm$EventType)
storm$EventType <- gsub("^STRONG WIND.*", "HIGH WIND", storm$EventType)
storm$EventType <- gsub("HEAVY SURF/HIGH SURF", "HIGH SURF", storm$EventType)
storm$EventType <- gsub("WILD/FOREST FIRE", "WILDFIRE", storm$EventType)
storm$EventType <- gsub("^THUNDERSTORM WIND.*", "THUNDERSTORM WIND", storm$EventType)
storm$EventType <- gsub("^FLASH FLOOD.*", "FLASH FLOOD", storm$EventType)
storm$EventType <- gsub("EXCESSIVE HEAT|RECORD WARMTH|HIGH TEMPERATURE RECORD|RECORD HEAT|UNUSUAL/RECORD WARMTH|UNSEASONABLY WARM YEAR|HOT SPELL|UNUSUAL WARMTH|HOT WEATHER|UNUSUALLY WARM|RECORD WARM|WARM WEATHER|VERY WARM|ABNORMAL WARMTH|UNSEASONABLY HOT|PROLONG WARMTH|HEAT TEMPS.|HEAt WAVE|EXTREME HEAT|RECORD HIGH TEMPERATURES|HEAT WAVE|HEAT WAVES", "HEAT", storm$EventType)
storm$EventType <- gsub("EXCESSIVELY DRY|ABNORMALLY DRY|DRY WEATHER|RECORD DRYNESS|DRIEST MONTH|VERY DRY|DRY SPELL|UNSEASONABLY DRY", "DRYNESS", storm$EventType)
storm$EventType <- gsub("UNUSUALLY COLD|UNSEASONAL LOW TEMP|EXTREME COLD/WIND CHILL|COLD/WIND CHILL|RECORD COOL|COOL SPELL|COLD WIND CHILL TEMPERATURES|COLD TEMPERATURES|UNSEASONABLE COLD|COLD TEMPERATURE|COLD WEATHER|RECORD COLD|EXTENDED COLD|UNSEASONABLY COOL|EXTREME COLD|LOW TEMPERATURE RECORD|RECORD COLD/FROST|SEVERE COLD", "COLD", storm$EventType)
storm$EventType <- gsub("HEAVY SNOW SHOWER|LATE SEASON SNOWFALL|MOUNTAIN SNOWS|MODERATE SNOWFALL|FALLING SNOW/ICE|LATE SNOW|MONTHLY SNOWFALL|EXCESSIVE SNOW|LATE SEASON SNOW|RECORD WINTER SNOW|SNOW ADVISORY|MODERATE SNOW|LIGHT SNOWFALL|EARLY SNOWFALL|UNUSUALLY SNOW|RECORD SNOWFALL|HEAVY SNOW|RECORD MAY SNOW|RECORD SNOW|SNOW AND SNOW|SNOW AND", "SNOW", storm$EventType)
storm$EventType <- gsub("^RIP CURRENT.*","RIP CURRENTS", storm$EventType)
storm$EventType <- gsub("^URBAN","URBAN FLOODS", storm$EventType)
storm$EventType <- gsub("^STORM SURGE.*", "STORM SURGE", storm$EventType)
storm$EventType <- as.factor(storm$EventType)
## remove summary rows, irrelevant in the dataset
storm <- storm %>% filter(!grepl('SUMMARY',EventType))
## Warning: package 'bindrcpp' was built under R version 3.4.3
## keep only the columns useful for analysis
storm <- storm %>% select(YR,Region,EventType, FATALITIES, INJURIES, Damage)
First the worst 10 event types are presented, for each category (fatalities,injuries or damage(in million dollars))
# totals
TOTFAT <- sum(storm$FATALITIES)
TOTINJ <- sum(storm$INJURIES)
TOTDAM <- sum(storm$Damage)
agg <- storm %>%
select(EventType, FATALITIES, INJURIES, Damage) %>%
group_by(EventType) %>%
summarise(FATALITIES = sum(FATALITIES), FATALITIES_PERC = round(100*sum(FATALITIES)/TOTFAT,1),
INJURIES = sum(INJURIES), INJURIES_PERC = round(100*sum(INJURIES)/TOTINJ,1),
Damage = sum(Damage),Damage_PERC = round(100*sum(Damage)/TOTDAM,1),
frequency=n())
# Top deadly events (fatalities)
agg %>%
select(EventType, FATALITIES, FATALITIES_PERC, INJURIES_PERC, Damage_PERC,frequency) %>%
arrange(desc(FATALITIES))
# Top events with most injuries (injuries)
agg %>%
select(EventType, INJURIES, INJURIES_PERC,FATALITIES_PERC, Damage_PERC,frequency) %>%
arrange(desc(INJURIES))
# Top events with most (non-human) economical damage (damage)
agg %>%
select(EventType, Damage, Damage_PERC,INJURIES_PERC,FATALITIES_PERC,frequency) %>%
arrange(desc(Damage))
The plot below represents the types of severe weather events which contribute most to any of the three categories.
aggsmall <- agg[agg$FATALITIES_PERC+agg$INJURIES_PERC+agg$Damage_PERC>1,]
plot2 <- ggplot(aggsmall, aes(x=reorder(EventType,-FATALITIES), y=FATALITIES)) + geom_bar(stat = "identity", color="red", fill="white") + coord_flip() + xlab("")+theme(axis.text.y = element_blank())
plot1 <- ggplot(aggsmall, aes(x=reorder(EventType,-FATALITIES), y=INJURIES)) + geom_bar(stat = "identity", color="blue", fill="white") + coord_flip() + xlab("")
plot3 <- ggplot(aggsmall, aes(x=reorder(EventType,-FATALITIES), y=Damage)) + geom_bar(stat = "identity", color="green", fill="white") + coord_flip() + xlab("")+theme(axis.text.y = element_blank())
grid.arrange(plot1, plot2,plot3, ncol=3)
As an extra the same plot is presented below, per US Region, showing regional differences.
## AGGREGATES per REGION
aggREG <- storm %>%
select(Region,EventType, FATALITIES, INJURIES, Damage) %>%
group_by(Region,EventType) %>%
summarise(FATALITIES = sum(FATALITIES),
INJURIES = sum(INJURIES),
Damage = sum(Damage), Frequency=n()) %>%
arrange(desc(FATALITIES))
aggsmall <- aggREG[aggREG$FATALITIES>40,]
plot4 <- ggplot(aggsmall, aes(x=reorder(EventType,-FATALITIES), y=FATALITIES,fill = Region)) + geom_bar(stat = "identity") + facet_grid(~ Region,scales="free") + xlab("") + coord_flip()+ guides(fill=FALSE) + theme(axis.text.x = element_text(angle=30))
aggsmall <- aggREG[aggREG$INJURIES>300,]
plot5 <- ggplot(aggsmall, aes(x=reorder(EventType,-INJURIES), y=INJURIES,fill = Region)) + geom_bar(stat = "identity") + facet_grid(~ Region,scales="free") + xlab("") + coord_flip() + guides(fill=FALSE)+ theme(axis.text.x = element_text(angle=30))
aggsmall <- aggREG[aggREG$Damage>1500,]
plot6 <- ggplot(aggsmall, aes(x=reorder(EventType,-Damage), y=Damage,fill = Region)) + geom_bar(stat = "identity") + facet_grid(~ Region,scales="free") + xlab("") + coord_flip()+ guides(fill=FALSE)+theme(axis.text.x = element_text(angle=30))
grid.arrange(plot4, plot5,plot6, nrow=3)