Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This analysis involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The data comes in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:
Storm Data [47Mb] https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2
There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
National Weather Service Storm Data Documentation https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf
National Climatic Data Center Storm Events FAQ https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
Let’s get started by loading the appropriate packages.
setwd("C:/Users/johnarmistead/Data Sci Coursera/Class5/StormData")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library(scales)
library(reshape)
##
## Attaching package: 'reshape'
## The following object is masked from 'package:lubridate':
##
## stamp
## The following object is masked from 'package:dplyr':
##
## rename
Next let’s load and take a look at the data. It’s a pretty big file, so we create a “bare” file that we don’t have to load more than once. The “raw” file will get munged.
bare <- as.data.frame(read.csv("repdata%2Fdata%2FStormData.csv.bz2", header=TRUE, sep=","))
raw <- bare
str(raw)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
head(raw)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
OK, it looks like we need to clean up the dates first. We won’t actually need these factors since we’re looking at “all time” stats. Nevertheless, it might be helpful later.
raw$BGN_DATE <- mdy_hms(raw$BGN_DATE) #convert to date-readable format
raw$YEAR <- year(raw$BGN_DATE) #add column for year in case of further analysis later
head(raw)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1 1 1950-04-18 0130 CST 97 MOBILE AL TORNADO
## 2 1 1950-04-18 0145 CST 3 BALDWIN AL TORNADO
## 3 1 1951-02-20 1600 CST 57 FAYETTE AL TORNADO
## 4 1 1951-06-08 0900 CST 89 MADISON AL TORNADO
## 5 1 1951-11-15 1500 CST 43 CULLMAN AL TORNADO
## 6 1 1951-11-15 2000 CST 77 LAUDERDALE AL TORNADO
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1 0 0 NA
## 2 0 0 NA
## 3 0 0 NA
## 4 0 0 NA
## 5 0 0 NA
## 6 0 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES
## 1 0 14.0 100 3 0 0 15
## 2 0 2.0 150 2 0 0 0
## 3 0 0.1 123 2 0 0 2
## 4 0 0.0 100 2 0 0 2
## 5 0 0.0 150 2 0 0 2
## 6 0 1.5 177 2 0 0 6
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE
## 1 25.0 K 0 3040
## 2 2.5 K 0 3042
## 3 25.0 K 0 3340
## 4 2.5 K 0 3458
## 5 2.5 K 0 3412
## 6 2.5 K 0 3450
## LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM YEAR
## 1 8812 3051 8806 1 1950
## 2 8755 0 0 2 1950
## 3 8742 0 0 3 1951
## 4 8626 0 0 4 1951
## 5 8642 0 0 5 1951
## 6 8748 0 0 6 1951
Now we need to get the Property and Crop Damages values in a consistent format. It looks like K indicates a thousand, M a million, and B a billion. We create additional columns and multiply the original column by the 10x factor.
raw$PROPFACTOR <- 0
raw$PROPFACTOR[raw$PROPDMGEXP %in% c("k", "K")] <- 1000
raw$PROPFACTOR[raw$PROPDMGEXP %in% c("m", "M")] <- 1000000
raw$PROPFACTOR[raw$PROPDMGEXP %in% c("b", "B")] <- 1000000000
raw$PROPDMG <- raw$PROPDMG*raw$PROPFACTOR #multiply original variable by Exp factor
raw$CROPFACTOR <- 0
raw$CROPFACTOR[raw$CROPDMGEXP %in% c("k", "K")] <- 1000
raw$CROPFACTOR[raw$CROPDMGEXP %in% c("m", "M")] <- 1000000
raw$CROPFACTOR[raw$CROPDMGEXP %in% c("b", "B")] <- 1000000000
raw$CROPDMG <- raw$CROPDMG*raw$CROPFACTOR #multiply original variable by Exp factor
Now let’s take a look at a sample.
#preliminary analysis
data <-raw %>%
group_by(EVTYPE) %>%
summarize(n=n()) %>% #sum everything for a quick glance at the data
arrange(desc(n)) %>%
print(data)
## Source: local data frame [985 x 2]
##
## EVTYPE n
## (fctr) (int)
## 1 HAIL 288661
## 2 TSTM WIND 219940
## 3 THUNDERSTORM WIND 82563
## 4 TORNADO 60652
## 5 FLASH FLOOD 54277
## 6 FLOOD 25326
## 7 THUNDERSTORM WINDS 20843
## 8 HIGH WIND 20212
## 9 LIGHTNING 15754
## 10 HEAVY SNOW 15708
## .. ... ...
At first glance, Hail could be a large contributor to human and dollar impact, but that doesn’t make sense. A more logical hypothesis would be that Storms or Tornadoes hurt a lot of people.
There appear to be several acronyms and various classifications of events that should be cleaned up. Let’s take a look at the whole list of weather events.
eventlist <- sort(unique(raw$EVTYPE))
head(eventlist, n=100) #take a subset because actual unique list > 1000
## [1] HIGH SURF ADVISORY COASTAL FLOOD
## [3] FLASH FLOOD LIGHTNING
## [5] TSTM WIND TSTM WIND (G45)
## [7] WATERSPOUT WIND
## [9] ? ABNORMAL WARMTH
## [11] ABNORMALLY DRY ABNORMALLY WET
## [13] ACCUMULATED SNOWFALL AGRICULTURAL FREEZE
## [15] APACHE COUNTY ASTRONOMICAL HIGH TIDE
## [17] ASTRONOMICAL LOW TIDE AVALANCE
## [19] AVALANCHE BEACH EROSIN
## [21] Beach Erosion BEACH EROSION
## [23] BEACH EROSION/COASTAL FLOOD BEACH FLOOD
## [25] BELOW NORMAL PRECIPITATION BITTER WIND CHILL
## [27] BITTER WIND CHILL TEMPERATURES Black Ice
## [29] BLACK ICE BLIZZARD
## [31] BLIZZARD AND EXTREME WIND CHIL BLIZZARD AND HEAVY SNOW
## [33] Blizzard Summary BLIZZARD WEATHER
## [35] BLIZZARD/FREEZING RAIN BLIZZARD/HEAVY SNOW
## [37] BLIZZARD/HIGH WIND BLIZZARD/WINTER STORM
## [39] BLOW-OUT TIDE BLOW-OUT TIDES
## [41] BLOWING DUST blowing snow
## [43] Blowing Snow BLOWING SNOW
## [45] BLOWING SNOW- EXTREME WIND CHI BLOWING SNOW & EXTREME WIND CH
## [47] BLOWING SNOW/EXTREME WIND CHIL BREAKUP FLOODING
## [49] BRUSH FIRE BRUSH FIRES
## [51] COASTAL FLOODING/EROSION COASTAL EROSION
## [53] Coastal Flood COASTAL FLOOD
## [55] coastal flooding Coastal Flooding
## [57] COASTAL FLOODING COASTAL FLOODING/EROSION
## [59] Coastal Storm COASTAL STORM
## [61] COASTAL SURGE COASTAL/TIDAL FLOOD
## [63] COASTALFLOOD COASTALSTORM
## [65] Cold COLD
## [67] COLD AIR FUNNEL COLD AIR FUNNELS
## [69] COLD AIR TORNADO Cold and Frost
## [71] COLD AND FROST COLD AND SNOW
## [73] COLD AND WET CONDITIONS Cold Temperature
## [75] COLD TEMPERATURES COLD WAVE
## [77] COLD WEATHER COLD WIND CHILL TEMPERATURES
## [79] COLD/WIND CHILL COLD/WINDS
## [81] COOL AND WET COOL SPELL
## [83] CSTL FLOODING/EROSION DAM BREAK
## [85] DAM FAILURE Damaging Freeze
## [87] DAMAGING FREEZE DEEP HAIL
## [89] DENSE FOG DENSE SMOKE
## [91] DOWNBURST DOWNBURST WINDS
## [93] DRIEST MONTH Drifting Snow
## [95] DROUGHT DROUGHT/EXCESSIVE HEAT
## [97] DROWNING DRY
## [99] DRY CONDITIONS DRY HOT WEATHER
## 985 Levels: HIGH SURF ADVISORY COASTAL FLOOD ... WND
If we did our analysis without cleaning up these headers, we may miss something. Let’s clean up some of these headers if they contain certain key words or phrases.
raw$EVTYPE <- gsub(".*HEAT.*", "Heat", raw$EVTYPE) #Make every column that contains the word "HEAT" to "Heat"
raw$EVTYPE <- gsub(".*TSTM.*", "Storm", raw$EVTYPE) # and so on
raw$EVTYPE <- gsub(".*STORM.*", "Storm", raw$EVTYPE)
raw$EVTYPE <- gsub(".*FLOOD.*", "Flood", raw$EVTYPE)
raw$EVTYPE <- gsub(".*STRONG WIND.*", "Wind", raw$EVTYPE)
raw$EVTYPE <- gsub(".*HIGH WIND.*", "Wind", raw$EVTYPE)
raw$EVTYPE <- gsub(".*FIRE.*", "Fire", raw$EVTYPE)
raw$EVTYPE <- gsub(".*HAIL.*", "Hail", raw$EVTYPE)
raw$EVTYPE <- gsub(".*BLIZZARD.*", "Blizzard", raw$EVTYPE)
raw$EVTYPE <- gsub(".*HURRICANE.*", "Hurricane", raw$EVTYPE)
raw$EVTYPE <- gsub(".*COLD.*", "Cold", raw$EVTYPE)
raw$EVTYPE <- gsub(".*WINTER.*", "Cold", raw$EVTYPE)
raw$EVTYPE <- gsub(".*RIP.*", "Tide/Surf", raw$EVTYPE)
raw$EVTYPE <- gsub(".*FOG.*", "Fog", raw$EVTYPE)
raw$EVTYPE <- gsub(".*AVALANCHE.*", "Avalanche", raw$EVTYPE)
raw$EVTYPE <- gsub(".*RAIN.*", "Rain", raw$EVTYPE)
raw$EVTYPE <- gsub(".*SURF.*", "Tide/Surf", raw$EVTYPE)
raw$EVTYPE <- gsub(".*THUNDERSTORM.*", "Storm", raw$EVTYPE)
raw$EVTYPE <- gsub(".*CLOUD.*", "Storm", raw$EVTYPE)
raw$EVTYPE <- gsub(".*LIGHTNING.*", "Lightning", raw$EVTYPE)
raw$EVTYPE <- gsub(".*SNOW.*", "Snow", raw$EVTYPE)
raw$EVTYPE <- gsub(".*TORNADO.*", "Tornado", raw$EVTYPE)
raw$EVTYPE <- gsub(".*FROST.*", "Cold", raw$EVTYPE)
raw$EVTYPE <- gsub(".*DROUGHT.*", "Drought", raw$EVTYPE)
Now that we’ve done that, let’s take another look at the data in summary.
data <-raw %>%
group_by(EVTYPE) %>%
summarize(n=n()) %>% #quickly summarize again
arrange(desc(n)) %>%
print(data)
## Source: local data frame [431 x 2]
##
## EVTYPE n
## (chr) (int)
## 1 Storm 358793
## 2 Hail 289270
## 3 Flood 82659
## 4 Tornado 60698
## 5 Wind 25763
## 6 Snow 17543
## 7 Lightning 15760
## 8 Rain 12165
## 9 Cold 11974
## 10 Fire 4240
## .. ... ...
Clearly, storms make much more of an impact than originally seen. Lastly let’s make sure everything’s numeric.
raw$FATALITIES <- as.numeric(raw$FATALITIES)
raw$INJURIES <- as.numeric(raw$INJURIES)
raw$PROPDMG <- as.numeric(raw$PROPDMG)
raw$CROPDMG <- as.numeric(raw$CROPDMG)
Now for the fun part. Here we take our first look at the human impact of weather events: fatalities and injuries.
humtest <- raw %>%
group_by(EVTYPE) %>%
mutate(Total=FATALITIES+INJURIES) %>% #add column for total for easy sorting and viewing as table if necessary
summarise(Fatalities=sum(FATALITIES),
Injuries=sum(INJURIES),
Total=sum(Total)) %>%
arrange(desc(Total))
head(humtest)
## Source: local data frame [6 x 4]
##
## EVTYPE Fatalities Injuries Total
## (chr) (dbl) (dbl) (dbl)
## 1 Tornado 5636 91407 97043
## 2 Storm 1177 13759 14936
## 3 Heat 3138 9154 12292
## 4 Flood 1523 8601 10124
## 5 Lightning 817 5231 6048
## 6 Wind 422 1831 2253
What are the mean Fatalies and Injuries from weather events?
mean(humtest$Fatalities)
## [1] 35.13921
mean(humtest$Injuries)
## [1] 326.051
Let’s re-run the summary table by only the top 12 above-average events that impact human life.
humset <- raw %>%
group_by(EVTYPE) %>%
mutate(Total=FATALITIES+INJURIES) %>%
summarise(Fatalities=sum(FATALITIES),
Injuries=sum(INJURIES),
Total=sum(Total)) %>%
filter(Fatalities > 35 | Injuries > 326) %>% #filter above averages
arrange(desc(Total))
humset <- head(humset, n=12) #subset only Top 12
Finally, we’ll unpivot the table and plot it.
human <- cbind(humset[1], stack(humset[-1])) %>% #unpivot so all values are in same column
filter(ind!="Total")
humplot <- ggplot(human) +
aes(x=reorder(EVTYPE, +values), y=values, fill=ind) + # plot and order high to low
geom_histogram(stat="identity")+coord_flip() +
ggtitle("Total Human Impact by Top 12 Weather Events") +
theme(plot.title=element_text(lineheight =0.8, face="bold"))+xlab("Event Type")
print(humplot)
Looks like Tornadoes by far injure and kill the most people. Heat kills the next-most people but storms cause slightly more injuries. Investing in storm detection techniques may prove useful from a policy perspective in preventing a negative impact on human life.
What about property and crop damage?
doltest <- raw %>%
group_by(EVTYPE) %>%
mutate(Total=PROPDMG+CROPDMG) %>% #add column for total for easy sorting and viewing as table if necessary
summarise(Property=sum(PROPDMG),
Crop=sum(CROPDMG),
Total=sum(Total)) %>%
arrange(desc(Total))
Mean Property and Crop Damage ($US):
mean(doltest$Property)
## [1] 991458566
mean(doltest$Crop)
## [1] 113930840
Life before, we filter for only the most impactful events, reformat and plot the data.
dolset <- raw %>%
group_by(EVTYPE) %>%
mutate(Total=PROPDMG+CROPDMG) %>%
summarise(Property=sum(PROPDMG),
Crop=sum(CROPDMG),
Total=sum(Total)) %>%
filter(Property > 991458566 | Crop > 113930840) %>% #filter above averages
arrange(desc(Total))
dolset <- head(dolset, n=12) #subset only Top 12
dollar <- cbind(dolset[1], stack(dolset[-1])) %>% #unpivot
filter(ind!="Total")
dollar$values <- dollar$values/1000000 #convert to Millions for easy plotting
dolplot <- ggplot(dollar) +
aes(x=reorder(EVTYPE, +values), y=values, fill=ind) +
geom_histogram(stat="identity")+coord_flip() +
ggtitle("Total Dollar Impact by Top 12 Weather Events ($US Millions)") +
theme(plot.title=element_text(lineheight =.8, face="bold")) +
xlab("Event Type")
print(dolplot)
From a dollar impact perspective, Floods are the most costly weather events. Hurricanes, Storms, and Tornadoes also cause significant loss. To prevent property damage costs, policy makers should invest more in flood prevention techniques like dams, canals, and other coastal defenses.
In all, this dataset provides significant insights into how weather events effect human life and property. More research must be done however, to class all the original Event Types into their appropriate buckets. Time series or location-based analysis may also make insights more relevant to individual policymakers. Documentation of how this analysis is performed is critical though. That is the only way we can keep ourselves safe from the storm of data.