Using the dataset provided by the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, we have analyzed the logged weather events and the impacts each had on population health and the economy of the U.S. Due to poor and infrequents records all data prior to 1983 was not included in this analysis. A more thorough study in the future could include this data as well, however, damages should be adjusted for inflation. All weather events were grouped into 19 distinct categories and statistics were calculated with respect to human fatalities, human injuries, and the total damage done to property and crops as expressed in U.S. dollars. The outcomes were considered whether they were directly or indirectly caused by the weather event.
Loss of human life was most significant for weather events related to extreme heat and droughts at 3,178 which occurred more than twice as often as the third and fourth leading causes of weather related fatalities, flooding (1,547) and high wind/dust storms (1,468) respectively. The second leading cause of fatality was tornado/water spout at 2,164.
Injury due to weather events was lead by tornadoes at 34,833 which had almost three times the number of injuries associated with it than the second leading cause, high winds and dust storms 11,974.
Economic impact was dominated by weather events involving water, lead by floods at $179.5B , followed by hurricanes ($90.3B), and tidal related events ($49.3B)
There should be a section titled Data Processing which describes (in words and code) how the data were loaded into R and processed for analysis. In particular, your analysis must start from the raw CSV file containing the data. You cannot do any preprocessing outside the document. If preprocessing is time-consuming you may consider using the cache = TRUE option for certain code chunks.
The data file to be used for this analysis was downloaded from the coursera reproducible research course cloud. The csv file, repdata-data-StormData.csv.bz2, was compressed via the bzip2 algorithm to reduce its size. First we must extract the file and read the csv into R.
storm_data <- read.csv(bzfile("repdata-data-StormData.csv.bz2"))
str(storm_data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
There are 902,297 observations with 37 variables. This analysis is concerned with the impacts of events have on public health, more specificially which are the most harmful, and which types of events have the greatest economic consequences.
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
First we will remove all columns that not relevant to this analysis. We will keep BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP
relevant_columns <- c("BGN_DATE", "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
storm_data_trimmed <- storm_data[, relevant_columns]
Next we will read Dates as Dates.
storm_data_trimmed$BGN_DATE <- as.Date(as.character(storm_data_trimmed$BGN_DATE), "%m/%d/%Y")
The data is said to be very sparse in the earlier years likely due to lack of good records. The other thing not mentioned but worth considering is the time value of money. The only way to really assess the economic impact is to make sure we convert all dollar amounts so that they are not weighted by inflation. It would be nice to throw out older data for the aforementioned reasons but we don’t want to throw out significant data. If you look at the histogram produced by the code below you can see the justification for excluding data prior to 1983.
hist(storm_data_trimmed$BGN_DATE, breaks = 63)
storm_data_trimmed2 <- storm_data_trimmed[storm_data_trimmed$BGN_DATE >= '1983-01-01', ]
str(storm_data_trimmed2)
## 'data.frame': 809136 obs. of 8 variables:
## $ BGN_DATE : Date, format: "1983-02-01" "1983-02-01" ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 856 856 834 856 856 856 856 856 856 856 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 0 0 ...
## $ INJURIES : num 0 0 1 0 0 0 0 0 0 0 ...
## $ PROPDMG : num 0 0 25 0 0 0 0 0 0 0 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 1 1 17 1 1 1 1 1 1 1 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
The code above removes all observations occuring prior to 1983. Our dataset is now down to 809,136 observations and 8 variables.
Looking at National Weather Service Storm Data Documentation, NATIONAL WEATHER SERVICE INSTRUCTION 10-1605, found here, https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf we can see a number of changes that may be of interest to this analysis.
Section 7 of the table of contents lists 48 event types, some of which were further subdivided, however, the summary output above of our raw data shows 985 event types.
EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY"
This total of 48 event types is also confirmed in 2.1.1 Storm Data Event Table on pg. 6 of the report. To look at the unique levels, event types, you can run the R code below. I have surpressed the output due to its length.
levels(storm_data_trimmed2$EVTYPE)
Unfortunately it appears that the dataset we were given to download does not match the link provided for the Storm Data Documentation. For example, the storm data documentation says the following, “The event name of Landslide was renamed to Debris Flow.”; however, when I look at all of the unique EVTYPEs in my dataset there is NO value called Debris Flow, but there are the following: [442] “LANDSLIDE” “LANDSLIDE/URBAN FLOOD” “LANDSLIDES” [445] “Landslump” “LANDSLUMP”
Here grepl can save the day. We can search the names entered into the Storm Database case insensitivity and group them by keywords. This should reduce the number of event categories we have.
storm_data_events <- storm_data_trimmed2
storm_data_events$EVENT <- NA
storm_data_events[grepl("fire|smoke", storm_data_trimmed2$EVTYPE, ignore.case=TRUE), "EVENT"] <- "Fire & Smoke"
storm_data_events[grepl("dam", storm_data_trimmed2$EVTYPE, ignore.case=TRUE), "EVENT"] <- "Dam Break"
storm_data_events[grepl("tornado|spout|microburst|funnel", storm_data_trimmed2$EVTYPE, ignore.case=TRUE), "EVENT"] <- "Tornado/Water Spout"
storm_data_events[grepl("fog", storm_data_trimmed2$EVTYPE, ignore.case=TRUE), "EVENT"] <- "Fog"
storm_data_events[grepl("volcan", storm_data_trimmed2$EVTYPE, ignore.case=TRUE), "EVENT"] <- "Volcano"
storm_data_events[grepl("summary", storm_data_trimmed2$EVTYPE, ignore.case=TRUE), "EVENT"] <- "Miscellaneous"
storm_data_events[grepl("sleet|hail", storm_data_trimmed2$EVTYPE, ignore.case=TRUE), "EVENT"] <- "Sleet Or Hail"
storm_data_events[grepl("mud|slide", storm_data_trimmed2$EVTYPE, ignore.case=TRUE), "EVENT"] <- "Mudslide"
storm_data_events[grepl("ice|freeze", storm_data_trimmed2$EVTYPE, ignore.case=TRUE), "EVENT"] <- "Ice"
storm_data_events[grepl("thunderstorm|tunderstorm|tstm|torrential|lightning", storm_data_trimmed2$EVTYPE, ignore.case=TRUE), "EVENT"] <- "Thunderstorm"
storm_data_events[grepl("winter|wintry|snow|cold|Blizzard", storm_data_trimmed2$EVTYPE, ignore.case=TRUE), "EVENT"] <- "Winter Weather"
storm_data_events[grepl("wind|blow|wnd|southeast|dust", storm_data_trimmed2$EVTYPE, ignore.case=TRUE), "EVENT"] <- "High Wind & Dust Storms"
storm_data_events[grepl("flood|erosion|erosin|fld|stream", storm_data_trimmed2$EVTYPE, ignore.case=TRUE), "EVENT"] <- "Flooding"
storm_data_events[grepl("tide|surge|seas|coastal", storm_data_trimmed2$EVTYPE, ignore.case=TRUE), "EVENT"] <- "Tide"
storm_data_events[grepl("dry|drought|driest", storm_data_trimmed2$EVTYPE, ignore.case=TRUE), "EVENT"] <- "Drought"
storm_data_events[grepl("heat|hot|warm", storm_data_trimmed2$EVTYPE, ignore.case=TRUE), "EVENT"] <- "Hot Weather"
storm_data_events[grepl("hurricane", storm_data_trimmed2$EVTYPE, ignore.case=TRUE), "EVENT"] <- "Hurricane"
storm_data_events[grepl("tropical|wet|wettest", storm_data_trimmed2$EVTYPE, ignore.case=TRUE), "EVENT"] <- "Tropical Storm"
storm_data_events[grepl("avalanche", storm_data_trimmed2$EVTYPE, ignore.case=TRUE), "EVENT"] <- "Avalanche"
storm_data_events$EVENT <- as.factor(storm_data_events$EVENT)
#storm_data_events$EVENT <- as.character(storm_data_events$EVENT)
This grouping has reduced the 985 distinct event types to 19.
Now we will summarize the fatality and injury data by finding the total for each event type:
storm_data_fatalities <- storm_data_events
storm_data_injuries <- storm_data_events
storm_data_fatalities$BGN_DATE <- NULL
storm_data_fatalities$EVTYPE <- NULL
storm_data_fatalities$INJURIES <- NULL
storm_data_fatalities$PROPDMG <- NULL
storm_data_fatalities$PROPDMGEXP <- NULL
storm_data_fatalities$CROPDMG <- NULL
storm_data_fatalities$CROPDMGEXP <- NULL
storm_data_injuries$BGN_DATE <- NULL
storm_data_injuries$EVTYPE <- NULL
storm_data_injuries$injuries <- NULL
storm_data_injuries$PROPDMG <- NULL
storm_data_injuries$PROPDMGEXP <- NULL
storm_data_injuries$CROPDMG <- NULL
storm_data_injuries$CROPDMGEXP <- NULL
storm_data_fatal <- storm_data_fatalities %>% group_by (EVENT) %>% summarise(Fatal = sum(FATALITIES))
storm_data_injury <- storm_data_injuries %>% group_by (EVENT) %>% summarise(Injuries = sum(INJURIES))
storm_data_fatal <- na.omit(storm_data_fatal)
storm_data_injury <- na.omit(storm_data_injury)
From the Storm Data Documentation: “Estimates should be rounded to three significant digits, followed by an alphabetical character signifying the magnitude of the number, i.e., 1.55B for $1,550,000,000. Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions.”
We need a function to convert the letter given representing power of 10 to multiply by the value given by PROPDMG or CROPDMG. Each type of damage is calculated and then added together for each event type to give the total damage by event category. The final total damage value is expressed in millions of U.S. dollars.
# exponent value provided to calculate damage
exponent <- function(x){
if(is.numeric(x)) {x <- x}
else if(grepl("h", x, ignore.case=TRUE)) {x <- 2}
else if(grepl("k", x, ignore.case=TRUE)) {x <- 3}
else if(grepl("m", x, ignore.case=TRUE)) {x <- 6}
else if(grepl("b", x, ignore.case=TRUE)) {x <- 9}
else if(x == "" || x == " "){x <- 0}
else{x <- 0}
x
}
# calculate total damage given multiplier and exponent
damage <- function(multiplier, exp){
power <- exponent(exp)
if(is.numeric(multiplier)){
multiplier <- multiplier * (10 ^ power)
}
if(!is.numeric(multiplier)){
multiplier <- 0
}
multiplier
}
# calculate the property and crop damages and combine the to give total estimate damage
storm_data_events$PROPDAMAGE <- mapply(damage, storm_data_events$PROPDMG, storm_data_events$PROPDMGEXP)
storm_data_events$CROPDAMAGE <- mapply(damage, storm_data_events$CROPDMG, storm_data_events$CROPDMGEXP)
storm_data_events$TOTALDAMAGE = storm_data_events$PROPDAMAGE + storm_data_events$CROPDAMAGE
storm_data_damage <- storm_data_events
storm_data_damage$BGN_DATE <- NULL
storm_data_damage$EVTYPE <- NULL
storm_data_damage$FATALITIES <- NULL
storm_data_damage$INJURIES <- NULL
storm_data_damage$PROPDMG <- NULL
storm_data_damage$PROPDMGEXP <- NULL
storm_data_damage$CROPDMG <- NULL
storm_data_damage$CROPDMGEXP <- NULL
storm_data_damage$PROPDAMAGE <- NULL
storm_data_damage$CROPDAMAGE <- NULL
storm_data_damage <- storm_data_damage %>% group_by (EVENT) %>% summarise(Damage = (sum(TOTALDAMAGE)/1000000))
With respect to population health we consider two primary measurable impacts, the first being fatalities directly or indirectly attributed to the weather event, and the second being injuries directly or indirectly attributed to the weather event.
Loss of human life was most significant for weather events related to extreme heat and droughts at 3,178 which occurred more than twice as often as the third and fourth leading causes of weather related fatalities, flooding (1,547) and high wind/dust storms (1,468) respectively. The second leading cause of fatality was tornado/water spout at 2,164.
njury due to weather events was lead by tornadoes at 34,833 which had almost three times the number of injuries associated with it than the second leading cause, high winds and dust storms 11,974.
storm_data_fatal
## Source: local data frame [19 x 2]
##
## EVENT Fatal
## (fctr) (dbl)
## 1 Avalanche 224
## 2 Dam Break 0
## 3 Drought 3
## 4 Fire & Smoke 90
## 5 Flooding 1547
## 6 Fog 80
## 7 High Wind & Dust Storms 1468
## 8 Hot Weather 3178
## 9 Hurricane 135
## 10 Ice 98
## 11 Miscellaneous 0
## 12 Mudslide 44
## 13 Sleet Or Hail 17
## 14 Thunderstorm 818
## 15 Tide 59
## 16 Tornado/Water Spout 2164
## 17 Tropical Storm 66
## 18 Volcano 0
## 19 Winter Weather 754
ggplot(storm_data_fatal, aes(x = reorder(EVENT,-Fatal), y = Fatal)) + geom_bar(stat = "identity") +labs(title="Total Fatalities by Event Type",x = "Event", y = "Fatalities") + theme(axis.text.x = element_text(angle = 45, hjust = 1))
storm_data_injury
## Source: local data frame [19 x 2]
##
## EVENT Injuries
## (fctr) (dbl)
## 1 Avalanche 171
## 2 Dam Break 0
## 3 Drought 33
## 4 Fire & Smoke 1608
## 5 Flooding 8676
## 6 Fog 1076
## 7 High Wind & Dust Storms 11974
## 8 Hot Weather 9243
## 9 Hurricane 1328
## 10 Ice 2152
## 11 Miscellaneous 0
## 12 Mudslide 55
## 13 Sleet Or Hail 1371
## 14 Thunderstorm 5275
## 15 Tide 85
## 16 Tornado/Water Spout 34833
## 17 Tropical Storm 383
## 18 Volcano 0
## 19 Winter Weather 4146
ggplot(storm_data_injury, aes(x = reorder(EVENT,-Injuries), y = Injuries)) + geom_bar(stat = "identity") +labs(title="Total Injuries by Event Type",x = "Event", y = "Injuries") + theme(axis.text.x = element_text(angle = 45, hjust = 1))
Here the results of total damages, property and crop, in U.S. dollars per event category are displayed. Total damage values are in millions of U.S. dollars. Economic impact was dominated by weather events involving water, lead by floods at $179.5B , followed by hurricanes ($90.3B), and tidal related events ($49.3B)
storm_data_damage
## Source: local data frame [20 x 2]
##
## EVENT Damage
## (fctr) (dbl)
## 1 Avalanche 8.7218
## 2 Dam Break 1.0020
## 3 Drought 15025.4196
## 4 Fire & Smoke 8900.0101
## 5 Flooding 179544.0857
## 6 Fog 25.0115
## 7 High Wind & Dust Storms 19681.4159
## 8 Hot Weather 924.8050
## 9 Hurricane 90271.4728
## 10 Ice 10888.8647
## 11 Miscellaneous 0.0000
## 12 Mudslide 346.8431
## 13 Sleet Or Hail 19020.9310
## 14 Thunderstorm 2177.5381
## 15 Tide 48450.2002
## 16 Tornado/Water Spout 39771.4003
## 17 Tropical Storm 8624.0585
## 18 Volcano 0.5000
## 19 Winter Weather 10121.6581
## 20 NA 4992.4017
ggplot(storm_data_damage, aes(x = reorder(EVENT,-Damage), y = Damage)) + geom_bar(stat = "identity") +labs(title="Total Damage ($) by Event Type",x = "Event", y = "Total Damages in Millions ($)") + theme(axis.text.x = element_text(angle = 45, hjust = 1))