Course: Reproducible Research (Coursera), Assignment 2
Sys.setlocale("LC_TIME", "English")
## [1] "English_United States.1252"
# knitr configuration
library(knitr)
opts_knit$set(progress=FALSE)
opts_chunk$set(echo=TRUE, message=FALSE, tidy=TRUE, comment=NA,
cache=TRUE, fig.path="figure/", fig.keep="high",
fig.width=10, fig.height=6,
fig.align="center")
# load required libs
library(dplyr, quietly=TRUE, warn.conflicts=FALSE)
library(ggplot2, quietly=TRUE, warn.conflicts=FALSE)
library(pander, quietly=TRUE, warn.conflicts=FALSE)
library(gridExtra, quietly=TRUE, warn.conflicts=FALSE)
In terms of the effects of storms in human health, the results show that tornadoes are the most deleterious, causing about 62% of the deaths and injuries registered in the data set: 97,043 people or were injured over the 1950-2011 time period, ~1,500 people/year. In fact, 10 of the event types (out of 50 types considered by NOAA) are responsible for over 92% of human victims, in descending order: Tornadoes, lightning, excessive heat, flooding (including flash floods), thunderstorms, winter/ice storms, high winds and wildfire.
An analysis of the financial impact of storms, indicate that floods are the number one threat to properties, representing about 150.2 billions[^billions] USD over the 1950-2011 period (a rate of loss of ~2.5 billion USD a year).
A similar analysis indicate that drought and floods are responsible for more than half of the losses for damaged crops, for a total of 24.83 billions USD over the 1993-2011 period (~ 1.3 billions USD a year).
The data set and its documentation were dowloaded using the following code
datasrc <- "https://d396qusza40orc.cloudfront.net/repdata/data/StormData.csv.bz2"
download.file(url = datasrc, destfile = "StormData.csv.bz2", method = "curl")
datadoc <- "https://d396qusza40orc.cloudfront.net/repdata/peer2_doc/pd01016005curr.pdf"
download.file(url = datadoc, destfile = "pd01016005curr.pdf", method = "curl")
datafaq <- "https://d396qusza40orc.cloudfront.net/repdata/peer2_doc/NCDC%20Storm%20Events-FAQ%20Page.pdf"
download.file(url = datadoc, destfile = "NCDC Storm Events-FAQ Page.pdf", method = "curl")
To get an idea of the structure of the data set, the first 10 lines of the data file were read.
tmp1 <- read.csv(bzfile("StormData.csv.bz2"), nrows = 10)
str(tmp1)
'data.frame': 10 obs. of 37 variables:
$ STATE__ : num 1 1 1 1 1 1 1 1 1 1
$ BGN_DATE : Factor w/ 7 levels "1/22/1952 0:00:00",..: 6 6 5 7 2 2 3 1 4 4
$ BGN_TIME : int 130 145 1600 900 1500 2000 100 900 2000 2000
$ TIME_ZONE : Factor w/ 1 level "CST": 1 1 1 1 1 1 1 1 1 1
$ COUNTY : num 97 3 57 89 43 77 9 123 125 57
$ COUNTYNAME: Factor w/ 9 levels "BALDWIN","BLOUNT",..: 7 1 4 6 3 5 2 8 9 4
$ STATE : Factor w/ 1 level "AL": 1 1 1 1 1 1 1 1 1 1
$ EVTYPE : Factor w/ 1 level "TORNADO": 1 1 1 1 1 1 1 1 1 1
$ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0
$ BGN_AZI : logi NA NA NA NA NA NA ...
$ BGN_LOCATI: logi NA NA NA NA NA NA ...
$ END_DATE : logi NA NA NA NA NA NA ...
$ END_TIME : logi NA NA NA NA NA NA ...
$ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0
$ COUNTYENDN: logi NA NA NA NA NA NA ...
$ END_RANGE : num 0 0 0 0 0 0 0 0 0 0
$ END_AZI : logi NA NA NA NA NA NA ...
$ END_LOCATI: logi NA NA NA NA NA NA ...
$ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3
$ WIDTH : num 100 150 123 100 150 177 33 33 100 100
$ F : int 3 2 2 2 2 2 2 1 3 3
$ MAG : num 0 0 0 0 0 0 0 0 0 0
$ FATALITIES: num 0 0 0 0 0 0 0 0 1 0
$ INJURIES : num 15 0 2 2 2 6 1 0 14 0
$ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25
$ PROPDMGEXP: Factor w/ 1 level "K": 1 1 1 1 1 1 1 1 1 1
$ CROPDMG : num 0 0 0 0 0 0 0 0 0 0
$ CROPDMGEXP: logi NA NA NA NA NA NA ...
$ WFO : logi NA NA NA NA NA NA ...
$ STATEOFFIC: logi NA NA NA NA NA NA ...
$ ZONENAMES : logi NA NA NA NA NA NA ...
$ LATITUDE : num 3040 3042 3340 3458 3412 ...
$ LONGITUDE : num 8812 8755 8742 8626 8642 ...
$ LATITUDE_E: num 3051 0 0 0 0 ...
$ LONGITUDE_: num 8806 0 0 0 0 ...
$ REMARKS : logi NA NA NA NA NA NA ...
$ REFNUM : num 1 2 3 4 5 6 7 8 9 10
The data set has 37 columns, several of those are relevant to the analysis at hand, namely those that indicate the date the event was reported (BGN_DATE
), in what US State the even occurred (STATE
), the event type (EVTYPE
), the number of people dying (FATALITIES
) or being injured (INJURIES
) due to the event, the economical cost of the damages (PROPDMG
, PROPDMGEXP
, CROPDMG
, and CROPDMGEXP
)
# removing temporary data frame
rm(tmp1)
To simplify the analysis (and save time and memory), only the relevant columns will be read from the data file:
storm <- read.csv("StormData.csv.bz2", colClasses = c("NULL", "character", rep("NULL",
4), rep("character", 2), rep("NULL", 14), rep("numeric", 3), "character",
"numeric", "character", rep("NULL", 9)), fileEncoding = "UTF-8")
Warning in scan(file, what, nmax, sep, dec, quote, skip, nlines,
na.strings, : 输入链结'StormData.csv.bz2'内的输入不对
Warning in scan(file, what, nmax, sep, dec, quote, skip, nlines,
na.strings, : EOF within quoted string
str(storm)
'data.frame': 192565 obs. of 9 variables:
$ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
$ STATE : chr "AL" "AL" "AL" "AL" ...
$ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
$ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
$ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
$ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
$ PROPDMGEXP: chr "K" "K" "K" "K" ...
$ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
$ CROPDMGEXP: chr "" "" "" "" ...
The read data set, contains 192565 rows and 9 columns.
According to the original data source, the event type column (EVTYPE
) comprises a definite and limited vocabulary of 50 valid event names.
The list of valid events was extracted from the file pd01016005curr.pdf
(section 2.1.1 “Storm Data Event Table”), and saved to a csv file (valid_events.csv
)
valid_events <- read.csv("valid_events.csv", stringsAsFactors = FALSE)
valid_events$Event.Name <- toupper(valid_events$Event.Name)
# number of valid event names
n_valid <- length(valid_events$Event.Name)
# number of unique event names in the data set
n_evtype <- length(unique(storm$EVTYPE))
goodrows <- subset(storm, EVTYPE %in% valid_events$Event.Name)$EVTYPE
fraction_good <- 100 * length(goodrows)/nrow(storm)
The crucial EVTYPE
column did contain more than the documented 50 unique values, in fact it had 137 possible distinct values.
Not only that, but only about 45.6% of the records in the data set correspond to the documented vocabulary for the event type.
Also, there seems to be a great diversity in the way some events have been recorded over the years, including misspellings, case-mixing, combination of a an event with some sort of numeric value, etc.
set.seed(567)
sample(unique(subset(storm, !EVTYPE %in% valid_events$Event.Name)$EVTYPE), 20)
[1] " Georgia.,823768.00\n48.00,7/3/2010 0:00:00,07:10:00 PM,CST,165.00,GAINES,TX,FLASH FLOOD,5.00,E,SEAGRAVES,7/3/2010 0:00:00,08:10:00 PM,0.00,,3.00,SE,SEAGRAVES ARPT,0.00,0.00,,0.00,0.00,0.00,0.00,K,0.00,K,MAF,TEXAS"
[2] " continuing into the early morning hours of May 26th. Several of these storms produced large hail and damaging winds."
[3] " 1.5 inches in Tatamy (Northampton County)"
[4] " south along the Interstate 95 corridor. Visibilities were reduced to one quarter mile or less.,740487.00\n51.00,4/10/2008 0:00:00,04:00:00 AM,EST,41.00,VAZ041 - 055 - 056,VA,DENSE FOG,0.00,,,4/10/2008 0:00:00,10:00:00 AM,0.00,,0.00,,,0.00,0.00,,0.00,0.00,0.00,0.00,K,0.00,K,LWX,VIRGINIA"
[5] " including Hwy 99 between Wamego and Louisville where one foot of water was observed to be flowing over the roadway. EPISODE NARRATIVE: Slow moving thunderstorms produced a significant amount of rainfall over portions of northeastern Kansas"
[6] " and extreme northeast Missouri on February 24"
[7] " crops had to be replanted even though the growing point was below the surface. This was due to the soil being a sandy loam which allowed freezing temperatures to penetrate into the ground.,572974.00\n19.00,5/6/2005 0:00:00,07:16:00 PM,CST,27.00,CARROLL,IA,HAIL,4.00,S,ARCADIA,5/6/2005 0:00:00,07:16:00 PM,0.00,,4.00,S,ARCADIA,0.00,0.00,,88.00,0.00,0.00,1.00,K,0.00,,DMX,IOWA"
[8] " East,HAMPSHIRE - HAMPSHIRE - MORGAN - BERKELEY - JEFFERSON - PENDLETON - HARDY - WESTERN GRANT,0.00,0.00,0.00,0.00,EPISODE NARRATIVE: An area of low pressure passed through the Ohio Valley spreading precipitation across Virginia on the 6th and 7th. Warmer air was drawn into the storm system aloft"
[9] " North and Central, ,3233.00,8309.00,3215.00,8252.00,EPISODE NARRATIVE: A large subtropical ridge across Texas was slowly expanding eastward. In advance of this upper ridge"
[10] " with a noteable lack of lightning"
[11] " West,WAYNE - WAYNE - CABELL - MASON - JACKSON - LINCOLN - PUTNAM - KANAWHA - ROANE - MINGO - LOGAN - BOONE - CLAY - MCDOWELL,0.00,0.00,0.00,0.00,EPISODE NARRATIVE: A rare October heat wave"
[12] " knocking out power to nearly 100"
[13] " was significantly higher. EPISODE NARRATIVE: There was widespread heavy rain in northern and western Arkansas between the 9th and the 11th of the month. A weather system in Texas dragged a cold front toward the region which produced 24-hour rainfall amounts of 3 to 6 inches by the morning of the 10th and an additional 1 to 3 inches by the morning of the 11th. Flooding was a problem along the Arkansas and White Rivers and tributaries. Releases from Lake Norfork in Baxter County in excess of 86"
[14] " southwest Arkansas and northeast Texas during the predawn hours of July 28th and spread southeast towards the Interstate 20 corridor of northeast Texas into northwest Louisiana during the day. The result was a few reports of wind damage across the region but the main result was excessive heavy rainfall. This rainfall resulted in numerous flooding reports across the region with several high water rescues reported and water into some homes and businesses.EVENT NARRATIVE: A 24 hour rainfall total of 5.25 inches was reported"
[15] "HIGH WIND 48"
[16] " blocking both directions on the Interstate. California Highway 89 was also closed due to a mudslide. Donner Creek caused flooding of some buildings in Truckee. This was the second largest flood of record on Donner Creek. California Highway 89 was closed across Nevada County from Truckee to the Sierra County line."
[17] " North, ,3209.00,9858.00,0.00,0.00,EPISODE NARRATIVE: Isolated elevated severe thunderstorms developed in association with an approaching shortwave. Hail up to the size of nickels was reported with these storms.EVENT NARRATIVE: Penny size hail fell north of Rising Star on HWY 183.,750911.00\n48.00,3/25/2009 0:00:00,11:35:00 AM,CST,133.00,EASTLAND,TX,HAIL,2.00,W,TIFFIN,3/25/2009 0:00:00,11:35:00 AM,0.00,,0.00,,,0.00,0.00,,1.00,0.00,0.00,0.00,K,0.00,K,FWD,TEXAS"
[18] " with several injuries at a campground"
[19] "MONTHLY PRECIPITATION"
[20] " 2.31 inches in Hopatcong (Sussex County)"
Therefore, some serious data cleanup needs to be done on this column.
# normalize all to uppercase
storm$EVTYPE <- toupper(storm$EVTYPE)
events <- storm$EVTYPE
# replace extraneous chars by a single space
events <- gsub("( ){1,}", " ", gsub("[^A-Z0-9 ]", " ", events))
# FLOOD related events
events[grepl("COASTAL|STORM SURGE", events)] <- "COASTAL FLOOD"
events[grepl("FLASH", events)] <- "FLASH FLOOD"
events[!grepl("FLASH|COASTAL", events) & grepl("FLOOD", events)] <- "FLOOD"
events[grepl("STREAM|URBAN", events)] <- "FLOOD"
# HEAT related events
events[grepl("HEAT|DRY", events)] <- "EXCESSIVE HEAT"
events[grepl("HOT|WARM", events)] <- "EXCESSIVE HEAT"
events[grepl("RECORD (HIGH|.*TEMP)|HIGH TEMPERA", events)] <- "EXCESSIVE HEAT"
# COLD related events
events[grepl("SLEET", events)] <- "SLEET"
events[grepl("BLIZZARD", events)] <- "BLIZZARD"
events[grepl("EXTREME", events) & grepl("CHILL|COLD", events)] <- "EXTREME COLD/WIND CHILL"
events[!grepl("EXTREME", events) & grepl("CHILL|COLD", events)] <- "COLD/WIND CHILL"
events[grepl("LAKE", events) & grepl("SNOW", events)] <- "LAKE-EFFECT SNOW"
events[!grepl("LAKE", events) & grepl("SNOW", events)] <- "HEAVY SNOW"
events[grepl("FROST|FREEZE", events)] <- "FROST/FREEZE"
events[!grepl("FROST", events) & grepl("FREEZE", events)] <- "SLEET"
events[grepl("FREEZ", events) & grepl("RAIN", events)] <- "SLEET"
events[grepl("DRIZZLE", events)] <- "SLEET"
events[grepl("(RECORD LOW|LOW TEMP)", events)] <- "EXTREME COLD/WIND CHILL"
events[grepl("GLAZE", events)] <- "EXTREME COLD/WIND CHILL"
events[grepl("ICE", events)] <- "ICE STORM"
events[grepl("WINT", events)] <- "WINTER STORM"
events[grepl("HAIL", events)] <- "HAIL"
# WIND, RAIN and LIGHTING related events
events <- gsub("WINDS", "WIND", events)
events[!grepl("DERSTORM WIND", events) & grepl("THUN|TSTM", events)] <- "LIGHTNING"
events[grepl("LIGHT|LIGN", events)] <- "LIGHTNING"
events[grepl("DERSTORM WIND", events)] <- "THUNDERSTORM WIND"
events[grepl("TORN", events)] <- "TORNADO"
events[grepl("SPOUT", events)] <- "WATERSPOUT"
events[grepl("HURRICANE|TYPHOON", events)] <- "HURRICANE (TYPHOON)"
events[grepl("FIRE", events)] <- "WILDFIRE"
events[!grepl("MARINE", events) & grepl("HIGH WIND", events)] <- "HIGH WIND"
events[grepl("GUST", events)] <- "STRONG WIND"
events[!grepl("COLD|MARINE|THUNDER|STRONG|HIGH", events) & grepl("WIND", events)] <- "STRONG WIND"
events[grepl("FUNNEL", events)] <- "FUNNEL CLOUD"
events[grepl("TROPICAL STORM", events)] <- "TROPICAL STORM"
events[!grepl("FREEZIN", events) & grepl("FOG|VOG", events)] <- "DENSE FOG"
events[grepl("WET|RAIN|SHOWER|PRECIP", events)] <- "HEAVY RAIN"
# DUST related events
events[grepl("DUST DEVEL", events)] <- "DUST DEVIL"
events[!grepl("DEVIL", events) & grepl("DUST", events)] <- "DUST STORM"
# MARINE EVENTS
events[grepl("RIP CURRENT", events)] <- "RIP CURRENT"
events[!grepl("LOW", events) & grepl("TIDE|WAVE|SWELL", events)] <- "STORM SURGE/TIDE"
events[grepl("SURF", events)] <- "HIGH SURF"
# MISC events
events[grepl("VOLCAN", events)] <- "VOLCANIC ASH"
# Not a storm, but is there, so we will classify it
events[grepl("(MUD|LAND|ROCK).*SLIDE", events)] <- "LANDSLIDE"
# everything else
events[grepl("SUMMARY", events)] <- "OTHER/UNKOWN"
events[!events %in% c("LANDSLIDE", "OTHER", valid_events$Event.Name)] <- "OTHER/UNKNOWN"
# re-assign the cleaned up column values
storm$EVTYPE <- events
To be able to accumulate by year, a variable was created to store the value extracted from the BGN_DATE
column
storm$BGN_DATE <- as.Date(storm$BGN_DATE, format = "%m/%d/%Y %H:%M:%S")
storm$year <- as.POSIXlt(storm$BGN_DATE)$year + 1900
Finally, to be able to estimate the monetary cost due to damage caused by the storms, we have to examine the appropriate columns.
# values in
tp <- table(storm$PROPDMGEXP)
tc <- table(storm$CROPDMGEXP)
pander(tp, caption = "*Property damage 'exponents'*")
156015 | 3 | 7 | 1 | 2 | 33665 | 1 | 2871 |
pander(tc, caption = "*Crop damage 'exponents'*")
192491 | 1 | 1 | 39 | 1 | 32 |
It would seem that there are a mixture of coding standards for these columns, and the great majority of the “exponents” (multipliers really) correspond to a coding such that:
The meaning of the other codes is not clear. Even after checking the documentation on the site that was the source for the data, several incompatible definitions could be glimpsed:
PROPDMG
column. Even if we accept this interpretation, there is the issue as to whether the units are in thousands, millions, or billions [^usbillions]The bottomline is that is not feasible to apply only one interpretation to the numeric codes in these columns.
To assertain if it would be possible to omit them in the analysis, we calculated the percentage of these codes in the column, from among the records that have a value for PROPDMG
storm$PROPDMGEXP <- toupper(storm$PROPDMGEXP)
storm$CROPDMGEXP <- toupper(storm$CROPDMGEXP)
pdmg_storm <- subset(storm, PROPDMG > 0)
cdmg_storm <- subset(storm, CROPDMG > 0)
undef_p <- 100 * sum(!pdmg_storm$PROPDMGEXP %in% c("B", "H", "K", "M"))/nrow(pdmg_storm)
undef_c <- 100 * sum(!cdmg_storm$CROPDMGEXP %in% c("B", "H", "K", "M"))/nrow(cdmg_storm)
badcode_storm <- data.frame(column = c("PROPDMGEXP", "CROPDMGEXP"), percent = c(paste0(round(undef_p,
3), "%"), paste0(round(undef_c, 3), "%")))
colnames(badcode_storm) <- c("Column", "Percent of undefined codes")
pander(badcode_storm)
Column | Percent of undefined codes |
---|---|
PROPDMGEXP | 0.036% |
CROPDMGEXP | 0% |
As can be seen, the fraction of records with uninterpretable codes is very small (<< 1%), thus we can safely drop them from the respective data frames.
pdmg_storm <- subset(pdmg_storm, PROPDMGEXP %in% c("B", "H", "K", "M"))
cdmg_storm <- subset(cdmg_storm, CROPDMGEXP %in% c("B", "H", "K", "M"))
In the cleaned up storm
data set, there is an unequal distribution of the reported events during the period under analysis, as can be seen from the table below
# summary of all events
t_event <- storm %>% group_by(EVTYPE) %>% summarise(total = n()) %>% mutate(perc_total = 100 *
total/sum(total)) %>% arrange(desc(total))
# top 10
top10_events <- t_event[1:10, ]
top10_percent <- sum(top10_events$perc_total)
colnames(top10_events) <- c("Event Class", "Frequency", "Percentage of reports")
pander(top10_events, caption = "*Top 10 events reported in the storm data set*",
round = 2)
Event Class | Frequency | Percentage of reports |
---|---|---|
LIGHTNING | 91200 | 47.36 |
HAIL | 63439 | 32.94 |
TORNADO | 34921 | 18.13 |
THUNDERSTORM WIND | 1887 | 0.98 |
HIGH WIND | 340 | 0.18 |
FLASH FLOOD | 259 | 0.13 |
FUNNEL CLOUD | 153 | 0.08 |
HEAVY SNOW | 67 | 0.03 |
FLOOD | 65 | 0.03 |
BLIZZARD | 36 | 0.02 |
The top 10 events in the data set (vide supra) are responsible for 99.9% of the reports from 1950–2011
The data contains columns that can help us measure the impact of storms in Public Health, understood in term of the number of victims that suffer death or injury as a result of one of these events.
# records indicating impact on human health
hi_storm <- subset(storm, FATALITIES > 0 | INJURIES > 0)
hi_storm$victims <- hi_storm$FATALITIES + hi_storm$INJURIES
About 3.75% of the records in the Storm data indicate that there were human victims.
In the table below we can see the top 10 storm types (events) that impacted more human health in the time period under consideration
hi_table <- hi_storm %>% group_by(EVTYPE) %>% summarise(c_tot = sum(victims)) %>%
arrange(desc(c_tot)) %>% mutate(c_perc = 100 * c_tot/sum(c_tot), c_cummperc = cumsum(c_perc))
colnames(hi_table) <- c("Event type", "Deaths/Injuries", "Percent", "Cumm Percent")
pander(hi_table[1:10, ], caption = "Top 10 causes of death or injury due to storms [1950-2011]",
round = 1)
Event type | Deaths/Injuries | Percent | Cumm Percent |
---|---|---|---|
TORNADO | 72600 | 93.9 | 93.9 |
LIGHTNING | 3657 | 4.7 | 98.6 |
HAIL | 417 | 0.5 | 99.1 |
WILDFIRE | 153 | 0.2 | 99.3 |
DENSE FOG | 130 | 0.2 | 99.5 |
HIGH WIND | 123 | 0.2 | 99.7 |
WINTER STORM | 59 | 0.1 | 99.8 |
DUST STORM | 56 | 0.1 | 99.8 |
THUNDERSTORM WIND | 53 | 0.1 | 99.9 |
FLOOD | 16 | 0 | 99.9 |
We can see that the top 10 causes comprise about 99.9% of all the victims affected in all those years, and that Tornadoes are by far the most important cause of death or injuries to humans.
The impact on humans has not been constant over the years, in fact there have been major events that went outside the norm, as can be seen in the graph below.
t_ph_year <- hi_storm %>% group_by(year) %>% summarize(t_fatal = sum(FATALITIES),
t_injur = sum(INJURIES))
pdeath <- ggplot(t_ph_year, aes(x = year, y = t_fatal)) + geom_line(stat = "identity",
col = "black", size = 1.5) + xlab("") + ylab("Number of deaths") + theme_bw() +
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(), plot.margin = unit(c(1,
1, -1, 1), "cm"))
pinjur <- ggplot(t_ph_year, aes(x = year, y = t_injur)) + geom_line(stat = "identity",
col = "red", size = 1.5) + xlab("Year") + ylab("Number of injuries") + theme_bw() +
theme(plot.margin = unit(c(0, 1, 0, 1), "cm"))
grid.arrange(pdeath, pinjur, main = "Deaths and injuries due to storms [1950-2011]")
The graph shows events such as 1995’s Chicago Heat Wave [^1995chicago] (maximum value in the top chart) which, during the month of July of that year, caused about many deaths in a period of only five days[^1995paper].
chicago1995 <- subset(hi_storm, STATE == "IL" & (BGN_DATE >= "1995-07-12" &
BGN_DATE <= "1995-07-16"))
chicago1995$BGN_DATE <- as.character(chicago1995$BGN_DATE)
pander(as.list(chicago1995[, 1:5]))
Also of note are 1998’s South Texas floods [^1998texas],[^1998noaatx], that in October of that year caused a great number of injuries and death. This event is responsible for the maximum value in the injuries plot.
texas1998 <- subset(hi_storm, STATE == "TX" & EVTYPE == "FLOOD" & year == 1998 &
months(hi_storm$BGN_DATE, abbreviate = TRUE) == "Oct") %>% group_by(BGN_DATE,
STATE, EVTYPE) %>% summarise(tot_fatal = sum(FATALITIES), tot_injur = sum(INJURIES))
texas1998$BGN_DATE <- as.character(texas1998$BGN_DATE)
colnames(texas1998) <- c("Date", "State", "Event", "Total deaths", "Total injuries")
pander(texas1998[, 1:5], split.tables = 120)
Quitting from lines 378-385 (repdata_project2.Rmd) 错误于value[[jvseq[[jjj]]]] : 下标出界 Calls:
The storms have also had a negative financial impact due to damage produced to property and crops.
About 14.24% of records in the data set include an estimate for the property damage, and 0.04% have data on the cost of damage to crops.
To evaluate the costs, we will add a column that traduces the character code into a multiplier, which will allow us to calculate the appropriate amount in each event.
The top 10 events in terms of property damage are listed in the table below, with flooding being the number one source of property loss.
mults <- list(H = 10^2, K = 10^3, M = 10^6, B = 10^9)
pdmg_storm$multiplier <- sapply(pdmg_storm$PROPDMGEXP, function(x) {
return(mults[[as.character(x)]])
})
pdmg_storm$amount <- pdmg_storm$PROPDMG * pdmg_storm$multiplier
pdmg_storm$type <- "Property damage"
summ_pdmg <- pdmg_storm %>% group_by(EVTYPE) %>% summarise(total = sum(amount)/10^9) %>%
mutate(percent = 100 * total/sum(total)) %>% arrange(desc(total))
colnames(summ_pdmg) <- c("Event", "Cost (in Billions USD)", "Percent from total")
pander(summ_pdmg[1:10, ], round = 2)
Event | Cost (in Billions USD) | Percent from total |
---|---|---|
TORNADO | 30.73 | 81.97 |
WINTER STORM | 5.13 | 13.69 |
WILDFIRE | 0.62 | 1.66 |
HIGH WIND | 0.36 | 0.97 |
HURRICANE (TYPHOON) | 0.19 | 0.51 |
FLASH FLOOD | 0.12 | 0.31 |
FLOOD | 0.09 | 0.24 |
THUNDERSTORM WIND | 0.09 | 0.23 |
HEAVY RAIN | 0.06 | 0.15 |
HEAVY SNOW | 0.05 | 0.14 |
And the correspoding events for crop damage shows that drought and flooding (two counterposed atmospheric events) are responsible for more that 50% of losses to crops.
cdmg_storm$multiplier <- sapply(cdmg_storm$CROPDMGEXP, function(x) {
return(mults[[as.character(x)]])
})
cdmg_storm$amount <- cdmg_storm$CROPDMG * cdmg_storm$multiplier
cdmg_storm$type <- "Crop damage"
summ_cdmg <- cdmg_storm %>% group_by(EVTYPE) %>% summarise(total = sum(amount)/10^9) %>%
mutate(percent = 100 * total/sum(total)) %>% arrange(desc(total))
colnames(summ_cdmg) <- c("Event", "Cost (in Billions USD)", "Percent from total")
pander(summ_cdmg[1:10, ], round = 2)
Event | Cost (in Billions USD) | Percent from total |
---|---|---|
FLOOD | 0.4 | 38.46 |
EXCESSIVE HEAT | 0.4 | 38.2 |
HAIL | 0.05 | 4.89 |
TORNADO | 0.05 | 4.8 |
THUNDERSTORM WIND | 0.04 | 3.35 |
HIGH WIND | 0.03 | 2.68 |
HURRICANE (TYPHOON) | 0.02 | 2.39 |
LIGHTNING | 0.02 | 1.62 |
FLASH FLOOD | 0.02 | 1.6 |
WINTER STORM | 0.02 | 1.48 |
When looking at the total losses per year over the study period, we observe a definite growth trend due to property damage by storms. Whereas, for crops there has been a decrease, at least since 1993, which is the first record of such losses in the data set.
For illustration purposes (because it might not be the best model for this data), we are superimposing a linear estimate, mainly to drive home the possible underlying trend.
dmg <- rbind(pdmg_storm %>% select(STATE, EVTYPE, year, amount, type), cdmg_storm %>%
select(STATE, EVTYPE, year, amount, type))
dmg$type <- as.factor(dmg$type)
summyr_dmg <- dmg %>% group_by(type, year) %>% summarise(yr_damage = sum(amount)) %>%
arrange(type, year)
dmg_yr_plot <- ggplot(summyr_dmg, aes(x = year, y = yr_damage/10^9, colour = type)) +
geom_point(aes(group = type)) + scale_y_log10() + geom_smooth(method = "lm") +
ylab("Damage in billions of USD") + xlab("Year") + ggtitle("Property and Crop Damage caused by storms [1950-2011]") +
scale_color_discrete(guide = FALSE) + facet_wrap(~type, nrow = 1) + theme_bw()
dmg_yr_plot
Some non-exhaustive reasons could be advanced for these trends: