This report describes the exploration of the NOAA Storm Database and answers some questions about severe weather events with the greatest impact on the population health and economics between 1996 and 2011 years. Analysis was based on the standard of the weather events types declared in the NWS Storm Data Documentation.
The Data processing part of this report explains how the time period of exploration was chosen and describes how the weather event names form the source dataset was brought in accordance with the standard event names.
At the Summary part shown that the most dangerous weather events with impact on the populations health are tornadoes, excessive heat and floods. By the economic consequences the most hazard weather events are floods, hurricanes/typhoons, droughts, storm surges/tides and tornadoes.
data_url <- "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
data_file <- "repdata-data-StormData.csv.bz2"
download.file(data_url, destfile = data_file, method = "auto")
data_date <- date()
The original dataset was downloaded Sun Dec 27 15:57:06 2015 as repdata-data-StormData.csv.bz2 archive:
Next code unpacks the downloaded .bz2 archive and reads the original dataset:
data <- read.csv(bzfile(data_file))
data_size <- dim(data)
Supplied dataset contains 902297 observations of weather events and 37 variables. Next variables were used in the present exploration:
data <- data[, c("EVTYPE", "BGN_DATE",
"FATALITIES", "INJURIES",
"PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
The supplied NOAA’s dataset contains information about storms and other severe weather events. In the purpose of data processing all event types were represented with 48 unique weather events and defined in National Weather Service Storm Data Documentation. This set of the standard types of weather events was established since 1996 year.
To provide further analysis in accordance with NWS standards I prepared a dataframe with standart names of weather events:
events <- data.frame(
EventName = c("Astronomical Low Tide", "Avalanche", "Blizzard",
"Coastal Flood", "Cold/Wind Chill", "Debris Flow",
"Dense Fog", "Dense Smoke", "Drought", "Dust Devil",
"Dust Storm", "Excessive Heat", "Extreme Cold/Wind Chill",
"Flash Flood", "Flood", "Freezing Fog", "Frost/Freeze",
"Funnel Cloud", "Hail", "Heat", "Heavy Rain", "Heavy Snow",
"High Surf", "High Wind", "Hurricane/Typhoon", "Ice Storm",
"Lakeshore Flood", "Lake-Effect Snow", "Lightning",
"Marine Hail", "Marine High Wind", "Marine Strong Wind",
"Marine Thunderstorm Wind", "Rip Current", "Seiche",
"Sleet", "Storm Surge/Tide", "Strong Wind",
"Thunderstorm Wind", "Tornado", "Tropical Depression",
"Tropical Storm", "Tsunami", "Volcanic Ash", "Waterspout",
"Wildfire", "Winter Storm", "Winter Weather"
)
)
After that I converted all event names to uppercase:
data$EVTYPE <- toupper(data$EVTYPE)
events$EventName <- toupper(events$EventName)
The events in the supplied dataset start in the year 1950 and end in November 2011.
As it is defined in the task to the present assignment we should consider writing report “as if it were to be read by a government or municipal manager who might be responsible for preparing for severe weather events and will need to prioritize resources for different types of events”. On the one hand, this means that chosen data period should contain quite fresh data (old data could be irrelevant because of different climate changes or measures already taken to prevent the consequences of hazardous weather events). On the other hand, analyzed data should contain enough information for statistical analysis because frequency of the different weather events varies year to year.
Furthermore, as it shown at the [Supplimental Information page] website](http://www.ncdc.noaa.gov/stormevents/details.jsp?type=supplemental) at the NOAA’s the number of matched events till 1993 is only 61% and not all event types were recorded:
If take into account that the standard of weather event names was accepted since 1996 I made a decision to exclude older events from the analisis:
require(lubridate)
data$YEAR <- year(as.Date(data$BGN_DATE, "%m/%d/%Y"))
subdata <- data[data$YEAR >= 1996, ]
After this transformation the amount of events reduced to 653530 (about 72.4% of the source dataset).
Based on the goal of analysis I also filtered out all weather events with absence of population health and economic consequences:
subdata <- subdata[subdata$FATALITIES > 0 |
subdata$INJURIES > 0 |
subdata$PROPDMG > 0 |
subdata$CROPDMG > 0, ]
Information about property and crop damages is presented in the variables PROPDMG and CROPDMG respectively. Corresponding multipliers lie in the variables PROPDMGEXP and CROPDMGEXP:
table(subdata$PROPDMGEXP)[table(subdata$PROPDMGEXP) > 0]
##
## B K M
## 8448 32 185474 7364
table(subdata$CROPDMGEXP)[table(subdata$CROPDMGEXP) > 0]
##
## B K M
## 102767 2 96787 1762
To calculate absolute values of damages I used the next refactoring:
multiplier_table <- c("1" = 1, "K" = 1e3, "M" = 1e6, "B" = 1e9)
subdata$PROPDMGEXP <- ifelse(subdata$PROPDMGEXP == "",
"1", as.character(subdata$PROPDMGEXP))
subdata$CROPDMGEXP <- ifelse(subdata$CROPDMGEXP == "",
"1", as.character(subdata$CROPDMGEXP))
subdata$PROPDMG <- as.numeric(subdata$PROPDMG) *
multiplier_table[subdata$PROPDMGEXP]
subdata$CROPDMG <- as.numeric(subdata$CROPDMG) *
multiplier_table[subdata$CROPDMGEXP]
At the end of preprocessing I calculated total health and economic impacts and saved them into the corresponding variables (they will be useful for the further analysis):
fatalities_total = sum(subdata$FATALITIES)
injuries_total = sum(subdata$INJURIES)
propdmg_total = sum(subdata$PROPDMG)
cropdmg_total = sum(subdata$CROPDMG)
The supplied dataset was collected by people and as we know everybody makes mistakes. In our case it is possible to estimate the impact of difference between event names in the dataset and standard event names.
To understand the size of the problem I merged dataset and the list of standard event names:
tmp <- merge(subdata, events, by.x = "EVTYPE", by.y = "EventName")
…and calculated the losses in terms of health and economy impact:
fatalities_before = 100 * (1 - sum(tmp$FATALITIES) / fatalities_total)
injuries_before = 100 * (1 - sum(tmp$INJURIES) / injuries_total)
propdmg_before = 100 * (1 - sum(tmp$PROPDMG) / propdmg_total)
cropdmg_before = 100 * (1 - sum(tmp$CROPDMG) / cropdmg_total)
By the reason of inconsistent event names next amount of values could be lost:
As you can see this values are quite big and could have meaningfull influence on the results. Therefore I made a decision to bring the source event names in accordance with the standard ones as far as it is possible.
Looking through the source dataset and using of definitions from the NWS Storm Data Documentation I noticed three types of problem with event names:
The most frequent cases with using of abbreviations and non standard names I solved by direct substitution:
subdata$EVTYPE <- gsub("TSTM", "THUNDERSTORM", subdata$EVTYPE)
subdata$EVTYPE <- gsub("^URBAN/SML STREAM FLD$", "HEAVY RAIN", subdata$EVTYPE)
subdata$EVTYPE <- gsub("^WILD/FOREST FIRE$", "WILDFIRE", subdata$EVTYPE)
subdata$EVTYPE <- gsub("^EXTREME COLD$", "EXTREME COLD/WIND CHILL", subdata$EVTYPE)
subdata$EVTYPE <- gsub("^HURRICANE$", "HURRICANE/TYPHOON", subdata$EVTYPE)
subdata$EVTYPE <- gsub("^TYPHOON$", "HURRICANE/TYPHOON", subdata$EVTYPE)
subdata$EVTYPE <- gsub("^COLD$", "EXTREME COLD/WIND CHILL", subdata$EVTYPE)
subdata$EVTYPE <- gsub("^LANDSLIDE$", "DEBRIS FLOW", subdata$EVTYPE)
subdata$EVTYPE <- gsub("^FOG$", "DENSE FOG", subdata$EVTYPE)
subdata$EVTYPE <- gsub("^GLAZE$", "FREEZING FOG", subdata$EVTYPE)
subdata$EVTYPE <- gsub("^WIND$", "HIGH WIND", subdata$EVTYPE)
subdata$EVTYPE <- gsub("^HEAVY SURF/HIGH SURF$", "HIGH SURF", subdata$EVTYPE)
subdata$EVTYPE <- gsub("^COLD AND SNOW$", "EXTREME COLD/WIND CHILL", subdata$EVTYPE)
subdata$EVTYPE <- gsub("^WINTER WEATHER/MIX$", "EXTREME COLD/WIND CHILL", subdata$EVTYPE)
subdata$EVTYPE <- gsub("^THUNDERSTORM WIND/HAIL$", "THUNDERSTORM WIND", subdata$EVTYPE)
subdata$EVTYPE <- gsub("^EXTREME WINDCHILL$", "EXTREME COLD/WIND CHILL", subdata$EVTYPE)
subdata$EVTYPE <- gsub("^STORM SURGE$", "STORM SURGE/TIDE", subdata$EVTYPE)
subdata$EVTYPE <- gsub("^RIVER FLOODING$", "FLOOD", subdata$EVTYPE)
subdata$EVTYPE <- gsub("^COASTAL FLOODING/EROSION$", "COASTAL FLOOD", subdata$EVTYPE)
The idea of spelling-errors correction was taken from the article by Gregory V. Bard “Spelling-Error Tolerant, Order-Independent Pass-Phrases via the Damerau-Levenshtein String-Edit Distance Metric”. But in place of algorithm described in the article I used the more simple method based on the Optimal String Aligment metric (restricted Damerau-Levenshtein distance) from the R-package stringdist.
At the begining I calculated the pairwise distances between event names from the dataset and standard event names:
require(stringdist)
raw_event_names <- as.data.frame(table(subdata$EVTYPE))
names(raw_event_names) <- c("EventOldName", "Count")
dist_matrix <- stringdistmatrix(raw_event_names$EventOldName, events$EventName)
After that for every original event name I chose the closest standard event name by the OSA-distaance metric:
raw_event_names$DistOSA <- apply(dist_matrix, 1, min)
raw_event_names$EventName <- events[apply(dist_matrix, 1, which.min), ]
If the OSA-distance is equal to zero the original event name is equal to standard one, otherwise there are three possible cases:
The third case was not frequent so I focused on the first and second cases. The good way to reduce the wrong error corrections is to limit the value of OSA-distance. Strong limitations (by setting the maximum of OSA-distans <= 2 ) could slightly reduce the recall of method but at the same time would highly reduce its fullness. To avoid this I calculate the normalized OSA-distance by dividing it to the length of corrected word:
raw_event_names$DistNorm <- raw_event_names$DistOSA /
nchar(as.character(raw_event_names$EventName))
It helped to avoid such wrong correction of short lexically-similar terms like “HEAVY SEAS” -> “HEAVY SNOW”, “RAIN” -> “HAIL” e t.c. After the set of attempts I have chosen the next pair of limitations:
max_osa_dist <- 4
max_norm_dist <- 0.3
…and used them to correct of the event names marked all residual names as “UNCLASSIFIED”:
raw_event_names$EventNewName <- ifelse(raw_event_names$DistOSA <= max_osa_dist &
raw_event_names$DistNorm < max_norm_dist,
raw_event_names$EventName,
"UNCLASSIFIED")
subdata <- merge(subdata, raw_event_names[, c("EventOldName", "EventNewName")],
by.x = "EVTYPE",
by.y = "EventOldName")
After that I calculated the losses in terms of health and economy impact again:
delta <- subdata[subdata$EventNewName != "UNCLASSIFIED", ]
fatalities_after = 100 * (1 - sum(delta$FATALITIES) / fatalities_total)
injuries_after = 100 * (1 - sum(delta$INJURIES) / injuries_total)
propdmg_after = 100 * (1 - sum(delta$PROPDMG) / propdmg_total)
cropdmg_after = 100 * (1 - sum(delta$CROPDMG) / cropdmg_total)
The new estimation is:
As we can see all cleaning procedures have redused the loses of data available for analysis more then 10 times. From my point of view the dataset is ready to the final investigation now.
To plot the final results I summed up all amounts of fatalities, injuries, property and crop damages grouped by event types and converted the values of property and crop damages into millions of dollars:
total_impact <- aggregate(cbind(FATALITIES, INJURIES, PROPDMG, CROPDMG) ~ EventNewName,
data = subdata, sum)
total_impact_table <- total_impact
total_impact$PROPDMG <- round(total_impact$PROPDMG / 1e6)
total_impact$CROPDMG <- round(total_impact$CROPDMG / 1e6)
Next my_barplot() function was written to represent the pairs of “fatalities/injuries” and “property/crop demages” in one plot. The idea of such type of plotting was taken from the site stackoverflow.com and modified to the conditions of present investigation. (For details, please, see Two horizontal bar charts with shared axis in ggplot2.
require(ggplot2)
require(scales)
require(grid)
require(gridExtra)
my_barplot = function(datatable, title = "", lsubtitle = "", rsubtitle = "") {
datatable <- as.data.frame(datatable)
names(datatable) <- c("Events", "Left", "Right")
datatable <- datatable[order(-(datatable$Left + datatable$Right)), ]
datatable$Events <- factor(datatable$Events,
levels = datatable[order(datatable$Left +
datatable$Right),
"Events"])
datatable <- datatable[1:15, ]
g.mid <- ggplot(datatable, aes(x = 1, y = Events)) +
geom_text(aes(label = Events), position = "identity", size = 3) +
ggtitle("") +
ylab(NULL) +
scale_x_continuous(expand = c(0,0),
limits = c(0.94,1.065)) +
theme(
axis.title = element_blank(),
panel.grid = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
panel.background = element_blank(),
axis.text.x = element_text(color = NA),
axis.ticks.x = element_line(color = NA),
plot.margin = unit(c(1,-1,1,-1), "mm")
)
g1 <- ggplot(datatable, aes(x = Events, y = Left)) +
geom_bar(stat = "identity") +
geom_text(aes(label = Left), hjust = 1.4, size = 3) +
ggtitle(lsubtitle) +
theme(
axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.margin = unit(c(1,-1,1,0), "mm")
) +
scale_y_reverse(limits = c(max(datatable$Left) * 1.4, 0)) +
coord_flip()
g2 <- ggplot(datatable, aes(x = Events, y = Right)) +
xlab(NULL) +
geom_bar(stat = "identity") +
ggtitle(rsubtitle) +
geom_text(aes(label = Right), hjust = -.3, size = 3) +
theme(
axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.margin = unit(c(1,0,1,-1), "mm")
) +
scale_y_continuous(limits = c(0,max(datatable$Right) * 1.3)) +
coord_flip()
gg.mid <- ggplot_gtable(ggplot_build(g.mid))
gg1 <- ggplot_gtable(ggplot_build(g1))
gg2 <- ggplot_gtable(ggplot_build(g2))
grid.arrange(gg1,gg.mid,gg2,
top = textGrob(title, gp=gpar(cex=1.5)),
ncol = 3,
widths = c(0.35, 0.25, 0.35))
}
Figure 1 shows the most harmful weather events with respect to the population health.
Middle column explains the standatd names of events with the biggest sums of fatalities and injuries sorted by the decrease of values. The left and the right bar plots represent the amount of fatalities and injuries separately:
my_barplot(total_impact[, c("EventNewName", "FATALITIES", "INJURIES")],
"Fig 1. The most harmful weather events with respect to population health\n(for the period of 1996 - 2011 years)",
"Fatalities", "Injuries")
As we can see the most harmful weather event with the biggest total influence on the population health is TORNADO which causes the biggest amount of injuries. Then goes EXCESSIVE HEAT and FLOOD. At the same time the most fatal weather event is EXCESSIVE HEAT.
Figure 2 shows the most harmful weather events with respect to the property and crop damages.
Middle column represents the standatd names of events with the biggest sums of property and crop damages sorted by the decrease of values. The left and the right bar plots represent the amount of property damages and crop damages separately:
my_barplot(total_impact[, c("EventNewName", "PROPDMG", "CROPDMG")],
"Fig 2. The most harmful events with the biggest economic consequences\n(for the period of 1996 - 2011 years, in millions USD)",
"Property Demage", "Crop Demage")
As we can see the most harmful weather event with the biggest total economic consequence is FLOOD which also causes the largest property damages. Then goes HURRICANE/TYPHOON and STORM SURGE/TIDE. TORNADO is at the forth place. All this weather events also caused the the biggest property damages.
But the situation with the crop damages is different. The most harmfull weather event here is DROUGHT, then goes HURRICANE/TYPHOON and FLOOD.
Here is the final data-table of the aggregate values of the weather events consequences for the years between 1996 and 2001 sorted by the standard event names. The “UNCLASSIFIED” event corresponds to the all source events which names was not associated with the standard weather event names.
library(knitr)
total_impact_table$PROPDMG <- round(total_impact_table$PROPDMG / 1e6, digits = 3)
total_impact_table$CROPDMG <- round(total_impact_table$CROPDMG / 1e6, digits = 3)
names(total_impact_table) <- c("Event Fame", "Fatalities", "Injuries",
"Property Damages, mln USD", "Crop Damages, mln USD")
kable(total_impact_table, digits=3)
| Event Fame | Fatalities | Injuries | Property Damages, mln USD | Crop Damages, mln USD |
|---|---|---|---|---|
| ASTRONOMICAL LOW TIDE | 0 | 0 | 9.745 | 0.000 |
| AVALANCHE | 223 | 156 | 3.712 | 0.000 |
| BLIZZARD | 70 | 385 | 525.659 | 7.060 |
| COASTAL FLOOD | 6 | 7 | 375.240 | 0.000 |
| COLD/WIND CHILL | 95 | 12 | 1.990 | 0.600 |
| DEBRIS FLOW | 37 | 52 | 324.578 | 20.017 |
| DENSE FOG | 69 | 855 | 20.465 | 0.000 |
| DENSE SMOKE | 0 | 0 | 0.100 | 0.000 |
| DROUGHT | 0 | 4 | 1046.101 | 13367.566 |
| DUST DEVIL | 2 | 39 | 0.664 | 0.000 |
| DUST STORM | 11 | 376 | 5.474 | 3.100 |
| EXCESSIVE HEAT | 1797 | 6393 | 9.659 | 492.402 |
| EXTREME COLD/WIND CHILL | 317 | 192 | 36.089 | 1326.023 |
| FLASH FLOOD | 887 | 1674 | 15222.254 | 1334.902 |
| FLOOD | 415 | 6759 | 144050.989 | 5002.798 |
| FREEZING FOG | 1 | 212 | 2.332 | 0.000 |
| FROST/FREEZE | 0 | 0 | 10.480 | 1094.186 |
| FUNNEL CLOUD | 0 | 1 | 0.134 | 0.000 |
| HAIL | 7 | 713 | 14595.143 | 2476.029 |
| HEAT | 237 | 1222 | 1.520 | 0.176 |
| HEAVY RAIN | 122 | 309 | 643.174 | 736.658 |
| HEAVY SNOW | 107 | 698 | 634.418 | 71.122 |
| HIGH SURF | 132 | 198 | 93.775 | 0.000 |
| HIGH WIND | 253 | 1167 | 5250.650 | 633.861 |
| HURRICANE/TYPHOON | 125 | 1326 | 81718.889 | 5350.108 |
| ICE STORM | 82 | 318 | 3642.249 | 15.660 |
| LAKE-EFFECT SNOW | 0 | 0 | 40.182 | 0.000 |
| LAKESHORE FLOOD | 0 | 0 | 7.540 | 0.000 |
| LIGHTNING | 651 | 4141 | 743.077 | 6.898 |
| MARINE HAIL | 0 | 0 | 0.004 | 0.000 |
| MARINE HIGH WIND | 1 | 1 | 1.297 | 0.000 |
| MARINE STRONG WIND | 14 | 22 | 0.418 | 0.000 |
| MARINE THUNDERSTORM WIND | 19 | 34 | 5.857 | 0.050 |
| RIP CURRENT | 542 | 503 | 0.163 | 0.000 |
| SEICHE | 0 | 0 | 0.980 | 0.000 |
| STORM SURGE/TIDE | 13 | 42 | 47834.724 | 0.855 |
| STRONG WIND | 110 | 299 | 176.994 | 64.954 |
| THUNDERSTORM WIND | 376 | 5125 | 7913.198 | 1016.943 |
| TORNADO | 1511 | 20667 | 24616.946 | 283.425 |
| TROPICAL DEPRESSION | 0 | 0 | 1.737 | 0.000 |
| TROPICAL STORM | 57 | 338 | 7642.476 | 677.711 |
| TSUNAMI | 33 | 129 | 144.062 | 0.020 |
| UNCLASSIFIED | 97 | 445 | 92.130 | 340.405 |
| VOLCANIC ASH | 0 | 0 | 0.500 | 0.000 |
| WATERSPOUT | 2 | 2 | 5.730 | 0.000 |
| WILDFIRE | 87 | 1456 | 7760.449 | 402.255 |
| WINTER STORM | 191 | 1292 | 1532.743 | 11.944 |
| WINTER WEATHER | 33 | 411 | 20.926 | 15.000 |