NOAA Storm Data Analysis(1950-2011) Wednesday, October 26, 2014
Analysis of DATA Storm Database to predict severe whether events 1. Summary
The intention of this study is to present a high-level analysis of the results of severe weather events in the US, from 1950 to 2011, concerning the total damage caused and the impact on public health.
The analysis is based on the DATA Storm Database. Details on the data may be found at the
National Weather Service Storm Data Documentation and at the National Climatic Data Center Storm Events FAQ The original data may be found at the Coursera Reproducible Research course web site: Storm Data [47Mb]
This study is separated in three parts:
For reproducability purposes, we have chosen not to supply the cleaned data, but to present a way to load and read the data from it's original source, as described in the summary section.
library(reshape2)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.1.1
For reasons of effectiveness, speed and computer memory management, we have opted to read only the specific columns of the data, which we are going to work on.
Specifically, we need the Event Types, the fatalities, the injuries, the property cost, the property cost level, the crop cost and the crop cost level. These are the columns no 8 and 23 to 28 of the data set.
DATA <- read.csv("repdata_data_StormData.csv", colClasses = (c(rep("NULL", 7),
NA, rep("NULL", 14), NA, NA, NA, NA, NA, NA, rep("NULL", 9))), header = T,
as.is = T)
2.2. Part 2: Processing the Data
In order to calculate the total cost, the cost values contained in the data must be transformed to actual USD values. The notation used is: â?¢ H = 100$ â?¢ K = 1.000$ â?¢ M = 1.000.000$ â?¢ B = 1.000.000.000$
DATA$PROPDMGVAL <- DATA$PROPDMG * sapply(toupper(DATA$PROPDMGEXP), function(x) {
switch(x, H = 10^2, K = 10^3, M = 10^6, B = 10^9, 1)
})
DATA$CROPDMGVAL <- DATA$CROPDMG * sapply(toupper(DATA$CROPDMGEXP), function(x) {
switch(x, H = 10^2, K = 10^3, M = 10^6, B = 10^9, 1)
})
DATA$COST <- DATA$PROPDMGVAL + DATA$CROPDMGVAL
DATA$HEALTHDMG <- DATA$FATALITIES + DATA$INJURIES
DATA$EVTYPE <- factor(DATA$EVTYPE, levels = unique(DATA$EVTYPE), ordered = T)
In order to process the data, we have used the â??meltâ? and â??dcastâ? functions from the reshape2 R package, to aggregate the health events and the damage cost over the event types, by the â??sumâ? function.
mcost <- melt(DATA, id.vars = "EVTYPE", measure.vars = c("PROPDMGVAL", "CROPDMGVAL",
"COST"))
dcost <- dcast(mcost, EVTYPE ~ variable, sum)
dcost <- dcost[order(-dcost$COST), ]
dcost$EVTYPE <- factor(dcost$EVTYPE, levels = unique(dcost$EVTYPE), ordered = T)
mhealth <- melt(DATA, id.vars = "EVTYPE", measure.vars = c("FATALITIES", "INJURIES",
"HEALTHDMG"))
dhealth <- dcast(mhealth, EVTYPE ~ variable, sum)
dhealth <- dhealth[order(-dhealth$HEALTHDMG), ]
dhealth$EVTYPE <- factor(dhealth$EVTYPE, levels = unique(dhealth$EVTYPE), ordered = T)
In this part we present the results of the analysis. For the presentation purposes we have created two plots, using the ggplot2 graphical R package. 3.1 Health Impact by Event Type
The health impact analysis presents the top 20 events with the higher impact on the population's health. The measure has been calculated by adding the fatalities and injuries caused by each event type. For each event type, the fatalities and the injuries are also shown, in the same plot.
healthplot <- ggplot(data = dhealth[1:20, ], aes(x = reorder(EVTYPE, -HEALTHDMG),
y = HEALTHDMG)) + xlab("Event Type") + ylab("Total Impact")
healthplot <- healthplot + geom_bar(fill = "red", stat = "identity") + theme(axis.text.x = element_text(angle = 90,
hjust = 0.5, vjust = 1))
healthplot <- healthplot + geom_line(aes(x = reorder(EVTYPE, -HEALTHDMG), y = FATALITIES,
group = 1, colour = "Fatalities"))
healthplot <- healthplot + geom_point(aes(x = reorder(EVTYPE, -HEALTHDMG), y = FATALITIES,
group = 1, colour = "Fatalities"))
healthplot <- healthplot + geom_line(aes(x = reorder(EVTYPE, -HEALTHDMG), y = INJURIES,
group = 2, colour = "Injuries"))
healthplot <- healthplot + geom_point(aes(x = reorder(EVTYPE, -HEALTHDMG), y = INJURIES,
group = 2, colour = "Injuries"))
healthplot <- healthplot + scale_colour_manual("Type of Impact", breaks = c("Fatalities",
"Injuries"), values = c("yellow", "blue"))
healthplot <- healthplot + ggtitle("Top 20 Health Impacts by Event Type (1950-2001)")
print(healthplot)
plot of chunk analyse_impact As we may see, Tornados are responsible for the majority of the health events, both for fatalities and injuries. The table of the top 20 event types is shown below:
print(dhealth[1:20, ])
## EVTYPE FATALITIES INJURIES HEALTHDMG
## 1 TORNADO 5633 91346 96979
## 99 EXCESSIVE HEAT 1903 6525 8428
## 2 TSTM WIND 504 6957 7461
## 36 FLOOD 470 6789 7259
## 15 LIGHTNING 816 5230 6046
## 27 HEAT 937 2100 3037
## 20 FLASH FLOOD 978 1777 2755
## 65 ICE STORM 89 1975 2064
## 16 THUNDERSTORM WIND 133 1488 1621
## 8 WINTER STORM 206 1321 1527
## 46 HIGH WIND 248 1137 1385
## 3 HAIL 15 1361 1376
## 973 HURRICANE/TYPHOON 64 1275 1339
## 53 HEAVY SNOW 127 1021 1148
## 221 WILDFIRE 75 911 986
## 10 THUNDERSTORM WINDS 64 908 972
## 47 BLIZZARD 101 805 906
## 276 FOG 62 734 796
## 18 RIP CURRENT 368 232 600
## 227 WILD/FOREST FIRE 12 545 557
##3.3 Damage by Event Type The Damage analysis presents the top 20 events responsible for the higher damages. The total damage has been calculated by adding the property damages and the crop damages caused by each event type. For each event type, the property and crop damages are also shown, in the same plot.
costplot <- ggplot(data = dcost[1:20, ], aes(x = reorder(EVTYPE, -COST), y = COST)) +
xlab("Event Type") + ylab("Total Cost")
costplot <- costplot + geom_bar(fill = "blue", stat = "identity", group = 1) +
theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 1))
costplot <- costplot + geom_line(aes(x = reorder(EVTYPE, -COST), y = PROPDMGVAL,
group = 2, colour = "Property damage"))
costplot <- costplot + geom_point(aes(x = reorder(EVTYPE, -COST), y = PROPDMGVAL,
group = 2, colour = "Property damage"))
costplot <- costplot + geom_line(aes(x = reorder(EVTYPE, -COST), y = CROPDMGVAL,
group = 3, colour = "Crop damage"))
costplot <- costplot + geom_point(aes(x = reorder(EVTYPE, -COST), y = CROPDMGVAL,
group = 3, colour = "Crop damage"))
costplot <- costplot + scale_colour_manual("Type of damage", breaks = c("Property damage",
"Crop damage"), values = c("red", "green"))
costplot <- costplot + ggtitle("Top 20 Damages by Event Type (1950-2001)")
print(costplot)
plot of chunk analyse_damage As we may notice, the order of the event types which have caused the biggest damages is not the same as that of the event types responsible for health events. The biggest damages have been caused by floods and hurricanes and special notice should be made to drough, which has coused sever crop damage over the years. The table of the top 20 event types is shown below:
print(dcost[1:20, ])
## EVTYPE PROPDMGVAL CROPDMGVAL COST
## 36 FLOOD 1.447e+11 5.662e+09 1.503e+11
## 973 HURRICANE/TYPHOON 6.931e+10 2.608e+09 7.191e+10
## 1 TORNADO 5.694e+10 4.150e+08 5.735e+10
## 204 STORM SURGE 4.332e+10 5.000e+03 4.332e+10
## 3 HAIL 1.573e+10 3.026e+09 1.876e+10
## 20 FLASH FLOOD 1.614e+10 1.421e+09 1.756e+10
## 194 DROUGHT 1.046e+09 1.397e+10 1.502e+10
## 226 HURRICANE 1.187e+10 2.742e+09 1.461e+10
## 52 RIVER FLOOD 5.119e+09 5.029e+09 1.015e+10
## 65 ICE STORM 3.945e+09 5.022e+09 8.967e+09
## 209 TROPICAL STORM 7.704e+09 6.783e+08 8.382e+09
## 8 WINTER STORM 6.688e+09 2.694e+07 6.715e+09
## 46 HIGH WIND 5.270e+09 6.386e+08 5.909e+09
## 221 WILDFIRE 4.765e+09 2.955e+08 5.061e+09
## 2 TSTM WIND 4.485e+09 5.540e+08 5.039e+09
## 976 STORM SURGE/TIDE 4.641e+09 8.500e+05 4.642e+09
## 16 THUNDERSTORM WIND 3.483e+09 4.148e+08 3.898e+09
## 13 HURRICANE OPAL 3.173e+09 1.900e+07 3.192e+09
## 227 WILD/FOREST FIRE 3.002e+09 1.068e+08 3.109e+09
## 313 HEAVY RAIN/SEVERE WEATHER 2.500e+09 0.000e+00 2.500e+09