This project has as main objective to determine which meteorological phenomena have the greates impact in the United States. We used the the U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database which contains information of major storms a weather events.
We used only data regarding injuries, fatalities, property and crop damage. We grouped the evnts in 15 categories for better handling. Also we had to create a strategy to manage the monetary variables since they were expresed in quatity and exponent format.
In order to rank events types according to the damage they produce we defined a ranking strategy that takes into account the historical ammount of damage, the mean damage per event and the maximum damage of a single event. We ranked the events separately for human and material impact and presented grafically the distributions of the ten most harmful events in each case, according to our defined ranking system
In this section we described the data adquisition and preprocessing.
First we get the file from the internet and unzip it with the function bunzip2() of the library R.utils.
setInternet2(T)
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
"repdata-data-StormData.csv.bz2")
library(R.utils)
bunzip2("repdata-data-StormData.csv.bz2")
Now we read the file. Be aware that this might take several minutes
StormData <- read.csv("repdata-data-StormData.csv")
We transform the data into a data.table object because its operations are usually performed faster than the data.frame objects. We did not read the file directly in data.table format because one colum of the data has commas and the function fread() has problems processing the file.
library(data.table)
StormData <- data.table(StormData)
We want to analyze the impact of different types of meteorological events regarding human and material onsequences. For this, we will ignore the metheorological caharacteristics of the events and use only the data related to human and material impact.
We subset the data to get a smaller data set that contins only the event type, fatalities, injuries, and crop and property damages.
StormData2 <- StormData[, list(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP,
CROPDMG, CROPDMGEXP)]
str(StormData2)
## Classes 'data.table' and 'data.frame': 902297 obs. of 7 variables:
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, ".internal.selfref")=<externalptr>
First let's check the different types of events:
head(levels(StormData2[, EVTYPE]), 30)
## [1] " HIGH SURF ADVISORY" " COASTAL FLOOD"
## [3] " FLASH FLOOD" " LIGHTNING"
## [5] " TSTM WIND" " TSTM WIND (G45)"
## [7] " WATERSPOUT" " WIND"
## [9] "?" "ABNORMAL WARMTH"
## [11] "ABNORMALLY DRY" "ABNORMALLY WET"
## [13] "ACCUMULATED SNOWFALL" "AGRICULTURAL FREEZE"
## [15] "APACHE COUNTY" "ASTRONOMICAL HIGH TIDE"
## [17] "ASTRONOMICAL LOW TIDE" "AVALANCE"
## [19] "AVALANCHE" "BEACH EROSIN"
## [21] "Beach Erosion" "BEACH EROSION"
## [23] "BEACH EROSION/COASTAL FLOOD" "BEACH FLOOD"
## [25] "BELOW NORMAL PRECIPITATION" "BITTER WIND CHILL"
## [27] "BITTER WIND CHILL TEMPERATURES" "Black Ice"
## [29] "BLACK ICE" "BLIZZARD"
length(levels(StormData2[, EVTYPE]))
## [1] 985
There are 985 types of registered events, moreover, form simple inspection, we can see that many of the labels refer to similar or even the same event. We decided to group our events into categories based on the information in http://www.ncdc.noaa.gov/oa/climate/sd/annsum2009.pdf
We create a new column maps the events in the StormData2 to a new group. This categorization is done based in the 2009 summary, and we decided to create 25 groups
StormData2[, `:=`(Group, "Other")]
StormData2[grepl("hail", EVTYPE, ignore.case = T), `:=`(Group, "Hail")]
StormData2[grepl("TUNDERSTORM|THUNDERTSORM|THUNDERSTROM|THUNDERTORM|THUNDERSTROM|TSTM|thunderstorm|THUDERSTORM|THUNDEERSTORM|THUNDERESTORM|THUNDESTORM|THUNERSTORM",
EVTYPE, ignore.case = T) & (Group == "Other"), `:=`(Group, "Thunderstorm")]
StormData2[grepl("tornado", EVTYPE, ignore.case = T) & (Group == "Other"), `:=`(Group,
"Tornado")]
StormData2[grepl("lightning", EVTYPE, ignore.case = T) & (Group == "Other"),
`:=`(Group, "Lightning")]
StormData2[grepl("heat", EVTYPE, ignore.case = T) & (Group == "Other"), `:=`(Group,
"Heat")]
StormData2[grepl("cold", EVTYPE, ignore.case = T) & (Group == "Other"), `:=`(Group,
"Cold")]
StormData2[grepl("(flash flood)|(flashflood)|(flood/flash)", EVTYPE, ignore.case = T) &
(Group == "Other"), `:=`(Group, "Flash flood")]
StormData2[grepl("river", EVTYPE, ignore.case = T) & grepl("flood", EVTYPE,
ignore.case = T) & (Group == "Other"), `:=`(Group, "River flood")]
StormData2[grepl("urban|street", EVTYPE, ignore.case = T) & grepl("flood", EVTYPE,
ignore.case = T) & (Group == "Other"), `:=`(Group, "Urban flood")]
StormData2[grepl("coastal|beach|cstl", EVTYPE, ignore.case = T) & grepl("flood",
EVTYPE, ignore.case = T) & (Group == "Other"), `:=`(Group, "Coastal flood")]
StormData2[grepl("coastal", EVTYPE, ignore.case = T) & grepl("storm", EVTYPE,
ignore.case = T) & (Group == "Other"), `:=`(Group, "Coastal storm")]
StormData2[grepl("tsunami", EVTYPE, ignore.case = T) & (Group == "Other"), `:=`(Group,
"Tsunami")]
StormData2[(grepl("tropical storm", EVTYPE, ignore.case = T) | grepl("hurricane",
EVTYPE, ignore.case = T)) & (Group == "Other"), `:=`(Group, "Tropical storm/Hurricanre")]
StormData2[grepl("rip current", EVTYPE, ignore.case = T) & (Group == "Other"),
`:=`(Group, "Rip current")]
StormData2[grepl("winter", EVTYPE, ignore.case = T) & (Group == "Other"), `:=`(Group,
"Winter Storm")]
StormData2[grepl("ice", EVTYPE, ignore.case = T) & (Group == "Other"), `:=`(Group,
"Ice")]
StormData2[grepl("avalanche", EVTYPE, ignore.case = T) & (Group == "Other"),
`:=`(Group, "Avalanche")]
StormData2[grepl("dust", EVTYPE, ignore.case = T) & (Group == "Other"), `:=`(Group,
"Dust")]
StormData2[grepl("rain", EVTYPE, ignore.case = T) & (Group == "Other"), `:=`(Group,
"Rain")]
StormData2[grepl("high|strong", EVTYPE, ignore.case = T) & grepl("wind|wnd",
EVTYPE, ignore.case = T) & (Group == "Other"), `:=`(Group, "High Wind")]
StormData2[grepl("water", EVTYPE, ignore.case = T) & grepl("spout", EVTYPE,
ignore.case = T) & (Group == "Other"), `:=`(Group, "Water spout")]
StormData2[grepl("fire", EVTYPE, ignore.case = T) & (Group == "Other"), `:=`(Group,
"Fires")]
StormData2[grepl("mud", EVTYPE, ignore.case = T) & (Group == "Other"), `:=`(Group,
"Mudslides")]
StormData2[grepl("volcan", EVTYPE, ignore.case = T) & (Group == "Other"), `:=`(Group,
"Volcano")]
# transform the group into factor
StormData2[, `:=`(Group, as.factor(Group))]
We can check the number of groups we created
levels(StormData2[, Group])
## [1] "Avalanche" "Coastal flood"
## [3] "Coastal storm" "Cold"
## [5] "Dust" "Fires"
## [7] "Flash flood" "Hail"
## [9] "Heat" "High Wind"
## [11] "Ice" "Lightning"
## [13] "Mudslides" "Other"
## [15] "Rain" "Rip current"
## [17] "River flood" "Thunderstorm"
## [19] "Tornado" "Tropical storm/Hurricanre"
## [21] "Tsunami" "Urban flood"
## [23] "Volcano" "Water spout"
## [25] "Winter Storm"
The material damage is accounted as quantities with a second variable for expoenent. We can see that the exponent columns in the money quantities have several leveles:
levels(StormData2[, PROPDMGEXP])
## [1] "" "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"
levels(StormData2[, CROPDMGEXP])
## [1] "" "?" "0" "2" "B" "k" "K" "m" "M"
We will use the following rules to transform our numeric dollar quantities: -H/h means hundres -K/k means thousands -n/M means millions -B means billions -Numerical values indicates that the ammount has to be multiplied by the corresponding power of ten -Entries with other exponent will be innored and replaced by NA.
We code the rules above for both exponents by doing:
# For the properties
StormData2[PROPDMGEXP %in% c("h", "H"), `:=`(PROPDMG, PROPDMG * 100)]
StormData2[PROPDMGEXP %in% c("K", "k"), `:=`(PROPDMG, PROPDMG * 1000)]
StormData2[PROPDMGEXP %in% c("m", "M"), `:=`(PROPDMG, PROPDMG * 1e+06)]
StormData2[PROPDMGEXP %in% c("b", "B"), `:=`(PROPDMG, PROPDMG * 1e+09)]
StormData2[PROPDMGEXP %in% c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9"),
`:=`(PROPDMG, PROPDMG * 10^as.numeric(PROPDMGEXP))]
StormData2[PROPDMGEXP %in% c("", "?", "+", "-"), `:=`(PROPDMG, PROPDMG * 1e+09)]
# For the crops
StormData2[CROPDMGEXP %in% c("h", "H"), `:=`(CROPDMG, CROPDMG * 100)]
StormData2[CROPDMGEXP %in% c("K", "k"), `:=`(CROPDMG, CROPDMG * 1000)]
StormData2[CROPDMGEXP %in% c("m", "M"), `:=`(CROPDMG, CROPDMG * 1e+06)]
StormData2[CROPDMGEXP %in% c("b", "B"), `:=`(CROPDMG, CROPDMG * 1e+09)]
StormData2[CROPDMGEXP %in% c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9"),
`:=`(CROPDMG, CROPDMG * 10^as.numeric(CROPDMGEXP))]
StormData2[CROPDMGEXP %in% c("", "?", "+", "-"), `:=`(CROPDMG, CROPDMG * 1e+09)]
This is all the data preprocesing that we need for our analysis
In order to get an insight of effects of the events in the population we will create a new data set containing the mean, median, maximun and total of injuries and fatalities per event group.
HumanImpact <- StormData2[, list(MeanInjuries = mean(INJURIES, na.rm = T), TotalInjuries = sum(INJURIES,
na.rm = T), MaxInjuries = max(INJURIES, na.rm = T), MeanFatalities = mean(FATALITIES,
na.rm = T), TotalFatalities = sum(FATALITIES, na.rm = T), MaxFatalities = max(FATALITIES,
na.rm = T)), by = Group]
We will use a combination of mean, total and maximun for our analysis beacuse it could happen that few events of a specific type are very dangerous. For example, let's check the events that cause the most injuries in total:
head(HumanImpact[order(TotalInjuries, decreasing = T), ], 3)
## Group MeanInjuries TotalInjuries MaxInjuries MeanFatalities
## 1: Tornado 1.50591 91407 1700 0.09285
## 2: Other 0.16131 10768 800 0.01813
## 3: Thunderstorm 0.02814 9448 70 0.00216
## TotalFatalities MaxFatalities
## 1: 5636 158
## 2: 1210 29
## 3: 725 11
and know the events with larger mean number of Injuries
head(HumanImpact[order(MeanInjuries, decreasing = T), ], 3)
## Group MeanInjuries TotalInjuries MaxInjuries
## 1: Tsunami 6.450 129 129
## 2: Heat 3.483 9224 519
## 3: Tropical storm/Hurricanre 1.737 1711 780
## MeanFatalities TotalFatalities MaxFatalities
## 1: 1.6500 33 32
## 2: 1.1850 3138 583
## 3: 0.2041 201 22
The tornados cause in total the most number of injuries, but Tsunamis have the largest mean. But when we check the number of entries for tornado and tsunamis we get:
dim(StormData2[Group == "Tsunami"])[1]
## [1] 20
dim(StormData2[Group == "Tornado"])[1]
## [1] 60699
Only 20 tsunamis were registered, but 60699 tornados, therefore we cannot only use the mean or median value as sole mesure of impact. We should use a combination of the previously calculated values.
In order to get an insight of material effects of the events we will create a new data set containing the mean, maximun and total of damage in property and crops. This is very similar of what we have done in the previous section for human impact
MaterialImpact <- StormData2[, list(MeanPropDmg = mean(PROPDMG, na.rm = T),
TotalPropDmg = sum(PROPDMG, na.rm = T), MaxPropDmg = max(PROPDMG, na.rm = T),
MeanCropDmg = mean(CROPDMG, na.rm = T), TotalCropDmg = sum(CROPDMG, na.rm = T),
MaxCropDmg = max(CROPDMG, na.rm = T)), by = Group]
Again, we need to use a combination of the fetures in order to asses the event that causes the most damage.
We will use a combination of atributes to determine the most harmful events. We define a harmful scale as a combination of the Mean damage of an event group, its maximum damage and its total registered damage. We decided that harmfull scale takes into account the three parameters.
A simple way to achieve this is to rank the three parameters separately and then add the respective ranks in the three categories. The events with higher total rank will be considered as the most harmful.
Based on the previous discussion, we create separate harm ranks for Injury and Fatalities and one average rank:
HumanImpact[, `:=`(InjuryRank, rank(MeanInjuries) + rank(MaxInjuries) + rank(TotalInjuries))]
HumanImpact[, `:=`(FatalityRank, rank(MeanFatalities) + rank(MaxFatalities) +
rank(TotalFatalities))]
HumanImpact[, `:=`(Rank, (InjuryRank + FatalityRank)/2)]
The 10 most harmful events regarding injuries are:
HumanImpact[order(InjuryRank, decreasing = T)[1:10], list(InjuryRank, Group,
TotalInjuries, MeanInjuries, MaxInjuries)]
## InjuryRank Group TotalInjuries MeanInjuries
## 1: 72.0 Tornado 91407 1.50591
## 2: 67.0 Heat 9224 3.48338
## 3: 65.0 Ice 2164 0.98858
## 4: 61.0 Other 10768 0.16131
## 5: 61.0 Tropical storm/Hurricanre 1711 1.73706
## 6: 51.0 Winter Storm 1891 0.09646
## 7: 50.5 Fires 1608 0.37933
## 8: 49.5 Tsunami 129 6.45000
## 9: 49.0 Lightning 5231 0.33183
## 10: 44.5 Flash flood 1802 0.03237
## MaxInjuries
## 1: 1700
## 2: 519
## 3: 1568
## 4: 800
## 5: 780
## 6: 165
## 7: 150
## 8: 129
## 9: 51
## 10: 150
The 10 most harmful events regarding fatalities are:
HumanImpact[order(FatalityRank, decreasing = T)[1:10], list(FatalityRank, Group,
TotalFatalities, MeanFatalities, MaxFatalities)]
## FatalityRank Group TotalFatalities MeanFatalities
## 1: 73.0 Heat 3138 1.18505
## 2: 66.0 Tornado 5636 0.09285
## 3: 57.0 Tsunami 33 1.65000
## 4: 55.0 Other 1210 0.01813
## 5: 54.0 Tropical storm/Hurricanre 201 0.20406
## 6: 53.5 Cold 451 0.18318
## 7: 52.0 Flash flood 1035 0.01859
## 8: 52.0 Rip current 577 0.74260
## 9: 47.0 Avalanche 224 0.57881
## 10: 45.0 Lightning 817 0.05183
## MaxFatalities
## 1: 583
## 2: 158
## 3: 32
## 4: 29
## 5: 22
## 6: 14
## 7: 20
## 8: 6
## 9: 6
## 10: 5
In average, the 10 most harmful events are:
HumanImpact[order(Rank, decreasing = T)[1:10], list(Rank, Group)]
## Rank Group
## 1: 70.00 Heat
## 2: 69.00 Tornado
## 3: 58.00 Other
## 4: 57.50 Tropical storm/Hurricanre
## 5: 53.25 Tsunami
## 6: 51.00 Ice
## 7: 48.25 Flash flood
## 8: 47.00 Cold
## 9: 47.00 Lightning
## 10: 47.00 Rip current
We can see the distribution of events of the 10 most harmful types by plotting:
top10 <- HumanImpact[order(Rank, decreasing = T)[1:10], Group]
library(ggplot2)
ggplot(data = StormData2[Group %in% top10]) + geom_jitter(aes(x = Group, y = INJURIES,
color = "Injuries", alpha = INJURIES)) + geom_jitter(aes(x = Group, y = FATALITIES,
color = "Fatalities", alpha = FATALITIES)) + coord_flip() + ggtitle("Fatalities and Injuries of the ten most harmful events for humans") +
xlab("Event group") + ylab("Fatalities/Injuries") + scale_y_log10(breaks = as.vector((1:9) %*%
t(10^(1:3)))) + theme(axis.text.x = element_text(angle = -90, hjust = 1)) +
scale_x_discrete(limits = rev(top10)) + scale_alpha(guide = "none") + scale_colour_manual("",
breaks = c("Fatalities", "Injuries"), values = c("red", "black"))
The scale of the plot is logarithmic in order to have a better view of the fatalities and injury distribution. From the plot we can see that in general, the number of fatalities is lower than the number of injuries. Most individual events have less than 10 fatalities.
We can also easily see that the tornados and heat waves are the most common types of events since we have the more ammount of moints. Moreover, these are the only events with a very large ammount events with more than 10 injuries.
Only 3 tornadoes and an Ice related event provoqued more than 1000 injuries. The mostdeathly event was a tornado that caused more than 500 deaths.
The rank of Tsunamis and Tropical Storms/Hurricanes is interesting. These events are uncommon (few points in the plot), however, since our rank system takes into account not only the total fatalities and injuries but the means, they rank high in our scale. More specifically, tsunamis are very uncommon (only 20 registered), but a single tsunami caused around 35 deaths which is more than most of the tornados.
Following the same stepes as for human harm, we create separate damage ranks for property and crop damage:
MaterialImpact[, `:=`(PropDmgRank, rank(MeanPropDmg) + rank(MaxPropDmg) + rank(TotalPropDmg))]
MaterialImpact[, `:=`(CropDmgRank, rank(MeanCropDmg) + rank(MaxCropDmg) + rank(TotalCropDmg))]
MaterialImpact[, `:=`(Rank, (PropDmgRank + CropDmgRank)/2)]
The 10 most harmful events regarding property damage are:
MaterialImpact[order(PropDmgRank, decreasing = T)[1:10], list(PropDmgRank, Group,
TotalPropDmg, MeanPropDmg, MaxPropDmg)]
## PropDmgRank Group TotalPropDmg MeanPropDmg
## 1: 75.0 Flash flood 6.842e+13 1.229e+09
## 2: 71.0 Thunderstorm 2.108e+13 6.279e+07
## 3: 67.0 Tornado 1.142e+12 1.881e+07
## 4: 61.0 Tropical storm/Hurricanre 9.247e+10 9.388e+07
## 5: 61.0 Lightning 1.760e+11 1.116e+07
## 6: 59.0 Other 3.098e+11 4.642e+06
## 7: 58.0 Hail 3.986e+11 1.373e+06
## 8: 54.0 High Wind 5.907e+10 2.293e+06
## 9: 52.5 River flood 5.237e+09 2.555e+07
## 10: 47.0 Fires 8.497e+09 2.004e+06
## MaxPropDmg
## 1: 6.800e+13
## 2: 1.400e+13
## 3: 8.800e+11
## 4: 1.693e+10
## 5: 1.300e+11
## 6: 1.150e+11
## 7: 3.000e+11
## 8: 2.000e+10
## 9: 5.000e+09
## 10: 1.500e+09
The 10 most harmful events regarding fatalities are:
MaterialImpact[order(CropDmgRank, decreasing = T)[1:10], list(CropDmgRank, Group,
TotalCropDmg, MeanCropDmg, MaxCropDmg)]
## CropDmgRank Group TotalCropDmg MeanCropDmg
## 1: 70.5 River flood 5.057e+09 24670654
## 2: 68.0 Tropical storm/Hurricanre 6.210e+09 6304760
## 3: 67.5 Ice 5.027e+09 2296535
## 4: 65.0 Other 2.208e+10 330820
## 5: 62.0 Thunderstorm 9.207e+09 27426
## 6: 59.0 Cold 1.417e+09 575453
## 7: 58.0 Hail 6.114e+09 21054
## 8: 55.0 Heat 9.045e+08 341567
## 9: 51.5 Rain 9.193e+08 75181
## 10: 51.5 Flash flood 1.532e+09 27522
## MaxCropDmg
## 1: 5.000e+09
## 2: 1.510e+09
## 3: 5.000e+09
## 4: 1.000e+09
## 5: 4.000e+09
## 6: 5.960e+08
## 7: 3.000e+09
## 8: 4.924e+08
## 9: 2.000e+08
## 10: 2.000e+08
In average, the 10 most harmful events are:
MaterialImpact[order(Rank, decreasing = T)[1:10], list(Rank, Group)]
## Rank Group
## 1: 66.50 Thunderstorm
## 2: 64.50 Tropical storm/Hurricanre
## 3: 63.25 Flash flood
## 4: 62.00 Other
## 5: 61.50 River flood
## 6: 58.00 Hail
## 7: 54.25 Ice
## 8: 53.50 Tornado
## 9: 50.50 High Wind
## 10: 46.75 Rain
The most harmfull events are water related (floods, storm, rain, etc).
Similarly as before, we can see the distribution of events of the 10 most harmful types by plotting:
top10 <- MaterialImpact[order(Rank, decreasing = T)[1:10], Group]
ggplot(data = StormData2[Group %in% top10]) + geom_jitter(aes(x = Group, y = PROPDMG,
color = "Property Damage", alpha = PROPDMG)) + geom_jitter(aes(x = Group,
y = CROPDMG, color = "Crop Damage", alpha = CROPDMG)) + coord_flip() + ggtitle("Monetary cost of most damaging events") +
xlab("Event group") + ylab("Cost in US Dollars") + scale_y_log10(breaks = as.vector(c(1,
5) %*% t(10^(1:14)))) + theme(axis.text.x = element_text(angle = -90, hjust = 1)) +
scale_x_discrete(limits = rev(top10)) + scale_alpha(guide = "none") + scale_colour_manual("",
breaks = c("Property Damage", "Crop Damage"), values = c("black", "green"))
Since the damage costs spam a very large range, we plotted the money quantities in logarithmic scale.
Hail and Thunderstorms seem to be the most common damaging events for crops. While ice, huricanes, tornados and wind are more damaging for property.
The crop damages rarely surpase the 10.000.000 dollars. However, Thunderstorms and Hurricanes can reach much higher damage values slightly more often than the rest of the events.
In this plot we can again see traces of our multicriteria ranking system. For example, the hurricanes/Tropical storms are not as common as othe events, however, when they occur, the damage is comparable to thunderstorms or tornados which are much more common.