By Antonio Ferraro Date: 16 January 2016
Weather events like flash floods tornadoes and thunderstorms have often caused a high number of casualties and great economic damages.
This exploratory analysis is attempting to assess a few facts about these catastrophic events according to the Reproducible Research course from Jan 2016 Project 2 Rubric. The data used here is from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database which contains events dating from 1950 to November 2011.
The analysis reads the data and reduces it by removing elements not useful for the specific tasks. Aggregations are then created and then used to show the results, in the form of bar plots and tables.
We load our data.
storms = read.csv(bzfile("repdata_data_StormData.csv.bz2"), header = TRUE)
It is always good to have a look at the raw data. Let us do it to see what type of database we are looking at (This is not displayed here, it is just part of my analysis).
str(storms)
summary(storms)
head(storms)
The data may contain also events which (luckily) did not cause damages, fatalities or injuries. To have an idea of how many, let us count them.
nrow(storms[storms$INJURIES==0.0 & storms$PROPDMG==0.0 & storms$FATALITIES==0.0 & storms$CROPDMG==0.0 , ])
## [1] 647664
We are not interested in these entries, as they do not affect the public health and do not have economic consequences, so we can remove them from the dataset (and so make it lighter). I will also remove columnns that I think are not relevant for this analysis.
For what concerns the fatalities and the injuries I just aggregate the relevant events, for the damages I consider the damages to the properties and those to the crops and I sum them up in a single aggregated table. Then the plot is generated in the same way as usual.
storms <- storms[,c("REFNUM", "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
storms <- storms[!(storms$INJURIES==0 & storms$PROPDMG==0.0 & storms$FATALITIES==0 & storms$CROPDMG==0.0), ]
# AGGREGATIONS: FATALITIES
stormfatalities <- aggregate(storms$FATALITIES, by=list(storms$EVTYPE), FUN=sum, na.rm=TRUE)
colnames(stormfatalities) = c("typeofstorm", "fatalities")
stormfatalities <- stormfatalities[order(-stormfatalities$fatalities),]
# This data frame may contain rows that relate to events that have no casualties
nrow(stormfatalities[stormfatalities$fatalities==0,])
## [1] 320
# I remove them as well
stormfatalities <- stormfatalities[stormfatalities$fatalities !=0,]
# I check how many rows do I have to see if I could make a full plot
nrow(stormfatalities)
## [1] 168
# AGGREGATIONS: INJURIES
storminjuries <- aggregate(storms$INJURIES, by=list(storms$EVTYPE), FUN=sum, na.rm=TRUE)
colnames(storminjuries) = c("typeofstorm", "injuries")
storminjuries <- storminjuries[order(-storminjuries$injuries),]
# This data frame may contain rows that relate to events that have no injuries
nrow(storminjuries[storminjuries$injuries==0,])
## [1] 330
# I remove them as well
storminjuries <- storminjuries[storminjuries$injuries !=0,]
# I check how many rows do I have to see if I could make a full plot
nrow(storminjuries)
## [1] 158
# Numerical multipliers for property damages (making assumptions based on table(storms$PROPDMGEXP))
# Add exp numerical columns
storms$PROPDMGEXPNum<-0
storms$CROPDMGEXPNum<-0
storms[storms$PROPDMGEXP=="-" | storms$PROPDMGEXP=="+" | storms$PROPDMGEXP=="?" | storms$PROPDMGEXP=="0" | storms$PROPDMGEXP=="", ]$PROPDMGEXPNum<-1
storms[storms$PROPDMGEXP=="H" | storms$PROPDMGEXP=="h" | storms$PROPDMGEXP=="2",]$PROPDMGEXPNum<-100
storms[storms$PROPDMGEXP=="K" | storms$PROPDMGEXP=="k" | storms$PROPDMGEXP=="3",]$PROPDMGEXPNum<-1000
storms[storms$PROPDMGEXP=="4",]$PROPDMGEXPNum<-10000
storms[storms$PROPDMGEXP=="5",]$PROPDMGEXPNum<-100000
storms[storms$PROPDMGEXP=="M" | storms$PROPDMGEXP=="m" | storms$PROPDMGEXP=="6",]$PROPDMGEXPNum<-1000000
storms[storms$PROPDMGEXP=="7",]$PROPDMGEXPNum<-10000000
storms[storms$PROPDMGEXP=="B" | storms$PROPDMGEXP=="b",]$PROPDMGEXPNum<-1000000000
# No events in database with exponential 8, 9 etc
# Numerical multipliers for crop damages (making assumptions based on table(storms$CROPDMGEXP))
storms[storms$CROPDMGEXP=="?" | storms$CROPDMGEXP=="0", ]$CROPDMGEXPNum<-1
storms[storms$CROPDMGEXP=="K" | storms$CROPDMGEXP=="k" ,]$CROPDMGEXPNum<-1000
storms[storms$CROPDMGEXP=="M" | storms$CROPDMGEXP=="m",]$CROPDMGEXPNum<-1000000
storms[storms$CROPDMGEXP=="B" | storms$CROPDMGEXP=="b",]$CROPDMGEXPNum<-1000000000
# AGGREGATIONS: MATERIAL DAMAGES Using the exponentials
stormdamagesP <- aggregate(storms$PROPDMG*storms$PROPDMGEXPNum, by=list(storms$EVTYPE), FUN=sum, na.rm=TRUE)
colnames(stormdamagesP) = c("typeofstorm", "damages")
stormdamagesP <- stormdamagesP[order(-stormdamagesP$damages),]
# This data frame may contain rows that relate to events that have no damages
nrow(stormdamagesP[stormdamagesP$damages==0,])
## [1] 82
# I remove them as well
stormdamagesP <- stormdamagesP[stormdamagesP$damages !=0 ,]
# I check how many rows do I have to see if I could make a full plot
# AGGREGATIONS: CROP DAMAGES
stormdamagesC <- aggregate(storms$CROPDMG * storms$CROPDMGEXPNum, by=list(storms$EVTYPE), FUN=sum, na.rm=TRUE)
colnames(stormdamagesC) = c("typeofstorm", "damages")
# This data frame may contain rows that relate to events that have no damages
nrow(stormdamagesC[stormdamagesC$damages==0,])
## [1] 352
# I remove them as well
stormdamagesC <- stormdamagesC[stormdamagesC$damages !=0 ,]
# I check how many rows do I have to see if I could make a full plot
stormdamages <- rbind(stormdamagesP, stormdamagesC)
# And now aggregate by stormtype
stormdamages<-aggregate(stormdamages$damages, by=list(stormdamages$typeofstorm), FUN=sum, na.rm=TRUE)
# All damages
colnames(stormdamages) = c("typeofstorm", "damages")
stormdamages <- stormdamages[order(-stormdamages$damages),]
nrow(stormdamages)
## [1] 431
The number of storms that have caused fatalities is too big to propose a readable barplot. So I only consider the 15 types of storms that have costed more human lives during the period covered by the database.
library(ggplot2)
library(colorspace)
mypal<-terrain_hcl(15)
ggplot(data=head(stormfatalities,15), aes(x=reorder(typeofstorm, -fatalities),
y=fatalities, fill=typeofstorm)) +
geom_bar(stat="identity", colour="darkblue") +
scale_fill_manual(values = mypal)+
scale_y_continuous(breaks=seq(0, 5500, 250))+
xlab("Type of storm") +
ylab("Total Fatalities" ) +
ggtitle("Fatalities By Event Type") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The tornadoes are the storms that cause the most fatalities. It is still worth to notice that we have only considered the first 15. By doing this we left out:
sum(stormfatalities$fatalities)-sum(head(stormfatalities,15)$fatalities)
## [1] 2189
On a total mumber of fatalities of:
sum(stormfatalities$fatalities)
## [1] 15145
Corresponding to the following proportion of the total fatalities:
(sum(stormfatalities$fatalities)-sum(head(stormfatalities,15)$fatalities))/
sum(stormfatalities$fatalities)
## [1] 0.1445362
Any human life is important. Excluding 14% of victims from a statistics does not seem right to me, so I feel it is right to publish the full table.
This table leads to questions: It is hard to understand the difference between events like “ICE ON ROAD” and “ICY ROADS”. The database document does not help explain this type of questions.
library(knitr)
colnames(stormfatalities) = c("Storm Type", "Fatalities")
kable(stormfatalities, row.names=FALSE, format = "html")
| Storm Type | Fatalities |
|---|---|
| TORNADO | 5633 |
| EXCESSIVE HEAT | 1903 |
| FLASH FLOOD | 978 |
| HEAT | 937 |
| LIGHTNING | 816 |
| TSTM WIND | 504 |
| FLOOD | 470 |
| RIP CURRENT | 368 |
| HIGH WIND | 248 |
| AVALANCHE | 224 |
| WINTER STORM | 206 |
| RIP CURRENTS | 204 |
| HEAT WAVE | 172 |
| EXTREME COLD | 160 |
| THUNDERSTORM WIND | 133 |
| HEAVY SNOW | 127 |
| EXTREME COLD/WIND CHILL | 125 |
| STRONG WIND | 103 |
| BLIZZARD | 101 |
| HIGH SURF | 101 |
| HEAVY RAIN | 98 |
| EXTREME HEAT | 96 |
| COLD/WIND CHILL | 95 |
| ICE STORM | 89 |
| WILDFIRE | 75 |
| HURRICANE/TYPHOON | 64 |
| THUNDERSTORM WINDS | 64 |
| FOG | 62 |
| HURRICANE | 61 |
| TROPICAL STORM | 58 |
| HEAVY SURF/HIGH SURF | 42 |
| LANDSLIDE | 38 |
| COLD | 35 |
| HIGH WINDS | 35 |
| TSUNAMI | 33 |
| WINTER WEATHER | 33 |
| UNSEASONABLY WARM AND DRY | 29 |
| URBAN/SML STREAM FLD | 28 |
| WINTER WEATHER/MIX | 28 |
| TORNADOES, TSTM WIND, HAIL | 25 |
| WIND | 23 |
| DUST STORM | 22 |
| FLASH FLOODING | 19 |
| DENSE FOG | 18 |
| EXTREME WINDCHILL | 17 |
| FLOOD/FLASH FLOOD | 17 |
| RECORD/EXCESSIVE HEAT | 17 |
| HAIL | 15 |
| COLD AND SNOW | 14 |
| FLASH FLOOD/FLOOD | 14 |
| MARINE STRONG WIND | 14 |
| STORM SURGE | 13 |
| WILD/FOREST FIRE | 12 |
| STORM SURGE/TIDE | 11 |
| UNSEASONABLY WARM | 11 |
| MARINE THUNDERSTORM WIND | 10 |
| WINTER STORMS | 10 |
| MARINE TSTM WIND | 9 |
| ROUGH SEAS | 8 |
| TROPICAL STORM GORDON | 8 |
| FREEZING RAIN | 7 |
| GLAZE | 7 |
| HEAVY SURF | 7 |
| LOW TEMPERATURE | 7 |
| MARINE MISHAP | 7 |
| STRONG WINDS | 7 |
| FLOODING | 6 |
| HURRICANE ERIN | 6 |
| ICE | 6 |
| COLD WEATHER | 5 |
| FLASH FLOODING/FLOOD | 5 |
| HEAT WAVES | 5 |
| HIGH SEAS | 5 |
| ICY ROADS | 5 |
| RIP CURRENTS/HEAVY SURF | 5 |
| SNOW | 5 |
| TSTM WIND/HAIL | 5 |
| GUSTY WINDS | 4 |
| HEAT WAVE DROUGHT | 4 |
| HIGH WIND/SEAS | 4 |
| Hypothermia/Exposure | 4 |
| Mudslide | 4 |
| RAIN/SNOW | 4 |
| ROUGH SURF | 4 |
| SNOW AND ICE | 4 |
| COASTAL FLOOD | 3 |
| COASTAL STORM | 3 |
| Cold | 3 |
| COLD WAVE | 3 |
| DRY MICROBURST | 3 |
| HEAVY SEAS | 3 |
| Heavy surf and wind | 3 |
| High Surf | 3 |
| HIGH WATER | 3 |
| HIGH WIND AND SEAS | 3 |
| HIGH WINDS/SNOW | 3 |
| HYPOTHERMIA/EXPOSURE | 3 |
| WATERSPOUT | 3 |
| WATERSPOUT/TORNADO | 3 |
| WILD FIRES | 3 |
| Coastal Flooding | 2 |
| Cold Temperature | 2 |
| DROUGHT/EXCESSIVE HEAT | 2 |
| DUST DEVIL | 2 |
| EXCESSIVE RAINFALL | 2 |
| Extreme Cold | 2 |
| FLASH FLOODS | 2 |
| FREEZING DRIZZLE | 2 |
| HEAVY SNOW AND HIGH WINDS | 2 |
| HURRICANE OPAL/HIGH WINDS | 2 |
| MIXED PRECIP | 2 |
| RECORD HEAT | 2 |
| RIVER FLOOD | 2 |
| RIVER FLOODING | 2 |
| SLEET | 2 |
| SNOW SQUALL | 2 |
| UNSEASONABLY COLD | 2 |
| AVALANCE | 1 |
| BLACK ICE | 1 |
| blowing snow | 1 |
| BLOWING SNOW | 1 |
| COASTAL FLOODING | 1 |
| COASTALSTORM | 1 |
| COLD/WINDS | 1 |
| DROWNING | 1 |
| Extended Cold | 1 |
| FALLING SNOW/ICE | 1 |
| FLOOD & HEAVY RAIN | 1 |
| FLOOD/RIVER FLOOD | 1 |
| FOG AND COLD TEMPERATURES | 1 |
| FREEZE | 1 |
| FREEZING RAIN/SNOW | 1 |
| Freezing Spray | 1 |
| FROST | 1 |
| GUSTY WIND | 1 |
| Heavy Surf | 1 |
| HIGH SWELLS | 1 |
| HIGH WAVES | 1 |
| HURRICANE FELIX | 1 |
| HURRICANE OPAL | 1 |
| HYPERTHERMIA/EXPOSURE | 1 |
| HYPOTHERMIA | 1 |
| ICE ON ROAD | 1 |
| LANDSLIDES | 1 |
| LIGHTNING. | 1 |
| LIGHT SNOW | 1 |
| Marine Accident | 1 |
| MARINE HIGH WIND | 1 |
| MINOR FLOODING | 1 |
| Mudslides | 1 |
| RAIN/WIND | 1 |
| RAPIDLY RISING WATER | 1 |
| RECORD COLD | 1 |
| SNOW/ BITTER COLD | 1 |
| Snow Squalls | 1 |
| Strong Winds | 1 |
| THUNDERSNOW | 1 |
| THUNDERSTORM | 1 |
| THUNDERSTORM WIND (G40) | 1 |
| THUNDERSTORM WIND G52 | 1 |
| THUNDERTORM WINDS | 1 |
| TSTM WIND (G35) | 1 |
| URBAN AND SMALL STREAM FLOODIN | 1 |
| Whirlwind | 1 |
| WINDS | 1 |
| WIND STORM | 1 |
| WINTER STORM HIGH WINDS | 1 |
| WINTRY MIX | 1 |
We take exactly the same approach for the injuries.
Even here there are too many rows to propose a barplot with all type of storms. So the 15 events that cause most injuries are identified and brief evaluation of what is not shown is then proposed.
ggplot(data=head(storminjuries,15), aes(x=reorder(typeofstorm, -injuries),
y=injuries, fill=typeofstorm)) +
geom_bar(stat="identity", colour="darkblue") +
scale_fill_manual(values = mypal)+
scale_y_continuous(breaks=seq(0, 100000, 4000))+
xlab("Type of storm") +
ylab("Total Injuries" ) +
ggtitle("Injuries By Event Type") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Even here, as expected, the tornadoes are the most harmful type of storm event. Only the first 15 have been considered. By doing this the following mumber of injuries have been left out:
sum(storminjuries$injuries)-sum(head(storminjuries,15)$injuries)
## [1] 9315
On a total number of injuries of:
sum(storminjuries$injuries)
## [1] 140528
Corresponding to the following proportion of the total injuries:
(sum(storminjuries$injuries)-sum(head(storminjuries,15)$injuries))/
sum(storminjuries$injuries)
## [1] 0.06628572
Again the same type of reasoning for the property damages is used here.
ggplot(data=head(stormdamages,15), aes(x=reorder(typeofstorm, -damages),
y=damages, fill=typeofstorm)) +
geom_bar(stat="identity", colour="darkblue") +
scale_fill_manual(values = mypal)+
scale_y_continuous(breaks=seq(0, 400000000000, 20000000000))+
xlab("Type of storm") +
ylab("Total Damages in USD" ) +
ggtitle("Material damages by event type") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
By plotting only the first 15 we left out the following sum:
sum(stormdamages$damages)-sum(head(stormdamages,15)$damages)
## [1] 37554416962
On a total sum of USD:
sum(stormdamages$damages)
## [1] 477329060938
Corresponding to the following proportion of the total damages:
(sum(stormdamages$damages)-sum(head(stormdamages,15)$damages))/
sum(stormdamages$damages)
## [1] 0.07867616
The floods clearly cause more material damages than any other type of storm.