The goal of this project is to explore data contained in the Storm Event Database from 1950 to 2011 to investigate which types of weather events are most harmful with respect to population health and which have the greatest economic consequences. The project is organised in the following sections:
The National Oceanic and Atmospheric Administration (NOAA) is an US scientific agency that focuses on “conditions of the oceans, major waterways, and atmosphere” which one of the main activity is “Monitoring and observing Earth systems with instruments and data collection networks”. The Storm Event Database contains data from 1950 to date that tracks the caracteristics of major weather events including an estimates of any fatalities, injuries and property damage.
The original raw data file is a zipped ~50Mb file containing an about 500Mb database and so the strategy for saving computer space (and computing time) chosen for analysing the data is to load only the variable that are necessary for addressing the proposed questions.
The following code extracts the first lines from the .csv file to give us information about the variable it contains
destfile <- "./Data/repdata_data_StormData.csv.bz2"
if(!file.exists(destfile))
{
dir.create("./Data")
destfile <- "./Data/repdata_data_StormData.csv.bz2"
urlziplocation <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(urlziplocation, destfile)
names <- read.csv(destfile, nrows = 1, header = TRUE)
}else{
names <- read.csv(destfile, nrows = 1, header = TRUE)
}
names(names)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
The database contains 37 variables, in order to address the questions, only 9 variables (namely BGN_DATE, STATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) are extracted from the .csv file.
df.rawdata <- read.csv("./Data/repdata_data_StormData.csv.bz2",
colClasses = c("NULL", "character", rep("NULL",4),
rep("character",2), rep("NULL",14),
rep("numeric",3), "character", "numeric",
"character", rep("NULL",9)))
After loading the raw data, the database is further reduced by considering raws where at least one fatality/injury/property or crop damage have been recorder.
df.wdata <- df.rawdata[!(df.rawdata$FATALITIES == 0 & df.rawdata$INJURIES == 0 &
df.rawdata$PROPDMG == 0 & df.rawdata$CROPDMG == 0),]
In this way the original database consisting of 902297 \(\times\) 37 values has been pruned into a more handable 254633 \(\times\) 9 values database.
A new YEAR variable is added to the database to evaluate in which year starting to group the data as not only certain events have been recorded from the beginning of the database creation.
df.wdata$YEAR <- year(as.Date(sub(" .*", "", df.wdata$BGN_DATE), format("%m/%d/%Y")))
The agriculturals and properties damage costs are stored in the CROPDMG and PROPDMG variables as numerical value while in PROPDMGEXP and CROPDMGEXP there is their exponent. The following function the damage value by multiplying the registred costs for the recorded exponent.
convert.exp <- function(x)
{
if(x == "B")
return(1000000000)
else if(x == "M" | x == "m")
return(1000000)
else if(x == "K" | x == "k")
return(1000)
else if(x == "H" | x == "h")
return(100)
return(1)
}
Since I could not find the documentation about some exponent factor, I decided to set the unknown characters to 1. The function can be easily changed if further information is added. The new variables PROPDMGCONV and CROPDMGCONV contains the converted cost values.
df.wdata$PROPDMGCONV <- sapply(df.wdata$PROPDMGEXP, convert.exp) * df.wdata$PROPDMG
df.wdata$CROPDMGCONV <- sapply(df.wdata$CROPDMGEXP, convert.exp) * df.wdata$CROPDMG
The National Weather Service Storm Data Documentation document lists \(48\) different event type (ref. section 2.1.1 for the Storm Data Event Table and Chapter 7 for event type description) while EVTYPE database variable contains 488 different values. The following code reduce the number of unique Event Types (the full code is in Appendix and future works.
df.wdata[grep("Marine hail",
df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Marine Hail"
df.wdata[grep("Marine High Wind",
df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Marine High Wind"
df.wdata[grep("Marine Strong Wind",
df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Marine Strong Wind"
df.wdata[grep("Marine Tstm|Marine Thunde",
df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Marine Thunderstorm Wind"
df.wdata[grep("Lightn|lighting|LIGNTNING|Lightning",
df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Lightning"
df.wdata[grep("Thunderstorm winds|Tstm w|tstmw|Thunderstormwinds|Thunderstorms w|
Thunderstormw|Thunderstom winds|Tunderstom winds",
df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Thunderstorm Wind"
df.wdata[grep("THUND|THUNE",
df.wdata$EVTYPE, ignore.case = FALSE),]$EVTYPE <- "Thunderstorm"
Since the documentation is not perfectly clear and the renaming process is quite tedious, the proposed solution is far to be optimal but gives an idea on how to takle this issue. In this way, the number of unique Event Type has been reduced to 102.
The Storm Event Database webpage states that all different events started to be registred from 1996 onwards. For this reason, in this project only events happened after 1995 are considered.
The following code filter the database by grouping the data by Event Type and then calculating the sum of fatalities and injuries
df.casualties <- df.wdata[df.wdata$YEAR > 1995,] %>%
group_by(EVTYPE) %>%
summarise(sum_fatalities = sum(FATALITIES),
sum_injuries = sum(INJURIES)) %>%
mutate(total_sum = sum_fatalities + sum_injuries)
From this database, the top 10 most harmful events, with respect to fatalities, injuries and their sum, are extracted.
df.top10sumfatalities <-
as.data.frame(df.casualties[order(df.casualties$sum_fatalities,
decreasing = TRUE)[1:10],])
df.top10suminjuries <-
as.data.frame(df.casualties[order(df.casualties$sum_injuries,
decreasing = TRUE)[1:10],])
df.top10total <-
as.data.frame(df.casualties[order(df.casualties$total_sum,
decreasing = TRUE)[1:10],])
A similar approach has been used for the economic consequences. First grouping by event type and calculating their sum.
df.cost <- df.wdata[df.wdata$YEAR > 1995,] %>%
group_by(EVTYPE) %>%
summarise(cropdmg = sum(CROPDMGCONV),
propdmg = sum(PROPDMGCONV)) %>%
mutate(total_sum = cropdmg + propdmg)
And then extracting the top 10 events that have a major impact on agriculture (crop) and properties.
df.top10PROPcost <- as.data.frame(df.cost[order(df.cost$propdmg,
decreasing = TRUE)[1:10], ])
df.top10CROPcost <- as.data.frame(df.cost[order(df.cost$cropdmg,
decreasing = TRUE)[1:10], ])
df.top10cost <- as.data.frame(df.cost[order(df.cost$total_sum,
decreasing = TRUE)[1:10], ])
As for the most harmful events, the following plot shows the top 10 Events that caused more fatalities, injuries and the sum of both of them.
g1 <- ggplot(df.top10sumfatalities, aes(reorder(EVTYPE, -sum_fatalities), sum_fatalities)) +
geom_bar(stat = "identity", position = "dodge", fill = "salmon") +
labs(title = "Sum Fatalities", x = "Event Type", y = "Number of Casualties") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
g2 <- ggplot(df.top10suminjuries,aes(reorder(EVTYPE, -sum_injuries), sum_injuries)) +
geom_bar(stat = "identity", position = "dodge", fill = "springgreen3") +
labs(title = "Sum Injuries", x = "Event Type", y = "Number of Casualties") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
df.top10totalplot <- melt(df.top10total[,c("EVTYPE", "sum_fatalities", "sum_injuries", "total_sum")], id.vars = 1)
g3 <- ggplot(df.top10totalplot, aes(reorder(EVTYPE, -value), value)) +
geom_bar(aes(fill = variable), stat = "identity", position = "dodge") +
labs(title = "Fatalities + Injuries", x = "Event Type", y = "Number of Casualties") +
scale_fill_discrete(labels = c("Fatalities", "Injuries", "Fatalities + Injuries")) +
theme(legend.title = element_blank(),
legend.background = element_rect(fill="transparent"),
legend.position=c(.75,.85),
legend.key.height = unit(0.2, "cm"),legend.key.width = unit(0.2, "cm"),
axis.text.x = element_text(angle = 90, hjust = 1))
gridExtra::grid.arrange(g1, g2, g3, nrow = 1,
bottom=textGrob("Figure 1: 10 ten most harmful events", gp=gpar(fontsize=11)))
The plot shows that, as expected, fatalities are much more less than injuries (a 10 factor difference as depicted in the Fatalitie + Injuries plot) and while Heat is the weather event causing more fatalities (followed by Tornado, Flood and Lightnign), Tornado are, by far, the major cause of injuries in US (with Flood, Heat and Lighting other major causes).
The following table report the different casualties for event type.
| Event Type | Fatalities | Event Type | Injuries | Event Type | Total |
|---|---|---|---|---|---|
| Heat | 2036 | Tornado | 20667 | Tornado | 22178 |
| Tornado | 1511 | Flood | 8520 | Flood | 9857 |
| Flood | 1337 | Heat | 7683 | Heat | 9719 |
| Lightning | 651 | Lightning | 4141 | Lightning | 4792 |
| Rip Current | 542 | Thunderstorm Wind | 3729 | Thunderstorm Wind | 3976 |
| Strong Wind | 383 | Strong Wind | 1507 | Strong Wind | 1890 |
| Avalanche | 266 | Wildfire | 1458 | Wildfire | 1545 |
| Thunderstorm Wind | 247 | Thunderstorm | 1401 | Thunderstorm | 1533 |
| Cold Wind Chill | 237 | Hurricane Typhoon | 1328 | Winterstorm | 1488 |
| Winterstorm | 195 | Winterstorm | 1293 | Hurricane Typhoon | 1453 |
In the case of most costly events, we can directly aggregate agricultural and property damages since they are now comparable in terms of costs.
g1 <- ggplot(df.top10CROPcost, aes(reorder(EVTYPE, -cropdmg), cropdmg)) +
geom_bar(stat = "identity", position = "dodge", fill = "salmon") +
labs(title = "Crop Damages",x = "Event Type",y = "Cost in US $") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
g2 <- ggplot(df.top10PROPcost, aes(reorder(EVTYPE, -propdmg), propdmg)) +
geom_bar(stat = "identity", position = "dodge", fill = "springgreen3") +
labs(title = "Property Damages",x = "Event Type",y = "Cost in US $") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
df.top10costplot <- melt(df.top10cost[,c("EVTYPE", "cropdmg", "propdmg", "total_sum")], id.vars = 1)
g3 <- ggplot(df.top10costplot, aes(reorder(EVTYPE, -value), value)) +
geom_bar(aes(fill = variable), stat = "identity", position = "dodge") +
labs(title = "Property + Crop damages", x = "Event Type", y = "Cost in US $") +
scale_fill_discrete(labels = c("Crop", "Property", "Sum")) +
theme(legend.background = element_rect(fill="transparent", ),
legend.position=c(.75,.85),
legend.key.height = unit(0.2, "cm"), legend.key.width = unit(0.2, "cm"),
axis.text.x = element_text(angle = 90, hjust = 1))
gridExtra::grid.arrange(g1, g2, g3, nrow = 1,
bottom=textGrob("Figure 2: 10 ten most economically costly events",gp=gpar(fontsize=11)))
As we can see from the plot, properties damages are much higer than crop ones (approximately the 90% of registred costs comes from property damages). And if Drought is the most economically effective cost for agriculture (followed by Flood and Hurricanes), Flood is the major cause for property damages (followed by Hurricanes and Storm Surge).
The following table summarise the total costs (in US $).
| Event Type | Crop damages | Event Type | Property damages | Event Type | Total Cost |
|---|---|---|---|---|---|
| Drought | 13367566000 | Flood | 159765842670 | Flood | 166113905870 |
| Flood | 6348063200 | Hurricane Typhoon | 81718889010 | Hurricane Typhoon | 87068996810 |
| Hurricane Typhoon | 5350107800 | Storm Surge | 47834724000 | Storm Surge | 47835579000 |
| Frost Freeze | 2682776500 | Tornado | 24616945710 | Tornado | 24900370720 |
| Funnel Cloud | 2496822450 | Funnel Cloud | 14595347520 | Funnel Cloud | 17092169970 |
| Heavy Rain | 729669800 | Wildfire | 7760449500 | Drought | 14413667000 |
| Strong Wind | 699024800 | Tropical Storm | 7642475550 | Tropical Storm | 8320186550 |
| Tropical Storm | 677711000 | Strong Wind | 5430792430 | Wildfire | 8162704630 |
| Thunderstorm Wind | 618611600 | Thunderstorm Wind | 4530861440 | Strong Wind | 6129817230 |
| Heat | 492578500 | Ice | 3642260810 | Thunderstorm Wind | 5149473040 |
This analysis is fully available on my git repository: https://github.com/pacoraggio/ReproducibleResearchPA2 (copy the link if Rpubs doesn’t allow you to open a new window) as a public project. After the course evaluation, I would like to perform time and geographical analyses on the data to see if local authorities have worked to prevent further damage. Please feel free to consult the repository and contact me if you have any suggestion (contact details on GitHub).
The following is the complete code for renaming and grouping the different event types. As part of the future works I will redefine the way to group the event types.
df.wdata[grep("Marine hail",
df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Marine Hail"
df.wdata[grep("Marine High Wind",
df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Marine High Wind"
df.wdata[grep("Marine Strong Wind",
df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Marine Strong Wind"
df.wdata[grep("Marine Tstm|Marine Thunde",
df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Marine Thunderstorm Wind"
df.wdata[grep("Lightn|lighting|LIGNTNING|Lightning",
df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Lightning"
df.wdata[grep("Thunderstorm winds|Tstm w|tstmw|Thunderstormwinds|Thunderstorms w|Thunderstormw|Thunderstom winds|Tunderstom winds",
df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Thunderstorm Wind"
df.wdata[grep("THUND|THUNE",
df.wdata$EVTYPE, ignore.case = FALSE),]$EVTYPE <- "Thunderstorm"
df.wdata[grep("Astronomical L",
df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Astronomical Low Tide"
df.wdata[grep("Avalanc|Slid",
df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Avalanche"
df.wdata[grep("Bliz",
df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Blizzard"
df.wdata[grep("Flood|Fld",
df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Flood"
df.wdata[grep("Chill",
df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Cold Wind Chill"
df.wdata[grep("Freezing Fog",
df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Freezing Fog"
df.wdata[grep("FOG",
df.wdata$EVTYPE, ignore.case = FALSE),]$EVTYPE <- "Dense Fog"
df.wdata[grep("Dense Smoke",
df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Dense Smoke"
df.wdata[grep("Droug",
df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Drought"
df.wdata[grep("Devil",
df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Dust Devil"
df.wdata[grep("DUST",
df.wdata$EVTYPE, ignore.case = FALSE),]$EVTYPE <- "Dust Storm"
df.wdata[grep("Heat",
df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Heat"
df.wdata[grep("Frost",
df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Frost"
df.wdata[grep("Funnel",
df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Funnel Cloud"
df.wdata[grep("Heavy Rain|hvy rain|Rainfall|Rainstorm",
df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Heavy Rain"
df.wdata[grep("snow",
df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Snow"
df.wdata[grep("Surf",
df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Surf"
df.wdata[grep("Hurr|Typh",
df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Hurricane Typhoon"
df.wdata[grep("Sleet",
df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Sleet"
df.wdata[grep("Ice",
df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Ice"
df.wdata[grep("Curre",
df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Rip Current"
df.wdata[grep("Seiche",
df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Seiche"
df.wdata[grep("Storm Surge",
df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Storm Surge"
df.wdata[grep("WIND",
df.wdata$EVTYPE, ignore.case = FALSE),]$EVTYPE <- "Strong Wind"
df.wdata[grep("Strong Wind|Strong Winds|Whirlwind|Gusty Winds|Gusty wind|Wind Damage|Gradient wind|gradient wind",
df.wdata$EVTYPE, ignore.case = FALSE),]$EVTYPE <- "Strong Wind"
df.wdata[grep("Torn",
df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Tornado"
df.wdata[grep("Tropical dep",
df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Tropical Depression"
df.wdata[grep("Tropical sto",
df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Tropical Storm"
df.wdata[grep("Tsu",
df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Tsunami"
df.wdata[grep("Volcanic",
df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Volcanic Ash"
df.wdata[grep("Waterspou",
df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Waterspout"
df.wdata[grep("fire",
df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Wildfire"
df.wdata[grep("STORM",
df.wdata$EVTYPE, ignore.case = FALSE),]$EVTYPE <- "Winterstorm"
df.wdata[grep("WINTER",
df.wdata$EVTYPE, ignore.case = FALSE),]$EVTYPE <- "Winter Weather"
df.wdata[grep("HAIL",
df.wdata$EVTYPE, ignore.case = FALSE),]$EVTYPE <- "Funnel Cloud"
df.wdata[grep("COLD|HYPO",
df.wdata$EVTYPE, ignore.case = FALSE),]$EVTYPE <- "Frost Freeze"
df.wdata[grep("Freeze|Frost",
df.wdata$EVTYPE, ignore.case = TRUE),]$EVTYPE <- "Frost Freeze"
df.wdata[grep("Frost Freeze|FREEZING RAIN|FREEZING DRIZZLE|Freezing Spray|Freezing Drizzle|Freezing Rain|Freezing drizzle|LIGHT FREEZING RAIN",
df.wdata$EVTYPE, ignore.case = FALSE),]$EVTYPE <- "Frost Freeze"