In this report we analyze the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This analysis shows the most harmful weather events across the United States in terms of population health and greatest economic consequences. The analysis finally shows that with respect to population health, tornados events produce the most harmful impact to population health, followed by storms events; while the worse consecuences for economy measured on costs for crops and properties, are caused by Floods and Tornados.
The data for this analisys come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. The file can be dowloaded from the web site:
Storm Data: (https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2)[47Mb]
There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined:
National Weather Service Storm Data Documentation (https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf).
National Climatic Data Center Storm Events FAQ (https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf).
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
We use the following packages in this analysis:
library(dplyr)
library(ggplot2)
Due to the size of the CSV file we save it as a RDS file to reduce the loading time on consecutives executions of the code:
if(file.exists("StormData.rds"))
{
noaa <- readRDS("StormData.rds")
}else{
noaa <- read.csv("repdata-data-StormData.csv")
saveRDS(noaa, "StormData.rds")
}
Giving format to the date columns:
noaa$BGN_DATE <- as.Date(as.character(noaa$BGN_DATE)
, format = "%m/%d/%Y %H:%M:%S")
noaa$END_DATE <- as.Date(as.character(noaa$END_DATE)
, format = "%m/%d/%Y %H:%M:%S")
The next step is getting rid of noisy rows, here we remove rows with odd characters and Event Types with no ocurrences:
noaa <- filter(noaa, !grepl("\\-|\\?|\\+|3|4|6|7|8", noaa$PROPDMGEXP))
noaa <- filter(noaa, !grepl("\\-|\\?|\\+|3|2|m|4|6|7|8", noaa$CROPDMGEXP))
evtype_table <- as.data.frame(table(noaa$EVTYPE))
evtypes_to_remove <- filter(evtype_table, Freq == 0) %>% select(Var1)
noaa <- filter(noaa, !(EVTYPE %in% evtypes_to_remove$Var1))
We still have a lot of reduntant and noisy Event Types so lets take care of that by gruping the reduntant Event Types on more general and readable categories. For this we choose the most common words for each general type of event:
WinterStrings <- "Icy|HYPOTHERMIA|BLIZZARD|FREEZE|AVALANCHE|ICE|SNOW|COLD|CHILL|WINDCHILL|FREEZING|FROST|GLAZE|HAIL|SLEET|WINTER|WINTRY|LAKE EFFECT|LAKE-EFFECT"
RainString <- "PRECIP|FALL|RAIN"
FloodString <- "FLOOD|FLASH|SURF|WATER|SEAS|LAKESHORE|TIDE|TSUNAMI"
TornadoString <- "TYPHOON|SPOUT|TORNADO|HURRICANE"
StromString <- "LIGHTNING|STORM|WIND|THUNDERSTORM|TROPICAL|TYPHOON|SPOUT|DUST|FUNNEL|GUST"
FireString <- "FIRE"
DroughtString <- "DRY|HOT|DROUGHT"
HeatString <- "WARM|WARMTH|HEAT"
FogString <- "FOG"
SlidesString <- "SLIDE"
noaa[grepl(WinterStrings,
noaa$EVTYPE,
ignore.case = TRUE), "EVTYPE"] <- "COLD/ICE/SNOW EVENT"
noaa[grepl(RainString,
noaa$EVTYPE,
ignore.case = TRUE), "EVTYPE"] <- "RAIN EVENT"
noaa[grepl(FloodString,
noaa$EVTYPE,
ignore.case = TRUE), "EVTYPE"] <- "FLOOD EVENT"
noaa[grepl(TornadoString,
noaa$EVTYPE,
ignore.case = TRUE), "EVTYPE"] <- "TORNADO EVENT"
noaa[grepl(StromString,
noaa$EVTYPE,
ignore.case = TRUE), "EVTYPE"] <- "STORM EVENT"
noaa[grepl(FireString,
noaa$EVTYPE,
ignore.case = TRUE), "EVTYPE"] <- "FIRE EVENT"
noaa[grepl(DroughtString,
noaa$EVTYPE,
ignore.case = TRUE), "EVTYPE"] <- "DROUGHT EVENT"
noaa[grepl(HeatString,
noaa$EVTYPE,
ignore.case = TRUE), "EVTYPE"] <- "HEAT EVENT"
noaa[grepl(FogString,
noaa$EVTYPE,
ignore.case = TRUE), "EVTYPE"] <- "FOG EVENT"
noaa[grepl(SlidesString,
noaa$EVTYPE,
ignore.case = TRUE), "EVTYPE"] <- "LANDSLIDE EVENT"
Finally, we remove the rest of the rows that do not fall into the defined event categories, as they represent noise and errors:
NewEvents <- c("COLD/ICE/SNOW EVENT"
, "RAIN EVENT"
, "FLOOD EVENT"
, "TORNADO EVENT"
, "STORM EVENT"
, "FIRE EVENT"
, "DROUGHT EVENT"
, "HEAT EVENT"
, "FOG EVENT"
, "LANDSLIDE EVENT")
noaa <- filter(noaa, EVTYPE %in% NewEvents)
Once we have a cleaner data, we start analysing it for the information we need.
We take the health impact of the weather effects as the sum of injuries and deaths for each row. Then we calculate the sum for each event type defined previously:
noaa <- mutate(noaa, HealthImpact = INJURIES + FATALITIES)
SumPerEvent <- c()
for(i in 1:length(NewEvents))
{
temp <- filter(noaa, EVTYPE == NewEvents[i])
SumPerEvent <- c(SumPerEvent, sum(temp$HealthImpact))
}
sumE <- data.frame(SumPerEvent, NewEvents)
g <- ggplot(sumE, aes(x = NewEvents, y = SumPerEvent)) + geom_bar(stat = "identity") + coord_flip()
g <- g + xlab("Health Impact") + ylab("Weather Event") + ggtitle("Health Impact per Weather Event Accross U.S.")
print(g)
As shown on the graph, Tornados produce a huge impact on the population health across the United States. Other important event types to consider are Storm events, Heat, Floods and Cold and Ice related events.
We will now see how weather events affect crops and properties accross the United States. For this we first translate the “Crop damage exponential” (CROPDMGEXP) and “Properties damage exponential” (PROPDMGEXP) symbols into their corresponding numbers. After that, we generate the actual damage cost and store it on a new column:
noaa <- mutate(noaa, CROPmodifier = 1)
noaa <- mutate(noaa, PROPmodifier = 1)
noaa[grepl("^$",
noaa$CROPDMGEXP,
ignore.case = TRUE), "CROPmodifier"] <- 1
noaa[grepl("0",
noaa$CROPDMGEXP,
ignore.case = TRUE), "CROPmodifier"] <- 1
noaa[grepl("M",
noaa$CROPDMGEXP,
ignore.case = TRUE), "CROPmodifier"] <- 1000000
noaa[grepl("K",
noaa$CROPDMGEXP,
ignore.case = TRUE), "CROPmodifier"] <- 1000
noaa[grepl("B",
noaa$CROPDMGEXP,
ignore.case = TRUE), "CROPmodifier"] <- 1000000000
noaa[grepl("H",
noaa$CROPDMGEXP,
ignore.case = TRUE), "CROPmodifier"] <- 100
noaa[grepl("1",
noaa$CROPDMGEXP,
ignore.case = TRUE), "CROPmodifier"] <- 10
noaa[grepl("^$",
noaa$PROPDMGEXP,
ignore.case = TRUE), "PROPmodifier"] <- 1
noaa[grepl("0",
noaa$PROPDMGEXP,
ignore.case = TRUE), "PROPmodifier"] <- 1
noaa[grepl("M",
noaa$PROPDMGEXP,
ignore.case = TRUE), "PROPmodifier"] <- 1000000
noaa[grepl("K",
noaa$PROPDMGEXP,
ignore.case = TRUE), "PROPmodifier"] <- 1000
noaa[grepl("B",
noaa$PROPDMGEXP,
ignore.case = TRUE), "PROPmodifier"] <- 1000000000
noaa[grepl("H",
noaa$PROPDMGEXP,
ignore.case = TRUE), "PROPmodifier"] <- 100
noaa[grepl("1",
noaa$PROPDMGEXP,
ignore.case = TRUE), "PROPmodifier"] <- 10
noaa <- mutate(noaa, CROPDMGmod = CROPDMG*CROPmodifier)
noaa <- mutate(noaa, PROPDMGmod = PROPDMG*PROPmodifier)
With this done, we calculate the sum for the damage cost on crops. On the next graph, the economic impact of each weather event on crops is shown.
CropSum <- c()
for(i in 1:length(NewEvents))
{
temp <- filter(noaa, EVTYPE == NewEvents[i])
CropSum <- c(CropSum, sum(temp$CROPDMGmod))
}
CropSumDataframe <- data.frame(CropSum, NewEvents)
g <- ggplot(CropSumDataframe, aes(x = NewEvents, y = CropSum)) + geom_bar(stat = "identity") + coord_flip()
g <- g + xlab("Economic Impact on Crops") + ylab("Weather Event") + ggtitle("Economic Impact on Crops per Weather Event Accross U.S.")
print(g)
The most economically harmful weather events for crops, are Droughts, Floods and Cold and Ice related events, this are followed by Tornados and storm related events.
Now we calculate the sum of costs on properties for each event, these are shown in the next graph:
PropSum <- c()
for(i in 1:length(NewEvents))
{
temp <- filter(noaa, EVTYPE == NewEvents[i])
PropSum <- c(PropSum, sum(temp$PROPDMGmod))
}
PropSumDataframe <- data.frame(PropSum, NewEvents)
g <- ggplot(PropSumDataframe, aes(x = NewEvents, y = PropSum)) + geom_bar(stat = "identity") + coord_flip()
g <- g + xlab("Economic Impact on Properties") + ylab("Weather Event") + ggtitle("Economic Impact on Properties per Weather Event Accross U.S.")
print(g)
In this case, Floods and Tornados have the greater economic impact, followed by other Storm related events.
Finally we calculated the total economic impact of each event:
noaa <- mutate(noaa, sumEconomic = CROPDMGmod + PROPDMGmod)
TotalSum <- c()
for(i in 1:length(NewEvents))
{
temp <- filter(noaa, EVTYPE == NewEvents[i])
TotalSum <- c(TotalSum, sum(temp$sumEconomic))
}
TotalSumDataframe <- data.frame(TotalSum, NewEvents)
g <- ggplot(TotalSumDataframe, aes(x = NewEvents, y = TotalSum)) + geom_bar(stat = "identity") + coord_flip()
g <- g + xlab("Total Economic Impact") + ylab("Weather Event") + ggtitle("Total Economic Impact per Weather Event Accross U.S.")
print(g)
The result shows, in coclusion, that Floods and Tornados are the most harmful weather events with respect to economic loss, this events consequences are followed by Storm and Cold and Ice related events.