Analysis of the impacts of the major storms and weather events to the public health and economics

Synopsis

Nowadays the temporal events and storms causing a lot of social problems, building destructions, lost of lives, environmental problems and economic problems. The study of temporal events and storms should be analysed to predict when other events will occur or to understand a cause of a temporal event or a storm.

For this study was used a dataset of the U.S National Oceanic and Atmospheric Administration’s (NOAA) with registers of storms that occurred between 1950 and 2011. This dataset contains information of when and where an event occurred, as well as estimates of any fatalities, injuries, and property damage.

In this analysis will be answered two questions:

Across the United States, which types of events are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

Data Processing

The dataset used for this analysis is available at bzip format and can be downloaded from this link, more information about this dataset can be found in National Weather Service and in National FAQ of Climatical Data Center Storm Events.

To download the dataset and load the data in the R environment, we use the following code:

# URL where dataset can be downloaded
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"

# Download the dataset if necessary
if(!file.exists("data.csv.bz2")) {
    download.file(url = fileUrl, destfile = "data.csv.bz2")
}

# Read the dataset
data <- read.csv("data.csv.bz2")

This dataset contains 902297 registers and 37 variables, the following code demonstrate the variables of the dataset:

names(data)

##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

To figure out what events are the most harmful for the population, we selected only the EVTYPE, FATALITIES and INJURIES fields, where FATALITIES and INJURIES are summed to sort and select the top 10 events.

library(dplyr)
library(pander)

# Select EVTYPE, FATALITIES and INJURIES
popHealth <- data %>% 
    select(EVTYPE, FATALITIES, INJURIES) %>%
    filter(EVTYPE != "?")

# Sum FATALITIES and INJURIES
popHealth$TOTAL <- popHealth$FATALITIES + popHealth$INJURIES

# Order by most problematic for population health
popHealth <- group_by(popHealth, EVTYPE) %>% 
    summarise_each(funs(sum)) %>%
    as.data.frame() %>%
    arrange(-TOTAL)

# Generate a table of the top 10 results
popHealth <- popHealth[1:10,]

To figure out what events causes the most economical consequences we need to convert first the exponencial character used to describe the value a numerical variable that is codified, in this case the numerical values represent the following list:

(1, 2, 3, …, 9) = (10^1, 10^2, 10^3, …, 10^9)
H or h = 10^2
K or k = 10^3
M or m = 10^6
B or b = 10^9

The variable PROPDMG is multiplied by the base 10 with the expoent of PROPDMGEXP, the same for CROPDMG and CROPDMGEXP.

# Function to return the numerical equivalent to the exponencial character
parseExp <- function (ex) {
    cex <- toupper(as.character(ex))
    if(cex == "2" | cex == "H") return (10^2)
    if(cex == "3" | cex == "K") return (10^3)
    if(cex == "6" | cex == "M") return (10^6)
    if(cex == "9" | cex == "B") return (10^9)
    if(cex == "1") return (10^1)
    if(cex == "4") return (10^4)
    if(cex == "5") return (10^5)
    if(cex == "7") return (10^7)
    if(cex == "8") return (10^8)
    return (10^0)
}

# Select the fields: EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP
econConseq <- data %>% 
    select(EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)

# Convert the expoent characters for each column
econConseq$PROPDMGEXP <- sapply(econConseq$PROPDMGEXP, parseExp)
econConseq$CROPDMGEXP <- sapply(econConseq$CROPDMGEXP, parseExp)

# Multiply the amount of value by the expoent
econConseq$PROPDMG <- econConseq$PROPDMG * econConseq$PROPDMGEXP
econConseq$CROPDMG <- econConseq$CROPDMG * econConseq$CROPDMGEXP

# Remove the expoent column
econConseq <- select(econConseq, -c(PROPDMGEXP, CROPDMGEXP))
econConseq$TOTAL <- econConseq$PROPDMG + econConseq$CROPDMG

# Order by most expensive
econConseq <- group_by(econConseq, EVTYPE) %>% 
    summarise_each(funs(sum)) %>%
    as.data.frame() %>%
    arrange(-TOTAL)

# Generate a table of the top 10 results
econConseq <- econConseq[1:10,]

Results

We can see in the next table that the TORNADO is the most destructive event than the others, he causes more than eleven times more population harmful than the second that is EXCESSIVE HEAT.

# Create a table of the results
pandoc.table(popHealth)

EVTYPE	FATALITIES	INJURIES	TOTAL
TORNADO	5633	91346	96979
EXCESSIVE HEAT	1903	6525	8428
TSTM WIND	504	6957	7461
FLOOD	470	6789	7259
LIGHTNING	816	5230	6046
HEAT	937	2100	3037
FLASH FLOOD	978	1777	2755
ICE STORM	89	1975	2064
THUNDERSTORM WIND	133	1488	1621
WINTER STORM	206	1321	1527

In the next plot we can check the proportion that a TORNADO have in relationshion with others.

library(reshape2)
library(ggplot2)

# Verticalize the data frame
popHealth <- melt(popHealth, id=c("EVTYPE"))
# Rename the columns
names(popHealth) <- c("EVTYPE", "HEALTHTYPE", "VALUE")
# Convert categorical values to factors
popHealth$EVTYPE <- factor(popHealth$EVTYPE)
popHealth$HEALTHTYPE <- factor(popHealth$HEALTHTYPE)

# Create the plot instance
gg <- ggplot(popHealth, aes(x = reorder(EVTYPE, -VALUE), y = VALUE))
# Add facter grid
gg <- gg + facet_grid(HEALTHTYPE ~ .)
# Add bars to the plot
gg <- gg + geom_bar(stat = "identity")
# Add labels text
gg <- gg + xlab("Event type") + ylab("Quantity")
# Rotate labels 45 deggrees
gg <- gg + theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Add legend values
gg <- gg + geom_text(aes(label = VALUE), vjust = -0.3)
# Show the plot
print(gg)

The next table show the most economical loss of properties and crops for weather events or storms, we can see that the flood is the most expensive disaster, but hurricane, tornado and storm has a big significantly participation in the values.

# Create a table to show the results, and format the number with comma as decimal separator
pandoc.table(data.frame(
    "Event" = econConseq$EVTYPE,
    "Property" = prettyNum(econConseq$PROPDMG, big.mark = ",", scientific=F),
    "Crops" = prettyNum(econConseq$CROPDMG, big.mark = ",", scientific=F),
    "Total" = prettyNum(econConseq$TOTAL, big.mark = ",", scientific=F)
))

Event	Property	Crops	Total
FLOOD	144,657,709,807	5,661,968,450	150,319,678,257
HURRICANE/TYPHOON	69,305,840,000	2,607,872,800	71,913,712,800
TORNADO	56,947,380,676	414,953,270	57,362,333,946
STORM SURGE	43,323,536,000	5,000	43,323,541,000
HAIL	15,735,267,513	3,025,954,473	18,761,221,986
FLASH FLOOD	16,822,673,978	1,421,317,100	18,243,991,078
DROUGHT	1,046,106,000	13,972,566,000	15,018,672,000
HURRICANE	11,868,319,010	2,741,910,000	14,610,229,010
RIVER FLOOD	5,118,945,500	5,029,459,000	10,148,404,500
ICE STORM	3,944,927,860	5,022,113,500	8,967,041,360

The next plot show this graphic visualization of the previous table.

library(reshape2)
library(ggplot2)

# Verticalize the variables
econConseqPlot <- melt(econConseq, id = c("EVTYPE"))

# Rename the columns
names(econConseqPlot) <- c("EVTYPE", "DMGTYPE", "VALUE")
# Convert the categorical values to factors
econConseqPlot$EVTYPE <- factor(econConseqPlot$EVTYPE)
econConseqPlot$DMGTYPE <- factor(econConseqPlot$DMGTYPE)

# Create the plot
gg <- ggplot(econConseqPlot, aes(x = reorder(EVTYPE, VALUE), y = VALUE))
# Add a facet configuration by damage type
gg <- gg + facet_grid(DMGTYPE ~ ., 
                 labeller = as_labeller(c(
                     "PROPDMG" = "Property",
                     "CROPDMG" = "Crop",
                     "TOTAL" = "Total")))
# Add the geometric bar model
gg <- gg + geom_bar(stat = "identity")
# Add texts for the bar with the values of the columns
gg <- gg + geom_text(aes(label = prettyNum(VALUE, big.mark = ",", scientific=F)), hjust = -0.1)
# Flip the bar to horizontal format and add X and Y labels
gg <- gg + coord_flip() + xlab("Event type") + ylab("Quantity")
# Increase the limit to show the geom_test
gg <- gg + ylim(0, 19*10^10)
# Print the plot
print(gg)

Conclusions

Based on the analised data we can answer the questions:

Across the United States, which types of events are most harmful with respect to population health?

The tornado is responsible for the mosts harmful weather events for the population
Across the United States, which types of events have the greatest economic consequences?

The flood is responsible for the greater costs with damage in weather events