Nowadays the temporal events and storms causing a lot of social problems, building destructions, lost of lives, environmental problems and economic problems. The study of temporal events and storms should be analysed to predict when other events will occur or to understand a cause of a temporal event or a storm.
For this study was used a dataset of the U.S National Oceanic and Atmospheric Administration’s (NOAA) with registers of storms that occurred between 1950 and 2011. This dataset contains information of when and where an event occurred, as well as estimates of any fatalities, injuries, and property damage.
In this analysis will be answered two questions:
The dataset used for this analysis is available at bzip format and can be downloaded from this link, more information about this dataset can be found in National Weather Service and in National FAQ of Climatical Data Center Storm Events.
To download the dataset and load the data in the R environment, we use the following code:
# URL where dataset can be downloaded
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
# Download the dataset if necessary
if(!file.exists("data.csv.bz2")) {
download.file(url = fileUrl, destfile = "data.csv.bz2")
}
# Read the dataset
data <- read.csv("data.csv.bz2")
This dataset contains 902297 registers and 37 variables, the following code demonstrate the variables of the dataset:
names(data)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
To figure out what events are the most harmful for the population, we selected only the EVTYPE, FATALITIES and INJURIES fields, where FATALITIES and INJURIES are summed to sort and select the top 10 events.
library(dplyr)
library(pander)
# Select EVTYPE, FATALITIES and INJURIES
popHealth <- data %>%
select(EVTYPE, FATALITIES, INJURIES) %>%
filter(EVTYPE != "?")
# Sum FATALITIES and INJURIES
popHealth$TOTAL <- popHealth$FATALITIES + popHealth$INJURIES
# Order by most problematic for population health
popHealth <- group_by(popHealth, EVTYPE) %>%
summarise_each(funs(sum)) %>%
as.data.frame() %>%
arrange(-TOTAL)
# Generate a table of the top 10 results
popHealth <- popHealth[1:10,]
To figure out what events causes the most economical consequences we need to convert first the exponencial character used to describe the value a numerical variable that is codified, in this case the numerical values represent the following list:
The variable PROPDMG is multiplied by the base 10 with the expoent of PROPDMGEXP, the same for CROPDMG and CROPDMGEXP.
# Function to return the numerical equivalent to the exponencial character
parseExp <- function (ex) {
cex <- toupper(as.character(ex))
if(cex == "2" | cex == "H") return (10^2)
if(cex == "3" | cex == "K") return (10^3)
if(cex == "6" | cex == "M") return (10^6)
if(cex == "9" | cex == "B") return (10^9)
if(cex == "1") return (10^1)
if(cex == "4") return (10^4)
if(cex == "5") return (10^5)
if(cex == "7") return (10^7)
if(cex == "8") return (10^8)
return (10^0)
}
# Select the fields: EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP
econConseq <- data %>%
select(EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
# Convert the expoent characters for each column
econConseq$PROPDMGEXP <- sapply(econConseq$PROPDMGEXP, parseExp)
econConseq$CROPDMGEXP <- sapply(econConseq$CROPDMGEXP, parseExp)
# Multiply the amount of value by the expoent
econConseq$PROPDMG <- econConseq$PROPDMG * econConseq$PROPDMGEXP
econConseq$CROPDMG <- econConseq$CROPDMG * econConseq$CROPDMGEXP
# Remove the expoent column
econConseq <- select(econConseq, -c(PROPDMGEXP, CROPDMGEXP))
econConseq$TOTAL <- econConseq$PROPDMG + econConseq$CROPDMG
# Order by most expensive
econConseq <- group_by(econConseq, EVTYPE) %>%
summarise_each(funs(sum)) %>%
as.data.frame() %>%
arrange(-TOTAL)
# Generate a table of the top 10 results
econConseq <- econConseq[1:10,]
We can see in the next table that the TORNADO is the most destructive event than the others, he causes more than eleven times more population harmful than the second that is EXCESSIVE HEAT.
# Create a table of the results
pandoc.table(popHealth)
| EVTYPE | FATALITIES | INJURIES | TOTAL |
|---|---|---|---|
| TORNADO | 5633 | 91346 | 96979 |
| EXCESSIVE HEAT | 1903 | 6525 | 8428 |
| TSTM WIND | 504 | 6957 | 7461 |
| FLOOD | 470 | 6789 | 7259 |
| LIGHTNING | 816 | 5230 | 6046 |
| HEAT | 937 | 2100 | 3037 |
| FLASH FLOOD | 978 | 1777 | 2755 |
| ICE STORM | 89 | 1975 | 2064 |
| THUNDERSTORM WIND | 133 | 1488 | 1621 |
| WINTER STORM | 206 | 1321 | 1527 |
In the next plot we can check the proportion that a TORNADO have in relationshion with others.
library(reshape2)
library(ggplot2)
# Verticalize the data frame
popHealth <- melt(popHealth, id=c("EVTYPE"))
# Rename the columns
names(popHealth) <- c("EVTYPE", "HEALTHTYPE", "VALUE")
# Convert categorical values to factors
popHealth$EVTYPE <- factor(popHealth$EVTYPE)
popHealth$HEALTHTYPE <- factor(popHealth$HEALTHTYPE)
# Create the plot instance
gg <- ggplot(popHealth, aes(x = reorder(EVTYPE, -VALUE), y = VALUE))
# Add facter grid
gg <- gg + facet_grid(HEALTHTYPE ~ .)
# Add bars to the plot
gg <- gg + geom_bar(stat = "identity")
# Add labels text
gg <- gg + xlab("Event type") + ylab("Quantity")
# Rotate labels 45 deggrees
gg <- gg + theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Add legend values
gg <- gg + geom_text(aes(label = VALUE), vjust = -0.3)
# Show the plot
print(gg)
The next table show the most economical loss of properties and crops for weather events or storms, we can see that the flood is the most expensive disaster, but hurricane, tornado and storm has a big significantly participation in the values.
# Create a table to show the results, and format the number with comma as decimal separator
pandoc.table(data.frame(
"Event" = econConseq$EVTYPE,
"Property" = prettyNum(econConseq$PROPDMG, big.mark = ",", scientific=F),
"Crops" = prettyNum(econConseq$CROPDMG, big.mark = ",", scientific=F),
"Total" = prettyNum(econConseq$TOTAL, big.mark = ",", scientific=F)
))
| Event | Property | Crops | Total |
|---|---|---|---|
| FLOOD | 144,657,709,807 | 5,661,968,450 | 150,319,678,257 |
| HURRICANE/TYPHOON | 69,305,840,000 | 2,607,872,800 | 71,913,712,800 |
| TORNADO | 56,947,380,676 | 414,953,270 | 57,362,333,946 |
| STORM SURGE | 43,323,536,000 | 5,000 | 43,323,541,000 |
| HAIL | 15,735,267,513 | 3,025,954,473 | 18,761,221,986 |
| FLASH FLOOD | 16,822,673,978 | 1,421,317,100 | 18,243,991,078 |
| DROUGHT | 1,046,106,000 | 13,972,566,000 | 15,018,672,000 |
| HURRICANE | 11,868,319,010 | 2,741,910,000 | 14,610,229,010 |
| RIVER FLOOD | 5,118,945,500 | 5,029,459,000 | 10,148,404,500 |
| ICE STORM | 3,944,927,860 | 5,022,113,500 | 8,967,041,360 |
The next plot show this graphic visualization of the previous table.
library(reshape2)
library(ggplot2)
# Verticalize the variables
econConseqPlot <- melt(econConseq, id = c("EVTYPE"))
# Rename the columns
names(econConseqPlot) <- c("EVTYPE", "DMGTYPE", "VALUE")
# Convert the categorical values to factors
econConseqPlot$EVTYPE <- factor(econConseqPlot$EVTYPE)
econConseqPlot$DMGTYPE <- factor(econConseqPlot$DMGTYPE)
# Create the plot
gg <- ggplot(econConseqPlot, aes(x = reorder(EVTYPE, VALUE), y = VALUE))
# Add a facet configuration by damage type
gg <- gg + facet_grid(DMGTYPE ~ .,
labeller = as_labeller(c(
"PROPDMG" = "Property",
"CROPDMG" = "Crop",
"TOTAL" = "Total")))
# Add the geometric bar model
gg <- gg + geom_bar(stat = "identity")
# Add texts for the bar with the values of the columns
gg <- gg + geom_text(aes(label = prettyNum(VALUE, big.mark = ",", scientific=F)), hjust = -0.1)
# Flip the bar to horizontal format and add X and Y labels
gg <- gg + coord_flip() + xlab("Event type") + ylab("Quantity")
# Increase the limit to show the geom_test
gg <- gg + ylim(0, 19*10^10)
# Print the plot
print(gg)
Based on the analised data we can answer the questions:
Across the United States, which types of events are most harmful with respect to population health?
The tornado is responsible for the mosts harmful weather events for the population
Across the United States, which types of events have the greatest economic consequences?
The flood is responsible for the greater costs with damage in weather events