Hello! I hope you are in good health as well as your family in these times of pandemic.
In this study we are going to analyze the U.S. National Oceanic and Atmospheric Administration’s (NOAA) Storm Database.
The main objective with this study is to determine which are the meteorological events that more damage the health and which events cause the most economic damages.
The study concludes that the worst economic losses are caused by floods while the greatest health impacts are caused by tornadoes.
Let’s go!
Over the years and with the development of science, human beings have become more concerned with the study of natural phenomena. Some of these phenomena are more severe than others, which is why it is essential to be able to characterize them and use that information to take planned and informed action.
In this work, we will focus on studying the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database which contains very important information about injures, fatalities, property damages, dates.
The data for the analysis can be downloaded for the web site:
Dataset: Storm Data [47Mb]
With this dataset we are going to try to find the answer to the next questions:
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
The dataset consists of events since 1950 until 2011. It´s compound by 902297 observations (rows) and 37 variables (columns). Of these the principal data required to evaluate the economic and health consequences of various weather events are:
EVTYPE - a factor variable giving the event type (e.g. tornado, flood, etc.)
FATALITIES - a numerical variable of the number of fatalities
INJURIES - a numerical variable of the number of injuries.
PROPDMG - a numerical variable giving the mantissa for the value of property damage in USD.
PROPDMGEXP - a factor variable giving the exponent for the value of property damage in USD.
CROPDMG - a numerical variable giving the mantissa for the value of crop damage in USD.
CROPDMGEXP - a factor variable giving the exponent for the value of crop damage in USD.
To guarantee reproducibility in the documents, which is one of the objectives of the course, we generated a code that allows us to create a directory in our work desk and to be able to deposit the information of the data base there to later analyze it.
# Test for a directory, if it doesn´t exists then define the data directory
Dir <- "./workf"
if(!dir.exists(Dir)){
dir.create(Dir)
}
# Define the data file and destfile
dest_data <- paste(Dir, "StormData.csv.bz2", sep="/")
# Downloading file
if(!file.exists(dest_data)){
dataUrl<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(dataUrl, destfile = dest_data)
}
With the data downloaded, it is first loaded into R and tidied. To simplify further processing a new dataframe is created containing just the 7 relevant variables. At this stage important libraries are loaded for use with data manipulation and plotting.
library(plyr)
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.0.3
#library(ggplot2)
Readstorm <- read.csv(dest_data)
stormframe <- data.frame(Readstorm$EVTYPE, Readstorm$FATALITIES,
Readstorm$INJURIES, Readstorm$PROPDMG, Readstorm$PROPDMGEXP,
Readstorm$CROPDMG, Readstorm$CROPDMGEXP)
names(stormframe ) <- c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG",
"PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
We´ll explore and tidy the dataset with the values of the property and crop exponent variables, PROPDMGEXP and CROPDMGEXP.
unique(stormframe $PROPDMGEXP)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
unique(stormframe$CROPDMGEXP)
## [1] "" "M" "K" "m" "B" "?" "0" "k" "2"
There is a number for letter designators, only the numerical powers. The lower case letters were converted to upper case since they are equivalent to the same prefix (H=100, K=1000, H=1000000, etc).
stormframe$PROPDMGEXP <- toupper(as.character(stormframe$PROPDMGEXP))
stormframe$CROPDMGEXP <- toupper(as.character(stormframe$CROPDMGEXP))
stormframe$CROPDMG[(stormframe$CROPDMG == "")] <- 0
stormframe$PROPDMGEXP[(stormframe$PROPDMGEXP == "")] <- 0
stormframe$CROPDMGEXP[(stormframe$CROPDMGEXP == "")] <- 0
stormframe$FATALITIES[(stormframe$FATALITIES == "")] <- 0
stormframe$INJURIES[(stormframe$INJURIES == "")] <- 0
stormframe$CROPDMGEXP[(stormframe$CROPDMGEXP == "H")] <- 2
stormframe$CROPDMGEXP[(stormframe$CROPDMGEXP == "K")] <- 3
stormframe$CROPDMGEXP[(stormframe$CROPDMGEXP == "M")] <- 6
stormframe$CROPDMGEXP[(stormframe$CROPDMGEXP == "B")] <- 9
stormframe$PROPDMGEXP[(stormframe$PROPDMGEXP == "H")] <- 2
stormframe$PROPDMGEXP[(stormframe$PROPDMGEXP == "K")] <- 3
stormframe$PROPDMGEXP[(stormframe$PROPDMGEXP == "M")] <- 6
stormframe$PROPDMGEXP[(stormframe$PROPDMGEXP == "B")] <- 9
stormframe$CROPDMGEXP[(stormframe$CROPDMGEXP == "+")] <- "NA"
stormframe$CROPDMGEXP[(stormframe$CROPDMGEXP == "?")] <- "NA"
stormframe$PROPDMGEXP[(stormframe$PROPDMGEXP == "?")] <- "NA"
stormframe$CROPDMGEXP[(stormframe$CROPDMGEXP == "-")] <- "NA"
stormframe$PROPDMGEXP[(stormframe$PROPDMGEXP == "-")] <- "NA"
stormframe$PROPDMGEXP[(stormframe$PROPDMGEXP == "+")] <- "NA"
stormframe$PROPDMGEXP <- as.integer(stormframe$PROPDMGEXP)
## Warning: NAs introducidos por coerción
stormframe$CROPDMGEXP <- as.integer(stormframe$CROPDMGEXP)
## Warning: NAs introducidos por coerción
We will find the total cost of the damage. To do this, we will calculate the costs associated with property and crop damage from the exponent and mantissa values.
stormframe$PROPDMGTOTAL <- stormframe$PROPDMG * 10^stormframe$PROPDMGEXP
stormframe$CROPDMGTOTAL <- stormframe$CROPDMG * 10^stormframe$CROPDMGEXP
stormframe$TOTALDMG <- stormframe$PROPDMGTOTAL + stormframe$CROPDMGTOTAL
We’ll aggregate all the data to find the totals as a function of EVTYPE for each of the summary variables Fatalities, Injuries, Property Damage, Crop Damage and Total Financial Damage.
fatalities_EVTYPE <- aggregate(FATALITIES ~ EVTYPE, data = stormframe, FUN=sum)
injuries_EVTYPE <- aggregate(INJURIES ~ EVTYPE, data = stormframe, FUN=sum)
propdamage_EVTYPE <- aggregate(PROPDMGTOTAL ~ EVTYPE, data = stormframe, FUN=sum)
cropdamage_EVTYPE <- aggregate(CROPDMGTOTAL ~ EVTYPE, data = stormframe, FUN=sum)
sumdamage_EVTYPE <- aggregate(TOTALDMG ~ EVTYPE, data = stormframe, FUN=sum)
s.sum <- merge(fatalities_EVTYPE, injuries_EVTYPE, by="EVTYPE", all=TRUE)
s.sum <- merge(s.sum, propdamage_EVTYPE, by="EVTYPE", all=TRUE)
s.sum <- merge(s.sum, cropdamage_EVTYPE, by="EVTYPE", all=TRUE)
s.sum <- merge(s.sum, sumdamage_EVTYPE, by="EVTYPE", all=TRUE)
fatalities_EVTYPE <- s.sum[order(s.sum$FATALITIES, decreasing=TRUE),][1:15,]
injuries_EVTYPE <- s.sum[order(s.sum$INJURIES, decreasing=TRUE),][1:15,]
propdamage_EVTYPE <- s.sum[order(s.sum$PROPDMGTOTAL, decreasing=TRUE),][1:15,]
cropdamage_EVTYPE <- s.sum[order(s.sum$CROPDMGTOTAL, decreasing=TRUE),][1:15,]
sumdamage_EVTYPE <- s.sum[order(s.sum$TOTALDMG, decreasing=TRUE),][1:15,]
In this section we will present the results of the previous analyses, make some exploratory graphs and observe some interesting conclusions.
By graphing the data, we realize that the greatest impact to life and in terms of injuries is assigned to tornadoes.
par(mfrow=c(1,2), mar=c(8,4,3,2), oma=c(4,2,2,2), cex=0.8)
barplot(fatalities_EVTYPE$FATALITIES, names.arg=fatalities_EVTYPE$EVTYPE, las=3,
cex.names=0.6, xlab="", ylab="TOTAL NUMBER OF FATALITIES", col="magenta",
main="WEATHER EVENTS WITH HIGHEST INCIDENT OF FATALITIES")
barplot(injuries_EVTYPE$INJURIES, names.arg=injuries_EVTYPE$EVTYPE, las=3, cex.names=0.6,
xlab="", ylab="TOTAL NUMBER OF INJURIES", col="Orange", main="WEATHER EVENTS WITH HIGHEST INCIDENCE OF INJURIES")
Analyzing the data, we realize that the greatest financial impact to properties is caused by flooding, while the greatest financial impact to crops is caused by drought.
par(mfrow=c(1,2), mar=c(8,4,3,2), oma=c(4,2,2,2), cex=0.8)
barplot(propdamage_EVTYPE$PROPDMGTOTAL/10^6, names.arg=propdamage_EVTYPE$EVTYPE, las=3, cex.names=0.6, xlab="", ylab="PROPERTY DAMAGE IN USD (Millions)",
col="blue", main="WEATHER EVENTS WITH HIGEST COST IN PROPERTY DAMAGE")
barplot(cropdamage_EVTYPE$CROPDMGTOTAL/10^6, names.arg=cropdamage_EVTYPE$EVTYPE, las=3,
cex.names=0.6, xlab="", ylab="CROP DAMAGE IN USD (Millions)",
col="yellow", main="WEATHER EVENTS WITH HIGHEST COST IN CROP DAMAGE")