Synopsis

Harmful evets for the population health across the United States

In order to allocate resources in the best way possible, knowing which are the most Harmful events for the population health across the United States could be the first step. This document provides an analysis of the Storm Data information from the National Climatic Data Center (NCDC) addressing this issue.

This analysis answer the following questions:

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

  2. Across the United States, which types of events have the greatest economic consequences?

Data Processing

In this section, there are some code chuncks wich download, read and process the data.

#Set the working directory to your own and download the files to it
setwd("F:/R codes/Projects/datasciencecoursera/Reproducible Research")

# File download
ulrfile1 <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
dirfile1 <- "F:/R codes/Projects/datasciencecoursera/Reproducible Research/repdata%2Fdata%2FStormData.csv.bz2"

download.file(ulrfile1, destfile =dirfile1)

#Assigning the dataset to a varible
storm.data <- read.csv(dirfile1)

#Changing the type of variables, BGN_DATE to data class from character

#To have an idea of the size of the data set
dim <- dim(storm.data)
dim
## [1] 902297     37

In the data set there are 902297 observations and 37 characteristics of the observed data

In order to know which are the more harmful events is necesary to subset the 37 variables in the data set the ones that are usufull for that porpuse are the following:

1.EVTYPE(factor): natural events type

2.FATALITIES(numeric): number of deaths

3.INJURIES(numeric): number of injuries

4.PROPDMG(numeric): mantissa of the value of property damage in USD

5.PROPDMGEXP(numeric): Exponent of the value of property damage in USD

6.CROPDMG(numeric): mantissa of the value of crop damage in USD

7.CROPDMGEXP(factor): exponent value of crop damaga in USD

Subseting the data whit the variables that are usefull for this analysis.

#Loading libraries
library(dplyr)
library(ggplot2)
library(cowplot)

#subseting variables using select function
storm.data2 <- storm.data %>% select(EVTYPE,FATALITIES,INJURIES,
                             PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP)
head(storm.data2)
##    EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO          0       15    25.0          K       0           
## 2 TORNADO          0        0     2.5          K       0           
## 3 TORNADO          0        2    25.0          K       0           
## 4 TORNADO          0        2     2.5          K       0           
## 5 TORNADO          0        2     2.5          K       0           
## 6 TORNADO          0        6     2.5          K       0

The variables 7.CROPDMGEXP(factor): and 5.PROPDMGEXP(numeric): represent the exponent of the variables 6.CROPDMG(numeric): and 4.PROPDMG(numeric): respectively, what it means is that those last variables are not in the same scale, it is necessary to work on that variables so they can be use in calculus and plots.

The exponents are abbreviations of the quantity of the data, those are the following:

#subseting two columns with abbreviations
subexp <- storm.data2 %>% select(PROPDMGEXP,CROPDMGEXP)

#What are the exponents in the data?
unique(subexp$PROPDMGEXP)
##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
unique(subexp$CROPDMGEXP)
## [1] ""  "M" "K" "m" "B" "?" "0" "k" "2"

Letters need to be replaced with numbers in order to multiply that exponent with the data, having in that way all the data in the same scale to comparison in charts.

#replacing letters with the corresponding exponent and missing data with ZERO
subexp <- replace(subexp, subexp=="K" | subexp=="k", "3")
subexp <- replace(subexp, subexp=="M" | subexp=="m", "6")
subexp <- replace(subexp, subexp=="B", "9")
subexp <- replace(subexp, subexp=="H" | subexp=="h", "")
subexp <- replace(subexp, subexp==""|subexp=="+"|subexp=="?"|subexp=="-", "0")

#This is how the columns looks like after the transformation
unique(subexp$PROPDMGEXP)
##  [1] "3" "6" "0" "9" "5" "4" "2" "7" "1" "8"
unique(subexp$CROPDMGEXP)
## [1] "0" "6" "3" "9" "2"

For plotting purposes the following create a new column called PROPCROPDM

#Setting columns to integer and copying their content to the subset-initial data
storm.data2$PROPDMGEXP <- as.integer(subexp$PROPDMGEXP)
storm.data2$CROPDMGEXP <- as.integer(subexp$CROPDMGEXP)

#Multiplying exponential for the base in order to have the numbers in the same scale
storm.data2$PROPDMG <- storm.data2$PROPDMG*10^storm.data2$PROPDMGEXP
storm.data2$CROPDMG <- storm.data2$CROPDMG*10^storm.data2$CROPDMGEXP

#Subsetting data again without the exponential data.
storm.data3 <- storm.data2 %>% select(EVTYPE,FATALITIES,INJURIES,
                             PROPDMG,CROPDMG)

#Creating a new column with the total amount of dollar considering property and crop damage.
storm.data3$PROPCROPDM <- storm.data3$PROPDMG+storm.data3$CROPDMG


head(storm.data3)
##    EVTYPE FATALITIES INJURIES PROPDMG CROPDMG PROPCROPDM
## 1 TORNADO          0       15   25000       0      25000
## 2 TORNADO          0        0    2500       0       2500
## 3 TORNADO          0        2   25000       0      25000
## 4 TORNADO          0        2    2500       0       2500
## 5 TORNADO          0        2    2500       0       2500
## 6 TORNADO          0        6    2500       0       2500

Results

1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

With the purpose of answering that question it is evidently that is necessary only the top events in fatalities and injuries, the data is going to be use only with the first 10 top events.

#Creating a category so that the data is group by EVTYPE to see the sum of cases of FATALITIES and INJURIES
storm.data4 <- storm.data3 %>%
                group_by(EVTYPE) %>% 
                summarise_at(vars(FATALITIES,INJURIES,PROPCROPDM), sum)

#Ordering data and subseting in the top 10
storm.data4 <- arrange(storm.data4, desc(FATALITIES), desc(INJURIES))
storm.data4 <- storm.data4[1:10,]
print(storm.data4)
## # A tibble: 10 x 4
##    EVTYPE         FATALITIES INJURIES    PROPCROPDM
##    <chr>               <dbl>    <dbl>         <dbl>
##  1 TORNADO              5633    91346  57362333946.
##  2 EXCESSIVE HEAT       1903     6525    500155700 
##  3 FLASH FLOOD           978     1777  18243991078.
##  4 HEAT                  937     2100    403258500 
##  5 LIGHTNING             816     5230    942471520.
##  6 TSTM WIND             504     6957   5038935845 
##  7 FLOOD                 470     6789 150319678257 
##  8 RIP CURRENT           368      232         1000 
##  9 HIGH WIND             248     1137   5908617595 
## 10 AVALANCHE             224      170      3721800

In the next plot,

#using ggplot2 

plot1 <- ggplot(storm.data4, aes(x=FATALITIES, y=EVTYPE), fill=EVTYPE) +
  geom_bar(stat="identity",width = 0.9,fill="skyblue")+ guides(fill=FALSE) +
  labs(x="Quantity", y=expression('Events'), title=("Most Harmful events for population health ~ DEATHS"))+
  theme(plot.title = element_text(hjust = 0.5))
  

plot2 <- ggplot(storm.data4, aes(x=INJURIES, y=EVTYPE), fill=EVTYPE) +
  geom_bar(stat="identity",width = 0.9,fill="skyblue")+ guides(fill=FALSE) +
  labs(x="Quantity", y=expression('Events'), title=("Most Harmful events for population health ~ INJURIES"))+
  theme(plot.title = element_text(hjust = 0.5))

#This plot is just one image with two plots on it
plot_grid(plot1, plot2, align = "v", nrow =2, rel_heights = c(1/2, 1/2))

According to the plot the most harmful events are:

  1. TORNADO with 5633 Fatalities and 9.134610^{4} injuries.

  2. EXCESSIVE HEAT with 1903 Fatalities and 6525 injuries.

  3. FLASH FLOOD with 978 Fatalities and 1777 injuries.

2. Across the United States, which types of events have the greatest economic consequences?

#using ggplot2 
#Using the new variable PROPCROPDM which is the sum of PROP and CROP damage
storm.data5 <- storm.data3 %>%
                group_by(EVTYPE) %>% 
                summarise_at(vars(PROPCROPDM), sum)

#Ordering data and subseting in the top 10
storm.data5 <- arrange(storm.data5, desc(PROPCROPDM))
storm.data5 <- storm.data5[1:10,]
print(storm.data5)
## # A tibble: 10 x 2
##    EVTYPE               PROPCROPDM
##    <chr>                     <dbl>
##  1 FLOOD             150319678257 
##  2 HURRICANE/TYPHOON  71913712800 
##  3 TORNADO            57362333946.
##  4 STORM SURGE        43323541000 
##  5 HAIL               18761221491.
##  6 FLASH FLOOD        18243991078.
##  7 DROUGHT            15018672000 
##  8 HURRICANE          14610229010 
##  9 RIVER FLOOD        10148404500 
## 10 ICE STORM           8967041360
plot3 <- ggplot(storm.data5, aes(x=PROPCROPDM/10e8, y=EVTYPE), fill=EVTYPE) +
  geom_bar(stat="identity",width = 0.9,fill="skyblue")+ guides(fill=FALSE) +
  labs(x="Cost in Million dolars", y=expression('Events'), title=("Events with the greatest economic consequences"))+
  theme(plot.title = element_text(hjust = 0.5))
  
plot3

As we can see the event with the greatest economic consequences are:

  1. FLOOD having a total cost of 1.503196810^{11} dollars.

  2. HURRICANE/TYPHOON having a total cost of 7.191371310^{10} dollars.

  3. STORM SURGE having a total cost of dollars.