Synopsis

The following exersice has been requested within the course “Reproducible Research” (Coursera). The aim of it is to answer two questions:

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
  2. Across the United States, which types of events have the greatest economic consequences?

To do so, it has been analysed data from NOAA Storm Database. The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

Data used in this exercise can be downloaded here.

Documentation about the data can be downloaded here.

Data Processing.

Data reading.

Once datafile is in the working directory the data can be readed:

data <- read.csv("repdata%2Fdata%2FStormData.csv.bz2")
str(data)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Data cleaning.

During the exercise it has been remarked that there are several “misstakes” over the variable EVTYPE. The following steps have the purpose of cleaning duplicates, remove double spaces, last spaces, etc. After each of the processes, a message with the number of different Event types will be shown in order to see how data cleaning is useful to lighten data and get precise results.

data0 <- data

Event types in raw data

message(length(unique(data$EVTYPE))," different event types")
## 985 different event types

All to uppercase

data0$EVTYPE <- toupper(data0$EVTYPE)
message(length(unique(data0$EVTYPE))," different event types")
## 898 different event types

Remove “/” to avoid missmatch

data0$EVTYPE <- gsub("/"," ", data0$EVTYPE)
message(length(unique(data0$EVTYPE))," different event types")
## 882 different event types

Remove “-” to avoid missmatch

data0$EVTYPE <- gsub("-"," ", data0$EVTYPE)
message(length(unique(data0$EVTYPE))," different event types")
## 876 different event types

Eliminate spaces at the end and at the begining

data0$EVTYPE <- trimws(data0$EVTYPE)
message(length(unique(data0$EVTYPE))," different event types")
## 865 different event types

Eliminate those starting with “SUMMARY*“.

grepl create a logical vector to select values starting by SUMM. ! inverts the logical vector.

data0 <- data0[!grepl("^SUMM", data0$EVTYPE),]
message(length(unique(data0$EVTYPE))," different event types")
## 799 different event types

Eliminate double spaces

library(qdapRegex)
## Warning: package 'qdapRegex' was built under R version 3.3.3
data0$EVTYPE <- rm_white(data0$EVTYPE)
message(length(unique(data0$EVTYPE))," different event types")
## 784 different event types

Commit change over bad words find over the document (erratas.R list file have been created)

library(stringr)
## Warning: package 'stringr' was built under R version 3.3.3
source("erratas.R")
for (i in 1:length(erratas)){
    data0$EVTYPE <- gsub(names(erratas[i]),erratas[1], data0$EVTYPE)
}
message(length(unique(data0$EVTYPE))," different event types")
## 768 different event types

Efects of storm events on human health.

Prepare libraries for the whole exercise.

library(reshape)
## Warning: package 'reshape' was built under R version 3.3.3
library(reshape2)
## Warning: package 'reshape2' was built under R version 3.3.3
## 
## Attaching package: 'reshape2'
## The following objects are masked from 'package:reshape':
## 
##     colsplit, melt, recast
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.3.3
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:reshape':
## 
##     rename
## The following objects are masked from 'package:qdapRegex':
## 
##     escape, explain
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.3
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:qdapRegex':
## 
##     %+%

Preparing the data (creation of new variable result of the sum of FAT + INJ, selection of collums of interest and summary by event type):

data1 <- data0
data1 <-
    data0 %>%
    mutate(TFI = FATALITIES+INJURIES) %>%
    select(EVTYPE, FATALITIES, INJURIES, TFI) %>%
    group_by(EVTYPE) %>%
    summarize(
        FATALITIES = sum(FATALITIES),
        INJURIES = sum(INJURIES),
        TFI = sum(TFI)
    )

Selection of the top 10 most harmful on human health (fatalities+injuries)

data1sel <- head(arrange(data1, desc(TFI)),10)
data1sel <- as.data.frame(data1sel)
data1sel
##               EVTYPE FATALITIES INJURIES   TFI
## 1            TORNADO       5645    91376 97021
## 2     EXCESSIVE HEAT       1903     6525  8428
## 3          TSTM WIND        504     6957  7461
## 4              FLOOD        470     6789  7259
## 5          LIGHTNING        816     5230  6046
## 6               HEAT        937     2100  3037
## 7        FLASH FLOOD        978     1777  2755
## 8          ICE STORM         89     1975  2064
## 9  THUNDERSTORM WIND        133     1488  1621
## 10      WINTER STORM        206     1321  1527

The function melt will allow us to create stacked barplots in the results.

data1melt <- melt(data1sel, id.vars = "EVTYPE")

Efects of storm events on economic resources.

To get the exact amount of money expended on each EVTYPE it is necesary to multiply the cost (PROPDMG,CROPDMG) by their respective exponent (PROPDMGEXP,CROPDMGEXP) defined as “k”, “m”, “b”. To do so, it has been created a function to facilitate the consequent mutation of data (THIS FUNCTION COMES FROM THE RPUB MADE BY Sergio L. B. Dalge (nice idea to do fast mutations).

exp_mutate <- function (x){
  ifelse(x == "K" | x == "k", 1000, 
         ifelse(x == "M" | x == "m", 1000000, 
                ifelse(x == "B" | x =="b",1000000000, 1)))
}

Using the function before we can create a chain of actions over the data, using dplyr to do so. The aim here is to mutate the values to get the “real” money expended, select just the data of interest (TOTAL PROPERTIES DAMAGE, TOTAL CROP DAMAGE and TOTAL DAMAGE) and finally sumarize the values by EVTYPE.

data2 <- data0
data2 <-
  data2 %>%
  mutate(PROPDMGEXP=exp_mutate(PROPDMGEXP)) %>%
  mutate(CROPDMGEXP=exp_mutate(CROPDMGEXP)) %>%
  mutate(TPROPDMG = PROPDMG * PROPDMGEXP) %>%
  mutate(TCROPDMG = CROPDMG * CROPDMGEXP) %>%
  mutate(TOTALDMG = TPROPDMG + TCROPDMG) %>%
  select(EVTYPE, TPROPDMG, TCROPDMG, TOTALDMG) %>%
  group_by(EVTYPE) %>% 
  summarize(TPROPDMG = sum(TPROPDMG), TCROPDMG = sum(TCROPDMG), TOTALDMG=sum(TOTALDMG))

Arrange and select the top 10 costly (TOTALDMG) events.

data2sel <- head(arrange(data2, desc(TOTALDMG)),10)
data2sel<-as.data.frame(data2sel)
data2sel
##               EVTYPE     TPROPDMG    TCROPDMG     TOTALDMG
## 1              FLOOD 144657709807  5661968450 150319678257
## 2  HURRICANE TYPHOON  69305840000  2607872800  71913712800
## 3            TORNADO  57162361285   452344770  57614706055
## 4        STORM SURGE  43323536000        5000  43323541000
## 5               HAIL  15732267048  3025954473  18758221521
## 6        FLASH FLOOD  16141362067  1421317100  17562679167
## 7            DROUGHT   1046106000 13972566000  15018672000
## 8          HURRICANE  11868319010  2741910000  14610229010
## 9        RIVER FLOOD   5118945500  5029459000  10148404500
## 10         ICE STORM   3944927860  5022113500   8967041360

The function melt will allow us to create stacked barplots in the results.

data2melt <- melt(data2sel, id.vars = "EVTYPE")

Results

Effects on human health by event type.

Next plot shows the top 10 most harmful events for human health sorted by Event Type:

ggplot(data = filter(data1melt, variable != "TFI"),
       aes(
           x = reorder(EVTYPE,-value),
           y = value,
           fill = variable
       )) +
    geom_bar(position = "stack", stat = "identity") +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    labs(x = "Event type (top 10 of total efect on human health)", y = "Human damage", title =
             "Human health damage by Event Type") +
    scale_fill_manual(
        values = c("#CC0000", "#99CCFF"),
        name = "Type of damage",
        breaks = c("FATALITIES", "INJURIES"),
        labels = c("FATALITIES", "INJURIES")
    ) +
    guides(fill = guide_legend(title = NULL))

Economic effects by event type.

Next plot shows the top 10 most harmful events sorted by TOTAL DAMAGE:

ggplot(data=filter(data2melt,variable!="TOTALDMG"), 
       aes(x=reorder(EVTYPE,-value), y=value, fill=variable)) + 
  geom_bar(position="stack",stat="identity")+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  labs(x= "Event type (top 10 of total cost)", 
       y="Cost damage", 
       title="Economic effects on Properties and Crops by Event Type") +
  scale_fill_manual(values=c("#999999", "#E69F00"), 
                         name="Type of expenses",
                         breaks=c("TPROPDMG", "TCROPDMG"),
                         labels=c("Total properties dmg", "Total crops dmg"))+
  guides(fill=guide_legend(title=NULL))