Peer-graded Assignment: Peer-graded Assignment: Course Project 2

1. Description

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States. The events in the database start in the year 1950 and end in November 2011.

The data is available to download from the NOAA’s web site: https://www.ncdc.noaa.gov/cdo-web/datasets. The NOAA’s data becomes through the National Climatic Data Center (NCDC) that regularly receives Storm Data from the National Weather Service (NWS) approximately 60-90 days after the end of the data month.

Generally, the data comes from different sources such as county, state and federal emergency management officials, local law enforcement officials, sky-warn spotters, NWS damage surveys, newspapers services, insurance industry and general public. It is important stating that some information appearing in the Storm Data may be provided by or gathered from sources outside the National Weather Service (NWS).

An effort is made to use the best available information but because of time and resource constraints, information from these sources may be unverified by the NWS; therefore customers should be cautious as the NWS does not guarantee the accuracy or validity of the information.

2. Data Processing

Load packages

The next libraries should be used in order to run the R code effectively:

library(dplyr)       #Used for doing transformations of the data (in a tidy way)
library(tidyr)       #A package that is also used for doing transformations

## Warning: package 'tidyr' was built under R version 3.4.4

library(lubridate)   #Package needed to operate dates in R
library(ggplot2)     #For doing graphics
library(lattice)     #Base package for doing graphics
library(gridExtra)   #For joining plots

Loading the data

Before loading the data, a zipped file must be downloaded and extracted in the same working directory used in R. The Zip file could be download from: https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2. If everything went all right, there must be a new Excel file in CSV format called: repdata_data_StormData.csv.

We are assigning the results to a variable called “NOAA”:

NOAA <- read.csv("repdata_data_StormData.csv", header = TRUE)

str(NOAA)  #describing the variable formats

## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436774 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

head(NOAA) #describing the fist records to the dataset

##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6

Raw data is well constructed with rows and columns, so we then proceed in analyzing the research questions that have been asked.

3. Research questions

Research question No.1: Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

In order to find an answer to this research question, it is important to select only the variables of interest in our data set. Since we are interested in studying in counting the number of human harms we need to select the next variables: FATALITIES and INJURIES. The variable EVTYPE contains the type of event registered in the NOAA’s dataset.

Once we have selected the interested variables, the first step is checking if there are any NAs values in our dataset:

NAS <- NOAA %>% 
        summarise(event=mean(is.na(EVTYPE)), 
                  fatalities = mean(is.na(FATALITIES)), 
                  injuries=mean(is.na(INJURIES))) %>%
        select(event, fatalities, injuries)
print(NAS)

##   event fatalities injuries
## 1     0          0        0

No NAs appear to be present in our selected variables. Our data has records starting from the year 1950 and ending in the year 2011. In earlier years technology wasn’t improved as in recent years and so we may find lack of useful data in order to be compared to recent years. For that, we need to select a sample from our dataset in which we could find it considerable to be used in our research questions.

NOAA.harmful <- NOAA %>%
        select(BGN_DATE, EVTYPE, FATALITIES, INJURIES) %>% #selecting variables of interest
        mutate(years = year(      #mutating a new column which contains the 'year'
                as.Date(sapply(
                        strsplit( #using strsplit function to separate 
                                  #and extracting only the 'years'
                                as.character(NOAA$BGN_DATE) ,
                                " ",fixed = TRUE),
                        "[[",1), 
                        "%m/%d/%Y"))) 
head(NOAA.harmful, 3)

##            BGN_DATE  EVTYPE FATALITIES INJURIES years
## 1 4/18/1950 0:00:00 TORNADO          0       15  1950
## 2 4/18/1950 0:00:00 TORNADO          0        0  1950
## 3 2/20/1951 0:00:00 TORNADO          0        2  1951

We proceed to do a small transformation in our new variable of the dataset. It should be useful to converting the columns FATALITIES and INJURIES into rows, so that could be easily to plotting our results:

NOAA.harmful.tendency <- NOAA.harmful %>%
        select(years, FATALITIES, INJURIES) %>%
        gather("Type", "total",2:3 ) %>%   #function gather to create new columns and merging
        group_by(years, Type) %>%          #the values of FATALITIES and INJURIES into them. 
        summarise(totals = sum(total))     #Summarises total of counts.
head(NOAA.harmful.tendency,3)

## # A tibble: 3 x 3
## # Groups:   years [2]
##   years Type       totals
##   <dbl> <chr>       <dbl>
## 1  1950 FATALITIES     70
## 2  1950 INJURIES      659
## 3  1951 FATALITIES     34

#plotting our new dataset

qplot(years, totals, data = NOAA.harmful.tendency, 
      geom ="line", color = Type, 
      xlab = "Year", ylab = "Total Harms", 
      main = "NOAA's Registered Harmful Weather Events"
           ) + theme(plot.title = element_text(hjust=0.5))

From the 50’s to the earlier 90’s there appears to be slight changes in respect to the total of fatalities. Data seems starting growing nearly at the beginning of 90’s. An interesting observation is representative from that year compared to the total of injuries as well; there appears to be a slight relationship in the behavior of both line graphs. For that reason, in order to answer the research questions, we are going to filter out the original dataset starting from the year 1990.

NOAA.year.1990.1 <- NOAA.harmful %>%
        select(years, FATALITIES, INJURIES, EVTYPE) %>%
        filter(as.integer(years) >= 1990)

harmful.total <- NOAA.year.1990.1 %>%
        select(EVTYPE, FATALITIES, INJURIES) %>%
        group_by(EVTYPE) %>%
        summarise(fatalities =sum(FATALITIES), injuries = sum(INJURIES))
head(harmful.total)

## # A tibble: 6 x 3
##   EVTYPE                  fatalities injuries
##   <fct>                        <dbl>    <dbl>
## 1 "   HIGH SURF ADVISORY"          0        0
## 2 " COASTAL FLOOD"                 0        0
## 3 " FLASH FLOOD"                   0        0
## 4 " LIGHTNING"                     0        0
## 5 " TSTM WIND"                     0        0
## 6 " TSTM WIND (G45)"               0        0

total.fatalities <- harmful.total %>% 
        select(EVTYPE, fatalities) %>% 
        arrange(desc(fatalities)) %>%
        head(5)
        
total.injuries   <- harmful.total %>% 
        select(EVTYPE, injuries) %>% 
        arrange(desc(injuries)) %>%
        head(5)



fatalities.sum <- barchart(fatalities~EVTYPE, data = total.fatalities, 
         col=rgb(0.5,0.4,0.6,0.6),
         xlab = "Event Type", ylab="Total Fatalities", main = "NOAA's Harmful Events (1990 - 2011)")
injuries.sum <- barchart(injuries~EVTYPE, data = total.injuries, 
         col=rgb(0.1,0.4,0.2,0.6),
         xlab = "Event Type", ylab="Total Injuries", main = "NOAA's Harmful Events (1990 - 2011)")

grid.arrange(fatalities.sum, injuries.sum)

Results of Research Question No.1

The results are showing that the reported NOAA’s events from the year 1990 to 2011 involve EXCESSIVE HEAT and TORNADOS considering that they are the most critical in respect to the population health of the U.S., since they are represented as fatalities. The reported cases show that excessive heat is the most representative reaching out 1,903 cases and 1,752 as for TORNADO events.

As for injury damage on people’s health, the most representative one is the TORNADO event as well, reaching out 26,674 cases reported.

Research question No.2: Across the United States, which types of events have the greatest economic consequences?

Just as we did with the first research question, we are going to filter the data starting from the year 1990. In order to answer this question, we are going to select only the next variables from interest: EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP.

EVTYPE = Contains information about the type of the registered event PROPDMG = Involves property damage (\() that has a representative impact in the economy CROPDMG = Involves crops damage (\)) that has a representative impact in the economy as well PROPDMGEXP/CROPDMGEXP = Prefixes used to denote the multiplier of the PROPDMG and CROPDMG respectively (i.e. kilo - 1,000, M = 1,000,000)

NOAA.economic <- NOAA %>%
        select(BGN_DATE,EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) %>% #variables
        mutate(years = year(      #mutating a new column which contains the 'year'
                as.Date(sapply(
                        strsplit( #using strsplit function to separate 
                                  #and extracting only the 'years'
                                as.character(NOAA$BGN_DATE) ,
                                " ",fixed = TRUE),
                        "[[",1), 
                        "%m/%d/%Y"))) %>%
        filter(as.integer(years) >= 1990) %>%
        select(EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
        
head(NOAA.economic, 3)

##      EVTYPE PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1      HAIL       0                  0           
## 2 TSTM WIND       0                  0           
## 3 TSTM WIND       0                  0

We now proceed to explore the variables and checking for any NAs:

str(NOAA.economic)

## 'data.frame':    751740 obs. of  5 variables:
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 244 856 856 856 834 856 856 856 834 834 ...
##  $ PROPDMG   : num  0 0 0 0 2.5 0 0 0 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 1 1 1 1 19 1 1 1 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...

NAS.economic <- NOAA.economic %>% 
        summarise(propdmg=mean(is.na(PROPDMG)), 
                  cropdmg = mean(is.na(CROPDMG))) %>%
        select(propdmg, cropdmg)
print(NAS.economic)

##   propdmg cropdmg
## 1       0       0

tapply(NOAA.economic$PROPDMG, NOAA.economic$PROPDMGEXP, sum) # count of prefixes in property

##                     -          ?          +          0          1 
##     527.41      15.00       0.00     117.00    7108.30       0.00 
##          2          3          4          5          6          7 
##      12.00      20.00      14.50     210.50      65.00      82.00 
##          8          B          h          H          K          m 
##       0.00     275.85       2.00      25.00 9136646.93      38.90 
##          M 
##  115814.45

tapply(NOAA.economic$CROPDMG, NOAA.economic$CROPDMGEXP, sum) # count of prefixes in crops

##                     ?          0          2          B          k 
##      11.00       0.00     260.00       0.00      13.61     436.00 
##          K          m          M 
## 1342955.91      10.00   34140.80

Important assumptions regarding the prefixes:

We are not quite sure what the number multipliers indicate, but since they represent a small part of the whole dataset, we are going to leave them out.
We are assuming that lower cases and upper cases are the same.
Value of prefixes:

“empty” (no multiplier): Multiply by 1
“0” (no multiplier): Multiply by 1
Hundred(h, H): Multiply by 100
Thousand(k,K): Multiply by 1000
Million (m, M): Multiply by 1000000
Billion(B): Multiply by 1000000000

Calculating a new variable called damage.dollars (for both Property and Crops cases) that contains the multiplication of the prefixes by the PROPDMG and CROPDMG variables respectively.

# Property Damage

NOAA.economic.prop <- NOAA.economic %>% 
        select(EVTYPE, PROPDMG, PROPDMGEXP) %>%
        filter(PROPDMG!=0) %>%
        mutate(damage.dollars=PROPDMG*(ifelse(PROPDMGEXP=="", 1, 
                                 ifelse(PROPDMGEXP=="k" | PROPDMGEXP=="K", 1000,
                                        ifelse(PROPDMGEXP=="m" | PROPDMGEXP=="M", 1000000,
                                               ifelse(PROPDMGEXP=="h" | PROPDMGEXP=="H", 100,
                                                      ifelse(PROPDMGEXP=="B",  1000000000,
                                                             ifelse(PROPDMGEXP=="0",  1,0)
                                                             ))))))) %>%
        mutate(type = "PROPERTY DAMAGE") %>%
        select(EVTYPE,damage.dollars, type)

# Crops Damage

NOAA.economic.crops <- NOAA.economic %>% 
        select(EVTYPE, CROPDMG, CROPDMGEXP) %>%
        filter(CROPDMG!=0) %>%
        mutate(damage.dollars=CROPDMG*(ifelse(CROPDMGEXP=="", 1,
                                     ifelse(CROPDMGEXP=="k" | CROPDMGEXP=="K", 1000,
                                            ifelse(CROPDMGEXP=="m" | CROPDMGEXP=="M", 1000000,
                                                          ifelse(CROPDMGEXP=="B",  1000000000,
                                                                 ifelse(CROPDMGEXP=="0",  1,0)
                                                          )))))) %>%
        mutate(type = "CROPS DAMAGE") %>%
        select(EVTYPE, damage.dollars, type)

Mutating and summarizing the results into a new table called ‘NOAA.economic.total.damage’

NOAA.economic.total.damage <- rbind(NOAA.economic.prop, NOAA.economic.crops)
NOAA.economic.total.damage <- NOAA.economic.total.damage %>%
        group_by(EVTYPE, type) %>%
        summarise(total.damage.dollars = sum(damage.dollars)) %>%
        select(EVTYPE, type, total.damage.dollars) %>%
        arrange(desc(type),desc(total.damage.dollars,desc))

Selecting only the top 5 events:

NOAA.economic.total.damage <- rbind(NOAA.economic.prop, NOAA.economic.crops)
NOAA.economic.total.damage <- NOAA.economic.total.damage %>%
        group_by(EVTYPE, type) %>%
        summarise(total.damage.dollars = sum(damage.dollars)) %>%
        select(EVTYPE, type, total.damage.dollars) %>%
        arrange(desc(total.damage.dollars)) %>%
        group_by(type) %>%
        slice(1:5)

Plotting the final result:

ggplot(NOAA.economic.total.damage, aes(x = EVTYPE, y = total.damage.dollars/1000000)) +
        geom_bar(aes(fill = type), position = "dodge", stat = "identity") +  coord_flip() +
        xlab("Reported Event Type") + ylab("Damage Amount (In Million Dollars)") +  
        labs(title = "Economic Loss by Climate Events in the U.S.",
             subtitle = "Reported from year 1990 - 2011") + scale_fill_discrete(name = "Economic Variable") +
             theme(plot.title = element_text(hjust=0.5, size = 15, face = "italic"))  + theme(plot.subtitle =  element_text(hjust=0.5)) +
             theme(plot.subtitle = element_text(hjust=0.5, size = 10, face = "italic"))  + theme(plot.subtitle =  element_text(hjust=0.5))

Results of Research Question No.2

FLOOD property damage clearly represents the most outstanding economic loss in the country, reaching out 144.65 Billion of dollars in losses reported. The second representative event is due to HURRICANE/TYPHOON events reaching out 69.30 Billion in dollars in losses.

Interestingly enough is that CROPS DAMAGES play also some contribution to the total economic loss, being DROUGHT the most outstanding event with 13.97 Billion of dollars in economic losses.

Peer-graded Assignment: Peer-graded Assignment: Course Project 2 - Reproducible Research

Jaime Paz

April 8, 2018