1. Synopsis

This report describes the analysis of the impact of Weather Events on the population health and Economy. The data used in this report was obtained on the the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database contain the characteristics of weather events in the United States, from 1950 to November 2011, including the location and date that they occur and estimates of any fatalities, injuries, property damage and crop damage.

The analysis aimed to answer two questions:

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

  2. Across the United States, which types of events have the greatest economic consequences?

The data was loaded, inspected, cleaned, processed and analysed, using dplyr, tidyr, mgsub and ggplot2 packages. After processing and analyzing the data it was found that the Tornado is the most harmful weather event to population health , resulting in higher injuries and fatalities, while Floods have greatest economic impact, resulting in higher damages to property and crops in the United States.

2. Data Processing

As mentioned before, the data was obtained from the [U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database] (https://www.weather.gov/). The events on the database are from the year 1950 to November 2011.

2.1. Downloading the data

The data was downloaded from the following link:

[Storm Data] (https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2)

2.2. Reading the data

The data came in the form of a comma-separated-value file compressed via the bzip2 algorithm. To read this data, the read.csv function was used.

data <- read.csv("repdata_data_StormData.csv.bz2", header = TRUE, sep = ",")

The code above loaded the data into R for the following processing and analysis.

2.3. Processing the data

After reading the data, it is necessary to get an overview of how the data looks like. For this, the dim, head and summary function were used.

dim(data)
## [1] 902297     37
head(data)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE  EVTYPE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL TORNADO
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL TORNADO
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL TORNADO
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL TORNADO
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL TORNADO
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL TORNADO
##   BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1         0                                               0         NA
## 2         0                                               0         NA
## 3         0                                               0         NA
## 4         0                                               0         NA
## 5         0                                               0         NA
## 6         0                                               0         NA
##   END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1         0                      14.0   100 3   0          0       15    25.0
## 2         0                       2.0   150 2   0          0        0     2.5
## 3         0                       0.1   123 2   0          0        2    25.0
## 4         0                       0.0   100 2   0          0        2     2.5
## 5         0                       0.0   150 2   0          0        2     2.5
## 6         0                       1.5   177 2   0          0        6     2.5
##   PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1          K       0                                         3040      8812
## 2          K       0                                         3042      8755
## 3          K       0                                         3340      8742
## 4          K       0                                         3458      8626
## 5          K       0                                         3412      8642
## 6          K       0                                         3450      8748
##   LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1       3051       8806              1
## 2          0          0              2
## 3          0          0              3
## 4          0          0              4
## 5          0          0              5
## 6          0          0              6
summary (data)
##     STATE__       BGN_DATE           BGN_TIME          TIME_ZONE        
##  Min.   : 1.0   Length:902297      Length:902297      Length:902297     
##  1st Qu.:19.0   Class :character   Class :character   Class :character  
##  Median :30.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :31.2                                                           
##  3rd Qu.:45.0                                                           
##  Max.   :95.0                                                           
##                                                                         
##      COUNTY       COUNTYNAME           STATE              EVTYPE         
##  Min.   :  0.0   Length:902297      Length:902297      Length:902297     
##  1st Qu.: 31.0   Class :character   Class :character   Class :character  
##  Median : 75.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :100.6                                                           
##  3rd Qu.:131.0                                                           
##  Max.   :873.0                                                           
##                                                                          
##    BGN_RANGE          BGN_AZI           BGN_LOCATI          END_DATE        
##  Min.   :   0.000   Length:902297      Length:902297      Length:902297     
##  1st Qu.:   0.000   Class :character   Class :character   Class :character  
##  Median :   0.000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :   1.484                                                           
##  3rd Qu.:   1.000                                                           
##  Max.   :3749.000                                                           
##                                                                             
##    END_TIME           COUNTY_END COUNTYENDN       END_RANGE       
##  Length:902297      Min.   :0    Mode:logical   Min.   :  0.0000  
##  Class :character   1st Qu.:0    NA's:902297    1st Qu.:  0.0000  
##  Mode  :character   Median :0                   Median :  0.0000  
##                     Mean   :0                   Mean   :  0.9862  
##                     3rd Qu.:0                   3rd Qu.:  0.0000  
##                     Max.   :0                   Max.   :925.0000  
##                                                                   
##    END_AZI           END_LOCATI            LENGTH              WIDTH         
##  Length:902297      Length:902297      Min.   :   0.0000   Min.   :   0.000  
##  Class :character   Class :character   1st Qu.:   0.0000   1st Qu.:   0.000  
##  Mode  :character   Mode  :character   Median :   0.0000   Median :   0.000  
##                                        Mean   :   0.2301   Mean   :   7.503  
##                                        3rd Qu.:   0.0000   3rd Qu.:   0.000  
##                                        Max.   :2315.0000   Max.   :4400.000  
##                                                                              
##        F               MAG            FATALITIES          INJURIES        
##  Min.   :0.0      Min.   :    0.0   Min.   :  0.0000   Min.   :   0.0000  
##  1st Qu.:0.0      1st Qu.:    0.0   1st Qu.:  0.0000   1st Qu.:   0.0000  
##  Median :1.0      Median :   50.0   Median :  0.0000   Median :   0.0000  
##  Mean   :0.9      Mean   :   46.9   Mean   :  0.0168   Mean   :   0.1557  
##  3rd Qu.:1.0      3rd Qu.:   75.0   3rd Qu.:  0.0000   3rd Qu.:   0.0000  
##  Max.   :5.0      Max.   :22000.0   Max.   :583.0000   Max.   :1700.0000  
##  NA's   :843563                                                           
##     PROPDMG         PROPDMGEXP           CROPDMG         CROPDMGEXP       
##  Min.   :   0.00   Length:902297      Min.   :  0.000   Length:902297     
##  1st Qu.:   0.00   Class :character   1st Qu.:  0.000   Class :character  
##  Median :   0.00   Mode  :character   Median :  0.000   Mode  :character  
##  Mean   :  12.06                      Mean   :  1.527                     
##  3rd Qu.:   0.50                      3rd Qu.:  0.000                     
##  Max.   :5000.00                      Max.   :990.000                     
##                                                                           
##      WFO             STATEOFFIC         ZONENAMES            LATITUDE   
##  Length:902297      Length:902297      Length:902297      Min.   :   0  
##  Class :character   Class :character   Class :character   1st Qu.:2802  
##  Mode  :character   Mode  :character   Mode  :character   Median :3540  
##                                                           Mean   :2875  
##                                                           3rd Qu.:4019  
##                                                           Max.   :9706  
##                                                           NA's   :47    
##    LONGITUDE        LATITUDE_E     LONGITUDE_       REMARKS         
##  Min.   :-14451   Min.   :   0   Min.   :-14455   Length:902297     
##  1st Qu.:  7247   1st Qu.:   0   1st Qu.:     0   Class :character  
##  Median :  8707   Median :   0   Median :     0   Mode  :character  
##  Mean   :  6940   Mean   :1452   Mean   :  3509                     
##  3rd Qu.:  9605   3rd Qu.:3549   3rd Qu.:  8735                     
##  Max.   : 17124   Max.   :9706   Max.   :106220                     
##                   NA's   :40                                        
##      REFNUM      
##  Min.   :     1  
##  1st Qu.:225575  
##  Median :451149  
##  Mean   :451149  
##  3rd Qu.:676723  
##  Max.   :902297  
## 

As observed the data has 902297 rows and 37 variables (columns). The variables “COUNTYENDN”, “F”, “LATITUDE” and “LATITUDE_E” present NA values, therefore, none of the variables needed for this analysis.

To answer the questions of which types of events are most harmful to the population health and which have the greatest economic consequences, the data needed to be subsetted.

2.3.1. Health data

For the first question, only the variables event type (EVTYPE), injuries and fatalities (INJURIES, FATALITIES) were necessary, Therefore, the cbind function was used, selecting only these columns and using ["..."] to maintain the column names from the original dataset. The result was stores in another object called hlt.

hlt <- cbind(data["EVTYPE"], data["INJURIES"], data["FATALITIES"])

After this, the data was transformed. The previous dataset (hlt) was grouped by the event type. Then, for both injuries and fatalities columns, the sum was calculated, by its groups (EVTYPE). After calculating the sum, the dataset was ordered by injuries and fatalities, in descending order, therefore, the first row had the highest possible number of injuries and fatalities together. Finally, the data was sliced to select only the 10 first events with more injuries and fatalities together and transformed, by gathering the damage type into one column with two levels (INJURIES and FATALITIES), using the gather function.

With the summary function in was possible to observe that this dataset presented no NA’s, therefore there was no need to further clean and process this data.

library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
hltplot <- hlt %>% 
    # Grouping by event
    group_by(EVTYPE) %>%                            
    # Calculating total fatalities and injuries
    summarise(INJURIES = sum(INJURIES), 
              FATALITIES = sum(FATALITIES)) %>%
    # Sorting fatalities and injuries in descending order
    arrange(desc(INJURIES + FATALITIES)) %>%
    #selecting only 10 first records of injuries and fatalities
        slice(1:10) %>%
    # Melting injuries and fatalities 
    gather(TYPE, Value, c(INJURIES, FATALITIES))


summary(hltplot)
##     EVTYPE              TYPE               Value      
##  Length:20          Length:20          Min.   :   89  
##  Class :character   Class :character   1st Qu.:  738  
##  Mode  :character   Mode  :character   Median : 1632  
##                                        Mean   : 6859  
##                                        3rd Qu.: 5331  
##                                        Max.   :91346

2.3.1. Economic data

For the second question, only the variables of event type (EVYTYPE), property damage and crop damage (PROPDMG, CROPDMG) were necessary and the cbind function was also used in the same way as previously.

#subset
ecn <- cbind(data["EVTYPE"], 
             data["PROPDMG"], data["PROPDMGEXP"], 
             data["CROPDMG"], data["CROPDMGEXP"])

However, the variables PROPDMG and CROPDMG were in different dimensions. In some cases the property and crop damages were in the order of millions or thousands etc. The following columns “PRPDMGEXP” and ‘CROPDMGEXP’ disclaim the exponent of these columns.

By creating a vector with these columns as factors and using the function levels() it was possible to detect the levels representing an exponen, on each column. In both cases there were exponents like B (billions), M (millions) and K(thousands) and there was also levels like H (hundreds) or levels like “?”, “+”, etc, that did not represented any exponent.

lvlPROPDMGEXP <- as.factor(ecn$PROPDMGEXP)
levels(lvlPROPDMGEXP)
##  [1] ""  "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K" "m" "M"
lvlCROPDMGEXP <- as.factor(ecn$CROPDMGEXP)
levels(lvlCROPDMGEXP)
## [1] ""  "?" "0" "2" "B" "k" "K" "m" "M"

Therefore, to properly subset the columns of interest for the following analysis, these exponents were substituted accordingly, using the mgsub functions, that allowed substitution of multiple patterns at one single line of code. The columns were transformed into numbers and the NA’s generated were removed. Subsequently, the columns “PROPDMG” and “CROPDMG” were multiplied, by row, using the arithmetic symbol (*) to its following exponent column, in order to obtain the actual cost of the damage, creating two new vectors. Then, these vectors along with the first column on event type (EVTYPE) were binded using cbind, with the results being stored in a new dataset.

# substitution K and M 
library(mgsub)

ecn$PROPDMGEXP <- mgsub(ecn$PROPDMGEXP, 
                        c("",  "-" ,"?", "+", "0", "1" ,"2" ,"3", "4" ,"5", "6", "7" ,"8" ,"B","h" ,"H","K", "m", "M"), 
                        c("NA", "NA", "NA", "NA", "NA", 1, 1, 1, 1, 1 , 1, 1, 1, 1000000000, 100, 100, 1000, 1000000, 1000000))
                        
 
ecn$CROPDMGEXP <- mgsub(ecn$CROPDMGEXP, 
                        c("", "?", "0", "2", "B","k","K", "m", "M"), 
                        c("NA", "NA", "NA", 1, 1000000000, 1000, 1000, 1000000, 1000000))
          
              
ecn$PROPDMGEXP <- as.numeric(ecn$PROPDMGEXP) 
ecn$CROPDMGEXP <- as.numeric(ecn$CROPDMGEXP)         

#remove NAs
ecnnona <- ecn[complete.cases(ecn), ]
summary(ecnnona)
##     EVTYPE             PROPDMG          PROPDMGEXP           CROPDMG       
##  Length:279540      Min.   :   0.00   Min.   :1.000e+00   Min.   :  0.000  
##  Class :character   1st Qu.:   0.00   1st Qu.:1.000e+03   1st Qu.:  0.000  
##  Mode  :character   Median :   0.00   Median :1.000e+03   Median :  0.000  
##                     Mean   :  14.44   Mean   :1.009e+05   Mean   :  4.074  
##                     3rd Qu.:   3.00   3rd Qu.:1.000e+03   3rd Qu.:  0.000  
##                     Max.   :5000.00   Max.   :1.000e+09   Max.   :978.000  
##    CROPDMGEXP       
##  Min.   :1.000e+03  
##  1st Qu.:1.000e+03  
##  Median :1.000e+03  
##  Mean   :2.443e+04  
##  3rd Qu.:1.000e+03  
##  Max.   :1.000e+09
#multiplication
PROPDAMAGE <- ecnnona$PROPDMG * ecnnona$PROPDMGEXP                         
CROPDAMAGE <- ecnnona$CROPDMG * ecnnona$CROPDMGEXP

#final dataset
dataecn <- cbind(ecnnona["EVTYPE"], PROPDAMAGE, CROPDAMAGE )

After cleaning the dataset, further processing was performed using the tidyr package and the dplyr package, similarly to what was performed with the previous dataset with injuries and fatalities (hltplot).

First, the previous dataset (dataecn) was grouped by the event type. Then, for both property and crop damage columns, the sum was calculated, by group. After calculating the sum, the dataset was ordered by property and crop damage, in descending order. Finally, the data was sliced and transformed, by selecting only the first 10 events with a higher impact and gathering the damage type into one column with two levels (PROPDAMAGE and CROPDAMAGE). The result was stores in a new dataset.

library(tidyr)
library(dplyr)


ecnplot <- dataecn %>% 
    # Grouping by event
    group_by(EVTYPE) %>%                            
    # Calculating total fatalities and injuries
    summarise(PROPDAMAGE = sum(PROPDAMAGE), 
              CROPDAMAGE = sum(CROPDAMAGE)) %>%
    # Sorting fatalities and injuries in descending order
    arrange(desc(PROPDAMAGE + CROPDAMAGE)) %>%
    #selecting only 10 first records of injuries and fatalities
        slice(1:10) %>%
    # Melting injuries and fatalities 
    gather(DMGTYPE, Value, c(PROPDAMAGE, CROPDAMAGE))

3. Results

The following section describes the generation of the plots to answer the two questions of interest on this report. For both, the ggplot2 package was used, along with the two new datasets created in the previous steps: hltplot and ecnplot, to answer the first and second questions respectively.

3.1. Across the United States, which types of events are most harmful with respect to population health?

The plot created to answer the first question was achieved using ggplot2 function. First, we defined the data and aesthetic. The x axis was the Event Type (EVTYPE) and we needed this variable to be ordered, in decresing order. Therefore, the reorder function was used, excluding the y variable (Value)

In addition, is was used the function geom_col as this function is more appropriate to display the values in the dataset, while geom_bar displays, typically, the counts.

The ggtitle function determines the title of the plot, xlab or ylab, the title of the axis and the theme function, along with the axis.text.x and element_text arguments, allow the rotation of the x.axis.

library(ggplot2)

ggplot(hltplot, aes(x = reorder(EVTYPE, -Value), 
                    y = Value, 
                    fill =  TYPE)) +                                            
        geom_col(position="stack") +
        ggtitle (" Impact of Weather Events on Population Health")+
        xlab ("Event Type")+
        theme(axis.text.x = element_text(angle = 90 ))

Therefore, as observed, the Tornados are responsible for the highest number of Injuries and Fatalities, therefore is the most harmful event to population health.

3.2. Across the United States, which types of events have the greatest economic consequences?

The second question (“Across the United States, which types of events have the greatest economic consequences?”) was answered using a similar code, to generate a similar plot.

library(ggplot2)

ggplot(ecnplot, aes(x = reorder(EVTYPE, -Value), 
                    y = Value, 
                    fill =  DMGTYPE)) +                                            
        geom_col(position="stack") +
        ggtitle ("Weather Events Impact on the Economy of United Stated")+
        xlab ("Event Type")+
        theme(axis.text.x = element_text(angle = 90 ))

Therefore, as observed, the Floods are responsible for the most damage in property and crops and therefore has the most impact on the economy of U.S.