Synopsis

The NOAA storm database available on the course website was analyzed. It contains data from 1950 to 2011. The database was cleaned by removing records classified as NA. Then, the values were filtered for each question, taking only the data that had a value greater than zero in both parameters of each question. For question 1, the values expressed in the INJURIES and FATALITIES columns were considered. For question 2, the values of the PROPDMG and CROPDMG columns were taken with their respective adjustment values for Thousands (K), Millions (M), and Billions (B). Tornadoes were found to be the event type that causes the most damage to people across the US, with a maximum that exceeds its successor by almost 5 times, as well as the highest mortality rate. Floods were found to be the event type that causes the greatest economic damage, with more damage to properties than crops.

Data processing

This section shows how the data was collected and cleaned for further analysis.

The data was obtained form the course “Reproducible Research” website from Coursera platform. The data is a csv document that containsstorm events since 1950 and ends in 2011.

url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"

data <- read.csv("repdata_data_StormData.csv.bz2")

head(data)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE  EVTYPE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL TORNADO
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL TORNADO
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL TORNADO
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL TORNADO
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL TORNADO
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL TORNADO
##   BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1         0                                               0         NA
## 2         0                                               0         NA
## 3         0                                               0         NA
## 4         0                                               0         NA
## 5         0                                               0         NA
## 6         0                                               0         NA
##   END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1         0                      14.0   100 3   0          0       15    25.0
## 2         0                       2.0   150 2   0          0        0     2.5
## 3         0                       0.1   123 2   0          0        2    25.0
## 4         0                       0.0   100 2   0          0        2     2.5
## 5         0                       0.0   150 2   0          0        2     2.5
## 6         0                       1.5   177 2   0          0        6     2.5
##   PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1          K       0                                         3040      8812
## 2          K       0                                         3042      8755
## 3          K       0                                         3340      8742
## 4          K       0                                         3458      8626
## 5          K       0                                         3412      8642
## 6          K       0                                         3450      8748
##   LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1       3051       8806              1
## 2          0          0              2
## 3          0          0              3
## 4          0          0              4
## 5          0          0              5
## 6          0          0              6
str(data)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Library

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

Results

Q1 Across the United States, which types of events are most harmful with respect to population health?

Fatalities and injuries columns indicates this parameter

q1_df <- data[,c("STATE","COUNTY","COUNTYNAME","EVTYPE","FATALITIES","INJURIES")]

# Then I filter for all the events that present some injury or fatality
q1_df <- filter(q1_df, FATALITIES > 0 & INJURIES > 0)

# Grouped by event type:
ppl_dmg <- q1_df %>% group_by(EVTYPE) %>% 
        summarise(injuries = sum(INJURIES), mortality = sum(FATALITIES))

# and ordered to see which event is more harmful
ppl_dmg[with(ppl_dmg, order(-injuries,-mortality)), ]
## # A tibble: 86 × 3
##    EVTYPE            injuries mortality
##    <chr>                <dbl>     <dbl>
##  1 TORNADO              60187      5227
##  2 EXCESSIVE HEAT        4791       402
##  3 FLOOD                 2679       104
##  4 ICE STORM             1720        35
##  5 HEAT                  1420        73
##  6 HURRICANE/TYPHOON     1219        32
##  7 BLIZZARD               718        48
##  8 LIGHTNING              649       283
##  9 TSTM WIND              646       199
## 10 FLASH FLOOD            641       171
## # … with 76 more rows
# in order to show the damage in a plot, the data was filtered again considering the value of the 75% superior class for injuries (129.75)
ppl_maxdmg <- filter(ppl_dmg, injuries > 129)

plot1 <- ggplot()+ geom_point(data = ppl_maxdmg, aes(x=EVTYPE, y=injuries), color="red", shape=4)+
        geom_point(data = ppl_maxdmg, aes(x=EVTYPE, y=mortality),
                   color="blue", shape=5)+
  xlab("Event Type")+
  ylab("Total damage people health")+
  theme_bw()+ 
  theme(axis.text.x = element_text(angle = 90))

print(plot1)

Despite all analysis the graph and the filtered tables shows that the worst event type related to clime it is the Tornadoes, with a max amount of people injured (60187) and with a fatality amount of 5227. the next climate event is the Excessive heat, but this is way to far of the Tornado event, with max amount of injury and fatalities of 4791 and 402 respectly.

Q2 Across the US, wich type of events have the greatest economic consequences?

Columns to be analyzed: - PROPDMG: indicate the economic estimated amount per event - PROPDMGEXP: character that indicate thousand (K), Million(M) an Billion (B). Same structure for the parameter CROPDMG (amount of damages in crops)

q2_df <- data[,c("STATE","EVTYPE","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")]
q2_df <- filter(q2_df, PROPDMG > 0 & CROPDMG > 0) #This eliminates the NA values too in both columns

# normalization of values for the expansion rate
unique(q2_df$PROPDMGEXP)
## [1] "B" "M" "m" "K" "5" "0" ""  "3"
unique(q2_df$CROPDMGEXP)
## [1] "M" "K" "m" "k" "B" "0" ""
q2_df["PROPDMGEXP"][q2_df["PROPDMGEXP"] == "m"] <- "M" 
q2_df["PROPDMGEXP"][q2_df["PROPDMGEXP"] == ""] <- "0" 

q2_df["CROPDMGEXP"][q2_df["CROPDMGEXP"] == "m"] <- "M" 
q2_df["CROPDMGEXP"][q2_df["CROPDMGEXP"] == "k"] <- "K" 
q2_df["CROPDMGEXP"][q2_df["CROPDMGEXP"] == ""] <- "0" 

# empty values were transformed to exp "0" and other letter were transformed to uppercase

head(q2_df[, c("PROPDMG","PROPDMGEXP")])
##   PROPDMG PROPDMGEXP
## 1     0.1          B
## 2     5.0          M
## 3    25.0          M
## 4    48.0          M
## 5    20.0          M
## 6    50.0          K
mean(q2_df$PROPDMG)
## [1] 66.2253
# The "num" variable allows to identify the values out of range, and the number of these instances
num <- filter(q2_df, PROPDMGEXP == 5 | PROPDMGEXP == 3 )

# there are three values in the PROPDMGEXP column that indicates non-factor value
# for these values, the amount of PROPDMG is less than the mean of the same column
# so these values were removed from the total calculations.

q2_df <- q2_df %>% filter(!(PROPDMGEXP == 5 | PROPDMGEXP == 3))

# Iteration to replace values from Exp to number 

prop_vect <- numeric()
crop_vect <- numeric()

# this iteration creates two vector, the they are multiply to add new columns values
# that allows to group by event type
for (i in q2_df$PROPDMGEXP) {
        if (i == "K"){i <- 1000}
        else if (i == "M"){i <- 1000000}
        else if (i == "B"){i <-1000000000}
    prop_vect <- c(prop_vect, i)
}

for (i in q2_df$CROPDMGEXP) {
        if (i == "K"){i <- 1000}
        else if (i == "M"){i <- 1000000}
        else if (i == "B"){i <-1000000000}
        crop_vect <- c(crop_vect, i)
}

q2_df$prop_damage <- as.numeric(prop_vect) * q2_df$PROPDMG
q2_df$crop_damage <- as.numeric(crop_vect) * q2_df$CROPDMG
q2_df$total_damage <- q2_df$prop_damage + q2_df$crop_damage

# Group the event type and sum the estimated values of economic loss

eco_dmg <- q2_df %>% group_by(EVTYPE) %>% 
        summarise(p_dmg = sum(prop_damage), c_dmg = sum(crop_damage), t_dmg = sum(total_damage))

plot2 <- ggplot()+ geom_point(data = eco_dmg, aes(x=EVTYPE,y=t_dmg), 
                     color="red")+
        xlab("Event Type")+ylab("Total economic damage")+theme_bw()+
        theme(axis.text.x = element_text(angle = 90))
print(plot2)

# Filter the damage according to quantiles

quantile(eco_dmg$t_dmg)
##           0%          25%          50%          75%         100% 
##          100       423750      8755000    142077900 126044533500
mean(eco_dmg$t_dmg)
## [1] 2164341765
eco_maxdmg <- filter(eco_dmg, t_dmg > mean(eco_dmg$t_dmg))

# table with max values in Million USD$ 

eco_maxdmg$t_dmg_MUSD <- eco_maxdmg$t_dmg/1000000
eco_maxdmg[with(eco_maxdmg, order(-t_dmg_MUSD)), ]
## # A tibble: 9 × 5
##   EVTYPE                   p_dmg      c_dmg        t_dmg t_dmg_MUSD
##   <chr>                    <dbl>      <dbl>        <dbl>      <dbl>
## 1 FLOOD             121971090050 4073443450 126044533500    126045.
## 2 HURRICANE/TYPHOON  26740295000 2607822800  29348117800     29348.
## 3 HURRICANE           7809278000 2688910000  10498188000     10498.
## 4 RIVER FLOOD         5079635000 5028734000  10108369000     10108.
## 5 ICE STORM             86504000 5022113500   5108617500      5109.
## 6 FLASH FLOOD         2926981010 1381860350   4308841360      4309.
## 7 HAIL                1960393440 1853569100   3813962540      3814.
## 8 TORNADO             1987231000  398705950   2385936950      2386.
## 9 HURRICANE OPAL      2168000000   19000000   2187000000      2187
plot3 <- ggplot() + geom_point(data=eco_maxdmg, aes(x=EVTYPE, y=t_dmg_MUSD), color="red")+
        xlab("Event Type")+
        ylab("Total economic damage")+
        theme_bw()+
        theme(axis.text.x = element_text(angle = 90))

print(plot3)

max(eco_maxdmg$t_dmg_MUSD)
## [1] 126044.5
max(eco_maxdmg$p_dmg)/1000000 # Million USD$ for better readability
## [1] 121971.1
max(eco_maxdmg$c_dmg)/1000000 # Million USD$ for better readability
## [1] 5028.734

For the economic consequences, the Flood events are most harmful, with MUSD 126,044.5, with more severe damage to the property (MUSD 121,971.1) than the crops (MUSD 5,028.734)