Synopsis

In this report we aim to describe the impact of different weather events in the United States since 1990. Our research shows that though some catastrophies like Tornados can be very harmful with respect to population health, because of their less frequency they aren’t the most important. We also noticed that though some events seems can cuase a lot economical damage, they do not have the same impact with the population health.

Reading the Storm data

We first read in the Storm data from the raw csv file included in the zip archive. We also load the packages we are going to use in the document.

storm <- read.csv("repdata_data_StormData.csv", header = T)

library(ggplot2)
library(gridExtra)
## Loading required package: grid
library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

After reading in we check the structure of this dataset, which have 37 variables and 902,297 rows.

storm <- tbl_df(storm)
str(storm)
## Classes 'tbl_df', 'tbl' and 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436774 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

We then convert the begining date into the date format and filter from the day 01/01/1990 to the last day of the dataset 11/30/2011.

storm <- mutate(storm, BGN_DATE  = as.Date(as.character(BGN_DATE),"%m/%d/%Y"))
storm2 <- filter(storm, BGN_DATE >= as.Date("01 01 1990", "%m%d%Y"))

storm2 <- mutate(storm2, EVTYPE = toupper(EVTYPE)) # convert into upper cases the event names

As we can see there are some events that have different names, though they are the same. For example THUNDERSTORM WIND and TSTM WIND.

group_by(storm2, EVTYPE) %>%
        summarise(freq = n()) %>%
        arrange(desc(freq))
## Source: local data frame [898 x 2]
## 
##                EVTYPE   freq
## 1                HAIL 240945
## 2           TSTM WIND 147989
## 3   THUNDERSTORM WIND  82564
## 4         FLASH FLOOD  54277
## 5             TORNADO  29764
## 6               FLOOD  25327
## 7  THUNDERSTORM WINDS  20843
## 8           HIGH WIND  20214
## 9           LIGHTNING  15754
## 10         HEAVY SNOW  15708
## ..                ...    ...

In order to clean the data a little bit, we are gonna add this events as they were one.

storm2[grep("(GLAZE|ICE STORM)", storm2$EVTYPE),"EVTYPE"] = "ICE STORM"
storm2[grep("FOG+", storm2$EVTYPE), "EVTYPE"] = "FOG"
storm2[grep("THUNDERSTORM WIND+", storm2$EVTYPE),"EVTYPE"] = "TSTM WIND"
storm2[grep("HURRICANE+", storm2$EVTYPE),"EVTYPE"] = "HURRICANE" 
storm2[grep("(WILD+|FOREST)", storm2$EVTYPE),"EVTYPE"] = "WILD FIRE"
storm2[grep("HEAT WAVE", storm2$EVTYPE),"EVTYPE"] = "HEAT"
storm2[grep("^HIstorm3H WIND", storm2$EVTYPE),"EVTYPE"] = "HIstorm3H WIND"
storm2[grep("^RIP CURR", storm2$EVTYPE),"EVTYPE"] = "RIP CURRENT"
storm2[grep("EXTREME.COLD", storm2$EVTYPE),"EVTYPE"] = "EXTREME COLD/ WIND CHILL"
storm2[grep("(GLAZE|ICE STORM)", storm2$EVTYPE),"EVTYPE"] = "ICE STORM"

Results

Population Health

The first question is wich types of events were most harmful with respect to population health? To answer this question we are going to use two variables: Injuries: number of injured people; and Fatalities: number of dead people. Here comes the controversy, we know that this two variables are different but we need to take into account both of them too. We realised that giving some kind of weight between them is going to be totally arbitrary, so instead we opted to simply sum the up.

group_by(storm2, EVTYPE) %>%
        summarise(sumTOT = sum(INJURIES + FATALITIES),
                  freq = n()) %>%
        arrange(desc(sumTOT), desc(freq))
## Source: local data frame [798 x 3]
## 
##            EVTYPE sumTOT   freq
## 1         TORNADO  28426  29764
## 2  EXCESSIVE HEAT   8428   1678
## 3       TSTM WIND   7986 257435
## 4           FLOOD   7259  25327
## 5       LIGHTNING   6046  15754
## 6            HEAT   3612    846
## 7     FLASH FLOOD   2755  54277
## 8       ICE STORM   2304   2077
## 9       WILD FIRE   1696   4232
## 10   WINTER STORM   1527  11433
## ..            ...    ...    ...

As we could see, Tornados seems to be the more destructive in terms of people lives, but they are not such frequent as Thunderstorms Winds (5 to 1). So, in order to get a good read about the first question we should also need to take into account not only the total number of Injured and dead people but the expected total number. To do so, we are going to divide the freq variable from each event across the sum of all them. After that we are going to multiply this number with the total number of injured and dead people to get the excpected value of health destruction.

g <- group_by(storm2, EVTYPE) 
g <- summarise(g, sumTOT = sum(INJURIES + FATALITIES),
                  freq = n())
g <- mutate(g, freq_per = round(freq / sum(freq)*100, digits = 2))
g <- mutate(g, HealthEXP = sumTOT * freq_per / 100)
g <- arrange(g, desc(HealthEXP)) # we sort them by the Total Expected Value of health destruction

print(g)
## Source: local data frame [798 x 5]
## 
##            EVTYPE sumTOT   freq freq_per HealthEXP
## 1       TSTM WIND   7986 257435    34.25 2735.2050
## 2         TORNADO  28426  29764     3.96 1125.6696
## 3            HAIL   1154 240945    32.05  369.8570
## 4           FLOOD   7259  25327     3.37  244.6283
## 5     FLASH FLOOD   2755  54277     7.22  198.9110
## 6       LIGHTNING   6046  15754     2.10  126.9660
## 7       HIGH WIND   1385  20214     2.69   37.2565
## 8      HEAVY SNOW   1148  15708     2.09   23.9932
## 9    WINTER STORM   1527  11433     1.52   23.2104
## 10 EXCESSIVE HEAT   8428   1678     0.22   18.5416
## ..            ...    ...    ...      ...       ...

We plot the results.

g$EVTYPE <- factor(g$EVTYPE, levels = as.character(g[order(g$HealthEXP,decreasing = T),]$EVTYPE))

hlthplot <- qplot(x = factor(EVTYPE), y = HealthEXP , geom = "bar", 
      data = filter(g, HealthEXP > 20), stat = "identity",
      xlab = "Event Type",
      ylab = "Expected Number of people",
      main = "Expected Number of People Injured or Dead by Weather Events")
print(hlthplot)

Now it can be seen that Thunderstom Winds were the most letal weather event between 1990 and 2011 in the United States, causing approximately twice as deads and injured people as Tornados, which is now in the second place.

Economic Damage

The second question is wich types of events had the greatest economic consequences?

We are using this two variables: PROPDMG = Property Damages CROPDMG = Crop Damages

This variable measures are in “PROPDMGEXP” and “CROPDMGEXP”. However, we are just using the “B”,“b”,“M”,“m”,“k”,“K”, blank spaces that have the variables. Where, B,b: for billions M,m: for millions K,k: for thousands

We rejected the rest, because we didn’t know their meaning. But as we can see in the next table they weren’t many.

table(storm2$CROPDMGEXP) 
## 
##             ?      0      2      B      k      K      m      M 
## 467856      7     19      1      9     21 281832      1   1994
table(storm2$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5 
## 346265      1      8      5    216     25     13      4      4     28 
##      6      7      8      B      h      H      K      m      M 
##      4      5      1     40      1      6 396151      7   8956

We filter this measures and compute the new Property Damage and Crop Dammage variables.

storm3 <- filter(storm2, PROPDMGEXP %in% c("B","b","M","m","k","K",""))
storm3 <- filter(storm3, CROPDMGEXP %in% c("B","b","M","m","k","K",""))

cropvalue <- ifelse(storm3$CROPDMGEXP %in% c("B","b"),10^9,
                 ifelse(storm3$CROPDMGEXP %in% c("M","m"),10^6,
                        ifelse(storm3$CROPDMGEXP %in% c("K","k"), 10^3,1)))
propvalue <- ifelse(storm3$PROPDMGEXP %in% c("B","b"),10^9,
                    ifelse(storm3$PROPDMGEXP %in% c("M","m"),10^6,
                           ifelse(storm3$PROPDMGEXP %in% c("K","k"), 10^3,1)))

table(cropvalue);table(propvalue) # check results
## cropvalue
##      1   1000  1e+06  1e+09 
## 467543 281847   1993      9
## propvalue
##      1   1000  1e+06  1e+09 
## 346259 396131   8962     40
storm3 <- mutate(storm3, PROPDMG = PROPDMG * propvalue,
                         CROPDMG = CROPDMG * cropvalue)

Then, as we did the first time, we sum up this who variables and compute their total expected value.

h <- group_by(storm3, EVTYPE) 
h <- summarise(h, sumDMG = sum(PROPDMG + CROPDMG),
                  freq = n())
h <- mutate(h, freq_per = round(freq / sum(freq)*100, digits = 2),
               sumDMGEXP = sumDMG * freq_per / 100)

h <- arrange(h, desc(sumDMGEXP)) # we sort them by the Total Expected Damage
print(h)
## Source: local data frame [794 x 5]
## 
##          EVTYPE       sumDMG   freq freq_per  sumDMGEXP
## 1          HAIL  18733216230 240909    32.06 6005869123
## 2         FLOOD 150319678257  25326     3.37 5065773157
## 3     TSTM WIND  10896687315 257213    34.23 3729936068
## 4   FLASH FLOOD  17561538817  54261     7.22 1267943103
## 5       TORNADO  30823290423  29740     3.96 1220602301
## 6     HIGH WIND   5908617560  20212     2.69  158941812
## 7  WINTER STORM   6715441250  11432     1.52  102074707
## 8     WILD FIRE   8899845130   4232     0.56   49839133
## 9       DROUGHT  15018672000   2487     0.33   49561618
## 10    HURRICANE  90271472810    288     0.04   36108589
## ..          ...          ...    ...      ...        ...

We plot the results. The same that happened in the first part, now the values are re-arranged and the expected values shows that Hail was the main desctructive weather event between 1990 and 2011.

h$EVTYPE <- factor(h$EVTYPE, levels = as.character(h[order(h$sumDMGEXP,decreasing = T),]$EVTYPE))

dmgplot <- qplot(x = factor(EVTYPE), y = sumDMGEXP / 10^6 , geom = "bar", 
      data = h[1:10,], stat = "identity",
      xlab = "Event Type",
      ylab = "Expected M Dollars Damage Cost",
      main = "Expected M Dollars Damage Cost by Weather Events")
print(dmgplot)

Finallay, we can plot the points between the two variables to get some idea of the total damage of this events. As we can see in the plot, Thunderstorm Winds and Tornados were the main destrutive weather envents during the period.

m.gh <- merge(x = g[,c(1,5)] , y = h[,c(1,5)], by = "EVTYPE",sort = F)

ggplot(m.gh[1:10,], aes(x = log(HealthEXP), y = log(sumDMGEXP))) +
        geom_point(size = 2.5, aes(color = EVTYPE)) +
        geom_vline(xintercept = mean(log(m.gh[1:10,"HealthEXP"]))) +
        geom_hline(yintercept = mean(log(m.gh[1:10,"sumDMGEXP"]))) +
        xlab("Ln Expected Number of people") +
        ylab("Ln Expected Dollars Damage Cost") +
        theme(legend.position = "none") +
        geom_text(aes(label = EVTYPE), hjust = 0.6, vjust = -0.8, size = 2.5)