Analysis ofthe impact from storm events over the american economy and its population health.

Synopsis

In this project we aim to explore the U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. We aim to understand which types of events are most harmful with respect to population health and which types of events have the greatest economic consequences.

Read the data and inspect

We first read the data from the National Weather Service about Storms and its effects in the U.S. Economy. The data come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. We will download it, uncompress it and read it into a data frame.

#Downloa the data into a temporary file
my.file <- tempfile()
fileUrl <- "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileUrl, destfile = my.file)

#read and unzip the file into a data frame
storm <- read.csv(bzfile(my.file))

After reading we check the data to see its structure:

#look at the dimention of the data i.e. number of variables and observations
dim(storm)
## [1] 902297     37
#look at the structure by the inicial observations
head(storm)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6
#a look at the structure of the data
str(storm)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

The colums we are interest are the ones that indicate the type of the event and its effects, these are indicated in the following variables:“EVTYPE” (event type) “FATALITIES”(fatalities) “INJURIES” (injuries) “PROPDMG” (property damage) CROPDMG" (crop damage)

We will also keep track of the following variables in case we want to get a deeper insight in our analysis: “BGN_DATE” (day, month and year of the event) “COUNTYNAME” (county of the event) “STATE” (U.S. State from the event)

We subset the data and tranform the Date variable of interest

library(lubridate)
## Warning: package 'lubridate' was built under R version 3.0.3
storm[, 2] <- mdy_hms(storm[, 2])
sum.storm <- storm[, c(2, 4, 5, 6, 7, 8, 22:27)]

Missing values are a common problem with environmental data and so we check to se what proportion of the observations are missing (i.e. coded as NA).

mean(is.na(sum.storm))
## [1] 0

As we can see missing values are not a problem in the significant variables and so we move on to the analysis of the data

Results

We move on the examine the effects of storms over the health of americans. We inspect the number of fatalities and injuries caused by storms for health effects and we look at the damage caused to properties and agriculture for the economic damage.

Across the United States, which types of events are most harmful with respect to population health?

Using a barplot, we can investigate the 10 types of the events that have the highest sum of impact in directly killing or injuring persons across all the States in the United States. We use only a subset of the data containing only the most harmful types of storms since the variety is great and many of them have very low impact.

The data is aggregated by event type to give a better insight of the relation between the variables

#subset and reshape the data by aggregating and sorting it
agr.FATALITIES <- aggregate(FATALITIES ~ EVTYPE, data = sum.storm, FUN = sum)
agr.INJ <- aggregate(INJURIES ~ EVTYPE, data = sum.storm, FUN = sum)
sorted.FAT <- agr.FATALITIES[with(agr.FATALITIES, order(agr.FATALITIES$FATALITIES, decreasing=TRUE)),]
sorted.INJ <- agr.INJ[with(agr.INJ, order(agr.INJ$INJURIES, decreasing=TRUE)),]
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.0.3
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.0.3
## Loading required package: grid
#barplots of deaths and injuries by storm type
plot1 <- ggplot(sorted.FAT[1:10, ], aes(x=EVTYPE, y=FATALITIES)) + 
  geom_bar(stat="identity", fill = "steelblue", color = "black") + 
  ggtitle("Number of deaths in the U.S. by Storm Type") + 
  xlab("Storm Type")+ylab("Number of Fatalities") + 
  geom_text(aes(label = FATALITIES), size = 4, vjust = 1)

plot2 <- ggplot(sorted.INJ[1:10, ], aes(x=EVTYPE, y=INJURIES)) + 
  geom_bar(stat="identity", fill = "lightblue", color = "black") + 
  ggtitle("Number of injuries caused by storms in the U.S.") + 
  xlab("Storm Type")+ylab("Number of Injuries") + 
  geom_text(aes(label = INJURIES), size = 4, vjust = -1)

grid.arrange(plot1, plot2)

plot of chunk unnamed-chunk-1

paste("The most fatal type of storm is the", agr.FATALITIES[which.max(agr.FATALITIES$FATALITIES), "EVTYPE"], "with a total of", max(agr.FATALITIES$FATALITIES), "fatalities" )
## [1] "The most fatal type of storm is the TORNADO with a total of 5633 fatalities"
paste("The most harmful type of storm is the", agr.INJ[which.max(agr.INJ$INJURIES), "EVTYPE"], "with a total of", max(agr.INJ$INJURIES), "injuries" )
## [1] "The most harmful type of storm is the TORNADO with a total of 91346 injuries"

From the figures and the results we can see that the most harmful type of event accross the United States is the tornado, causing the biggest number of both deaths and injuries. The number of fatalities caused by tornados surpasses the mark of 5633, 3 times the number caused by the second most fatal type of storm which is the excessive heat. It is also the one that causes most injuries, 91346 injuries in the period analised, 13 times more than the TSTM WIND. But overall, we can see that tornardos, the TSTM wind and floods in general are the biggest threats in terms of health to the american people.

Now we move to see the effects of storms over the american economy.

Across the United States, which types of events have the greatest economic consequences?

Using a barplot, we can investigate the 10 types of the events which cause most damage in terms of property value and agricultural damage across all the States in the United States. Again, We look only at a small a subset of the kinds of storms (the most harmful) since the variety is great.

Again, the data is aggregated by event type to give a better insight of the relation between the variables

#aggregate and tranform the data
agr.PROPDMG <- aggregate(PROPDMG ~ EVTYPE, data = sum.storm, FUN = sum)
agr.CROPDMG <- aggregate(CROPDMG ~ EVTYPE, data = sum.storm, FUN = sum)
sorted.PROPDMG <- agr.PROPDMG[with(agr.PROPDMG, order(agr.PROPDMG$PROPDMG, decreasing=TRUE)),]
sorted.CROPDMG <- agr.CROPDMG[with(agr.CROPDMG, order(agr.CROPDMG$CROPDMG, decreasing=TRUE)),]
#barplot for property damage and crop damage
plot3<- ggplot(sorted.PROPDMG[1:10, ], aes(x=EVTYPE, y=PROPDMG)) + 
  geom_bar(stat="identity", fill = "red", color = "black") + 
  ggtitle("Property damage cauded in the U.S. by Storm Type") + 
  xlab("Storm Type")+ylab("Damage") + 
  geom_text(aes(label = PROPDMG), size = 4, vjust = -1)

plot4 <- ggplot(sorted.CROPDMG[1:10, ], aes(x=EVTYPE, y=CROPDMG)) + 
    geom_bar(stat="identity", fill = "#D55E00", color = "black") + 
    ggtitle("Crop damage cauded in the U.S. by Storm Type") + 
    xlab("Storm Type")+ylab("Damage") + 
    geom_text(aes(label = CROPDMG), size = 4, vjust = -1)


grid.arrange(plot3, plot4)

plot of chunk unnamed-chunk-2

paste("The ratio between Agriculture Damage and Propertie Damage is :", sum(agr.CROPDMG$CROPDMG)/sum(agr.PROPDMG$PROPDMG))
## [1] "The ratio between Agriculture Damage and Propertie Damage is : 0.126586183906853"

In regards to economic damage again tornados are the most dramatic issue, but the TSMT Wind and Floods also cause a lot of damage to properties. In this case the damage is way more distributed than when we analysed injuries caused. However the storm type that causes more harm to agriculture is not the tornado, but the hail. But since the property damage caused is much higher than agriculture damage we can still consider tornados the greatest threat in terms of harm to the americans.

Summary

As we can see from our study the tornado from all kinds of storms is the one that causes most harm both to the american economy and to the health of americans. Further analysis is needed to investigate if there are patterns and recurrencies over the years and/or accross the states that may sugest better policies deal with the damage done by these events.