Among the many factors involved in the cause of diseases or morbidities are those related to the environment in which they live susceptible subject. Storm and other severe weather events are among them.

This design checks across the United States, which types of these events are more harmful to public health and which bring greater economic losses.

The results show that among these events, tornadoes appear to be associated with the largest measurement of fatalities, flooding with high measures of injury, hurricanes / typhoons with the greatest damages to agriculture and storm with higher damage to property.

Project Objectives

Check weather events which are related to the main events of fatalities, injuries and property damage, in order to plan public prevention and risk reduction.

Data Processing

The data analyzed is from storm database of the U.S. National Oceanic and Atmospheric Administration’s (NOAA), which can be downloaded via the link https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2 (but the code can do it automatically). For more informations, see National Weather Service Storm Data (Documentation) and National Climatic Data Center Storm Events (FAQ).

Once downloaded and uncompressed data file, an initial exploration showed 902,297 observations on 37 variables, from which were selected 8 of interest:

filename = "repdata-data-StormData.csv.bz2"
if (!file.exists(filename)) {
        fileURL = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
        download.file(fileURL, filename)
}
if (!file.exists("repdata-data-StormData.csv")) {
        library(R.utils)
        bunzip2(filename)
}
storm = read.csv("repdata-data-StormData.csv")
dim(storm)
## [1] 902297     37
str(storm)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436774 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
library(dplyr)
s = select(storm, 
           BGN_DATE,
           EVTYPE, 
           FATALITIES,
           INJURIES, 
           PROPDMG, 
           CROPDMG, 
           PROPDMGEXP, 
           CROPDMGEXP)

Especially, it was necessary to adjust the values of losses measures.

table(s$PROPDMGEXP, s$CROPDMGEXP)
##    
##                 ?      0      2      B      k      K      m      M
##     461616      2      3      1      4      0   3865      0    443
##   -      1      0      0      0      0      0      0      0      0
##   ?      8      0      0      0      0      0      0      0      0
##   +      5      0      0      0      0      0      0      0      0
##   0    211      0      0      0      0      0      4      0      1
##   1     25      0      0      0      0      0      0      0      0
##   2     13      0      0      0      0      0      0      0      0
##   3      3      0      0      0      0      0      1      0      0
##   4      4      0      0      0      0      0      0      0      0
##   5     26      0      0      0      0      0      1      0      1
##   6      4      0      0      0      0      0      0      0      0
##   7      5      0      0      0      0      0      0      0      0
##   8      1      0      0      0      0      0      0      0      0
##   B     16      0      0      0      2      0     11      0     11
##   h      1      0      0      0      0      0      0      0      0
##   H      6      0      0      0      0      0      0      0      0
##   K 149067      4     16      0      3     21 274690      0    864
##   m      6      0      0      0      0      0      0      1      0
##   M   7395      1      0      0      0      0   3260      0    674
s$PROPDMGEXP = as.character(s$PROPDMGEXP)
s$PROPDMGEXP = gsub("\\-|\\+|\\?","0",s$PROPDMGEXP)
s$PROPDMGEXP = gsub("B|b", "9", s$PROPDMGEXP)
s$PROPDMGEXP = gsub("M|m", "6", s$PROPDMGEXP)
s$PROPDMGEXP = gsub("K|k", "3", s$PROPDMGEXP)
s$PROPDMGEXP = gsub("H|h", "2", s$PROPDMGEXP)
s$PROPDMGEXP = as.numeric(s$PROPDMGEXP)
s$PROPDMGEXP[is.na(s$PROPDMGEXP)] = 0
s$PROPDAM = s$PROPDMG * 10 ^ s$PROPDMGEXP
s$CROPDMGEXP = as.character(s$CROPDMGEXP)
s$CROPDMGEXP = gsub("\\-|\\+|\\?","0",s$CROPDMGEXP)
s$CROPDMGEXP = gsub("B|b", "9", s$CROPDMGEXP)
s$CROPDMGEXP = gsub("M|m", "6", s$CROPDMGEXP)
s$CROPDMGEXP = gsub("K|k", "3", s$CROPDMGEXP)
s$CROPDMGEXP = gsub("H|h", "2", s$CROPDMGEXP)
s$CROPDMGEXP = as.numeric(s$CROPDMGEXP)
s$CROPDMGEXP[is.na(s$CROPDMGEXP)] = 0
s$CROPDAM = s$CROPDMG * 10 ^ s$CROPDMGEXP

The variables were transformed (log scale, per billion) and the data were arranged to enable basic data exploring (including a better graph visualization).

It was adopted as those higher values greater than the 99th percentile for data no zero.

s = s[,-(5:8)]
s$BGN_DATE = as.Date(s$BGN_DATE, "%m/%d/%Y %H:%M:%S")
class(s$BGN_DATE)
## [1] "Date"
library(lubridate)
s$BGN_DATE = year(s$BGN_DATE)
library(reshape2)
dates = melt(s[,-2], id = "BGN_DATE",
              measure.vars = c("FATALITIES", "INJURIES", "PROPDAM" ,"CROPDAM"))
dates = aggregate(value ~ BGN_DATE + variable, dates, sum)
summary(dates)
##     BGN_DATE          variable      value          
##  Min.   :1950   FATALITIES:62   Min.   :0.000e+00  
##  1st Qu.:1965   INJURIES  :62   1st Qu.:6.600e+01  
##  Median :1980   PROPDAM   :62   Median :1.236e+03  
##  Mean   :1980   CROPDAM   :62   Mean   :1.925e+09  
##  3rd Qu.:1996                   3rd Qu.:3.169e+08  
##  Max.   :2011                   Max.   :1.219e+11
dates$value = dates$value/1000000000

health = melt(s[,-1], id = "EVTYPE", measure.vars = c("FATALITIES", "INJURIES"))
summary(health)
##                EVTYPE             variable          value         
##  HAIL             :577322   FATALITIES:902297   Min.   :0.00e+00  
##  TSTM WIND        :439880   INJURIES  :902297   1st Qu.:0.00e+00  
##  THUNDERSTORM WIND:165126                       Median :0.00e+00  
##  TORNADO          :121304                       Mean   :8.63e-02  
##  FLASH FLOOD      :108554                       3rd Qu.:0.00e+00  
##  FLOOD            : 50652                       Max.   :1.70e+03  
##  (Other)          :341756
health = health[health$value > 0, ]
quantile(health$value, c(.75, .8, .9, .95, .99, 1))
##  75%  80%  90%  95%  99% 100% 
##    3    4   10   20   90 1700
health = health[(health$value > quantile(health$value, .99)), ]
health$value = log(health$value)

econ = melt(s[,-1], id = "EVTYPE", measure.vars = c("PROPDAM", "CROPDAM"))
summary(econ)
##                EVTYPE          variable          value          
##  HAIL             :577322   PROPDAM:902297   Min.   :0.000e+00  
##  TSTM WIND        :439880   CROPDAM:902297   1st Qu.:0.000e+00  
##  THUNDERSTORM WIND:165126                    Median :0.000e+00  
##  TORNADO          :121304                    Mean   :2.645e+05  
##  FLASH FLOOD      :108554                    3rd Qu.:0.000e+00  
##  FLOOD            : 50652                    Max.   :1.150e+11  
##  (Other)          :341756
econ = econ[econ$value > 0, ]
econ$value = log(econ$value)
quantile(econ$value, c(.75, .8, .9, .95, .99, 1))
##      75%      80%      90%      95%      99%     100% 
## 10.81978 11.00210 12.42922 13.65299 16.11810 25.46820
econ = econ[econ$value > 18, ]
summary(econ$value)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.01   18.42   18.83   19.15   19.47   25.47

We can see that last decade has more complete data.

library(ggplot2)
ggplot(dates, 
       aes(BGN_DATE, value)) + 
        geom_bar(stat = "identity") +
        facet_grid(. ~ variable) +
        xlab("Year") +
        ylab("Value in USD (per billion)") +
        ggtitle("Total of Damage Per Consequence Type Caused Per Severe Weather Events
                Across United States from 1950 to 2011.") +
        theme(axis.text.x = element_text(angle = 90))

Results

Below we can see that tornadoes were the major causes of fatalities and also had a strong impact on the number of injuries. However, when it comes to injuries, tornadoes are medially overcome by floods, hurricanes / typhoons and blizzard. Floods, high winds and hurricanes were the main cause of damage to crop and severe thunderstorms and storms surge were the most losses caused to property.

ggplot(health,
       aes(EVTYPE, value, fill = variable)) +
        geom_boxplot() +
        xlab("Events Type") +
        ylab("Value in USD (log scale)") +
        ggtitle("Association  Between Higher Values per Type Damage For Health and Type of Weather Events
                Across United States from 1950 to 2011.") +
        theme( axis.text.x = element_text(angle = 90)) +
        guides(fill = guide_legend(title = "Type Damage", title.position = "top"))

ggplot(econ,
       aes(EVTYPE, value, fill = variable)) +
        geom_boxplot() +
        xlab("Events Type") +
        ylab("Value in USD (log scale)") +
        ggtitle("Association  Between Higher Values per Type Damage For Economy and Type of Weather Events
                Across United States from 1950 to 2011.")  + 
        theme( axis.text.x = element_text(angle = 90)) + 
       guides(fill = guide_legend(title = "Type Damage", title.position = "top"))