The most harmful weather events.(PA2-Reproducible Research).

Author: Jose A. Ariño

Date 25/04/2015

Synopsis

This document contains a Exploratory Data Analysis of a dataset from NOAA Storm database across United States in the period 1996-2011. The goal is getting a ranking of the types of weather events, according to the damage produced by each one, in terms of health and economic consequencies. This ranking will be able to answer to two questions: One, which are the most harmful weather events in terms of population health? And two, which are the most harmful in terms of economic consequencies? The answer to the first question is a group of four events, Excessive Heat, Tornado, Flash flood and Flood. The answer to the second is a group of three events, Hurricane (Typhoon), Storm Surge/Tide and Flood.

Data Processing

1- Opening required libraries

The libraries that require installation in R are: dplyr,reshape2,xtable and ggplot2.

library(dplyr)
library(reshape2)
library(xtable)
library(ggplot2)

2- Getting data

The data is downloaded from the course web site:(https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2). After, the data is readed in R directly from a bz2 file.

if (!file.exists("./data")) {
    dir.create("data")
}

url <- "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
zip2file <- "./data/StormData.csv.bz2"

if (!file.exists(zip2file)){
    download.file(url,zip2file)
    write.csv(date(),"./data/date_download.txt")
}

if (!exists("data_raw")){
    data_raw<-read.csv(zip2file,colClasses="character",stringsAsFactors = FALSE,
                 na.strings=c("NA",""),header= TRUE)    # bz2 file read directly
}

data<-data_raw

3- Cleaning data

3.1- Select variables for the analysis

Date (col 2),Event Type (col 8),problems of the event (cols 23 to 28),remarks and ref (cols 36-37), are selected.

data<-data[c(2,8,23:28,36:37)]

3.2- Variable names

Adjusting the variable names to regular names.

names(data)<-tolower(names(data)) # lowercase
names(data)<-gsub("_","",names(data)) # not points

3.3- Variable formats

data$bgndate<-as.Date(data$bgndate,"%m/%d/%Y")
data$fatalities<-as.integer(data$fatalities)
data$injuries<-as.integer(data$injuries)
data$propdmg<-as.numeric(data$propdmg)
data$propdmgexp<-as.factor(data$propdmgexp)
data$cropdmg<-as.numeric(data$cropdmg)
data$cropdmgexp<-as.factor(data$cropdmgexp)
data$refnum<-as.integer(data$refnum)

3.4- Select dates: date > 01-01-1996.

Dates before contain incomplete records, NOAA website. Dropping them we can reduce the observations from 902,297 in raw data to 653,530.

data<-filter(data,bgndate>="1996-01-01")
# splitting remarks by eficiency
data_remarks<-data[c(9,10)]
data<-select(data,-remarks)

3.5- Adjusting a wrong data

The event that has the maximum amount of $ ( Ref = 605953 ) has a wrong data: It has a “B” into the propdmgexp column, when it is actually a “M”.In the current database of NOAA is corrected.

data[data$refnum == 605943,6]<-"M"

3.6- Dropping rows with null effects.

Dropping rows that does no effects, neither health damages nor economics. This allows to reduce observations from 653,530 to 201,318.

summary(data)

##     bgndate              evtype            fatalities       
##  Min.   :1996-01-01   Length:653530      Min.   :  0.00000  
##  1st Qu.:2000-11-21   Class :character   1st Qu.:  0.00000  
##  Median :2005-05-14   Mode  :character   Median :  0.00000  
##  Mean   :2004-10-25                      Mean   :  0.01336  
##  3rd Qu.:2008-08-22                      3rd Qu.:  0.00000  
##  Max.   :2011-11-30                      Max.   :158.00000  
##                                                             
##     injuries           propdmg          propdmgexp        cropdmg       
##  Min.   :0.00e+00   Min.   :   0.00   K      :369938   Min.   :  0.000  
##  1st Qu.:0.00e+00   1st Qu.:   0.00   M      :  7375   1st Qu.:  0.000  
##  Median :0.00e+00   Median :   0.00   B      :    31   Median :  0.000  
##  Mean   :8.87e-02   Mean   :  11.69   0      :     1   Mean   :  1.839  
##  3rd Qu.:0.00e+00   3rd Qu.:   1.00   -      :     0   3rd Qu.:  0.000  
##  Max.   :1.15e+03   Max.   :5000.00   (Other):     0   Max.   :990.000  
##                                       NA's   :276185                    
##    cropdmgexp         refnum      
##  K      :278686   Min.   :248768  
##  M      :  1771   1st Qu.:412150  
##  B      :     4   Median :575533  
##  ?      :     0   Mean   :575533  
##  0      :     0   3rd Qu.:738915  
##  (Other):     0   Max.   :902297  
##  NA's   :373069

in the four effect variables there are not either negative values -min=0- or NA values; then

data<-filter(data,fatalities > 0 | injuries > 0 | propdmg > 0 | cropdmg > 0)

3.7- Figuring out total damages in dolars

Values of the exponent factor: We can see it in propdmgexp and cropdmgexp:

summary(data)

##     bgndate              evtype            fatalities       
##  Min.   :1996-01-01   Length:201318      Min.   :  0.00000  
##  1st Qu.:2000-06-21   Class :character   1st Qu.:  0.00000  
##  Median :2005-04-22   Mode  :character   Median :  0.00000  
##  Mean   :2004-09-25                      Mean   :  0.04337  
##  3rd Qu.:2009-02-11                      3rd Qu.:  0.00000  
##  Max.   :2011-11-30                      Max.   :158.00000  
##                                                             
##     injuries           propdmg          propdmgexp        cropdmg      
##  Min.   :   0.000   Min.   :   0.00   K      :185474   Min.   :  0.00  
##  1st Qu.:   0.000   1st Qu.:   2.00   M      :  7365   1st Qu.:  0.00  
##  Median :   0.000   Median :   6.00   B      :    31   Median :  0.00  
##  Mean   :   0.288   Mean   :  37.96   -      :     0   Mean   :  5.97  
##  3rd Qu.:   0.000   3rd Qu.:  25.00   ?      :     0   3rd Qu.:  0.00  
##  Max.   :1150.000   Max.   :5000.00   (Other):     0   Max.   :990.00  
##                                       NA's   :  8448                   
##    cropdmgexp         refnum      
##  K      : 96787   Min.   :248768  
##  M      :  1762   1st Qu.:394793  
##  B      :     2   Median :570208  
##  ?      :     0   Mean   :573211  
##  0      :     0   3rd Qu.:749106  
##  (Other):     0   Max.   :902260  
##  NA's   :102767

Seeing the above summary of the data, we can observe that the values for exp variable are NA,K,M,B. This values have consistency, because they aren’t any NA in exp with positive values in propdmg or cropdmg (189268 and 18691 obs respectively). But there are, however, a lot of values of exp associated with values = 0 in propdmg or cropdmg. We assume that this is a typo that doesn’t affect to the calculation.

summarize(group_by(filter(data,propdmg>0),propdmgexp),num=n())

## Source: local data frame [3 x 2]
## 
##   propdmgexp    num
## 1          B     31
## 2          K 181873
## 3          M   7364

summarize(group_by(filter(data,cropdmg>0),cropdmgexp),num=n())

## Source: local data frame [3 x 2]
## 
##   cropdmgexp   num
## 1          B     2
## 2          K 16994
## 3          M  1695

There isn’t any strange value of exp. So, we can operate like the codebook says:

First, let’s create a table with the values of exp.

table_exp<-data.frame(exp=c(NA,"K","M","B"),imp=c(0,10^3,10^6,10^9))

Adding new columns for the import of costs of propdmg and cropdmg, and for total cost.

data$propexp<-table_exp[match(data$propdmgexp,table_exp$exp),2]
data$propimp<-data$propdmg*data$propexp
data$cropexp<-table_exp[match(data$cropdmgexp,table_exp$exp),2]
data$cropimp<-data$cropdmg*data$cropexp
data$econ.cost<-data$propimp+data$cropimp

Finally, let’s select the relevant variables

data<-select(data,bgndate,evtype,fatalities,injuries,econ.cost,refnum)

And the dataset is:

head(data)

##      bgndate       evtype fatalities injuries econ.cost refnum
## 1 1996-01-06 WINTER STORM          0        0    418000 248768
## 2 1996-01-11      TORNADO          0        0    100000 248769
## 3 1996-01-11    TSTM WIND          0        0      3000 248770
## 4 1996-01-11    TSTM WIND          0        0      5000 248771
## 5 1996-01-11    TSTM WIND          0        0      2000 248772
## 6 1996-01-18    HIGH WIND          0        0    400000 248774

3.8- Cleaning the values of evtype variable

3.8.1 The “good list” and the goal of this part

There is a list of types of events, that is considered “the good list”:Pag 6,Table 1 hear. The goal list is written in the next vector:

evtype.good<- c("Astronomical Low Tide","Avalanche","Blizzard","Coastal Flood",
                "Cold/Wind Chill", "Debris Flow", "Dense Fog","Dense Smoke", "Drought",
                "Dust Devil","Dust Storm","Excessive Heat","Extreme Cold/Wind Chill",
                "Flash Flood","Flood","Frost/Freeze","Funnel Cloud","Freezing Fog","Hail",
                "Heat","Heavy Rain","Heavy Snow","High Surf","High Wind","Hurricane (Typhoon)",
                "Ice Storm","Lake-Effect Snow","Lakeshore Flood","Lightning","Marine Hail",
                "Marine High Wind","Marine Strong Wind","Marine Thunderstorm Wind",
                "Rip Current","Seiche","Sleet","Storm Surge/Tide","Strong Wind",
                "Thunderstorm Wind","Tornado","Tropical Depression","Tropical Storm",
                "Tsunami","Volcanic Ash","Waterspout","Wildfire","Winter Storm",
                "Winter Weather")

Adjusting the values of the list with the tolower and make.names functions:

data$evtype<-tolower(make.names(data$evtype))
evtype.good<-tolower(make.names(evtype.good))

The list above will be the goal of the classification of evtypes. We must assign one of the values of this list to each of the values of evtype. For doing it we create an auxiliar data frame(evtype_raw) with two columns:

evtype : each one of the types of evtype.
evtype.good : evtype ( from the list of evtype.good) that will replace the evtype raw. This column is empty before the analysis.

evtype_raw<-summarize(group_by(data,evtype))
evtype_raw$evtype.good <-as.character(rep("",nrow(evtype_raw)))

The number of evtypes to classify is 183

3.8.2. First step: evtype = evtype.good.
The first classification is easy: Assign the evtype.good that was exactly the same that evtype.

evtype_raw$evtype.good<-evtype.good[match(evtype_raw$evtype,evtype.good)]
nr<-nrow(evtype_raw[is.na(evtype_raw$evtype.good),])

After this step it remains 138 evtype without class.

3.8.3. Second step: classification with “patterns”

We will use the grepl function and regex -Regular expressions- like patterns. Instead of use one grepl for each pattern, we create a data frame (evtype_assign ) with all the patterns. After that, we do a loop, passing each element of this data frame by the grepl function.

a) Creating the data frame with the “patterns”

pattern: Regular expression to pass by grepl.If evtype is associated with the pattern, it is considered.
no.pattern: This column is a negative pattern. If evtype is not associated wiht the pattern, It will be considered
evtype.good: evtype to replace to all the evtypes that are considered.

pattern<-c("coastal|flood","cold|wind.chill","dense|fog","dust|storm",                                         
"excessive|heat|glaze|warm|hypert","extreme|cold|hypot|unreas","flash|flood",
"flood|drowning|dam","freezing|fog","frost|freeze|wintry|icy","hail","heavy|rain|urban",                                   
"heavy|snow|late","high|surf|rough|rogue|d.and.w|f.and.w|beach","high|wind",
"hurricane|typhoon","ice|storm","marine|strong.wind","marine|thunderstorm",
"rip|current","storm|surge.tide","gradient|wind|non.tstm","strong.winds",
"thunderstorm|wind|burst","winter|weather","astronomical",
"other|spout|slump|slide|precip","fire")  


no.pattern<-c("flash|river|tidal","extreme|snow|unseason|weather","X",                                                                    
"coastal|storm|thunder","snow","X","castal|erosion|ice|river|tidal",
"castal|erosion|ice|flash","fog|rain","X","tstm","shower|snow|surf",
"surf","astrono|wind",
"astrono|extreme|gradient|marine|rain|seas|strong|surf|swell|thunder|tstm|water|wave",
"X","coas|falling|surge|thunder","accident|g.w.|tstm","g40","X","coastal|thunder",
"X","X",
"damage|extreme|gradient|gusty|high|marine|non|winds|severe|strong|surf|weave|^wind",
"X","X","X","X")
  
evtype.good.a<- c("coastal.flood","cold.wind.chill", "dense.fog",       
"dust.storm","excessive.heat","extreme.cold.wind.chill", 
"flash.flood","flood","freezing.fog",            
"frost.freeze","hail","heavy.rain",              
"heavy.snow","high.surf","high.wind",               
"hurricane..typhoon.","ice.storm","marine.strong.wind",      
"marine.thunderstorm.wind", "rip.current","storm.surge.tide",        
"strong.wind","strong.wind","thunderstorm.wind",       
"winter.weather","astronomical.low.tide","debris.flow","wildfire")  
   

evtype_assign<-data.frame(pattern,no.pattern,evtype.good.a,stringsAsFactors = F)

The patterns table is displayed now, where we can see, in a clear format, the patterns and no patterns to use in the next paragraph b)

xt<-xtable(evtype_assign)
print(xt,type="html")

	pattern	no.pattern	evtype.good.a
1	coastal\|flood	flash\|river\|tidal	coastal.flood
2	cold\|wind.chill	extreme\|snow\|unseason\|weather	cold.wind.chill
3	dense\|fog	X	dense.fog
4	dust\|storm	coastal\|storm\|thunder	dust.storm
5	excessive\|heat\|glaze\|warm\|hypert	snow	excessive.heat
6	extreme\|cold\|hypot\|unreas	X	extreme.cold.wind.chill
7	flash\|flood	castal\|erosion\|ice\|river\|tidal	flash.flood
8	flood\|drowning\|dam	castal\|erosion\|ice\|flash	flood
9	freezing\|fog	fog\|rain	freezing.fog
10	frost\|freeze\|wintry\|icy	X	frost.freeze
11	hail	tstm	hail
12	heavy\|rain\|urban	shower\|snow\|surf	heavy.rain
13	heavy\|snow\|late	surf	heavy.snow
14	high\|surf\|rough\|rogue\|d.and.w\|f.and.w\|beach	astrono\|wind	high.surf
15	high\|wind	astrono\|extreme\|gradient\|marine\|rain\|seas\|strong\|surf\|swell\|thunder\|tstm\|water\|wave	high.wind
16	hurricane\|typhoon	X	hurricane..typhoon.
17	ice\|storm	coas\|falling\|surge\|thunder	ice.storm
18	marine\|strong.wind	accident\|g.w.\|tstm	marine.strong.wind
19	marine\|thunderstorm	g40	marine.thunderstorm.wind
20	rip\|current	X	rip.current
21	storm\|surge.tide	coastal\|thunder	storm.surge.tide
22	gradient\|wind\|non.tstm	X	strong.wind
23	strong.winds	X	strong.wind
24	thunderstorm\|wind\|burst	damage\|extreme\|gradient\|gusty\|high\|marine\|non\|winds\|severe\|strong\|surf\|weave\|^wind	thunderstorm.wind
25	winter\|weather	X	winter.weather
26	astronomical	X	astronomical.low.tide
27	other\|spout\|slump\|slide\|precip	X	debris.flow
28	fire	X	wildfire

’

b) Passing each pattern by the grepl function

evtype_raw1<-filter(evtype_raw,!is.na(evtype.good))
evtype_raw2<-filter(evtype_raw,is.na(evtype.good))

for (i in 1:nrow(evtype_assign)){
   x<- nrow(evtype_raw2[grepl(evtype_assign[i,1],evtype_raw2$evtype) &
                        !grepl(evtype_assign[i,2],evtype_raw2$evtype),2])
   a<- rep(evtype_assign[i,3],x)
   evtype_raw2[grepl(evtype_assign[i,1],evtype_raw2$evtype) &
                   !grepl(evtype_assign[i,2],evtype_raw2$evtype),2]<-a
}
evtype_raw<-rbind(evtype_raw1,evtype_raw2)

nr<-nrow(evtype_raw[is.na(evtype_raw$evtype.good),])

After this step it remains 0 evtype without class.

3.8.4. Updating evtype in data

Now the evtype_raw table is complete -all the evtype have one class assigned-. We can see the result listing the 183 rows of this table.

Finally, we will replace each of the raw evtype with the good evtype assigned.

evtype_raw<-as.data.frame(evtype_raw) 
data$evtype<-evtype_raw[match(data$evtype,evtype_raw$evtype),2]

Now, the data is a tidy data. Let’s see a sample:

head(data)

##      bgndate            evtype fatalities injuries econ.cost refnum
## 1 1996-01-06      winter.storm          0        0    418000 248768
## 2 1996-01-11           tornado          0        0    100000 248769
## 3 1996-01-11 thunderstorm.wind          0        0      3000 248770
## 4 1996-01-11 thunderstorm.wind          0        0      5000 248771
## 5 1996-01-11 thunderstorm.wind          0        0      2000 248772
## 6 1996-01-18         high.wind          0        0    400000 248774

Analysing data

First, we will aggregate the data by type of event, obtaining the totals for each one of the relevant variables:

fat_tot<-sum(data$fatalities)
inj_tot<-sum(data$injuries)
eco_tot<-sum(data$econ.cost)/1000000
data_ag<-summarize(group_by(data,evtype),count=n(),
                   fatalities=sum(fatalities),
                   fat.perc.s.Tot=round(fatalities*100/fat_tot,digit=2),
                   injuries=sum(injuries),
                   inj.perc.s.Tot=round(injuries*100/inj_tot,digit=2),
                   econ.cost=sum(econ.cost)/1000000,
                   eco.perc.s.Tot=round(econ.cost*100/eco_tot),digit=2)

Fixing the labels of eventype for the tables and plots in the results.

data_ag$evtype=sub("\\."," ",data_ag$evtype)  
data_ag$evtype=sub("\\."," ",data_ag$evtype)  
data_ag$evtype=sub("^cold wind chill","cold/wind chill",data_ag$evtype)  
data_ag$evtype=sub("extreme cold wind.chill","extreme cold/wind chill",data_ag$evtype)  
data_ag$evtype=sub("frost freeze","frost/frize",data_ag$evtype) 
data_ag$evtype=sub("hurricane  typhoon.","hurricane (typhoon)",data_ag$evtype) 
data_ag$evtype=sub("storm surge tide","storm/surge tide",data_ag$evtype)

Arranging the data by the three variable required in the rubric: fatalities, injuries and economic.cost.

arrange1<-arrange(data_ag,desc(fatalities))
arrange2<-arrange(data_ag,desc(injuries))
arrange3<-arrange(data_ag,desc(econ.cost))

Creating the lists of the more 10 harmful events.

list1<-arrange1[1:10,c(1,2,3,4)]
list2<-arrange2[1:10,c(1,2,5,6)]
list3<-arrange3[1:10,c(1,2,7,8)]

Finally, we create the three relevant plots.

# rearranging 
arrange1<-arrange(data_ag,fatalities)
arrange2<-arrange(data_ag,injuries)
arrange3<-arrange(data_ag,econ.cost)
# plot 1
data_ag1<-filter(data_ag,fatalities>1)
data_ag1$evtype<-factor(data_ag1$evtype,levels=arrange1$evtype)
p1<-ggplot(data_ag1,aes(x=evtype,y=fatalities))+
    geom_bar(stat="identity")+
    coord_flip()+
    ylab("Number of fatalities")+
    xlab("Type of event")+
    ggtitle("Type of storm ordered by fatalities produced. \n USA: 1971-2011") + 
    theme(plot.title = element_text(lineheight=.7, face="bold"))
# plot 2
data_ag2<-filter(data_ag,injuries>1)
data_ag2$evtype<-factor(data_ag2$evtype,levels=arrange2$evtype)
p2<-ggplot(data_ag2,aes(x=evtype,y=injuries))+
    geom_bar(stat="identity")+
    coord_flip()+
    ylab("number of injuries")+
    xlab("Type of event")+
    ggtitle("Type of storm ordered by injuries produced. \n USA: 1971-2011") + 
    theme(plot.title = element_text(lineheight=.7, face="bold"))
# plot 3
data_ag3<-filter(data_ag,econ.cost>1)
data_ag3$evtype<-factor(data_ag3$evtype,levels=arrange3$evtype)
p3<-ggplot(data_ag3,aes(x=evtype,y=econ.cost))+
    geom_bar(stat="identity")+
    coord_flip()+
    ylab("Economic costs ( Millions of $)")+
    xlab("Type of event")+
    ggtitle("Type of storm ordered by economic costs produced. \n USA: 1996-2011") + 
    theme(plot.title = element_text(lineheight=.7, face="bold"))

Results

After analising the dataset of the whole of the severe weather events happened across the United States in the period january of 1996 to November 2011, the following conclussions can be drown: They will be split in two parts, answering the two questions of the requirements of the study:

1- Type of Events more harmful with respect to population health

The variables that explaino the population health are the number of fatalities and the number of injuries. In terms of each one, the results are:

Table 1: List of the most harmful events in terms of fatalities.

print(list1)

## Source: local data frame [10 x 4]
## 
##                     evtype  count fatalities fat.perc.s.Tot
## 1           excessive heat    709       1801          20.63
## 2                  tornado  12366       1511          17.30
## 3              flash flood  19014        887          10.16
## 4                lightning  11152        651           7.46
## 5              rip current    603        542           6.21
## 6                    flood   9641        420           4.81
## 7        thunderstorm wind 105455        383           4.39
## 8  extreme cold/wind chill    304        261           2.99
## 9                     heat    164        237           2.71
## 10               high wind   5402        235           2.69

Plot 1:

print(p1)

Table 2: List of the most harmful events in terms of injuries.

print(list2)

## Source: local data frame [10 x 4]
## 
##                 evtype  count injuries inj.perc.s.Tot
## 1              tornado  12366    20667          35.65
## 2                flood   9641     6760          11.66
## 3       excessive heat    709     6690          11.54
## 4    thunderstorm wind 105455     5154           8.89
## 5            lightning  11152     4141           7.14
## 6          flash flood  19014     1674           2.89
## 7             wildfire   1229     1458           2.51
## 8  hurricane (typhoon)    208     1328           2.29
## 9         winter storm   1460     1292           2.23
## 10                heat    164     1222           2.11

Plot 2:

print(p2)

As the last tables and plots say, the types of events with the most harmful consequences for the population health are the Excessive Heat, the Tornado and the Flash flood in the case of fatalities, and the Tornado, the Flood and the Excessive heat in the case of the injuries. The first group rise the 48,07 % of the total fatalities and the second the 58,85 % of the total injuries in the period.

2- Type of Events with the greatest economic consequences.

The variable associated to the economic consequences is the total economic costs.In terms of this one the results are:

Table 3: List of events with the most economic consequences (Economic costs in Millions of $)

print(list3)

## Source: local data frame [10 x 4]
## 
##                 evtype  count econ.cost eco.perc.s.Tot
## 1  hurricane (typhoon)    208 87068.997             30
## 2     storm/surge tide    216 47835.579             17
## 3                flood   9641 34295.768             12
## 4              tornado  12366 24900.371              9
## 5                 hail  22690 17092.036              6
## 6          flash flood  19014 16557.171              6
## 7              drought    258 14413.667              5
## 8    thunderstorm wind 105455  8932.315              3
## 9       tropical storm    410  8320.187              3
## 10            wildfire   1229  8162.705              3

Plot 3:

print(p3)

According to this table and plot, the types of events with the most economic costs are the Hurricane (Typhoon), the Storm Surge/Tide and the Flood. This ones have caused the 59.02 % of the total costs of the storms in the period.

Conclussions

The adequate prioritization of economic resources, to decide the best politics to prevent severe weather events, should have to consider the order defined in the above results.