This document contains a Exploratory Data Analysis of a dataset from NOAA Storm database across United States in the period 1996-2011. The goal is getting a ranking of the types of weather events, according to the damage produced by each one, in terms of health and economic consequencies. This ranking will be able to answer to two questions: One, which are the most harmful weather events in terms of population health? And two, which are the most harmful in terms of economic consequencies? The answer to the first question is a group of four events, Excessive Heat, Tornado, Flash flood and Flood. The answer to the second is a group of three events, Hurricane (Typhoon), Storm Surge/Tide and Flood.
The libraries that require installation in R are: dplyr,reshape2,xtable and ggplot2.
library(dplyr)
library(reshape2)
library(xtable)
library(ggplot2)
The data is downloaded from the course web site:(https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2). After, the data is readed in R directly from a bz2 file.
if (!file.exists("./data")) {
dir.create("data")
}
url <- "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
zip2file <- "./data/StormData.csv.bz2"
if (!file.exists(zip2file)){
download.file(url,zip2file)
write.csv(date(),"./data/date_download.txt")
}
if (!exists("data_raw")){
data_raw<-read.csv(zip2file,colClasses="character",stringsAsFactors = FALSE,
na.strings=c("NA",""),header= TRUE) # bz2 file read directly
}
data<-data_raw
Date (col 2),Event Type (col 8),problems of the event (cols 23 to 28),remarks and ref (cols 36-37), are selected.
data<-data[c(2,8,23:28,36:37)]
Adjusting the variable names to regular names.
names(data)<-tolower(names(data)) # lowercase
names(data)<-gsub("_","",names(data)) # not points
data$bgndate<-as.Date(data$bgndate,"%m/%d/%Y")
data$fatalities<-as.integer(data$fatalities)
data$injuries<-as.integer(data$injuries)
data$propdmg<-as.numeric(data$propdmg)
data$propdmgexp<-as.factor(data$propdmgexp)
data$cropdmg<-as.numeric(data$cropdmg)
data$cropdmgexp<-as.factor(data$cropdmgexp)
data$refnum<-as.integer(data$refnum)
Dates before contain incomplete records, NOAA website. Dropping them we can reduce the observations from 902,297 in raw data to 653,530.
data<-filter(data,bgndate>="1996-01-01")
# splitting remarks by eficiency
data_remarks<-data[c(9,10)]
data<-select(data,-remarks)
The event that has the maximum amount of $ ( Ref = 605953 ) has a wrong data: It has a “B” into the propdmgexp column, when it is actually a “M”.In the current database of NOAA is corrected.
data[data$refnum == 605943,6]<-"M"
Dropping rows that does no effects, neither health damages nor economics. This allows to reduce observations from 653,530 to 201,318.
summary(data)
## bgndate evtype fatalities
## Min. :1996-01-01 Length:653530 Min. : 0.00000
## 1st Qu.:2000-11-21 Class :character 1st Qu.: 0.00000
## Median :2005-05-14 Mode :character Median : 0.00000
## Mean :2004-10-25 Mean : 0.01336
## 3rd Qu.:2008-08-22 3rd Qu.: 0.00000
## Max. :2011-11-30 Max. :158.00000
##
## injuries propdmg propdmgexp cropdmg
## Min. :0.00e+00 Min. : 0.00 K :369938 Min. : 0.000
## 1st Qu.:0.00e+00 1st Qu.: 0.00 M : 7375 1st Qu.: 0.000
## Median :0.00e+00 Median : 0.00 B : 31 Median : 0.000
## Mean :8.87e-02 Mean : 11.69 0 : 1 Mean : 1.839
## 3rd Qu.:0.00e+00 3rd Qu.: 1.00 - : 0 3rd Qu.: 0.000
## Max. :1.15e+03 Max. :5000.00 (Other): 0 Max. :990.000
## NA's :276185
## cropdmgexp refnum
## K :278686 Min. :248768
## M : 1771 1st Qu.:412150
## B : 4 Median :575533
## ? : 0 Mean :575533
## 0 : 0 3rd Qu.:738915
## (Other): 0 Max. :902297
## NA's :373069
in the four effect variables there are not either negative values -min=0- or NA values; then
data<-filter(data,fatalities > 0 | injuries > 0 | propdmg > 0 | cropdmg > 0)
Values of the exponent factor: We can see it in propdmgexp and cropdmgexp:
summary(data)
## bgndate evtype fatalities
## Min. :1996-01-01 Length:201318 Min. : 0.00000
## 1st Qu.:2000-06-21 Class :character 1st Qu.: 0.00000
## Median :2005-04-22 Mode :character Median : 0.00000
## Mean :2004-09-25 Mean : 0.04337
## 3rd Qu.:2009-02-11 3rd Qu.: 0.00000
## Max. :2011-11-30 Max. :158.00000
##
## injuries propdmg propdmgexp cropdmg
## Min. : 0.000 Min. : 0.00 K :185474 Min. : 0.00
## 1st Qu.: 0.000 1st Qu.: 2.00 M : 7365 1st Qu.: 0.00
## Median : 0.000 Median : 6.00 B : 31 Median : 0.00
## Mean : 0.288 Mean : 37.96 - : 0 Mean : 5.97
## 3rd Qu.: 0.000 3rd Qu.: 25.00 ? : 0 3rd Qu.: 0.00
## Max. :1150.000 Max. :5000.00 (Other): 0 Max. :990.00
## NA's : 8448
## cropdmgexp refnum
## K : 96787 Min. :248768
## M : 1762 1st Qu.:394793
## B : 2 Median :570208
## ? : 0 Mean :573211
## 0 : 0 3rd Qu.:749106
## (Other): 0 Max. :902260
## NA's :102767
Seeing the above summary of the data, we can observe that the values for exp variable are NA,K,M,B. This values have consistency, because they aren’t any NA in exp with positive values in propdmg or cropdmg (189268 and 18691 obs respectively). But there are, however, a lot of values of exp associated with values = 0 in propdmg or cropdmg. We assume that this is a typo that doesn’t affect to the calculation.
summarize(group_by(filter(data,propdmg>0),propdmgexp),num=n())
## Source: local data frame [3 x 2]
##
## propdmgexp num
## 1 B 31
## 2 K 181873
## 3 M 7364
summarize(group_by(filter(data,cropdmg>0),cropdmgexp),num=n())
## Source: local data frame [3 x 2]
##
## cropdmgexp num
## 1 B 2
## 2 K 16994
## 3 M 1695
There isn’t any strange value of exp. So, we can operate like the codebook says:
First, let’s create a table with the values of exp.
table_exp<-data.frame(exp=c(NA,"K","M","B"),imp=c(0,10^3,10^6,10^9))
Adding new columns for the import of costs of propdmg and cropdmg, and for total cost.
data$propexp<-table_exp[match(data$propdmgexp,table_exp$exp),2]
data$propimp<-data$propdmg*data$propexp
data$cropexp<-table_exp[match(data$cropdmgexp,table_exp$exp),2]
data$cropimp<-data$cropdmg*data$cropexp
data$econ.cost<-data$propimp+data$cropimp
Finally, let’s select the relevant variables
data<-select(data,bgndate,evtype,fatalities,injuries,econ.cost,refnum)
And the dataset is:
head(data)
## bgndate evtype fatalities injuries econ.cost refnum
## 1 1996-01-06 WINTER STORM 0 0 418000 248768
## 2 1996-01-11 TORNADO 0 0 100000 248769
## 3 1996-01-11 TSTM WIND 0 0 3000 248770
## 4 1996-01-11 TSTM WIND 0 0 5000 248771
## 5 1996-01-11 TSTM WIND 0 0 2000 248772
## 6 1996-01-18 HIGH WIND 0 0 400000 248774
`
3.8.1 The “good list” and the goal of this part
There is a list of types of events, that is considered “the good list”:Pag 6,Table 1 hear. The goal list is written in the next vector:
evtype.good<- c("Astronomical Low Tide","Avalanche","Blizzard","Coastal Flood",
"Cold/Wind Chill", "Debris Flow", "Dense Fog","Dense Smoke", "Drought",
"Dust Devil","Dust Storm","Excessive Heat","Extreme Cold/Wind Chill",
"Flash Flood","Flood","Frost/Freeze","Funnel Cloud","Freezing Fog","Hail",
"Heat","Heavy Rain","Heavy Snow","High Surf","High Wind","Hurricane (Typhoon)",
"Ice Storm","Lake-Effect Snow","Lakeshore Flood","Lightning","Marine Hail",
"Marine High Wind","Marine Strong Wind","Marine Thunderstorm Wind",
"Rip Current","Seiche","Sleet","Storm Surge/Tide","Strong Wind",
"Thunderstorm Wind","Tornado","Tropical Depression","Tropical Storm",
"Tsunami","Volcanic Ash","Waterspout","Wildfire","Winter Storm",
"Winter Weather")
Adjusting the values of the list with the tolower and make.names functions:
data$evtype<-tolower(make.names(data$evtype))
evtype.good<-tolower(make.names(evtype.good))
The list above will be the goal of the classification of evtypes. We must assign one of the values of this list to each of the values of evtype. For doing it we create an auxiliar data frame(evtype_raw) with two columns:
evtype_raw<-summarize(group_by(data,evtype))
evtype_raw$evtype.good <-as.character(rep("",nrow(evtype_raw)))
The number of evtypes to classify is 183
3.8.2. First step: evtype = evtype.good.
The first classification is easy: Assign the evtype.good that was exactly the same that evtype.
evtype_raw$evtype.good<-evtype.good[match(evtype_raw$evtype,evtype.good)]
nr<-nrow(evtype_raw[is.na(evtype_raw$evtype.good),])
After this step it remains 138 evtype without class.
3.8.3. Second step: classification with “patterns”
We will use the grepl function and regex -Regular expressions- like patterns. Instead of use one grepl for each pattern, we create a data frame (evtype_assign ) with all the patterns. After that, we do a loop, passing each element of this data frame by the grepl function.
a) Creating the data frame with the “patterns”
pattern<-c("coastal|flood","cold|wind.chill","dense|fog","dust|storm",
"excessive|heat|glaze|warm|hypert","extreme|cold|hypot|unreas","flash|flood",
"flood|drowning|dam","freezing|fog","frost|freeze|wintry|icy","hail","heavy|rain|urban",
"heavy|snow|late","high|surf|rough|rogue|d.and.w|f.and.w|beach","high|wind",
"hurricane|typhoon","ice|storm","marine|strong.wind","marine|thunderstorm",
"rip|current","storm|surge.tide","gradient|wind|non.tstm","strong.winds",
"thunderstorm|wind|burst","winter|weather","astronomical",
"other|spout|slump|slide|precip","fire")
no.pattern<-c("flash|river|tidal","extreme|snow|unseason|weather","X",
"coastal|storm|thunder","snow","X","castal|erosion|ice|river|tidal",
"castal|erosion|ice|flash","fog|rain","X","tstm","shower|snow|surf",
"surf","astrono|wind",
"astrono|extreme|gradient|marine|rain|seas|strong|surf|swell|thunder|tstm|water|wave",
"X","coas|falling|surge|thunder","accident|g.w.|tstm","g40","X","coastal|thunder",
"X","X",
"damage|extreme|gradient|gusty|high|marine|non|winds|severe|strong|surf|weave|^wind",
"X","X","X","X")
evtype.good.a<- c("coastal.flood","cold.wind.chill", "dense.fog",
"dust.storm","excessive.heat","extreme.cold.wind.chill",
"flash.flood","flood","freezing.fog",
"frost.freeze","hail","heavy.rain",
"heavy.snow","high.surf","high.wind",
"hurricane..typhoon.","ice.storm","marine.strong.wind",
"marine.thunderstorm.wind", "rip.current","storm.surge.tide",
"strong.wind","strong.wind","thunderstorm.wind",
"winter.weather","astronomical.low.tide","debris.flow","wildfire")
evtype_assign<-data.frame(pattern,no.pattern,evtype.good.a,stringsAsFactors = F)
The patterns table is displayed now, where we can see, in a clear format, the patterns and no patterns to use in the next paragraph b)
xt<-xtable(evtype_assign)
print(xt,type="html")
| pattern | no.pattern | evtype.good.a | |
|---|---|---|---|
| 1 | coastal|flood | flash|river|tidal | coastal.flood |
| 2 | cold|wind.chill | extreme|snow|unseason|weather | cold.wind.chill |
| 3 | dense|fog | X | dense.fog |
| 4 | dust|storm | coastal|storm|thunder | dust.storm |
| 5 | excessive|heat|glaze|warm|hypert | snow | excessive.heat |
| 6 | extreme|cold|hypot|unreas | X | extreme.cold.wind.chill |
| 7 | flash|flood | castal|erosion|ice|river|tidal | flash.flood |
| 8 | flood|drowning|dam | castal|erosion|ice|flash | flood |
| 9 | freezing|fog | fog|rain | freezing.fog |
| 10 | frost|freeze|wintry|icy | X | frost.freeze |
| 11 | hail | tstm | hail |
| 12 | heavy|rain|urban | shower|snow|surf | heavy.rain |
| 13 | heavy|snow|late | surf | heavy.snow |
| 14 | high|surf|rough|rogue|d.and.w|f.and.w|beach | astrono|wind | high.surf |
| 15 | high|wind | astrono|extreme|gradient|marine|rain|seas|strong|surf|swell|thunder|tstm|water|wave | high.wind |
| 16 | hurricane|typhoon | X | hurricane..typhoon. |
| 17 | ice|storm | coas|falling|surge|thunder | ice.storm |
| 18 | marine|strong.wind | accident|g.w.|tstm | marine.strong.wind |
| 19 | marine|thunderstorm | g40 | marine.thunderstorm.wind |
| 20 | rip|current | X | rip.current |
| 21 | storm|surge.tide | coastal|thunder | storm.surge.tide |
| 22 | gradient|wind|non.tstm | X | strong.wind |
| 23 | strong.winds | X | strong.wind |
| 24 | thunderstorm|wind|burst | damage|extreme|gradient|gusty|high|marine|non|winds|severe|strong|surf|weave|^wind | thunderstorm.wind |
| 25 | winter|weather | X | winter.weather |
| 26 | astronomical | X | astronomical.low.tide |
| 27 | other|spout|slump|slide|precip | X | debris.flow |
| 28 | fire | X | wildfire |
’
b) Passing each pattern by the grepl function
evtype_raw1<-filter(evtype_raw,!is.na(evtype.good))
evtype_raw2<-filter(evtype_raw,is.na(evtype.good))
for (i in 1:nrow(evtype_assign)){
x<- nrow(evtype_raw2[grepl(evtype_assign[i,1],evtype_raw2$evtype) &
!grepl(evtype_assign[i,2],evtype_raw2$evtype),2])
a<- rep(evtype_assign[i,3],x)
evtype_raw2[grepl(evtype_assign[i,1],evtype_raw2$evtype) &
!grepl(evtype_assign[i,2],evtype_raw2$evtype),2]<-a
}
evtype_raw<-rbind(evtype_raw1,evtype_raw2)
nr<-nrow(evtype_raw[is.na(evtype_raw$evtype.good),])
After this step it remains 0 evtype without class.
3.8.4. Updating evtype in data
Now the evtype_raw table is complete -all the evtype have one class assigned-. We can see the result listing the 183 rows of this table.
Finally, we will replace each of the raw evtype with the good evtype assigned.
evtype_raw<-as.data.frame(evtype_raw)
data$evtype<-evtype_raw[match(data$evtype,evtype_raw$evtype),2]
Now, the data is a tidy data. Let’s see a sample:
head(data)
## bgndate evtype fatalities injuries econ.cost refnum
## 1 1996-01-06 winter.storm 0 0 418000 248768
## 2 1996-01-11 tornado 0 0 100000 248769
## 3 1996-01-11 thunderstorm.wind 0 0 3000 248770
## 4 1996-01-11 thunderstorm.wind 0 0 5000 248771
## 5 1996-01-11 thunderstorm.wind 0 0 2000 248772
## 6 1996-01-18 high.wind 0 0 400000 248774
First, we will aggregate the data by type of event, obtaining the totals for each one of the relevant variables:
fat_tot<-sum(data$fatalities)
inj_tot<-sum(data$injuries)
eco_tot<-sum(data$econ.cost)/1000000
data_ag<-summarize(group_by(data,evtype),count=n(),
fatalities=sum(fatalities),
fat.perc.s.Tot=round(fatalities*100/fat_tot,digit=2),
injuries=sum(injuries),
inj.perc.s.Tot=round(injuries*100/inj_tot,digit=2),
econ.cost=sum(econ.cost)/1000000,
eco.perc.s.Tot=round(econ.cost*100/eco_tot),digit=2)
Fixing the labels of eventype for the tables and plots in the results.
data_ag$evtype=sub("\\."," ",data_ag$evtype)
data_ag$evtype=sub("\\."," ",data_ag$evtype)
data_ag$evtype=sub("^cold wind chill","cold/wind chill",data_ag$evtype)
data_ag$evtype=sub("extreme cold wind.chill","extreme cold/wind chill",data_ag$evtype)
data_ag$evtype=sub("frost freeze","frost/frize",data_ag$evtype)
data_ag$evtype=sub("hurricane typhoon.","hurricane (typhoon)",data_ag$evtype)
data_ag$evtype=sub("storm surge tide","storm/surge tide",data_ag$evtype)
Arranging the data by the three variable required in the rubric: fatalities, injuries and economic.cost.
arrange1<-arrange(data_ag,desc(fatalities))
arrange2<-arrange(data_ag,desc(injuries))
arrange3<-arrange(data_ag,desc(econ.cost))
Creating the lists of the more 10 harmful events.
list1<-arrange1[1:10,c(1,2,3,4)]
list2<-arrange2[1:10,c(1,2,5,6)]
list3<-arrange3[1:10,c(1,2,7,8)]
Finally, we create the three relevant plots.
# rearranging
arrange1<-arrange(data_ag,fatalities)
arrange2<-arrange(data_ag,injuries)
arrange3<-arrange(data_ag,econ.cost)
# plot 1
data_ag1<-filter(data_ag,fatalities>1)
data_ag1$evtype<-factor(data_ag1$evtype,levels=arrange1$evtype)
p1<-ggplot(data_ag1,aes(x=evtype,y=fatalities))+
geom_bar(stat="identity")+
coord_flip()+
ylab("Number of fatalities")+
xlab("Type of event")+
ggtitle("Type of storm ordered by fatalities produced. \n USA: 1971-2011") +
theme(plot.title = element_text(lineheight=.7, face="bold"))
# plot 2
data_ag2<-filter(data_ag,injuries>1)
data_ag2$evtype<-factor(data_ag2$evtype,levels=arrange2$evtype)
p2<-ggplot(data_ag2,aes(x=evtype,y=injuries))+
geom_bar(stat="identity")+
coord_flip()+
ylab("number of injuries")+
xlab("Type of event")+
ggtitle("Type of storm ordered by injuries produced. \n USA: 1971-2011") +
theme(plot.title = element_text(lineheight=.7, face="bold"))
# plot 3
data_ag3<-filter(data_ag,econ.cost>1)
data_ag3$evtype<-factor(data_ag3$evtype,levels=arrange3$evtype)
p3<-ggplot(data_ag3,aes(x=evtype,y=econ.cost))+
geom_bar(stat="identity")+
coord_flip()+
ylab("Economic costs ( Millions of $)")+
xlab("Type of event")+
ggtitle("Type of storm ordered by economic costs produced. \n USA: 1996-2011") +
theme(plot.title = element_text(lineheight=.7, face="bold"))
After analising the dataset of the whole of the severe weather events happened across the United States in the period january of 1996 to November 2011, the following conclussions can be drown: They will be split in two parts, answering the two questions of the requirements of the study:
The variables that explaino the population health are the number of fatalities and the number of injuries. In terms of each one, the results are:
Table 1: List of the most harmful events in terms of fatalities.
print(list1)
## Source: local data frame [10 x 4]
##
## evtype count fatalities fat.perc.s.Tot
## 1 excessive heat 709 1801 20.63
## 2 tornado 12366 1511 17.30
## 3 flash flood 19014 887 10.16
## 4 lightning 11152 651 7.46
## 5 rip current 603 542 6.21
## 6 flood 9641 420 4.81
## 7 thunderstorm wind 105455 383 4.39
## 8 extreme cold/wind chill 304 261 2.99
## 9 heat 164 237 2.71
## 10 high wind 5402 235 2.69
Plot 1:
print(p1)
Table 2: List of the most harmful events in terms of injuries.
print(list2)
## Source: local data frame [10 x 4]
##
## evtype count injuries inj.perc.s.Tot
## 1 tornado 12366 20667 35.65
## 2 flood 9641 6760 11.66
## 3 excessive heat 709 6690 11.54
## 4 thunderstorm wind 105455 5154 8.89
## 5 lightning 11152 4141 7.14
## 6 flash flood 19014 1674 2.89
## 7 wildfire 1229 1458 2.51
## 8 hurricane (typhoon) 208 1328 2.29
## 9 winter storm 1460 1292 2.23
## 10 heat 164 1222 2.11
Plot 2:
print(p2)
As the last tables and plots say, the types of events with the most harmful consequences for the population health are the Excessive Heat, the Tornado and the Flash flood in the case of fatalities, and the Tornado, the Flood and the Excessive heat in the case of the injuries. The first group rise the 48,07 % of the total fatalities and the second the 58,85 % of the total injuries in the period.
The variable associated to the economic consequences is the total economic costs.In terms of this one the results are:
Table 3: List of events with the most economic consequences (Economic costs in Millions of $)
print(list3)
## Source: local data frame [10 x 4]
##
## evtype count econ.cost eco.perc.s.Tot
## 1 hurricane (typhoon) 208 87068.997 30
## 2 storm/surge tide 216 47835.579 17
## 3 flood 9641 34295.768 12
## 4 tornado 12366 24900.371 9
## 5 hail 22690 17092.036 6
## 6 flash flood 19014 16557.171 6
## 7 drought 258 14413.667 5
## 8 thunderstorm wind 105455 8932.315 3
## 9 tropical storm 410 8320.187 3
## 10 wildfire 1229 8162.705 3
Plot 3:
print(p3)
According to this table and plot, the types of events with the most economic costs are the Hurricane (Typhoon), the Storm Surge/Tide and the Flood. This ones have caused the 59.02 % of the total costs of the storms in the period.
The adequate prioritization of economic resources, to decide the best politics to prevent severe weather events, should have to consider the order defined in the above results.