This analysis identifies nature event types that are most harmful to population health and have greatest economic consequences. The harm was measured by fatality and injury numbers. Economic consequences were measured by property and crops damage. The original provided data has almost one thousand unique nature type of event names. Reason for that was mixing small and large cap letters, duplicate names and spelling errors. To make sure that this research remains reproducible no manual data cleaning was involved. Two algorithms were created in deployed three times each to clean EVTYPE variable data. The first algorithm based on grep(), the second based on amatch() functions. Among nature events tornadoes were on top of the list to cause harm to population health, flood type events were leaders in causing property damage.
loading libraries:
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Lets load data into R:
dat<-read.csv("repdata_data_StormData.csv.bz2")
Two questions to be answered by this data set: 1. Across the US which types of events ( as indicated in the EVTYPE variable) are most harmful with respect to population health? 2. Across the US, which types of events have the greatest economic consequences?
For the question 1, lets focus on variables FATALITIES AND INJURIES as representatives of harm to public health. For the question 2, PROPDMG, PROPDMGEXP and CROPDMG, CROPDMEXP are of interest. Of course variable EVTYPE is needed in both cases.
Original data file has 902297 lines To reduce amount of data that is loaded into memory, two data subsets were created: first, q1, is meant to address question 1, and the second data subset, q2, was aimed at question 2. Selected columns were described above. After creating q1 and q2 data sets, the original data set, dat was removed, to preserve computational memory.
q1<-select(dat, EVTYPE:INJURIES)
q1<-select(q1, -(BGN_RANGE:MAG))
q2<-select(dat, EVTYPE:CROPDMGEXP)
q2<-select(q2, -(BGN_RANGE:INJURIES))
rm(dat)
Both questions asks to qualify data versus type of events as described in variable EVTYPE. EVTYPE has 985 unique names in the original format. However, that number drops to 898 if letters are capitalized in texts describing event types. Therefore, EVTYPE column data was capitalized (the same letters but letters capitalized).
q1$EVTYPE<-toupper(q1$EVTYPE)
q2$EVTYPE<-toupper(q2$EVTYPE)
The last number, 898 is still too large to implement reasonable plotting. In the following section the report focuses on answering question 1: Across the US which types of events ( as indicated in the EVTYPE variable) are most harmful with respect to population health? Note, that we are not interested in events that were not harmful, that had zero fatalities or zero injuries. Therefore, to data sets are being created below, q1f and q1i, for fatalities and injuries, which had non-zero fatalities or injuries:
q1f<-filter(q1, FATALITIES>0)
q1i<-filter(q1, INJURIES>0)
There were total 1.514510^{4} fatalities and 1.4052810^{5} injuries. Next step is to group data by event type:
q1f0<-q1f %>%group_by(as.character(EVTYPE)) %>% summarise(totaln=sum(FATALITIES))
q1i0<-q1i %>%group_by(as.character(EVTYPE)) %>% summarise(totaln=sum(INJURIES))
names(q1f0)<-c("EVTYPE", "FATALITIES")
names(q1i0)<-c("EVTYPE", "INJURIES")
Here number of event types reduced to 160 and 150 for fatalities and injuries data sets.
There were two variables in original data set describing economic damage: property damage and crops damage (PROPDMG and CROPDMG, respectively). Here we only interested in events when either of those is non zero. New data set q20 contains the subset of economic damage data:
q20<-filter(q2, q2$PROPDMG!=0 | q2$CROPDMG!=0)
Two exponential variables PROPDMGEXP and CROPDMGEXP consist a multiplier in alphabetic form, K=1000, M=1000000 and B=1000000000 according to the manual provided with the data set. However, those variables contains messy data, there are many empty spaces, and question marks. Two new numerical variables will be created, named PROPM and CROPM, where unknowns “?” will be zeroed, and letters K, M, B replaced with their multiplier. Empty spaces will be filled with zeros, numerical multiplayers replaced accordingly:
q20$PROPM<-1
q20$PROPM[which(q20$PROPDMGEXP=="h")]<-100
q20$PROPM[which(q20$PROPDMGEXP=="H")]<-100
q20$PROPM[which(q20$PROPDMGEXP=="K")]<-1000
q20$PROPM[which(q20$PROPDMGEXP=="k")]<-1000
q20$PROPM[which(q20$PROPDMGEXP=="M")]<-1000000
q20$PROPM[which(q20$PROPDMGEXP=="m")]<-1000000
q20$PROPM[which(q20$PROPDMGEXP=="B")]<-1000000000
q20$PROPM[which(q20$PROPDMGEXP=="b")]<-1000000000
q20$PROPM[which(q20$PROPDMGEXP==1)]<-10
q20$PROPM[which(q20$PROPDMGEXP==2)]<-100
q20$PROPM[which(q20$PROPDMGEXP==3)]<-1000
q20$PROPM[which(q20$PROPDMGEXP==4)]<-10000
q20$PROPM[which(q20$PROPDMGEXP==5)]<-100000
q20$PROPM[which(q20$PROPDMGEXP==6)]<-1000000
q20$PROPM[which(q20$PROPDMGEXP==7)]<-10000000
q20$PROPM[which(q20$PROPDMGEXP==8)]<-100000000
q20$PROPM[which(q20$PROPDMGEXP==9)]<-1000000000
q20$CROPM<-1
q20$CROPM[which(q20$CROPDMGEXP=="h")]<-100
q20$CROPM[which(q20$CROPDMGEXP=="H")]<-100
q20$CROPM[which(q20$CROPDMGEXP=="K")]<-1000
q20$CROPM[which(q20$CROPDMGEXP=="k")]<-1000
q20$CROPM[which(q20$CROPDMGEXP=="M")]<-1000000
q20$CROPM[which(q20$CROPDMGEXP=="m")]<-1000000
q20$CROPM[which(q20$CROPDMGEXP=="B")]<-1000000000
q20$CROPM[which(q20$CROPDMGEXP=="b")]<-1000000000
q20$CROPM[which(q20$CROPDMGEXP==1)]<-10
q20$CROPM[which(q20$CROPDMGEXP==2)]<-100
q20$CROPM[which(q20$CROPDMGEXP==3)]<-1000
q20$CROPM[which(q20$CROPDMGEXP==4)]<-10000
q20$CROPM[which(q20$CROPDMGEXP==5)]<-100000
q20$CROPM[which(q20$CROPDMGEXP==6)]<-1000000
q20$CROPM[which(q20$CROPDMGEXP==7)]<-10000000
q20$CROPM[which(q20$CROPDMGEXP==8)]<-100000000
q20$CROPM[which(q20$CROPDMGEXP==9)]<-1000000000
Total cost, variable DAMAGE was calculated as the sum of property damage and crop damage, each multiplied with their corresponding multiplier:
q20$DAMAGE<-q20$PROPDMG*q20$PROPM+q20$CROPDMG*q20$CROPM
q2D0<-q20 %>%group_by(as.character(EVTYPE)) %>% summarise(total=sum(DAMAGE))
names(q2D0)<-c("EVTYPE", "DAMAGE")
Data cleaning will be required in EVTYPE data. Lines with “?” were deleted. Total 1 lines had that. That resulted reduction in total damage by: 5000
questionable_damage<-q2D0$DAMAGE[which(q2D0$EVTYPE=="?")]
q2D0<-q2D0[-which(q2D0$EVTYPE=="?"),]
As in FATALITIES and INJURIES scenarios, first step is apply grep function to reduce duplicate titles that contained the same strings. Prior to that cleaning, injuries data file contained 396 event types:
orig<-q2D0
c<-0
q2D00<-data.frame()
l<-length(orig$DAMAGE)
while (l>0){
c<-c+1
x<-orig$EVTYPE[1]
i<-grep(x, orig$EVTYPE)
q2D00[c,1]<-x
q2D00[c,2]<-sum(orig$DAMAGE[i])
orig<-orig[-i,]
l<-length(orig$DAMAGE)
rm(i)
}
names(q2D00)<-names(q2D0)
rm(orig, l, c)
Second step, using amatch function to combine event types that are different by spelling errors:
q2D0<-q2D00
rm(q2D00)
c<-0
cc<-0
l<-data.frame()
paired_eventsD<-as.character()
for (x in q2D0$EVTYPE){
cc<-cc+1
y<-q2D0[-(1:cc),]
linen<-which(q2D0$EVTYPE==x)
if (nchar(x)<6) {b<-1}
else {b<-2}
i<-amatch(x, y$EVTYPE, maxDist = b)
if (is.na(i)==FALSE){
c<-c+1
l[c,1]<-linen
l[c,2]<-i+cc
y0<-y$EVTYPE[i]
paired_eventsD[c]<-paste(x,'-', y0)
}
}
rm(linen, c, cc)
q2D00<-q2D0
q2D00$DAMAGE[l[,1]]<-q2D00$DAMAGE[l[,1]]+q2D00$DAMAGE[l[,2]]
q2D00<-q2D00[-(l[,2]),]
The last data cleaning algorithm did not handle well “THUNDEEERSTORM WINDS-THUNDERSTORM WIND-THUNDERSTORM WINDS-THUNDERTORM WINDS, ETC”. However, it handle reasonably well in 99.999872 percent scenarios defined as total damage prior to post data cleaning. The difference is minuscule. Here we end up with a DAMAGE data set, q2D00 that has 121 lines. Sorting it out descending order:
DAMAGE<-arrange(q2D00, desc(DAMAGE))
Here results presented in graphic format using stacked bar plots. An advantage in using stacked bars is visibility of the dominance by the specific event type. Data in plots were normalized to total fatalities, total injuries or total economic damage. Total numbers were taken from data that was saved prior to EVTYPE data cleaning. Event types that caused the most fatalities:
barplot(as.matrix(fatal$FATALITIES [1:10]/sum(q1f$FATALITIES)), col = heat.colors(10), legend.text=fatal$EVTYPE[1:10], ylab="Fraction of total fatalities", width=1, xlim=c(0,2), main="Most fatal nature events")
First 20 most fatal event types are printed out. They contributed to 94.1168703 percent of all fatalities in original data set:
fatal[1:20,]
## EVTYPE FATALITIES
## 1 TORNADO 5636
## 2 EXCESSIVE HEAT 1920
## 3 HEAT 1120
## 4 FLASH FLOOD 1035
## 5 LIGHTNING 817
## 6 RIP CURRENT 572
## 7 TSTM WIND 505
## 8 FLOOD 484
## 9 COLD 451
## 10 HIGH WIND 297
## 11 AVALANCE 225
## 12 WINTER STORM 216
## 13 THUNDERSTORM 200
## 14 HURRICANE 133
## 15 HEAVY SNOW 129
## 16 STRONG WIND 111
## 17 HIGH SURF 104
## 18 BLIZZARD 101
## 19 ICE 100
## 20 HEAVY RAIN 98
Events that cause the most of injuries:
barplot(as.matrix(injur$INJURIES [1:7]/sum(q1i$INJURIES)), col = heat.colors(7), legend.text=injur$EVTYPE[1:7], ylab="Fraction of total injuries", width=1, xlim=c(0,2), main="Nature events causing the most of injuries")
First 20 most fatal event types are printed out. They contributed to 97.4026529 percent of all fatalities in original data set:
injur[1:20,]
## EVTYPE INJURIES
## 1 TORNADO 91407
## 2 TSTM WIND 6961
## 3 FLOOD 6795
## 4 EXCESSIVE HEAT 6525
## 5 LIGHTNING 5232
## 6 HEAT 2529
## 7 THUNDERSTORM 2451
## 8 ICE 2114
## 9 FLASH FLOOD 1802
## 10 HIGH 1683
## 11 HAIL 1467
## 12 WINTER STORM 1338
## 13 HURRICANE 1328
## 14 WILD FIRES 1061
## 15 HEAVY SNOW 1034
## 16 BLIZZARD 805
## 17 FOG 734
## 18 WILD/FOREST FIRE 545
## 19 WINTER WEATHER 538
## 20 RIP CURRENT 529
Events that cause the most economic damage:
barplot(as.matrix(DAMAGE$DAMAGE[1:7]/(questionable_damage+sum(q2D0$DAMAGE))), col = heat.colors(7), legend.text=DAMAGE$EVTYPE[1:7], ylab="Fraction of total damage", main="Nature events causing the most economic damage", width=1, xlim=c(0,2))
First 10 lines describe 90.917798 percent of all recorded damage:
DAMAGE[1:10,]
## EVTYPE DAMAGE
## 1 FLOOD 161022463609
## 2 HURRICANE 90161472810
## 3 TORNADO 57418279847
## 4 STORM SURGE 47965579000
## 5 HAIL 19134684410
## 6 FLASH FLOOD 19120989246
## 7 DROUGHT 15018927780
## 8 ICE 8989448410
## 9 TROPICAL STORM 8409286550
## 10 HIGH WINDS 6735939953