SYNOPSIS

This analysis identifies nature event types that are most harmful to population health and have greatest economic consequences. The harm was measured by fatality and injury numbers. Economic consequences were measured by property and crops damage. The original provided data has almost one thousand unique nature type of event names. Reason for that was mixing small and large cap letters, duplicate names and spelling errors. To make sure that this research remains reproducible no manual data cleaning was involved. Two algorithms were created in deployed three times each to clean EVTYPE variable data. The first algorithm based on grep(), the second based on amatch() functions. Among nature events tornadoes were on top of the list to cause harm to population health, flood type events were leaders in causing property damage.

loading libraries:

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

DATA PROCESSING

Lets load data into R:

dat<-read.csv("repdata_data_StormData.csv.bz2")

Two questions to be answered by this data set: 1. Across the US which types of events ( as indicated in the EVTYPE variable) are most harmful with respect to population health? 2. Across the US, which types of events have the greatest economic consequences?

For the question 1, lets focus on variables FATALITIES AND INJURIES as representatives of harm to public health. For the question 2, PROPDMG, PROPDMGEXP and CROPDMG, CROPDMEXP are of interest. Of course variable EVTYPE is needed in both cases.

Original data file has 902297 lines To reduce amount of data that is loaded into memory, two data subsets were created: first, q1, is meant to address question 1, and the second data subset, q2, was aimed at question 2. Selected columns were described above. After creating q1 and q2 data sets, the original data set, dat was removed, to preserve computational memory.

 q1<-select(dat, EVTYPE:INJURIES)
 q1<-select(q1, -(BGN_RANGE:MAG))
 q2<-select(dat, EVTYPE:CROPDMGEXP)
 q2<-select(q2, -(BGN_RANGE:INJURIES))
 rm(dat)

Both questions asks to qualify data versus type of events as described in variable EVTYPE. EVTYPE has 985 unique names in the original format. However, that number drops to 898 if letters are capitalized in texts describing event types. Therefore, EVTYPE column data was capitalized (the same letters but letters capitalized).

 q1$EVTYPE<-toupper(q1$EVTYPE)
 q2$EVTYPE<-toupper(q2$EVTYPE)

Question Nr1 data preparation

The last number, 898 is still too large to implement reasonable plotting. In the following section the report focuses on answering question 1: Across the US which types of events ( as indicated in the EVTYPE variable) are most harmful with respect to population health? Note, that we are not interested in events that were not harmful, that had zero fatalities or zero injuries. Therefore, to data sets are being created below, q1f and q1i, for fatalities and injuries, which had non-zero fatalities or injuries:

q1f<-filter(q1, FATALITIES>0)
q1i<-filter(q1, INJURIES>0)

There were total 1.514510^{4} fatalities and 1.4052810^{5} injuries. Next step is to group data by event type:

q1f0<-q1f %>%group_by(as.character(EVTYPE)) %>% summarise(totaln=sum(FATALITIES))
q1i0<-q1i %>%group_by(as.character(EVTYPE)) %>% summarise(totaln=sum(INJURIES))
names(q1f0)<-c("EVTYPE", "FATALITIES")
names(q1i0)<-c("EVTYPE", "INJURIES")

Here number of event types reduced to 160 and 150 for fatalities and injuries data sets.

Question Nr2. data preparation. Data cleaning for the events that show the greatest economic consequencies

There were two variables in original data set describing economic damage: property damage and crops damage (PROPDMG and CROPDMG, respectively). Here we only interested in events when either of those is non zero. New data set q20 contains the subset of economic damage data:

q20<-filter(q2, q2$PROPDMG!=0 | q2$CROPDMG!=0)

Two exponential variables PROPDMGEXP and CROPDMGEXP consist a multiplier in alphabetic form, K=1000, M=1000000 and B=1000000000 according to the manual provided with the data set. However, those variables contains messy data, there are many empty spaces, and question marks. Two new numerical variables will be created, named PROPM and CROPM, where unknowns “?” will be zeroed, and letters K, M, B replaced with their multiplier. Empty spaces will be filled with zeros, numerical multiplayers replaced accordingly:

q20$PROPM<-1
q20$PROPM[which(q20$PROPDMGEXP=="h")]<-100
q20$PROPM[which(q20$PROPDMGEXP=="H")]<-100
q20$PROPM[which(q20$PROPDMGEXP=="K")]<-1000
q20$PROPM[which(q20$PROPDMGEXP=="k")]<-1000
q20$PROPM[which(q20$PROPDMGEXP=="M")]<-1000000
q20$PROPM[which(q20$PROPDMGEXP=="m")]<-1000000
q20$PROPM[which(q20$PROPDMGEXP=="B")]<-1000000000
q20$PROPM[which(q20$PROPDMGEXP=="b")]<-1000000000
q20$PROPM[which(q20$PROPDMGEXP==1)]<-10
q20$PROPM[which(q20$PROPDMGEXP==2)]<-100
q20$PROPM[which(q20$PROPDMGEXP==3)]<-1000
q20$PROPM[which(q20$PROPDMGEXP==4)]<-10000
q20$PROPM[which(q20$PROPDMGEXP==5)]<-100000
q20$PROPM[which(q20$PROPDMGEXP==6)]<-1000000
q20$PROPM[which(q20$PROPDMGEXP==7)]<-10000000
q20$PROPM[which(q20$PROPDMGEXP==8)]<-100000000
q20$PROPM[which(q20$PROPDMGEXP==9)]<-1000000000
q20$CROPM<-1
q20$CROPM[which(q20$CROPDMGEXP=="h")]<-100
q20$CROPM[which(q20$CROPDMGEXP=="H")]<-100
q20$CROPM[which(q20$CROPDMGEXP=="K")]<-1000
q20$CROPM[which(q20$CROPDMGEXP=="k")]<-1000
q20$CROPM[which(q20$CROPDMGEXP=="M")]<-1000000
q20$CROPM[which(q20$CROPDMGEXP=="m")]<-1000000
q20$CROPM[which(q20$CROPDMGEXP=="B")]<-1000000000
q20$CROPM[which(q20$CROPDMGEXP=="b")]<-1000000000
q20$CROPM[which(q20$CROPDMGEXP==1)]<-10
q20$CROPM[which(q20$CROPDMGEXP==2)]<-100
q20$CROPM[which(q20$CROPDMGEXP==3)]<-1000
q20$CROPM[which(q20$CROPDMGEXP==4)]<-10000
q20$CROPM[which(q20$CROPDMGEXP==5)]<-100000
q20$CROPM[which(q20$CROPDMGEXP==6)]<-1000000
q20$CROPM[which(q20$CROPDMGEXP==7)]<-10000000
q20$CROPM[which(q20$CROPDMGEXP==8)]<-100000000
q20$CROPM[which(q20$CROPDMGEXP==9)]<-1000000000

Total cost, variable DAMAGE was calculated as the sum of property damage and crop damage, each multiplied with their corresponding multiplier:

q20$DAMAGE<-q20$PROPDMG*q20$PROPM+q20$CROPDMG*q20$CROPM
q2D0<-q20 %>%group_by(as.character(EVTYPE)) %>% summarise(total=sum(DAMAGE))
names(q2D0)<-c("EVTYPE", "DAMAGE")

Data cleaning will be required in EVTYPE data. Lines with “?” were deleted. Total 1 lines had that. That resulted reduction in total damage by: 5000

questionable_damage<-q2D0$DAMAGE[which(q2D0$EVTYPE=="?")]
q2D0<-q2D0[-which(q2D0$EVTYPE=="?"),]

As in FATALITIES and INJURIES scenarios, first step is apply grep function to reduce duplicate titles that contained the same strings. Prior to that cleaning, injuries data file contained 396 event types:

orig<-q2D0
c<-0
q2D00<-data.frame()
l<-length(orig$DAMAGE)
while (l>0){
  c<-c+1
  x<-orig$EVTYPE[1]
  i<-grep(x, orig$EVTYPE)
  q2D00[c,1]<-x
  q2D00[c,2]<-sum(orig$DAMAGE[i])
  orig<-orig[-i,]
  l<-length(orig$DAMAGE)
  rm(i)
} 
names(q2D00)<-names(q2D0)
rm(orig, l, c)

Second step, using amatch function to combine event types that are different by spelling errors:

q2D0<-q2D00
rm(q2D00)
c<-0
cc<-0
l<-data.frame()
paired_eventsD<-as.character()
for (x in q2D0$EVTYPE){
  cc<-cc+1
  y<-q2D0[-(1:cc),]
  linen<-which(q2D0$EVTYPE==x)
   if (nchar(x)<6) {b<-1}
    else {b<-2}
  i<-amatch(x, y$EVTYPE, maxDist = b)
  if (is.na(i)==FALSE){
    c<-c+1
    l[c,1]<-linen
    l[c,2]<-i+cc
    y0<-y$EVTYPE[i]
    paired_eventsD[c]<-paste(x,'-', y0)
  }
}
rm(linen, c, cc)
q2D00<-q2D0
q2D00$DAMAGE[l[,1]]<-q2D00$DAMAGE[l[,1]]+q2D00$DAMAGE[l[,2]]
q2D00<-q2D00[-(l[,2]),]

The last data cleaning algorithm did not handle well “THUNDEEERSTORM WINDS-THUNDERSTORM WIND-THUNDERSTORM WINDS-THUNDERTORM WINDS, ETC”. However, it handle reasonably well in 99.999872 percent scenarios defined as total damage prior to post data cleaning. The difference is minuscule. Here we end up with a DAMAGE data set, q2D00 that has 121 lines. Sorting it out descending order:

DAMAGE<-arrange(q2D00, desc(DAMAGE))

RESULTS

Here results presented in graphic format using stacked bar plots. An advantage in using stacked bars is visibility of the dominance by the specific event type. Data in plots were normalized to total fatalities, total injuries or total economic damage. Total numbers were taken from data that was saved prior to EVTYPE data cleaning. Event types that caused the most fatalities:

barplot(as.matrix(fatal$FATALITIES [1:10]/sum(q1f$FATALITIES)), col = heat.colors(10), legend.text=fatal$EVTYPE[1:10], ylab="Fraction of total fatalities", width=1, xlim=c(0,2), main="Most fatal nature events")

First 20 most fatal event types are printed out. They contributed to 94.1168703 percent of all fatalities in original data set:

fatal[1:20,]
##            EVTYPE FATALITIES
## 1         TORNADO       5636
## 2  EXCESSIVE HEAT       1920
## 3            HEAT       1120
## 4     FLASH FLOOD       1035
## 5       LIGHTNING        817
## 6     RIP CURRENT        572
## 7       TSTM WIND        505
## 8           FLOOD        484
## 9            COLD        451
## 10      HIGH WIND        297
## 11       AVALANCE        225
## 12   WINTER STORM        216
## 13   THUNDERSTORM        200
## 14      HURRICANE        133
## 15     HEAVY SNOW        129
## 16    STRONG WIND        111
## 17      HIGH SURF        104
## 18       BLIZZARD        101
## 19            ICE        100
## 20     HEAVY RAIN         98

Events that cause the most of injuries:

barplot(as.matrix(injur$INJURIES [1:7]/sum(q1i$INJURIES)), col = heat.colors(7), legend.text=injur$EVTYPE[1:7], ylab="Fraction of total injuries", width=1, xlim=c(0,2), main="Nature events causing the most of injuries")

First 20 most fatal event types are printed out. They contributed to 97.4026529 percent of all fatalities in original data set:

injur[1:20,]
##              EVTYPE INJURIES
## 1           TORNADO    91407
## 2         TSTM WIND     6961
## 3             FLOOD     6795
## 4    EXCESSIVE HEAT     6525
## 5         LIGHTNING     5232
## 6              HEAT     2529
## 7      THUNDERSTORM     2451
## 8               ICE     2114
## 9       FLASH FLOOD     1802
## 10             HIGH     1683
## 11             HAIL     1467
## 12     WINTER STORM     1338
## 13        HURRICANE     1328
## 14       WILD FIRES     1061
## 15       HEAVY SNOW     1034
## 16         BLIZZARD      805
## 17              FOG      734
## 18 WILD/FOREST FIRE      545
## 19   WINTER WEATHER      538
## 20      RIP CURRENT      529

Events that cause the most economic damage:

barplot(as.matrix(DAMAGE$DAMAGE[1:7]/(questionable_damage+sum(q2D0$DAMAGE))), col = heat.colors(7), legend.text=DAMAGE$EVTYPE[1:7], ylab="Fraction of total damage", main="Nature events causing the most economic damage", width=1, xlim=c(0,2))

First 10 lines describe 90.917798 percent of all recorded damage:

DAMAGE[1:10,]
##            EVTYPE       DAMAGE
## 1           FLOOD 161022463609
## 2       HURRICANE  90161472810
## 3         TORNADO  57418279847
## 4     STORM SURGE  47965579000
## 5            HAIL  19134684410
## 6     FLASH FLOOD  19120989246
## 7         DROUGHT  15018927780
## 8             ICE   8989448410
## 9  TROPICAL STORM   8409286550
## 10    HIGH  WINDS   6735939953

CONCLUSIONS

  1. 5 nature event types that are most deadly in were (in descending order): tornado, excessive heat, heat, flash flood and lightning.
  2. 5 nature events that causing the most of injuries were (in descending order): tornado, TSTM wind, flood, excessive heat, lightning.
  3. 5 nature event types that caused the most economic damage (in descending order): flood, hurricane, tornado, storm surge, hail.