Synopsis

This analysis was prepared with the aim to find the event types that have the most drastic effects on population. The assumption is that these events are the ones with the greatest fatalities and injuries and property and crop damages.Natioanal Oceanic and Atmospheric Administration’s (NOAA) Data spanning from 1950 and 2011 is preprocessed and analyzed. Analysis show that tornados and excessive heat are the event types with most injuries and fatalities while flood, hurricane typhoon and tornados again causing the biggest economic damage.

Loading and Processing the Raw Data

For this review the data from U.S. National Oceanic and Atmospheric Administration’s (NOAA) is used. It covers the events from 1950 until November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete. The data is downloaded and uploaded via the open source R Program ( version 3.1.2 ) with the below code.

if(!file.exists("Storm.bz2")){
    URL <- ("http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2")
    download.file(URL,"Storm.bz2")
    Storm <- read.csv("Storm.bz2")
    }
      Storm <-read.csv("Storm.bz2")

After reading the data we check the structure of the data

str(Storm)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Since the Beginning Date and End DAte are factor variables, we are turning them into to a Date format.

Storm$BGN_DATE <- as.Date(Storm$BGN_DATE,"%m/%d/%Y")
Storm$END_DATE <- as.Date(Storm$END_DATE,"%m/%d/%Y")

Since Time and Time Zone variables are not particular interest for this paper, we do not process them even though their format could be converted.

A look to the EVTYPE variable shows that there are encoding problems.

head(sort(table(Storm$EVTYPE),decreasing=TRUE),n=60)
## 
##                     HAIL                TSTM WIND        THUNDERSTORM WIND 
##                   288661                   219940                    82563 
##                  TORNADO              FLASH FLOOD                    FLOOD 
##                    60652                    54277                    25326 
##       THUNDERSTORM WINDS                HIGH WIND                LIGHTNING 
##                    20843                    20212                    15754 
##               HEAVY SNOW               HEAVY RAIN             WINTER STORM 
##                    15708                    11723                    11433 
##           WINTER WEATHER             FUNNEL CLOUD         MARINE TSTM WIND 
##                     7026                     6839                     6175 
## MARINE THUNDERSTORM WIND               WATERSPOUT              STRONG WIND 
##                     5812                     3796                     3566 
##     URBAN/SML STREAM FLD                 WILDFIRE                 BLIZZARD 
##                     3392                     2761                     2719 
##                  DROUGHT                ICE STORM           EXCESSIVE HEAT 
##                     2488                     2006                     1678 
##               HIGH WINDS         WILD/FOREST FIRE             FROST/FREEZE 
##                     1533                     1457                     1342 
##                DENSE FOG       WINTER WEATHER/MIX           TSTM WIND/HAIL 
##                     1293                     1104                     1028 
##  EXTREME COLD/WIND CHILL                     HEAT                HIGH SURF 
##                     1002                      767                      725 
##           TROPICAL STORM           FLASH FLOODING             EXTREME COLD 
##                      690                      682                      655 
##            COASTAL FLOOD         LAKE-EFFECT SNOW        FLOOD/FLASH FLOOD 
##                      650                      636                      624 
##                LANDSLIDE                     SNOW          COLD/WIND CHILL 
##                      600                      587                      539 
##                      FOG              RIP CURRENT              MARINE HAIL 
##                      538                      470                      442 
##               DUST STORM                AVALANCHE                     WIND 
##                      427                      386                      340 
##             RIP CURRENTS              STORM SURGE            FREEZING RAIN 
##                      304                      261                      250 
##              URBAN FLOOD     HEAVY SURF/HIGH SURF        EXTREME WINDCHILL 
##                      249                      228                      204 
##             STRONG WINDS           DRY MICROBURST    ASTRONOMICAL LOW TIDE 
##                      196                      186                      174 
##                HURRICANE              RIVER FLOOD               LIGHT SNOW 
##                      174                      173                      154

First we created another variable for cleaned Event Types. With only targeting the high frequency event types some degree of data cleaning is performed. Such as all uppercase converted to lower case, space trimming, “/” replacing, winds" converted into “wind”, “tstm” into “thunderstorm”. Even though there are other event types to be consolidated (e.g “Excessive Heat” vs. “Heat”, “Hail” vs. “hail 1.5”) documentation doesn’t make it clear that they can be classified as the same.So they were left out.

Storm$CL_EVTYPE <- Storm$EVTYPE
Storm$CL_EVTYPE <- tolower(Storm$CL_EVTYPE)
Storm$CL_EVTYPE <- gsub("/"," ", Storm$CL_EVTYPE)
Storm$CL_EVTYPE <- gsub("  "," ", Storm$CL_EVTYPE)
Storm$CL_EVTYPE <- gsub("winds","wind", Storm$CL_EVTYPE)
Storm$CL_EVTYPE <- gsub("tstm","thunderstorm", Storm$CL_EVTYPE)

Finally we convert our old EVTYPE variable with our new variable and delete the CL_EVTYPE variable and take a look to the cleaned event types

Storm$EVTYPE <- Storm$CL_EVTYPE
Storm$CL_EVTYPE <- NULL
head(sort(table(Storm$EVTYPE),decreasing=TRUE),n=60)
## 
##        thunderstorm wind                     hail                  tornado 
##                   323362                   288661                    60652 
##              flash flood                    flood                high wind 
##                    54277                    25327                    21748 
##                lightning               heavy snow marine thunderstorm wind 
##                    15754                    15708                    11987 
##               heavy rain             winter storm           winter weather 
##                    11742                    11433                     7045 
##             funnel cloud               waterspout              strong wind 
##                     6844                     3796                     3773 
##     urban sml stream fld                 wildfire                 blizzard 
##                     3392                     2761                     2719 
##                  drought                ice storm           excessive heat 
##                     2488                     2006                     1678 
##         wild forest fire             frost freeze                dense fog 
##                     1457                     1343                     1293 
##   thunderstorm wind hail       winter weather mix  extreme cold wind chill 
##                     1115                     1110                     1002 
##                     heat                high surf           tropical storm 
##                      767                      734                      690 
##           flash flooding             extreme cold            coastal flood 
##                      682                      657                      656 
##         lake-effect snow        flood flash flood                     snow 
##                      636                      626                      617 
##                landslide          cold wind chill                      fog 
##                      600                      539                      538 
##              rip current              marine hail               dust storm 
##                      470                      442                      427 
##                avalanche                     wind             rip currents 
##                      386                      382                      304 
##              storm surge            freezing rain              urban flood 
##                      261                      260                      251 
##     heavy surf high surf        extreme windchill           dry microburst 
##                      228                      204                      186 
##         coastal flooding               light snow    astronomical low tide 
##                      183                      176                      174 
##                hurricane              river flood            record warmth 
##                      174                      173                      154 
##               dust devil         storm surge tide         marine high wind 
##                      149                      148                      135

Data Analysis

Events Harmful to Population Health

Inspecting most harmful events, total fatalities and injuries could be revealing.

head(sort(tapply(Storm$FATALITIES,Storm$EVTYPE,sum),decreasing=TRUE))
##           tornado    excessive heat       flash flood              heat 
##              5633              1903               978               937 
##         lightning thunderstorm wind 
##               816               701
head(sort(tapply(Storm$INJURIES,Storm$EVTYPE,sum),decreasing=TRUE))
##           tornado thunderstorm wind             flood    excessive heat 
##             91346              9363              6789              6525 
##         lightning              heat 
##              5230              2100

In order to have separete lists with top ten total values we can do the following subsetting

FatalitiesByEvent <- aggregate(FATALITIES ~EVTYPE,Storm,sum)
TopTenFatalities <- FatalitiesByEvent[order(FatalitiesByEvent$FATALITIES,decreasing=TRUE),]
TopTenFatalities <- TopTenFatalities[1:10,]
TopTenFatalities
##                EVTYPE FATALITIES
## 732           tornado       5633
## 115    excessive heat       1903
## 137       flash flood        978
## 234              heat        937
## 401         lightning        816
## 660 thunderstorm wind        701
## 152             flood        470
## 505       rip current        368
## 306         high wind        283
## 19          avalanche        224

The same for Injuries:

InjuriesByEvent <- aggregate(INJURIES ~EVTYPE,Storm,sum)
TopTenInjuries <- InjuriesByEvent[order(InjuriesByEvent$INJURIES,decreasing=TRUE),]
TopTenInjuries <- TopTenInjuries[1:10,]
TopTenInjuries
##                EVTYPE INJURIES
## 732           tornado    91346
## 660 thunderstorm wind     9363
## 152             flood     6789
## 115    excessive heat     6525
## 401         lightning     5230
## 234              heat     2100
## 371         ice storm     1975
## 137       flash flood     1777
## 306         high wind     1439
## 204              hail     1361

To see the the total picture of most harmful event types with respect to population health (priorizing fatalities) we can look to a combined table

HealthDamage <- merge(TopTenFatalities,TopTenInjuries,by.x="EVTYPE",by.y="EVTYPE",all=TRUE)
HealthDamage <- HealthDamage[order(HealthDamage$FATALITIES,decreasing=TRUE),]
HealthDamage
##               EVTYPE FATALITIES INJURIES
## 12           tornado       5633    91346
## 2     excessive heat       1903     6525
## 3        flash flood        978     1777
## 6               heat        937     2100
## 9          lightning        816     5230
## 11 thunderstorm wind        701     9363
## 4              flood        470     6789
## 10       rip current        368       NA
## 7          high wind        283     1439
## 1          avalanche        224       NA
## 5               hail         NA     1361
## 8          ice storm         NA     1975

With this table wee see that tornados and excessive heat are the leading events for fatalities and injuries.

To show the results side by side we have utilized the reshape2 package and created long data frame with the melt function. And then created a bar plot with ggplot2.

library(reshape2)
## Warning: package 'reshape2' was built under R version 3.1.3
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.1.3
TopFive <- HealthDamage[1:5,]
LongTable <- melt(TopFive,id="EVTYPE", measure.vars=c("FATALITIES","INJURIES"))


ggplot(LongTable, aes(x=EVTYPE,y=value,fill=factor(variable)))+geom_bar(stat="identity",position="dodge",width=.5)+ scale_fill_discrete(name="") + xlab("Event Type") + ylab("Total Incidents")

Economic Damage

Another point of interest would be the economic impact. In order to calculate the damage, the exponent variables are used.To understand both variables are summarizied with table function.

table(Storm$CROPDMGEXP)
## 
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994
table(Storm$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5 
## 465934      1      8      5    216     25     13      4      4     28 
##      6      7      8      B      h      H      K      m      M 
##      4      5      1     40      1      6 424665      7  11330

With this short look to the variables and the help of Storm Data Documentation, a series of steps are applied to data and number symbols such as “k”, “m” converted to the appropriate numbers for multiplication. All the remaining ambiguous characters converted into 1 and other numbers left as they are.

Storm$PropEXP <- Storm$PROPDMGEXP
Storm$PropEXP <- gsub("[Kk]","1000",Storm$PropEXP)
Storm$PropEXP <- gsub("[Mm]","1000000",Storm$PropEXP)
Storm$PropEXP <- gsub("B","1000000000",Storm$PropEXP)
Storm$PropEXP <- gsub("[Hh]","1",Storm$PropEXP)
Storm$PropEXP <- gsub("^0","1",Storm$PropEXP)
Storm$PropEXP <- as.integer(Storm$PropEXP)
## Warning: NAs introduced by coercion
Storm$PropEXP[is.na(Storm$PropEXP)] <- 1
Storm$PROPDMGEXP <- Storm$PropEXP
Storm$PropEXP <-NULL
table(Storm$PROPDMGEXP)
## 
##      1      2      3      4      5      6      7      8   1000  1e+06 
## 466196     13      4      4     28      4      5      1 424665  11337 
##  1e+09 
##     40

The same was done for the CROPEXP variable.

Storm$CropEXP <- Storm$CROPDMGEXP
Storm$CropEXP <- gsub("[Kk]","1000",Storm$CropEXP)
Storm$CropEXP <- gsub("[Mm]","1000000",Storm$CropEXP)
Storm$CropEXP <- gsub("B","1000000000",Storm$CropEXP)
Storm$CropEXP <- gsub("^0","1",Storm$CropEXP)
Storm$CropEXP <- as.integer(Storm$CropEXP)
## Warning: NAs introduced by coercion
Storm$CropEXP[is.na(Storm$CropEXP)] <- 1
Storm$CROPDMGEXP <- Storm$CropEXP
Storm$CropEXP <-NULL
table(Storm$CROPDMGEXP)
## 
##      1      2   1000  1e+06  1e+09 
## 618439      1 281853   1995      9

To see the total damage two variables added into a new Total Damage Variable after multiplying with the respective “..EXP variable.

Storm$TOTDAM <- (Storm$PROPDMG *Storm$PROPDMGEXP ) +(Storm$CROPDMG * Storm$CROPDMGEXP)

As the next step We are aggregating the total damages by Event Types:

DamageByEvent <- aggregate(TOTDAM ~EVTYPE,Storm,sum)

Checking the dimension of this data shows there are a lot of type of events, a closer look to top ten would be more meaningful.

 TopFiveDamage <- DamageByEvent[order(DamageByEvent$TOTDAM,decreasing=TRUE),]
TopFiveDamage <- TopFiveDamage[1:5,]
TopFiveDamage
##                EVTYPE       TOTDAM
## 152             flood 150319678257
## 355 hurricane typhoon  71913712800
## 732           tornado  57352114458
## 575       storm surge  43323541000
## 204              hail  18758221641

Plot showing the top five event types with most total economic damages :

ggplot(TopFiveDamage,aes(x=EVTYPE,y=TOTDAM))+ geom_bar(stat="identity",position="dodge",fill="blue",width=.5)+xlab("Event Type") + ylab("Total Damage") 

Results

Analysis show that tornados and excessive heat are the event types with most injuries and fatalities while flood, hurricane typhoon and tornados causing the biggest economic damage. Even though there are common event types in both economic and health damages they do not overlap totally. Excessive heat and heat are some of top causes of fatalities while they are causing considerably less economic damage. This could be one of the areas that needs further analysis.