Natural Disaster in United States during 1950-2011

Synopsis

This report contains my analysis on Strom Dataset collected by NOAA National Weather Service between 1950 and 2011. The result shows that Tornado has the greatest impact on population health in the United Stated. Meanwhile, Excessive heat, which occures less than Tornado causes relativly a high number of injuries and fatalities in the US. In comparision, Flood causes the highst economic impact and then Hurricanes and Typhoons.

Assumptions

To analyze the data, we need plyr, reshape2, ggplot2, tm, wordcloud to be installed on the machine.

    library(wordcloud)
## Warning: package 'wordcloud' was built under R version 3.2.2
## Loading required package: RColorBrewer
## Warning: package 'RColorBrewer' was built under R version 3.2.2
    library(RColorBrewer)
    library(plyr)
## Warning: package 'plyr' was built under R version 3.2.2
    library(reshape2)
## Warning: package 'reshape2' was built under R version 3.2.2
    library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.2

Loading and Processing the Raw Data

The data obtaind from Strom Data. This data contains the US National Oceanic and Atmospheric Administaration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States from the begining of 1950 to the end of November 2011, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined:

Reading data

The data for this project come in the form of a comma-seperated-value file compressed via the bzip2 algorithm to reduce its size. To read this we can use simply read.csv() function in R. This function can handle compressed files automatically. For this purpose, you can use read.table() function which is a generic version of read.csv(). Data loading is time-consuming process. To avoid this, we can use cache=TRUE option in code chunck.

  stormData <- read.csv(bzfile("StormData.csv.bz2"))

After reading the storm data we can check the first few rows(this are 902,297 rows) of this datasets.

  dim(stormData)
## [1] 902297     37
  head(stormData)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6

It would be useful to take a look at the format and type of considered variables in this dataset.

  str(stormData)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

The column we are interested in is the EVTYPE which contains the type of events happend during 1950-2011 in all across the US. Here we can see some type of them with their number of occurance.

  stormData$EVTYPE <- toupper(stormData$EVTYPE)
  events <- stormData$EVTYPE
  summary(events)
##    Length     Class      Mode 
##    902297 character character

Here we can draw a fancy version of chart by wordcloud package.

    wordcloud(as.character(unique(stormData$EVTYPE)), random.order=FALSE)
## Loading required package: tm
## Warning: package 'tm' was built under R version 3.2.2
## Loading required package: NLP
## Warning: package 'NLP' was built under R version 3.2.2
## 
## Attaching package: 'NLP'
## 
## The following object is masked from 'package:ggplot2':
## 
##     annotate

Result

Events vs. Population Health

To answer this question that what type of events is the most harmufl one in the US, we should consider its impact on population health which is measurable by the number of Fatality and Injury. To do this, we need to compare the number of fatalities and injuries occured by most common type of events (top 20). Fist we need to subset and aggregate these data.

Fatalities

  fatality <- aggregate(FATALITIES ~ EVTYPE, data = stormData, FUN = "sum")
  fatality <- fatality[order(fatality$FATALITIES, decreasing = TRUE), ]
  head(fatality,20)
##                      EVTYPE FATALITIES
## 758                 TORNADO       5633
## 116          EXCESSIVE HEAT       1903
## 138             FLASH FLOOD        978
## 243                    HEAT        937
## 418               LIGHTNING        816
## 779               TSTM WIND        504
## 154                   FLOOD        470
## 524             RIP CURRENT        368
## 320               HIGH WIND        248
## 19                AVALANCHE        224
## 888            WINTER STORM        206
## 525            RIP CURRENTS        204
## 245               HEAT WAVE        172
## 125            EXTREME COLD        162
## 685       THUNDERSTORM WIND        133
## 274              HEAVY SNOW        127
## 126 EXTREME COLD/WIND CHILL        125
## 312               HIGH SURF        104
## 604             STRONG WIND        103
## 28                 BLIZZARD        101

To compare different events, it’s nice to display the aggregated data on a bar chart.

  topFatality <- fatality[1:20, ]
  ggplot(data = topFatality, aes(EVTYPE, FATALITIES, fill = FATALITIES)) + geom_bar(stat = "identity") + 
    xlab("Event") + ylab("Fatalities") + ggtitle("Fatalities Caused by Different Events (top 20)") + 
    coord_flip() + theme(legend.position = "none")

The graph clearly shows that TORNADO is the most deadly event over the years from 1950 to 2011.

Injuries

  injury <- aggregate(INJURIES ~ EVTYPE, data = stormData, FUN = "sum")
  injury <- injury[order(injury$INJURIES, decreasing = TRUE), ]
  head(injury,20)
##                 EVTYPE INJURIES
## 758            TORNADO    91346
## 779          TSTM WIND     6957
## 154              FLOOD     6789
## 116     EXCESSIVE HEAT     6525
## 418          LIGHTNING     5230
## 243               HEAT     2100
## 387          ICE STORM     1975
## 138        FLASH FLOOD     1777
## 685  THUNDERSTORM WIND     1488
## 212               HAIL     1361
## 888       WINTER STORM     1321
## 372  HURRICANE/TYPHOON     1275
## 320          HIGH WIND     1137
## 274         HEAVY SNOW     1021
## 875           WILDFIRE      911
## 711 THUNDERSTORM WINDS      908
## 28            BLIZZARD      805
## 171                FOG      734
## 873   WILD/FOREST FIRE      545
## 105         DUST STORM      440

To compare different events, it’s nice to display the aggregated data on a bar chart.

  topInjury <- injury[1:20, ]
  ggplot(data = topInjury, aes(EVTYPE, INJURIES, fill = INJURIES)) + geom_bar(stat = "identity") + 
    xlab("Event") + ylab("Injuries") + ggtitle("Injuries Caused by Different Events (top 20)") + 
    coord_flip() + theme(legend.position = "none")

As you can see on the above chart, TORNADO is the most injuring event in the US.

Events vs. Economic Consequences

Here we want to answer this question that what type of events has the most impact on economic. First we should cjeck the exponent data in PROPDMGEXP variable in the Storm dataset.

  unique(stormData$PROPDMGEXP)
##  [1] K M   B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels:  - ? + 0 1 2 3 4 5 6 7 8 B h H K m M

It shows that we have mixed cases data, for example h and H or m and M. To fix this, it’s good idea to convert them to upper or lower case character.

  stormData$PROPDMGEXP <- toupper(stormData$PROPDMGEXP)
  unique(stormData$PROPDMGEXP)
##  [1] "K" "M" ""  "B" "+" "0" "5" "6" "?" "4" "2" "3" "H" "7" "-" "1" "8"

Now we should convert exponent characters to numeric values. calcExp() function will perform this task.

  calcExp <- function(x, exp = "") {
        switch(exp, `-` = x * -1, `?` = x, `+` = x, `1` = x*10, `2` = x*(10^2), `3` = x*(10^3),
               `4` = x * (10^4), `5` = x * (10^5), `6` = x * (10^6), `7` = x *(10^7), 
               `8` = x * (10^8), H = x * 100, K = x * 1000, M = x * 1e+06, B = x * 1e+09, x)
  }
  applyCalcExp <- function(vx, vexp) {
    if (length(vx) != length(vexp)) 
        stop("Not same size")
    result <- rep(0, length(vx))
    for (i in 1:length(vx)) {
        result[i] <- calcExp(vx[i], vexp[i])
    }
    result
  }

Now, we are able to calculate the cost of damage caused by each event.

  stormData$damageCosts <- applyCalcExp(as.numeric(stormData$PROPDMG), stormData$PROPDMGEXP)
  summary(stormData$damageCosts)
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -1.500e+01  0.000e+00  0.000e+00  4.746e+05  5.000e+02  1.150e+11

Here, we calculate the total damage cost for every event.

  tDamagecosts <- subset(aggregate(damageCosts ~ EVTYPE, data = stormData, FUN = "sum"), damageCosts > 0)
  tDamagecosts <- tDamagecosts[order(tDamagecosts$damageCosts, decreasing = TRUE), ]

To compare different events, it’s nice to display the aggregated data on a bar chart.

  tDamagecosts_top20 <- tDamagecosts[1:20, ]
  ggplot(data = tDamagecosts_top20, aes(EVTYPE, damageCosts, fill = damageCosts)) + geom_bar(stat = "identity") + 
  xlab("Event") + ylab("Economic costs in $US") + 
  ggtitle("Economic costs caused by Different Events (top 20) ") + coord_flip() + theme(legend.position = "none")

As it’s clear in the above char, FLOOD has the greatest impact on economic and then HURRICAN/TYPHOON, TORNADO and STORM SURGE respectfully.