Economic and Public Health Impact of Weather Events(Storms) in the United States 1950 - 2011

Sree June 20th 2014

1.0 Purpose

This project explores the U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

The purpose of this report is to answer two questions

  1. Across the United States, which types of events are most harmful with respect to population health?

  2. Across the United States, which types of events have the greatest economic consequences?

2.0 Data Processing

The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. The data was downloaded from the Coursera Reproducible Research web site [Stormdata (47Mb) ]

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

2.1 Prepare for loading the data

Set the working directory

setwd("C://Users/sreekantha/Documents/data-science/Assignments/Sub5/feed")

Install and load required packages

library("data.table")
library("knitr")

2.2 Import the data

Download data from the Coursera peer assesment location and processed as 'CSV' file in a single script fragment. The files will be unzipped from current working directory defined above Below is the link: https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2

loading into a data frame

stormdata <- read.csv("/Users/sreekantha/Documents/data-science/Assignments/Sub5/feed/repdata-data-StormData.csv", sep = ",", stringsAsFactors = FALSE)

2.3 Analyse the data

Analyse the structure of the stormdata using the str command

str(stormdata)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Analyse the summary of the data fields using the summary command

summary(stormdata)
##     STATE__       BGN_DATE           BGN_TIME          TIME_ZONE        
##  Min.   : 1.0   Length:902297      Length:902297      Length:902297     
##  1st Qu.:19.0   Class :character   Class :character   Class :character  
##  Median :30.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :31.2                                                           
##  3rd Qu.:45.0                                                           
##  Max.   :95.0                                                           
##                                                                         
##      COUNTY     COUNTYNAME           STATE              EVTYPE         
##  Min.   :  0   Length:902297      Length:902297      Length:902297     
##  1st Qu.: 31   Class :character   Class :character   Class :character  
##  Median : 75   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :101                                                           
##  3rd Qu.:131                                                           
##  Max.   :873                                                           
##                                                                        
##    BGN_RANGE      BGN_AZI           BGN_LOCATI          END_DATE        
##  Min.   :   0   Length:902297      Length:902297      Length:902297     
##  1st Qu.:   0   Class :character   Class :character   Class :character  
##  Median :   0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :   1                                                           
##  3rd Qu.:   1                                                           
##  Max.   :3749                                                           
##                                                                         
##    END_TIME           COUNTY_END COUNTYENDN       END_RANGE  
##  Length:902297      Min.   :0    Mode:logical   Min.   :  0  
##  Class :character   1st Qu.:0    NA's:902297    1st Qu.:  0  
##  Mode  :character   Median :0                   Median :  0  
##                     Mean   :0                   Mean   :  1  
##                     3rd Qu.:0                   3rd Qu.:  0  
##                     Max.   :0                   Max.   :925  
##                                                              
##    END_AZI           END_LOCATI            LENGTH           WIDTH     
##  Length:902297      Length:902297      Min.   :   0.0   Min.   :   0  
##  Class :character   Class :character   1st Qu.:   0.0   1st Qu.:   0  
##  Mode  :character   Mode  :character   Median :   0.0   Median :   0  
##                                        Mean   :   0.2   Mean   :   8  
##                                        3rd Qu.:   0.0   3rd Qu.:   0  
##                                        Max.   :2315.0   Max.   :4400  
##                                                                       
##        F               MAG          FATALITIES     INJURIES     
##  Min.   :0        Min.   :    0   Min.   :  0   Min.   :   0.0  
##  1st Qu.:0        1st Qu.:    0   1st Qu.:  0   1st Qu.:   0.0  
##  Median :1        Median :   50   Median :  0   Median :   0.0  
##  Mean   :1        Mean   :   47   Mean   :  0   Mean   :   0.2  
##  3rd Qu.:1        3rd Qu.:   75   3rd Qu.:  0   3rd Qu.:   0.0  
##  Max.   :5        Max.   :22000   Max.   :583   Max.   :1700.0  
##  NA's   :843563                                                 
##     PROPDMG      PROPDMGEXP           CROPDMG       CROPDMGEXP       
##  Min.   :   0   Length:902297      Min.   :  0.0   Length:902297     
##  1st Qu.:   0   Class :character   1st Qu.:  0.0   Class :character  
##  Median :   0   Mode  :character   Median :  0.0   Mode  :character  
##  Mean   :  12                      Mean   :  1.5                     
##  3rd Qu.:   0                      3rd Qu.:  0.0                     
##  Max.   :5000                      Max.   :990.0                     
##                                                                      
##      WFO             STATEOFFIC         ZONENAMES            LATITUDE   
##  Length:902297      Length:902297      Length:902297      Min.   :   0  
##  Class :character   Class :character   Class :character   1st Qu.:2802  
##  Mode  :character   Mode  :character   Mode  :character   Median :3540  
##                                                           Mean   :2875  
##                                                           3rd Qu.:4019  
##                                                           Max.   :9706  
##                                                           NA's   :47    
##    LONGITUDE        LATITUDE_E     LONGITUDE_       REMARKS         
##  Min.   :-14451   Min.   :   0   Min.   :-14455   Length:902297     
##  1st Qu.:  7247   1st Qu.:   0   1st Qu.:     0   Class :character  
##  Median :  8707   Median :   0   Median :     0   Mode  :character  
##  Mean   :  6940   Mean   :1452   Mean   :  3509                     
##  3rd Qu.:  9605   3rd Qu.:3549   3rd Qu.:  8735                     
##  Max.   : 17124   Max.   :9706   Max.   :106220                     
##                   NA's   :40                                        
##      REFNUM      
##  Min.   :     1  
##  1st Qu.:225575  
##  Median :451149  
##  Mean   :451149  
##  3rd Qu.:676723  
##  Max.   :902297  
## 

Analyse the first lines of the dataset using the head command

head(stormdata)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6

2.4 Prepare the data

One of the first actions to enable aggregation of the fatalities and injuries numbers based on event type, is ensuring that the former are numeric.

Convert fatalities field to numeric

stormdata$FATALITIES <- as.numeric(stormdata$FATALITIES)

Convert injuries field to numeric

stormdata$INJURIES <- as.numeric(stormdata$INJURIES)

Clean-up the event type

The event type (EVTYPE) contains duplicate categories based on mixed cases.

stormdata$EVTYPE <- toupper(stormdata$EVTYPE)
eventtype <- sort(unique(stormdata$EVTYPE))

Show first 50 event types

eventtype[1:50]
##  [1] "   HIGH SURF ADVISORY"          " COASTAL FLOOD"                
##  [3] " FLASH FLOOD"                   " LIGHTNING"                    
##  [5] " TSTM WIND"                     " TSTM WIND (G45)"              
##  [7] " WATERSPOUT"                    " WIND"                         
##  [9] "?"                              "ABNORMAL WARMTH"               
## [11] "ABNORMALLY DRY"                 "ABNORMALLY WET"                
## [13] "ACCUMULATED SNOWFALL"           "AGRICULTURAL FREEZE"           
## [15] "APACHE COUNTY"                  "ASTRONOMICAL HIGH TIDE"        
## [17] "ASTRONOMICAL LOW TIDE"          "AVALANCE"                      
## [19] "AVALANCHE"                      "BEACH EROSIN"                  
## [21] "BEACH EROSION"                  "BEACH EROSION/COASTAL FLOOD"   
## [23] "BEACH FLOOD"                    "BELOW NORMAL PRECIPITATION"    
## [25] "BITTER WIND CHILL"              "BITTER WIND CHILL TEMPERATURES"
## [27] "BLACK ICE"                      "BLIZZARD"                      
## [29] "BLIZZARD AND EXTREME WIND CHIL" "BLIZZARD AND HEAVY SNOW"       
## [31] "BLIZZARD SUMMARY"               "BLIZZARD WEATHER"              
## [33] "BLIZZARD/FREEZING RAIN"         "BLIZZARD/HEAVY SNOW"           
## [35] "BLIZZARD/HIGH WIND"             "BLIZZARD/WINTER STORM"         
## [37] "BLOW-OUT TIDE"                  "BLOW-OUT TIDES"                
## [39] "BLOWING DUST"                   "BLOWING SNOW"                  
## [41] "BLOWING SNOW- EXTREME WIND CHI" "BLOWING SNOW & EXTREME WIND CH"
## [43] "BLOWING SNOW/EXTREME WIND CHIL" "BREAKUP FLOODING"              
## [45] "BRUSH FIRE"                     "BRUSH FIRES"                   
## [47] "COASTAL  FLOODING/EROSION"      "COASTAL EROSION"               
## [49] "COASTAL FLOOD"                  "COASTAL FLOODING"

the event types show still a lot of similarities, that ultimately need to be adjusted, some parts can be automatically converted a lot of the others need manual actions.

The next step is to transfer the event type to a factor

stormdata$EVTYPE <- as.factor(stormdata$EVTYPE)

2.5 Aggregating event data

Consolidate all lethal events.

fatalities <- as.data.table(subset(aggregate(FATALITIES ~ EVTYPE, data = stormdata, 
    FUN = "sum"), FATALITIES > 0))
fatalities <- fatalities[order(-FATALITIES), ]

Show the first 20 rows

top20 <- fatalities[1:20, ]
library(ggplot2)
ggplot(data = top20, aes(EVTYPE, FATALITIES, fill = FATALITIES)) + geom_bar(stat = "identity") +  xlab("Event") + ylab("Fatalities") + ggtitle("Fatalities caused by Events (top 20) ") + coord_flip() + theme(legend.position = "none")

plot of chunk unnamed-chunk-13

The graph clearly shows that tornado's are by far the most deadly disaster over the years

Consolidate all events with injuries

injuries <- as.data.table(subset(aggregate(INJURIES ~ EVTYPE, data = stormdata, 
    FUN = "sum"), INJURIES > 0))
injuries <- injuries[order(-INJURIES), ]

Show again the first 20 rows

top20i <- injuries[1:20, ]
ggplot(data = top20i, aes(EVTYPE, INJURIES, fill = INJURIES)) + geom_bar(stat = "identity") + 
    xlab("Event") + ylab("Injuries") + ggtitle("Injuries caused by Events (top 20) ") + 
    coord_flip() + theme(legend.position = "none")

plot of chunk unnamed-chunk-15

Again Tornado is by far the leader of injuries caused by events

2.6 Economic impact of events

First check the exponent data, to see which exponents we have

unique(stormdata$PROPDMGEXP)
##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8"

It shows that again we have mixed cases, for example h and H or m and M.

stormdata$PROPDMGEXP <- toupper(stormdata$PROPDMGEXP)
unique(stormdata$PROPDMGEXP)
##  [1] "K" "M" ""  "B" "+" "0" "5" "6" "?" "4" "2" "3" "H" "7" "-" "1" "8"
table(stormdata$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5 
## 465934      1      8      5    216     25     13      4      4     28 
##      6      7      8      B      H      K      M 
##      4      5      1     40      7 424665  11337

Now that we cleaned the exponents, lets convert them to numeric values.

calcExp <- function(x, exp = "") {
    switch(exp, `-` = x * -1, `?` = x, `+` = x, `1` = x, `2` = x * (10^2), `3` = x * 
        (10^3), `4` = x * (10^4), `5` = x * (10^5), `6` = x * (10^6), `7` = x * 
        (10^7), `8` = x * (10^8), H = x * 100, K = x * 1000, M = x * 1e+06, 
        B = x * 1e+09, x)
}

applyCalcExp <- function(vx, vexp) {
    if (length(vx) != length(vexp)) 
        stop("Not same size")
    result <- rep(0, length(vx))
    for (i in 1:length(vx)) {
        result[i] <- calcExp(vx[i], vexp[i])
    }
    result
}

Now we are able to calculate the damage costs, caused by the events, let me call it EconomicCosts

stormdata$EconomicCosts <- applyCalcExp(as.numeric(stormdata$PROPDMG), stormdata$PROPDMGEXP)
summary(stormdata$EconomicCosts)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -1.50e+01  0.00e+00  0.00e+00  4.75e+05  5.00e+02  1.15e+11

Consolidate the economic costs based on event

costs <- as.data.table(subset(aggregate(EconomicCosts ~ EVTYPE, data = stormdata, 
    FUN = "sum"), EconomicCosts > 0))
costs <- costs[order(-EconomicCosts), ]

Show again the first 20 rows

library(scales)
top20c <- costs[1:20, ]
ggplot(data = top20c, aes(EVTYPE, EconomicCosts, fill = EconomicCosts)) + geom_bar(stat = "identity") + 
    scale_y_continuous(labels = comma) + xlab("Event") + ylab("Economic costs in $") + 
    ggtitle("Economic costs caused by Events (top 20) ") + coord_flip() + theme(legend.position = "none")

plot of chunk unnamed-chunk-22