Coursera Reproducible Research peer Assessment 2

Introduction

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. Synopsis

The analysis on the storm event database revealed that tornadoes are the most dangerous weather event to the population health. The second most dangerous event type is the excessive heat. The economic impact of weather events was also analyzed. Flash floods and thunderstorm winds caused billions of dollars in property damages between 1950 and 2011. The largest crop damage caused by drought, followed by flood and hails.

Data Processing

The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. The data was downloaded from the Coursera Reproducible Research web site [Stormdata (47Mb) ]

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

#setting the environment 
Sys.setlocale("LC_TIME", "English")
## [1] "English_United States.1252"
sessionInfo()
## R version 3.2.0 (2015-04-16)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## locale:
## [1] LC_COLLATE=French_France.1252      LC_CTYPE=French_France.1252       
## [3] LC_MONETARY=French_France.1252     LC_NUMERIC=C                      
## [5] LC_TIME=English_United States.1252
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] magrittr_1.5    tools_3.2.0     htmltools_0.2.6 yaml_2.1.13    
##  [5] stringi_0.4-1   rmarkdown_0.7   knitr_1.10.5    stringr_1.0.0  
##  [9] digest_0.6.8    evaluate_0.7
#set  working directory
setwd("C:/--Coursera/assessments/")

#load required packages

library(plyr)
library(data.table)
library(ggplot2)


StormData.csv <- bzfile("C:/--Coursera/assessments/repdata_data_StormData.csv.bz2","repdata_data_StormData.csv")

#read file
storm.data <- read.csv(StormData.csv, sep = ",", stringsAsFactors = FALSE)

unlink(StormData.csv)

head(storm.data)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6
summary(storm.data)
##     STATE__       BGN_DATE           BGN_TIME          TIME_ZONE        
##  Min.   : 1.0   Length:902297      Length:902297      Length:902297     
##  1st Qu.:19.0   Class :character   Class :character   Class :character  
##  Median :30.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :31.2                                                           
##  3rd Qu.:45.0                                                           
##  Max.   :95.0                                                           
##                                                                         
##      COUNTY       COUNTYNAME           STATE              EVTYPE         
##  Min.   :  0.0   Length:902297      Length:902297      Length:902297     
##  1st Qu.: 31.0   Class :character   Class :character   Class :character  
##  Median : 75.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :100.6                                                           
##  3rd Qu.:131.0                                                           
##  Max.   :873.0                                                           
##                                                                          
##    BGN_RANGE          BGN_AZI           BGN_LOCATI       
##  Min.   :   0.000   Length:902297      Length:902297     
##  1st Qu.:   0.000   Class :character   Class :character  
##  Median :   0.000   Mode  :character   Mode  :character  
##  Mean   :   1.484                                        
##  3rd Qu.:   1.000                                        
##  Max.   :3749.000                                        
##                                                          
##    END_DATE           END_TIME           COUNTY_END COUNTYENDN    
##  Length:902297      Length:902297      Min.   :0    Mode:logical  
##  Class :character   Class :character   1st Qu.:0    NA's:902297   
##  Mode  :character   Mode  :character   Median :0                  
##                                        Mean   :0                  
##                                        3rd Qu.:0                  
##                                        Max.   :0                  
##                                                                   
##    END_RANGE          END_AZI           END_LOCATI       
##  Min.   :  0.0000   Length:902297      Length:902297     
##  1st Qu.:  0.0000   Class :character   Class :character  
##  Median :  0.0000   Mode  :character   Mode  :character  
##  Mean   :  0.9862                                        
##  3rd Qu.:  0.0000                                        
##  Max.   :925.0000                                        
##                                                          
##      LENGTH              WIDTH                F               MAG         
##  Min.   :   0.0000   Min.   :   0.000   Min.   :0.0      Min.   :    0.0  
##  1st Qu.:   0.0000   1st Qu.:   0.000   1st Qu.:0.0      1st Qu.:    0.0  
##  Median :   0.0000   Median :   0.000   Median :1.0      Median :   50.0  
##  Mean   :   0.2301   Mean   :   7.503   Mean   :0.9      Mean   :   46.9  
##  3rd Qu.:   0.0000   3rd Qu.:   0.000   3rd Qu.:1.0      3rd Qu.:   75.0  
##  Max.   :2315.0000   Max.   :4400.000   Max.   :5.0      Max.   :22000.0  
##                                         NA's   :843563                    
##    FATALITIES          INJURIES            PROPDMG       
##  Min.   :  0.0000   Min.   :   0.0000   Min.   :   0.00  
##  1st Qu.:  0.0000   1st Qu.:   0.0000   1st Qu.:   0.00  
##  Median :  0.0000   Median :   0.0000   Median :   0.00  
##  Mean   :  0.0168   Mean   :   0.1557   Mean   :  12.06  
##  3rd Qu.:  0.0000   3rd Qu.:   0.0000   3rd Qu.:   0.50  
##  Max.   :583.0000   Max.   :1700.0000   Max.   :5000.00  
##                                                          
##   PROPDMGEXP           CROPDMG         CROPDMGEXP       
##  Length:902297      Min.   :  0.000   Length:902297     
##  Class :character   1st Qu.:  0.000   Class :character  
##  Mode  :character   Median :  0.000   Mode  :character  
##                     Mean   :  1.527                     
##                     3rd Qu.:  0.000                     
##                     Max.   :990.000                     
##                                                         
##      WFO             STATEOFFIC         ZONENAMES            LATITUDE   
##  Length:902297      Length:902297      Length:902297      Min.   :   0  
##  Class :character   Class :character   Class :character   1st Qu.:2802  
##  Mode  :character   Mode  :character   Mode  :character   Median :3540  
##                                                           Mean   :2875  
##                                                           3rd Qu.:4019  
##                                                           Max.   :9706  
##                                                           NA's   :47    
##    LONGITUDE        LATITUDE_E     LONGITUDE_       REMARKS         
##  Min.   :-14451   Min.   :   0   Min.   :-14455   Length:902297     
##  1st Qu.:  7247   1st Qu.:   0   1st Qu.:     0   Class :character  
##  Median :  8707   Median :   0   Median :     0   Mode  :character  
##  Mean   :  6940   Mean   :1452   Mean   :  3509                     
##  3rd Qu.:  9605   3rd Qu.:3549   3rd Qu.:  8735                     
##  Max.   : 17124   Max.   :9706   Max.   :106220                     
##                   NA's   :40                                        
##      REFNUM      
##  Min.   :     1  
##  1st Qu.:225575  
##  Median :451149  
##  Mean   :451149  
##  3rd Qu.:676723  
##  Max.   :902297  
## 
str(storm.data)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
#the storm data set has 902297 rows and 37 columns

Clean up storm.data converting PROPDMG & CROPDMG to scale values. according to the documentation :“Estimates should be rounded to three significant digits, followed by an alphabetical character signifying the magnitude of the number, i.e., 1.55B for $1,550,000,000. Alphabetical characters used to signify magnitude include”K" for thousands, “M” for millions, and “B” for billions“.

storm.data$PROPDMGEXP <- as.character(storm.data$PROPDMGEXP)
storm.data$PROPDMGEXP[storm.data$PROPDMGEXP == "" | storm.data$PROPDMGEXP == "+" | storm.data$PROPDMGEXP == "?" | storm.data$PROPDMGEXP == "-"] <- "1"
storm.data$PROPDMGEXP[storm.data$PROPDMGEXP == "H" | storm.data$PROPDMGEXP == "h"] <- "100"
storm.data$PROPDMGEXP[storm.data$PROPDMGEXP == "K" | storm.data$PROPDMGEXP == "k"] <- "1000"
storm.data$PROPDMGEXP[storm.data$PROPDMGEXP == "M" | storm.data$PROPDMGEXP == "m"] <- "1000000"
storm.data$PROPDMGEXP[storm.data$PROPDMGEXP == "B" | storm.data$PROPDMGEXP == "b"] <- "1000000000"
storm.data$PROPDMGEXP <- as.numeric(storm.data$PROPDMGEXP)
storm.data$PROPDMGUSD <- storm.data$PROPDMG * storm.data$PROPDMGEXP


storm.data$CROPDMGEXP <- as.character(storm.data$CROPDMGEXP)
storm.data$CROPDMGEXP[storm.data$CROPDMGEXP == "" | storm.data$CROPDMGEXP == "?"] <- "1"
storm.data$CROPDMGEXP[storm.data$CROPDMGEXP == "B" | storm.data$CROPDMGEXP == "b"] <- "1000000000"
storm.data$CROPDMGEXP[storm.data$CROPDMGEXP == "M" | storm.data$CROPDMGEXP == "m"] <- "1000000"
storm.data$CROPDMGEXP[storm.data$CROPDMGEXP == "K" | storm.data$CROPDMGEXP == "k"] <- "1000"
storm.data$CROPDMGEXP[storm.data$CROPDMGEXP == "" | storm.data$CROPDMGEXP == "?"] <- "1"
storm.data$CROPDMGEXP <- as.numeric(storm.data$CROPDMGEXP)
storm.data$CROPDMGUSD <- storm.data$CROPDMG * storm.data$CROPDMGEXP
# aggregate storm.data by EVTYPE
storm.data.evtype <- aggregate(cbind(FATALITIES, INJURIES, PROPDMGUSD, CROPDMGUSD) ~ EVTYPE, data = storm.data, FUN = sum)
# Add calculated column 'health' as a sum of FATALITIES and INJURIES
storm.data.evtype$health <- storm.data.evtype$FATALITIES + storm.data.evtype$INJURIES
# Add calculated column 'damage' as a sum of FATALITIES and INJURIES
storm.data.evtype$damage <- storm.data.evtype$PROPDMGUSD + storm.data.evtype$CROPDMGUSD
#Clean-up the event type(EVTYPE) duplicates. 
storm.data$EVTYPE <- toupper(storm.data$EVTYPE)
event.type <- sort(unique(storm.data$EVTYPE))
## Show first 50 event types
event.type[1:50]
##  [1] "   HIGH SURF ADVISORY"          " COASTAL FLOOD"                
##  [3] " FLASH FLOOD"                   " LIGHTNING"                    
##  [5] " TSTM WIND"                     " TSTM WIND (G45)"              
##  [7] " WATERSPOUT"                    " WIND"                         
##  [9] "?"                              "ABNORMAL WARMTH"               
## [11] "ABNORMALLY DRY"                 "ABNORMALLY WET"                
## [13] "ACCUMULATED SNOWFALL"           "AGRICULTURAL FREEZE"           
## [15] "APACHE COUNTY"                  "ASTRONOMICAL HIGH TIDE"        
## [17] "ASTRONOMICAL LOW TIDE"          "AVALANCE"                      
## [19] "AVALANCHE"                      "BEACH EROSIN"                  
## [21] "BEACH EROSION"                  "BEACH EROSION/COASTAL FLOOD"   
## [23] "BEACH FLOOD"                    "BELOW NORMAL PRECIPITATION"    
## [25] "BITTER WIND CHILL"              "BITTER WIND CHILL TEMPERATURES"
## [27] "BLACK ICE"                      "BLIZZARD"                      
## [29] "BLIZZARD AND EXTREME WIND CHIL" "BLIZZARD AND HEAVY SNOW"       
## [31] "BLIZZARD SUMMARY"               "BLIZZARD WEATHER"              
## [33] "BLIZZARD/FREEZING RAIN"         "BLIZZARD/HEAVY SNOW"           
## [35] "BLIZZARD/HIGH WIND"             "BLIZZARD/WINTER STORM"         
## [37] "BLOW-OUT TIDE"                  "BLOW-OUT TIDES"                
## [39] "BLOWING DUST"                   "BLOWING SNOW"                  
## [41] "BLOWING SNOW- EXTREME WIND CHI" "BLOWING SNOW & EXTREME WIND CH"
## [43] "BLOWING SNOW/EXTREME WIND CHIL" "BREAKUP FLOODING"              
## [45] "BRUSH FIRE"                     "BRUSH FIRES"                   
## [47] "COASTAL  FLOODING/EROSION"      "COASTAL EROSION"               
## [49] "COASTAL FLOOD"                  "COASTAL FLOODING"
#event type to a factor
storm.data$EVTYPE <- as.factor(storm.data$EVTYPE)
#top 10 fatalities by Event 

fatalities <- as.data.table(subset(aggregate(FATALITIES ~ EVTYPE, data = storm.data.evtype, 
    FUN = "sum"), FATALITIES > 0))
fatalities <- fatalities[order(-FATALITIES), ]

fatalities[1:10,]
##             EVTYPE FATALITIES
##  1:        TORNADO       5633
##  2: EXCESSIVE HEAT       1903
##  3:    FLASH FLOOD        978
##  4:           HEAT        937
##  5:      LIGHTNING        816
##  6:      TSTM WIND        504
##  7:          FLOOD        470
##  8:    RIP CURRENT        368
##  9:      HIGH WIND        248
## 10:      AVALANCHE        224
#top 10 injuries by Event
injuries <- as.data.table(subset(aggregate(INJURIES ~ EVTYPE, data = storm.data.evtype, 
    FUN = "sum"), INJURIES > 0))
injuries <- injuries[order(-INJURIES), ]

injuries[1:10, ]
##                EVTYPE INJURIES
##  1:           TORNADO    91346
##  2:         TSTM WIND     6957
##  3:             FLOOD     6789
##  4:    EXCESSIVE HEAT     6525
##  5:         LIGHTNING     5230
##  6:              HEAT     2100
##  7:         ICE STORM     1975
##  8:       FLASH FLOOD     1777
##  9: THUNDERSTORM WIND     1488
## 10:              HAIL     1361
#The three events that have the highest health consequences, both for fatalities and injuries are tornados, excessive heat and high wind.


fatalities.plot <- ggplot(data = fatalities[1:10,], aes(EVTYPE, FATALITIES, fill = FATALITIES)) + geom_bar(stat = "identity") + ggtitle("Fatalities by Event") + 
    xlab("Event") + ylab("Fatalities") + 
    coord_flip() 

fatalities.plot

injuries.plot <- ggplot(data = injuries[1:10, ], aes(EVTYPE, INJURIES, fill = INJURIES)) + geom_bar(stat = "identity") + 
   ggtitle("Injuries by Event") +  xlab("Event") + ylab("Injuries") + 
    coord_flip() 

injuries.plot

Economic Damage

storm.damage <- storm.data.evtype[order(storm.data.evtype$damage, decreasing = T),][1:10,]

storm.damage.property <- storm.damage[,c(1,4)]
names(storm.damage.property)[2] <- "DAMAGE"
storm.damage.property$TYPE <- "PROPERTY"
#property damage in US$ million
storm.damage.property$DAMAGE <- storm.damage.property$DAMAGE / 1000000

storm.damage.property [1:10,c("EVTYPE","DAMAGE")]
##                EVTYPE     DAMAGE
## 170             FLOOD 144657.710
## 411 HURRICANE/TYPHOON  69305.840
## 834           TORNADO  56937.161
## 670       STORM SURGE  43323.536
## 244              HAIL  15732.267
## 153       FLASH FLOOD  16140.812
## 95            DROUGHT   1046.106
## 402         HURRICANE  11868.319
## 590       RIVER FLOOD   5118.945
## 427         ICE STORM   3944.928
storm.damage.crop <- storm.damage[,c(1,5)]
names(storm.damage.crop)[2] <- "DAMAGE"
storm.damage.crop$TYPE <- "CROP"
#crop damage in US$ million
storm.damage.crop$DAMAGE <- storm.damage.crop$DAMAGE / 1000000

storm.damage.crop [1:10,c("EVTYPE","DAMAGE")]
##                EVTYPE     DAMAGE
## 170             FLOOD  5661.9685
## 411 HURRICANE/TYPHOON  2607.8728
## 834           TORNADO   414.9531
## 670       STORM SURGE     0.0050
## 244              HAIL  3025.9545
## 153       FLASH FLOOD  1421.3171
## 95            DROUGHT 13972.5660
## 402         HURRICANE  2741.9100
## 590       RIVER FLOOD  5029.4590
## 427         ICE STORM  5022.1135
damage.property.plot <- ggplot(data = storm.damage.property, aes(EVTYPE,DAMAGE, fill="TYPE")) + 
        geom_bar(stat="identity") + coord_flip()+
        ggtitle("Property Damage") +
        ylab("Damage in US$ million")+ 
        xlab("Event Type")+
        theme(legend.position = "none")

damage.property.plot 

damage.crop.plot <- ggplot(data = storm.damage.crop, aes(EVTYPE,DAMAGE, fill="TYPE")) + 
        geom_bar(stat="identity") + coord_flip()+
        ggtitle("Corp Damage") +
        ylab("Damage in US$ million")+ 
        xlab("Event Type")+
        theme(legend.position = "none")

damage.crop.plot

Top Events that caused the highest damage

storm.damage.plot <- rbind(storm.damage.property , storm.damage.crop)
storm.damage.plot <- transform(storm.damage.plot, EVTYPE=reorder(EVTYPE, -DAMAGE) ) 

qplot(
  EVTYPE, 
  DAMAGE, 
  data = storm.damage.plot,
  fill = TYPE,
  geom = "bar",
  stat = "identity",
  main = "Economic damage",
  ylab = "Damage in million $",
  xlab = ""
) + scale_fill_discrete("") + theme(axis.text.x = element_text(angle = 90))

Conclusions

Tornados by far are the leading cause of both fatalities and injuries, followed by Excessive heat for Fatalities and Thunderstorm winds for injuries.

Floods are the most economically damaging.