The following analysis is based on data included in Storm Data which is an official publication of the National Oceanic and Atmospheric Administration (NOAA).

Synopsis

*Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

Tornadoes are the most harmful with respect to United States population health (see Figure 1 below). Tornadoes are the third most frequent event recorded, but are responsible for the most number of fatalities (5633) and the most number of injuries (91346). Excessive Heat has the second highest fatalities recorded (2171). Thunderstorm Winds cause the second most injuries (9353) followed by Excessive Heat at third highest injuries (7059).

*Across the United States, which types of events have the greatest economic consequences?

Floods cause the highest property damage ($144,657,709,800) in the United States, and also cause the second highest crop damage ($5,661,968,450) (see Figure 2 below). Hurricanes ($81,174,159,010) and Tornadoes ($56,937,160,480) cause the second and third highest property damage. Drought causes the most crop damage ($13,972,566,000), while Hurricanes cause the third highest crop damages ($5,349,782,800).

Overall Tornadoes are the most harmful events in the United States.

Data Processing

Obtain and load the dataset from the National Weather Service.

#Download the file
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "stormData.csv.bz2")
#Read the file
stormData <- read.csv("stormData.csv.bz2")

Explore the dataset.

dim(stormData)
## [1] 902297     37
head(stormData, 10)
##    STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1        1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2        1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3        1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4        1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5        1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6        1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
## 7        1 11/16/1951 0:00:00     0100       CST      9     BLOUNT    AL
## 8        1  1/22/1952 0:00:00     0900       CST    123 TALLAPOOSA    AL
## 9        1  2/13/1952 0:00:00     2000       CST    125 TUSCALOOSA    AL
## 10       1  2/13/1952 0:00:00     2000       CST     57    FAYETTE    AL
##     EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1  TORNADO         0                                               0
## 2  TORNADO         0                                               0
## 3  TORNADO         0                                               0
## 4  TORNADO         0                                               0
## 5  TORNADO         0                                               0
## 6  TORNADO         0                                               0
## 7  TORNADO         0                                               0
## 8  TORNADO         0                                               0
## 9  TORNADO         0                                               0
## 10 TORNADO         0                                               0
##    COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1          NA         0                      14.0   100 3   0          0
## 2          NA         0                       2.0   150 2   0          0
## 3          NA         0                       0.1   123 2   0          0
## 4          NA         0                       0.0   100 2   0          0
## 5          NA         0                       0.0   150 2   0          0
## 6          NA         0                       1.5   177 2   0          0
## 7          NA         0                       1.5    33 2   0          0
## 8          NA         0                       0.0    33 1   0          0
## 9          NA         0                       3.3   100 3   0          1
## 10         NA         0                       2.3   100 3   0          0
##    INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1        15    25.0          K       0                                    
## 2         0     2.5          K       0                                    
## 3         2    25.0          K       0                                    
## 4         2     2.5          K       0                                    
## 5         2     2.5          K       0                                    
## 6         6     2.5          K       0                                    
## 7         1     2.5          K       0                                    
## 8         0     2.5          K       0                                    
## 9        14    25.0          K       0                                    
## 10        0    25.0          K       0                                    
##    LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1      3040      8812       3051       8806              1
## 2      3042      8755          0          0              2
## 3      3340      8742          0          0              3
## 4      3458      8626          0          0              4
## 5      3412      8642          0          0              5
## 6      3450      8748          0          0              6
## 7      3405      8631          0          0              7
## 8      3255      8558          0          0              8
## 9      3334      8740       3336       8738              9
## 10     3336      8738       3337       8737             10

Since we are interested in the Health and Property variables, explore these specific variables in more detail.

storm <- stormData[,c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
summary(storm)
##                EVTYPE         FATALITIES          INJURIES        
##  HAIL             :288661   Min.   :  0.0000   Min.   :   0.0000  
##  TSTM WIND        :219940   1st Qu.:  0.0000   1st Qu.:   0.0000  
##  THUNDERSTORM WIND: 82563   Median :  0.0000   Median :   0.0000  
##  TORNADO          : 60652   Mean   :  0.0168   Mean   :   0.1557  
##  FLASH FLOOD      : 54277   3rd Qu.:  0.0000   3rd Qu.:   0.0000  
##  FLOOD            : 25326   Max.   :583.0000   Max.   :1700.0000  
##  (Other)          :170878                                         
##     PROPDMG          PROPDMGEXP        CROPDMG          CROPDMGEXP    
##  Min.   :   0.00          :465934   Min.   :  0.000          :618413  
##  1st Qu.:   0.00   K      :424665   1st Qu.:  0.000   K      :281832  
##  Median :   0.00   M      : 11330   Median :  0.000   M      :  1994  
##  Mean   :  12.06   0      :   216   Mean   :  1.527   k      :    21  
##  3rd Qu.:   0.50   B      :    40   3rd Qu.:  0.000   0      :    19  
##  Max.   :5000.00   5      :    28   Max.   :990.000   B      :     9  
##                    (Other):    84                     (Other):     9
mean(is.na(storm))
## [1] 0

There seem to be data quality issues in variables PROPDMGEXP and CROPDMGEXP. Explore and cleanse these variables.

summary(storm$PROPDMGEXP)
##             -      ?      +      0      1      2      3      4      5 
## 465934      1      8      5    216     25     13      4      4     28 
##      6      7      8      B      h      H      K      m      M 
##      4      5      1     40      1      6 424665      7  11330
summary(storm$CROPDMGEXP)
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994

The EVTYPE field does not have standard values. Since most of the events have zero or few impacts, cleanse only those event types with significant impacts.

Also since there are comparatively few instances of deviations in PROPDMGEXP and CROPDMGEXP, set any factors not equal to “H”, “K”, “M”, or “B” equal to zero.

#clean EVTYPE
storm$EVTYPE <- toupper(storm$EVTYPE)
storm$EVTYPE[storm$EVTYPE == "TSTM WIND"] <- "THUNDERSTORM WIND"
storm$EVTYPE[storm$EVTYPE == "THUNDERSTORM WINDS"] <- "THUNDERSTORM WIND"
storm$EVTYPE[storm$EVTYPE == "HIGH WINDS"] <- "HIGH WIND"
storm$EVTYPE[storm$EVTYPE == "RIP CURRENTS"] <- "RIP CURRENT"
storm$EVTYPE[storm$EVTYPE == "HURRICANE/TYPHOON"] <- "HURRICANE"
storm$EVTYPE[storm$EVTYPE == "HEAT WAVE"] <- "EXCESSIVE HEAT"
storm$EVTYPE[storm$EVTYPE == "EXTREME HEAT"] <- "EXCESSIVE HEAT"

#clean PROPDMGEXP
storm$PROPDMGEXP <- factor(toupper(storm$PROPDMGEXP))
levels(storm$PROPDMGEXP)[levels(storm$PROPDMGEXP)==""] <- "0"
levels(storm$PROPDMGEXP)[levels(storm$PROPDMGEXP)=="-"] <- "0"
levels(storm$PROPDMGEXP)[levels(storm$PROPDMGEXP)=="+"] <- "0"
levels(storm$PROPDMGEXP)[levels(storm$PROPDMGEXP)=="?"] <- "0"
levels(storm$PROPDMGEXP)[levels(storm$PROPDMGEXP)=="1"] <- "0"
levels(storm$PROPDMGEXP)[levels(storm$PROPDMGEXP)=="2"] <- "0"
levels(storm$PROPDMGEXP)[levels(storm$PROPDMGEXP)=="3"] <- "0"
levels(storm$PROPDMGEXP)[levels(storm$PROPDMGEXP)=="4"] <- "0"
levels(storm$PROPDMGEXP)[levels(storm$PROPDMGEXP)=="5"] <- "0"
levels(storm$PROPDMGEXP)[levels(storm$PROPDMGEXP)=="6"] <- "0"
levels(storm$PROPDMGEXP)[levels(storm$PROPDMGEXP)=="7"] <- "0"
levels(storm$PROPDMGEXP)[levels(storm$PROPDMGEXP)=="8"] <- "0"
levels(storm$PROPDMGEXP)[levels(storm$PROPDMGEXP)=="H"] <- "100"
levels(storm$PROPDMGEXP)[levels(storm$PROPDMGEXP)=="K"] <- "1000"
levels(storm$PROPDMGEXP)[levels(storm$PROPDMGEXP)=="M"] <- "1000000"
levels(storm$PROPDMGEXP)[levels(storm$PROPDMGEXP)=="B"] <- "1000000000"

#clean CROPDMGEXP
storm$CROPDMGEXP <- factor(toupper(storm$CROPDMGEXP))
levels(storm$CROPDMGEXP)[levels(storm$CROPDMGEXP)==""] <- "0"
levels(storm$CROPDMGEXP)[levels(storm$CROPDMGEXP)=="?"] <- "0"
levels(storm$CROPDMGEXP)[levels(storm$CROPDMGEXP)=="2"] <- "0"
levels(storm$CROPDMGEXP)[levels(storm$CROPDMGEXP)=="K"] <- "1000"
levels(storm$CROPDMGEXP)[levels(storm$CROPDMGEXP)=="M"] <- "1000000"
levels(storm$CROPDMGEXP)[levels(storm$CROPDMGEXP)=="B"] <- "1000000000"

Calculate the $ amount of damage expenses and assign to new variables.

#convert expense variables to numeric
storm$PROPDMGEXP <- as.numeric(as.character(storm$PROPDMGEXP))
storm$CROPDMGEXP <- as.numeric(as.character(storm$CROPDMGEXP))

storm$PROPDMGAMT <- storm$PROPDMG * storm$PROPDMGEXP
storm$CROPDMGAMT <- storm$CROPDMG * storm$CROPDMGEXP

Results

Check which events are most frequent.

library(plyr)
storm.events.count <- count(storm, 'EVTYPE')
head(storm.events.count[order(storm.events.count$freq, decreasing = T), ],20)
##                       EVTYPE   freq
## 680        THUNDERSTORM WIND 323349
## 211                     HAIL 288661
## 752                  TORNADO  60652
## 137              FLASH FLOOD  54277
## 153                    FLOOD  25327
## 318                HIGH WIND  21747
## 414                LIGHTNING  15754
## 272               HEAVY SNOW  15708
## 252               HEAVY RAIN  11742
## 881             WINTER STORM  11433
## 886           WINTER WEATHER   7045
## 189             FUNNEL CLOUD   6844
## 440         MARINE TSTM WIND   6175
## 439 MARINE THUNDERSTORM WIND   5812
## 848               WATERSPOUT   3796
## 599              STRONG WIND   3569
## 832     URBAN/SML STREAM FLD   3392
## 868                 WILDFIRE   2761
## 28                  BLIZZARD   2719
## 84                   DROUGHT   2488

To find out which type of events are more harmful to population health, aggregate the harmful health impacts by event type.

#For Health Impact
agg.storm.health.sum <-aggregate(storm[,c(2,3)], by=list(storm$EVTYPE), FUN=sum, na.rm=TRUE)
colnames(agg.storm.health.sum) <- c("Event.Type", "Fatalities.Sum", "Injuries.Sum")
agg.storm.health.mean <-aggregate(storm[,c(2,3)], by=list(storm$EVTYPE), FUN=mean, na.rm=TRUE)
colnames(agg.storm.health.mean) <- c("Event.Type", "Fatalities.Mean", "Injuries.Mean")
agg.storm.health <- merge(agg.storm.health.sum, agg.storm.health.mean, by="Event.Type")
#Take top 10 Entries
agg.storm.health.TopFatalities <- head(agg.storm.health[order(agg.storm.health$Fatalities.Sum, decreasing = T), ],10)
agg.storm.health.TopInjuries <- head(agg.storm.health[order(agg.storm.health$Injuries.Sum, decreasing = T), ],10)

Check Top Fatalities Data:

agg.storm.health.TopFatalities
##            Event.Type Fatalities.Sum Injuries.Sum Fatalities.Mean
## 752           TORNADO           5633        91346     0.092874101
## 116    EXCESSIVE HEAT           2171         7059     1.223098592
## 137       FLASH FLOOD            978         1777     0.018018682
## 242              HEAT            937         2100     1.221642764
## 414         LIGHTNING            816         5230     0.051796369
## 680 THUNDERSTORM WIND            701         9353     0.002167936
## 520       RIP CURRENT            572          529     0.739018088
## 153             FLOOD            470         6789     0.018557271
## 318         HIGH WIND            283         1439     0.013013289
## 19          AVALANCHE            224          170     0.580310881
##     Injuries.Mean
## 752    1.50606740
## 116    3.97690141
## 137    0.03273947
## 242    2.73794003
## 414    0.33197918
## 680    0.02892540
## 520    0.68346253
## 153    0.26805386
## 318    0.06617005
## 19     0.44041451

Check Top Injuries Data:

agg.storm.health.TopInjuries
##            Event.Type Fatalities.Sum Injuries.Sum Fatalities.Mean
## 752           TORNADO           5633        91346    9.287410e-02
## 680 THUNDERSTORM WIND            701         9353    2.167936e-03
## 116    EXCESSIVE HEAT           2171         7059    1.223099e+00
## 153             FLOOD            470         6789    1.855727e-02
## 414         LIGHTNING            816         5230    5.179637e-02
## 242              HEAT            937         2100    1.221643e+00
## 383         ICE STORM             89         1975    4.436690e-02
## 137       FLASH FLOOD            978         1777    1.801868e-02
## 318         HIGH WIND            283         1439    1.301329e-02
## 211              HAIL             15         1361    5.196407e-05
##     Injuries.Mean
## 752   1.506067401
## 680   0.028925403
## 116   3.976901408
## 153   0.268053856
## 414   0.331979180
## 242   2.737940026
## 383   0.984546361
## 137   0.032739466
## 318   0.066170046
## 211   0.004714873

For population and health impact, create barplot of Fatalities and Injuries.

library(ggplot2)
library(grid)
#Fatalities plot
h1 <- ggplot(agg.storm.health.TopFatalities, aes(x = factor(agg.storm.health.TopFatalities$Event.Type), y = agg.storm.health.TopFatalities$Fatalities.Sum)) + geom_bar(fill="red", stat = "identity")+ labs(x = "Event Type", y = "Fatalities Sum") + theme(axis.text.x = element_text(angle=45, hjust=1))

#Injuries Plot
h2 <- ggplot(agg.storm.health.TopInjuries, aes(x = factor(agg.storm.health.TopInjuries$Event.Type), y = agg.storm.health.TopInjuries$Injuries.Sum)) + geom_bar(fill="orange", stat = "identity")+ labs(x = "Event Type", y = "Injuries Sum") + theme(axis.text.x = element_text(angle=45, hjust=1))

grid.newpage()
pushViewport(viewport(layout = grid.layout(2, 1)))
print(h1, vp = viewport(layout.pos.row = 1,layout.pos.col = 1))
print(h2, vp = viewport(layout.pos.row = 2,layout.pos.col = 1))

Figure 1: Top 10 Storm Events Types by Sum Fatalities and Top 10 Storm Events by Sum of Injuries. Tornadoes cause the most fatalities and injuries.

To find out which type of events have greatest economic impact, aggregate the harmful health impacts by event type.

#For Damage Impact
agg.storm.damages.sum <-aggregate(storm[,c(8,9)], by=list(storm$EVTYPE), FUN=sum, na.rm=TRUE)
colnames(agg.storm.damages.sum) <- c("Event.Type", "Property.Damages.Sum", "Crop.Damages.Sum")
agg.storm.damages.mean <-aggregate(storm[,c(8,9)], by=list(storm$EVTYPE), FUN=mean, na.rm=TRUE)
colnames(agg.storm.damages.mean) <- c("Event.Type", "Property.Damages.Mean", "Crop.Damages.Mean")
agg.storm.damages <- merge(agg.storm.damages.sum, agg.storm.damages.mean, by="Event.Type")
#Take top 10 entries
agg.storm.damages.TopProperty <- head(agg.storm.damages[order(agg.storm.damages$Property.Damages.Sum, decreasing = T), ],10)
agg.storm.damages.TopCrop <- head(agg.storm.damages[order(agg.storm.damages$Crop.Damages.Sum, decreasing = T), ],10)

Check Top Property Damage Data:

agg.storm.damages.TopProperty
##            Event.Type Property.Damages.Sum Crop.Damages.Sum
## 153             FLOOD         144657709800       5661968450
## 360         HURRICANE          81174159010       5349782800
## 752           TORNADO          56937160480        414953110
## 594       STORM SURGE          43323536000             5000
## 137       FLASH FLOOD          16140811510       1421317100
## 211              HAIL          15732267220       3025954450
## 680 THUNDERSTORM WIND           9704034430       1159505100
## 766    TROPICAL STORM           7703890550        678346000
## 881      WINTER STORM           6688497250         26944000
## 318         HIGH WIND           5878369910        679291900
##     Property.Damages.Mean Crop.Damages.Mean
## 153            5711600.66      2.235546e+05
## 360          309825034.39      2.041902e+07
## 752             938751.57      6.841540e+03
## 594          165990559.39      1.915709e+01
## 137             297378.48      2.618636e+04
## 211              54500.84      1.048273e+04
## 680              30011.02      3.585924e+03
## 766           11165058.77      9.831101e+05
## 881             585016.82      2.356687e+03
## 318             270307.16      3.123612e+04

Check Top Property Damage Data:

agg.storm.damages.TopCrop
##            Event.Type Property.Damages.Sum Crop.Damages.Sum
## 84            DROUGHT           1046106000      13972566000
## 153             FLOOD         144657709800       5661968450
## 360         HURRICANE          81174159010       5349782800
## 524       RIVER FLOOD           5118945500       5029459000
## 383         ICE STORM           3944927810       5022113500
## 211              HAIL          15732267220       3025954450
## 137       FLASH FLOOD          16140811510       1421317100
## 125      EXTREME COLD             67737400       1312973000
## 680 THUNDERSTORM WIND           9704034430       1159505100
## 186      FROST/FREEZE             10480000       1094186000
##     Property.Damages.Mean Crop.Damages.Mean
## 84           4.204606e+05       5615983.119
## 153          5.711601e+06        223554.643
## 360          3.098250e+08      20419018.321
## 524          2.958928e+07      29072017.341
## 383          1.966564e+06       2503546.112
## 211          5.450084e+04         10482.727
## 137          2.973785e+05         26186.361
## 125          1.031011e+05       1998436.834
## 680          3.001102e+04          3585.924
## 186          7.803425e+03        814732.688

For economic impact, create barplot of damage expenses for property and for crops.

#Property Damages Plot
d1 <- ggplot(agg.storm.damages.TopProperty, aes(x = factor(agg.storm.damages.TopProperty$Event.Type), y = agg.storm.damages.TopProperty$Property.Damages.Sum)) + geom_bar(fill="blue", stat = "identity")+ labs(x = "Event Type", y = "Property Damages Sum") + theme(axis.text.x = element_text(angle=45, hjust=1))

#Crop Damages Plot
d2 <- ggplot(agg.storm.damages.TopCrop, aes(x = factor(agg.storm.damages.TopCrop$Event.Type), y = agg.storm.damages.TopCrop$Crop.Damages.Sum)) + geom_bar(fill="green", stat = "identity")+ labs(x = "Event Type", y = "Crop Damages Sum") + theme(axis.text.x = element_text(angle=45, hjust=1))

grid.newpage()
pushViewport(viewport(layout = grid.layout(2, 1)))
print(d1, vp = viewport(layout.pos.row = 1,layout.pos.col = 1))
print(d2, vp = viewport(layout.pos.row = 2,layout.pos.col = 1))

Figure 2: Top 10 Storm Events Types by Property Damage Amount(USD) and Top 10 Storm Events by Crop Damage Amount(USD). Floods cause the most property damage while Droughts cause the most crop damage.

Documentation

#dependencies platform version
sessionInfo()
## R version 3.2.1 (2015-06-18)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.2 (Yosemite)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] ggplot2_1.0.1 plyr_1.8.3   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.0      digest_0.6.8     MASS_7.3-40      gtable_0.1.2    
##  [5] formatR_1.2      magrittr_1.5     scales_0.2.5     evaluate_0.7    
##  [9] stringi_0.5-5    reshape2_1.4.1   rmarkdown_0.7    labeling_0.3    
## [13] proto_0.3-10     tools_3.2.1      stringr_1.0.0    munsell_0.4.2   
## [17] yaml_2.1.13      colorspace_1.2-6 htmltools_0.2.6  knitr_1.10.5