The following analysis is based on data included in Storm Data which is an official publication of the National Oceanic and Atmospheric Administration (NOAA).
*Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Tornadoes are the most harmful with respect to United States population health (see Figure 1 below). Tornadoes are the third most frequent event recorded, but are responsible for the most number of fatalities (5633) and the most number of injuries (91346). Excessive Heat has the second highest fatalities recorded (2171). Thunderstorm Winds cause the second most injuries (9353) followed by Excessive Heat at third highest injuries (7059).
*Across the United States, which types of events have the greatest economic consequences?
Floods cause the highest property damage ($144,657,709,800) in the United States, and also cause the second highest crop damage ($5,661,968,450) (see Figure 2 below). Hurricanes ($81,174,159,010) and Tornadoes ($56,937,160,480) cause the second and third highest property damage. Drought causes the most crop damage ($13,972,566,000), while Hurricanes cause the third highest crop damages ($5,349,782,800).
Overall Tornadoes are the most harmful events in the United States.
Obtain and load the dataset from the National Weather Service.
#Download the file
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "stormData.csv.bz2")
#Read the file
stormData <- read.csv("stormData.csv.bz2")
Explore the dataset.
dim(stormData)
## [1] 902297 37
head(stormData, 10)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## 7 1 11/16/1951 0:00:00 0100 CST 9 BLOUNT AL
## 8 1 1/22/1952 0:00:00 0900 CST 123 TALLAPOOSA AL
## 9 1 2/13/1952 0:00:00 2000 CST 125 TUSCALOOSA AL
## 10 1 2/13/1952 0:00:00 2000 CST 57 FAYETTE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## 7 TORNADO 0 0
## 8 TORNADO 0 0
## 9 TORNADO 0 0
## 10 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## 7 NA 0 1.5 33 2 0 0
## 8 NA 0 0.0 33 1 0 0
## 9 NA 0 3.3 100 3 0 1
## 10 NA 0 2.3 100 3 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## 7 1 2.5 K 0
## 8 0 2.5 K 0
## 9 14 25.0 K 0
## 10 0 25.0 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
## 7 3405 8631 0 0 7
## 8 3255 8558 0 0 8
## 9 3334 8740 3336 8738 9
## 10 3336 8738 3337 8737 10
Since we are interested in the Health and Property variables, explore these specific variables in more detail.
storm <- stormData[,c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
summary(storm)
## EVTYPE FATALITIES INJURIES
## HAIL :288661 Min. : 0.0000 Min. : 0.0000
## TSTM WIND :219940 1st Qu.: 0.0000 1st Qu.: 0.0000
## THUNDERSTORM WIND: 82563 Median : 0.0000 Median : 0.0000
## TORNADO : 60652 Mean : 0.0168 Mean : 0.1557
## FLASH FLOOD : 54277 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## FLOOD : 25326 Max. :583.0000 Max. :1700.0000
## (Other) :170878
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## Min. : 0.00 :465934 Min. : 0.000 :618413
## 1st Qu.: 0.00 K :424665 1st Qu.: 0.000 K :281832
## Median : 0.00 M : 11330 Median : 0.000 M : 1994
## Mean : 12.06 0 : 216 Mean : 1.527 k : 21
## 3rd Qu.: 0.50 B : 40 3rd Qu.: 0.000 0 : 19
## Max. :5000.00 5 : 28 Max. :990.000 B : 9
## (Other): 84 (Other): 9
mean(is.na(storm))
## [1] 0
There seem to be data quality issues in variables PROPDMGEXP and CROPDMGEXP. Explore and cleanse these variables.
summary(storm$PROPDMGEXP)
## - ? + 0 1 2 3 4 5
## 465934 1 8 5 216 25 13 4 4 28
## 6 7 8 B h H K m M
## 4 5 1 40 1 6 424665 7 11330
summary(storm$CROPDMGEXP)
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
The EVTYPE field does not have standard values. Since most of the events have zero or few impacts, cleanse only those event types with significant impacts.
Also since there are comparatively few instances of deviations in PROPDMGEXP and CROPDMGEXP, set any factors not equal to “H”, “K”, “M”, or “B” equal to zero.
#clean EVTYPE
storm$EVTYPE <- toupper(storm$EVTYPE)
storm$EVTYPE[storm$EVTYPE == "TSTM WIND"] <- "THUNDERSTORM WIND"
storm$EVTYPE[storm$EVTYPE == "THUNDERSTORM WINDS"] <- "THUNDERSTORM WIND"
storm$EVTYPE[storm$EVTYPE == "HIGH WINDS"] <- "HIGH WIND"
storm$EVTYPE[storm$EVTYPE == "RIP CURRENTS"] <- "RIP CURRENT"
storm$EVTYPE[storm$EVTYPE == "HURRICANE/TYPHOON"] <- "HURRICANE"
storm$EVTYPE[storm$EVTYPE == "HEAT WAVE"] <- "EXCESSIVE HEAT"
storm$EVTYPE[storm$EVTYPE == "EXTREME HEAT"] <- "EXCESSIVE HEAT"
#clean PROPDMGEXP
storm$PROPDMGEXP <- factor(toupper(storm$PROPDMGEXP))
levels(storm$PROPDMGEXP)[levels(storm$PROPDMGEXP)==""] <- "0"
levels(storm$PROPDMGEXP)[levels(storm$PROPDMGEXP)=="-"] <- "0"
levels(storm$PROPDMGEXP)[levels(storm$PROPDMGEXP)=="+"] <- "0"
levels(storm$PROPDMGEXP)[levels(storm$PROPDMGEXP)=="?"] <- "0"
levels(storm$PROPDMGEXP)[levels(storm$PROPDMGEXP)=="1"] <- "0"
levels(storm$PROPDMGEXP)[levels(storm$PROPDMGEXP)=="2"] <- "0"
levels(storm$PROPDMGEXP)[levels(storm$PROPDMGEXP)=="3"] <- "0"
levels(storm$PROPDMGEXP)[levels(storm$PROPDMGEXP)=="4"] <- "0"
levels(storm$PROPDMGEXP)[levels(storm$PROPDMGEXP)=="5"] <- "0"
levels(storm$PROPDMGEXP)[levels(storm$PROPDMGEXP)=="6"] <- "0"
levels(storm$PROPDMGEXP)[levels(storm$PROPDMGEXP)=="7"] <- "0"
levels(storm$PROPDMGEXP)[levels(storm$PROPDMGEXP)=="8"] <- "0"
levels(storm$PROPDMGEXP)[levels(storm$PROPDMGEXP)=="H"] <- "100"
levels(storm$PROPDMGEXP)[levels(storm$PROPDMGEXP)=="K"] <- "1000"
levels(storm$PROPDMGEXP)[levels(storm$PROPDMGEXP)=="M"] <- "1000000"
levels(storm$PROPDMGEXP)[levels(storm$PROPDMGEXP)=="B"] <- "1000000000"
#clean CROPDMGEXP
storm$CROPDMGEXP <- factor(toupper(storm$CROPDMGEXP))
levels(storm$CROPDMGEXP)[levels(storm$CROPDMGEXP)==""] <- "0"
levels(storm$CROPDMGEXP)[levels(storm$CROPDMGEXP)=="?"] <- "0"
levels(storm$CROPDMGEXP)[levels(storm$CROPDMGEXP)=="2"] <- "0"
levels(storm$CROPDMGEXP)[levels(storm$CROPDMGEXP)=="K"] <- "1000"
levels(storm$CROPDMGEXP)[levels(storm$CROPDMGEXP)=="M"] <- "1000000"
levels(storm$CROPDMGEXP)[levels(storm$CROPDMGEXP)=="B"] <- "1000000000"
Calculate the $ amount of damage expenses and assign to new variables.
#convert expense variables to numeric
storm$PROPDMGEXP <- as.numeric(as.character(storm$PROPDMGEXP))
storm$CROPDMGEXP <- as.numeric(as.character(storm$CROPDMGEXP))
storm$PROPDMGAMT <- storm$PROPDMG * storm$PROPDMGEXP
storm$CROPDMGAMT <- storm$CROPDMG * storm$CROPDMGEXP
Check which events are most frequent.
library(plyr)
storm.events.count <- count(storm, 'EVTYPE')
head(storm.events.count[order(storm.events.count$freq, decreasing = T), ],20)
## EVTYPE freq
## 680 THUNDERSTORM WIND 323349
## 211 HAIL 288661
## 752 TORNADO 60652
## 137 FLASH FLOOD 54277
## 153 FLOOD 25327
## 318 HIGH WIND 21747
## 414 LIGHTNING 15754
## 272 HEAVY SNOW 15708
## 252 HEAVY RAIN 11742
## 881 WINTER STORM 11433
## 886 WINTER WEATHER 7045
## 189 FUNNEL CLOUD 6844
## 440 MARINE TSTM WIND 6175
## 439 MARINE THUNDERSTORM WIND 5812
## 848 WATERSPOUT 3796
## 599 STRONG WIND 3569
## 832 URBAN/SML STREAM FLD 3392
## 868 WILDFIRE 2761
## 28 BLIZZARD 2719
## 84 DROUGHT 2488
To find out which type of events are more harmful to population health, aggregate the harmful health impacts by event type.
#For Health Impact
agg.storm.health.sum <-aggregate(storm[,c(2,3)], by=list(storm$EVTYPE), FUN=sum, na.rm=TRUE)
colnames(agg.storm.health.sum) <- c("Event.Type", "Fatalities.Sum", "Injuries.Sum")
agg.storm.health.mean <-aggregate(storm[,c(2,3)], by=list(storm$EVTYPE), FUN=mean, na.rm=TRUE)
colnames(agg.storm.health.mean) <- c("Event.Type", "Fatalities.Mean", "Injuries.Mean")
agg.storm.health <- merge(agg.storm.health.sum, agg.storm.health.mean, by="Event.Type")
#Take top 10 Entries
agg.storm.health.TopFatalities <- head(agg.storm.health[order(agg.storm.health$Fatalities.Sum, decreasing = T), ],10)
agg.storm.health.TopInjuries <- head(agg.storm.health[order(agg.storm.health$Injuries.Sum, decreasing = T), ],10)
Check Top Fatalities Data:
agg.storm.health.TopFatalities
## Event.Type Fatalities.Sum Injuries.Sum Fatalities.Mean
## 752 TORNADO 5633 91346 0.092874101
## 116 EXCESSIVE HEAT 2171 7059 1.223098592
## 137 FLASH FLOOD 978 1777 0.018018682
## 242 HEAT 937 2100 1.221642764
## 414 LIGHTNING 816 5230 0.051796369
## 680 THUNDERSTORM WIND 701 9353 0.002167936
## 520 RIP CURRENT 572 529 0.739018088
## 153 FLOOD 470 6789 0.018557271
## 318 HIGH WIND 283 1439 0.013013289
## 19 AVALANCHE 224 170 0.580310881
## Injuries.Mean
## 752 1.50606740
## 116 3.97690141
## 137 0.03273947
## 242 2.73794003
## 414 0.33197918
## 680 0.02892540
## 520 0.68346253
## 153 0.26805386
## 318 0.06617005
## 19 0.44041451
Check Top Injuries Data:
agg.storm.health.TopInjuries
## Event.Type Fatalities.Sum Injuries.Sum Fatalities.Mean
## 752 TORNADO 5633 91346 9.287410e-02
## 680 THUNDERSTORM WIND 701 9353 2.167936e-03
## 116 EXCESSIVE HEAT 2171 7059 1.223099e+00
## 153 FLOOD 470 6789 1.855727e-02
## 414 LIGHTNING 816 5230 5.179637e-02
## 242 HEAT 937 2100 1.221643e+00
## 383 ICE STORM 89 1975 4.436690e-02
## 137 FLASH FLOOD 978 1777 1.801868e-02
## 318 HIGH WIND 283 1439 1.301329e-02
## 211 HAIL 15 1361 5.196407e-05
## Injuries.Mean
## 752 1.506067401
## 680 0.028925403
## 116 3.976901408
## 153 0.268053856
## 414 0.331979180
## 242 2.737940026
## 383 0.984546361
## 137 0.032739466
## 318 0.066170046
## 211 0.004714873
For population and health impact, create barplot of Fatalities and Injuries.
library(ggplot2)
library(grid)
#Fatalities plot
h1 <- ggplot(agg.storm.health.TopFatalities, aes(x = factor(agg.storm.health.TopFatalities$Event.Type), y = agg.storm.health.TopFatalities$Fatalities.Sum)) + geom_bar(fill="red", stat = "identity")+ labs(x = "Event Type", y = "Fatalities Sum") + theme(axis.text.x = element_text(angle=45, hjust=1))
#Injuries Plot
h2 <- ggplot(agg.storm.health.TopInjuries, aes(x = factor(agg.storm.health.TopInjuries$Event.Type), y = agg.storm.health.TopInjuries$Injuries.Sum)) + geom_bar(fill="orange", stat = "identity")+ labs(x = "Event Type", y = "Injuries Sum") + theme(axis.text.x = element_text(angle=45, hjust=1))
grid.newpage()
pushViewport(viewport(layout = grid.layout(2, 1)))
print(h1, vp = viewport(layout.pos.row = 1,layout.pos.col = 1))
print(h2, vp = viewport(layout.pos.row = 2,layout.pos.col = 1))
Figure 1: Top 10 Storm Events Types by Sum Fatalities and Top 10 Storm Events by Sum of Injuries. Tornadoes cause the most fatalities and injuries.
To find out which type of events have greatest economic impact, aggregate the harmful health impacts by event type.
#For Damage Impact
agg.storm.damages.sum <-aggregate(storm[,c(8,9)], by=list(storm$EVTYPE), FUN=sum, na.rm=TRUE)
colnames(agg.storm.damages.sum) <- c("Event.Type", "Property.Damages.Sum", "Crop.Damages.Sum")
agg.storm.damages.mean <-aggregate(storm[,c(8,9)], by=list(storm$EVTYPE), FUN=mean, na.rm=TRUE)
colnames(agg.storm.damages.mean) <- c("Event.Type", "Property.Damages.Mean", "Crop.Damages.Mean")
agg.storm.damages <- merge(agg.storm.damages.sum, agg.storm.damages.mean, by="Event.Type")
#Take top 10 entries
agg.storm.damages.TopProperty <- head(agg.storm.damages[order(agg.storm.damages$Property.Damages.Sum, decreasing = T), ],10)
agg.storm.damages.TopCrop <- head(agg.storm.damages[order(agg.storm.damages$Crop.Damages.Sum, decreasing = T), ],10)
Check Top Property Damage Data:
agg.storm.damages.TopProperty
## Event.Type Property.Damages.Sum Crop.Damages.Sum
## 153 FLOOD 144657709800 5661968450
## 360 HURRICANE 81174159010 5349782800
## 752 TORNADO 56937160480 414953110
## 594 STORM SURGE 43323536000 5000
## 137 FLASH FLOOD 16140811510 1421317100
## 211 HAIL 15732267220 3025954450
## 680 THUNDERSTORM WIND 9704034430 1159505100
## 766 TROPICAL STORM 7703890550 678346000
## 881 WINTER STORM 6688497250 26944000
## 318 HIGH WIND 5878369910 679291900
## Property.Damages.Mean Crop.Damages.Mean
## 153 5711600.66 2.235546e+05
## 360 309825034.39 2.041902e+07
## 752 938751.57 6.841540e+03
## 594 165990559.39 1.915709e+01
## 137 297378.48 2.618636e+04
## 211 54500.84 1.048273e+04
## 680 30011.02 3.585924e+03
## 766 11165058.77 9.831101e+05
## 881 585016.82 2.356687e+03
## 318 270307.16 3.123612e+04
Check Top Property Damage Data:
agg.storm.damages.TopCrop
## Event.Type Property.Damages.Sum Crop.Damages.Sum
## 84 DROUGHT 1046106000 13972566000
## 153 FLOOD 144657709800 5661968450
## 360 HURRICANE 81174159010 5349782800
## 524 RIVER FLOOD 5118945500 5029459000
## 383 ICE STORM 3944927810 5022113500
## 211 HAIL 15732267220 3025954450
## 137 FLASH FLOOD 16140811510 1421317100
## 125 EXTREME COLD 67737400 1312973000
## 680 THUNDERSTORM WIND 9704034430 1159505100
## 186 FROST/FREEZE 10480000 1094186000
## Property.Damages.Mean Crop.Damages.Mean
## 84 4.204606e+05 5615983.119
## 153 5.711601e+06 223554.643
## 360 3.098250e+08 20419018.321
## 524 2.958928e+07 29072017.341
## 383 1.966564e+06 2503546.112
## 211 5.450084e+04 10482.727
## 137 2.973785e+05 26186.361
## 125 1.031011e+05 1998436.834
## 680 3.001102e+04 3585.924
## 186 7.803425e+03 814732.688
For economic impact, create barplot of damage expenses for property and for crops.
#Property Damages Plot
d1 <- ggplot(agg.storm.damages.TopProperty, aes(x = factor(agg.storm.damages.TopProperty$Event.Type), y = agg.storm.damages.TopProperty$Property.Damages.Sum)) + geom_bar(fill="blue", stat = "identity")+ labs(x = "Event Type", y = "Property Damages Sum") + theme(axis.text.x = element_text(angle=45, hjust=1))
#Crop Damages Plot
d2 <- ggplot(agg.storm.damages.TopCrop, aes(x = factor(agg.storm.damages.TopCrop$Event.Type), y = agg.storm.damages.TopCrop$Crop.Damages.Sum)) + geom_bar(fill="green", stat = "identity")+ labs(x = "Event Type", y = "Crop Damages Sum") + theme(axis.text.x = element_text(angle=45, hjust=1))
grid.newpage()
pushViewport(viewport(layout = grid.layout(2, 1)))
print(d1, vp = viewport(layout.pos.row = 1,layout.pos.col = 1))
print(d2, vp = viewport(layout.pos.row = 2,layout.pos.col = 1))
Figure 2: Top 10 Storm Events Types by Property Damage Amount(USD) and Top 10 Storm Events by Crop Damage Amount(USD). Floods cause the most property damage while Droughts cause the most crop damage.
#dependencies platform version
sessionInfo()
## R version 3.2.1 (2015-06-18)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.2 (Yosemite)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] ggplot2_1.0.1 plyr_1.8.3
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.0 digest_0.6.8 MASS_7.3-40 gtable_0.1.2
## [5] formatR_1.2 magrittr_1.5 scales_0.2.5 evaluate_0.7
## [9] stringi_0.5-5 reshape2_1.4.1 rmarkdown_0.7 labeling_0.3
## [13] proto_0.3-10 tools_3.2.1 stringr_1.0.0 munsell_0.4.2
## [17] yaml_2.1.13 colorspace_1.2-6 htmltools_0.2.6 knitr_1.10.5