In this report we aim to describe the effects of Storms and other Severe Weather events on public health and economy. In order to achiveve this goal, we are exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. The work will mainly consist on exploratory analysis. We start with the intuition that many severe events can result in fatalities, injuries, and property damage but do not initially set an overall hypothesis upfront.
All the work is done on the following platform
R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Mac OSX 10.1.1 Yosemite
Model Name: MacBook Pro
Processor Name: Intel Core i7
Processor Speed: 2,2 GHz
Number of Processors: 1
Total Number of Cores: 4
L2 Cache (per Core): 256 KB
L3 Cache: 6 MB
Memory: 16 GB
The data set is downloaded from the Course Web Site Storm Data
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
The file is compressed bzip2. We uncompress with bzfile utility and read into R with read.csv()
#De-compress and read in the csv file. This may take a while, hence cache = TRUE
storm <- read.csv(bzfile("repdata-data-StormData.csv.bz2"))
After reading the data, we check the general structure of the dataset, There are 902297 row of data of 37 different variables.
dim(storm)
## [1] 902297 37
We need to get the relevant columns out of the 37 listed
names(storm)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
Our analysis is mainly Based on the event type column (EVTYPE). For population health effects we also grab the relevant columns.
We define the health related columns as "FATALITIES" and "INJURIES" . Since the analysis is done Across the US we do not grab any location specific info (County, State or Longitude - Latitude etc.)
StormHealth <- (storm[, c("EVTYPE","FATALITIES","INJURIES")])
# A quick look
head(StormHealth)
## EVTYPE FATALITIES INJURIES
## 1 TORNADO 0 15
## 2 TORNADO 0 0
## 3 TORNADO 0 2
## 4 TORNADO 0 2
## 5 TORNADO 0 2
## 6 TORNADO 0 6
Now we aggregate accross the event type to get total fatalities and injuries.
StormHealthAgg<- aggregate(cbind(FATALITIES,INJURIES) ~ EVTYPE, data = StormHealth, FUN = sum)
We define the most harmful to be the highest 10 events with total number of fatalities and injuries combined. So we add another column named TOTAL and sort our list by TOTAL column descending.
StormHealthAgg$TOTAL <- StormHealthAgg$FATALITIES + StormHealthAgg$INJURIES
Sorted<-StormHealthAgg[order(StormHealthAgg$TOTAL, decreasing = TRUE),]
# a Quick Look
head(Sorted, n = 30)
## EVTYPE FATALITIES INJURIES TOTAL
## 826 TORNADO 5633 91346 96979
## 124 EXCESSIVE HEAT 1903 6525 8428
## 846 TSTM WIND 504 6957 7461
## 167 FLOOD 470 6789 7259
## 453 LIGHTNING 816 5230 6046
## 271 HEAT 937 2100 3037
## 151 FLASH FLOOD 978 1777 2755
## 422 ICE STORM 89 1975 2064
## 753 THUNDERSTORM WIND 133 1488 1621
## 962 WINTER STORM 206 1321 1527
## 343 HIGH WIND 248 1137 1385
## 241 HAIL 15 1361 1376
## 393 HURRICANE/TYPHOON 64 1275 1339
## 299 HEAVY SNOW 127 1021 1148
## 949 WILDFIRE 75 911 986
## 779 THUNDERSTORM WINDS 64 908 972
## 28 BLIZZARD 101 805 906
## 183 FOG 62 734 796
## 572 RIP CURRENT 368 232 600
## 947 WILD/FOREST FIRE 12 545 557
## 573 RIP CURRENTS 204 297 501
## 273 HEAT WAVE 172 309 481
## 112 DUST STORM 22 440 462
## 967 WINTER WEATHER 33 398 431
## 839 TROPICAL STORM 58 340 398
## 19 AVALANCHE 224 170 394
## 132 EXTREME COLD 160 231 391
## 661 STRONG WIND 103 280 383
## 86 DENSE FOG 18 342 360
## 281 HEAVY RAIN 98 251 349
After a quick look at our data it seems there are 985 unique event types where the original data book describes 48 unique events. That means there are still duplicate rows with slightly different names (i.e THUNDERSTORM WIND vs TSTM WIND or THUNDERSTORM WINDS) that we need to combine for more accurate data.
We will only combine from the top of the list since most of the list is consisting of low numbers of damages or 0 damages and would not effect our overall analysis of the top 10 most harmful event list.
Here are the duplicate EVTYPE rows that matters and we combine them
Sorted[Sorted$EVTYPE == "TSTM WIND" | Sorted$EVTYPE == "THUNDERSTORM WINDS" | Sorted$EVTYPE == "TSTM WIND/HAIL",]$EVTYPE = "THUNDERSTORM WIND";
Sorted[Sorted$EVTYPE == "FOG",]$EVTYPE = "DENSE FOG";
Sorted[Sorted$EVTYPE == "WILD FIRES" | Sorted$EVTYPE == "WILD/FOREST FIRE",]$EVTYPE = "WILDFIRE";
Sorted[Sorted$EVTYPE == "RIP CURRENTS",]$EVTYPE = "RIP CURRENT";
Sorted[Sorted$EVTYPE == "HIGH WINDS",]$EVTYPE = "HIGH WIND";
Sorted[grep("FLASH FLOOD", Sorted$EVTYPE), ]$EVTYPE = "FLASH FLOOD";
Sorted[grep("HURRICANE", Sorted$EVTYPE), ]$EVTYPE = "HURRICANE/TYPHOON";
Sorted[Sorted$EVTYPE == "ICE",]$EVTYPE = "ICE STORM";
Sorted[Sorted$EVTYPE == "HEAT WAVE" | Sorted$EVTYPE == "EXTREME HEAT" | Sorted$EVTYPE == "Heat Wave",]$EVTYPE = "EXCESSIVE HEAT";
Sorted[Sorted$EVTYPE == "WINTER STORMS",]$EVTYPE = "WINTER STORM";
Sorted[Sorted$EVTYPE == "RIVER FLOOD",]$EVTYPE = "FLOOD";
Sorted[grep("STORM SURGE", Sorted$EVTYPE), ]$EVTYPE = "STORM SURGE";
# Aggregate and sort again
Sorted <- aggregate(cbind(FATALITIES, INJURIES, TOTAL) ~ EVTYPE, data = Sorted, FUN = sum)
Sorted<-Sorted[order(Sorted$TOTAL, decreasing = TRUE),]
Here is our final top 10 list
head(Sorted,n=10)
## EVTYPE FATALITIES INJURIES TOTAL
## 788 TORNADO 5633 91346 96979
## 717 THUNDERSTORM WIND 706 9448 10154
## 123 EXCESSIVE HEAT 2171 7059 9230
## 151 FLOOD 472 6791 7263
## 421 LIGHTNING 816 5230 6046
## 252 HEAT 937 2100 3037
## 149 FLASH FLOOD 1035 1802 2837
## 391 ICE STORM 95 2112 2207
## 323 HIGH WIND 283 1439 1722
## 907 WILDFIRE 90 1606 1696
And the appropriate stacked barchart
library(lattice)
# Reorder the factor for plot bar ordering from max to min
Sorted$EVTYPE <- reorder(Sorted$EVTYPE , Sorted$TOTAL)
# And plot a stacked bar chart
barchart(EVTYPE ~ FATALITIES + INJURIES,data=Sorted[1:10,], horizontal = TRUE, auto.key = TRUE, stack = TRUE, ylab="Weather Event", par.settings=list(superpose.polygon=list(col=c("tomato1","royalblue2"))))
Similarly we start with subsetting relevant columns EVTYPE","PROPDMG", "PROPDMGEXP","CROPDMG","CROPDMGEXP"
These match to Property Damage , Property Damage Multiplier (Exponent?), Crop Damage and Crop Damage Multiplier (Exponent?), respectively.
StormEcon <- (storm[, c("EVTYPE","PROPDMG", "PROPDMGEXP","CROPDMG","CROPDMGEXP")])
str(StormEcon)
## 'data.frame': 902297 obs. of 5 variables:
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 826 826 826 826 826 826 826 826 826 826 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","+","-","0",..: 16 16 16 16 16 16 16 16 16 16 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","0","2","?",..: 1 1 1 1 1 1 1 1 1 1 ...
Looking at levels of CROPDMGEXP and PROPDMGEXP factors:
unique(StormEcon$CROPDMGEXP)
## [1] M K m B ? 0 k 2
## Levels: 0 2 ? B K M k m
unique(StormEcon$PROPDMGEXP)
## [1] K M B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels: + - 0 1 2 3 4 5 6 7 8 ? B H K M h m
Since there is no actual databook, we can only guess variables like [K|k], [M|m], B , [H|h] would match to Kilo, Million ,Billion and Hundreds. Similarly we will assume numerics represent exponents of 10 . So 0 -> 1 , 3 -> 1000 etc.
We will ignore variables like ? + - and treat them as multipliers 0
We assume following multipliers for the CROPDMGEXP values(? 0 2 B k K m M) <=>(0, 1, 100, 10^9, 1000, 1000, 10^6, 10^6)
And we assume following multipliers for PROPDMGEXP values
(+ - 0 1 2 3 4 5 6 7 8 ? B H K M h m) <==> (0, 0, 1, 10, 10^2, 10^ 3, 10^4, 10^5, 10^6, 10^7, 10^8, 0, 10^9, 100, 10^3, 10^6, 100, 10^6 )
Lets modify the table accordingly by adding Multiplier Columns
# Add CROPMULTIPLIER column based on CROPDMGEXP factors
StormEcon$CROPMULTIPLIER <- 0
StormEcon[StormEcon$CROPDMGEXP == "K" | StormEcon$CROPDMGEXP == "k",]$CROPMULTIPLIER = 10^3;
StormEcon[StormEcon$CROPDMGEXP == "M" | StormEcon$CROPDMGEXP == "m",]$CROPMULTIPLIER = 10^6;
StormEcon[StormEcon$CROPDMGEXP == "B" ,]$CROPMULTIPLIER = 10^9;
StormEcon[StormEcon$CROPDMGEXP == "0" ,]$CROPMULTIPLIER = 1;
StormEcon[StormEcon$CROPDMGEXP == "2" ,]$CROPMULTIPLIER = 100;
# Do the same thing for PROPDMGEXP
StormEcon$PROPMULTIPLIER <- 0
StormEcon[StormEcon$PROPDMGEXP == "K" | StormEcon$PROPDMGEXP == "k",]$PROPMULTIPLIER = 10^3;
StormEcon[StormEcon$PROPDMGEXP == "M" | StormEcon$PROPDMGEXP == "m",]$PROPMULTIPLIER = 10^6;
StormEcon[StormEcon$PROPDMGEXP == "H" | StormEcon$PROPDMGEXP == "h",]$PROPMULTIPLIER = 10^2;
StormEcon[StormEcon$PROPDMGEXP == "B" ,]$PROPMULTIPLIER = 10^9;
StormEcon[StormEcon$PROPDMGEXP == "0" ,]$PROPMULTIPLIER = 1;
StormEcon[StormEcon$PROPDMGEXP == "1" ,]$PROPMULTIPLIER = 10;
StormEcon[StormEcon$PROPDMGEXP == "2" ,]$PROPMULTIPLIER = 100;
StormEcon[StormEcon$PROPDMGEXP == "3" ,]$PROPMULTIPLIER = 10^3;
StormEcon[StormEcon$PROPDMGEXP == "4" ,]$PROPMULTIPLIER = 10^4;
StormEcon[StormEcon$PROPDMGEXP == "5" ,]$PROPMULTIPLIER = 10^5;
StormEcon[StormEcon$PROPDMGEXP == "6" ,]$PROPMULTIPLIER = 10^6;
StormEcon[StormEcon$PROPDMGEXP == "7" ,]$PROPMULTIPLIER = 10^7;
StormEcon[StormEcon$PROPDMGEXP == "8" ,]$PROPMULTIPLIER = 10^8;
Now calculate the actual damages in USD by multiplying the DMG with DMGMULTIPLIER and aggregate across EVTYPE. Similar to health data order by total damages.
StormEconAgg <- aggregate(cbind("PropertyDamage_USD" = PROPDMG * PROPMULTIPLIER, "CropDamage_USD" = CROPDMG * CROPMULTIPLIER) ~ EVTYPE, data = StormEcon, FUN = sum)
StormEconAgg$TOTAL_DMG <- StormEconAgg$CropDamage_USD + StormEconAgg$PropertyDamage_USD
SortedEcon <- StormEconAgg[order(StormEconAgg$TOTAL_DMG, decreasing = TRUE),]
# A quick look
head(SortedEcon[,c(1,4)], n=30)
## EVTYPE TOTAL_DMG
## 167 FLOOD 150319678250
## 393 HURRICANE/TYPHOON 71913712800
## 826 TORNADO 57362333884
## 656 STORM SURGE 43323541000
## 241 HAIL 18761221926
## 151 FLASH FLOOD 18243990872
## 91 DROUGHT 15018672000
## 385 HURRICANE 14610229010
## 577 RIVER FLOOD 10148404500
## 422 ICE STORM 8967041360
## 839 TROPICAL STORM 8382236550
## 962 WINTER STORM 6715441251
## 343 HIGH WIND 5908617560
## 949 WILDFIRE 5060586800
## 846 TSTM WIND 5038935845
## 657 STORM SURGE/TIDE 4642038000
## 753 THUNDERSTORM WIND 3897965520
## 390 HURRICANE OPAL 3191846000
## 947 WILD/FOREST FIRE 3108626330
## 287 HEAVY RAIN/SEVERE WEATHER 2500000000
## 779 THUNDERSTORM WINDS 2135245438
## 834 TORNADOES, TSTM WIND, HAIL 1602500000
## 281 HEAVY RAIN 1427647890
## 132 EXTREME COLD 1360710400
## 599 SEVERE THUNDERSTORM 1205560000
## 198 FROST/FREEZE 1103566000
## 299 HEAVY SNOW 1067412240
## 453 LIGHTNING 942471516
## 28 BLIZZARD 771273950
## 360 HIGH WINDS 649044330
Similarly we have the data duplicate problem for EVTYPE column.
Here are the duplicate EVTYPE rows that matters and we combine them.
SortedEcon[SortedEcon$EVTYPE == "TSTM WIND" | SortedEcon$EVTYPE == "THUNDERSTORM WINDS" | SortedEcon$EVTYPE == "TSTM WIND/HAIL",]$EVTYPE = "THUNDERSTORM WIND";
SortedEcon[SortedEcon$EVTYPE == "FOG",]$EVTYPE = "DENSE FOG";
SortedEcon[SortedEcon$EVTYPE == "WILD FIRES" | SortedEcon$EVTYPE == "WILD/FOREST FIRE",]$EVTYPE = "WILDFIRE";
SortedEcon[SortedEcon$EVTYPE == "RIP CURRENTS",]$EVTYPE = "RIP CURRENT";
SortedEcon[SortedEcon$EVTYPE == "HIGH WINDS",]$EVTYPE = "HIGH WIND";
SortedEcon[grep("FLASH FLOOD", SortedEcon$EVTYPE), ]$EVTYPE = "FLASH FLOOD";
SortedEcon[grep("HURRICANE", SortedEcon$EVTYPE), ]$EVTYPE = "HURRICANE/TYPHOON";
SortedEcon[SortedEcon$EVTYPE == "ICE",]$EVTYPE = "ICE STORM";
SortedEcon[SortedEcon$EVTYPE == "HEAT WAVE" | SortedEcon$EVTYPE == "EXTREME HEAT" | SortedEcon$EVTYPE == "Heat Wave",]$EVTYPE = "EXCESSIVE HEAT";
SortedEcon[SortedEcon$EVTYPE == "WINTER STORMS",]$EVTYPE = "WINTER STORM";
SortedEcon[SortedEcon$EVTYPE == "RIVER FLOOD",]$EVTYPE = "FLOOD";
SortedEcon[grep("STORM SURGE", SortedEcon$EVTYPE), ]$EVTYPE = "STORM SURGE";
# Aggregate and sort again
SortedEcon <- aggregate(cbind(CropDamage_USD, PropertyDamage_USD, TOTAL_DMG) ~ EVTYPE, data = SortedEcon, FUN = sum)
SortedEcon<-SortedEcon[order(SortedEcon$TOTAL_DMG, decreasing = TRUE),]
Again our strategy is the top 10 total damages are the most harmful weather events in terms of economy.
Here is our final top 10 list
head(SortedEcon[,c(1,4)],n=10)
## EVTYPE TOTAL_DMG
## 151 FLOOD 160468082750
## 364 HURRICANE/TYPHOON 90271472810
## 788 TORNADO 57362333884
## 621 STORM SURGE 47965579000
## 149 FLASH FLOOD 19120989028
## 222 HAIL 18761221926
## 90 DROUGHT 15018672000
## 717 THUNDERSTORM WIND 11181178553
## 391 ICE STORM 8979696360
## 907 WILDFIRE 8793313130
And the appropriate stacked barchart
library(lattice)
# Reorder the factor for plot bar ordering from max to min
SortedEcon$EVTYPE <- reorder(SortedEcon$EVTYPE , SortedEcon$TOTAL_DMG)
# And plot a stacked bar chart
barchart(EVTYPE ~ CropDamage_USD + PropertyDamage_USD ,data=SortedEcon[1:10,], horizontal = TRUE, auto.key = TRUE, stack = TRUE, ylab="Weather Event", xlab = "Total Damages in US Dollars",par.settings=list(superpose.polygon=list(col=c("tomato1","royalblue2"))))
As stated before with the appropriate graphs, the most harmful 10 events for health and economy are again listed below
Top 10 Most Harmful Events for Public Health:
head(as.character(Sorted[,1]),n=10)
## [1] "TORNADO" "THUNDERSTORM WIND" "EXCESSIVE HEAT"
## [4] "FLOOD" "LIGHTNING" "HEAT"
## [7] "FLASH FLOOD" "ICE STORM" "HIGH WIND"
## [10] "WILDFIRE"
Top 10 Most Harmful Events for Economy:
head(as.character(SortedEcon[,1]),n=10)
## [1] "FLOOD" "HURRICANE/TYPHOON" "TORNADO"
## [4] "STORM SURGE" "FLASH FLOOD" "HAIL"
## [7] "DROUGHT" "THUNDERSTORM WIND" "ICE STORM"
## [10] "WILDFIRE"