Synopsis

In this report we aim to describe the effects of Storms and other Severe Weather events on public health and economy. In order to achiveve this goal, we are exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. The work will mainly consist on exploratory analysis. We start with the intuition that many severe events can result in fatalities, injuries, and property damage but do not initially set an overall hypothesis upfront.

Data Processing

System Info

All the work is done on the following platform

  R version 3.1.2 (2014-10-31)
  Platform: x86_64-apple-darwin13.4.0 (64-bit)
  Mac OSX 10.1.1 Yosemite  
  Model Name:  MacBook Pro  
  Processor Name:  Intel Core i7  
  Processor Speed:  2,2 GHz  
  Number of Processors: 1  
  Total Number of Cores:    4  
  L2 Cache (per Core):  256 KB  
  L3 Cache: 6 MB  
  Memory:   16 GB  

Loading and preprocessing the data

The data set is downloaded from the Course Web Site Storm Data
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
The file is compressed bzip2. We uncompress with bzfile utility and read into R with read.csv()

#De-compress and read in the csv file. This may take a while, hence cache = TRUE
storm <- read.csv(bzfile("repdata-data-StormData.csv.bz2"))

After reading the data, we check the general structure of the dataset, There are 902297 row of data of 37 different variables.

dim(storm)
## [1] 902297     37

We need to get the relevant columns out of the 37 listed

names(storm)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

Analysis 1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

Our analysis is mainly Based on the event type column (EVTYPE). For population health effects we also grab the relevant columns.
We define the health related columns as "FATALITIES" and "INJURIES" . Since the analysis is done Across the US we do not grab any location specific info (County, State or Longitude - Latitude etc.)

StormHealth <- (storm[, c("EVTYPE","FATALITIES","INJURIES")])
# A quick look
head(StormHealth)
##    EVTYPE FATALITIES INJURIES
## 1 TORNADO          0       15
## 2 TORNADO          0        0
## 3 TORNADO          0        2
## 4 TORNADO          0        2
## 5 TORNADO          0        2
## 6 TORNADO          0        6

Now we aggregate accross the event type to get total fatalities and injuries.

StormHealthAgg<- aggregate(cbind(FATALITIES,INJURIES) ~  EVTYPE, data = StormHealth, FUN = sum)

We define the most harmful to be the highest 10 events with total number of fatalities and injuries combined. So we add another column named TOTAL and sort our list by TOTAL column descending.

StormHealthAgg$TOTAL <- StormHealthAgg$FATALITIES + StormHealthAgg$INJURIES
Sorted<-StormHealthAgg[order(StormHealthAgg$TOTAL, decreasing = TRUE),]
# a Quick Look
head(Sorted, n = 30)
##                 EVTYPE FATALITIES INJURIES TOTAL
## 826            TORNADO       5633    91346 96979
## 124     EXCESSIVE HEAT       1903     6525  8428
## 846          TSTM WIND        504     6957  7461
## 167              FLOOD        470     6789  7259
## 453          LIGHTNING        816     5230  6046
## 271               HEAT        937     2100  3037
## 151        FLASH FLOOD        978     1777  2755
## 422          ICE STORM         89     1975  2064
## 753  THUNDERSTORM WIND        133     1488  1621
## 962       WINTER STORM        206     1321  1527
## 343          HIGH WIND        248     1137  1385
## 241               HAIL         15     1361  1376
## 393  HURRICANE/TYPHOON         64     1275  1339
## 299         HEAVY SNOW        127     1021  1148
## 949           WILDFIRE         75      911   986
## 779 THUNDERSTORM WINDS         64      908   972
## 28            BLIZZARD        101      805   906
## 183                FOG         62      734   796
## 572        RIP CURRENT        368      232   600
## 947   WILD/FOREST FIRE         12      545   557
## 573       RIP CURRENTS        204      297   501
## 273          HEAT WAVE        172      309   481
## 112         DUST STORM         22      440   462
## 967     WINTER WEATHER         33      398   431
## 839     TROPICAL STORM         58      340   398
## 19           AVALANCHE        224      170   394
## 132       EXTREME COLD        160      231   391
## 661        STRONG WIND        103      280   383
## 86           DENSE FOG         18      342   360
## 281         HEAVY RAIN         98      251   349

After a quick look at our data it seems there are 985 unique event types where the original data book describes 48 unique events. That means there are still duplicate rows with slightly different names (i.e THUNDERSTORM WIND vs TSTM WIND or THUNDERSTORM WINDS) that we need to combine for more accurate data.

We will only combine from the top of the list since most of the list is consisting of low numbers of damages or 0 damages and would not effect our overall analysis of the top 10 most harmful event list.

Here are the duplicate EVTYPE rows that matters and we combine them

Sorted[Sorted$EVTYPE == "TSTM WIND" | Sorted$EVTYPE == "THUNDERSTORM WINDS" | Sorted$EVTYPE == "TSTM WIND/HAIL",]$EVTYPE = "THUNDERSTORM WIND";
Sorted[Sorted$EVTYPE == "FOG",]$EVTYPE = "DENSE FOG";
Sorted[Sorted$EVTYPE == "WILD FIRES" | Sorted$EVTYPE == "WILD/FOREST FIRE",]$EVTYPE = "WILDFIRE";
Sorted[Sorted$EVTYPE == "RIP CURRENTS",]$EVTYPE = "RIP CURRENT";
Sorted[Sorted$EVTYPE == "HIGH WINDS",]$EVTYPE = "HIGH WIND";
Sorted[grep("FLASH FLOOD", Sorted$EVTYPE), ]$EVTYPE = "FLASH FLOOD";
Sorted[grep("HURRICANE", Sorted$EVTYPE), ]$EVTYPE  = "HURRICANE/TYPHOON";
Sorted[Sorted$EVTYPE == "ICE",]$EVTYPE = "ICE STORM";
Sorted[Sorted$EVTYPE == "HEAT WAVE" | Sorted$EVTYPE == "EXTREME HEAT" | Sorted$EVTYPE == "Heat Wave",]$EVTYPE = "EXCESSIVE HEAT";
Sorted[Sorted$EVTYPE == "WINTER STORMS",]$EVTYPE = "WINTER STORM";
Sorted[Sorted$EVTYPE == "RIVER FLOOD",]$EVTYPE = "FLOOD";
Sorted[grep("STORM SURGE", Sorted$EVTYPE), ]$EVTYPE  = "STORM SURGE";

# Aggregate and sort again
Sorted <- aggregate(cbind(FATALITIES, INJURIES, TOTAL) ~  EVTYPE, data = Sorted, FUN = sum)
Sorted<-Sorted[order(Sorted$TOTAL, decreasing = TRUE),]

Here is our final top 10 list

head(Sorted,n=10)
##                EVTYPE FATALITIES INJURIES TOTAL
## 788           TORNADO       5633    91346 96979
## 717 THUNDERSTORM WIND        706     9448 10154
## 123    EXCESSIVE HEAT       2171     7059  9230
## 151             FLOOD        472     6791  7263
## 421         LIGHTNING        816     5230  6046
## 252              HEAT        937     2100  3037
## 149       FLASH FLOOD       1035     1802  2837
## 391         ICE STORM         95     2112  2207
## 323         HIGH WIND        283     1439  1722
## 907          WILDFIRE         90     1606  1696

And the appropriate stacked barchart

library(lattice)
# Reorder the factor for plot bar ordering from max to min
Sorted$EVTYPE <- reorder(Sorted$EVTYPE , Sorted$TOTAL)
# And plot a stacked bar chart
barchart(EVTYPE ~ FATALITIES + INJURIES,data=Sorted[1:10,], horizontal = TRUE, auto.key = TRUE, stack = TRUE, ylab="Weather Event", par.settings=list(superpose.polygon=list(col=c("tomato1","royalblue2"))))

Analysis 2. Across the United States, which types of events have the greatest economic consequences?

Similarly we start with subsetting relevant columns EVTYPE","PROPDMG", "PROPDMGEXP","CROPDMG","CROPDMGEXP"

These match to Property Damage , Property Damage Multiplier (Exponent?), Crop Damage and Crop Damage Multiplier (Exponent?), respectively.

StormEcon <- (storm[, c("EVTYPE","PROPDMG", "PROPDMGEXP","CROPDMG","CROPDMGEXP")])
str(StormEcon)
## 'data.frame':    902297 obs. of  5 variables:
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 826 826 826 826 826 826 826 826 826 826 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","+","-","0",..: 16 16 16 16 16 16 16 16 16 16 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","0","2","?",..: 1 1 1 1 1 1 1 1 1 1 ...

Looking at levels of CROPDMGEXP and PROPDMGEXP factors:

unique(StormEcon$CROPDMGEXP)
## [1]   M K m B ? 0 k 2
## Levels:  0 2 ? B K M k m
unique(StormEcon$PROPDMGEXP)
##  [1] K M   B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels:  + - 0 1 2 3 4 5 6 7 8 ? B H K M h m

Since there is no actual databook, we can only guess variables like [K|k], [M|m], B , [H|h] would match to Kilo, Million ,Billion and Hundreds. Similarly we will assume numerics represent exponents of 10 . So 0 -> 1 , 3 -> 1000 etc.

We will ignore variables like ? + - and treat them as multipliers 0

We assume following multipliers for the CROPDMGEXP values
(? 0 2 B k K m M) <=>(0, 1, 100, 10^9, 1000, 1000, 10^6, 10^6)

And we assume following multipliers for PROPDMGEXP values

(+ - 0 1 2 3 4 5 6 7 8 ? B H K M h m) <==> (0, 0, 1, 10, 10^2, 10^ 3, 10^4, 10^5, 10^6, 10^7, 10^8, 0, 10^9, 100, 10^3, 10^6, 100, 10^6 )

Lets modify the table accordingly by adding Multiplier Columns

# Add CROPMULTIPLIER column based on CROPDMGEXP factors
StormEcon$CROPMULTIPLIER <- 0
StormEcon[StormEcon$CROPDMGEXP == "K" | StormEcon$CROPDMGEXP == "k",]$CROPMULTIPLIER = 10^3;
StormEcon[StormEcon$CROPDMGEXP == "M" | StormEcon$CROPDMGEXP == "m",]$CROPMULTIPLIER = 10^6;
StormEcon[StormEcon$CROPDMGEXP == "B" ,]$CROPMULTIPLIER = 10^9;
StormEcon[StormEcon$CROPDMGEXP == "0" ,]$CROPMULTIPLIER = 1;
StormEcon[StormEcon$CROPDMGEXP == "2" ,]$CROPMULTIPLIER = 100;

# Do the same thing for PROPDMGEXP
StormEcon$PROPMULTIPLIER <- 0
StormEcon[StormEcon$PROPDMGEXP == "K" | StormEcon$PROPDMGEXP == "k",]$PROPMULTIPLIER = 10^3;
StormEcon[StormEcon$PROPDMGEXP == "M" | StormEcon$PROPDMGEXP == "m",]$PROPMULTIPLIER = 10^6;
StormEcon[StormEcon$PROPDMGEXP == "H" | StormEcon$PROPDMGEXP == "h",]$PROPMULTIPLIER = 10^2;
StormEcon[StormEcon$PROPDMGEXP == "B" ,]$PROPMULTIPLIER = 10^9;
StormEcon[StormEcon$PROPDMGEXP == "0" ,]$PROPMULTIPLIER = 1;
StormEcon[StormEcon$PROPDMGEXP == "1" ,]$PROPMULTIPLIER = 10;
StormEcon[StormEcon$PROPDMGEXP == "2" ,]$PROPMULTIPLIER = 100;
StormEcon[StormEcon$PROPDMGEXP == "3" ,]$PROPMULTIPLIER = 10^3;
StormEcon[StormEcon$PROPDMGEXP == "4" ,]$PROPMULTIPLIER = 10^4;
StormEcon[StormEcon$PROPDMGEXP == "5" ,]$PROPMULTIPLIER = 10^5;
StormEcon[StormEcon$PROPDMGEXP == "6" ,]$PROPMULTIPLIER = 10^6;
StormEcon[StormEcon$PROPDMGEXP == "7" ,]$PROPMULTIPLIER = 10^7;
StormEcon[StormEcon$PROPDMGEXP == "8" ,]$PROPMULTIPLIER = 10^8;

Now calculate the actual damages in USD by multiplying the DMG with DMGMULTIPLIER and aggregate across EVTYPE. Similar to health data order by total damages.

StormEconAgg <- aggregate(cbind("PropertyDamage_USD" = PROPDMG * PROPMULTIPLIER, "CropDamage_USD" = CROPDMG * CROPMULTIPLIER) ~  EVTYPE, data = StormEcon, FUN = sum)

StormEconAgg$TOTAL_DMG <- StormEconAgg$CropDamage_USD + StormEconAgg$PropertyDamage_USD
SortedEcon <- StormEconAgg[order(StormEconAgg$TOTAL_DMG, decreasing = TRUE),]
# A quick look
head(SortedEcon[,c(1,4)], n=30)
##                         EVTYPE    TOTAL_DMG
## 167                      FLOOD 150319678250
## 393          HURRICANE/TYPHOON  71913712800
## 826                    TORNADO  57362333884
## 656                STORM SURGE  43323541000
## 241                       HAIL  18761221926
## 151                FLASH FLOOD  18243990872
## 91                     DROUGHT  15018672000
## 385                  HURRICANE  14610229010
## 577                RIVER FLOOD  10148404500
## 422                  ICE STORM   8967041360
## 839             TROPICAL STORM   8382236550
## 962               WINTER STORM   6715441251
## 343                  HIGH WIND   5908617560
## 949                   WILDFIRE   5060586800
## 846                  TSTM WIND   5038935845
## 657           STORM SURGE/TIDE   4642038000
## 753          THUNDERSTORM WIND   3897965520
## 390             HURRICANE OPAL   3191846000
## 947           WILD/FOREST FIRE   3108626330
## 287  HEAVY RAIN/SEVERE WEATHER   2500000000
## 779         THUNDERSTORM WINDS   2135245438
## 834 TORNADOES, TSTM WIND, HAIL   1602500000
## 281                 HEAVY RAIN   1427647890
## 132               EXTREME COLD   1360710400
## 599        SEVERE THUNDERSTORM   1205560000
## 198               FROST/FREEZE   1103566000
## 299                 HEAVY SNOW   1067412240
## 453                  LIGHTNING    942471516
## 28                    BLIZZARD    771273950
## 360                 HIGH WINDS    649044330

Similarly we have the data duplicate problem for EVTYPE column.

Here are the duplicate EVTYPE rows that matters and we combine them.

SortedEcon[SortedEcon$EVTYPE == "TSTM WIND" | SortedEcon$EVTYPE == "THUNDERSTORM WINDS" | SortedEcon$EVTYPE == "TSTM WIND/HAIL",]$EVTYPE = "THUNDERSTORM WIND";
SortedEcon[SortedEcon$EVTYPE == "FOG",]$EVTYPE = "DENSE FOG";
SortedEcon[SortedEcon$EVTYPE == "WILD FIRES" | SortedEcon$EVTYPE == "WILD/FOREST FIRE",]$EVTYPE = "WILDFIRE";
SortedEcon[SortedEcon$EVTYPE == "RIP CURRENTS",]$EVTYPE = "RIP CURRENT";
SortedEcon[SortedEcon$EVTYPE == "HIGH WINDS",]$EVTYPE = "HIGH WIND";
SortedEcon[grep("FLASH FLOOD", SortedEcon$EVTYPE), ]$EVTYPE  = "FLASH FLOOD";
SortedEcon[grep("HURRICANE", SortedEcon$EVTYPE), ]$EVTYPE  = "HURRICANE/TYPHOON";
SortedEcon[SortedEcon$EVTYPE == "ICE",]$EVTYPE = "ICE STORM";
SortedEcon[SortedEcon$EVTYPE == "HEAT WAVE" | SortedEcon$EVTYPE == "EXTREME HEAT" | SortedEcon$EVTYPE == "Heat Wave",]$EVTYPE = "EXCESSIVE HEAT";
SortedEcon[SortedEcon$EVTYPE == "WINTER STORMS",]$EVTYPE = "WINTER STORM";
SortedEcon[SortedEcon$EVTYPE == "RIVER FLOOD",]$EVTYPE = "FLOOD";
SortedEcon[grep("STORM SURGE", SortedEcon$EVTYPE), ]$EVTYPE  = "STORM SURGE";

# Aggregate and sort again
SortedEcon <- aggregate(cbind(CropDamage_USD, PropertyDamage_USD, TOTAL_DMG) ~  EVTYPE, data = SortedEcon, FUN = sum)
SortedEcon<-SortedEcon[order(SortedEcon$TOTAL_DMG, decreasing = TRUE),]

Again our strategy is the top 10 total damages are the most harmful weather events in terms of economy.
Here is our final top 10 list

head(SortedEcon[,c(1,4)],n=10)
##                EVTYPE    TOTAL_DMG
## 151             FLOOD 160468082750
## 364 HURRICANE/TYPHOON  90271472810
## 788           TORNADO  57362333884
## 621       STORM SURGE  47965579000
## 149       FLASH FLOOD  19120989028
## 222              HAIL  18761221926
## 90            DROUGHT  15018672000
## 717 THUNDERSTORM WIND  11181178553
## 391         ICE STORM   8979696360
## 907          WILDFIRE   8793313130

And the appropriate stacked barchart

library(lattice)
# Reorder the factor for plot bar ordering from max to min
SortedEcon$EVTYPE <- reorder(SortedEcon$EVTYPE , SortedEcon$TOTAL_DMG)
# And plot a stacked bar chart
barchart(EVTYPE ~ CropDamage_USD + PropertyDamage_USD ,data=SortedEcon[1:10,], horizontal = TRUE, auto.key = TRUE, stack = TRUE, ylab="Weather Event", xlab = "Total Damages in US Dollars",par.settings=list(superpose.polygon=list(col=c("tomato1","royalblue2"))))

Results

As stated before with the appropriate graphs, the most harmful 10 events for health and economy are again listed below

Top 10 Most Harmful Events for Public Health:

head(as.character(Sorted[,1]),n=10)
##  [1] "TORNADO"           "THUNDERSTORM WIND" "EXCESSIVE HEAT"   
##  [4] "FLOOD"             "LIGHTNING"         "HEAT"             
##  [7] "FLASH FLOOD"       "ICE STORM"         "HIGH WIND"        
## [10] "WILDFIRE"

Top 10 Most Harmful Events for Economy:

head(as.character(SortedEcon[,1]),n=10)
##  [1] "FLOOD"             "HURRICANE/TYPHOON" "TORNADO"          
##  [4] "STORM SURGE"       "FLASH FLOOD"       "HAIL"             
##  [7] "DROUGHT"           "THUNDERSTORM WIND" "ICE STORM"        
## [10] "WILDFIRE"