Storms and their impact on the USA’s healt of population and economic damage

Synopsis

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

In this short project we:

  • will research which type of event is the most harmfull to the health of the population;
  • will determine which type of event have the greater economic impact.

To determine the most harmful event we use a simple weighted system for adding fatalities and injuries (based on http://www.rssb.co.uk/rgs/standards/GEGN8642%20Iss%202.pdf)

For the economic conqequences we evaluate the damage, looking at damage to both properties and crops. The sum of both damages is then used as an indicator of the total damage were both types of damage have the same weight.


First we have to ensure the locale is correct, lets ensure all is in english….

Sys.setlocale("LC_ALL","English");
## [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

Data loading

To prevent downloading for each run this analysis uses the download and unzipped file, we’ve downloaded it once earlier. We assume the unzipped file is in the working directory. We left the name as is.The data is downloaded from https://d396qusza40orc.cloudfront.net/repdata/data/StormData.csv.bz2 at 2 juli 2016, 23:38 local time, Europe.

stormdata <- read.csv("repdata%2Fdata%2FStormData.csv", header=TRUE, sep=",");

First inspection

To get some impression of the storm data we have a quick peek.

# look at first few rows
head(stormdata);
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6
# Lets see what the data looks like
summary(stormdata);
##     STATE__                  BGN_DATE             BGN_TIME     
##  Min.   : 1.0   5/25/2011 0:00:00:  1202   12:00:00 AM: 10163  
##  1st Qu.:19.0   4/27/2011 0:00:00:  1193   06:00:00 PM:  7350  
##  Median :30.0   6/9/2011 0:00:00 :  1030   04:00:00 PM:  7261  
##  Mean   :31.2   5/30/2004 0:00:00:  1016   05:00:00 PM:  6891  
##  3rd Qu.:45.0   4/4/2011 0:00:00 :  1009   12:00:00 PM:  6703  
##  Max.   :95.0   4/2/2006 0:00:00 :   981   03:00:00 PM:  6700  
##                 (Other)          :895866   (Other)    :857229  
##    TIME_ZONE          COUNTY           COUNTYNAME         STATE       
##  CST    :547493   Min.   :  0.0   JEFFERSON :  7840   TX     : 83728  
##  EST    :245558   1st Qu.: 31.0   WASHINGTON:  7603   KS     : 53440  
##  MST    : 68390   Median : 75.0   JACKSON   :  6660   OK     : 46802  
##  PST    : 28302   Mean   :100.6   FRANKLIN  :  6256   MO     : 35648  
##  AST    :  6360   3rd Qu.:131.0   LINCOLN   :  5937   IA     : 31069  
##  HST    :  2563   Max.   :873.0   MADISON   :  5632   NE     : 30271  
##  (Other):  3631                   (Other)   :862369   (Other):621339  
##                EVTYPE         BGN_RANGE           BGN_AZI      
##  HAIL             :288661   Min.   :   0.000          :547332  
##  TSTM WIND        :219940   1st Qu.:   0.000   N      : 86752  
##  THUNDERSTORM WIND: 82563   Median :   0.000   W      : 38446  
##  TORNADO          : 60652   Mean   :   1.484   S      : 37558  
##  FLASH FLOOD      : 54277   3rd Qu.:   1.000   E      : 33178  
##  FLOOD            : 25326   Max.   :3749.000   NW     : 24041  
##  (Other)          :170878                      (Other):134990  
##          BGN_LOCATI                  END_DATE             END_TIME     
##               :287743                    :243411              :238978  
##  COUNTYWIDE   : 19680   4/27/2011 0:00:00:  1214   06:00:00 PM:  9802  
##  Countywide   :   993   5/25/2011 0:00:00:  1196   05:00:00 PM:  8314  
##  SPRINGFIELD  :   843   6/9/2011 0:00:00 :  1021   04:00:00 PM:  8104  
##  SOUTH PORTION:   810   4/4/2011 0:00:00 :  1007   12:00:00 PM:  7483  
##  NORTH PORTION:   784   5/30/2004 0:00:00:   998   11:59:00 PM:  7184  
##  (Other)      :591444   (Other)          :653450   (Other)    :622432  
##    COUNTY_END COUNTYENDN       END_RANGE           END_AZI      
##  Min.   :0    Mode:logical   Min.   :  0.0000          :724837  
##  1st Qu.:0    NA's:902297    1st Qu.:  0.0000   N      : 28082  
##  Median :0                   Median :  0.0000   S      : 22510  
##  Mean   :0                   Mean   :  0.9862   W      : 20119  
##  3rd Qu.:0                   3rd Qu.:  0.0000   E      : 20047  
##  Max.   :0                   Max.   :925.0000   NE     : 14606  
##                                                 (Other): 72096  
##            END_LOCATI         LENGTH              WIDTH         
##                 :499225   Min.   :   0.0000   Min.   :   0.000  
##  COUNTYWIDE     : 19731   1st Qu.:   0.0000   1st Qu.:   0.000  
##  SOUTH PORTION  :   833   Median :   0.0000   Median :   0.000  
##  NORTH PORTION  :   780   Mean   :   0.2301   Mean   :   7.503  
##  CENTRAL PORTION:   617   3rd Qu.:   0.0000   3rd Qu.:   0.000  
##  SPRINGFIELD    :   575   Max.   :2315.0000   Max.   :4400.000  
##  (Other)        :380536                                         
##        F               MAG            FATALITIES          INJURIES        
##  Min.   :0.0      Min.   :    0.0   Min.   :  0.0000   Min.   :   0.0000  
##  1st Qu.:0.0      1st Qu.:    0.0   1st Qu.:  0.0000   1st Qu.:   0.0000  
##  Median :1.0      Median :   50.0   Median :  0.0000   Median :   0.0000  
##  Mean   :0.9      Mean   :   46.9   Mean   :  0.0168   Mean   :   0.1557  
##  3rd Qu.:1.0      3rd Qu.:   75.0   3rd Qu.:  0.0000   3rd Qu.:   0.0000  
##  Max.   :5.0      Max.   :22000.0   Max.   :583.0000   Max.   :1700.0000  
##  NA's   :843563                                                           
##     PROPDMG          PROPDMGEXP        CROPDMG          CROPDMGEXP    
##  Min.   :   0.00          :465934   Min.   :  0.000          :618413  
##  1st Qu.:   0.00   K      :424665   1st Qu.:  0.000   K      :281832  
##  Median :   0.00   M      : 11330   Median :  0.000   M      :  1994  
##  Mean   :  12.06   0      :   216   Mean   :  1.527   k      :    21  
##  3rd Qu.:   0.50   B      :    40   3rd Qu.:  0.000   0      :    19  
##  Max.   :5000.00   5      :    28   Max.   :990.000   B      :     9  
##                    (Other):    84                     (Other):     9  
##       WFO                                       STATEOFFIC    
##         :142069                                      :248769  
##  OUN    : 17393   TEXAS, North                       : 12193  
##  JAN    : 13889   ARKANSAS, Central and North Central: 11738  
##  LWX    : 13174   IOWA, Central                      : 11345  
##  PHI    : 12551   KANSAS, Southwest                  : 11212  
##  TSA    : 12483   GEORGIA, North and Central         : 11120  
##  (Other):690738   (Other)                            :595920  
##                                                                                                                                                                                                     ZONENAMES     
##                                                                                                                                                                                                          :594029  
##                                                                                                                                                                                                          :205988  
##  GREATER RENO / CARSON CITY / M - GREATER RENO / CARSON CITY / M                                                                                                                                         :   639  
##  GREATER LAKE TAHOE AREA - GREATER LAKE TAHOE AREA                                                                                                                                                       :   592  
##  JEFFERSON - JEFFERSON                                                                                                                                                                                   :   303  
##  MADISON - MADISON                                                                                                                                                                                       :   302  
##  (Other)                                                                                                                                                                                                 :100444  
##     LATITUDE      LONGITUDE        LATITUDE_E     LONGITUDE_    
##  Min.   :   0   Min.   :-14451   Min.   :   0   Min.   :-14455  
##  1st Qu.:2802   1st Qu.:  7247   1st Qu.:   0   1st Qu.:     0  
##  Median :3540   Median :  8707   Median :   0   Median :     0  
##  Mean   :2875   Mean   :  6940   Mean   :1452   Mean   :  3509  
##  3rd Qu.:4019   3rd Qu.:  9605   3rd Qu.:3549   3rd Qu.:  8735  
##  Max.   :9706   Max.   : 17124   Max.   :9706   Max.   :106220  
##  NA's   :47                      NA's   :40                     
##                                            REMARKS           REFNUM      
##                                                :287433   Min.   :     1  
##                                                : 24013   1st Qu.:225575  
##  Trees down.\n                                 :  1110   Median :451149  
##  Several trees were blown down.\n              :   569   Mean   :451149  
##  Trees were downed.\n                          :   446   3rd Qu.:676723  
##  Large trees and power lines were blown down.\n:   432   Max.   :902297  
##  (Other)                                       :588294
# What is the type of a column?
sapply(stormdata, class);
##    STATE__   BGN_DATE   BGN_TIME  TIME_ZONE     COUNTY COUNTYNAME 
##  "numeric"   "factor"   "factor"   "factor"  "numeric"   "factor" 
##      STATE     EVTYPE  BGN_RANGE    BGN_AZI BGN_LOCATI   END_DATE 
##   "factor"   "factor"  "numeric"   "factor"   "factor"   "factor" 
##   END_TIME COUNTY_END COUNTYENDN  END_RANGE    END_AZI END_LOCATI 
##   "factor"  "numeric"  "logical"  "numeric"   "factor"   "factor" 
##     LENGTH      WIDTH          F        MAG FATALITIES   INJURIES 
##  "numeric"  "numeric"  "integer"  "numeric"  "numeric"  "numeric" 
##    PROPDMG PROPDMGEXP    CROPDMG CROPDMGEXP        WFO STATEOFFIC 
##  "numeric"   "factor"  "numeric"   "factor"   "factor"   "factor" 
##  ZONENAMES   LATITUDE  LONGITUDE LATITUDE_E LONGITUDE_    REMARKS 
##   "factor"  "numeric"  "numeric"  "numeric"  "numeric"   "factor" 
##     REFNUM 
##  "numeric"
# Which events are in our data? List of distinct values?
# unique(stormdata$EVTYPE)

# Are there NA's?
sum(is.na(stormdata$EVTYPE));
## [1] 0
# What are the damage exp values, for property and crops
unique(stormdata$PROPDMGEXP);
##  [1] K M   B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels:  - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
unique(stormdata$CROPDMGEXP);
## [1]   M K m B ? 0 k 2
## Levels:  ? 0 2 B k K m M

Data processing

To see which events are most harmfull we need to express the harm. The data contains information about injuries and fatalities. One issue is how to aggregate to one value; what’s the ‘weight’ of an injury compared to a fatality? There is no column of weighted injuries or something like that. To keep it simple we treat all injuries as major with weight 0.1, as derived from the britsich railroad standards (http://www.rssb.co.uk/rgs/standards/GEGN8642%20Iss%202.pdf)

First we aggregate for both the fatalities and injuries and order the data by decreasing weighted harm.

# Only aggregate for the fatalitieshead and injuries
stoda.harm.aggregated <- aggregate(stormdata[c("FATALITIES","INJURIES")], by=list(stormdata$EVTYPE), sum);
names(stoda.harm.aggregated)[1]<-"StormEvent";

# weighted order, injury is 0.1 of a fatality (britsch railroad standards...)
stoda.harm.aggregated$Weighted <- stoda.harm.aggregated$FATALITIES + stoda.harm.aggregated$INJURIES * 0.1;
stoda.harm.aggr.order.indexes  <- order(stoda.harm.aggregated$Weighted, decreasing = TRUE);

To see which events have the greatest economic impact we need to express the costs. We need to map the EXP columns to a simple multiplier. In the analysis we saw the list of unique Exponents…

Exponent
For properties - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
For crops ? 0 2 B k K m M
Mapping 10^x, x 0 0 0 0 1 2 3 4 5 6 7 8 9 2 2 3 3 6 6

Where B is billion, h is hundreds, k is thousands and m is millsions, by converting all to upper we can shorten the lookup a bit.

# Create lookup table <todo smarter/shorter>
library(plyr);
base.list <- c(0, 0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 2, 3, 6);
exp.list  <- c("", "-", "?", "+", "0", "1", "2", "3", "4", "5", "6", "7", "8", "B", "H", "K", "M");

# now use lookup factor and determine cost <todo in one step>
# Mmm. wanted something simpler.

lookup.table      <- data.frame(prop.dmg.factor = base.list, PROPDMGEXP = exp.list);
stormdata.factors <- join(stormdata, lookup.table, by='PROPDMGEXP');

lookup.table      <- data.frame(crop.dmg.factor = base.list, CROPDMGEXP = exp.list);
stormdata.factors <- join(stormdata.factors, lookup.table, by='CROPDMGEXP');

# We have the factors now determine the amount of $
stormdata.factors$CROPDMGAMOUNT <- stormdata.factors$CROPDMG * 10 ^ stormdata.factors$crop.dmg.factor;
stormdata.factors$PROPDMGAMOUNT <- stormdata.factors$PROPDMG * 10 ^ stormdata.factors$prop.dmg.factor;
stormdata.factors$TOTDMGAMOUNT  <- stormdata.factors$CROPDMGAMOUNT + stormdata.factors$PROPDMGAMOUNT;

Results

Using the aggregated and weigthed injuries we show the top 7 events resulting in the most harm.

# show first 7 rows
stoda.harm.aggregated[stoda.harm.aggr.order.indexes[1:7],];
##         StormEvent FATALITIES INJURIES Weighted
## 834        TORNADO       5633    91346  14767.6
## 130 EXCESSIVE HEAT       1903     6525   2555.5
## 464      LIGHTNING        816     5230   1339.0
## 856      TSTM WIND        504     6957   1199.7
## 153    FLASH FLOOD        978     1777   1155.7
## 170          FLOOD        470     6789   1148.9
## 275           HEAT        937     2100   1147.0

To get a better insight the top 7 is shown in an ordered bar plot. Note that the barplot is only for a quick relative overview.

library(ggplot2);
library(reshape2);

stoda.harm.aggregated.top7 <- stoda.harm.aggregated[stoda.harm.aggr.order.indexes[1:7],];
# reset factor
stoda.harm.aggregated.top7$StormEvent <- factor(stoda.harm.aggregated.top7$StormEvent);

# One figure
# with barplot order by weighted value
stoda.harm.aggregated.top7.melted <- melt(stoda.harm.aggregated.top7, id.vars=c("StormEvent", "Weighted"));

#stoda.harm.aggregated.top7.melted
ggplot(stoda.harm.aggregated.top7.melted, aes(x=reorder(StormEvent, -Weighted), y=value, fill=variable))+geom_bar(stat = "identity") + labs(title="Top 7 Storm related events", x = "Storm Event", y = "#people invloved (weighted value)")+ theme(axis.text.x = element_text(face="bold", color="#993333", size=8, angle=45), axis.text.y = element_text(face="bold", color="#993333", size=12, angle=45));

Using the amount of $ in damages for both property and crops we can have a look at the top7.

stoda.economic.aggregated <- aggregate(stormdata.factors[c("CROPDMGAMOUNT","PROPDMGAMOUNT", "TOTDMGAMOUNT")], by=list(stormdata.factors$EVTYPE), sum);
names(stoda.economic.aggregated)[1]<-"StormEvent"

stoda.economic.aggr.ordered.indexes <- order(stoda.economic.aggregated$TOTDMGAMOUNT, decreasing = TRUE);
head(stoda.economic.aggr.ordered.indexes);
## [1] 170 411 670 153  95 402
# show first 7 rows
stoda.economic.aggr.top7 <- stoda.economic.aggregated[stoda.economic.aggr.ordered.indexes[1:7], ];

# Refactor in M$ (i.e. divide by 1000000)
stoda.economic.aggr.top7$CROPDMGAMOUNT <- stoda.economic.aggr.top7$CROPDMGAMOUNT / 1000000;
stoda.economic.aggr.top7$PROPDMGAMOUNT <- stoda.economic.aggr.top7$PROPDMGAMOUNT / 1000000;
stoda.economic.aggr.top7$TOTDMGAMOUNT  <- stoda.economic.aggr.top7$TOTDMGAMOUNT  / 1000000;
stoda.economic.aggr.top7
##            StormEvent CROPDMGAMOUNT PROPDMGAMOUNT TOTDMGAMOUNT
## 170             FLOOD      5661.968    144657.710    150319.68
## 411 HURRICANE/TYPHOON      2607.873     69305.840     71913.71
## 670       STORM SURGE         0.005     43323.536     43323.54
## 153       FLASH FLOOD      1421.317     16822.674     18243.99
## 95            DROUGHT     13972.566      1046.106     15018.67
## 402         HURRICANE      2741.910     11868.319     14610.23
## 590       RIVER FLOOD      5029.459      5118.945     10148.40

To get an idea of the top 7 events with respect to economic damage we use the same type of diagram as we used for harm. To be able to handle the damagecost they’re scaled to million dollar figures.

stoda.economic.aggr.top7.melted <- melt(stoda.economic.aggr.top7, id.vars=c("StormEvent", "TOTDMGAMOUNT"));

ggplot(stoda.economic.aggr.top7.melted, aes(x=reorder(StormEvent, -TOTDMGAMOUNT), y=value, fill=variable)) + 
  geom_bar(stat = "identity") + labs(title="Top 7 Storm related events", x = "Storm Event", y = "Economic damage in m$") +
  theme(axis.text.x = element_text(face="bold", color="#993333", size=8, angle=45), axis.text.y = element_text(face="bold", color="#993333", size=12, angle=45));

Conclusion

Looking at the two major factors, harm and economic costs we see that tornado’s lead to most harmed people while flood leads to the greatest economic impact on properties.