Torando is most harmful to populational health and Flood has most economy impact in US from 1951 to 2011

Synopsis

In this report, I answered the following two questions by analyze NOAA storm data from 1951 to 2011: 1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health? 2. Across the United States, which types of events have the greatest economic consequences? And the conclusions are: 1) Tornado generated most fatalities and injuries, thus are most harmful with respect to population health. 2) Flood has made total lost about 1000b usd for property and corps, which has the greatest economic consequences.

Intro

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Raw Data

In this study, we are using the following dataset:

Storm Data:a comma-separated-value file compressed via the bzip2

The documentations for the dataset are as below:

National Weather Service Storm Data Documentation:
National Climatic Data Center Storm Events FAQ

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

Data Processing

I first download and unzip the data.

if(!file.exists("./data")){dir.create("./data")}
fileUrl<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileUrl,destfile="./data/data.csv.bz2",method="curl")
system("bunzip2 ./data/data.csv.bz2 ")

Then I read the data, which took a long time.

#colClasses = sapply(read.csv("./data/data.csv",nrows=200),class)
system.time(DF <- read.csv("./data/data.csv"))

##    user  system elapsed 
## 214.522   3.387 230.349

I also tried to use faster fread() from data.table package, but this mechod causes some unknown errors.

require(data.table) #fast read

## Loading required package: data.table

system.time(DT<-fread("./data/data.csv"))

## 
Read 0.0% of 908280 rows

## Warning in fread("./data/data.csv"): Bumped column 37 to type character on
## data row 208121, field contains ' as it was so called by the media'.
## Coercing previously read values in this column from logical, integer or
## numeric back to character which may not be lossless; e.g., if '00' and
## '000' occurred before they will now be just '0', and there may be
## inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in
## this column before the bump). If this matters please rerun and set
## 'colClasses' to 'character' for this column. Please note that column type
## detection uses the first 5 rows, the middle 5 rows and the last 5 rows, so
## hopefully this message should be very rare. If reporting to
## datatable-help, please rerun and include the output from verbose=TRUE.

## 
Read 25.3% of 908280 rows

## Error in fread("./data/data.csv"): Expected sep (',') but new line or EOF ends field 36 on line 249453 when reading data: 1.00,12/18/1996 0:00:00,"02:00:00 PM","CST",28.00,"ALZ028>029 - 035>038 - 040>049","AL","WINTER STORM",0.00,,,12/18/1996 0:00:00,"07:00:00 PM",0.00,,0.00,,,0.00,0.00,,0.00,0.00,0.00,240.00,"K",320.00,"K","BMX","ALABAMA, Central","CLAY - CLAY - RANDOLPH - CHILTON - COOSA - TALLAPOOSA - CHAMBERS - DALLAS - AUTAUGA - LOWNDES - ELMORE - MONTGOMERY - MACON - BULLOCK - LEE - RUSSELL - PIKE",0.00,0.00,0.00,0.00,"A snow storm that began in the early afternoon hours across the central sections of the state dumped 1 to 3 inches of snow on parts of the state.  It was over by early evening.  Schools and businesses let out early on the 18th across much of the area affected.  A few roads became slick but there were no major travel problems reported.  The snow remained on the ground in some areas for about two days.  Here is a list of snowfall totals by county:
##   Autauga    2-3""     Bullock   1""      Chambers

## Timing stopped at: 2.762 0.635 13.251

Check the dim of the data : 902297 by 37

dim(DF)

## [1] 902297     37

The column names are good

head(DF)

##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6

Result

Which event are most harmful with respect to population health?

First, we look at the column names/fields of the data:

colnames(DF)

##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

We found FATALITIES and INJURIES should be the aspects to define how harmful with respect to populaton health. Then we sum these fields by event type (EVTYPE) and put them in a new variable called harmness.

#death_number <- sapply(levels(DF2[,"EVTYPE"]),function(x)(sum(subset(DF2,EVTYPE==x,FATALITIES)))) ## took a long time 
require("plyr")

## Loading required package: plyr

system.time(harmness<-ddply(DF,.(EVTYPE),
      summarise,
      FATALITIES=sum(FATALITIES),
      INJURIES=sum(INJURIES)))

##    user  system elapsed 
##   1.695   0.408   2.128

We wanted to see how the harmness looks like:

summary(harmness)

##                    EVTYPE      FATALITIES         INJURIES      
##     HIGH SURF ADVISORY:  1   Min.   :   0.00   Min.   :    0.0  
##   COASTAL FLOOD       :  1   1st Qu.:   0.00   1st Qu.:    0.0  
##   FLASH FLOOD         :  1   Median :   0.00   Median :    0.0  
##   LIGHTNING           :  1   Mean   :  15.38   Mean   :  142.7  
##   TSTM WIND           :  1   3rd Qu.:   0.00   3rd Qu.:    0.0  
##   TSTM WIND (G45)     :  1   Max.   :5633.00   Max.   :91346.0  
##  (Other)              :979

Because there are many event has none death or injury, we need filter them out and find the index/rows with at least 1 death or injury.

idx <- which(harmness[,2]+harmness[,3]>0)
length(idx)

## [1] 220

Then we will plot the boxplot for both death and injury, also highlight the max points with red points.

boxplot(log10(harmness[idx,2:3]),ylab="log10(#)")

## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out =
## z$out[z$group == : Outlier (-Inf) in boxplot 1 is not drawn

points(1,log10(max(harmness[,2])),col="red")        
text(1.2,log10(max(harmness[,2])),harmness[which.max(harmness[,2]),1],col="red")        
points(2,log10(max(harmness[,3])),col="red")        
text(2.2,log10(max(harmness[,3])),harmness[which.max(harmness[,2]),1],col="red")

plot of chunk unnamed-chunk-10

As you can seen from the figure, both death and injury data suggests tornado is the most harmful event to population health.

Types of Events are greatest economic consequences

For economic consequences, we need look at the damage fields of the data. Through field doc, we know these fields are:

fields	Comment
PROPDMG	Property damage in whole numbers and hundredths
PROPDMGEXP	multiplier where Hundred (H), Thousand (K), Million (M), Billion (B)
CROPDMG	Crop damage in whole numbers and hundredths
CROPDMGEXP	multiplier where Hundred (H), Thousand (K), Million (M), Billion (B)

I first get the useful data.

system.time(economy<-ddply(DF,.(EVTYPE,PROPDMGEXP,CROPDMGEXP),
      summarise,
      PROPDMG=sum(PROPDMG),
      CROPDMG=sum(CROPDMG)))

##    user  system elapsed 
##   2.980   0.503   3.920

head(economy)

##                  EVTYPE PROPDMGEXP CROPDMGEXP PROPDMG CROPDMG
## 1    HIGH SURF ADVISORY          K                200       0
## 2         COASTAL FLOOD                             0       0
## 3           FLASH FLOOD          K                 50       0
## 4             LIGHTNING                             0       0
## 5             TSTM WIND                             0       0
## 6             TSTM WIND          K                100       0

Let's filter out zero damages.

economy <- subset(economy,PROPDMG+CROPDMG>0)
head(economy)

##                   EVTYPE PROPDMGEXP CROPDMGEXP PROPDMG CROPDMG
## 1     HIGH SURF ADVISORY          K                200       0
## 3            FLASH FLOOD          K                 50       0
## 6              TSTM WIND          K                100       0
## 7              TSTM WIND          M                  8       0
## 8        TSTM WIND (G45)          K                  8       0
## 11                     ?          K                  5       0

dim(economy)

## [1] 881   5

length(levels(DF$EVTYPE))

## [1] 985

Construct the look-up table which will be useful in calculation sum. As there are atypical multiplier, I just made there unit coresonding to 0. This step should not impact the result, because the one has most influence should be the unit “B”.

prop_table <-rep(0,length(levels(economy$PROPDMGEXP)))
names(prop_table)<-levels(economy$PROPDMGEXP)                 
prop_table[c("B","h","H","K","m","M")]<-c(10^9,10^2,10^2,10^3,10^6,10^6)
prop_table

##           -     ?     +     0     1     2     3     4     5     6     7 
## 0e+00 0e+00 0e+00 0e+00 0e+00 0e+00 0e+00 0e+00 0e+00 0e+00 0e+00 0e+00 
##     8     B     h     H     K     m     M 
## 0e+00 1e+09 1e+02 1e+02 1e+03 1e+06 1e+06

crop_table <-rep(0,length(levels(economy$CROPDMGEXP)))
names(crop_table)<-levels(economy$CROPDMGEXP)                 
crop_table

##   ? 0 2 B k K m M 
## 0 0 0 0 0 0 0 0 0

crop_table[c("B","k","K","m","M")]<-c(10^9,10^3,10^3,10^6,10^6)
crop_table

##           ?     0     2     B     k     K     m     M 
## 0e+00 0e+00 0e+00 0e+00 1e+09 1e+03 1e+03 1e+06 1e+06

Now calculate the compressed data for economy lost.

head(economy)

##                   EVTYPE PROPDMGEXP CROPDMGEXP PROPDMG CROPDMG
## 1     HIGH SURF ADVISORY          K                200       0
## 3            FLASH FLOOD          K                 50       0
## 6              TSTM WIND          K                100       0
## 7              TSTM WIND          M                  8       0
## 8        TSTM WIND (G45)          K                  8       0
## 11                     ?          K                  5       0

economy$Sum <- economy$PROPDMG*prop_table[economy$PROPDMGEXP]+
  economy$CROPDMG*crop_table[economy$CROPDMGEXP]
economy2 <- ddply(economy,.(EVTYPE),
                  summarise,
                  SumLog10 = log10(sum(Sum+1))) # +1 for the low limit 
head(economy2)

##                  EVTYPE SumLog10
## 1    HIGH SURF ADVISORY 5.301032
## 2           FLASH FLOOD 4.698979
## 3             TSTM WIND 6.908485
## 4       TSTM WIND (G45) 3.903144
## 5                     ? 3.699057
## 6   AGRICULTURAL FREEZE 7.459694

Plot boxplot of the lost and highlight the max. The result is: Flood generated largest economy lost (~ 1000B)

boxplot(economy2$SumLog10,ylab = "Economy Lost (log10)",main="total economy damage")
points(1,max(economy2$SumLog10),col="red")
text(1.2,max(economy2$SumLog10),paste(economy2[which.max(economy2$SumLog10),1],
                                      "10^",
                                  sprintf("%.1f",max(economy2$SumLog10)),
                                          "$Lost",sep=""),      
             col="red")

plot of chunk unnamed-chunk-15

Session

sessionInfo()

## R version 3.1.1 (2014-07-10)
## Platform: x86_64-apple-darwin13.1.0 (64-bit)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] plyr_1.8.1       data.table_1.9.4 knitr_1.8       
## 
## loaded via a namespace (and not attached):
## [1] chron_2.3-45    codetools_0.2-9 digest_0.6.4    evaluate_0.5.5 
## [5] formatR_1.0     Rcpp_0.11.3     reshape2_1.4    stringr_0.6.2  
## [9] tools_3.1.1