In this report, I answered the following two questions by analyze NOAA storm data from 1951 to 2011: 1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health? 2. Across the United States, which types of events have the greatest economic consequences? And the conclusions are: 1) Tornado generated most fatalities and injuries, thus are most harmful with respect to population health. 2) Flood has made total lost about 1000b usd for property and corps, which has the greatest economic consequences.
Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
In this study, we are using the following dataset:
The documentations for the dataset are as below:
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
I first download and unzip the data.
if(!file.exists("./data")){dir.create("./data")}
fileUrl<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileUrl,destfile="./data/data.csv.bz2",method="curl")
system("bunzip2 ./data/data.csv.bz2 ")
Then I read the data, which took a long time.
#colClasses = sapply(read.csv("./data/data.csv",nrows=200),class)
system.time(DF <- read.csv("./data/data.csv"))
## user system elapsed
## 214.522 3.387 230.349
I also tried to use faster fread() from data.table package, but this mechod causes some unknown errors.
require(data.table) #fast read
## Loading required package: data.table
system.time(DT<-fread("./data/data.csv"))
##
Read 0.0% of 908280 rows
## Warning in fread("./data/data.csv"): Bumped column 37 to type character on
## data row 208121, field contains ' as it was so called by the media'.
## Coercing previously read values in this column from logical, integer or
## numeric back to character which may not be lossless; e.g., if '00' and
## '000' occurred before they will now be just '0', and there may be
## inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in
## this column before the bump). If this matters please rerun and set
## 'colClasses' to 'character' for this column. Please note that column type
## detection uses the first 5 rows, the middle 5 rows and the last 5 rows, so
## hopefully this message should be very rare. If reporting to
## datatable-help, please rerun and include the output from verbose=TRUE.
##
Read 25.3% of 908280 rows
## Error in fread("./data/data.csv"): Expected sep (',') but new line or EOF ends field 36 on line 249453 when reading data: 1.00,12/18/1996 0:00:00,"02:00:00 PM","CST",28.00,"ALZ028>029 - 035>038 - 040>049","AL","WINTER STORM",0.00,,,12/18/1996 0:00:00,"07:00:00 PM",0.00,,0.00,,,0.00,0.00,,0.00,0.00,0.00,240.00,"K",320.00,"K","BMX","ALABAMA, Central","CLAY - CLAY - RANDOLPH - CHILTON - COOSA - TALLAPOOSA - CHAMBERS - DALLAS - AUTAUGA - LOWNDES - ELMORE - MONTGOMERY - MACON - BULLOCK - LEE - RUSSELL - PIKE",0.00,0.00,0.00,0.00,"A snow storm that began in the early afternoon hours across the central sections of the state dumped 1 to 3 inches of snow on parts of the state. It was over by early evening. Schools and businesses let out early on the 18th across much of the area affected. A few roads became slick but there were no major travel problems reported. The snow remained on the ground in some areas for about two days. Here is a list of snowfall totals by county:
## Autauga 2-3"" Bullock 1"" Chambers
## Timing stopped at: 2.762 0.635 13.251
Check the dim of the data : 902297 by 37
dim(DF)
## [1] 902297 37
The column names are good
head(DF)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
First, we look at the column names/fields of the data:
colnames(DF)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
We found FATALITIES and INJURIES should be the aspects to define how harmful with respect to populaton health. Then we sum these fields by event type (EVTYPE) and put them in a new variable called harmness.
#death_number <- sapply(levels(DF2[,"EVTYPE"]),function(x)(sum(subset(DF2,EVTYPE==x,FATALITIES)))) ## took a long time
require("plyr")
## Loading required package: plyr
system.time(harmness<-ddply(DF,.(EVTYPE),
summarise,
FATALITIES=sum(FATALITIES),
INJURIES=sum(INJURIES)))
## user system elapsed
## 1.695 0.408 2.128
We wanted to see how the harmness looks like:
summary(harmness)
## EVTYPE FATALITIES INJURIES
## HIGH SURF ADVISORY: 1 Min. : 0.00 Min. : 0.0
## COASTAL FLOOD : 1 1st Qu.: 0.00 1st Qu.: 0.0
## FLASH FLOOD : 1 Median : 0.00 Median : 0.0
## LIGHTNING : 1 Mean : 15.38 Mean : 142.7
## TSTM WIND : 1 3rd Qu.: 0.00 3rd Qu.: 0.0
## TSTM WIND (G45) : 1 Max. :5633.00 Max. :91346.0
## (Other) :979
Because there are many event has none death or injury, we need filter them out and find the index/rows with at least 1 death or injury.
idx <- which(harmness[,2]+harmness[,3]>0)
length(idx)
## [1] 220
Then we will plot the boxplot for both death and injury, also highlight the max points with red points.
boxplot(log10(harmness[idx,2:3]),ylab="log10(#)")
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out =
## z$out[z$group == : Outlier (-Inf) in boxplot 1 is not drawn
points(1,log10(max(harmness[,2])),col="red")
text(1.2,log10(max(harmness[,2])),harmness[which.max(harmness[,2]),1],col="red")
points(2,log10(max(harmness[,3])),col="red")
text(2.2,log10(max(harmness[,3])),harmness[which.max(harmness[,2]),1],col="red")
As you can seen from the figure, both death and injury data suggests tornado is the most harmful event to population health.
For economic consequences, we need look at the damage fields of the data. Through field doc, we know these fields are:
| fields | Comment |
|---|---|
| PROPDMG | Property damage in whole numbers and hundredths |
| PROPDMGEXP | multiplier where Hundred (H), Thousand (K), Million (M), Billion (B) |
| CROPDMG | Crop damage in whole numbers and hundredths |
| CROPDMGEXP | multiplier where Hundred (H), Thousand (K), Million (M), Billion (B) |
I first get the useful data.
system.time(economy<-ddply(DF,.(EVTYPE,PROPDMGEXP,CROPDMGEXP),
summarise,
PROPDMG=sum(PROPDMG),
CROPDMG=sum(CROPDMG)))
## user system elapsed
## 2.980 0.503 3.920
head(economy)
## EVTYPE PROPDMGEXP CROPDMGEXP PROPDMG CROPDMG
## 1 HIGH SURF ADVISORY K 200 0
## 2 COASTAL FLOOD 0 0
## 3 FLASH FLOOD K 50 0
## 4 LIGHTNING 0 0
## 5 TSTM WIND 0 0
## 6 TSTM WIND K 100 0
Let's filter out zero damages.
economy <- subset(economy,PROPDMG+CROPDMG>0)
head(economy)
## EVTYPE PROPDMGEXP CROPDMGEXP PROPDMG CROPDMG
## 1 HIGH SURF ADVISORY K 200 0
## 3 FLASH FLOOD K 50 0
## 6 TSTM WIND K 100 0
## 7 TSTM WIND M 8 0
## 8 TSTM WIND (G45) K 8 0
## 11 ? K 5 0
dim(economy)
## [1] 881 5
length(levels(DF$EVTYPE))
## [1] 985
Construct the look-up table which will be useful in calculation sum. As there are atypical multiplier, I just made there unit coresonding to 0. This step should not impact the result, because the one has most influence should be the unit “B”.
prop_table <-rep(0,length(levels(economy$PROPDMGEXP)))
names(prop_table)<-levels(economy$PROPDMGEXP)
prop_table[c("B","h","H","K","m","M")]<-c(10^9,10^2,10^2,10^3,10^6,10^6)
prop_table
## - ? + 0 1 2 3 4 5 6 7
## 0e+00 0e+00 0e+00 0e+00 0e+00 0e+00 0e+00 0e+00 0e+00 0e+00 0e+00 0e+00
## 8 B h H K m M
## 0e+00 1e+09 1e+02 1e+02 1e+03 1e+06 1e+06
crop_table <-rep(0,length(levels(economy$CROPDMGEXP)))
names(crop_table)<-levels(economy$CROPDMGEXP)
crop_table
## ? 0 2 B k K m M
## 0 0 0 0 0 0 0 0 0
crop_table[c("B","k","K","m","M")]<-c(10^9,10^3,10^3,10^6,10^6)
crop_table
## ? 0 2 B k K m M
## 0e+00 0e+00 0e+00 0e+00 1e+09 1e+03 1e+03 1e+06 1e+06
Now calculate the compressed data for economy lost.
head(economy)
## EVTYPE PROPDMGEXP CROPDMGEXP PROPDMG CROPDMG
## 1 HIGH SURF ADVISORY K 200 0
## 3 FLASH FLOOD K 50 0
## 6 TSTM WIND K 100 0
## 7 TSTM WIND M 8 0
## 8 TSTM WIND (G45) K 8 0
## 11 ? K 5 0
economy$Sum <- economy$PROPDMG*prop_table[economy$PROPDMGEXP]+
economy$CROPDMG*crop_table[economy$CROPDMGEXP]
economy2 <- ddply(economy,.(EVTYPE),
summarise,
SumLog10 = log10(sum(Sum+1))) # +1 for the low limit
head(economy2)
## EVTYPE SumLog10
## 1 HIGH SURF ADVISORY 5.301032
## 2 FLASH FLOOD 4.698979
## 3 TSTM WIND 6.908485
## 4 TSTM WIND (G45) 3.903144
## 5 ? 3.699057
## 6 AGRICULTURAL FREEZE 7.459694
Plot boxplot of the lost and highlight the max. The result is: Flood generated largest economy lost (~ 1000B)
boxplot(economy2$SumLog10,ylab = "Economy Lost (log10)",main="total economy damage")
points(1,max(economy2$SumLog10),col="red")
text(1.2,max(economy2$SumLog10),paste(economy2[which.max(economy2$SumLog10),1],
"10^",
sprintf("%.1f",max(economy2$SumLog10)),
"$Lost",sep=""),
col="red")
sessionInfo()
## R version 3.1.1 (2014-07-10)
## Platform: x86_64-apple-darwin13.1.0 (64-bit)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] plyr_1.8.1 data.table_1.9.4 knitr_1.8
##
## loaded via a namespace (and not attached):
## [1] chron_2.3-45 codetools_0.2-9 digest_0.6.4 evaluate_0.5.5
## [5] formatR_1.0 Rcpp_0.11.3 reshape2_1.4 stringr_0.6.2
## [9] tools_3.1.1