Synopsis

Storm Data documents the occurrence of storms and other significant weather phenomena having sufficient intensity to cause loss of life, injuries, significant property damage, and/or disruption to commerce.
From the human health point of view, tornade was the most harmful phenomenon. It occupied 37% of fatality and 65% of injury.
As to economic losses, property damage was caused by flood. It took 34% of all cause of property damage and 32% of all economic loss.
On the contrary, the crop damage was mainly (28%) caused by a drought. But this was only 3% of economic loss caused by a natural disaster.
We should take a heavy warning to Wind and Water!

Data Processing

Loading data

The data is downloded from the course web site: “https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2” and decompressed as

file<-"repdata-data-StormData.csv"

and readed as a data.frame named sd.

library(data.table)
sd<-fread(file,verbose=TRUE)
## Input contains no \n. Taking this to be a filename to open
## File opened, filesize is 0.523066 GB.
## Memory mapping ... ok
## Detected eol as \r\n (CRLF) in that order, the Windows standard.
## Positioned on line 1 after skip or autostart
## This line is the autostart and not blank so searching up for the last non-blank ... line 1
## Detecting sep ... ','
## Detected 37 columns. Longest stretch was from line 1 to line 30
## Starting data input on line 1 (either column names or first row of data). First 10 characters: "STATE__",
## All the fields on line 1 are character fields. Treating as the column names.
## Count of eol: 1307675 (including 1 at the end)
## Count of sep: 34819802
## nrow = MIN( nsep [34819802] / ncol [37] -1, neol [1307675] - nblank [1] ) = 967216
## Type codes (   first 5 rows): 3444344430000303003343333430000333303
## Type codes (+ middle 5 rows): 3444344434444303443343333434440333343
## Type codes (+   last 5 rows): 3444344434444303443343333434444333343
## Type codes: 3444344434444303443343333434444333343 (after applying colClasses and integer64)
## Type codes: 3444344434444303443343333434444333343 (after applying drop or select (if supplied)
## Allocating 37 column slots (37 - 0 dropped)
## 
Read 30.0% of 967216 rows
Read 51.7% of 967216 rows
Read 69.3% of 967216 rows
Read 81.7% of 967216 rows
Read 902297 rows and 37 (of 37) columns from 0.523 GB file in 00:00:06
## Warning in fread(file, verbose = TRUE): Read less rows (902297) than were
## allocated (967216). Run again with verbose=TRUE and please report.
##    0.000s (  0%) Memory map (rerun may be quicker)
##    0.000s (  0%) sep and header detection
##    0.544s (  9%) Count rows (wc -l)
##    0.001s (  0%) Column type detection (first, middle and last 5 rows)
##    0.487s (  8%) Allocation of 902297x37 result (xMB) in RAM
##    4.914s ( 82%) Reading data
##    0.000s (  0%) Allocation for type bumps (if any), including gc time if triggered
##    0.000s (  0%) Coercing data already read in type bumps (if any)
##    0.016s (  0%) Changing na.strings to NA
##    5.962s        Total

From the calculation as
## Count of eol: 1307675 (including 1 at the end)
## Count of sep: 34819802
## nrow = MIN( nsep [34819802] / ncol [37] -1, neol [1307675] - nblank [1] ) = 967216
fread insists some troubles with number of rows.
However, if you see the contents of REMARKS, you see many sep’s and eol’s. So you can use fread as fast and safe function to read StormData.

str(sd)
## Classes 'data.table' and 'data.frame':   902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : chr  "3" "2" "2" "2" ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Making subset

Subset should include
EVTYPE; event type
FATALITIES; number of people died
INJURIES; number of people injuured
PROPDMG; amount of property damage
PROPDMGEXP; unit of damage (B,M,K,H,…)
CROPDMG; amount of corp damage
CROPDMGEXP - unit of damage (B,M,K,…)

library(dplyr)
## 
## Attaching package: 'dplyr'
## 
##  d;%d8c.c*cc8c'c/cc/ 'package:data.table' cc   cc9c/ccc&cc>c: 
## 
##      between, last 
## 
##  d;%d8c.c*cc8c'c/cc/ 'package:stats' cc    cc9c/ccc&cc>c: 
## 
##      filter, lag 
## 
##  d;%d8c.c*cc8c'c/cc/ 'package:base' cc cc9c/ccc&cc>c: 
## 
##      intersect, setdiff, setequal, union
library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## 
##  d;%d8c.c*cc8c'c/cc/ 'package:dplyr' cc    cc9c/ccc&cc>c: 
## 
##      arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
#####
#Select columns with ...EXP columns
sd.sub<-select(sd,c(EVTYPE,FATALITIES,INJURIES,PROPDMG:CROPDMGEXP))
#PROPDMEXP contains unit of PROP damage  
unitpc<-names(table(sd.sub$PROPDMGEXP))
#""  "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K" "m" "M"
unitpn<-c(0,0,0,0,0,1,1,1,1,1,1,1,1,1e9,1e2,1e2,1e3,1e6,1e6)
#
tmp<-as.numeric(mapvalues(sd.sub$PROPDMGEXP,unitpc,unitpn))
sd.sub$PROPDMG<-sd.sub$PROPDMG*tmp
#####
#In the case of CROPDMG
unitcc<-names(table(sd.sub$CROPDMGEXP))
#""  "?" "0" "2" "B" "k" "K" "m" "M"
unitcn<-c(0,0,0,1,1e9,1e3,1e3,1e6,1e6)
tmp<-as.numeric(mapvalues(sd.sub$CROPDMGEXP,unitcc,unitcn))
sd.sub$CROPDMG<-sd.sub$CROPDMG*tmp
## There are no use of ...EXP culomn.
sd.sub<-select(sd.sub,EVTYPE,FATALITIES,INJURIES,PROPDMG,CROPDMG)
by.event<-sd.sub %>% group_by(EVTYPE)

Analysis 1; Which type of weather events are most harmful to population health?

#Summarize fatal and injured number by events, Harm is a sum of the both.
health.sum<-by.event %>% summarise_each(funs(sum),FATALITIES,INJURIES) %>% mutate(Harm=FATALITIES+INJURIES)
#Each total number of fatal, injury and harm.
health.total<-as.data.frame(summarise_each(health.sum,funs(sum),FATALITIES,INJURIES,Harm))
#Ordered by number of the top 5 fatal events
fatal.ord<-as.data.frame(arrange(health.sum,desc(FATALITIES)))[1:5,]
fatalsubsum<-colSums(fatal.ord[,2:4])
fatal.ord5<-rbind(fatal.ord,cbind(EVTYPE="Others",(health.total-fatalsubsum)))
#Ordered by number of the top 5 injury events
injur.ord<-as.data.frame(arrange(health.sum,desc(INJURIES)))[1:5,]
injursubsum<-colSums(injur.ord[,2:4])
injur.ord5<-rbind(injur.ord,cbind(EVTYPE="Others",(health.total-injursubsum)))
#Ordered by number of the top 5 harm events
harm.ord<-as.data.frame(arrange(health.sum,desc(Harm)))[1:5,]
harmsubsum<-colSums(harm.ord[,2:4])
harm.ord5<-rbind(harm.ord,cbind(EVTYPE="Others",(health.total-harmsubsum)))

Top five events ordered by fatal, injured or harm category.

fatal.ord5
##           EVTYPE FATALITIES INJURIES  Harm
## 1        TORNADO       5633    91346 96979
## 2 EXCESSIVE HEAT       1903     6525  8428
## 3    FLASH FLOOD        978     1777  2755
## 4           HEAT        937     2100  3037
## 5      LIGHTNING        816     5230  6046
## 6         Others       4878    33550 38428
#per cent
p<-mutate(fatal.ord5,fatalpercent=round(fatal.ord5[,2]/as.numeric(health.total[1])*100,2))
p<-mutate(p,injurpercent=round(fatal.ord5[,3]/as.numeric(health.total[2])*100,2))
p<-mutate(p,harmpercent=round(fatal.ord5[,4]/as.numeric(health.total[3])*100,2))
p[,c(1,5:7)]
##           EVTYPE fatalpercent injurpercent harmpercent
## 1        TORNADO        37.19        65.00       62.30
## 2 EXCESSIVE HEAT        12.57         4.64        5.41
## 3    FLASH FLOOD         6.46         1.26        1.77
## 4           HEAT         6.19         1.49        1.95
## 5      LIGHTNING         5.39         3.72        3.88
## 6         Others        32.21        23.87       24.69
injur.ord5
##           EVTYPE FATALITIES INJURIES  Harm
## 1        TORNADO       5633    91346 96979
## 2      TSTM WIND        504     6957  7461
## 3          FLOOD        470     6789  7259
## 4 EXCESSIVE HEAT       1903     6525  8428
## 5      LIGHTNING        816     5230  6046
## 6         Others       5819    23681 29500
p<-mutate(injur.ord5,fatalpercent=round(injur.ord5[,2]/as.numeric(health.total[1])*100,2))
p<-mutate(p,injurpercent=round(injur.ord5[,3]/as.numeric(health.total[2])*100,2))
p<-mutate(p,harmpercent=round(injur.ord5[,4]/as.numeric(health.total[3])*100,2))
p[,c(1,5:7)]
##           EVTYPE fatalpercent injurpercent harmpercent
## 1        TORNADO        37.19        65.00       62.30
## 2      TSTM WIND         3.33         4.95        4.79
## 3          FLOOD         3.10         4.83        4.66
## 4 EXCESSIVE HEAT        12.57         4.64        5.41
## 5      LIGHTNING         5.39         3.72        3.88
## 6         Others        38.42        16.85       18.95
harm.ord5
##           EVTYPE FATALITIES INJURIES  Harm
## 1        TORNADO       5633    91346 96979
## 2 EXCESSIVE HEAT       1903     6525  8428
## 3      TSTM WIND        504     6957  7461
## 4          FLOOD        470     6789  7259
## 5      LIGHTNING        816     5230  6046
## 6         Others       5819    23681 29500
p<-mutate(harm.ord5,fatalpercent=round(harm.ord5[,2]/as.numeric(health.total[1])*100,2))
p<-mutate(p,injurpercent=round(harm.ord5[,3]/as.numeric(health.total[2])*100,2))
p<-mutate(p,harmpercent=round(harm.ord5[,4]/as.numeric(health.total[3])*100,2))
p[,c(1,5:7)]
##           EVTYPE fatalpercent injurpercent harmpercent
## 1        TORNADO        37.19        65.00       62.30
## 2 EXCESSIVE HEAT        12.57         4.64        5.41
## 3      TSTM WIND         3.33         4.95        4.79
## 4          FLOOD         3.10         4.83        4.66
## 5      LIGHTNING         5.39         3.72        3.88
## 6         Others        38.42        16.85       18.95

Analysis 2; Which types of events have the greatest economic consequences?

#In the case of property, crop and economic
economic.sum<-by.event %>% summarise_each(funs(sum),PROPDMG,CROPDMG) %>% mutate(Economic=PROPDMG+CROPDMG)
#
economic.total<-as.data.frame(summarise_each(economic.sum,funs(sum),PROPDMG,CROPDMG,Economic))
##
prop.ord<-as.data.frame(arrange(economic.sum,desc(PROPDMG)))[1:5,]
propsubsum<-colSums(prop.ord[,2:4])
prop.ord5<-rbind(prop.ord,cbind(EVTYPE="Others",(economic.total-propsubsum)))
#
crop.ord<-as.data.frame(arrange(economic.sum,desc(CROPDMG)))[1:5,]
cropsubsum<-colSums(crop.ord[,2:4])
crop.ord5<-rbind(crop.ord,cbind(EVTYPE="Others",(economic.total-cropsubsum)))
#
economic.ord<-as.data.frame(arrange(economic.sum,desc(Economic)))[1:5,]
economicsubsum<-colSums(economic.ord[,2:4])
economic.ord5<-rbind(economic.ord,cbind(EVTYPE="Others",(economic.total-economicsubsum)))

Top five events ordered by propaty, crop or economic category.

prop.ord5
##              EVTYPE      PROPDMG     CROPDMG     Economic
## 1             FLOOD 144657709800  5661968450 150319678250
## 2 HURRICANE/TYPHOON  69305840000  2607872800  71913712800
## 3           TORNADO  56937160582   414953110  57352113692
## 4       STORM SURGE  43323536000        5000  43323541000
## 5       FLASH FLOOD  16140811599  1421317100  17562128699
## 6            Others  96953587223 38998075450 135951662673
#per cent
p<-mutate(prop.ord5,proppercent=round(prop.ord5[,2]/as.numeric(economic.total[1])*100,2))
p<-mutate(p,croppercent=round(prop.ord5[,3]/as.numeric(economic.total[2])*100,2))
p<-mutate(p,econopercent=round(prop.ord5[,4]/as.numeric(economic.total[3])*100,2))
p[,c(1,5:7)]
##              EVTYPE proppercent croppercent econopercent
## 1             FLOOD       33.85       11.53        31.55
## 2 HURRICANE/TYPHOON       16.22        5.31        15.09
## 3           TORNADO       13.32        0.85        12.04
## 4       STORM SURGE       10.14        0.00         9.09
## 5       FLASH FLOOD        3.78        2.89         3.69
## 6            Others       22.69       79.42        28.54
crop.ord5
##        EVTYPE      PROPDMG     CROPDMG     Economic
## 1     DROUGHT   1046106000 13972566000  15018672000
## 2       FLOOD 144657709800  5661968450 150319678250
## 3 RIVER FLOOD   5118945500  5029459000  10148404500
## 4   ICE STORM   3944927810  5022113500   8967041310
## 5        HAIL  15732267250  3025954450  18758221700
## 6      Others 256818688844 16392130510 273210819354
p<-mutate(crop.ord5,proppercent=round(crop.ord5[,2]/as.numeric(economic.total[1])*100,2))
p<-mutate(p,croppercent=round(crop.ord5[,3]/as.numeric(economic.total[2])*100,2))
p<-mutate(p,econopercent=round(crop.ord5[,4]/as.numeric(economic.total[3])*100,2))
p[,c(1,5:7)]
##        EVTYPE proppercent croppercent econopercent
## 1     DROUGHT        0.24       28.45         3.15
## 2       FLOOD       33.85       11.53        31.55
## 3 RIVER FLOOD        1.20       10.24         2.13
## 4   ICE STORM        0.92       10.23         1.88
## 5        HAIL        3.68        6.16         3.94
## 6      Others       60.10       33.38        57.35
economic.ord5
##              EVTYPE      PROPDMG     CROPDMG     Economic
## 1             FLOOD 144657709800  5661968450 150319678250
## 2 HURRICANE/TYPHOON  69305840000  2607872800  71913712800
## 3           TORNADO  56937160582   414953110  57352113692
## 4       STORM SURGE  43323536000        5000  43323541000
## 5              HAIL  15732267250  3025954450  18758221700
## 6            Others  97362131572 37393438100 134755569672
p<-mutate(economic.ord5,proppercent=round(economic.ord5[,2]/as.numeric(economic.total[1])*100,2))
p<-mutate(p,croppercent=round(economic.ord5[,3]/as.numeric(economic.total[2])*100,2))
p<-mutate(p,econopercent=round(economic.ord5[,4]/as.numeric(economic.total[3])*100,2))
p[,c(1,5:7)]
##              EVTYPE proppercent croppercent econopercent
## 1             FLOOD       33.85       11.53        31.55
## 2 HURRICANE/TYPHOON       16.22        5.31        15.09
## 3           TORNADO       13.32        0.85        12.04
## 4       STORM SURGE       10.14        0.00         9.09
## 5              HAIL        3.68        6.16         3.94
## 6            Others       22.78       76.15        28.28

Results:

Result 1; The most harmful weather events to population health.

Total number of fatal, injury and harm as the summation of the both.

health.total
##   FATALITIES INJURIES   Harm
## 1      15145   140528 155673
round(health.total/health.total$Harm*100,2)
##   FATALITIES INJURIES Harm
## 1       9.73    90.27  100

These were distributed as followed.

#####
library(ggplot2)
library(gridExtra)
library(grid)
#####
graph.data<-data.frame(fatal.ord5$EVTYPE,fatal.ord5$FATALITIES,injur.ord5$EVTYPE,injur.ord5$INJURIES,harm.ord5$EVTYPE,harm.ord5$Harm)
name.data<-c("Fatal","Ingury","Harm")
graph.list<-lapply(1:3,function(i){
        d <- data.frame(event=graph.data[,i*2-1],count=graph.data[,i*2])
        g<- ggplot(d,aes(x=6:1,y=count,fill=event))+ 
        geom_bar(stat="identity")+    
        coord_flip()+
        theme(legend.position="right", legend.text = element_text(size=8))+
        theme(legend.key.height=unit(0.7,"line"))+
        theme(axis.text.y = element_blank())+
        scale_fill_discrete(breaks=d$event)+
        xlab("Event") +
        ggtitle(name.data[i])})
args<-c(graph.list,list(ncol=1,nrow=3))
do.call(grid.arrange,args)

Result 2; Weather events that make the greatest economic consequences.

Total number of property, crop and economic as the summation of the both.

economic.total
##        PROPDMG     CROPDMG     Economic
## 1 427318645204 49104191910 476422837114
round(economic.total/economic.total$Economic*100,2)
##   PROPDMG CROPDMG Economic
## 1   89.69   10.31      100

These were distributed as followed.

graph.data<-data.frame(prop.ord5$EVTYPE,prop.ord5$PROPDMG,crop.ord5$EVTYPE,crop.ord5$CROPDMG,economic.ord5$EVTYPE,economic.ord5$Economic)
name.data<-c("Property","Crop","Economic")
graph.list<-lapply(1:3,function(i){
        d <- data.frame(event=graph.data[,i*2-1],count=graph.data[,i*2])
        g<- ggplot(d,aes(x=6:1,y=count,fill=event))+ 
        geom_bar(stat="identity")+    
        coord_flip()+
        theme(legend.position="right", legend.text = element_text(size=8))+
        theme(legend.key.height=unit(0.7,"line"))+
        theme(axis.text.y = element_blank())+
        scale_fill_discrete(breaks=d$event)+
        xlab("Event") +
        ggtitle(name.data[i])})
args<-c(graph.list,list(ncol=1,nrow=3))
do.call(grid.arrange,args)