Tornado and Flood are the most injurious events

Note: deer peers, I sorry for my English.

Questions

1. Across the United States, which types of events are most harmful with respect to population health?
2. Across the United States, which types of events have the greatest economic consequences?

Synopsis

For each types of events we’ve calculated total fatalities, injuries and damage, and mean fatalities, injuries and damage too. (we used whole period of observations 1950-2011). Such types of events how “TORNADO”, “FLASH FLOOD”, “FLOOD” and “HIGH WIND” are in both tables “TOP-10 total fatalities”" and “TOP-10 total damage”. So, we can say, that they are the most injurious types of events. In addition we’ve discovered interesting association between mean fatalities per event and number of events. (because we are investigating fix period, we can say about frequency of event instead number of event). I would note, that there are types of events, which occur more seldom, than for example tornados or floods, but they are much more harmful (for example EXTREME HEAT).

Data Processing

I’ve downloaded data by URL Storm Data
I’ve read bz2-file

df <- read.csv(bzfile("./repdata-data-StormData.csv.bz2"))

See first 3 transposed rows

t(head(df, n = 3))

##            1                   2                   3                  
## STATE__    "1"                 "1"                 "1"                
## BGN_DATE   "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00"
## BGN_TIME   "0130"              "0145"              "1600"             
## TIME_ZONE  "CST"               "CST"               "CST"              
## COUNTY     "97"                " 3"                "57"               
## COUNTYNAME "MOBILE"            "BALDWIN"           "FAYETTE"          
## STATE      "AL"                "AL"                "AL"               
## EVTYPE     "TORNADO"           "TORNADO"           "TORNADO"          
## BGN_RANGE  "0"                 "0"                 "0"                
## BGN_AZI    ""                  ""                  ""                 
## BGN_LOCATI ""                  ""                  ""                 
## END_DATE   ""                  ""                  ""                 
## END_TIME   ""                  ""                  ""                 
## COUNTY_END "0"                 "0"                 "0"                
## COUNTYENDN NA                  NA                  NA                 
## END_RANGE  "0"                 "0"                 "0"                
## END_AZI    ""                  ""                  ""                 
## END_LOCATI ""                  ""                  ""                 
## LENGTH     "14.0"              " 2.0"              " 0.1"             
## WIDTH      "100"               "150"               "123"              
## F          "3"                 "2"                 "2"                
## MAG        "0"                 "0"                 "0"                
## FATALITIES "0"                 "0"                 "0"                
## INJURIES   "15"                " 0"                " 2"               
## PROPDMG    "25.0"              " 2.5"              "25.0"             
## PROPDMGEXP "K"                 "K"                 "K"                
## CROPDMG    "0"                 "0"                 "0"                
## CROPDMGEXP ""                  ""                  ""                 
## WFO        ""                  ""                  ""                 
## STATEOFFIC ""                  ""                  ""                 
## ZONENAMES  ""                  ""                  ""                 
## LATITUDE   "3040"              "3042"              "3340"             
## LONGITUDE  "8812"              "8755"              "8742"             
## LATITUDE_E "3051"              "   0"              "   0"             
## LONGITUDE_ "8806"              "   0"              "   0"             
## REMARKS    ""                  ""                  ""                 
## REFNUM     "1"                 "2"                 "3"

check NAs for df

sapply(df, function(e) sum(is.na(e)))

##    STATE__   BGN_DATE   BGN_TIME  TIME_ZONE     COUNTY COUNTYNAME 
##          0          0          0          0          0          0 
##      STATE     EVTYPE  BGN_RANGE    BGN_AZI BGN_LOCATI   END_DATE 
##          0          0          0          0          0          0 
##   END_TIME COUNTY_END COUNTYENDN  END_RANGE    END_AZI END_LOCATI 
##          0          0     902297          0          0          0 
##     LENGTH      WIDTH          F        MAG FATALITIES   INJURIES 
##          0          0     843563          0          0          0 
##    PROPDMG PROPDMGEXP    CROPDMG CROPDMGEXP        WFO STATEOFFIC 
##          0          0          0          0          0          0 
##  ZONENAMES   LATITUDE  LONGITUDE LATITUDE_E LONGITUDE_    REMARKS 
##          0         47          0         40          0          0 
##     REFNUM 
##          0

There isn’t NA-values in interesting columns - EVTYPE, FATALITIES, INJURIES, *DMG

Thus, the data are ready for analysis.

Results

In first, I,ve calculated total fatalities and injuries (see variables FATALtotal and INJURtotal), and mean fatalities and injuries (variables FATALmean and INJURmean).

library(dplyr)
tmp <- group_by(df, EVTYPE)
tmp2 <- summarise(.data=tmp, FATALtotal=sum(FATALITIES), FATALmean=mean(FATALITIES),
                  EVTYPEn=n(), INJURtotal=sum(INJURIES), INJURmean=mean(INJURIES))
# order by FATALtotal
tmp2ord <- tmp2[order(tmp2$FATALtotal, decreasing = T),]

Here we see TOP-10 types of events by variable FATALtotal

library(xtable)
xt <- xtable(head(tmp2ord, n=10))
print(xt, type="html")

	EVTYPE	FATALtotal	FATALmean	EVTYPEn	INJURtotal	INJURmean
1	TORNADO	5633.00	0.09	60652	91346.00	1.51
2	EXCESSIVE HEAT	1903.00	1.13	1678	6525.00	3.89
3	FLASH FLOOD	978.00	0.02	54277	1777.00	0.03
4	HEAT	937.00	1.22	767	2100.00	2.74
5	LIGHTNING	816.00	0.05	15754	5230.00	0.33
6	TSTM WIND	504.00	0.00	219940	6957.00	0.03
7	FLOOD	470.00	0.02	25326	6789.00	0.27
8	RIP CURRENT	368.00	0.78	470	232.00	0.49
9	HIGH WIND	248.00	0.01	20212	1137.00	0.06
10	AVALANCHE	224.00	0.58	386	170.00	0.44

Next, we see TOP-10 types of events by variable FATALmean

library(xtable)
xt <- xtable(head(tmp2[order(tmp2$FATALmean, decreasing = T),], n=10))
print(xt, type="html")

	EVTYPE	FATALtotal	FATALmean	EVTYPEn	INJURtotal	INJURmean
1	TORNADOES, TSTM WIND, HAIL	25.00	25.00	1	0.00	0.00
2	COLD AND SNOW	14.00	14.00	1	0.00	0.00
3	TROPICAL STORM GORDON	8.00	8.00	1	43.00	43.00
4	RECORD/EXCESSIVE HEAT	17.00	5.67	3	0.00	0.00
5	EXTREME HEAT	96.00	4.36	22	155.00	7.05
6	HEAT WAVE DROUGHT	4.00	4.00	1	15.00	15.00
7	HIGH WIND/SEAS	4.00	4.00	1	0.00	0.00
8	MARINE MISHAP	7.00	3.50	2	5.00	2.50
9	WINTER STORMS	10.00	3.33	3	17.00	5.67
10	Heavy surf and wind	3.00	3.00	1	0.00	0.00

Next, I’ve created new factor FATALtotalFac for FATALtotal (It’s necessary for plot)

tmp2ord$FATALtotalFac <- character(length = nrow(tmp2ord))
tmp2ord[tmp2ord$FATALtotal>=500,]$FATALtotalFac<-"500+" 
tmp2ord[tmp2ord$FATALtotal>=100 & tmp2ord$FATALtotal<500,]$FATALtotalFac<-"100-500" 
tmp2ord[tmp2ord$FATALtotal>=50 & tmp2ord$FATALtotal<100,]$FATALtotalFac<-"50-100" 
tmp2ord[tmp2ord$FATALtotal<50,]$FATALtotalFac<-"0-50" 
tmp2ord$FATALtotalFac <- factor(x=tmp2ord$FATALtotalFac,levels = c("0-50", "50-100",
                                                                   "100-500", "500+"),
                                ordered = T)

some transformations for adding labels on plot

tmp3 <- tmp2ord[tmp2ord$FATALtotalFac!="0-50" & tmp2ord$FATALtotalFac!="50-100",]
tmp3$EVTYPE <- as.character(tmp3$EVTYPE)
tmp3[tmp3$FATALtotalFac=="500+",]$EVTYPE <- paste0(tmp3[tmp3$FATALtotalFac=="500+",]$EVTYPE," (", tmp3[tmp3$FATALtotalFac=="500+",]$FATALtotal, ")")

Next, I’ve made informative plot, which describe association between mean fatalities per event and number of events.

library(ggplot2)

ggplot(data = tmp2ord, aes(x=(FATALmean)^(1/4), y=(EVTYPEn)^(1/4), col=FATALtotalFac, size=(FATALtotal)^(1/4))) + 
    geom_point()+xlim(0,1.5)+
    ylim(0,24)+
    annotate("text", label = tmp3$EVTYPE, x = (tmp3$FATALmean)^(1/4), y = (tmp3$EVTYPEn)^(1/4), size = 3, colour = "#2e3a23",  hjust=-0.1)+
    xlab("Mean fatalities per event, pcs^(1/4)")+
    ylab("Number of events, pcs^(1/4)")+
    labs(title="Relation between 'Mean fatalities per event' and 'Number of events'")+
    scale_size_continuous(breaks=NULL)+
    scale_color_discrete(name="Fatalities levels")

We can see negative association between mean fatalities per event and number of events (or frequency of event). All evets from TOP-10 are lying near centre of the plot.

It’s interesting: there is positive correlation between FATALtotal and INJURtotal

fit <- lm(FATALtotal~INJURtotal, data = tmp2ord)
summary(fit)

## 
## Call:
## lm(formula = FATALtotal ~ INJURtotal, data = tmp2ord)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -181.2   -6.3   -6.3   -6.3 1481.8 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.3045903  2.0897235   3.017  0.00262 ** 
## INJURtotal  0.0635815  0.0007099  89.558  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 65.51 on 983 degrees of freedom
## Multiple R-squared:  0.8908, Adjusted R-squared:  0.8907 
## F-statistic:  8021 on 1 and 983 DF,  p-value: < 2.2e-16

There is strong correlation between FATALtotal and INJURtotal. If INJURtotal increased on 100 men then FATALtotal increased on 6 men.

Mean injuries in time series

# preparation data for plot
df$BGN_DATE <- as.Date(df$BGN_DATE, "%m/%d/%Y")
library(lubridate)
df$year <- year(df$BGN_DATE)
tmp <- group_by(df, year)
tmp2 <- summarise(.data=tmp, INJURtotal=sum(INJURIES), INJURmean=mean(INJURIES))
#make plot
ggplot(data = tmp2, aes(x=year, y=INJURmean))+geom_line()+geom_smooth()+
    ylab("Mean injuries per event")+
    xlab("Years")+
    labs(title="Mean injuries decrease")

We see, that mean injuries decrease.

Next, I’ve made table TOP-10 total damage

df$PROPDMGdoll <- df$PROPDMG
df[df$PROPDMGEXP=="K" | df$PROPDMGEXP=="k",]$PROPDMGdoll <- df[df$PROPDMGEXP=="K" | df$PROPDMGEXP=="k",]$PROPDMG*10^3
df[df$PROPDMGEXP=="M" | df$PROPDMGEXP=="m",]$PROPDMGdoll <- df[df$PROPDMGEXP=="M" | df$PROPDMGEXP=="m",]$PROPDMG*10^6
df[df$PROPDMGEXP=="B" | df$PROPDMGEXP=="b",]$PROPDMGdoll <- df[df$PROPDMGEXP=="B" | df$PROPDMGEXP=="b",]$PROPDMG*10^9

tmp <- group_by(df, EVTYPE)
tmp3 <- summarise(.data=tmp, PROPDMGtotal=sum(PROPDMGdoll), PROPDMGmean=mean(PROPDMGdoll),
                  FATALtotal=sum(FATALITIES))
# order by PROPDMGtotal
tmp3ord <- tmp3[order(tmp3$PROPDMGtotal, decreasing = T),]

Here we see TOP-10 types of events by variable PROPDMGtotal

xt <- xtable(head(tmp3ord, n=10))
print(xt, type="html")

	EVTYPE	PROPDMGtotal	PROPDMGmean	FATALtotal
1	FLOOD	144657709807.00	5711826.18	470.00
2	HURRICANE/TYPHOON	69305840000.00	787566363.64	64.00
3	TORNADO	56937160778.70	938751.58	5633.00
4	STORM SURGE	43323536000.00	165990559.39	13.00
5	FLASH FLOOD	16140812067.10	297378.49	978.00
6	HAIL	15732267047.70	54500.84	15.00
7	HURRICANE	11868319010.00	68208729.94	61.00
8	TROPICAL STORM	7703890550.00	11165058.77	58.00
9	WINTER STORM	6688497251.00	585016.82	206.00
10	HIGH WIND	5270046295.00	260738.49	248.00

Intersection TOP-10 total fatalities and TOP-10 total damage

intersect(tmp2ord$EVTYPE[1:10], tmp3ord$EVTYPE[1:10])

## [1] "TORNADO"     "FLASH FLOOD" "FLOOD"       "HIGH WIND"

So, such types of events how “TORNADO”, “FLASH FLOOD”, “FLOOD” and “HIGH WIND” are in TOP-10 total fatalities and TOP-10 total damage simultaneously. We can say, that they are the most injurious types of events.

Tornado and Flood are the most injurious events

Grigory Miholap

15/11/2015

Questions

Synopsis

Data Processing

Results