Note: deer peers, I sorry for my English.

Questions

1. Across the United States, which types of events are most harmful with respect to population health?
2. Across the United States, which types of events have the greatest economic consequences?

Synopsis

For each types of events we’ve calculated total fatalities, injuries and damage, and mean fatalities, injuries and damage too. (we used whole period of observations 1950-2011). Such types of events how “TORNADO”, “FLASH FLOOD”, “FLOOD” and “HIGH WIND” are in both tables “TOP-10 total fatalities”" and “TOP-10 total damage”. So, we can say, that they are the most injurious types of events. In addition we’ve discovered interesting association between mean fatalities per event and number of events. (because we are investigating fix period, we can say about frequency of event instead number of event). I would note, that there are types of events, which occur more seldom, than for example tornados or floods, but they are much more harmful (for example EXTREME HEAT).

Data Processing

  1. I’ve downloaded data by URL Storm Data

  2. I’ve read bz2-file

df <- read.csv(bzfile("./repdata-data-StormData.csv.bz2"))

See first 3 transposed rows

t(head(df, n = 3))
##            1                   2                   3                  
## STATE__    "1"                 "1"                 "1"                
## BGN_DATE   "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00"
## BGN_TIME   "0130"              "0145"              "1600"             
## TIME_ZONE  "CST"               "CST"               "CST"              
## COUNTY     "97"                " 3"                "57"               
## COUNTYNAME "MOBILE"            "BALDWIN"           "FAYETTE"          
## STATE      "AL"                "AL"                "AL"               
## EVTYPE     "TORNADO"           "TORNADO"           "TORNADO"          
## BGN_RANGE  "0"                 "0"                 "0"                
## BGN_AZI    ""                  ""                  ""                 
## BGN_LOCATI ""                  ""                  ""                 
## END_DATE   ""                  ""                  ""                 
## END_TIME   ""                  ""                  ""                 
## COUNTY_END "0"                 "0"                 "0"                
## COUNTYENDN NA                  NA                  NA                 
## END_RANGE  "0"                 "0"                 "0"                
## END_AZI    ""                  ""                  ""                 
## END_LOCATI ""                  ""                  ""                 
## LENGTH     "14.0"              " 2.0"              " 0.1"             
## WIDTH      "100"               "150"               "123"              
## F          "3"                 "2"                 "2"                
## MAG        "0"                 "0"                 "0"                
## FATALITIES "0"                 "0"                 "0"                
## INJURIES   "15"                " 0"                " 2"               
## PROPDMG    "25.0"              " 2.5"              "25.0"             
## PROPDMGEXP "K"                 "K"                 "K"                
## CROPDMG    "0"                 "0"                 "0"                
## CROPDMGEXP ""                  ""                  ""                 
## WFO        ""                  ""                  ""                 
## STATEOFFIC ""                  ""                  ""                 
## ZONENAMES  ""                  ""                  ""                 
## LATITUDE   "3040"              "3042"              "3340"             
## LONGITUDE  "8812"              "8755"              "8742"             
## LATITUDE_E "3051"              "   0"              "   0"             
## LONGITUDE_ "8806"              "   0"              "   0"             
## REMARKS    ""                  ""                  ""                 
## REFNUM     "1"                 "2"                 "3"

check NAs for df

sapply(df, function(e) sum(is.na(e)))
##    STATE__   BGN_DATE   BGN_TIME  TIME_ZONE     COUNTY COUNTYNAME 
##          0          0          0          0          0          0 
##      STATE     EVTYPE  BGN_RANGE    BGN_AZI BGN_LOCATI   END_DATE 
##          0          0          0          0          0          0 
##   END_TIME COUNTY_END COUNTYENDN  END_RANGE    END_AZI END_LOCATI 
##          0          0     902297          0          0          0 
##     LENGTH      WIDTH          F        MAG FATALITIES   INJURIES 
##          0          0     843563          0          0          0 
##    PROPDMG PROPDMGEXP    CROPDMG CROPDMGEXP        WFO STATEOFFIC 
##          0          0          0          0          0          0 
##  ZONENAMES   LATITUDE  LONGITUDE LATITUDE_E LONGITUDE_    REMARKS 
##          0         47          0         40          0          0 
##     REFNUM 
##          0

There isn’t NA-values in interesting columns - EVTYPE, FATALITIES, INJURIES, *DMG

Thus, the data are ready for analysis.

Results

In first, I,ve calculated total fatalities and injuries (see variables FATALtotal and INJURtotal), and mean fatalities and injuries (variables FATALmean and INJURmean).

library(dplyr)
tmp <- group_by(df, EVTYPE)
tmp2 <- summarise(.data=tmp, FATALtotal=sum(FATALITIES), FATALmean=mean(FATALITIES),
                  EVTYPEn=n(), INJURtotal=sum(INJURIES), INJURmean=mean(INJURIES))
# order by FATALtotal
tmp2ord <- tmp2[order(tmp2$FATALtotal, decreasing = T),]

Here we see TOP-10 types of events by variable FATALtotal

library(xtable)
xt <- xtable(head(tmp2ord, n=10))
print(xt, type="html")
EVTYPE FATALtotal FATALmean EVTYPEn INJURtotal INJURmean
1 TORNADO 5633.00 0.09 60652 91346.00 1.51
2 EXCESSIVE HEAT 1903.00 1.13 1678 6525.00 3.89
3 FLASH FLOOD 978.00 0.02 54277 1777.00 0.03
4 HEAT 937.00 1.22 767 2100.00 2.74
5 LIGHTNING 816.00 0.05 15754 5230.00 0.33
6 TSTM WIND 504.00 0.00 219940 6957.00 0.03
7 FLOOD 470.00 0.02 25326 6789.00 0.27
8 RIP CURRENT 368.00 0.78 470 232.00 0.49
9 HIGH WIND 248.00 0.01 20212 1137.00 0.06
10 AVALANCHE 224.00 0.58 386 170.00 0.44

Next, we see TOP-10 types of events by variable FATALmean

library(xtable)
xt <- xtable(head(tmp2[order(tmp2$FATALmean, decreasing = T),], n=10))
print(xt, type="html")
EVTYPE FATALtotal FATALmean EVTYPEn INJURtotal INJURmean
1 TORNADOES, TSTM WIND, HAIL 25.00 25.00 1 0.00 0.00
2 COLD AND SNOW 14.00 14.00 1 0.00 0.00
3 TROPICAL STORM GORDON 8.00 8.00 1 43.00 43.00
4 RECORD/EXCESSIVE HEAT 17.00 5.67 3 0.00 0.00
5 EXTREME HEAT 96.00 4.36 22 155.00 7.05
6 HEAT WAVE DROUGHT 4.00 4.00 1 15.00 15.00
7 HIGH WIND/SEAS 4.00 4.00 1 0.00 0.00
8 MARINE MISHAP 7.00 3.50 2 5.00 2.50
9 WINTER STORMS 10.00 3.33 3 17.00 5.67
10 Heavy surf and wind 3.00 3.00 1 0.00 0.00

Next, I’ve created new factor FATALtotalFac for FATALtotal (It’s necessary for plot)

tmp2ord$FATALtotalFac <- character(length = nrow(tmp2ord))
tmp2ord[tmp2ord$FATALtotal>=500,]$FATALtotalFac<-"500+" 
tmp2ord[tmp2ord$FATALtotal>=100 & tmp2ord$FATALtotal<500,]$FATALtotalFac<-"100-500" 
tmp2ord[tmp2ord$FATALtotal>=50 & tmp2ord$FATALtotal<100,]$FATALtotalFac<-"50-100" 
tmp2ord[tmp2ord$FATALtotal<50,]$FATALtotalFac<-"0-50" 
tmp2ord$FATALtotalFac <- factor(x=tmp2ord$FATALtotalFac,levels = c("0-50", "50-100",
                                                                   "100-500", "500+"),
                                ordered = T)

some transformations for adding labels on plot

tmp3 <- tmp2ord[tmp2ord$FATALtotalFac!="0-50" & tmp2ord$FATALtotalFac!="50-100",]
tmp3$EVTYPE <- as.character(tmp3$EVTYPE)
tmp3[tmp3$FATALtotalFac=="500+",]$EVTYPE <- paste0(tmp3[tmp3$FATALtotalFac=="500+",]$EVTYPE," (", tmp3[tmp3$FATALtotalFac=="500+",]$FATALtotal, ")")

Next, I’ve made informative plot, which describe association between mean fatalities per event and number of events.

library(ggplot2)

ggplot(data = tmp2ord, aes(x=(FATALmean)^(1/4), y=(EVTYPEn)^(1/4), col=FATALtotalFac, size=(FATALtotal)^(1/4))) + 
    geom_point()+xlim(0,1.5)+
    ylim(0,24)+
    annotate("text", label = tmp3$EVTYPE, x = (tmp3$FATALmean)^(1/4), y = (tmp3$EVTYPEn)^(1/4), size = 3, colour = "#2e3a23",  hjust=-0.1)+
    xlab("Mean fatalities per event, pcs^(1/4)")+
    ylab("Number of events, pcs^(1/4)")+
    labs(title="Relation between 'Mean fatalities per event' and 'Number of events'")+
    scale_size_continuous(breaks=NULL)+
    scale_color_discrete(name="Fatalities levels")

We can see negative association between mean fatalities per event and number of events (or frequency of event). All evets from TOP-10 are lying near centre of the plot.

It’s interesting: there is positive correlation between FATALtotal and INJURtotal

fit <- lm(FATALtotal~INJURtotal, data = tmp2ord)
summary(fit)
## 
## Call:
## lm(formula = FATALtotal ~ INJURtotal, data = tmp2ord)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -181.2   -6.3   -6.3   -6.3 1481.8 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.3045903  2.0897235   3.017  0.00262 ** 
## INJURtotal  0.0635815  0.0007099  89.558  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 65.51 on 983 degrees of freedom
## Multiple R-squared:  0.8908, Adjusted R-squared:  0.8907 
## F-statistic:  8021 on 1 and 983 DF,  p-value: < 2.2e-16

There is strong correlation between FATALtotal and INJURtotal. If INJURtotal increased on 100 men then FATALtotal increased on 6 men.

Mean injuries in time series

# preparation data for plot
df$BGN_DATE <- as.Date(df$BGN_DATE, "%m/%d/%Y")
library(lubridate)
df$year <- year(df$BGN_DATE)
tmp <- group_by(df, year)
tmp2 <- summarise(.data=tmp, INJURtotal=sum(INJURIES), INJURmean=mean(INJURIES))
#make plot
ggplot(data = tmp2, aes(x=year, y=INJURmean))+geom_line()+geom_smooth()+
    ylab("Mean injuries per event")+
    xlab("Years")+
    labs(title="Mean injuries decrease")

We see, that mean injuries decrease.

Next, I’ve made table TOP-10 total damage

df$PROPDMGdoll <- df$PROPDMG
df[df$PROPDMGEXP=="K" | df$PROPDMGEXP=="k",]$PROPDMGdoll <- df[df$PROPDMGEXP=="K" | df$PROPDMGEXP=="k",]$PROPDMG*10^3
df[df$PROPDMGEXP=="M" | df$PROPDMGEXP=="m",]$PROPDMGdoll <- df[df$PROPDMGEXP=="M" | df$PROPDMGEXP=="m",]$PROPDMG*10^6
df[df$PROPDMGEXP=="B" | df$PROPDMGEXP=="b",]$PROPDMGdoll <- df[df$PROPDMGEXP=="B" | df$PROPDMGEXP=="b",]$PROPDMG*10^9
tmp <- group_by(df, EVTYPE)
tmp3 <- summarise(.data=tmp, PROPDMGtotal=sum(PROPDMGdoll), PROPDMGmean=mean(PROPDMGdoll),
                  FATALtotal=sum(FATALITIES))
# order by PROPDMGtotal
tmp3ord <- tmp3[order(tmp3$PROPDMGtotal, decreasing = T),]

Here we see TOP-10 types of events by variable PROPDMGtotal

xt <- xtable(head(tmp3ord, n=10))
print(xt, type="html")
EVTYPE PROPDMGtotal PROPDMGmean FATALtotal
1 FLOOD 144657709807.00 5711826.18 470.00
2 HURRICANE/TYPHOON 69305840000.00 787566363.64 64.00
3 TORNADO 56937160778.70 938751.58 5633.00
4 STORM SURGE 43323536000.00 165990559.39 13.00
5 FLASH FLOOD 16140812067.10 297378.49 978.00
6 HAIL 15732267047.70 54500.84 15.00
7 HURRICANE 11868319010.00 68208729.94 61.00
8 TROPICAL STORM 7703890550.00 11165058.77 58.00
9 WINTER STORM 6688497251.00 585016.82 206.00
10 HIGH WIND 5270046295.00 260738.49 248.00

Intersection TOP-10 total fatalities and TOP-10 total damage

intersect(tmp2ord$EVTYPE[1:10], tmp3ord$EVTYPE[1:10])
## [1] "TORNADO"     "FLASH FLOOD" "FLOOD"       "HIGH WIND"

So, such types of events how “TORNADO”, “FLASH FLOOD”, “FLOOD” and “HIGH WIND” are in TOP-10 total fatalities and TOP-10 total damage simultaneously. We can say, that they are the most injurious types of events.