Note: deer peers, I sorry for my English.
1. Across the United States, which types of events are most harmful with respect to population health?
2. Across the United States, which types of events have the greatest economic consequences?
For each types of events we’ve calculated total fatalities, injuries and damage, and mean fatalities, injuries and damage too. (we used whole period of observations 1950-2011). Such types of events how “TORNADO”, “FLASH FLOOD”, “FLOOD” and “HIGH WIND” are in both tables “TOP-10 total fatalities”" and “TOP-10 total damage”. So, we can say, that they are the most injurious types of events. In addition we’ve discovered interesting association between mean fatalities per event and number of events. (because we are investigating fix period, we can say about frequency of event instead number of event). I would note, that there are types of events, which occur more seldom, than for example tornados or floods, but they are much more harmful (for example EXTREME HEAT).
I’ve downloaded data by URL Storm Data
I’ve read bz2-file
df <- read.csv(bzfile("./repdata-data-StormData.csv.bz2"))
See first 3 transposed rows
t(head(df, n = 3))
## 1 2 3
## STATE__ "1" "1" "1"
## BGN_DATE "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00"
## BGN_TIME "0130" "0145" "1600"
## TIME_ZONE "CST" "CST" "CST"
## COUNTY "97" " 3" "57"
## COUNTYNAME "MOBILE" "BALDWIN" "FAYETTE"
## STATE "AL" "AL" "AL"
## EVTYPE "TORNADO" "TORNADO" "TORNADO"
## BGN_RANGE "0" "0" "0"
## BGN_AZI "" "" ""
## BGN_LOCATI "" "" ""
## END_DATE "" "" ""
## END_TIME "" "" ""
## COUNTY_END "0" "0" "0"
## COUNTYENDN NA NA NA
## END_RANGE "0" "0" "0"
## END_AZI "" "" ""
## END_LOCATI "" "" ""
## LENGTH "14.0" " 2.0" " 0.1"
## WIDTH "100" "150" "123"
## F "3" "2" "2"
## MAG "0" "0" "0"
## FATALITIES "0" "0" "0"
## INJURIES "15" " 0" " 2"
## PROPDMG "25.0" " 2.5" "25.0"
## PROPDMGEXP "K" "K" "K"
## CROPDMG "0" "0" "0"
## CROPDMGEXP "" "" ""
## WFO "" "" ""
## STATEOFFIC "" "" ""
## ZONENAMES "" "" ""
## LATITUDE "3040" "3042" "3340"
## LONGITUDE "8812" "8755" "8742"
## LATITUDE_E "3051" " 0" " 0"
## LONGITUDE_ "8806" " 0" " 0"
## REMARKS "" "" ""
## REFNUM "1" "2" "3"
check NAs for df
sapply(df, function(e) sum(is.na(e)))
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME
## 0 0 0 0 0 0
## STATE EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE
## 0 0 0 0 0 0
## END_TIME COUNTY_END COUNTYENDN END_RANGE END_AZI END_LOCATI
## 0 0 902297 0 0 0
## LENGTH WIDTH F MAG FATALITIES INJURIES
## 0 0 843563 0 0 0
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC
## 0 0 0 0 0 0
## ZONENAMES LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS
## 0 47 0 40 0 0
## REFNUM
## 0
There isn’t NA-values in interesting columns - EVTYPE, FATALITIES, INJURIES, *DMG
Thus, the data are ready for analysis.
In first, I,ve calculated total fatalities and injuries (see variables FATALtotal and INJURtotal), and mean fatalities and injuries (variables FATALmean and INJURmean).
library(dplyr)
tmp <- group_by(df, EVTYPE)
tmp2 <- summarise(.data=tmp, FATALtotal=sum(FATALITIES), FATALmean=mean(FATALITIES),
EVTYPEn=n(), INJURtotal=sum(INJURIES), INJURmean=mean(INJURIES))
# order by FATALtotal
tmp2ord <- tmp2[order(tmp2$FATALtotal, decreasing = T),]
Here we see TOP-10 types of events by variable FATALtotal
library(xtable)
xt <- xtable(head(tmp2ord, n=10))
print(xt, type="html")
| EVTYPE | FATALtotal | FATALmean | EVTYPEn | INJURtotal | INJURmean | |
|---|---|---|---|---|---|---|
| 1 | TORNADO | 5633.00 | 0.09 | 60652 | 91346.00 | 1.51 |
| 2 | EXCESSIVE HEAT | 1903.00 | 1.13 | 1678 | 6525.00 | 3.89 |
| 3 | FLASH FLOOD | 978.00 | 0.02 | 54277 | 1777.00 | 0.03 |
| 4 | HEAT | 937.00 | 1.22 | 767 | 2100.00 | 2.74 |
| 5 | LIGHTNING | 816.00 | 0.05 | 15754 | 5230.00 | 0.33 |
| 6 | TSTM WIND | 504.00 | 0.00 | 219940 | 6957.00 | 0.03 |
| 7 | FLOOD | 470.00 | 0.02 | 25326 | 6789.00 | 0.27 |
| 8 | RIP CURRENT | 368.00 | 0.78 | 470 | 232.00 | 0.49 |
| 9 | HIGH WIND | 248.00 | 0.01 | 20212 | 1137.00 | 0.06 |
| 10 | AVALANCHE | 224.00 | 0.58 | 386 | 170.00 | 0.44 |
Next, we see TOP-10 types of events by variable FATALmean
library(xtable)
xt <- xtable(head(tmp2[order(tmp2$FATALmean, decreasing = T),], n=10))
print(xt, type="html")
| EVTYPE | FATALtotal | FATALmean | EVTYPEn | INJURtotal | INJURmean | |
|---|---|---|---|---|---|---|
| 1 | TORNADOES, TSTM WIND, HAIL | 25.00 | 25.00 | 1 | 0.00 | 0.00 |
| 2 | COLD AND SNOW | 14.00 | 14.00 | 1 | 0.00 | 0.00 |
| 3 | TROPICAL STORM GORDON | 8.00 | 8.00 | 1 | 43.00 | 43.00 |
| 4 | RECORD/EXCESSIVE HEAT | 17.00 | 5.67 | 3 | 0.00 | 0.00 |
| 5 | EXTREME HEAT | 96.00 | 4.36 | 22 | 155.00 | 7.05 |
| 6 | HEAT WAVE DROUGHT | 4.00 | 4.00 | 1 | 15.00 | 15.00 |
| 7 | HIGH WIND/SEAS | 4.00 | 4.00 | 1 | 0.00 | 0.00 |
| 8 | MARINE MISHAP | 7.00 | 3.50 | 2 | 5.00 | 2.50 |
| 9 | WINTER STORMS | 10.00 | 3.33 | 3 | 17.00 | 5.67 |
| 10 | Heavy surf and wind | 3.00 | 3.00 | 1 | 0.00 | 0.00 |
Next, I’ve created new factor FATALtotalFac for FATALtotal (It’s necessary for plot)
tmp2ord$FATALtotalFac <- character(length = nrow(tmp2ord))
tmp2ord[tmp2ord$FATALtotal>=500,]$FATALtotalFac<-"500+"
tmp2ord[tmp2ord$FATALtotal>=100 & tmp2ord$FATALtotal<500,]$FATALtotalFac<-"100-500"
tmp2ord[tmp2ord$FATALtotal>=50 & tmp2ord$FATALtotal<100,]$FATALtotalFac<-"50-100"
tmp2ord[tmp2ord$FATALtotal<50,]$FATALtotalFac<-"0-50"
tmp2ord$FATALtotalFac <- factor(x=tmp2ord$FATALtotalFac,levels = c("0-50", "50-100",
"100-500", "500+"),
ordered = T)
some transformations for adding labels on plot
tmp3 <- tmp2ord[tmp2ord$FATALtotalFac!="0-50" & tmp2ord$FATALtotalFac!="50-100",]
tmp3$EVTYPE <- as.character(tmp3$EVTYPE)
tmp3[tmp3$FATALtotalFac=="500+",]$EVTYPE <- paste0(tmp3[tmp3$FATALtotalFac=="500+",]$EVTYPE," (", tmp3[tmp3$FATALtotalFac=="500+",]$FATALtotal, ")")
Next, I’ve made informative plot, which describe association between mean fatalities per event and number of events.
library(ggplot2)
ggplot(data = tmp2ord, aes(x=(FATALmean)^(1/4), y=(EVTYPEn)^(1/4), col=FATALtotalFac, size=(FATALtotal)^(1/4))) +
geom_point()+xlim(0,1.5)+
ylim(0,24)+
annotate("text", label = tmp3$EVTYPE, x = (tmp3$FATALmean)^(1/4), y = (tmp3$EVTYPEn)^(1/4), size = 3, colour = "#2e3a23", hjust=-0.1)+
xlab("Mean fatalities per event, pcs^(1/4)")+
ylab("Number of events, pcs^(1/4)")+
labs(title="Relation between 'Mean fatalities per event' and 'Number of events'")+
scale_size_continuous(breaks=NULL)+
scale_color_discrete(name="Fatalities levels")
We can see negative association between mean fatalities per event and number of events (or frequency of event). All evets from TOP-10 are lying near centre of the plot.
It’s interesting: there is positive correlation between FATALtotal and INJURtotal
fit <- lm(FATALtotal~INJURtotal, data = tmp2ord)
summary(fit)
##
## Call:
## lm(formula = FATALtotal ~ INJURtotal, data = tmp2ord)
##
## Residuals:
## Min 1Q Median 3Q Max
## -181.2 -6.3 -6.3 -6.3 1481.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.3045903 2.0897235 3.017 0.00262 **
## INJURtotal 0.0635815 0.0007099 89.558 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 65.51 on 983 degrees of freedom
## Multiple R-squared: 0.8908, Adjusted R-squared: 0.8907
## F-statistic: 8021 on 1 and 983 DF, p-value: < 2.2e-16
There is strong correlation between FATALtotal and INJURtotal. If INJURtotal increased on 100 men then FATALtotal increased on 6 men.
Mean injuries in time series
# preparation data for plot
df$BGN_DATE <- as.Date(df$BGN_DATE, "%m/%d/%Y")
library(lubridate)
df$year <- year(df$BGN_DATE)
tmp <- group_by(df, year)
tmp2 <- summarise(.data=tmp, INJURtotal=sum(INJURIES), INJURmean=mean(INJURIES))
#make plot
ggplot(data = tmp2, aes(x=year, y=INJURmean))+geom_line()+geom_smooth()+
ylab("Mean injuries per event")+
xlab("Years")+
labs(title="Mean injuries decrease")
We see, that mean injuries decrease.
Next, I’ve made table TOP-10 total damage
df$PROPDMGdoll <- df$PROPDMG
df[df$PROPDMGEXP=="K" | df$PROPDMGEXP=="k",]$PROPDMGdoll <- df[df$PROPDMGEXP=="K" | df$PROPDMGEXP=="k",]$PROPDMG*10^3
df[df$PROPDMGEXP=="M" | df$PROPDMGEXP=="m",]$PROPDMGdoll <- df[df$PROPDMGEXP=="M" | df$PROPDMGEXP=="m",]$PROPDMG*10^6
df[df$PROPDMGEXP=="B" | df$PROPDMGEXP=="b",]$PROPDMGdoll <- df[df$PROPDMGEXP=="B" | df$PROPDMGEXP=="b",]$PROPDMG*10^9
tmp <- group_by(df, EVTYPE)
tmp3 <- summarise(.data=tmp, PROPDMGtotal=sum(PROPDMGdoll), PROPDMGmean=mean(PROPDMGdoll),
FATALtotal=sum(FATALITIES))
# order by PROPDMGtotal
tmp3ord <- tmp3[order(tmp3$PROPDMGtotal, decreasing = T),]
Here we see TOP-10 types of events by variable PROPDMGtotal
xt <- xtable(head(tmp3ord, n=10))
print(xt, type="html")
| EVTYPE | PROPDMGtotal | PROPDMGmean | FATALtotal | |
|---|---|---|---|---|
| 1 | FLOOD | 144657709807.00 | 5711826.18 | 470.00 |
| 2 | HURRICANE/TYPHOON | 69305840000.00 | 787566363.64 | 64.00 |
| 3 | TORNADO | 56937160778.70 | 938751.58 | 5633.00 |
| 4 | STORM SURGE | 43323536000.00 | 165990559.39 | 13.00 |
| 5 | FLASH FLOOD | 16140812067.10 | 297378.49 | 978.00 |
| 6 | HAIL | 15732267047.70 | 54500.84 | 15.00 |
| 7 | HURRICANE | 11868319010.00 | 68208729.94 | 61.00 |
| 8 | TROPICAL STORM | 7703890550.00 | 11165058.77 | 58.00 |
| 9 | WINTER STORM | 6688497251.00 | 585016.82 | 206.00 |
| 10 | HIGH WIND | 5270046295.00 | 260738.49 | 248.00 |
Intersection TOP-10 total fatalities and TOP-10 total damage
intersect(tmp2ord$EVTYPE[1:10], tmp3ord$EVTYPE[1:10])
## [1] "TORNADO" "FLASH FLOOD" "FLOOD" "HIGH WIND"
So, such types of events how “TORNADO”, “FLASH FLOOD”, “FLOOD” and “HIGH WIND” are in TOP-10 total fatalities and TOP-10 total damage simultaneously. We can say, that they are the most injurious types of events.