The goal of this report is to provide sufficient information to a government or a municipal manager who might be responsible for preparing for severe weather events and will need to prioritize resources for different types of events.
To that effect the report will only consider the last 10 years (2001-2011) as they will be the most representative of what can happen in the following years.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
First the data is downloaded.
if (!file.exists("repdata-data-StormData.bz2")) {
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
destfile = "repdata-data-StormData.bz2", method="curl")
}
Then it is loaded from the “.bz2” file and cached.
dfStorm <- read.csv(bzfile("repdata-data-StormData.bz2"), header=T, stringsAsFactors = F)
Columns are selected and formatted appropriately: we keep the BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP columns.
Data is filtered to keep the last 10 years (2001-2011) as they will be the most representative of what can happen in the following years.
tbldfStorm <- select(tbl_df(dfStorm), BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG,
PROPDMGEXP, CROPDMG, CROPDMGEXP) %>%
mutate(BGN_DATE = mdy_hms(BGN_DATE)) %>%
filter(BGN_DATE > ymd("2001-01-01"))
The PROPDMGEXP and CROPDMGEXP columns are transformed to numerical values in order to compute economical damages.
map <- c("","B","K","M")
value <- c(1,1000000000,1000,1000000)
for (i in seq_along(map)) {
tbldfStorm$PROPDMGEXP[which(tbldfStorm$PROPDMGEXP==map[i])] <- value[i]
tbldfStorm$CROPDMGEXP[which(tbldfStorm$CROPDMGEXP==map[i])] <- value[i]
}
tbldfStorm <- mutate(tbldfStorm, PROPDMGEXP = as.numeric(PROPDMGEXP),
CROPDMGEXP = as.numeric(CROPDMGEXP))
Finally some summaries are printed:
summary(select(tbldfStorm, FATALITIES, INJURIES))
## FATALITIES INJURIES
## Min. : 0.00000 Min. :0.00e+00
## 1st Qu.: 0.00000 1st Qu.:0.00e+00
## Median : 0.00000 Median :0.00e+00
## Mean : 0.01129 Mean :6.61e-02
## 3rd Qu.: 0.00000 3rd Qu.:0.00e+00
## Max. :158.00000 Max. :1.15e+03
summary(select(tbldfStorm, PROPDMGEXP, CROPDMGEXP))
## PROPDMGEXP CROPDMGEXP
## Min. :0.000e+00 Min. :1.000e+00
## 1st Qu.:1.000e+00 1st Qu.:1.000e+00
## Median :1.000e+03 Median :1.000e+03
## Mean :6.873e+04 Mean :1.096e+04
## 3rd Qu.:1.000e+03 3rd Qu.:1.000e+03
## Max. :1.000e+09 Max. :1.000e+09
Across the United States, which types of events (as indicated in the 𝙴𝚅𝚃𝚈𝙿𝙴 variable) are most harmful with respect to population health?
dfSum <- group_by(tbldfStorm, EVTYPE) %>%
summarise(Fatalities=sum(FATALITIES), Injuries=sum(INJURIES)) %>%
filter(Fatalities+Injuries>0) %>%
mutate(totalH=Fatalities+Injuries) %>%
arrange(totalH)
dfPlot <- tail(select(dfSum, EVTYPE, Fatalities, Injuries), n=20)
dfPlot$EVTYPE <- factor(dfPlot$EVTYPE, levels=dfPlot$EVTYPE)
dfPlot <- melt(dfPlot,id.vars = "EVTYPE",
measure.vars = c("Fatalities","Injuries"),
variable.name="Casualties")
The data is plotted in order to compare fatalities with injuries.
ggplot(dfPlot,aes(EVTYPE,value)) +
geom_bar(aes(fill=Casualties), stat = "identity") +
coord_flip() +
theme_bw() +
ylab("Total injured or killed") +
xlab("Wheather events") +
ggtitle("Human casualties between 2001 and 2011") +
theme(legend.position=c(0.7,0.7),
legend.text = element_text(size=16),
legend.title = element_blank())
Figure 1: Human casualties between 2001 and 2011 for the 20 most harmeful weather events
From the plot we can see that Tornado and Excessive heat are the two most harmful events for both injuries and fatalities. After that some events cause more injuries or fatalities. As expected there are more injuries than fatalities.
Finally we check whether there is a tendance of increase during the last 10 years.
print(group_by(tbldfStorm, Year=year(tbldfStorm$BGN_DATE)) %>%
summarise(Fatalities=sum(FATALITIES), Injuries=sum(INJURIES)) %>%
mutate(Total=Fatalities+Injuries))
## Source: local data frame [11 x 4]
##
## Year Fatalities Injuries Total
## (dbl) (dbl) (dbl) (dbl)
## 1 2001 469 2716 3185
## 2 2002 498 3155 3653
## 3 2003 443 2931 3374
## 4 2004 370 2426 2796
## 5 2005 469 1834 2303
## 6 2006 599 3368 3967
## 7 2007 421 2191 2612
## 8 2008 488 2703 3191
## 9 2009 333 1354 1687
## 10 2010 425 1855 2280
## 11 2011 1002 7792 8794
It does not seem that there is a tendance of increase except for year 2011.
Across the United States, which types of events have the greatest economic consequences?
dfSumEc <- mutate(tbldfStorm, totalPROP=PROPDMG*PROPDMGEXP,
totalCROP=CROPDMG*CROPDMGEXP) %>%
select(EVTYPE,totalPROP,totalCROP) %>%
group_by(EVTYPE) %>%
summarise(Properties=sum(totalPROP), Crop=sum(totalCROP)) %>%
filter(Properties+Crop>0) %>%
mutate(Properties=Properties/1e9,Crop=Crop/1e9, total=Properties+Crop) %>%
arrange(total)
dfPlotEc <- tail(select(dfSumEc, EVTYPE, Properties, Crop), n=20)
dfPlotEc$EVTYPE <- factor(dfPlotEc$EVTYPE, levels=dfPlotEc$EVTYPE)
dfPlotEc <- melt(dfPlotEc,id.vars = "EVTYPE",
measure.vars = c("Properties","Crop"),
variable.name="Type")
The data is plotted in order to compare properties and crop damages.
ggplot(dfPlotEc,aes(EVTYPE,value)) +
geom_bar(aes(fill=Type), stat = "identity") +
coord_flip() +
theme_bw() +
ylab("Total damages in billion $") +
xlab("Wheather events") +
ggtitle("Damages in billion dollars between 2001 and 2011") +
theme(legend.position=c(0.7,0.7),
legend.text = element_text(size=16),
legend.title = element_blank())
Figure 2: Damages in billion dollars between 2001 and 2011 for the 20 most harmful weather events
From the plot we can see that Flood and Hurricane/Typhoon are the most devastating economically. In general the crop damages cost are lower that properties damages except for the drought event.
Finally we check whether there is a tendance of increase during the last 10 years.
group_by(tbldfStorm, Year=year(tbldfStorm$BGN_DATE)) %>%
summarise(Properties=sum(PROPDMG*PROPDMGEXP), Crop=sum(CROPDMG*CROPDMGEXP)) %>%
mutate(Total=Properties+Crop) %>%
mutate(Properties=format(Properties,big.mark="'"),
Crop=format(Crop,big.mark="'"),
Total=format(Total,big.mark="'"))
## Source: local data frame [11 x 4]
##
## Year Properties Crop Total
## (dbl) (chr) (chr) (chr)
## 1 2001 10'026'988'670 1'780'588'100 11'807'576'770
## 2 2002 4'100'882'450 1'410'368'140 5'511'250'590
## 3 2003 10'254'548'240 1'143'070'350 11'397'618'590
## 4 2004 25'346'598'870 1'452'177'850 26'798'776'720
## 5 2005 96'789'791'170 4'035'202'300 100'824'993'470
## 6 2006 121'937'434'190 3'534'238'700 125'471'672'890
## 7 2007 5'788'934'160 1'691'152'000 7'480'086'160
## 8 2008 15'568'383'080 2'209'793'000 17'778'176'080
## 9 2009 5'227'204'130 522'220'000 5'749'424'130
## 10 2010 9'246'487'640 1'785'286'000 11'031'773'640
## 11 2011 20'888'981'960 666'742'000 21'555'723'960
We can see that the years 2005 and 2006 have had lots of damages. Year 2011 has also a high total value of damages.