Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. The analysis addresses two questions, 1)which types of events are most harmful to population health? and 2)which types of events have the greatest economic consequences? The dataset provides the fatalies, injuries, property damage and crop damage estimate of each event. The analysis concludes the most harmful event to population health is Tornado and the greatest economic consequences were made by flood.
1.Download storm data from https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2 and then save it to /data folder
2.Read the bz2 file directly
3.The download cmd cannot be cached, so I commented it for one time use
4.Reading csv file of 40+ MB is time-consuming, I cached it
##download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2","data/storm_data.csv.bz2",cacheOK=TRUE)
storm<-read.csv('./data/storm_data.csv.bz2')
This analysis considers fatalities/injuries as the source of the population health. It includes both direct and indirect fatalities and injuries.
storm %>% group_by(EVTYPE) %>% summarise(total_fatalities = sum(FATALITIES, na.rm = TRUE),total_injuries = sum(INJURIES, na.rm=TRUE)) %>% arrange(desc(total_fatalities,total_injuries)) ->a
# Grouped Bar Plot
b<-a[1:4,]
barplot(t(as.matrix(b[, 2:3])), main="Top 4 Fatalities/Injuries Distribution by Type of Events",names.arg=b$EVTYPE,
xlab="Type of Events", legend=colnames(b[,2:3]),col=c("darkblue","red"),beside=TRUE)
The grouped barplot shows the top 4 event types based on the total fatalities and injuries. Tornado is the most harmful with respect to population health. The fatalities and injuries caused by Tornado is 5633 and 9.134610^{4} respectively.
There are 2 types of damage estimate available in the dataset, property damage and crop damage, the variables are PROPDMG and CROPDMG.PROPDMGEXP and CROPDMGEXP are the magnitude respectively.
According to the Microsoft Word - 10-1605_StormDataPrep.doc, DMGEXP should be an alphabetical character signifying the magnitude of the number, i.e., 1.55B for $1,550,000,000. Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions.
storm %>% group_by(EVTYPE) %>% select(PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP,EVTYPE,REMARKS) -> cost
summary(cost)
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## Min. : 0.00 :465934 Min. : 0.000 :618413
## 1st Qu.: 0.00 K :424665 1st Qu.: 0.000 K :281832
## Median : 0.00 M : 11330 Median : 0.000 M : 1994
## Mean : 12.06 0 : 216 Mean : 1.527 k : 21
## 3rd Qu.: 0.50 B : 40 3rd Qu.: 0.000 0 : 19
## Max. :5000.00 5 : 28 Max. :990.000 B : 9
## (Other): 84 (Other): 9
## EVTYPE
## HAIL :288661
## TSTM WIND :219940
## THUNDERSTORM WIND: 82563
## TORNADO : 60652
## FLASH FLOOD : 54277
## FLOOD : 25326
## (Other) :170878
## REMARKS
## :287433
## : 24013
## Trees down.\n : 1110
## Several trees were blown down.\n : 568
## Trees were downed.\n : 446
## Large trees and power lines were blown down.\n: 432
## (Other) :588295
As the summary indicated, PROPDMGEXP,CROPDMGEXP are not complete.
There are three major issues:
1. Missing DMGEXP data
2. Category “5” is not clear
3. Category “0” is not clear
Then, I’ll try to map each category to the correct unit.
cost %>% filter(PROPDMGEXP=="") -> outliers_na
summary(outliers_na)
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## Min. : 0.00000 :465934 Min. : 0.0000 :461616
## 1st Qu.: 0.00000 - : 0 1st Qu.: 0.0000 K : 3865
## Median : 0.00000 ? : 0 Median : 0.0000 M : 443
## Mean : 0.00113 + : 0 Mean : 0.5121 B : 4
## 3rd Qu.: 0.00000 0 : 0 3rd Qu.: 0.0000 0 : 3
## Max. :75.00000 1 : 0 Max. :990.0000 ? : 2
## (Other): 0 (Other): 1
## EVTYPE REMARKS
## HAIL :196662 :245306
## TSTM WIND :157095 : 16990
## FLASH FLOOD : 21319 Trees down.\n : 590
## THUNDERSTORM WINDS: 8951 Penny size hail was observed.\n: 315
## TORNADO : 8805 Trees were downed.\n : 279
## HEAVY SNOW : 8695 Trees were blown down.\n : 234
## (Other) : 64407 (Other) :202220
As the Mean is close to Zero and the max is small, it is safe to conclude that when the DMGEXP is blank, the damage cost estimate is zero all the time.
cost %>% filter(PROPDMGEXP=='5') -> outliers_5
tail(outliers_5)
## # A tibble: 6 x 6
## # Groups: EVTYPE [6]
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP EVTYPE REMARKS
## <dbl> <fct> <dbl> <fct> <fct> <fct>
## 1 0. 5 0. "" HAIL " "
## 2 0. 5 0. "" THUNDERSTORM WINDS "A large awnin~
## 3 0.200 5 0. "" TORNADO " "
## 4 0.700 5 0. "" FLOODING " "
## 5 13.0 5 0. "" LIGHTNING "Lightning set~
## 6 6.40 5 430. K FLASH FLOOD " "
After examing one of the remarks in the “5” category, I find out that “5” represents 5K. See line 5.
cost %>% filter(PROPDMGEXP=='0') -> outliers_0
summary(outliers_0)
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## Min. : 0.00 0 :216 Min. : 0.000 :211
## 1st Qu.: 10.00 : 0 1st Qu.: 0.000 K : 4
## Median : 30.00 - : 0 Median : 0.000 M : 1
## Mean : 32.91 ? : 0 Mean : 1.002 ? : 0
## 3rd Qu.: 50.00 + : 0 3rd Qu.: 0.000 0 : 0
## Max. :150.00 1 : 0 Max. :160.000 2 : 0
## (Other): 0 (Other): 0
## EVTYPE
## THUNDERSTORM WINDS:158
## LIGHTNING : 13
## HAIL : 12
## FLASH FLOOD : 10
## TORNADO : 9
## FLOOD/FLASH FLOOD : 2
## (Other) : 12
## REMARKS
## : 27
## Thunderstorm winds blew down a large tree east of Hampton and knocked power lines down in Hampton and McDonough. : 3
## Thunderstorm winds knocked down a couple of trees. : 2
## Thunderstorm winds knocked down a pine tree near Starrs Mill and Bradford pear tree west of Hampton. : 2
## Thunderstorm winds knocked down trees and power lines. : 2
## Thunderstorm winds knocked numerous trees down on power lines. : 2
## (Other) :178
I cannot find any info from documentation or remarks about this category. since the majority events are Thunderstorm Winds, I make a scientific guess that the unit is “M”.
cost %>% group_by(EVTYPE) %>% filter(PROPDMGEXP=='K') %>% summarise(total=sum(PROPDMG*1000)) -> cost_p_k
cost %>% group_by(EVTYPE) %>% filter(PROPDMGEXP=='M') %>% summarise(total=sum(PROPDMG*1000000)) -> cost_p_m
cost %>% group_by(EVTYPE) %>% filter(PROPDMGEXP=='0') %>% summarise(total=sum(PROPDMG*1000000)) -> cost_p_0
cost %>% group_by(EVTYPE) %>% filter(PROPDMGEXP=='B') %>% summarise(total=sum(PROPDMG*1000000000)) -> cost_p_b
cost %>% group_by(EVTYPE) %>% filter(PROPDMGEXP=='5') %>% summarise(total=sum(PROPDMG*5000)) -> cost_p_5k
cost %>% group_by(EVTYPE) %>% filter(CROPDMGEXP=='K') %>% summarise(total=sum(CROPDMG*1000)) -> cost_c_k1
cost %>% group_by(EVTYPE) %>% filter(CROPDMGEXP=='M') %>% summarise(total=sum(CROPDMG*1000000)) -> cost_c_m
cost %>% group_by(EVTYPE) %>% filter(CROPDMGEXP=='k') %>% summarise(total=sum(CROPDMG*1000)) -> cost_c_k2
cost %>% group_by(EVTYPE) %>% filter(CROPDMGEXP=='0') %>% summarise(total=sum(CROPDMG*1000000)) -> cost_c_0
cost %>% group_by(EVTYPE) %>% filter(CROPDMGEXP=='B') %>% summarise(total=sum(CROPDMG*1000000000)) -> cost_c_b
bind_rows(cost_p_k,cost_p_m,cost_p_0,cost_p_b,cost_p_5k,cost_c_k1,cost_c_m,cost_c_k2,cost_c_0,cost_c_b) -> result
result %>% group_by(EVTYPE) %>% summarise(total_exp=sum(total)) %>% arrange(desc(total_exp)) -> result2
top5<-result2[1:5,]
x <- barplot(top5$total_exp, main="Top 5 Type of Events have greatest econmic consequences",names.arg=top5$EVTYPE,
xlab="Type of Events",ylab="Total Damage Cost",beside=TRUE, las=2,xaxt="n",yaxt="n")
text(cex=0.6, x=x-.25, y=-2.25, top5$EVTYPE, xpd=TRUE, srt=45)
axis(2, at=top5$total_exp, labels=format(paste(round(top5$total_exp/1e9,1),"B"), scientific=FALSE), hadj=0.9, cex.axis=0.8, las=2)
As the two charts concluded, the Tornado caused the most harm to population health, 5633 death and 9.134610^{4} injuries. The flood caused the greatest economic loss which is 150.3196783 billion dollors.