Coursera Reproducible Research peer Assessment 2
Introduction
Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. Synopsis
The analysis on the storm event database revealed that tornadoes are the most dangerous weather event to the population health. The second most dangerous event type is the excessive heat. The economic impact of weather events was also analyzed. Flash floods and thunderstorm winds caused billions of dollars in property damages between 1950 and 2011. The largest crop damage caused by drought, followed by flood and hails.
Data Processing
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. The data was downloaded from the Coursera Reproducible Research web site [Stormdata (47Mb) ]
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
#setting the environment
Sys.setlocale("LC_TIME", "English")
## [1] "English_United States.1252"
sessionInfo()
## R version 3.2.0 (2015-04-16)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
##
## locale:
## [1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252
## [3] LC_MONETARY=French_France.1252 LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] magrittr_1.5 tools_3.2.0 htmltools_0.2.6 yaml_2.1.13
## [5] stringi_0.4-1 rmarkdown_0.7 knitr_1.10.5 stringr_1.0.0
## [9] digest_0.6.8 evaluate_0.7
#set working directory
setwd("C:/--Coursera/assessments/")
#load required packages
library(plyr)
library(data.table)
library(ggplot2)
StormData.csv <- bzfile("C:/--Coursera/assessments/repdata_data_StormData.csv.bz2","repdata_data_StormData.csv")
#read file
storm.data <- read.csv(StormData.csv, sep = ",", stringsAsFactors = FALSE)
unlink(StormData.csv)
head(storm.data)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
summary(storm.data)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE
## Min. : 1.0 Length:902297 Length:902297 Length:902297
## 1st Qu.:19.0 Class :character Class :character Class :character
## Median :30.0 Mode :character Mode :character Mode :character
## Mean :31.2
## 3rd Qu.:45.0
## Max. :95.0
##
## COUNTY COUNTYNAME STATE EVTYPE
## Min. : 0.0 Length:902297 Length:902297 Length:902297
## 1st Qu.: 31.0 Class :character Class :character Class :character
## Median : 75.0 Mode :character Mode :character Mode :character
## Mean :100.6
## 3rd Qu.:131.0
## Max. :873.0
##
## BGN_RANGE BGN_AZI BGN_LOCATI
## Min. : 0.000 Length:902297 Length:902297
## 1st Qu.: 0.000 Class :character Class :character
## Median : 0.000 Mode :character Mode :character
## Mean : 1.484
## 3rd Qu.: 1.000
## Max. :3749.000
##
## END_DATE END_TIME COUNTY_END COUNTYENDN
## Length:902297 Length:902297 Min. :0 Mode:logical
## Class :character Class :character 1st Qu.:0 NA's:902297
## Mode :character Mode :character Median :0
## Mean :0
## 3rd Qu.:0
## Max. :0
##
## END_RANGE END_AZI END_LOCATI
## Min. : 0.0000 Length:902297 Length:902297
## 1st Qu.: 0.0000 Class :character Class :character
## Median : 0.0000 Mode :character Mode :character
## Mean : 0.9862
## 3rd Qu.: 0.0000
## Max. :925.0000
##
## LENGTH WIDTH F MAG
## Min. : 0.0000 Min. : 0.000 Min. :0.0 Min. : 0.0
## 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.:0.0 1st Qu.: 0.0
## Median : 0.0000 Median : 0.000 Median :1.0 Median : 50.0
## Mean : 0.2301 Mean : 7.503 Mean :0.9 Mean : 46.9
## 3rd Qu.: 0.0000 3rd Qu.: 0.000 3rd Qu.:1.0 3rd Qu.: 75.0
## Max. :2315.0000 Max. :4400.000 Max. :5.0 Max. :22000.0
## NA's :843563
## FATALITIES INJURIES PROPDMG
## Min. : 0.0000 Min. : 0.0000 Min. : 0.00
## 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.00
## Median : 0.0000 Median : 0.0000 Median : 0.00
## Mean : 0.0168 Mean : 0.1557 Mean : 12.06
## 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.50
## Max. :583.0000 Max. :1700.0000 Max. :5000.00
##
## PROPDMGEXP CROPDMG CROPDMGEXP
## Length:902297 Min. : 0.000 Length:902297
## Class :character 1st Qu.: 0.000 Class :character
## Mode :character Median : 0.000 Mode :character
## Mean : 1.527
## 3rd Qu.: 0.000
## Max. :990.000
##
## WFO STATEOFFIC ZONENAMES LATITUDE
## Length:902297 Length:902297 Length:902297 Min. : 0
## Class :character Class :character Class :character 1st Qu.:2802
## Mode :character Mode :character Mode :character Median :3540
## Mean :2875
## 3rd Qu.:4019
## Max. :9706
## NA's :47
## LONGITUDE LATITUDE_E LONGITUDE_ REMARKS
## Min. :-14451 Min. : 0 Min. :-14455 Length:902297
## 1st Qu.: 7247 1st Qu.: 0 1st Qu.: 0 Class :character
## Median : 8707 Median : 0 Median : 0 Mode :character
## Mean : 6940 Mean :1452 Mean : 3509
## 3rd Qu.: 9605 3rd Qu.:3549 3rd Qu.: 8735
## Max. : 17124 Max. :9706 Max. :106220
## NA's :40
## REFNUM
## Min. : 1
## 1st Qu.:225575
## Median :451149
## Mean :451149
## 3rd Qu.:676723
## Max. :902297
##
str(storm.data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
#the storm data set has 902297 rows and 37 columns
Clean up storm.data converting PROPDMG & CROPDMG to scale values. according to the documentation :“Estimates should be rounded to three significant digits, followed by an alphabetical character signifying the magnitude of the number, i.e., 1.55B for $1,550,000,000. Alphabetical characters used to signify magnitude include”K" for thousands, “M” for millions, and “B” for billions“.
storm.data$PROPDMGEXP <- as.character(storm.data$PROPDMGEXP)
storm.data$PROPDMGEXP[storm.data$PROPDMGEXP == "" | storm.data$PROPDMGEXP == "+" | storm.data$PROPDMGEXP == "?" | storm.data$PROPDMGEXP == "-"] <- "1"
storm.data$PROPDMGEXP[storm.data$PROPDMGEXP == "H" | storm.data$PROPDMGEXP == "h"] <- "100"
storm.data$PROPDMGEXP[storm.data$PROPDMGEXP == "K" | storm.data$PROPDMGEXP == "k"] <- "1000"
storm.data$PROPDMGEXP[storm.data$PROPDMGEXP == "M" | storm.data$PROPDMGEXP == "m"] <- "1000000"
storm.data$PROPDMGEXP[storm.data$PROPDMGEXP == "B" | storm.data$PROPDMGEXP == "b"] <- "1000000000"
storm.data$PROPDMGEXP <- as.numeric(storm.data$PROPDMGEXP)
storm.data$PROPDMGUSD <- storm.data$PROPDMG * storm.data$PROPDMGEXP
storm.data$CROPDMGEXP <- as.character(storm.data$CROPDMGEXP)
storm.data$CROPDMGEXP[storm.data$CROPDMGEXP == "" | storm.data$CROPDMGEXP == "?"] <- "1"
storm.data$CROPDMGEXP[storm.data$CROPDMGEXP == "B" | storm.data$CROPDMGEXP == "b"] <- "1000000000"
storm.data$CROPDMGEXP[storm.data$CROPDMGEXP == "M" | storm.data$CROPDMGEXP == "m"] <- "1000000"
storm.data$CROPDMGEXP[storm.data$CROPDMGEXP == "K" | storm.data$CROPDMGEXP == "k"] <- "1000"
storm.data$CROPDMGEXP[storm.data$CROPDMGEXP == "" | storm.data$CROPDMGEXP == "?"] <- "1"
storm.data$CROPDMGEXP <- as.numeric(storm.data$CROPDMGEXP)
storm.data$CROPDMGUSD <- storm.data$CROPDMG * storm.data$CROPDMGEXP
# aggregate storm.data by EVTYPE
storm.data.evtype <- aggregate(cbind(FATALITIES, INJURIES, PROPDMGUSD, CROPDMGUSD) ~ EVTYPE, data = storm.data, FUN = sum)
# Add calculated column 'health' as a sum of FATALITIES and INJURIES
storm.data.evtype$health <- storm.data.evtype$FATALITIES + storm.data.evtype$INJURIES
# Add calculated column 'damage' as a sum of FATALITIES and INJURIES
storm.data.evtype$damage <- storm.data.evtype$PROPDMGUSD + storm.data.evtype$CROPDMGUSD
#Clean-up the event type(EVTYPE) duplicates.
storm.data$EVTYPE <- toupper(storm.data$EVTYPE)
event.type <- sort(unique(storm.data$EVTYPE))
## Show first 50 event types
event.type[1:50]
## [1] " HIGH SURF ADVISORY" " COASTAL FLOOD"
## [3] " FLASH FLOOD" " LIGHTNING"
## [5] " TSTM WIND" " TSTM WIND (G45)"
## [7] " WATERSPOUT" " WIND"
## [9] "?" "ABNORMAL WARMTH"
## [11] "ABNORMALLY DRY" "ABNORMALLY WET"
## [13] "ACCUMULATED SNOWFALL" "AGRICULTURAL FREEZE"
## [15] "APACHE COUNTY" "ASTRONOMICAL HIGH TIDE"
## [17] "ASTRONOMICAL LOW TIDE" "AVALANCE"
## [19] "AVALANCHE" "BEACH EROSIN"
## [21] "BEACH EROSION" "BEACH EROSION/COASTAL FLOOD"
## [23] "BEACH FLOOD" "BELOW NORMAL PRECIPITATION"
## [25] "BITTER WIND CHILL" "BITTER WIND CHILL TEMPERATURES"
## [27] "BLACK ICE" "BLIZZARD"
## [29] "BLIZZARD AND EXTREME WIND CHIL" "BLIZZARD AND HEAVY SNOW"
## [31] "BLIZZARD SUMMARY" "BLIZZARD WEATHER"
## [33] "BLIZZARD/FREEZING RAIN" "BLIZZARD/HEAVY SNOW"
## [35] "BLIZZARD/HIGH WIND" "BLIZZARD/WINTER STORM"
## [37] "BLOW-OUT TIDE" "BLOW-OUT TIDES"
## [39] "BLOWING DUST" "BLOWING SNOW"
## [41] "BLOWING SNOW- EXTREME WIND CHI" "BLOWING SNOW & EXTREME WIND CH"
## [43] "BLOWING SNOW/EXTREME WIND CHIL" "BREAKUP FLOODING"
## [45] "BRUSH FIRE" "BRUSH FIRES"
## [47] "COASTAL FLOODING/EROSION" "COASTAL EROSION"
## [49] "COASTAL FLOOD" "COASTAL FLOODING"
#event type to a factor
storm.data$EVTYPE <- as.factor(storm.data$EVTYPE)
#top 10 fatalities by Event
fatalities <- as.data.table(subset(aggregate(FATALITIES ~ EVTYPE, data = storm.data.evtype,
FUN = "sum"), FATALITIES > 0))
fatalities <- fatalities[order(-FATALITIES), ]
fatalities[1:10,]
## EVTYPE FATALITIES
## 1: TORNADO 5633
## 2: EXCESSIVE HEAT 1903
## 3: FLASH FLOOD 978
## 4: HEAT 937
## 5: LIGHTNING 816
## 6: TSTM WIND 504
## 7: FLOOD 470
## 8: RIP CURRENT 368
## 9: HIGH WIND 248
## 10: AVALANCHE 224
#top 10 injuries by Event
injuries <- as.data.table(subset(aggregate(INJURIES ~ EVTYPE, data = storm.data.evtype,
FUN = "sum"), INJURIES > 0))
injuries <- injuries[order(-INJURIES), ]
injuries[1:10, ]
## EVTYPE INJURIES
## 1: TORNADO 91346
## 2: TSTM WIND 6957
## 3: FLOOD 6789
## 4: EXCESSIVE HEAT 6525
## 5: LIGHTNING 5230
## 6: HEAT 2100
## 7: ICE STORM 1975
## 8: FLASH FLOOD 1777
## 9: THUNDERSTORM WIND 1488
## 10: HAIL 1361
#The three events that have the highest health consequences, both for fatalities and injuries are tornados, excessive heat and high wind.
fatalities.plot <- ggplot(data = fatalities[1:10,], aes(EVTYPE, FATALITIES, fill = FATALITIES)) + geom_bar(stat = "identity") + ggtitle("Fatalities by Event") +
xlab("Event") + ylab("Fatalities") +
coord_flip()
fatalities.plot
injuries.plot <- ggplot(data = injuries[1:10, ], aes(EVTYPE, INJURIES, fill = INJURIES)) + geom_bar(stat = "identity") +
ggtitle("Injuries by Event") + xlab("Event") + ylab("Injuries") +
coord_flip()
injuries.plot
Economic Damage
storm.damage <- storm.data.evtype[order(storm.data.evtype$damage, decreasing = T),][1:10,]
storm.damage.property <- storm.damage[,c(1,4)]
names(storm.damage.property)[2] <- "DAMAGE"
storm.damage.property$TYPE <- "PROPERTY"
#property damage in US$ million
storm.damage.property$DAMAGE <- storm.damage.property$DAMAGE / 1000000
storm.damage.property [1:10,c("EVTYPE","DAMAGE")]
## EVTYPE DAMAGE
## 170 FLOOD 144657.710
## 411 HURRICANE/TYPHOON 69305.840
## 834 TORNADO 56937.161
## 670 STORM SURGE 43323.536
## 244 HAIL 15732.267
## 153 FLASH FLOOD 16140.812
## 95 DROUGHT 1046.106
## 402 HURRICANE 11868.319
## 590 RIVER FLOOD 5118.945
## 427 ICE STORM 3944.928
storm.damage.crop <- storm.damage[,c(1,5)]
names(storm.damage.crop)[2] <- "DAMAGE"
storm.damage.crop$TYPE <- "CROP"
#crop damage in US$ million
storm.damage.crop$DAMAGE <- storm.damage.crop$DAMAGE / 1000000
storm.damage.crop [1:10,c("EVTYPE","DAMAGE")]
## EVTYPE DAMAGE
## 170 FLOOD 5661.9685
## 411 HURRICANE/TYPHOON 2607.8728
## 834 TORNADO 414.9531
## 670 STORM SURGE 0.0050
## 244 HAIL 3025.9545
## 153 FLASH FLOOD 1421.3171
## 95 DROUGHT 13972.5660
## 402 HURRICANE 2741.9100
## 590 RIVER FLOOD 5029.4590
## 427 ICE STORM 5022.1135
damage.property.plot <- ggplot(data = storm.damage.property, aes(EVTYPE,DAMAGE, fill="TYPE")) +
geom_bar(stat="identity") + coord_flip()+
ggtitle("Property Damage") +
ylab("Damage in US$ million")+
xlab("Event Type")+
theme(legend.position = "none")
damage.property.plot
damage.crop.plot <- ggplot(data = storm.damage.crop, aes(EVTYPE,DAMAGE, fill="TYPE")) +
geom_bar(stat="identity") + coord_flip()+
ggtitle("Corp Damage") +
ylab("Damage in US$ million")+
xlab("Event Type")+
theme(legend.position = "none")
damage.crop.plot
Top Events that caused the highest damage
storm.damage.plot <- rbind(storm.damage.property , storm.damage.crop)
storm.damage.plot <- transform(storm.damage.plot, EVTYPE=reorder(EVTYPE, -DAMAGE) )
qplot(
EVTYPE,
DAMAGE,
data = storm.damage.plot,
fill = TYPE,
geom = "bar",
stat = "identity",
main = "Economic damage",
ylab = "Damage in million $",
xlab = ""
) + scale_fill_discrete("") + theme(axis.text.x = element_text(angle = 90))
Conclusions
Tornados by far are the leading cause of both fatalities and injuries, followed by Excessive heat for Fatalities and Thunderstorm winds for injuries.
Floods are the most economically damaging.