Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, property damage, and crop damage and preventing such outcomes to the extent possible is a key concern.
So, the goal of this report is to show which types of storms and other severe weather events are most harmful to population health and have greatest economic consequences across the United Sates in the year 2010. The U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database was used for this report and specifically obtained data for the year 2010 (the most recent complete year available).
Data showed that among storms and other weather events in the U.S., most fatalities were linked to Flash Flood followed by Rip Current while most injuries were caused by Tornado. Furthermore, results indicated that the combined estimated amount of Property Damage and Crop Damage in 2010 was due to Flood and Hail.
Load the file from the course web site
setwd("C:/Users/jggomez/Desktop/Data Science/Course 5")
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
"Week4/StormData.csv")
setwd("C:/Users/jggomez/Desktop/Data Science/Course 5/Week4")
Read the raw file, which is a CSV file, included in the zip archive. Blank lines and all comments are ignored.
StormData = read.csv("StormData.csv", na.strings = "",
blank.lines.skip = TRUE, comment.char = "")
str(StormData) ##Check the internal structure of StormData dataset
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29600 levels "5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13512 1872 4597 10591 4371 10093 1972 23872 24417 4597 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 34 levels " N"," NW","E",..: NA NA NA NA NA NA NA NA NA NA ...
## $ BGN_LOCATI: Factor w/ 54428 levels "- 1 N Albion",..: NA NA NA NA NA NA NA NA NA NA ...
## $ END_DATE : Factor w/ 6662 levels "1/1/1993 0:00:00",..: NA NA NA NA NA NA NA NA NA NA ...
## $ END_TIME : Factor w/ 3646 levels " 0900CST"," 200CST",..: NA NA NA NA NA NA NA NA NA NA ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 23 levels "E","ENE","ESE",..: NA NA NA NA NA NA NA NA NA NA ...
## $ END_LOCATI: Factor w/ 34505 levels "- .5 NNW","- 11 ESE Jay",..: NA NA NA NA NA NA NA NA NA NA ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 18 levels "-","?","+","0",..: 16 16 16 16 16 16 16 16 16 16 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 8 levels "?","0","2","B",..: NA NA NA NA NA NA NA NA NA NA ...
## $ WFO : Factor w/ 541 levels " CI","$AC","$AG",..: NA NA NA NA NA NA NA NA NA NA ...
## $ STATEOFFIC: Factor w/ 249 levels "ALABAMA, Central",..: NA NA NA NA NA NA NA NA NA NA ...
## $ ZONENAMES : Factor w/ 25111 levels " "| __truncated__,..: NA NA NA NA NA NA NA NA NA NA ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436780 levels "-2 at Deer Park\n",..: NA NA NA NA NA NA NA NA NA NA ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
It was noted that events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
With that, events from 1 Jan 2010 to 31 Dec 2010 were used for this report to capture the most recent complete year available. The BGN_DATE variable was used as date of event.
StormData$BGN_DATE <- as.POSIXct(strptime(x = as.character(StormData$BGN_DATE),
format = "%m/%d/%Y")) ##transform BGN_DATE to date class
class(StormData$BGN_DATE)
## [1] "POSIXct" "POSIXt"
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Storm2010 <- filter(StormData, BGN_DATE > "2009-12-31" &
BGN_DATE < "2011-01-01") ## filter 2010 events
library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
Fatalities=ddply(Storm2010,~EVTYPE,summarise,
SumFatalities=sum(FATALITIES,na.rm=TRUE))
##summarize fatalities by type of events
HighFatalities=arrange(Fatalities,desc(SumFatalities))
HighFatalities$Percentage <- round(HighFatalities$SumFatalities /
sum(HighFatalities$SumFatalities) * 100,2)
HighFatalities = head(HighFatalities,10)
HighFatalities
## EVTYPE SumFatalities Percentage
## 1 FLASH FLOOD 67 15.76
## 2 RIP CURRENT 49 11.53
## 3 TORNADO 45 10.59
## 4 HEAT 44 10.35
## 5 FLOOD 41 9.65
## 6 EXCESSIVE HEAT 39 9.18
## 7 LIGHTNING 29 6.82
## 8 COLD/WIND CHILL 26 6.12
## 9 AVALANCHE 22 5.18
## 10 HIGH SURF 17 4.00
Show the top 10 storms and weather events with most number of fatalities in bar plot.
a=HighFatalities$EVTYPE
b=HighFatalities$SumFatalities
c = matrix (b, nrow=1, byrow=TRUE)
colnames(c)=a
y1=round(max(b),-1) ## to be set as y-axis limit
par(mar = c(7, 4, 2, 2) + 0.2)
end_point = 0.5 + ncol(c) + ncol(c)-1
barplot(b,main = "Storms and Weather Events with High Fatalities in 2010",
xlab = "Type of Event", ylab="Fatalities",col = c("red"),
ylim = c(0,y1), cex.axis=0.7,cex.names=0.7,width=1.75)
text(seq(1.5,end_point,by=2), par("usr")[3]-0.25,
srt = 60, adj= 1, xpd = TRUE,
labels = paste(colnames(c)), cex=0.5) ##60 deg x-axis label
FLASH FLOOD showed the highest fatalities with 67 cases or 15.76% of total fatalities in 2010.
This is followed by RIP CURRENT with 49 cases or 11.53% of total fatalities.
Injuries=ddply(Storm2010,~EVTYPE,summarise,
SumInjuries=sum(INJURIES,na.rm=TRUE))
##summarize injuries by type of events
HighInjuries=arrange(Injuries,desc(SumInjuries))
HighInjuries$Percentage <- round(HighInjuries$SumInjuries /
sum(HighInjuries$SumInjuries) * 100,2)
HighInjuries = head(HighInjuries,10)
HighInjuries
## EVTYPE SumInjuries Percentage
## 1 TORNADO 699 37.68
## 2 THUNDERSTORM WIND 323 17.41
## 3 FLASH FLOOD 188 10.13
## 4 LIGHTNING 182 9.81
## 5 FLOOD 127 6.85
## 6 EXCESSIVE HEAT 63 3.40
## 7 RIP CURRENT 45 2.43
## 8 HAIL 42 2.26
## 9 HEAT 39 2.10
## 10 STRONG WIND 28 1.51
Show Top 10 storms and weather events with most cases of injuries in bar plot
d=HighInjuries$EVTYPE
e=HighInjuries$SumInjuries
f = matrix (e, nrow=1, byrow=TRUE)
colnames(f)=d
y2=round(max(f),-2) ## to be set as y-axis limit
par(mar = c(7, 4, 2, 2) + 0.2)
end_point = 0.5 + ncol(f) + ncol(f)-1
barplot(e,main = "Storms and Weather Events with High Injuries in 2010",
xlab = "Type of Event", ylab="Injuries",col = c("orange"),
ylim = c(0,y2), cex.axis=0.7,cex.names=0.7,width=1.65)
text(seq(1.5,end_point,by=2), par("usr")[3]-0.25,
srt = 60, adj= 1, xpd = TRUE,
labels = paste(colnames(f)), cex=0.5)##60 deg x-axis label
TORNADO showed the highest injuries with 699 cases or 37.68% of total injuries in 2010.
This was followed by THUNDERSTORM WIND with 323 cases or 17.41% of total Injuries.
FatalitiesInjuries <- merge(Fatalities,Injuries,by="EVTYPE")
FatalitiesInjuries$Combined <- (FatalitiesInjuries$SumFatalities +
FatalitiesInjuries$SumInjuries)
HighFatalitiesInjuries <- arrange(FatalitiesInjuries,desc(Combined))
HighFatalitiesInjuries$Percentage <- round(HighFatalitiesInjuries$Combined /
sum(HighFatalitiesInjuries$Combined) * 100,2)
HighFatalitiesInjuries=head(HighFatalitiesInjuries,10)
HighFatalitiesInjuries
## EVTYPE SumFatalities SumInjuries Combined Percentage
## 1 TORNADO 45 699 744 32.63
## 2 THUNDERSTORM WIND 15 323 338 14.82
## 3 FLASH FLOOD 67 188 255 11.18
## 4 LIGHTNING 29 182 211 9.25
## 5 FLOOD 41 127 168 7.37
## 6 EXCESSIVE HEAT 39 63 102 4.47
## 7 RIP CURRENT 49 45 94 4.12
## 8 HEAT 44 39 83 3.64
## 9 AVALANCHE 22 20 42 1.84
## 10 HAIL 0 42 42 1.84
TORNADO showed the highest combined fatalities and injuries with 744 cases or 32.63% of total combined fatalities and injuries in 2010.
According to the National Weather Service Storm Data Documentation, the estimated property damage amount in dollar should be entered as actual dollar amount and estimates should be rounded off to three significant digits (PROPDMG) followed by an alphabetical character signifying the magnitude of the number “K” for thousands, “M” for Millions, “B” for billions (PROPDMGEXP).
Now, check if all data entries in PROPDMG are in three significant digits as stated in the documentation.
g=filter(Storm2010, PROPDMG > 999.99)
g$PROPDMG
## [1] 5000 5000 1000 3200 1000 1000 1000 1000
g$PROPDMGEXP
## [1] K K K K K K K K
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
nrow(g) / nrow(Storm2010) ## % of invalid entries
## [1] 0.0001661095
Remove invalid PROPDMG entries since total frequency is insignificant. Then, estimate the amount of property damage
PropertyDamage = filter(Storm2010, PROPDMG < 1000)
h = data.frame(table(PropertyDamage$PROPDMGEXP))
sum(h$Freq) ## check if tallies with the nrows of PropertyDamage
## [1] 48153
nrow(PropertyDamage)
## [1] 48153
h$PROPDMGEXP=c(0,0,0,0,0,0,0,0,0,0,0,0,1000000000,0,0,1000,1000000,1000000)
colnames(h)=c("PROPDMGEXP","f","PropDmgValue")
i <- (merge(PropertyDamage, h, by = 'PROPDMGEXP')) ##to add column to PropertyDamage that indicates the corresponding values of PROPDMGEXP
summary(PropertyDamage$PROPDMGEXP)
## - ? + 0 1 2 3 4 5 6 7 8
## 0 0 0 0 0 0 0 0 0 0 0 0
## B h H K m M
## 2 0 0 47735 0 416
data.frame(table(i$PropDmgValue)) ##check if results tally
## Var1 Freq
## 1 1000 47735
## 2 1e+06 416
## 3 1e+09 2
nrow(PropertyDamage)
## [1] 48153
nrow(i)
## [1] 48153
i$PropDmg = i$PROPDMG*i$PropDmgValue/1000000 ## to show amount in million dollars
summary(i$PropDmg)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1916 0.0020 1800.0000
Process CROPDMG and CROPDMGEXP to estimate amount of Crop Damage in the same manner in which PROPDMG and PROPDMGEXP were processed
j=filter(Storm2010, CROPDMG > 999.99)
j$CROPDMG
## numeric(0)
CropDamage=Storm2010 ##since there is no invalid CROPDMG entry amount
k = data.frame(table(CropDamage$CROPDMGEXP))
sum(k$Freq) ## check if tallies with the nrows of CropDamage
## [1] 48161
nrow(CropDamage)
## [1] 48161
k$CROPDMGEXP=c(0,0,0,1000000000,1000,1000,1000000,1000000)
colnames(k)=c("CROPDMGEXP","f","CropDmgValue")
l <- (merge(CropDamage, k, by = 'CROPDMGEXP'))
summary(CropDamage$CROPDMGEXP)
## ? 0 2 B k K m M
## 0 0 0 0 0 48021 0 140
data.frame(table(l$CropDmgValue)) ##check of result tallies
## Var1 Freq
## 1 1000 48021
## 2 1e+06 140
nrow(CropDamage)
## [1] 48161
nrow(l)
## [1] 48161
l$CropDmg = l$CROPDMG*l$CropDmgValue/1000000 ## to show amount in million dollars
summary(l$CropDmg)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.03707 0.00000 286.00000
library(plyr)
PropertyDamage=ddply(i,~EVTYPE,summarise,
SumPropDmg=sum(PropDmg,na.rm=TRUE))
##summarize Property Damage by type of events
HighPropertyDamage=arrange(PropertyDamage,desc(SumPropDmg))
HighPropertyDamage$Percentage <- round(HighPropertyDamage$SumPropDmg /
sum(HighPropertyDamage$SumPropDmg) * 100,2)
HighPropertyDamage = head(HighPropertyDamage,10)
HighPropertyDamage
## EVTYPE SumPropDmg Percentage
## 1 HAIL 3404.10995 36.89
## 2 FLOOD 3094.02854 33.53
## 3 TORNADO 1135.92045 12.31
## 4 FLASH FLOOD 826.04090 8.95
## 5 WILDFIRE 244.88867 2.65
## 6 THUNDERSTORM WIND 213.20100 2.31
## 7 LIGHTNING 70.14870 0.76
## 8 WINTER STORM 61.94800 0.67
## 9 HIGH SURF 49.11150 0.53
## 10 HIGH WIND 27.65992 0.30
HAIL showed the highest amount of Property Damage of $3404.10995M or 36.89% of total Property Damage in 2010.
CropDamage=ddply(l,~EVTYPE,summarise,
SumCropDmg=sum(CropDmg,na.rm=TRUE))
##summarize Crop Damage by type of events
HighCropDamage=arrange(CropDamage,desc(SumCropDmg))
HighCropDamage$Percentage <- round(HighCropDamage$SumCropDmg /
sum(HighCropDamage$SumCropDmg) * 100,2)
HighCropDamage = head(HighCropDamage,10)
HighCropDamage
## EVTYPE SumCropDmg Percentage
## 1 FLOOD 1104.709 61.88
## 2 FROST/FREEZE 354.240 19.84
## 3 FLASH FLOOD 152.680 8.55
## 4 HAIL 101.980 5.71
## 5 TORNADO 27.645 1.55
## 6 WINTER WEATHER 15.000 0.84
## 7 THUNDERSTORM WIND 10.812 0.61
## 8 DROUGHT 5.807 0.33
## 9 TROPICAL STORM 5.010 0.28
## 10 HEAVY RAIN 4.101 0.23
FLOOD showed the highest amount of Crop Damage of $1104.709M or 61.88% of total Crop Damage in 2010.
PropertyCropDamage <- merge(PropertyDamage,CropDamage,by="EVTYPE")
PropertyCropDamage$Combined <- (PropertyCropDamage$SumPropDmg +
PropertyCropDamage$SumCropDmg)
HighPropertyCropDamage <- arrange(PropertyCropDamage,desc(Combined))
HighPropertyCropDamage$Percentage <- round(HighPropertyCropDamage$Combined /
sum(HighPropertyCropDamage$Combined) * 100,2)
HighPropertyCropDamage=head(HighPropertyCropDamage,10)
HighPropertyCropDamage
## EVTYPE SumPropDmg SumCropDmg Combined Percentage
## 1 FLOOD 3094.0285 1104.709 4198.7375 38.12
## 2 HAIL 3404.1100 101.980 3506.0900 31.83
## 3 TORNADO 1135.9205 27.645 1163.5655 10.56
## 4 FLASH FLOOD 826.0409 152.680 978.7209 8.89
## 5 FROST/FREEZE 3.1750 354.240 357.4150 3.25
## 6 WILDFIRE 244.8887 2.050 246.9387 2.24
## 7 THUNDERSTORM WIND 213.2010 10.812 224.0130 2.03
## 8 LIGHTNING 70.1487 0.452 70.6007 0.64
## 9 WINTER STORM 61.9480 0.000 61.9480 0.56
## 10 HIGH SURF 49.1115 0.000 49.1115 0.45
FLOOD showed the highest amount of combined Property Damage and Crop Damage of $4198.73754M or 38.12% of total combined Property Damage and Crop Damage in 2010.
Show first 10 storms and weather events with highest amount (in million dollars) of combined Property Damage and Crop Damage in stacked bar plot
m=HighPropertyCropDamage$EVTYPE
PropertyDmg=HighPropertyCropDamage$SumPropDmg
CropDmg=HighPropertyCropDamage$SumCropDmg
n=rbind(PropertyDmg,CropDmg)
colnames(n)=m
y3=max(HighPropertyCropDamage$Combined)+1000 ## to be set as y-axis limit
par(mar = c(7, 6, 2, 2) + 0.2)
barplot(n, main = "Storms and Whether Events with Highest Economic Impact in 2010",
xlab = "Type of Event", ylab="Combined Property Damage and Crops Damage
in MM", col = c("red","orange"), ylim=c(0,y3),cex.axis=0.5,
cex.names=0.6, cex.main = 0.8, width=1.70, xaxt='n', ann=FALSE)
text(seq(1.5,end_point,by=2), par("usr")[3]-0.25, srt = 60, adj= 1,
xpd = TRUE, labels = paste(colnames(n)), cex=0.5)
legend("topright", c("Property Damage","Crop Damage"), fill = c("red","orange"),
cex = 0.6)