Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project explores the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
This data analysis addresses the following questions:
Download the Data from the Coursera Data Science course site:
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "storms.csv.bz2", method="curl")
storms <- read.csv("storms.csv.bz2", header=TRUE)
str(storms)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
Remove data from before 1996, as mentioned from the NCDC documentation this older data is of lower quality.
storms$END_DATE = as.Date(storms$END_DATE, "%m/%d/%Y")
storms = subset(storms, END_DATE >= "1996-01-01")
dim(storms)
## [1] 653529 37
The following variables are important for this analysis:
Put all event types to lowercase:
storms$EVTYPE = tolower(storms$EVTYPE)
Remove the observations where there is no damage reported, nor fatalities or injuries:
storms = subset(storms, storms$CROPDMG != 0 | storms$PROPDMG != 0 | storms$FATALITIES != 0 | storms$INJURIES != 0)
dim(storms)
## [1] 201318 37
The NWS documentation refers to *DMGEXP values as empty, “K”, “M”, “B”. representing multipliers for the *DMG variables:
Check the *DMGEXP values
table(storms$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5
## 8448 0 0 0 0 0 0 0 0 0
## 6 7 8 B h H K m M
## 0 0 0 32 0 0 185474 0 7364
table(storms$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 102767 0 0 0 2 0 96787 0 1762
Calculate cost in billion dollars using the *DMG and *DMGEXP variables for property damage and crop damage:
# Property Damage Cost in billion USD, using PROPDMG and PROPDMGEXP variables
storms$PROPDMGCost = NA
storms$PROPDMGCost[storms$PROPDMGEXP == ""] = storms$PROPDMG / 10^9
storms$PROPDMGCost[storms$PROPDMGEXP == "K"] = storms$PROPDMG / 10^6
storms$PROPDMGCost[storms$PROPDMGEXP == "M"] = storms$PROPDMG / 10^3
# Crop Damage Cost in billion USD, using CROPDMG and CROPDMGEXP variables
storms$PROPDMGCost[storms$PROPDMGEXP == "B"] = storms$PROPDMG
storms$CROPDMGCost = NA
storms$CROPDMGCost[storms$CROPDMGEXP == ""] = storms$CROPDMG / 10^9
storms$CROPDMGCost[storms$CROPDMGEXP == "K"] = storms$CROPDMG / 10^6
storms$CROPDMGCost[storms$CROPDMGEXP == "M"] = storms$CROPDMG / 10^3
storms$CROPDMGCost[storms$CROPDMGEXP == "B"] = storms$CROPDMG
Summarise and calculate damage cost for both Property and Crop damage:
sum(storms$PROPDMGCost)
## [1] 2427.098
summary(storms$PROPDMGCost)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.0121 0.0000 595.0000
sum(storms$CROPDMGCost)
## [1] 44.09622
summary(storms$CROPDMGCost)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0e+00 0.0e+00 0.0e+00 2.2e-04 0.0e+00 3.8e+01
# Use qcc for pareto charts
library(qcc)
## Package 'qcc', version 2.6
## Type 'citation("qcc")' for citing this R package in publications.
Reported Property Damages are about 50 times higher than the Crop Damages.
Top 10 event types with highest property damage cost (in billion dollars):
top10prop = head(sort(tapply(storms$PROPDMGCost, storms$EVTYPE, sum), decreasing=TRUE), n=10)
top10prop.df = data.frame(eventType = names(top10prop), propDamage = round(top10prop), row.names = NULL)
top10prop.df
## eventType propDamage
## 1 flood 674
## 2 storm surge/tide 595
## 3 flash flood 456
## 4 hurricane 258
## 5 hurricane/typhoon 142
## 6 high wind 92
## 7 tornado 69
## 8 hail 32
## 9 tstm wind 25
## 10 wildfire 18
pareto.chart(top10prop, main="US storm event types with greatest economic consequences (1996-2011)", ylab = "Property Damage (billion USD)")
##
## Pareto chart analysis for top10prop
## Frequency Cum.Freq. Percentage Cum.Percent.
## flood 673.91126 673.9113 28.5418426 28.54184
## storm surge/tide 595.20319 1269.1144 25.2083574 53.75020
## flash flood 456.22914 1725.3436 19.3224554 73.07266
## hurricane 257.66367 1983.0073 10.9127068 83.98536
## hurricane/typhoon 142.16847 2125.1757 6.0211935 90.00656
## high wind 91.73748 2216.9132 3.8853136 93.89187
## tornado 68.79707 2285.7103 2.9137294 96.80560
## hail 31.97069 2317.6810 1.3540396 98.15964
## tstm wind 25.45813 2343.1391 1.0782159 99.23785
## wildfire 17.99529 2361.1344 0.7621459 100.00000
Top 10 event types with highest property damage cost (in billion dollars):
top10crop = head(sort(tapply(storms$CROPDMGCost, storms$EVTYPE, sum), decreasing=TRUE), n=10)
top10crop.df = data.frame(eventType = names(top10crop), cropDamage = round(top10crop, 2), row.names = NULL)
top10crop.df
## eventType cropDamage
## 1 hurricane/typhoon 38.01
## 2 hail 1.91
## 3 flood 1.16
## 4 tstm wind 0.84
## 5 drought 0.78
## 6 high wind 0.28
## 7 thunderstorm wind 0.28
## 8 flash flood 0.22
## 9 extreme cold 0.15
## 10 tropical storm 0.10
Injuries are concidered here as the main measurement. Secondly data is presented for fatalities.
Top 10 event types with highest number of injuries:
top10injuries = head(sort(tapply(storms$INJURIES/1000, storms$EVTYPE, sum), decreasing=TRUE), n=10)
top10injuries.df = data.frame(eventType = names(top10injuries), injuries = top10injuries, row.names = NULL)
top10injuries.df
## eventType injuries
## 1 tornado 20.667
## 2 flood 6.758
## 3 excessive heat 6.391
## 4 lightning 4.141
## 5 tstm wind 3.629
## 6 flash flood 1.674
## 7 thunderstorm wind 1.400
## 8 winter storm 1.292
## 9 hurricane/typhoon 1.275
## 10 heat 1.222
pareto.chart(top10injuries, main="US storm event types most harmful for population health (1996-2011)", ylab="injuries (thousands)")
##
## Pareto chart analysis for top10injuries
## Frequency Cum.Freq. Percentage Cum.Percent.
## tornado 20.667 20.667 42.657227 42.65723
## flood 6.758 27.425 13.948688 56.60592
## excessive heat 6.391 33.816 13.191191 69.79711
## lightning 4.141 37.957 8.547132 78.34424
## tstm wind 3.629 41.586 7.490351 85.83459
## flash flood 1.674 43.260 3.455180 89.28977
## thunderstorm wind 1.400 44.660 2.889637 92.17941
## winter storm 1.292 45.952 2.666722 94.84613
## hurricane/typhoon 1.275 47.227 2.631633 97.47776
## heat 1.222 48.449 2.522240 100.00000
Top 10 event types with highest number of fatalities:
top10fatalities = head(sort(tapply(storms$FATALITIES, storms$EVTYPE, sum), decreasing=TRUE), n=10)
top10fatalities.df = data.frame(eventType = names(top10fatalities), fatalities = top10fatalities, row.names = NULL)
top10fatalities.df
## eventType fatalities
## 1 excessive heat 1797
## 2 tornado 1511
## 3 flash flood 887
## 4 lightning 651
## 5 flood 414
## 6 rip current 340
## 7 tstm wind 241
## 8 heat 237
## 9 high wind 235
## 10 avalanche 223