The aim of this report is to perform exploratory data analyses of the U.S. National Oceanic and Atmospheric Administration's (NOAA) storm data to address the following questions:
From the exploratory data analyses of the NOAA storm data, I found that, across the United States, tornado is the most harmful weather event to population health - Tornadoes cause both the highest numbers of fatalities and injuries. While flood and drought have the greatest economic consequences across the United States - Flood has the highest cost of property damage with about 144 billion dollars total, and drought has the highest cost of crop damage with about 14 billion dollars total.
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:
There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
if (!file.exists("./repdata-data-StormData.csv.bz2")){
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", destfile="./repdata-data-StormData.csv.bz2", method="curl")
}
if (!file.exists("./repdata-data-StormData.csv")){
library(R.utils)
bunzip2("repdata-data-StormData.csv.bz2", "repdata-data-StormData.csv", remove = FALSE)
}
# Load storm data into R data frame
stormData <- read.csv("repdata-data-StormData.csv")
str(stormData)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
head(stormData)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
fatalitiesData <- stormData[stormData$FATALITIES > 0, c("EVTYPE", "FATALITIES")]
str(fatalitiesData)
## 'data.frame': 6974 obs. of 2 variables:
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ FATALITIES: num 1 1 4 1 6 7 2 5 25 2 ...
injuriesData <- stormData[stormData$INJURIES > 0, c("EVTYPE", "INJURIES")]
str(injuriesData)
## 'data.frame': 17604 obs. of 2 variables:
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ INJURIES: num 15 2 2 2 6 1 14 3 3 26 ...
propertyDamageData <- stormData[stormData$PROPDMG > 0, c("EVTYPE", "PROPDMG", "PROPDMGEXP")]
str(propertyDamageData)
## 'data.frame': 239174 obs. of 3 variables:
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
Convert - ? + 0 1 2 3 4 5 6 7 8 B h H K m and M units in PROPDMGEXP to numerical values, and add a new variable called PRODMGCOST to the propertyDamageData dataset containing these numerical values.
summary(propertyDamageData$PROPDMGEXP)
## - ? + 0 1 2 3 4 5
## 76 1 0 5 209 0 1 1 4 18
## 6 7 8 B h H K m M
## 3 2 0 40 1 6 227481 7 11319
propertyDamageData$PRODMGCOST[propertyDamageData$PROPDMGEXP == ""] <- 0
propertyDamageData$PRODMGCOST[propertyDamageData$PROPDMGEXP == "-"] <- 0
propertyDamageData$PRODMGCOST[propertyDamageData$PROPDMGEXP == "?"] <- 0
propertyDamageData$PRODMGCOST[propertyDamageData$PROPDMGEXP == "+"] <- 0
propertyDamageData$PRODMGCOST[propertyDamageData$PROPDMGEXP == "0"] <- 1
propertyDamageData$PRODMGCOST[propertyDamageData$PROPDMGEXP == "1"] <- 10
propertyDamageData$PRODMGCOST[propertyDamageData$PROPDMGEXP == "2"] <- 100
propertyDamageData$PRODMGCOST[propertyDamageData$PROPDMGEXP == "3"] <- 1000
propertyDamageData$PRODMGCOST[propertyDamageData$PROPDMGEXP == "4"] <- 10000
propertyDamageData$PRODMGCOST[propertyDamageData$PROPDMGEXP == "5"] <- 1e+05
propertyDamageData$PRODMGCOST[propertyDamageData$PROPDMGEXP == "6"] <- 1e+06
propertyDamageData$PRODMGCOST[propertyDamageData$PROPDMGEXP == "7"] <- 1e+07
propertyDamageData$PRODMGCOST[propertyDamageData$PROPDMGEXP == "8"] <- 1e+08
propertyDamageData$PRODMGCOST[propertyDamageData$PROPDMGEXP == "B"] <- 1e+09
propertyDamageData$PRODMGCOST[propertyDamageData$PROPDMGEXP == "h"] <- 100
propertyDamageData$PRODMGCOST[propertyDamageData$PROPDMGEXP == "H"] <- 100
propertyDamageData$PRODMGCOST[propertyDamageData$PROPDMGEXP == "K"] <- 1000
propertyDamageData$PRODMGCOST[propertyDamageData$PROPDMGEXP == "m"] <- 1e+06
propertyDamageData$PRODMGCOST[propertyDamageData$PROPDMGEXP == "M"] <- 1e+06
Calculate the cost of property damage, and add a new variable called PROPDMGEXPCOST to the propertyDamageData dataset containg the calculated property damage cost.
propertyDamageData$PROPDMGEXPCOST <- propertyDamageData$PROPDMG * propertyDamageData$PRODMGCOST
str(propertyDamageData)
## 'data.frame': 239174 obs. of 5 variables:
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP : Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ PRODMGCOST : num 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 ...
## $ PROPDMGEXPCOST: num 25000 2500 25000 2500 2500 2500 2500 2500 25000 25000 ...
cropDamageData <- stormData[stormData$CROPDMG > 0, c("EVTYPE", "CROPDMG","CROPDMGEXP")]
str(cropDamageData)
## 'data.frame': 22099 obs. of 3 variables:
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 409 786 405 408 408 786 786 834 834 812 ...
## $ CROPDMG : num 10 500 1 4 10 50 50 5 50 15 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 9 7 9 9 8 7 7 7 7 7 ...
Convert - ? + 0 1 2 3 4 5 6 7 8 B h H K m and M units in CROPDMGEXP to numerical values, and add a new variable called CROPDMGCOST to the cropDamageData dataset containing these numerical values.
summary(cropDamageData$CROPDMGEXP)
## ? 0 2 B k K m M
## 3 0 12 0 7 21 20137 1 1918
cropDamageData$CROPDMGCOST[cropDamageData$CROPDMGEXP == ""] <- 0
cropDamageData$CROPDMGCOST[cropDamageData$CROPDMGEXP == "?"] <- 0
cropDamageData$CROPDMGCOST[cropDamageData$CROPDMGEXP == "0"] <- 1
cropDamageData$CROPDMGCOST[cropDamageData$CROPDMGEXP == "2"] <- 100
cropDamageData$CROPDMGCOST[cropDamageData$CROPDMGEXP == "B"] <- 1e+09
cropDamageData$CROPDMGCOST[cropDamageData$CROPDMGEXP == "k"] <- 1000
cropDamageData$CROPDMGCOST[cropDamageData$CROPDMGEXP == "K"] <- 1000
cropDamageData$CROPDMGCOST[cropDamageData$CROPDMGEXP == "m"] <- 1e+06
cropDamageData$CROPDMGCOST[cropDamageData$CROPDMGEXP == "M"] <- 1e+06
Calculate the cost of crop damage, and add a new variable called CROPDMGEXPCOST to the cropDamageData dataset containg the calculated crop damage cost.
cropDamageData$CROPDMGEXPCOST <- cropDamageData$CROPDMG * cropDamageData$CROPDMGCOST
str(cropDamageData)
## 'data.frame': 22099 obs. of 5 variables:
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 409 786 405 408 408 786 786 834 834 812 ...
## $ CROPDMG : num 10 500 1 4 10 50 50 5 50 15 ...
## $ CROPDMGEXP : Factor w/ 9 levels "","?","0","2",..: 9 7 9 9 8 7 7 7 7 7 ...
## $ CROPDMGCOST : num 1e+06 1e+03 1e+06 1e+06 1e+06 1e+03 1e+03 1e+03 1e+03 1e+03 ...
## $ CROPDMGEXPCOST: num 1.0e+07 5.0e+05 1.0e+06 4.0e+06 1.0e+07 5.0e+04 5.0e+04 5.0e+03 5.0e+04 1.5e+04 ...
totalFatalities <- aggregate(FATALITIES ~ EVTYPE, data = fatalitiesData, FUN = sum)
totalInjuries <- aggregate(INJURIES ~ EVTYPE, data = injuriesData, FUN = sum)
str(totalFatalities)
## 'data.frame': 168 obs. of 2 variables:
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 18 19 29 30 42 44 54 56 57 60 ...
## $ FATALITIES: num 1 224 1 101 1 1 3 2 1 3 ...
head(totalFatalities)
## EVTYPE FATALITIES
## 1 AVALANCE 1
## 2 AVALANCHE 224
## 3 BLACK ICE 1
## 4 BLIZZARD 101
## 5 blowing snow 1
## 6 BLOWING SNOW 1
str(totalInjuries)
## 'data.frame': 158 obs. of 2 variables:
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 19 29 30 42 44 49 54 58 59 60 ...
## $ INJURIES: num 170 24 805 1 13 2 2 5 1 1 ...
head(totalInjuries)
## EVTYPE INJURIES
## 1 AVALANCHE 170
## 2 BLACK ICE 24
## 3 BLIZZARD 805
## 4 blowing snow 1
## 5 BLOWING SNOW 13
## 6 BRUSH FIRE 2
totalFatalitiesTopTen <- totalFatalities[order(totalFatalities$FATALITIES, decreasing=T),][1:10,]
totalInjuriesTopTen <- totalInjuries[order(totalInjuries$INJURIES, decreasing=T),][1:10,]
totalFatalitiesTopTen
## EVTYPE FATALITIES
## 141 TORNADO 5633
## 26 EXCESSIVE HEAT 1903
## 35 FLASH FLOOD 978
## 57 HEAT 937
## 97 LIGHTNING 816
## 145 TSTM WIND 504
## 40 FLOOD 470
## 116 RIP CURRENT 368
## 75 HIGH WIND 248
## 2 AVALANCHE 224
totalInjuriesTopTen
## EVTYPE INJURIES
## 129 TORNADO 91346
## 135 TSTM WIND 6957
## 30 FLOOD 6789
## 20 EXCESSIVE HEAT 6525
## 85 LIGHTNING 5230
## 47 HEAT 2100
## 79 ICE STORM 1975
## 28 FLASH FLOOD 1777
## 121 THUNDERSTORM WIND 1488
## 45 HAIL 1361
par(mfrow = c(1, 2), mar = c(12, 4, 3, 2), mgp = c(3, 1, 0), cex = 0.8)
barplot(totalFatalitiesTopTen$FATALITIES,
main = "Event Types With Top 10 Highest Fatalities",
ylab = "Total Fatalities",
las = 3, names.arg = totalFatalitiesTopTen$EVTYPE)
barplot(totalInjuriesTopTen$INJURIES,
main = "Event Types With Top 10 Highest Injuries",
ylab = "Total Injuries",
las = 3, names.arg = totalInjuriesTopTen$EVTYPE)
Based on the above barplots, tornado causes both the highest numbers of fatalities and injuries.
totalPropertyDamageCost <- aggregate(PROPDMGEXPCOST ~ EVTYPE, data = propertyDamageData, FUN = sum)
totalCropDamageCost <- aggregate(CROPDMGEXPCOST ~ EVTYPE, data = cropDamageData, FUN = sum)
str(totalPropertyDamageCost)
## 'data.frame': 406 obs. of 2 variables:
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 1 3 5 6 9 15 16 17 19 21 ...
## $ PROPDMGEXPCOST: num 200000 50000 8100000 8000 5000 ...
head(totalPropertyDamageCost)
## EVTYPE PROPDMGEXPCOST
## 1 HIGH SURF ADVISORY 200000
## 2 FLASH FLOOD 50000
## 3 TSTM WIND 8100000
## 4 TSTM WIND (G45) 8000
## 5 ? 5000
## 6 APACHE COUNTY 5000
str(totalCropDamageCost)
## 'data.frame': 136 obs. of 2 variables:
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 14 30 57 69 73 79 81 86 87 95 ...
## $ CROPDMGEXPCOST: num 2.88e+07 1.12e+08 5.60e+04 5.00e+01 6.60e+07 ...
head(totalCropDamageCost)
## EVTYPE CROPDMGEXPCOST
## 1 AGRICULTURAL FREEZE 28820000
## 2 BLIZZARD 112060000
## 3 COASTAL FLOODING 56000
## 4 COLD AIR TORNADO 50
## 5 COLD AND WET CONDITIONS 66000000
## 6 COLD/WIND CHILL 600000
totalPropertyDamageCostTopTen <- totalPropertyDamageCost[order(totalPropertyDamageCost$PROPDMGEXPCOST, decreasing=T),][1:10,]
totalCropDamageCostTopTen <- totalCropDamageCost[order(totalCropDamageCost$CROPDMGEXPCOST, decreasing=T),][1:10,]
totalPropertyDamageCostTopTen
## EVTYPE PROPDMGEXPCOST
## 64 FLOOD 1.447e+11
## 182 HURRICANE/TYPHOON 6.931e+10
## 334 TORNADO 5.695e+10
## 282 STORM SURGE 4.332e+10
## 51 FLASH FLOOD 1.682e+10
## 106 HAIL 1.574e+10
## 174 HURRICANE 1.187e+10
## 342 TROPICAL STORM 7.704e+09
## 399 WINTER STORM 6.688e+09
## 159 HIGH WIND 5.270e+09
totalCropDamageCostTopTen
## EVTYPE CROPDMGEXPCOST
## 10 DROUGHT 1.397e+10
## 27 FLOOD 5.662e+09
## 78 RIVER FLOOD 5.029e+09
## 72 ICE STORM 5.022e+09
## 42 HAIL 3.026e+09
## 64 HURRICANE 2.742e+09
## 69 HURRICANE/TYPHOON 2.608e+09
## 23 FLASH FLOOD 1.421e+09
## 19 EXTREME COLD 1.293e+09
## 37 FROST/FREEZE 1.094e+09
par(mfrow = c(1, 2), mar = c(12, 4, 3, 2), mgp = c(3, 1, 0), cex = 0.8)
barplot(totalPropertyDamageCostTopTen$PROPDMGEXPCOST / (10^9),
main = "Event Types With Top 10 \n Highest Property Damage Cost",
ylab = "Total Property Damage Cost ($billions)",
las = 3, names.arg = totalPropertyDamageCostTopTen$EVTYPE)
barplot(totalCropDamageCostTopTen$CROPDMGEXPCOST / (10^9),
main = "Event Types With Top 10 \n Highest Crop Damage Cost",
ylab = "Total Crop Damage Cost ($billions)",
las = 3, names.arg = totalCropDamageCostTopTen$EVTYPE)
Based on the above barplots, flood causes the highest cost of property damage while drought causes the highest cost of crop damage.
From the above exploratory data analyses, across the United States, the most harmful weather event to population health is tornado - Tornadoes cause both the highest numbers of fatalities and injuries. While flood and drought have the greatest economic consequences across the United States - Flood has the highest cost of property damage with about 144 billion dollars total, and drought has the highest cost of crop damage with about 14 billion dollars total.