Weather events can have harmful effects on population health and can also have negative economic consequences. I address the following question:
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
This reports uses the Paretal law to focus on the small set of weather events that has the largest negative effect on population health (fatalities and injuries) and economy (property and crop damage).
Excessive heat and tornadoes cause most harm to population health. Different types of flooding and lightning follow. Floodings and heavy storms are most economically damaging.
The data availabe for this report comes from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. The data is available for download here.
Check whether the file is already in the local directory, if not download it. Then load the data into R.
if(!file.exists("repdata_data_StormData.csv.bz2")) {
fileURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileURL, destfile = "repdata_data_StormData.csv.bz2")
}
rawData <- read.csv("repdata_data_StormData.csv.bz2")
library(dplyr)
We are interested in the types of events that are harmful for population health and the economy. First we need to identify the types of events, define population health in metrics and the same for economy. In our data set are the following columns.
names(rawData)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
We wish to keep the following variables: EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP.
eventData <- select( rawData, BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
Earlier years had a lack of data recording and hence their data will skew the results. First we take the year component of the begin date.
eventData$year <- as.numeric(format(as.Date(eventData$BGN_DATE, format = "%m/%d/%Y %H:%M:%S"), "%Y"))
nbins <- max(eventData$year) - min(eventData$year)
Plot the events per year to choose a cut-off point of data.
nbins <- max(eventData$year) - min(eventData$year)
hist(eventData$year, breaks = nbins, xlab = "Year", main = "Plot 1: Histogram of number of events per year")
Based on this plot I will remove any data from before 1996. First remove the unecessary BGN_DATE variable.
eventData <- select(eventData, -BGN_DATE)
eventData <- eventData[eventData$year >= 1996, ]
First we will check how many types of weather events are documented.
numEvents <- length(levels(eventData$EVTYPE))
numEvents
## [1] 985
There are 985 types of events. To get an idea about the events let’s look at a random sample of event names:
set.seed(144)
sample(levels(eventData$EVTYPE), size = 20)
## [1] "BRUSH FIRE" "Summary of April 13"
## [3] "Summary of June 24" "Microburst"
## [5] "HEAVY SNOW FREEZING RAIN" "Sml Stream Fld"
## [7] "BLIZZARD/HEAVY SNOW" "Summary of June 18"
## [9] "COLD WAVE" "THUNDERSTORM WINDS LIGHTNING"
## [11] "RECORD COLD" "FLOOD/FLASH"
## [13] "PROLONG COLD/SNOW" "EXTREME HEAT"
## [15] "DRY MICROBURST 50" "Coastal Flood"
## [17] "RAIN AND WIND" "MONTHLY TEMPERATURE"
## [19] "THUNDERSTORM WINDS/FLASH FLOOD" "HIGH WIND/BLIZZARD/FREEZING RA"
The event names are all over the place. This definitely will have to be sorted. Due to time constraints I will not perform this operation for this project. To clean all events will take too long and to clean only a part will bias the results in favour of the part that I clean.
Fatality and Injury data are both numeric. Let’s look at NA values, min and maximum values.
sum(is.na(eventData$FATALITIES))
## [1] 0
sum(is.na(eventData$INJURIES))
## [1] 0
summary(eventData$FATALITIES)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.01336 0.00000 158.00000
summary(eventData$INJURIES)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.0887 0.0000 1150.0000
Not expecting any problems from this data.
Calculate the total number of recorded fatalities and injuries.
TotFat <- sum(eventData$FATALITIES)
TotInj <- sum(eventData$INJURIES)
Total number of fatalities is 8732 and total number of injuries is 57975. We will require these numbers to compare the severity of an event type harm in comparison to the total.
The property en crop data numbers are kind of split into columns. For both there is the amount in PROPDMG and CROPDMG and a multiplier for the amount in PROPDMGEXP and CROPDMGEXP. We want one column for both property and crop damage each, with the full numerical value of the damage.
First of all look at the levels of PROPDMGEXP and CROPDMGEXP.
levels(eventData$PROPDMGEXP)
## [1] "" "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"
levels(eventData$CROPDMGEXP)
## [1] "" "?" "0" "2" "B" "k" "K" "m" "M"
Any - number value remains - h or H is 100 - k or K is 1,000 - m or M is 1,000,000 - b or B is 1,000,000,000 - other levels are discarded
eventData$PROPDMGEXP <- gsub("[Hh]", "2", eventData$PROPDMGEXP)
eventData$PROPDMGEXP <- gsub("[Kk]", "3", eventData$PROPDMGEXP)
eventData$PROPDMGEXP <- gsub("[Mm]", "6", eventData$PROPDMGEXP)
eventData$PROPDMGEXP <- gsub("[Bb]", "9", eventData$PROPDMGEXP)
eventData$PROPDMGEXP <- gsub("\\+|\\-|\\?\\ ", "0", eventData$PROPDMGEXP)
eventData$PROPDMGEXP <- as.numeric(eventData$PROPDMGEXP)
eventData$PROPDMGEXP[is.na(eventData$PROPDMGEXP)] <- 0
eventData$CROPDMGEXP <- gsub("[Hh]", "2", eventData$CROPDMGEXP)
eventData$CROPDMGEXP <- gsub("[Kk]", "3", eventData$CROPDMGEXP)
eventData$CROPDMGEXP <- gsub("[Mm]", "6", eventData$CROPDMGEXP)
eventData$CROPDMGEXP <- gsub("[Bb]", "9", eventData$CROPDMGEXP)
eventData$CROPDMGEXP <- gsub("\\+|\\-|\\?\\ ", "0", eventData$CROPDMGEXP)
eventData$CROPDMGEXP <- as.numeric(eventData$CROPDMGEXP)
eventData$CROPDMGEXP[is.na(eventData$CROPDMGEXP)] <- 0
Now combine the numbers with the exponents.
eventData$PROPDMG <- eventData$PROPDMG * (10 ^ eventData$PROPDMGEXP)
eventData$CROPDMG <- eventData$CROPDMG * (10 ^ eventData$CROPDMGEXP)
Discard the unecessary variables.
eventData <- select(eventData, EVTYPE, FATALITIES, INJURIES, PROPDMG, CROPDMG)
Calculate total economic consequences for property damage and crop damage.
TotPROPDMG <- sum(eventData$PROPDMG)
TotCROPDMG <- sum(eventData$CROPDMG)
Total property damage: 366767615380 Total crop damage: 34752728730
In this section we will create subsets of the raw data that can be used for meaningful visualisations in the results section.
First we ‘bin’ the fatalities and injuries. For each event type we sum the total amount of injuries and fatalities.
FatPerEve <- tapply(eventData$FATALITIES, eventData$EVTYPE, sum)
InjPerEve <- tapply(eventData$INJURIES, eventData$EVTYPE, sum)
Next we create a news data frame that has a row for each event type. It has 3 columns, one is the event type name, another the total fatalities for that event type and a last the total injuries for that event type. Immediatly we remove any event types for which the number of fatalities and injuries both are 0. We store that data in PHDF which stands for population health data frame.
PHDFAll <- data.frame(FatPerEve, InjPerEve, levels(eventData$EVTYPE))
PHDF <- subset(PHDFAll, FatPerEve > 0 & InjPerEve > 0)
Next we use arrange from the dplyr package to arrange the event types in order of most fatalities to least fatalities. Event types that have equal number of fatalities are ordered by number of injuries.
PHDFord <- arrange(PHDF, desc(FatPerEve), desc(InjPerEve))
Next we calculate the percentage that an event type’s fatalities is of the total number of fatalities and store it in the FatPerc variable. We also calculate the cumulative sum of fatalities row-wise (FatCumSum variable), and the percentage of total that sum corresponds to (FatCumPerc variable). We do the same for injuries.
The reason for the CumSum variables is to see the percentage of total harm caused as we the next worsed event type, it will make more sense when we print the data frame.
for(i in 1:dim(PHDFord)[1]) {
PHDFord$FatPerc[i] <- round(100*PHDFord$FatPerEve[i]/TotFat, digits = 2)
PHDFord$FatCumSum[i] <- sum(head(PHDFord$FatPerEve,i))
PHDFord$FatCumPerc[i] <- round(100*PHDFord$FatCumSum[i]/TotFat)
PHDFord$InjPerc[i] <- round(100*PHDFord$InjPerEve[i]/TotInj, digits = 2)
PHDFord$InjCumSum[i] <- sum(head(PHDFord$InjPerEve,i))
PHDFord$InjCumPerc[i] <- round(100*PHDFord$InjCumSum[i]/TotInj)
}
Now we will print the top 20 event types.
head(PHDFord, n=20)
## FatPerEve InjPerEve levels.eventData.EVTYPE. FatPerc FatCumSum
## 1 1797 6391 EXCESSIVE HEAT 20.58 1797
## 2 1511 20667 TORNADO 17.30 3308
## 3 887 1674 FLASH FLOOD 10.16 4195
## 4 651 4141 LIGHTNING 7.46 4846
## 5 414 6758 FLOOD 4.74 5260
## 6 340 209 RIP CURRENT 3.89 5600
## 7 241 3629 TSTM WIND 2.76 5841
## 8 237 1222 HEAT 2.71 6078
## 9 235 1083 HIGH WIND 2.69 6313
## 10 223 156 AVALANCHE 2.55 6536
## 11 202 294 RIP CURRENTS 2.31 6738
## 12 191 1292 WINTER STORM 2.19 6929
## 13 130 1400 THUNDERSTORM WIND 1.49 7059
## 14 125 24 EXTREME COLD/WIND CHILL 1.43 7184
## 15 113 79 EXTREME COLD 1.29 7297
## 16 107 698 HEAVY SNOW 1.23 7404
## 17 103 278 STRONG WIND 1.18 7507
## 18 95 12 COLD/WIND CHILL 1.09 7602
## 19 94 230 HEAVY RAIN 1.08 7696
## 20 87 146 HIGH SURF 1.00 7783
## FatCumPerc InjPerc InjCumSum InjCumPerc
## 1 21 11.02 6391 11
## 2 38 35.65 27058 47
## 3 48 2.89 28732 50
## 4 55 7.14 32873 57
## 5 60 11.66 39631 68
## 6 64 0.36 39840 69
## 7 67 6.26 43469 75
## 8 70 2.11 44691 77
## 9 72 1.87 45774 79
## 10 75 0.27 45930 79
## 11 77 0.51 46224 80
## 12 79 2.23 47516 82
## 13 81 2.41 48916 84
## 14 82 0.04 48940 84
## 15 84 0.14 49019 85
## 16 85 1.20 49717 86
## 17 86 0.48 49995 86
## 18 87 0.02 50007 86
## 19 88 0.40 50237 87
## 20 89 0.25 50383 87
We can see that the the top twenty fatality-causing events account for a total of 89 percent of fatalities (FatCumPerc) and 87 percent of injuries (InjCumPerc). The top twenty event types only accounts for 2% of event types! (remeber total number of events is almost a thousand). In the results section we will plot the results.
We will apply the same process as we did for population health except for some small details. When we arrange the results we will not arrange it hierachically like fatalities first and then by injury, instead we will arrange by the sum of both property and crop damage. We will also look at the results of the sum of damage.
First, find the sum of both property and crop damage per event type.
CROPDMGPerEve <- tapply(eventData$CROPDMG, eventData$EVTYPE, sum)
PROPDMGPerEve <- tapply(eventData$PROPDMG, eventData$EVTYPE, sum)
Next, create a data frame with event types as rows of a column with the corresponding propert and crop damages. Then remove all complete zero rows.
ECDFAll <- data.frame(PROPDMGPerEve, CROPDMGPerEve, levels(eventData$EVTYPE))
ECDF <- subset(ECDFAll, PROPDMGPerEve > 0 & CROPDMGPerEve > 0)
Here, we apply the only difference in process. We arrange the rows by the sum of property and crop damage together.
ECDFord <- arrange(ECDF, desc((PROPDMGPerEve+CROPDMGPerEve)))
We repeat the same confusing process of percentage of total, cumulative sum and percentage of cumulative sum. This time we also add a calculation of the percentage of damage from the total for both property and crop damage together. We do this because unlike the fatality and injury (which can’t really be compared and summed), we can sum the economic damage caused to properties and crops.
for(i in 1:dim(ECDFord)[1]) {
ECDFord$PROPPerc[i] <- round(100*ECDFord$PROPDMGPerEve[i]/TotPROPDMG, digits = 2)
ECDFord$PROPCumSum[i] <- sum(head(ECDFord$PROPDMGPerEve,i))
ECDFord$PROPCumPerc[i] <- round(100*ECDFord$PROPCumSum[i]/TotPROPDMG)
ECDFord$CROPPerc[i] <- round(100*ECDFord$CROPDMGPerEve[i]/TotCROPDMG, digits = 2)
ECDFord$CROPCumSum[i] <- sum(head(ECDFord$CROPDMGPerEve,i))
ECDFord$CROPCumPerc[i] <- round(100*ECDFord$CROPCumSum[i]/TotCROPDMG)
ECDFord$TotPerc[i] <- round(100*(ECDFord$PROPDMGPerEve[i]+ECDFord$CROPDMGPerEve[i])/(TotPROPDMG+TotCROPDMG), digits = 2)
}
Again we explore the top 20 event types.
head(ECDFord, n=20)
## PROPDMGPerEve CROPDMGPerEve levels.eventData.EVTYPE. PROPPerc
## 1 143944833550 4974778400 FLOOD 39.25
## 2 69305840000 2607872800 HURRICANE/TYPHOON 18.90
## 3 43193536000 5000 STORM SURGE 11.78
## 4 24616945710 283425010 TORNADO 6.71
## 5 14595143420 2476029450 HAIL 3.98
## 6 15222203910 1334901700 FLASH FLOOD 4.15
## 7 11812819010 2741410000 HURRICANE 3.22
## 8 1046101000 13367566000 DROUGHT 0.29
## 9 7642475550 677711000 TROPICAL STORM 2.08
## 10 5247860360 633561300 HIGH WIND 1.43
## 11 4758667000 295472800 WILDFIRE 1.30
## 12 4478026440 553915350 TSTM WIND 1.22
## 13 4641188000 850000 STORM SURGE/TIDE 1.27
## 14 3382654440 398331000 THUNDERSTORM WIND 0.92
## 15 3642248810 15660000 ICE STORM 0.99
## 16 3001782500 106782330 WILD/FOREST FIRE 0.82
## 17 1532743250 11944000 WINTER STORM 0.42
## 18 584864440 728169800 HEAVY RAIN 0.16
## 19 19760400 1288973000 EXTREME COLD 0.01
## 20 9480000 1094086000 FROST/FREEZE 0.00
## PROPCumSum PROPCumPerc CROPPerc CROPCumSum CROPCumPerc TotPerc
## 1 143944833550 39 14.31 4974778400 14 37.09
## 2 213250673550 58 7.50 7582651200 22 17.91
## 3 256444209550 70 0.00 7582656200 22 10.76
## 4 281061155260 77 0.82 7866081210 23 6.20
## 5 295656298680 81 7.12 10342110660 30 4.25
## 6 310878502590 85 3.84 11677012360 34 4.12
## 7 322691321600 88 7.89 14418422360 41 3.62
## 8 323737422600 88 38.46 27785988360 80 3.59
## 9 331379898150 90 1.95 28463699360 82 2.07
## 10 336627758510 92 1.82 29097260660 84 1.46
## 11 341386425510 93 0.85 29392733460 85 1.26
## 12 345864451950 94 1.59 29946648810 86 1.25
## 13 350505639950 96 0.00 29947498810 86 1.16
## 14 353888294390 96 1.15 30345829810 87 0.94
## 15 357530543200 97 0.05 30361489810 87 0.91
## 16 360532325700 98 0.31 30468272140 88 0.77
## 17 362065068950 99 0.03 30480216140 88 0.38
## 18 362649933390 99 2.10 31208385940 90 0.33
## 19 362669693790 99 3.71 32497358940 94 0.33
## 20 362679173790 99 3.15 33591444940 97 0.27
When we look at the top 20 weather events causing most economic damage we see that only the top 13 events have more than 1% impact each. This top thirteen cause 96% of property damage and 86% of crop damage. In the results section there will be a barplot of the top eleven with their percentage impact.
This section shows the results in visually actractive manner. If you prefer details see the the previous section.
The next figure shows a bar plot of the 20 most deadly weather events and the percentage of fatalities and injuries from all weather event types caused by that particular event type.
par(mfrow = c(2,1),mar = c(0.5,4,1,1), oma = c(13,1,2,0))
barplot(head(PHDFord$FatPerc, n=20), col = rainbow(20), ylab = "Percentage (%)", main = "Plot 2: Percentage of all fatalities per weather event")
barplot(head(PHDFord$InjPerc, n=20), names.arg = PHDFord$levels.eventData.EVTYPE.[1:20], las =2, col = rainbow(20), main = "Percentage of all injuries per weather event", ylab = "Percentage (%)")
Excessive heat and tornadoes cause most harm to population health. Different types of flooding and lightning follow. You can also see many very extreme events that don’t cause many injuries but are very deadly, like rip currents and avalanches.
The next figure shows a bar plot of the 13 most economically destructive weather events and the percentage of of the sum of property and crop damage from all weather event types caused by that particular event type.
par(mfrow = c(1,1),mar = c(14,4,4,1))
barplot(head(ECDFord$TotPerc, n=13), col = rainbow(13), ylab = "Percentage (%)", main = "Plot 3: Percentage of combined property and crop damage per weather event", names.arg = ECDFord$levels.eventData.EVTYPE.[1:13], las = 2)
Floodings and heavy storms are most economically damaging. Excessive heat is not near the top. That can be expected because property shouldn’t be damaged much by heat, and crops will be watered unless it is too extreme.