Author: JayEnAar
Date: 20 June 2014
Context: Part of an assignment for a Coursera ‘Reporoducible Research’ online course, run by Prof RD Peng of Johns Hopkins School of Public Health
Weather events like storms hurricanes, tornadoes and extremes oftemperature can be expected to result in loss of life and limb, besides resulting in economic loss.
Is it possible to * estimate the extend of these adverse effects on human health and economic welfare? and * to determine the types of extreme weather events that result in the most serious adverse consequences.
The US Government has a system for systematically collecting data on weather events that allows these questions to be answered with a reasonable degree of certainty. The dataset spans a period of 60 years from 1950 to 2011. It is beleived that data quality has almost certainly improved in more recent times and so temporal comparisons may not reflect real changes in the effects of extreme wetaher events. Deaths and injuries as a result of weather events is also influenced by the number of people and their degree of exposure. These too have changed over the last half century. Economic loss is equally influenced by the growth in the relative wealth of Americans. More people live in coastal areas, in areas prone to high temperature episodes and wild fires, and more people now have expensive homes, cars and boats to lose, than was the case 30-60 years ago.
Therefore temporal analysis was not performed.
The Analysis shows that between 1950 and 2011, there were 15,145 deaths and 140,528 injuries due to weather events. Property damage amounted to an estimated 10.88 billion USD, and damage to crops totalled 1.38 billion. Tornadoes were the single biggest cause of fatalities, resulting in 5,633 deaths. But counting tornadoes, high winds, storms hurricanes, typhoons and lightning as part of TWISTR - an acronym that encompasses atmospheric disturbances associated with precipitation, the total deaths over 60 years in 7599 (50.2% of the total)
TWISTR also accounted for 82% of all injuries
Tornadoes were also the biggest cause of property damage.The top 10 weather events causing property damage were all part of the TWISTR constellation and accounted for 9.9 billion of damage, or 91% of all weather related property damage. Similarly for crop losses, the top cause was hail, but taking the top 10 together (all part of TWISTR) they accounted for 1.25 billion, or 90.7% of all crop losses due to weather events.
The data comes from the U.S. National Oceanic and Atmospheric Administration’s They maintain a storm database which contains this link to a zipped data file. The csv data file is compressed using bzip2, a free data compressing software available from
zipfile <- "repdata-data-StormData.csv.bz2"
stormdata <- read.csv(zipfile)
The data file consists of 9,02,297 observations with 37 variables. The data dictionary is available here.
The key variables of interest for this analysis are as follows:
EVTYPE : Event type. there are 985 different types of events a; a factor variable with 985 levels
FATALITIES : the number of deaths recorded for each weather event and presumably directaly attributable to it. A numeric variable
INJURIES : as above but for for non-lethal injuries. A numeric variable
PROPDMG : A numeric variable that estimates in thousands of USD the cost of property damage. The estimates are somewhat rough and ready - see Appendix B of the data dictionary for the approximations used.
CROPDMG : A numeric variable that estimates the cost of lost or damaged crops. This too is an estimate, see above.
STATE: This appears to be the usual 2-alphabet code for US States. However this is a factor variable with, surprisingly, 72 levels; one would have expected 50 or 51. There may be some mis-recording of data, or other areas of the wider North America / Central America / Caribbean may have been included in the data set. Weather, after all does not respect state boundaries.
The other variables are not of particular interest for this analysis and so will be dropped when creating a smaller data set with just the following variables (the variable position is in brackets)
stormdata <- stormdata[, c(7,8,23,24,25,27)]
This is the data set that will be used for further analysis
The total deaths across the United States are: 15,145
total.deaths <- sum(stormdata$FATALITIES)
total.deaths
## [1] 15145
The total number of non-fatal injuries is: 140,528
total.injuries <- sum(stormdata$INJURIES)
total.injuries
## [1] 140528
The total cost of adverse weather in terms of damage to property is (in thousands of dollars): 10, 884,500 (or 10.8 billion USD)
total.propdmg <- sum(stormdata$PROPDMG)
total.propdmg
## [1] 10884500
The total cost of adverse weather in terms of damage to crops is (in thousands of dollars): 1,377,827 (or1.37 billion USD)
total.cropdmg <- sum(stormdata$CROPDMG)
total.cropdmg
## [1] 1377827
A quick examination of the data shows that for a large number of records there are zero fatalities and zero injuries. It might be useful therefore to create subsets of the date where there are a) 1 or more fatalities; and b) 1 or more injuries and use just this cutdown version of the data set for specific analyses
require(plyr)
## Loading required package: plyr
fatalevents <- subset(stormdata, FATALITIES > 0)
injuryevents <- subset(stormdata, INJURIES > 0)
save(fatalevents, file="fatalevents.Rda")
save(injuryevents, file="injuryevents.Rda")
str(fatalevents)
## 'data.frame': 6974 obs. of 6 variables:
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ FATALITIES: num 1 1 4 1 6 7 2 5 25 2 ...
## $ INJURIES : num 14 26 50 8 195 12 3 20 200 90 ...
## $ PROPDMG : num 25 250 25 25 2.5 250 25 2.5 2.5 0.25 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
str(injuryevents)
## 'data.frame': 17604 obs. of 6 variables:
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ FATALITIES: num 0 0 0 0 0 0 1 0 0 1 ...
## $ INJURIES : num 15 2 2 2 6 1 14 3 3 26 ...
## $ PROPDMG : num 25 25 2.5 2.5 2.5 2.5 25 2.5 2.5 250 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
Fatalities and Injures are a good measure of the adverse effect of weather events on huiman health
The following code attempts to answer this question: The plan is to create a summary of the total number of deaths by event and rearrange this new data table in decreasing order of deaths and
deaths.by.event <- ddply(fatalevents, "EVTYPE", summarise, deaths = sum(FATALITIES),
proploss = sum(PROPDMG), croploss = sum(CROPDMG) )
str(deaths.by.event)
## 'data.frame': 168 obs. of 4 variables:
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 18 19 29 30 42 44 54 56 57 60 ...
## $ deaths : num 1 224 1 101 1 1 3 2 1 3 ...
## $ proploss: num 0 660 0 4136 15 ...
## $ croploss: num 0 0 0 112 0 0 0 0 0 0 ...
injuries.by.event <- ddply(injuryevents, .(EVTYPE), summarise, injuries = sum(INJURIES),
proploss = sum(PROPDMG), croploss = sum(CROPDMG))
str(injuries.by.event)
## 'data.frame': 158 obs. of 4 variables:
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 19 29 30 42 44 49 54 58 59 60 ...
## $ injuries: num 170 24 805 1 13 2 2 5 1 1 ...
## $ proploss: num 677 0 4208 15 0 ...
## $ croploss: num 0 0 155 0 0 0 0 0 0 0 ...
The above shows that 168 different types of weather events capture ALL the fatalities, and 158 capture all the injuries. Even this is too large a number and so. we’ll take just the top 20.
deaths.by.topevents <- deaths.by.event[with(deaths.by.event, order(-deaths)), ]
deaths.by.top20events <- deaths.by.topevents[1:20, ]
deaths.by.top20events
## EVTYPE deaths proploss croploss
## 141 TORNADO 5633 143960.3 6613.80
## 26 EXCESSIVE HEAT 1903 203.3 492.40
## 35 FLASH FLOOD 978 36411.6 4578.25
## 57 HEAT 937 135.0 450.80
## 97 LIGHTNING 816 1848.5 0.00
## 145 TSTM WIND 504 10843.9 115.00
## 40 FLOOD 470 20561.5 6375.30
## 116 RIP CURRENT 368 0.0 0.00
## 75 HIGH WIND 248 12294.3 617.93
## 2 AVALANCHE 224 660.5 0.00
## 163 WINTER STORM 206 6935.8 30.00
## 117 RIP CURRENTS 204 0.0 0.00
## 58 HEAT WAVE 172 666.8 200.00
## 30 EXTREME COLD 160 2603.4 1.75
## 136 THUNDERSTORM WIND 133 6102.5 600.00
## 63 HEAVY SNOW 127 3944.7 40.00
## 31 EXTREME COLD/WIND CHILL 125 0.0 0.00
## 131 STRONG WIND 103 2014.1 63.40
## 4 BLIZZARD 101 4136.5 112.00
## 71 HIGH SURF 101 765.0 0.00
barplot(deaths.by.top20events$deaths[1:10], names.arg = deaths.by.top20events$EVTYPE[1:10], cex.axis= 0.8, cex.names=0.35, xlab = "10 weather events that lead to the most deaths", ylab="deaths")
The table and chart show that by far and away weather event that results in the most deaths are Tornados. However there is a lot of over lap. Tornadoes cause 5,633 deaths but inluding TSTM Wind (504 deaths + a further 133 classed as Thunderstorm wind, High Wind (248), strong wind(103) heavy rain(98), hurricane and typhoon(64), AND Lightning (816) would bring the total due to a broad category of weather event that could be referred to as ‘Tornados, wind, storm and Rain (’TWISTR’ an acronym I just made up) to:
t <- 5633+504+133+248+103+98+64 +816
t
## [1] 7599
As a %age of the total weather related deaths TWISTR accounts for 50% of all deaths
t*100/total.deaths
## [1] 50.17
injuries.by.topevents <- injuries.by.event[with(injuries.by.event, order(-injuries)), ]
injuries.by.top20events <- injuries.by.topevents[1:20, ]
injuries.by.top20events
## EVTYPE injuries proploss croploss
## 129 TORNADO 91346 851910.7 25101.69
## 135 TSTM WIND 6957 101786.6 3075.75
## 30 FLOOD 6789 11679.2 5741.05
## 20 EXCESSIVE HEAT 6525 207.5 492.40
## 85 LIGHTNING 5230 18819.3 13.55
## 47 HEAT 2100 145.0 485.80
## 79 ICE STORM 1975 5147.1 1015.00
## 28 FLASH FLOOD 1777 33275.7 5414.70
## 121 THUNDERSTORM WIND 1488 36077.6 1405.50
## 45 HAIL 1361 10564.8 3463.00
## 152 WINTER STORM 1321 12151.9 293.00
## 76 HURRICANE/TYPHOON 1275 672.2 301.51
## 63 HIGH WIND 1137 34672.3 1450.59
## 53 HEAVY SNOW 1021 8389.9 170.00
## 149 WILDFIRE 911 17762.0 1068.20
## 122 THUNDERSTORM WINDS 908 29464.3 1291.55
## 3 BLIZZARD 805 4208.1 155.00
## 33 FOG 734 6680.9 0.00
## 148 WILD/FOREST FIRE 545 8365.9 506.00
## 19 DUST STORM 440 1629.0 100.00
barplot(injuries.by.top20events$injuries[1:10], names.arg = injuries.by.top20events$EVTYPE[1:10], cex.axis= 0.8, cex.names=0.35, xlab = "10 weather events that lead to the most injuries", ylab="injuries")
The table and chart above show that Tornados by far cause the most injuries. here too using my TWISTR category of weather event the total number of case of non -fatal injuries would be:
i <- 91346+6957+6789+5230+1488+1137+908+340+302+280
i
## [1] 114777
and this would amoun to
i*100/total.injuries
## [1] 81.68
It is a reasonable assumpton to make that the weather events that result in the biggest economic damage will be the same events that cause loss of life and limb. On the basis of this reasoning I constructed a list of the top 20 weather events (as recorded in the data base - these are not the same as in the weather events table in the NOAA manual) that account (as above) for the most deaths and injuries.
Using this list the plan is to create a subset of records in the original data file that record the proprety dmage and crop damage from these top 20 weather events. It is a reasonable expectation that Tornados will, as in the case of health effects, account for the largest economic loss. There is considerable overlap between these two lists of top 20 events, with 12 events common to both lists, and a total of 28 unique events in either or both lists
eventsA <- deaths.by.top20events$EVTYPE
eventsB <- injuries.by.top20events$EVTYPE
eventsA.and.B <- intersect(eventsA, eventsB)
eventsA.or.B <- union(eventsA,eventsB)
Looking at the output from the above code, I decided to use the eventsA.or.B. The 12 events included are:
events <- eventsA.or.B
events
## [1] "TORNADO" "EXCESSIVE HEAT"
## [3] "FLASH FLOOD" "HEAT"
## [5] "LIGHTNING" "TSTM WIND"
## [7] "FLOOD" "RIP CURRENT"
## [9] "HIGH WIND" "AVALANCHE"
## [11] "WINTER STORM" "RIP CURRENTS"
## [13] "HEAT WAVE" "EXTREME COLD"
## [15] "THUNDERSTORM WIND" "HEAVY SNOW"
## [17] "EXTREME COLD/WIND CHILL" "STRONG WIND"
## [19] "BLIZZARD" "HIGH SURF"
## [21] "ICE STORM" "HAIL"
## [23] "HURRICANE/TYPHOON" "WILDFIRE"
## [25] "THUNDERSTORM WINDS" "FOG"
## [27] "WILD/FOREST FIRE" "DUST STORM"
Now to select from the main data file those records that have one of these weather events recorded.
econdmg.events <- subset(stormdata, EVTYPE %in% events, select = c(STATE,EVTYPE,PROPDMG,CROPDMG))
str(econdmg.events)
## 'data.frame': 834992 obs. of 4 variables:
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ PROPDMG: num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ CROPDMG: num 0 0 0 0 0 0 0 0 0 0 ...
The total dollar value of the loss due to these selected weather events are: for damage to property
sum(econdmg.events$PROPDMG)
## [1] 10379197
and for damage to crops
#sum(econdmg.events$CROPDMG)
Next to sum the property damage and crop damage by event type and create 2 data frames
propdmg.events <- aggregate(econdmg.events$PROPDMG, list(Event = econdmg.events$EVTYPE), sum)
str(propdmg.events)
## 'data.frame': 28 obs. of 2 variables:
## $ Event: Factor w/ 985 levels " HIGH SURF ADVISORY",..: 19 30 117 130 140 141 153 170 188 244 ...
## $ x : num 1624 25318 5050 1460 7658 ...
colnames(propdmg.events)[2] <- "PROPDMG"
propdmg.by.events <- propdmg.events[with(propdmg.events, order(-PROPDMG)), ]
cropdmg.events <- aggregate(econdmg.events$CROPDMG, list(Event = econdmg.events$EVTYPE), sum)
colnames(cropdmg.events)[2] <- "CROPDMG"
cropdmg.by.events <- cropdmg.events[with(cropdmg.events, order(-CROPDMG)), ]
head(propdmg.by.events, 10)
## Event PROPDMG
## 24 TORNADO 3212258
## 7 FLASH FLOOD 1420125
## 25 TSTM WIND 1335966
## 8 FLOOD 899938
## 22 THUNDERSTORM WIND 876844
## 10 HAIL 688693
## 18 LIGHTNING 603352
## 23 THUNDERSTORM WINDS 446293
## 15 HIGH WIND 324732
## 28 WINTER STORM 132721
head(cropdmg.by.events, 10)
## Event CROPDMG
## 10 HAIL 579596
## 7 FLASH FLOOD 179200
## 8 FLOOD 168038
## 25 TSTM WIND 109203
## 24 TORNADO 100019
## 22 THUNDERSTORM WIND 66791
## 23 THUNDERSTORM WINDS 18685
## 15 HIGH WIND 17283
## 5 EXTREME COLD 6121
## 16 HURRICANE/TYPHOON 4798
barplot(propdmg.by.events$PROPDMG[1:10], names.arg = propdmg.by.events$Event[1:10], cex.axis= 0.8, cex.names=0.35, xlab = "10 weather events that result in the most property dmage", ylab ="Thousand USD")
As shown by the tables and the graph above the weather event that causes the most property damage are Tornados. The top 10 weather events - all part of TWISTR - together cause property damage of (in millions of dollars)
proploss <- sum(propdmg.by.events$PROPDMG[1:10])/ 1000
proploss
## [1] 9941
In percentage terms this amounts to
proploss*10^5/total.propdmg
## [1] 91.33
The weather events that cause the most crop loss are hail followed by floods, Thunderstorms, high winds and Tornados. The top 10 events for crop losses - all part of the TWISTR category, together cause crop losses of: (in millions of dollars)
croploss <- sum(cropdmg.by.events$CROPDMG[1:10]) /1000
croploss
## [1] 1250
In percentage terms this amounts to
croploss*10^5/total.cropdmg
## [1] 90.7
End of report