In this analysis we use data from the U.S. National Oceanic and Atmospheric Administration’s database (NOAA) collected between 1950 and 2011 to understand:
The raw data used for this analysis can be downloaded here. The National Weather Service Storm Data Documentation and the National Climatic Data Center Storm Events FAQ contain useful documentation on how variables in the original dataset have been constructed and defined.
In order to process the raw CSV input data we import the contents of the file stormdata.csv into R:
stormdata <- read.csv("repdata_data_StormData.csv.bz2")
Here we print the first few rows to get an idea of the structure of the data:
head(stormdata)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
As we can see the INJURIES column and the FATALITIES report how many injuries and fatalities, respectively, have been linked to a particular weather event. The event type is reported in the EVTYPE column.
We start our analysis by exploring the number of injuries linked to severe weather events. In particular, we calculate the mean total number of injuries and then subset the data to include events that have caused a number of injuries above average:
total_injuries_per_event_type <- aggregate(x = stormdata$INJURIES, by = list(stormdata$EVTYPE), FUN = sum, na.rm=TRUE)
names(total_injuries_per_event_type) <- c("EVTYPE","INJURIES")
avg_injuries <- mean(total_injuries_per_event_type$INJURIES,na.rm=TRUE)
avg_injuries
## [1] 142.668
injuries_above_avg <- total_injuries_per_event_type[which(total_injuries_per_event_type$INJURIES > avg_injuries),]
# sort by descending order for better readability
injuries_above_avg <- injuries_above_avg[order(-injuries_above_avg$INJURIES),]
The full list of events that have caused a significant number of injuries - in descending order - is finally reported here:
injuries_above_avg
## EVTYPE INJURIES
## 830 TORNADO 91346
## 854 TSTM WIND 6957
## 164 FLOOD 6789
## 123 EXCESSIVE HEAT 6525
## 452 LIGHTNING 5230
## 269 HEAT 2100
## 424 ICE STORM 1975
## 147 FLASH FLOOD 1777
## 759 THUNDERSTORM WIND 1488
## 238 HAIL 1361
## 972 WINTER STORM 1321
## 406 HURRICANE/TYPHOON 1275
## 354 HIGH WIND 1137
## 304 HEAVY SNOW 1021
## 953 WILDFIRE 911
## 783 THUNDERSTORM WINDS 908
## 22 BLIZZARD 805
## 182 FOG 734
## 956 WILD/FOREST FIRE 545
## 111 DUST STORM 440
## 978 WINTER WEATHER 398
## 82 DENSE FOG 342
## 844 TROPICAL STORM 340
## 274 HEAT WAVE 309
## 368 HIGH WINDS 302
## 582 RIP CURRENTS 297
## 672 STRONG WIND 280
## 284 HEAVY RAIN 251
## 581 RIP CURRENT 232
## 133 EXTREME COLD 231
## 216 GLAZE 216
## 11 AVALANCHE 170
## 135 EXTREME HEAT 155
## 342 HIGH SURF 152
## 955 WILD FIRES 150
The steps to this analysis are very similar to those in the previous section. In this instance we need process the data in the FATALITIES column. We will calculate the average number of fatalities and then focus our attention on events that have caused a number of fatalities above average:
total_fatalities_per_event_type <- aggregate(x = stormdata$FATALITIES, by = list(stormdata$EVTYPE), FUN = sum, na.rm=TRUE)
names(total_fatalities_per_event_type) <- c("EVTYPE","FATALITIES")
avg_fatalities <- mean(total_fatalities_per_event_type$FATALITIES,na.rm=TRUE)
avg_fatalities
## [1] 15.37563
fatalities_above_avg <- total_fatalities_per_event_type[which(total_fatalities_per_event_type$FATALITIES > avg_fatalities),]
# Sort by descending order for better readability:
fatalities_above_avg <- fatalities_above_avg[order(-fatalities_above_avg$FATALITIES),]
The full list of events that have caused a significant number of fatalities - in descending order - is finally reported here:
fatalities_above_avg
## EVTYPE FATALITIES
## 830 TORNADO 5633
## 123 EXCESSIVE HEAT 1903
## 147 FLASH FLOOD 978
## 269 HEAT 937
## 452 LIGHTNING 816
## 854 TSTM WIND 504
## 164 FLOOD 470
## 581 RIP CURRENT 368
## 354 HIGH WIND 248
## 11 AVALANCHE 224
## 972 WINTER STORM 206
## 582 RIP CURRENTS 204
## 274 HEAT WAVE 172
## 133 EXTREME COLD 160
## 759 THUNDERSTORM WIND 133
## 304 HEAVY SNOW 127
## 134 EXTREME COLD/WIND CHILL 125
## 672 STRONG WIND 103
## 22 BLIZZARD 101
## 342 HIGH SURF 101
## 284 HEAVY RAIN 98
## 135 EXTREME HEAT 96
## 71 COLD/WIND CHILL 95
## 424 ICE STORM 89
## 953 WILDFIRE 75
## 406 HURRICANE/TYPHOON 64
## 783 THUNDERSTORM WINDS 64
## 182 FOG 62
## 397 HURRICANE 61
## 844 TROPICAL STORM 58
## 336 HEAVY SURF/HIGH SURF 42
## 437 LANDSLIDE 38
## 59 COLD 35
## 368 HIGH WINDS 35
## 875 TSUNAMI 33
## 978 WINTER WEATHER 33
## 885 UNSEASONABLY WARM AND DRY 29
## 917 URBAN/SML STREAM FLD 28
## 980 WINTER WEATHER/MIX 28
## 833 TORNADOES, TSTM WIND, HAIL 25
## 959 WIND 23
## 111 DUST STORM 22
## 154 FLASH FLOODING 19
## 82 DENSE FOG 18
## 138 EXTREME WINDCHILL 17
## 169 FLOOD/FLASH FLOOD 17
## 552 RECORD/EXCESSIVE HEAT 17
The original dataset provides two main variables we can use to estimate the damage to the economy caused by sever weather events:
Both these variables come with exponential multipliers, respectively CROPDMGEXP and PROPDMGEXP. Unfortunately not all the values and multipliers are provided (or follow a predictable pattern) in the original dataset, therefore we will perform some cleaning up on the data first to isolate the meaningful rows.
We’ll start with PROPDMG values by filtering out zeros:
propdmg <- subset(stormdata, PROPDMG !=0)
For convenience, figures will be reported in million USD. For this purpose, a helper function will be used to convert our values into million USD: when the exponential multiplier is not recognised or not provided, we return ‘0’ as a an output value:
damage2M <- function(value,exp){
output = 0
if (exp == "K" || exp == "k" )
output = value / 1000
if (exp == "M" || exp == "m" )
output = value
if (exp == "B" || exp == "b")
output = value * 1000
return(output)
}
propdmg$PROPDMG <- damage2M(propdmg$PROPDMG,propdmg$PROPDMGEXP)
propdmg <- subset(propdmg, PROPDMG !=0) #Filter out zeros resulted from running the damage2M function
propdmg <- propdmg[c("EVTYPE","PROPDMG")]
total_propdmg_per_event_type <- aggregate(x = propdmg$PROPDMG, by = list(propdmg$EVTYPE), FUN = sum, na.rm=TRUE)
names(total_propdmg_per_event_type) <- c("EVTYPE","PROPDMG")
avg_propdmg <- mean(total_propdmg_per_event_type$PROPDMG,na.rm=TRUE)
avg_propdmg
## [1] 26.80911
propdmg_above_avg <- total_propdmg_per_event_type[which(total_propdmg_per_event_type$PROPDMG > avg_propdmg),]
# Finally, sort by descending order for convenience:
propdmg_above_avg <- propdmg_above_avg[order(-propdmg_above_avg$PROPDMG),]
propdmg_above_avg
## EVTYPE PROPDMG
## 332 TORNADO 3212.25816
## 48 FLASH FLOOD 1420.12459
## 348 TSTM WIND 1335.96561
## 61 FLOOD 899.93848
## 297 THUNDERSTORM WIND 876.84417
## 103 HAIL 688.69338
## 203 LIGHTNING 603.35178
## 308 THUNDERSTORM WINDS 446.29318
## 156 HIGH WIND 324.73156
## 399 WINTER STORM 132.72059
## 130 HEAVY SNOW 122.25199
## 386 WILDFIRE 84.45934
## 188 ICE STORM 66.00067
## 283 STRONG WIND 62.99381
## 163 HIGH WINDS 55.62500
## 120 HEAVY RAIN 50.84214
## 340 TROPICAL STORM 48.42368
## 389 WILD/FOREST FIRE 39.34495
## 53 FLASH FLOODING 28.49715
We can now repeat the same steps for CROPDMG values:
cropdmg <- subset(stormdata, CROPDMG !=0)
cropdmg$CROPDMG <- damage2M(cropdmg$CROPDMG,cropdmg$CROPDMGEXP)
cropdmg <- subset(cropdmg, CROPDMG !=0) #Filter out zeros if any
cropdmg <- cropdmg[c("EVTYPE","CROPDMG")]
total_cropdmg_per_event_type <- aggregate(x = cropdmg$CROPDMG, by = list(cropdmg$EVTYPE), FUN = sum, na.rm=TRUE)
names(total_cropdmg_per_event_type) <- c("EVTYPE","CROPDMG")
avg_cropdmg <- mean(total_cropdmg_per_event_type$CROPDMG,na.rm=TRUE)
avg_cropdmg
## [1] 10131.08
cropdmg_above_avg <- total_cropdmg_per_event_type[which(total_cropdmg_per_event_type$CROPDMG > avg_cropdmg),]
# Sort by descending order for better readability:
cropdmg_above_avg <- cropdmg_above_avg[order(-cropdmg_above_avg$CROPDMG),]
cropdmg_above_avg
## EVTYPE CROPDMG
## 42 HAIL 579596.28
## 23 FLASH FLOOD 179200.46
## 27 FLOOD 168037.88
## 115 TSTM WIND 109202.60
## 107 TORNADO 100018.52
## 97 THUNDERSTORM WIND 66791.45
## 10 DROUGHT 33898.62
## 100 THUNDERSTORM WINDS 18684.93
## 60 HIGH WIND 17283.21
## 54 HEAVY RAIN 11122.80
With regards to population health, our analysis shows that tornados are the the type of event that has caused the highest number of injuries (91346) and fatalities (5633). Other noticeable events that have caused a considerable number of injuries and fatalities are: thunderstorm winds, floods, excessive heat and lightning.
A summary of the report highlighting the 10 most harmful events is shown here:
high_ft <- head(fatalities_above_avg,10)
p1 <- ggplot(high_ft,aes(x=EVTYPE,y=FATALITIES,fill=EVTYPE)) + geom_bar(stat="identity") + theme(legend.position="none") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab("") + ylab("Total fatalities")
high_inj <- head(injuries_above_avg,10)
p2 <- ggplot(high_inj,aes(x=EVTYPE,y=INJURIES,fill=EVTYPE)) + geom_bar(stat="identity") + theme(legend.position="none") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab("") + ylab("Total injuries")
# Credits to the Cookbook for R website for helper function multiplot:
# http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_%28ggplot2%29/
multiplot(p1,p2,cols=2)
With regards to economical damage, our analysis shows that tornados have had the highest impact on property damage (estimated 3212m USD) while hail has had the highest impact on crops damage (estimated 579596m USD).
Storm winds, floods and - in the case of property damage - lightening are also events with significant impact.
A summary of the report highlighting the 10 most significant events is shown here:
high_pd <- head(propdmg_above_avg,10)
p3 <- ggplot(high_pd,aes(x=EVTYPE,y=PROPDMG,fill=EVTYPE)) + geom_bar(stat="identity") + theme(legend.position="none") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab("") + ylab("Property damage (M)")
high_cd <- head(cropdmg_above_avg,10)
p4 <- ggplot(high_cd,aes(x=EVTYPE,y=CROPDMG,fill=EVTYPE)) + geom_bar(stat="identity") + theme(legend.position="none") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab("") + ylab("Agricultural damage (M)")
# Credits to the Cookbook for R website for helper function multiplot:
# http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_%28ggplot2%29/
multiplot(p3,p4,cols=2)