Extreme weather events often cause economic damage and result in injuries and/or death of people affected. To prepare for these weather events by prioritizing resources and preventing such outcomes to the extent possible is therefore a key concern of policy makers.
This project trys to answer two related questions:
* Which types of events are most harmful with respect to population health?
* Which types of events have the greatest economic consequences?
To answer the questions the project explores the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any injuries, damages etc.
In a first step we will load all R packages required in further steps.
library(lubridate) # make dealing with dates easier
library(stringr) # make dealing with text strings easier
library(dplyr) # manipulate data frames (summarize, filter, select...)
library(ggplot2) # grammar of graphics
library(gridExtra) # grid graphics
library(RColorBrewer) # color palettes
From the the course web site - https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2 - we obtained the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm storm. The NOAA tracks with this data characteristics of major storms and weather events in the United States. The data have been provided in the course as a comma-separated-value file compressed via the bzip2 algorithm. The data start in the year 1950 and end in November 2011.
The data have been provided by the course as a comma-separated-value file compressed via the bzip2 algorithm.We first read in the data from this raw text file wrapping R’s read.csv function around the bzfile function which unzips the data.
storm <- read.csv(bzfile("~/Dropbox/RepData_Peer_Assessment2/repdata_data_StormData.csv.bz2"), stringsAsFactors=TRUE)
After reading in the data we check the dimensions of the data set
dim(storm)
## [1] 902297 37
As we see the data set has 902297 rows and 37 attributes.
The attributes we are most interested in are
To get a better feeling for this data, we inspect the last few records in the relevant subset and use subsequently R’s summary function on the same subset.
storm_select<-storm %>%
select(EVTYPE,FATALITIES, INJURIES,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP)
tail(storm_select)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG
## 902292 WINTER WEATHER 0 0 0 K 0
## 902293 HIGH WIND 0 0 0 K 0
## 902294 HIGH WIND 0 0 0 K 0
## 902295 HIGH WIND 0 0 0 K 0
## 902296 BLIZZARD 0 0 0 K 0
## 902297 HEAVY SNOW 0 0 0 K 0
## CROPDMGEXP
## 902292 K
## 902293 K
## 902294 K
## 902295 K
## 902296 K
## 902297 K
summary(storm_select)
## EVTYPE FATALITIES INJURIES
## HAIL :288661 Min. : 0.0000 Min. : 0.0000
## TSTM WIND :219940 1st Qu.: 0.0000 1st Qu.: 0.0000
## THUNDERSTORM WIND: 82563 Median : 0.0000 Median : 0.0000
## TORNADO : 60652 Mean : 0.0168 Mean : 0.1557
## FLASH FLOOD : 54277 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## FLOOD : 25326 Max. :583.0000 Max. :1700.0000
## (Other) :170878
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## Min. : 0.00 :465934 Min. : 0.000 :618413
## 1st Qu.: 0.00 K :424665 1st Qu.: 0.000 K :281832
## Median : 0.00 M : 11330 Median : 0.000 M : 1994
## Mean : 12.06 0 : 216 Mean : 1.527 k : 21
## 3rd Qu.: 0.50 B : 40 3rd Qu.: 0.000 0 : 19
## Max. :5000.00 5 : 28 Max. :990.000 B : 9
## (Other): 84 (Other): 9
We see that there are no missing values.
We also see from the attributes PROPDMGEXP and CROPDMGEXP that the values regarding damages are coded on different scale: Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions. If the attributes are empty the values are ‘as is’.This means we will create new attributes that will have the values on the same scale. Other characters found like “0” may be capturing errors - so we may discard these rows.
storm_cor <- data.frame(storm %>%
mutate(PROPDMG_CORRECTED=ifelse(PROPDMGEXP=="K",PROPDMG*1000,ifelse(PROPDMGEXP=="M",PROPDMG*1000000,ifelse(PROPDMGEXP=="B",PROPDMG*1000000000,PROPDMG)))) %>%
mutate(CROPDMG_CORRECTED=ifelse(CROPDMGEXP=="K",CROPDMG*1000,ifelse(CROPDMGEXP=="M",CROPDMG*1000000,ifelse(CROPDMGEXP=="B",CROPDMG*1000000000,CROPDMGEXP)))) %>%
filter(PROPDMGEXP=="K"|PROPDMGEXP=="M"|PROPDMGEXP=="B"|PROPDMGEXP=="") %>%
filter(CROPDMGEXP=="K"|CROPDMGEXP=="M"|CROPDMGEXP=="B"|CROPDMGEXP==""))
We finally add both costs to have a total damage.
storm_cor <- storm_cor %>%
mutate(TOTALDMG_CORRECTED=PROPDMG_CORRECTED+CROPDMG_CORRECTED)
And we do the same for injurues and fatalities.
storm_cor <- storm_cor %>%
mutate(PEOPLE_AFFECTED=INJURIES+FATALITIES)
Another attribute we will most likely use is the date of the event. We will simply extract the year - which will be good enough for the questions we need to answer and allows us not to convert the dates that have been captured for different timezones to a common one.
storm_cor <- data.frame(storm_cor %>%
mutate(BGIN=as.Date(BGN_DATE,format="%m/%d/%Y %H:%M:%S")) %>%
mutate(YEAR=year(BGIN)))
If we give a deeper look into event types we find that plenty of types exist with low impact on either fatalities and/or cost. We further can see that the event types are partially mistyped o recorded in similar fashions.
storm_cor %>%
group_by(EVTYPE) %>%
summarize(TOTAL_FATALITIES=sum(FATALITIES),TOTAL_DAMAGE=sum(TOTALDMG_CORRECTED),NUMBER_OF_EVENTS=n()) %>%
arrange(desc(TOTAL_FATALITIES)) %>%
head(50)
## Source: local data frame [50 x 4]
##
## EVTYPE TOTAL_FATALITIES TOTAL_DAMAGE
## 1 TORNADO 5630 57290486627
## 2 EXCESSIVE HEAT 1903 500156665
## 3 FLASH FLOOD 978 17561571402
## 4 HEAT 937 403258636
## 5 LIGHTNING 816 940762889
## 6 TSTM WIND 504 5039149129
## 7 FLOOD 470 150319689960
## 8 RIP CURRENT 368 1193
## 9 HIGH WIND 246 5908626269
## 10 AVALANCHE 224 3722032
## 11 WINTER STORM 206 6715445967
## 12 RIP CURRENTS 204 162304
## 13 HEAT WAVE 172 16010119
## 14 EXTREME COLD 160 1360710920
## 15 THUNDERSTORM WIND 133 3897965326
## 16 EXTREME COLD/WIND CHILL 125 8698283
## 17 HEAVY SNOW 125 1067251927
## 18 STRONG WIND 103 240195941
## 19 BLIZZARD 101 771274927
## 20 HIGH SURF 101 89575324
## 21 HEAVY RAIN 98 1427654303
## 22 EXTREME HEAT 96 5115021
## 23 COLD/WIND CHILL 95 2590019
## 24 ICE STORM 89 8967038855
## 25 WILDFIRE 75 5060587764
## 26 HURRICANE/TYPHOON 64 71913712855
## 27 THUNDERSTORM WINDS 64 1923857456
## 28 FOG 62 13156037
## 29 HURRICANE 61 14610229114
## 30 TROPICAL STORM 58 8382236815
## 31 HEAVY SURF/HIGH SURF 42 9870224
## 32 LANDSLIDE 38 344613280
## 33 COLD 35 500072
## 34 HIGH WINDS 35 649045766
## 35 TSUNAMI 33 144082001
## 36 WINTER WEATHER 33 35866367
## 37 UNSEASONABLY WARM AND DRY 29 13
## 38 URBAN/SML STREAM FLD 28 66800976
## 39 WINTER WEATHER/MIX 28 6373104
## 40 TORNADOES, TSTM WIND, HAIL 25 1602500000
## 41 WIND 23 8984839
## 42 DUST STORM 22 8649225
## 43 FLASH FLOODING 19 322868271
## 44 DENSE FOG 18 9674373
## 45 EXTREME WINDCHILL 17 17755202
## 46 FLOOD/FLASH FLOOD 17 268173605
## 47 RECORD/EXCESSIVE HEAT 17 3
## 48 HAIL 15 18727909547
## 49 COLD AND SNOW 14 1
## 50 FLASH FLOOD/FLOOD 14 273005016
## Variables not shown: NUMBER_OF_EVENTS (int)
To overcome this we will use a grouping. We have used as an inspiration/input for our taxonomy /grouping following source: http://c40-production-images.s3.amazonaws.com/researches/images/33_C40_Arup_Climate_Hazard_Typology.original.pdf?1426352208. As you can see we have modified the grouping proposed.
storm_cor <- storm_cor %>%
mutate(EVENT=toupper(EVTYPE)) %>%
mutate(EVENT=str_replace_all(EVENT,"[[:punct:]]",""))
storm_cor <- storm_cor %>%
mutate(EVENTCLASS=ifelse(grepl(".*TOR.*|.*WIND.*|*.STORM.*|.*HURRI.*|.*TYPH.*", EVENT)==T,"Wind/Storm","Other"))
storm_cor <- storm_cor %>%
mutate(EVENTCLASS=ifelse(grepl(".*SNOW.*|.*RAIN.*|.*BLIZZ.*|.*PERCIP.*", EVENT)==T,"Percipitation",EVENTCLASS))
storm_cor <- storm_cor %>%
mutate(EVENTCLASS=ifelse(grepl(".*FLOOD.*|.*TSUNAMI.*", EVENT)==T,"Flood",EVENTCLASS))
storm_cor <- storm_cor %>%
mutate(EVENTCLASS=ifelse(grepl(".*COLD.*|.*WINT.*|.*CHILL.*|.*FREEZ.*|.*LOW.*|.*COOL.*", EVENT)==T,"Cold Weather",EVENTCLASS))
storm_cor <- storm_cor %>%
mutate(EVENTCLASS=ifelse(grepl(".*HEAT.*|.*HOT.*|.*DROUGHT.*|.*HIGH TEMP.*|.*RECORD HIGH.*|.*DRY.*", EVENT)==T,"Hot Weather",EVENTCLASS))
storm_cor <- storm_cor %>%
mutate(EVENTCLASS=ifelse(grepl(".*THUNDER.*|.*LIGHT.*", EVENT)==T,"Thunder/Lightning",EVENTCLASS))
storm_cor <- storm_cor %>%
mutate(EVENTCLASS=ifelse(grepl(".*AVALA.*|.*SLIDE.*", EVENT)==T,"Avalanche/Land Slide",EVENTCLASS))
storm_cor <- storm_cor %>%
mutate(EVENTCLASS=ifelse(grepl(".*FIRE.*", EVENT)==T,"Fire",EVENTCLASS))
storm_cor <- storm_cor %>%
mutate(EVENTCLASS=ifelse(grepl(".*HAIL.*", EVENT)==T,"Hail",EVENTCLASS))
We finally select only the attributes used in subsequent analysis.
storm_cor<-storm_cor %>%
select(STATE,EVENT,EVENTCLASS,YEAR,PEOPLE_AFFECTED,FATALITIES,INJURIES,TOTALDMG_CORRECTED)
We can see the data set has reduced dimesions: 901921 rows and 8 attributes.
To analyse the effects of extreme weather events on people’s health we will look to the total number of people affected - which is the sum of injuries and fatalities.
yearly_event<-storm_cor %>%
group_by(YEAR,EVENTCLASS) %>%
summarize(TOTAL_AFFECTED=sum(PEOPLE_AFFECTED))
ggplot(data=yearly_event,aes(x=YEAR,y=TOTAL_AFFECTED,fill=EVENTCLASS))+geom_bar(stat="identity")+scale_fill_brewer(palette="Paired",name="Event Class")+xlab("Year")+ylab("Injuries and Fatalities")+ggtitle("Effects of Extreme Weather Events - US")+scale_x_continuous(breaks=c(1950,1960,1970,1980,1990,2000,2010))
The first observation we can make from the figure is the fact that the data base holds before 1993 almost only records for wind/storm related events. Other events are recorded only for younger years.
The second very clear observation is that over the years wind/storm related events cause most of the injuries and fatalities, followed by hot weather. Flood seems to be have be a one timer.
To analyse the biggest economic effects of extreme weather events we will look to the total costs of damages - on both property and crop.
yearly_event<-storm_cor %>%
group_by(YEAR,EVENTCLASS) %>%
summarize(TOTAL_DAMAGE=sum(TOTALDMG_CORRECTED))
ggplot(data=yearly_event,aes(x=YEAR,y=TOTAL_DAMAGE,fill=EVENTCLASS))+geom_bar(stat="identity")+scale_fill_brewer(palette="Paired",name="Event Class")+xlab("Year")+ylab("Total Damage - Costs in US $")+ggtitle("Effects of Extreme Weather Events - US")+scale_x_continuous(breaks=c(1950,1960,1970,1980,1990,2000,2010))
Even if the absolutely highs in 2005/2006 could be due to wrong data entry/coding of the scale of one/some events in these years the overall picture shows that wind/storm and flood have the greatest economic consequences.