Analysis of National Weather Events USA

Synopsis

Extreme weather events often cause economic damage and result in injuries and/or death of people affected. To prepare for these weather events by prioritizing resources and preventing such outcomes to the extent possible is therefore a key concern of policy makers.

This project trys to answer two related questions:
* Which types of events are most harmful with respect to population health?
* Which types of events have the greatest economic consequences?

To answer the questions the project explores the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any injuries, damages etc.

Load Packages Required

In a first step we will load all R packages required in further steps.

library(lubridate) # make dealing with dates easier
library(stringr) # make dealing with text strings easier
library(dplyr) # manipulate data frames (summarize, filter, select...)
library(ggplot2) # grammar of graphics
library(gridExtra) # grid graphics
library(RColorBrewer) # color palettes

Load and Process Raw Data

From the the course web site - https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2 - we obtained the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm storm. The NOAA tracks with this data characteristics of major storms and weather events in the United States. The data have been provided in the course as a comma-separated-value file compressed via the bzip2 algorithm. The data start in the year 1950 and end in November 2011.

Data Load

The data have been provided by the course as a comma-separated-value file compressed via the bzip2 algorithm.We first read in the data from this raw text file wrapping R’s read.csv function around the bzfile function which unzips the data.

storm <- read.csv(bzfile("~/Dropbox/RepData_Peer_Assessment2/repdata_data_StormData.csv.bz2"), stringsAsFactors=TRUE)

After reading in the data we check the dimensions of the data set

dim(storm)

## [1] 902297     37

As we see the data set has 902297 rows and 37 attributes.

Data Preprocessing - General Remarks

The attributes we are most interested in are

Event Type (EVTYPE)
Fatalities and Injuries (FATALITIES, INJURIES) as they describe the effect on population health
Property and crop related damage (PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP) as they contain the costs involved.

To get a better feeling for this data, we inspect the last few records in the relevant subset and use subsequently R’s summary function on the same subset.

storm_select<-storm %>% 
  select(EVTYPE,FATALITIES, INJURIES,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP)
  
tail(storm_select)

##                EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG
## 902292 WINTER WEATHER          0        0       0          K       0
## 902293      HIGH WIND          0        0       0          K       0
## 902294      HIGH WIND          0        0       0          K       0
## 902295      HIGH WIND          0        0       0          K       0
## 902296       BLIZZARD          0        0       0          K       0
## 902297     HEAVY SNOW          0        0       0          K       0
##        CROPDMGEXP
## 902292          K
## 902293          K
## 902294          K
## 902295          K
## 902296          K
## 902297          K

summary(storm_select)

##                EVTYPE         FATALITIES          INJURIES        
##  HAIL             :288661   Min.   :  0.0000   Min.   :   0.0000  
##  TSTM WIND        :219940   1st Qu.:  0.0000   1st Qu.:   0.0000  
##  THUNDERSTORM WIND: 82563   Median :  0.0000   Median :   0.0000  
##  TORNADO          : 60652   Mean   :  0.0168   Mean   :   0.1557  
##  FLASH FLOOD      : 54277   3rd Qu.:  0.0000   3rd Qu.:   0.0000  
##  FLOOD            : 25326   Max.   :583.0000   Max.   :1700.0000  
##  (Other)          :170878                                         
##     PROPDMG          PROPDMGEXP        CROPDMG          CROPDMGEXP    
##  Min.   :   0.00          :465934   Min.   :  0.000          :618413  
##  1st Qu.:   0.00   K      :424665   1st Qu.:  0.000   K      :281832  
##  Median :   0.00   M      : 11330   Median :  0.000   M      :  1994  
##  Mean   :  12.06   0      :   216   Mean   :  1.527   k      :    21  
##  3rd Qu.:   0.50   B      :    40   3rd Qu.:  0.000   0      :    19  
##  Max.   :5000.00   5      :    28   Max.   :990.000   B      :     9  
##                    (Other):    84                     (Other):     9

We see that there are no missing values.

Data Preprocessing - Costs

We also see from the attributes PROPDMGEXP and CROPDMGEXP that the values regarding damages are coded on different scale: Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions. If the attributes are empty the values are ‘as is’.This means we will create new attributes that will have the values on the same scale. Other characters found like “0” may be capturing errors - so we may discard these rows.

storm_cor <- data.frame(storm %>%
  mutate(PROPDMG_CORRECTED=ifelse(PROPDMGEXP=="K",PROPDMG*1000,ifelse(PROPDMGEXP=="M",PROPDMG*1000000,ifelse(PROPDMGEXP=="B",PROPDMG*1000000000,PROPDMG)))) %>%
  mutate(CROPDMG_CORRECTED=ifelse(CROPDMGEXP=="K",CROPDMG*1000,ifelse(CROPDMGEXP=="M",CROPDMG*1000000,ifelse(CROPDMGEXP=="B",CROPDMG*1000000000,CROPDMGEXP)))) %>%
  filter(PROPDMGEXP=="K"|PROPDMGEXP=="M"|PROPDMGEXP=="B"|PROPDMGEXP=="") %>%
  filter(CROPDMGEXP=="K"|CROPDMGEXP=="M"|CROPDMGEXP=="B"|CROPDMGEXP==""))

We finally add both costs to have a total damage.

storm_cor <- storm_cor %>%
  mutate(TOTALDMG_CORRECTED=PROPDMG_CORRECTED+CROPDMG_CORRECTED)

And we do the same for injurues and fatalities.

storm_cor <- storm_cor %>%
  mutate(PEOPLE_AFFECTED=INJURIES+FATALITIES)

Data Preprocessing - Date

Another attribute we will most likely use is the date of the event. We will simply extract the year - which will be good enough for the questions we need to answer and allows us not to convert the dates that have been captured for different timezones to a common one.

storm_cor <- data.frame(storm_cor %>%
  mutate(BGIN=as.Date(BGN_DATE,format="%m/%d/%Y %H:%M:%S")) %>% 
  mutate(YEAR=year(BGIN)))

Data Preprocessing - Event Types

If we give a deeper look into event types we find that plenty of types exist with low impact on either fatalities and/or cost. We further can see that the event types are partially mistyped o recorded in similar fashions.

storm_cor %>%
  group_by(EVTYPE) %>%
  summarize(TOTAL_FATALITIES=sum(FATALITIES),TOTAL_DAMAGE=sum(TOTALDMG_CORRECTED),NUMBER_OF_EVENTS=n()) %>%
  arrange(desc(TOTAL_FATALITIES)) %>%
  head(50)

## Source: local data frame [50 x 4]
## 
##                        EVTYPE TOTAL_FATALITIES TOTAL_DAMAGE
## 1                     TORNADO             5630  57290486627
## 2              EXCESSIVE HEAT             1903    500156665
## 3                 FLASH FLOOD              978  17561571402
## 4                        HEAT              937    403258636
## 5                   LIGHTNING              816    940762889
## 6                   TSTM WIND              504   5039149129
## 7                       FLOOD              470 150319689960
## 8                 RIP CURRENT              368         1193
## 9                   HIGH WIND              246   5908626269
## 10                  AVALANCHE              224      3722032
## 11               WINTER STORM              206   6715445967
## 12               RIP CURRENTS              204       162304
## 13                  HEAT WAVE              172     16010119
## 14               EXTREME COLD              160   1360710920
## 15          THUNDERSTORM WIND              133   3897965326
## 16    EXTREME COLD/WIND CHILL              125      8698283
## 17                 HEAVY SNOW              125   1067251927
## 18                STRONG WIND              103    240195941
## 19                   BLIZZARD              101    771274927
## 20                  HIGH SURF              101     89575324
## 21                 HEAVY RAIN               98   1427654303
## 22               EXTREME HEAT               96      5115021
## 23            COLD/WIND CHILL               95      2590019
## 24                  ICE STORM               89   8967038855
## 25                   WILDFIRE               75   5060587764
## 26          HURRICANE/TYPHOON               64  71913712855
## 27         THUNDERSTORM WINDS               64   1923857456
## 28                        FOG               62     13156037
## 29                  HURRICANE               61  14610229114
## 30             TROPICAL STORM               58   8382236815
## 31       HEAVY SURF/HIGH SURF               42      9870224
## 32                  LANDSLIDE               38    344613280
## 33                       COLD               35       500072
## 34                 HIGH WINDS               35    649045766
## 35                    TSUNAMI               33    144082001
## 36             WINTER WEATHER               33     35866367
## 37  UNSEASONABLY WARM AND DRY               29           13
## 38       URBAN/SML STREAM FLD               28     66800976
## 39         WINTER WEATHER/MIX               28      6373104
## 40 TORNADOES, TSTM WIND, HAIL               25   1602500000
## 41                       WIND               23      8984839
## 42                 DUST STORM               22      8649225
## 43             FLASH FLOODING               19    322868271
## 44                  DENSE FOG               18      9674373
## 45          EXTREME WINDCHILL               17     17755202
## 46          FLOOD/FLASH FLOOD               17    268173605
## 47      RECORD/EXCESSIVE HEAT               17            3
## 48                       HAIL               15  18727909547
## 49              COLD AND SNOW               14            1
## 50          FLASH FLOOD/FLOOD               14    273005016
## Variables not shown: NUMBER_OF_EVENTS (int)

To overcome this we will use a grouping. We have used as an inspiration/input for our taxonomy /grouping following source: http://c40-production-images.s3.amazonaws.com/researches/images/33_C40_Arup_Climate_Hazard_Typology.original.pdf?1426352208. As you can see we have modified the grouping proposed.

storm_cor <- storm_cor %>%
mutate(EVENT=toupper(EVTYPE)) %>%
mutate(EVENT=str_replace_all(EVENT,"[[:punct:]]",""))

storm_cor <- storm_cor %>%
mutate(EVENTCLASS=ifelse(grepl(".*TOR.*|.*WIND.*|*.STORM.*|.*HURRI.*|.*TYPH.*", EVENT)==T,"Wind/Storm","Other"))

storm_cor <- storm_cor %>%
mutate(EVENTCLASS=ifelse(grepl(".*SNOW.*|.*RAIN.*|.*BLIZZ.*|.*PERCIP.*", EVENT)==T,"Percipitation",EVENTCLASS))

storm_cor <- storm_cor %>%
mutate(EVENTCLASS=ifelse(grepl(".*FLOOD.*|.*TSUNAMI.*", EVENT)==T,"Flood",EVENTCLASS))

storm_cor <- storm_cor %>%
mutate(EVENTCLASS=ifelse(grepl(".*COLD.*|.*WINT.*|.*CHILL.*|.*FREEZ.*|.*LOW.*|.*COOL.*", EVENT)==T,"Cold Weather",EVENTCLASS))

storm_cor <- storm_cor %>%
mutate(EVENTCLASS=ifelse(grepl(".*HEAT.*|.*HOT.*|.*DROUGHT.*|.*HIGH TEMP.*|.*RECORD HIGH.*|.*DRY.*", EVENT)==T,"Hot Weather",EVENTCLASS))

storm_cor <- storm_cor %>%
mutate(EVENTCLASS=ifelse(grepl(".*THUNDER.*|.*LIGHT.*", EVENT)==T,"Thunder/Lightning",EVENTCLASS))

storm_cor <- storm_cor %>%
mutate(EVENTCLASS=ifelse(grepl(".*AVALA.*|.*SLIDE.*", EVENT)==T,"Avalanche/Land Slide",EVENTCLASS))

storm_cor <- storm_cor %>%
mutate(EVENTCLASS=ifelse(grepl(".*FIRE.*", EVENT)==T,"Fire",EVENTCLASS))

storm_cor <- storm_cor %>%
mutate(EVENTCLASS=ifelse(grepl(".*HAIL.*", EVENT)==T,"Hail",EVENTCLASS))

Data Preprocessing - Relevant Subset

We finally select only the attributes used in subsequent analysis.

storm_cor<-storm_cor %>%
  select(STATE,EVENT,EVENTCLASS,YEAR,PEOPLE_AFFECTED,FATALITIES,INJURIES,TOTALDMG_CORRECTED)

We can see the data set has reduced dimesions: 901921 rows and 8 attributes.

Results

Effect of Weather Events on People’s Health

To analyse the effects of extreme weather events on people’s health we will look to the total number of people affected - which is the sum of injuries and fatalities.

yearly_event<-storm_cor %>%
  group_by(YEAR,EVENTCLASS) %>%
  summarize(TOTAL_AFFECTED=sum(PEOPLE_AFFECTED))


ggplot(data=yearly_event,aes(x=YEAR,y=TOTAL_AFFECTED,fill=EVENTCLASS))+geom_bar(stat="identity")+scale_fill_brewer(palette="Paired",name="Event Class")+xlab("Year")+ylab("Injuries and Fatalities")+ggtitle("Effects of Extreme Weather Events - US")+scale_x_continuous(breaks=c(1950,1960,1970,1980,1990,2000,2010))

The first observation we can make from the figure is the fact that the data base holds before 1993 almost only records for wind/storm related events. Other events are recorded only for younger years.

The second very clear observation is that over the years wind/storm related events cause most of the injuries and fatalities, followed by hot weather. Flood seems to be have be a one timer.

Effect of Weather Events on Damages/Costs

To analyse the biggest economic effects of extreme weather events we will look to the total costs of damages - on both property and crop.

yearly_event<-storm_cor %>%
  group_by(YEAR,EVENTCLASS) %>%
  summarize(TOTAL_DAMAGE=sum(TOTALDMG_CORRECTED))


ggplot(data=yearly_event,aes(x=YEAR,y=TOTAL_DAMAGE,fill=EVENTCLASS))+geom_bar(stat="identity")+scale_fill_brewer(palette="Paired",name="Event Class")+xlab("Year")+ylab("Total Damage - Costs in US $")+ggtitle("Effects of Extreme Weather Events - US")+scale_x_continuous(breaks=c(1950,1960,1970,1980,1990,2000,2010))

Even if the absolutely highs in 2005/2006 could be due to wrong data entry/coding of the scale of one/some events in these years the overall picture shows that wind/storm and flood have the greatest economic consequences.