Synopsis

Weather events can have dramatic consequences. In this analysis, we examine the effects of weather events on two main indicators: “HEALTH” and “ECONOMIC”.

Data Processing

Loading the dataset

The files are directly downloaded from this repository..
A documentation is provided here.

df <- fread("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2")

After reading in the data using fread, we check what data and types we loaded.

str(df)
## Classes 'data.table' and 'data.frame':   902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
##  - attr(*, ".internal.selfref")=<externalptr>

There are 902297 events listed in the data. The column names are not the best, but they are partially descriptive. As it is an external data set, we don’t know what the names would mean exactly, and keeping the names ‘as is’ is the safest bet.

Transformation of data

The EVTYPE column is currently of type ‘chr’. We will convert this to a factor, and view parts of the result:

df$EVTYPE <- as.factor(toupper(df$EVTYPE)) # Converting all values in EVTYPE to UPPERCASE
head(table(df$EVTYPE), 24)
## 
##          HIGH SURF ADVISORY               COASTAL FLOOD 
##                           1                           1 
##                 FLASH FLOOD                   LIGHTNING 
##                           1                           1 
##                   TSTM WIND             TSTM WIND (G45) 
##                           4                           1 
##                  WATERSPOUT                        WIND 
##                           1                           1 
##                           ?             ABNORMAL WARMTH 
##                           1                           4 
##              ABNORMALLY DRY              ABNORMALLY WET 
##                           2                           1 
##        ACCUMULATED SNOWFALL         AGRICULTURAL FREEZE 
##                           4                           6 
##               APACHE COUNTY      ASTRONOMICAL HIGH TIDE 
##                           1                         103 
##       ASTRONOMICAL LOW TIDE                    AVALANCE 
##                         174                           1 
##                   AVALANCHE                BEACH EROSIN 
##                         386                           1 
##               BEACH EROSION BEACH EROSION/COASTAL FLOOD 
##                           4                           1 
##                 BEACH FLOOD  BELOW NORMAL PRECIPITATION 
##                           2                           2

As we can see, this is not an ideal grouping. There are still 898 factors, although in practice there is considerable overlap. For example, ‘AVALANCE’ is a typo, and so is ‘BEACH EROSIN’. Another example would be the factors with TSTM in the name. There are quite a few, some named, some with an integer added, but they are probably similar (more examples are in the full table). For now, we will leave this as it is, but in the results we should be careful to account for these inconsistencies, and not draw conclusions too broadly.

Results

Which types of events are most harmful to population health?

To answer this question, we must first define population health. We have two data points per event that relate to ‘population health’:

  • FATALITIES
  • INJURIES

It is difficult to compare these two, especially without proper understanding of what the National Weather Service (NWS) considers an ‘injury’. Is a scratch the same as a broken leg?
Another issue is comparing across events. Is an event that consistently has 10 fatalities as harmful as an event that has a 1% probability of 1000 fatalities (and 99% chance of no fatalities)? On average we would rate them the same, but in terms of impact the latter would probably be more devastating.
The simplest approach is to assume a fatality is as harmful to an injury to population health, and that we do not correct for the spread in outcomes.

df_fatalities_injuries <- aggregate(cbind(FATALITIES, INJURIES) ~ EVTYPE, data = df, sum) # Collapse the data to one row per EVTYPE, and sum all FATALITIES/INJURIES
df_fatalities_injuries$SUM_FATALITIES_INJURIES <- rowSums(cbind(df_fatalities_injuries$FATALITIES, df_fatalities_injuries$INJURIES), na.rm = TRUE) # To compare events, add injuries and fatalities (also to limit results, as there are too many unique events for legible graphs.)
df_fatalities_injuries <- df_fatalities_injuries[order(df_fatalities_injuries$SUM_FATALITIES_INJURIES, decreasing = TRUE), ]
df_fatalities_injuries <- head(df_fatalities_injuries, 10)
ggplot(data = df_fatalities_injuries %>% pivot_longer(cols = c(FATALITIES, INJURIES), names_to = 'TYPE', values_to = 'COUNT'), aes(x = EVTYPE, y = COUNT, fill = TYPE)) + geom_bar(stat = 'identity', position = 'stack') + labs(title = 'Sum of fatalities/injuries by event') + xlab('Count') + ylab('Event') + coord_flip()

From this figure, it is obvious that tornadoes are by far the most dangerous events. This is followed by thunderstorms. Strangely enough, hurricanes aren’t in the top 10 (or even close).

Which types of events have the greatest economic consequences?

This time, we will take a slightly different approach. We will compare the mean of CROPDMG and mean of PROPDMG by EVTYPE, as those variables are the economic data points.

df_economic <- df %>%
    aggregate(cbind(CROPDMG, PROPDMG) ~ EVTYPE, mean) %>% 
    filter(CROPDMG > 300 | PROPDMG > 300)
ggplot(data = df_economic, aes(x = CROPDMG, y = PROPDMG, color = EVTYPE)) + geom_point(position = 'jitter') + labs(title = 'Overview of events with significant (>300) damage to either crops or property')

Not the most clear graph, but it does show two things: certain events do a lot of damage to crops, but most do not. It is in damage to property that most events have a significant impact. We will briefly examine the two different types of economic damage and their respective outliers.

head(df_economic[order(df_economic$CROPDMG, decreasing = TRUE), ], 4)
##                   EVTYPE CROPDMG PROPDMG
## 3  DUST STORM/HIGH WINDS     500      50
## 8           FOREST FIRES     500       5
## 22 TROPICAL STORM GORDON     500     500
## 14       HIGH WINDS/COLD     401     122
head(df_economic[order(df_economic$PROPDMG, decreasing = TRUE), ], 4)
##                    EVTYPE CROPDMG PROPDMG
## 2         COASTAL EROSION       0     766
## 10   HEAVY RAIN AND FLOOD       0     600
## 16 RIVER AND STREAM FLOOD       0     600
## 1   BLIZZARD/WINTER STORM       0     500

It seems that tropical storm ‘Gordon’ had the greatest impact on both CROPDMG and PROPDMG. Coastal erosion is responsible for a lot of property damage, but no crop damage (this makes sense, as people live close to the beach/coast, but crops rarely thrive under saltwater conditions.)