Weather events can have dramatic consequences. In this analysis, we examine the effects of weather events on two main indicators: “HEALTH” and “ECONOMIC”.
The files are directly downloaded from this
repository..
A documentation is provided here.
df <- fread("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2")
After reading in the data using fread, we check what data and types we loaded.
str(df)
## Classes 'data.table' and 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
## - attr(*, ".internal.selfref")=<externalptr>
There are 902297 events listed in the data. The column names are not the best, but they are partially descriptive. As it is an external data set, we don’t know what the names would mean exactly, and keeping the names ‘as is’ is the safest bet.
The EVTYPE column is currently of type ‘chr’. We will convert this to a factor, and view parts of the result:
df$EVTYPE <- as.factor(toupper(df$EVTYPE)) # Converting all values in EVTYPE to UPPERCASE
head(table(df$EVTYPE), 24)
##
## HIGH SURF ADVISORY COASTAL FLOOD
## 1 1
## FLASH FLOOD LIGHTNING
## 1 1
## TSTM WIND TSTM WIND (G45)
## 4 1
## WATERSPOUT WIND
## 1 1
## ? ABNORMAL WARMTH
## 1 4
## ABNORMALLY DRY ABNORMALLY WET
## 2 1
## ACCUMULATED SNOWFALL AGRICULTURAL FREEZE
## 4 6
## APACHE COUNTY ASTRONOMICAL HIGH TIDE
## 1 103
## ASTRONOMICAL LOW TIDE AVALANCE
## 174 1
## AVALANCHE BEACH EROSIN
## 386 1
## BEACH EROSION BEACH EROSION/COASTAL FLOOD
## 4 1
## BEACH FLOOD BELOW NORMAL PRECIPITATION
## 2 2
As we can see, this is not an ideal grouping. There are still 898 factors, although in practice there is considerable overlap. For example, ‘AVALANCE’ is a typo, and so is ‘BEACH EROSIN’. Another example would be the factors with TSTM in the name. There are quite a few, some named, some with an integer added, but they are probably similar (more examples are in the full table). For now, we will leave this as it is, but in the results we should be careful to account for these inconsistencies, and not draw conclusions too broadly.
To answer this question, we must first define population health. We have two data points per event that relate to ‘population health’:
It is difficult to compare these two, especially without proper
understanding of what the National Weather Service (NWS) considers an
‘injury’. Is a scratch the same as a broken leg?
Another issue is comparing across events. Is an event that consistently
has 10 fatalities as harmful as an event that has a 1% probability of
1000 fatalities (and 99% chance of no fatalities)? On average we would
rate them the same, but in terms of impact the latter would probably be
more devastating.
The simplest approach is to assume a fatality is as harmful to an injury
to population health, and that we do not correct for the spread in
outcomes.
df_fatalities_injuries <- aggregate(cbind(FATALITIES, INJURIES) ~ EVTYPE, data = df, sum) # Collapse the data to one row per EVTYPE, and sum all FATALITIES/INJURIES
df_fatalities_injuries$SUM_FATALITIES_INJURIES <- rowSums(cbind(df_fatalities_injuries$FATALITIES, df_fatalities_injuries$INJURIES), na.rm = TRUE) # To compare events, add injuries and fatalities (also to limit results, as there are too many unique events for legible graphs.)
df_fatalities_injuries <- df_fatalities_injuries[order(df_fatalities_injuries$SUM_FATALITIES_INJURIES, decreasing = TRUE), ]
df_fatalities_injuries <- head(df_fatalities_injuries, 10)
ggplot(data = df_fatalities_injuries %>% pivot_longer(cols = c(FATALITIES, INJURIES), names_to = 'TYPE', values_to = 'COUNT'), aes(x = EVTYPE, y = COUNT, fill = TYPE)) + geom_bar(stat = 'identity', position = 'stack') + labs(title = 'Sum of fatalities/injuries by event') + xlab('Count') + ylab('Event') + coord_flip()
From this figure, it is obvious that tornadoes are by far the most dangerous events. This is followed by thunderstorms. Strangely enough, hurricanes aren’t in the top 10 (or even close).
This time, we will take a slightly different approach. We will compare the mean of CROPDMG and mean of PROPDMG by EVTYPE, as those variables are the economic data points.
df_economic <- df %>%
aggregate(cbind(CROPDMG, PROPDMG) ~ EVTYPE, mean) %>%
filter(CROPDMG > 300 | PROPDMG > 300)
ggplot(data = df_economic, aes(x = CROPDMG, y = PROPDMG, color = EVTYPE)) + geom_point(position = 'jitter') + labs(title = 'Overview of events with significant (>300) damage to either crops or property')
Not the most clear graph, but it does show two things: certain events do a lot of damage to crops, but most do not. It is in damage to property that most events have a significant impact. We will briefly examine the two different types of economic damage and their respective outliers.
head(df_economic[order(df_economic$CROPDMG, decreasing = TRUE), ], 4)
## EVTYPE CROPDMG PROPDMG
## 3 DUST STORM/HIGH WINDS 500 50
## 8 FOREST FIRES 500 5
## 22 TROPICAL STORM GORDON 500 500
## 14 HIGH WINDS/COLD 401 122
head(df_economic[order(df_economic$PROPDMG, decreasing = TRUE), ], 4)
## EVTYPE CROPDMG PROPDMG
## 2 COASTAL EROSION 0 766
## 10 HEAVY RAIN AND FLOOD 0 600
## 16 RIVER AND STREAM FLOOD 0 600
## 1 BLIZZARD/WINTER STORM 0 500
It seems that tropical storm ‘Gordon’ had the greatest impact on both CROPDMG and PROPDMG. Coastal erosion is responsible for a lot of property damage, but no crop damage (this makes sense, as people live close to the beach/coast, but crops rarely thrive under saltwater conditions.)