Synopsis

The purpose of this study is to analyze the health and economic impact of different weather events. The dataset that was used, Storm Data, is an official publication of the National Oceanic and Atmospheric Administration (NOAA) which documents significant weather phenomena. Events documented in the dataset start in 1950 and end in 2011. The accuracy of the data is greater for later records.

The analysis identified floods as the most harmful weather event in terms of economic impact. In terms of negative health effect, tornadoes are the most harmful, causing the greatest number of injuries and fatalities.

Data Processing

Reading data

We start by downloading and processing the data.

download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", destfile = "StormData.csv.bz2")
#accesed on 09.02.2018
bunzip2("StormData.csv.bz2")

#using fread from data.table to save time (compared to read.csv)
sd <- fread("StormData.csv")
## 
Read 30.0% of 967216 rows
Read 51.7% of 967216 rows
Read 70.3% of 967216 rows
Read 82.7% of 967216 rows
Read 902297 rows and 37 (of 37) columns from 0.523 GB file in 00:00:06
str(sd)
## Classes 'data.table' and 'data.frame':   902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : chr  "3" "2" "2" "2" ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Recoding the exponent variables

There are some important variables, that need recoding. The PROPDMGEXP and CROPDMGEXP variables should be exponents for the appropriate damage variables. However, as seen below, it is not clear. To determine the meaning of each exponent, I have used this publication

table(sd$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5 
## 465934      1      8      5    216     25     13      4      4     28 
##      6      7      8      B      h      H      K      m      M 
##      4      5      1     40      1      6 424665      7  11330
table(sd$CROPDMGEXP)
## 
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994
#recoding the PROPDMG_EXP variable
sd$PROPDMGEXP_n <- recode(tolower(sd$PROPDMGEXP), 
                          'h' = 10^2,
                          'k' = 10^3,
                          'm' = 10^6,
                          'b' = 10^9,
                          '+' = 1,
                          '1' = 10,
                          '2' = 10,
                          '3' = 10,
                          '4' = 10,
                          '5' = 10,
                          '6' = 10,
                          '7' = 10,
                          '8' = 10,
                          .default = 0)
#creating a variable with real damages (in million USD)
sd$PROPDMG_real <- sd$PROPDMG * sd$PROPDMGEXP_n / 10^6

#recoding the CROPDMG_EXP variable
sd$CROPDMGEXP_n <- recode(tolower(sd$CROPDMGEXP), 
                          'h' = 10^2,
                          'k' = 10^3,
                          'm' = 10^6,
                          'b' = 10^9,
                          '+' = 1,
                          '1' = 10,
                          '2' = 10,
                          '3' = 10,
                          '4' = 10,
                          '5' = 10,
                          '6' = 10,
                          '7' = 10,
                          '8' = 10,
                          .default = 0)
sd$CROPDMG_real <- sd$CROPDMG * sd$CROPDMGEXP_n  / 10^6

#to keep the data tidy, we keep only the important variables
sd <- sd %>% select(EVTYPE, PROPDMG_real, CROPDMG_real, INJURIES, FATALITIES)

Cleaning the event types

First, lets prepare a summary of the economic and health effect by event types. Here the economic effect is defined as sum of property and crop damage, and health as sum of injuries and death related to the event.

sum_sd <- sd %>%
        group_by(EVTYPE) %>%
        summarise(econ_effect=sum(PROPDMG_real, CROPDMG_real),
                  health_effect=sum(INJURIES, FATALITIES))

str(sum_sd)
## Classes 'tbl_df', 'tbl' and 'data.frame':    985 obs. of  3 variables:
##  $ EVTYPE       : chr  "   HIGH SURF ADVISORY" " COASTAL FLOOD" " FLASH FLOOD" " LIGHTNING" ...
##  $ econ_effect  : num  0.2 0 0.05 0 8.1 0.008 0 0 0.005 0 ...
##  $ health_effect: num  0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, ".internal.selfref")=<externalptr>

It can be observed, that the event type variable is very messy - it contains many more values, than the 48 types listed in the documentation. Let’s take a look at the 50 event types with the highest economic impact and 50 with the highest impact on health

highest_econ <- sum_sd[order(sum_sd$econ_effect, decreasing = TRUE),][1:50,1]
highest_health <- sum_sd[order(sum_sd$health_effect, decreasing = TRUE),][1:50,1]
unique(rbind(highest_econ, highest_health))
## # A tibble: 73 x 1
##               EVTYPE
##                <chr>
##  1             FLOOD
##  2 HURRICANE/TYPHOON
##  3           TORNADO
##  4       STORM SURGE
##  5              HAIL
##  6       FLASH FLOOD
##  7           DROUGHT
##  8         HURRICANE
##  9       RIVER FLOOD
## 10         ICE STORM
## # ... with 63 more rows

Now, let’s see which of them are not on the list from the documentation and recode them. Full analysis would require to check and recode all the event types, however it would be a tedious manual task, so we will only concentrate on the most harmful event types.

sum_sd$EVTYPE <- recode(sum_sd$EVTYPE,
                        "STORM SURGE" = "STORM SURGE/TIDE",
                        "HURRICANE" = "HURRICANE/TYPHOON",
                        "RIVER FLOOD" = "FLOOD",
                        "TSTM WIND" = "THUNDERSTORM WIND",
                        "HURRICANE OPAL" = "HURRICANE/TYPHOON",
                        "WILD/FOREST FIRE" = "WILDFIRE",
                        "HEAVY RAIN/SEVERE WEATHER" = "HEAVY RAIN",
                        "THUNDERSTORM WINDS" = "THUNDERSTORM WIND",
                        "TORNADOES, TSTM WIND, HAIL" = "TORNADO",
                        "EXTREME COLD" = "EXTREME COLD/WIND CHILL",
                        "SEVERE THUNDERSTORM" = "THUNDERSTORM WIND",
                        "HIGH WINDS" = "HIGH WIND",
                        "WILD FIRES" = "WILDFIRE",
                        "TYPHOON" = "HURRICANE/TYPHOON",
                        "FREEZE" = "FROST/FREEZE",
                        "HEAT" = "EXCESSIVE HEAT",
                        "HURRICANE ERIN" = "HURRICANE/TYPHOON",
                        "LANDSLIDE" = "DEBRIS FLOW",
                        "FLASH FLOODING" = "FLASH FLOOD",
                        "FLASH FLOOD/FLOOD" = "FLASH FLOOD",
                        "DAMAGING FREEZE" = "FROST/FREEZE",
                        "FLOOD/FLASH FLOOD" = "FLASH FLOOD",
                        "HAILSTORM" = "HAIL",
                        "EXCESSIVE WETNESS" = "FLOOD",
                        "River Flooding" = "FLOOD",
                        "COASTAL FLOODING" = "COASTAL FLOOD",
                        "HIGH WINDS/COLD" = "EXTREME COLD/WIND CHILL",
                        "FLOODING" = "FLOOD",
                        "FOG" = "DENSE FOG",
                        "RIP CURRENTS" = "RIP CURRENT",
                        "HEAT WAVE" = "EXCESSIVE HEAT",
                        "EXTREME HEAT" = "EXCESSIVE HEAT",
                        "GLAZE" = "FROST/FREEZE",
                        "ICE" = "FROST/FREEZE",
                        "WIND" = "STRONG WIND",
                        "COLD/WIND CHILL" = "EXTREME COLD/WIND CHILL",
                        "URBAN/SML STREAM FLD" = "FLOOD",
                        "TSTM WIND/HAIL" = "THUNDERSTORM WIND",
                        "WINTER WEATHER/MIX" = "WINTER WEATHER",
                        "HEAVY SURF/HIGH SURF" = "HIGH SURF",
                        "COLD" = "EXTREME COLD/WIND CHILL",
                        "WINTRY MIX" = "WINTER WEATHER",
                        "Heat Wave" = "EXCESSIVE HEAT",
                        "WINTER WEATHER MIX" = "WINTER WEATHER")

Now, that the event types were recoded, we need to summarize the data again.

sum_sd <- sum_sd %>%
        group_by(EVTYPE) %>%
        summarise(econ_effect=sum(econ_effect),
                  health_effect=sum(health_effect))

str(sum_sd)
## Classes 'tbl_df', 'tbl' and 'data.frame':    942 obs. of  3 variables:
##  $ EVTYPE       : chr  "   HIGH SURF ADVISORY" " COASTAL FLOOD" " FLASH FLOOD" " LIGHTNING" ...
##  $ econ_effect  : num  0.2 0 0.05 0 8.1 0.008 0 0 0.005 0 ...
##  $ health_effect: num  0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Results

Now that the data is clean, an analysis can be performed.

Economic effect

Let’s start by plotting the events, that are most economically harmful.

e <- ggplot(sum_sd[order(sum_sd$econ_effect, decreasing = TRUE),][1:10,], aes(EVTYPE, econ_effect))
e + geom_bar(stat = 'identity') +
        theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) + 
        xlab("Event Type") +
        ylab("Damage (in millions USD)") +
        ggtitle("Economic effect")

It is clearly visible, that floods are the most economically harmful weather events. In the analysed period (1950-2011) floods have caused over 150 billion USD damages in property and crops.

Health effect

We can plot the health impact (number of injuries and fatalities) of each event type.

h <- ggplot(sum_sd[order(sum_sd$health_effect, decreasing = TRUE),][1:10,], aes(EVTYPE, health_effect))
h + geom_bar(stat = 'identity') +
        theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) + 
        xlab("Event Type") +
        ylab("Number of injuries and fatalities") +
        ggtitle("Health effect")

It is clear, that tornadoes are the most harmful weather events in terms of fatalities and injuries.