Synopsis

The NOAA storm data analysis shows that first, the top 10 most harmful events with respect to population health are (descending order): Tornado, Excessive heat, TSTM wind, Flood, Lighting, Heat, Flash flood, Ice storm, Thunderstorm wind, and Winter storm.

If we look at the most harmful event in terms of population health, i.e. Tornado, we can see that compare to other months, tornado have relatively lower impact on population health during May, June, July, August, and September.

And second, the types of events that have most economic damage are (descending order): Flood, Hurricane/Typhoon, Tornado, Storm sruge, Hail, Flash Flood, Drought, Hurricane, River Flood, and Ice storm.

Assignment objective

The basic goal of this assignment is to explore the NOAA Storm Database and answer some basic questions about severe weather events. You must use the database to answer the questions below and show the code for your entire analysis. Your analysis can consist of tables, figures, or other summaries. You may use any R package you want to support your analysis.

Questions

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

  2. Across the United States, which types of events have the greatest economic consequences?

NOAA storm data analysis

Load and process the data

I download the NOAA storm data from the course website, upzip and save in the same directory of this Rmd file.

I am using the read.csv() function to load the data, and first let’s view the data using str() function:

noaa_data <- read.csv("repdata_data_StormData.csv")
str(noaa_data)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

All the questions asked about the impact of different types of storm, and by looking at the data, the type variable (EVTYPE) is character format. Thus, I’m going to change it to factor format, which will make further analysis easier.

Going to use as.factor() function to convert data type:

noaa_data$EVTYPE <- as.factor(noaa_data$EVTYPE)

The first question ask which type of storm are most harmful with respect to population health, based on the data, the population health should be the sum of fatalities and injuries. So I am going to create another varialbe PH (population health), which is just the sum of fatality and injury numbers.

noaa_data$PH <- noaa_data$FATALITIES + noaa_data$INJURIES

Finally, sum up all the PH values for each type of event and calculate which event has maximum PH values.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
PH_max <- noaa_data %>%
  group_by(EVTYPE) %>%
  summarise(total_PH = sum(PH, na.rm = TRUE)) %>%
  arrange(desc(total_PH))

# display the top 10 most harmful events
head(PH_max,10)
## # A tibble: 10 × 2
##    EVTYPE            total_PH
##    <fct>                <dbl>
##  1 TORNADO              96979
##  2 EXCESSIVE HEAT        8428
##  3 TSTM WIND             7461
##  4 FLOOD                 7259
##  5 LIGHTNING             6046
##  6 HEAT                  3037
##  7 FLASH FLOOD           2755
##  8 ICE STORM             2064
##  9 THUNDERSTORM WIND     1621
## 10 WINTER STORM          1527

which month does Tornado on average cause most damage on population health

First, let’s turn time variable from character to time format (only focus on begin time):

noaa_data$BGN_DATE <- as.POSIXlt(noaa_data$BGN_DATE, format = "%m/%d/%Y %H:%M:%S")

Next, calculate the monthly average of population health damage and plot monthly variation:

library(ggplot2)
library(dplyr)

noaa_data <- noaa_data %>%
  mutate(month = format(noaa_data$BGN_DATE, "%m"))

# turn month into factor variable
noaa_data$month <- as.factor(noaa_data$month)

PH_avg_pmonth <- noaa_data %>%
  group_by(month) %>%
  summarise(avg_PH = mean(PH, na.rm = TRUE))

ggplot(PH_avg_pmonth, aes(x = month, y = avg_PH)) +
  geom_point() +
  theme_minimal() +
  labs(title = "Tornado monthly avg population health damage",
       x = "month",
       y = "Population health damage")

Result

We can see that the top 10 most harmful events with respect to population health are (decending order): Tornado, Excessive heat, TSTM wind, Flood, Lighting, Heat, Flash flood, Ice storm, Thunderstorm wind, and Winter storm.

If we look at the most harmful event in terms of population health, i.e. Tornado, we can see that compare to other months, tornado have relatively lower impact on population health during May, June, July, August, and September.

Economic impact of different types of events

Next, I am going to explore which types of events have the greatest economic consequences? The main variables that indicate economic consequences are property damage (PROPDMG) and crop damage (CROPDMG). They have associated variables PROPDMGEXP and CROPDMGEXP, which indicates how big those damage are, i.e. K stands for thousands of dollar, M stands for millions of dollar, and B stand for billions of dollar. However, there are other values in the associated variables other than “K”, “M”, and “B”:

print(unique(noaa_data$PROPDMGEXP))
##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
print(unique(noaa_data$CROPDMGEXP))
## [1] ""  "M" "K" "m" "B" "?" "0" "k" "2"

For example, “+”, “-”, “?”, and numeric values. I’m going to assume that those values mean that corresponding damage is less than 100 dollars and thus a lot smaller than 1000, and will be excluded from the analysis (will return na values).

library(dplyr)

noaa_data <- noaa_data %>%
  mutate(propdm_adjusted = case_when(
    PROPDMGEXP == "h" ~ PROPDMG * 100,
    PROPDMGEXP == "H" ~ PROPDMG * 100,
    PROPDMGEXP == "K" ~ PROPDMG * 1000,
    PROPDMGEXP == "m" ~ PROPDMG * 1000000,
    PROPDMGEXP == "M" ~ PROPDMG * 1000000,
    PROPDMGEXP == "B" ~ PROPDMG * 1000000000,
    TRUE ~ NA_real_  # if no "K", "M", "B",then return NA
  )) %>%
  mutate(cropdm_adjusted = case_when(
    CROPDMGEXP == "K" ~ CROPDMG * 1000,
    CROPDMGEXP == "k" ~ CROPDMG * 1000,
    CROPDMGEXP == "M" ~ CROPDMG * 1000000,
    CROPDMGEXP == "m" ~ CROPDMG * 1000000,
    CROPDMGEXP == "B" ~ CROPDMG * 1000000000,
    TRUE ~ NA_real_  # if no "K", "M", "B",then return NA
  ))

Then sum up property and crop damage together as a indicator for economic impact:

noaa_data$propdm_adjusted[is.na(noaa_data$propdm_adjusted)] <- 0
noaa_data$cropdm_adjusted[is.na(noaa_data$cropdm_adjusted)] <- 0

noaa_data$eco_dam <- noaa_data$propdm_adjusted + noaa_data$cropdm_adjusted

Then calculate the max economic impact and display the top 10 types of events:

library(dplyr)

ED_max <- noaa_data %>%
  group_by(EVTYPE) %>%
  summarise(total_ED = sum(eco_dam, na.rm = TRUE)) %>%
  arrange(desc(total_ED))

# display the top 10 most harmful events
head(ED_max,10)
## # A tibble: 10 × 2
##    EVTYPE                total_ED
##    <fct>                    <dbl>
##  1 FLOOD             150319678250
##  2 HURRICANE/TYPHOON  71913712800
##  3 TORNADO            57352113590
##  4 STORM SURGE        43323541000
##  5 HAIL               18758221670
##  6 FLASH FLOOD        17562128610
##  7 DROUGHT            15018672000
##  8 HURRICANE          14610229010
##  9 RIVER FLOOD        10148404500
## 10 ICE STORM           8967041310

Result

As shown in the table above, the types of events that have most economic damage are (descending order): Flood, Hurricane/Typhoon, Tornado, Storm sruge, Hail, Flash Flood, Drought, Hurricane, River Flood, and Ice storm.