1. Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. It’s going to answering two questions: (1) Across the United States, which types of events are most harmful with respect to population health? (2) Across the United States, which types of events have the greatest economic consequences? The answers are tonardo and flood respectively.

2. Data Processing

2.1 Importing Data

  1. Download the data file from the url: https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2
  2. Unzip the file and rename the document as “StormData.csv”
  3. Read in the csv file using read_csv
  4. Store the dataset to the data frame called “StormData”
StormData <- read_csv("StormData.csv")
dimension <- dim(StormData)
total_events <- length(table(StormData$EVTYPE))

The dataset contains 902297 observations and 37 variables. According to the National Weather Service Storm Data Documentation, Section 2.1.1 Storm Data Event Table (The URL of that documentation: https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf), the storm data contains 48 different storm events. However, the imported dataset contains 977 different storm events. Hence, rows which contains unsuitable event types have to be removed.

2.2 Observations Filtering and Variables Selection

  1. A variable, event_name, was created to store the 48 storm event types.
  2. Since the names of storm events in the dataset are in capital letter, all characters in the variable, event_name, were converted to upper case.
  3. A new data frame, StormData2, was created after using the correct storm event types to filter the original dataset, StormData.
  4. Since we only concern the events’ effects on population health and economics, related variables from the the original dataset, StormData, were selected to the new data frame, StormData2.
event_name <- c(
        "Astronomical Low Tide", "Avalanche", "Blizzard", "Coastal Flood",
        "Cold/Wind Chill", "Debris Flow", "Dense Fog", "Dense Smoke",
        "Drought", "Dust Devil", "Dust Storm", "Excessive Heat", 
        "Extreme Cold/Wind Chill", "Flash Flood", "Flood", "Frost/Freeze", 
        "Funnel Cloud", "Freezing Fog", "Hail", "Heat", "Heavy Rain", 
        "Heavy Snow", "High Surf", "High Wind", "Hurricane", "Typhoon", 
        "Ice Storm", "Lake-Effect Snow", "Lakeshore Flood", "Lightning", 
        "Marine Hail", "Marine High Wind", "Marine Strong Wind", 
        "Marine Thunderstorm Wind", "Rip Current", "Seiche", "Sleet", 
        "Storm Surge/Tide", "Strong Wind", "Thunderstorm Wind", "Tornado", 
        "Tropical Depression", "Tropical Storm", "Tsunami", "Volcanic Ash", 
        "Waterspout", "Wildfire", "Winter Storm", "Winter Weather"
)

event_name <- str_to_upper(event_name)

StormData2 <- StormData %>% 
        filter(EVTYPE %in% event_name) %>% 
        select(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)

dimension2 <- dim(StormData2)

After the data transformation, the new data frame, StormData2, contrains 635464 observations and 7 variables.

In order to tackle the questions, two subsets of data frame were created from StormData2, i.e. storm_health and storm_econ.

2.3 Creating the first data subset, storm_health:

It contains variables of EVTYPE, FATALITIES and INJURIES, representing events types, fatalities and injuries respectively. A new variable, HARM_SUM, representing population health, was created by adding up FATALITIES and INJURIES. Then the subset was grouped by events types and a summary of total count of population health affected by each events types was calculated.

storm_health <- StormData2 %>% 
        select(EVTYPE, FATALITIES, INJURIES) %>% 
        mutate(HARM_SUM = FATALITIES + INJURIES) %>% 
        select(EVTYPE, HARM_SUM) %>% 
        group_by(EVTYPE) %>% 
        summarise(TOTAL_HARM = sum(HARM_SUM)) %>% 
        arrange(desc(TOTAL_HARM))
head(storm_health)
## # A tibble: 6 x 2
##           EVTYPE TOTAL_HARM
##            <chr>      <dbl>
## 1        TORNADO      96979
## 2 EXCESSIVE HEAT       8428
## 3          FLOOD       7259
## 4      LIGHTNING       6046
## 5           HEAT       3037
## 6    FLASH FLOOD       2755

2.4 Creating the second data subset, storm_econ:

It contains variables EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP, representing events types, property damage, property damage exponential, crop damage and crop damage exponential respectively.

storm_econ <- StormData2 %>% 
        select(EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)

na_prop <- sum(is.na(StormData2$PROPDMGEXP))
na_crop <- sum(is.na(StormData2$CROPDMGEXP))
unique(StormData2$PROPDMGEXP)
##  [1] "K" "M" NA  "B" "+" "0" "5" "m" "2" "4" "7" "?" "-" "6" "3" "1" "8"
## [18] "H"
unique(StormData2$CROPDMGEXP)
## [1] NA  "K" "M" "B" "0" "k"

According to the National Weather Service Storm Data Documentation, Section 2.7, both PROPDMGEXP and CROPDMGEXP should have three values only: K for thousand, M for million, and B for billion. However, in the dataset, they contains other values and many NA values (PROPDMGEXP has 279370 NAs; CROPDMGEXP has 360444 NAs).

Since the actual amount of property damage is the multiplication of PROPDMG and PROPDMGEXP, all the NAs and strange values in PROPDMGEXP were replaced by 1. The same strategy also applied to the actual amount of crop damage, which is the multiplication of CROPDMG and CROPDMGEXP. On the contrary, the remaining values were replaced by their corresponding exponential values. After all of that, both PROPDMGEXP and CROPDMGEXP were converted from character to numeric, in order to carry out the multiplication later.

propdmgexp <- storm_econ$PROPDMGEXP
for (i in 1:length(propdmgexp)){
        if(is.na(propdmgexp[i])){
                propdmgexp[i] <- 1
        }
}
for (i in 1:length(propdmgexp)){
        if(propdmgexp[i] == "K" | propdmgexp[i] == "3") {
                propdmgexp[i] <- 10^3
        } else if (propdmgexp[i] == "M" | propdmgexp[i] == "m" | propdmgexp[i] == "6") {
                propdmgexp[i] <- 10^6
        } else if (propdmgexp[i] == "B") {
                propdmgexp[i] <- 10^9
        } else if (propdmgexp[i] == "5") {
                propdmgexp[i] <- 10^5
        } else if (propdmgexp[i] == "2" | propdmgexp[i] == "H") {
                propdmgexp[i] <- 10^2
        } else if (propdmgexp[i] == "4") {
                propdmgexp[i] <- 10^4
        } else if (propdmgexp[i] == "7") {
                propdmgexp[i] <- 10^7
        } else if (propdmgexp[i] == "8") {
                propdmgexp[i] <- 10^8
        } else {
                propdmgexp[i] <- 1 
        }
}
propdmgexp <- as.numeric(propdmgexp)

cropdmgexp <- storm_econ$CROPDMGEXP
for (i in 1:length(cropdmgexp)){
        if(is.na(cropdmgexp[i])){
                cropdmgexp[i] <- 1
        }
}
for (i in 1:length(cropdmgexp)){
        if(cropdmgexp[i] == "K" | cropdmgexp[i] == "k") {
                cropdmgexp[i] <- 1000
        } else if (cropdmgexp[i] == "M") {
                cropdmgexp[i] <- 1000000
        } else if (cropdmgexp[i] == "B") {
                cropdmgexp[i] <- 1000000000
        } else {
                cropdmgexp[i] <- 1 
        }
}
cropdmgexp <- as.numeric(cropdmgexp)

After tackling the NAs and strange values, the subset, storm_econ, was grouped by events types. A summary was concluded by summing up total amount of economic damage for each events types.

storm_econ <- storm_econ %>% 
        mutate(PROPDMGEXP = propdmgexp,
               CROPDMGEXP = cropdmgexp,
               TOTAL_DMG = PROPDMG * PROPDMGEXP + CROPDMG * CROPDMGEXP) %>% 
        group_by(EVTYPE) %>% 
        summarize(TOTAL = sum(TOTAL_DMG)) %>% 
        arrange(desc(TOTAL))
head(storm_econ)
## # A tibble: 6 x 2
##        EVTYPE        TOTAL
##         <chr>        <dbl>
## 1       FLOOD 150319678257
## 2     TORNADO  57362333947
## 3        HAIL  18761221986
## 4 FLASH FLOOD  18244041079
## 5     DROUGHT  15018672000
## 6   HURRICANE  14610229010

3. Results

3.1 Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

ggplot(head(storm_health, 10), aes(reorder(EVTYPE, TOTAL_HARM), TOTAL_HARM)) +
        geom_col() +
        coord_flip() +
        labs(title = "Top 10 Storm Events' Consequences on Population Health",
             y = "Total Number",
             x = "Storm Event Types")

The graph’s y-axis contains the 48 different storm events types while its x-asix displays the total count for each types.

According to the graph, tornado, excessive heat, flood and lightning are most harmful with respect to population health, with tornado having the most harmful effects.

3.2 Across the United States, which types of events have the greatest economic consequences?

ggplot(head(storm_econ, 10), aes(reorder(EVTYPE, TOTAL), TOTAL)) +
        geom_col() +
        coord_flip() +
        labs(title = "Top 10 Economic Damage Caused by Storm Events",
             y = "Damage Amount($)",
             x = "Storm Event Types")

The graph’s y-axis contains the 48 different storm events types while its x-asix displays the total amount of damage for each types.

According to the graph, flood, tornado, hail, flash flood, drought and hurricane have the greatest economic consequences, with flood having the greatest consequences.