Synopsis

Severe weather events and storms can cause massive amounts of damage to property, crops, and public health. In order to reduce the amount of damage, it is important to analyze the data so as to help reduce the amount of damage in future events. This project uses data from the U.S. National Oceanic and Atmospheric Administration (NOAA) storm database. The analysis is performed using the R programming language, using RStudio as a programming environment. The data was sorted to extract only the relevant data columns (property damage, crop damage, fatalities, and injuries), summed together by event types, and sorted to find the top 25 events with respect to public health and economic damage. The analysis revealed that the event type that had the greatest impact on public health were tornadoes, based on the amount of fatalities and injuries. The event type that had the greatest economic impact were floods, based on property and crop damage.

Data Processing

Raw data is often very large and messy, so the provided dataset must undergo some processing prior to being analyzed.

Loading required packages

R has many “packages” (libraries of code) that are included, but not loaded into the environment by default. In order to execute some of the code in this analysis, it is necessary to load the required packages prior to executing the code to ensure no errors occur.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

Reading and Subsetting the Data

Since working with large datasets is very resource intensive, we will extract just the columns that we are interested in.

data <- read.csv("repdata_data_StormData.csv.bz2")
col_select <- c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", 
                "CROPDMG", "CROPDMGEXP")
tidy_data <- data[, col_select]

Correct Damage Values

The damage values are provided in two different columns. The PROPDMG and CROPDMG columns provide a numerical value, and the PROPDMGEXP and CROPDMGEXP columns contain exponent values to multiply the previous values to get the accurate number. In order to properly analyze the data, these numbers must be corrected.

Substitute Letters for Numerical Values

The exponent columns contain letters and other characters in place of numerical exponents. These values need to be converted to numerical values in order to be able to perform mathematical operations.

tidy_data$PROPDMGEXP <- gsub("[Hh]", "2", tidy_data$PROPDMGEXP)
tidy_data$PROPDMGEXP <- gsub("[Kk]", "3", tidy_data$PROPDMGEXP)
tidy_data$PROPDMGEXP <- gsub("[Mm]", "6", tidy_data$PROPDMGEXP)
tidy_data$PROPDMGEXP <- gsub("[Bb]", "9", tidy_data$PROPDMGEXP)
tidy_data$PROPDMGEXP <- gsub("[\\+]", "1", tidy_data$PROPDMGEXP)
tidy_data$PROPDMGEXP <- gsub("-", "0", tidy_data$PROPDMGEXP)
tidy_data$PROPDMGEXP <- gsub("[\\?]", "0", tidy_data$PROPDMGEXP)
tidy_data$PROPDMGEXP <- gsub("^$", "0", tidy_data$PROPDMGEXP)
tidy_data$PROPDMGEXP <- as.numeric(tidy_data$PROPDMGEXP)

tidy_data$CROPDMGEXP <- gsub("[Kk]", "3", tidy_data$CROPDMGEXP)
tidy_data$CROPDMGEXP <- gsub("[Mm]", "6", tidy_data$CROPDMGEXP)
tidy_data$CROPDMGEXP <- gsub("[Bb]", "9", tidy_data$CROPDMGEXP)
tidy_data$CROPDMGEXP <- gsub("[Hh]", "2", tidy_data$CROPDMGEXP)
tidy_data$CROPDMGEXP <- gsub("[\\?]", "0", tidy_data$CROPDMGEXP)
tidy_data$CROPDMGEXP <- gsub("^$", "0", tidy_data$CROPDMGEXP)
tidy_data$CROPDMGEXP <- as.numeric(tidy_data$CROPDMGEXP)

Multiply Damage Values by Exponents

Now that all of the values are numerical and prepared, we can multiply the damage values by their exponents to get the accurate values.

tidy_data <- tidy_data %>% mutate(PROPDMG = PROPDMG * 10^PROPDMGEXP)
tidy_data <- tidy_data %>% mutate(CROPDMG = CROPDMG * 10^CROPDMGEXP)

Subset Non-Zero Values out of Tidy Data

Similar to the subsetting before, there is a lot of extraneous data with very low values. Since we are only interested in non-zero values, we will subset them out of the data to improve computational efficiency.

tidy_data <- tidy_data[tidy_data$FATALITIES != 0|
                               tidy_data$INJURIES != 0|
                               tidy_data$PROPDMG != 0|
                               tidy_data$CROPDMG != 0,]

Aggregate Data by Event Types, Subset Top 25

Now that the data has been filtered, we can assign the four categories of values to different variables, sum them by event type, and sort them to filter out the top 25 values in each category.

fatalities <- aggregate(FATALITIES ~ EVTYPE, data = tidy_data, sum)
injuries <- aggregate(INJURIES ~ EVTYPE, data = tidy_data, sum)
propdmg <- aggregate(PROPDMG ~ EVTYPE, data = tidy_data, sum)
cropdmg <- aggregate(CROPDMG ~ EVTYPE, data = tidy_data, sum)

## Combining economic data together
econ_dmg <- cbind(propdmg,cropdmg$CROPDMG)
colnames(econ_dmg)[3] <- "CROPDMG"
econ_dmg <- mutate(econ_dmg, TOTAL_DMG = PROPDMG + CROPDMG)

fatalities <- arrange(fatalities, desc(FATALITIES))[1:25,]
injuries <- arrange(injuries, desc(INJURIES))[1:25,]
econ_dmg <- arrange(econ_dmg, desc(TOTAL_DMG))[1:25,]

## Renaming column headers for clarity
colnames(fatalities)[1] <- "EVENT_TYPE"
colnames(injuries)[1] <- "EVENT_TYPE"
colnames(econ_dmg) <- c("EVENT_TYPE",
                        "PROPERTY_DAMAGE",
                        "CROP_DAMAGE",
                        "TOTAL_DAMAGE")

Results

The data is now ready for analysis. First we will answer the question:

1. Across the United States, which types of events (as indicated in the EVTYPE) are most harmful with respect to population health?

To answer this question, let’s look at the top 25 event types for fatalities and injuries. Figure 1 below shows the data in both tabular and graphical formats.

fatalities
##                 EVENT_TYPE FATALITIES
## 1                  TORNADO       5633
## 2           EXCESSIVE HEAT       1903
## 3              FLASH FLOOD        978
## 4                     HEAT        937
## 5                LIGHTNING        816
## 6                TSTM WIND        504
## 7                    FLOOD        470
## 8              RIP CURRENT        368
## 9                HIGH WIND        248
## 10               AVALANCHE        224
## 11            WINTER STORM        206
## 12            RIP CURRENTS        204
## 13               HEAT WAVE        172
## 14            EXTREME COLD        160
## 15       THUNDERSTORM WIND        133
## 16              HEAVY SNOW        127
## 17 EXTREME COLD/WIND CHILL        125
## 18             STRONG WIND        103
## 19                BLIZZARD        101
## 20               HIGH SURF        101
## 21              HEAVY RAIN         98
## 22            EXTREME HEAT         96
## 23         COLD/WIND CHILL         95
## 24               ICE STORM         89
## 25                WILDFIRE         75
injuries
##            EVENT_TYPE INJURIES
## 1             TORNADO    91346
## 2           TSTM WIND     6957
## 3               FLOOD     6789
## 4      EXCESSIVE HEAT     6525
## 5           LIGHTNING     5230
## 6                HEAT     2100
## 7           ICE STORM     1975
## 8         FLASH FLOOD     1777
## 9   THUNDERSTORM WIND     1488
## 10               HAIL     1361
## 11       WINTER STORM     1321
## 12  HURRICANE/TYPHOON     1275
## 13          HIGH WIND     1137
## 14         HEAVY SNOW     1021
## 15           WILDFIRE      911
## 16 THUNDERSTORM WINDS      908
## 17           BLIZZARD      805
## 18                FOG      734
## 19   WILD/FOREST FIRE      545
## 20         DUST STORM      440
## 21     WINTER WEATHER      398
## 22          DENSE FOG      342
## 23     TROPICAL STORM      340
## 24          HEAT WAVE      309
## 25         HIGH WINDS      302
ggplot(fatalities, aes(EVENT_TYPE, FATALITIES)) +
        geom_bar(stat = "identity", fill = "red3") +
        guides(x=guide_axis(angle = 30)) +
        xlab("Event Type") +
        ylab("Fatalities")

ggplot(injuries, aes(EVENT_TYPE, INJURIES)) +
        geom_bar(stat = "identity", fill = "red3") +
        guides(x=guide_axis(angle = 30)) +
        xlab("Event Type") +
        ylab("Injuries")

As seen above, tornadoes are by far the most harmful types of events, with respect to population health.

Next, let’s answer the second question:

2. Across the United States, which types of events have the greatest economic consequences?

To answer this question, let’s look at the economic damage data. The values for property damage and crop damage have been combined to calculate the event types with the highest combined cost. Figure 2 below shows the top 25 event types with the highest damage costs in both tabular and graphical formats.

econ_dmg
##                    EVENT_TYPE PROPERTY_DAMAGE CROP_DAMAGE TOTAL_DAMAGE
## 1                       FLOOD    144657709807  5661968450 150319678257
## 2           HURRICANE/TYPHOON     69305840000  2607872800  71913712800
## 3                     TORNADO     56947381217   414953270  57362334487
## 4                 STORM SURGE     43323536000        5000  43323541000
## 5                        HAIL     15735267513  3025954473  18761221986
## 6                 FLASH FLOOD     16822673979  1421317100  18243991079
## 7                     DROUGHT      1046106000 13972566000  15018672000
## 8                   HURRICANE     11868319010  2741910000  14610229010
## 9                 RIVER FLOOD      5118945500  5029459000  10148404500
## 10                  ICE STORM      3944927860  5022113500   8967041360
## 11             TROPICAL STORM      7703890550   678346000   8382236550
## 12               WINTER STORM      6688497251    26944000   6715441251
## 13                  HIGH WIND      5270046475   638571300   5908617775
## 14                   WILDFIRE      4765114000   295472800   5060586800
## 15                  TSTM WIND      4484928495   554007350   5038935845
## 16           STORM SURGE/TIDE      4641188000      850000   4642038000
## 17          THUNDERSTORM WIND      3483122472   414843050   3897965522
## 18             HURRICANE OPAL      3172846000    19000000   3191846000
## 19           WILD/FOREST FIRE      3001829500   106796830   3108626330
## 20  HEAVY RAIN/SEVERE WEATHER      2500000000           0   2500000000
## 21         THUNDERSTORM WINDS      1944590859   190654788   2135245647
## 22 TORNADOES, TSTM WIND, HAIL      1600000000     2500000   1602500000
## 23                 HEAVY RAIN       694248090   733399800   1427647890
## 24               EXTREME COLD        67737400  1292973000   1360710400
## 25        SEVERE THUNDERSTORM      1205360000      200000   1205560000
ggplot(econ_dmg, aes(EVENT_TYPE, TOTAL_DAMAGE)) +
        geom_bar(stat = "identity", fill = "green4") +
        guides(x=guide_axis(angle = 30)) +
        xlab("Event Type") +
        ylab("Total Damage (in dollars")

As shown above, floods cause the most economic damage by far, with hurricanes/typhoons coming second at a little less than half that value.