Synopsis

This course project challenged us to analyze and explore the National Oceanic and Atmospheric Administration (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete. The link to the data can be found here while additional information about the data and its documentation can be found here and an FAQ about the National National Climatic Data Center Storm Events can be found here

We were tasked to answer two questions.

  1. Across the United States, which types of events are most harmful with respect to population health?
  2. Across the United States, which types of events have the greatest economic consequences?

After analyzing and exploring the data, it was found that tornadoes were responsible for causing the most harm with respect to population health, while floods were responsible for having the highest amount of economic consequence.

Data Processing

Load in libraries and data so it can be processed

library(data.table)
library(ggplot2)

data <- read.csv("./repdata_data_StormData.csv.bz2", header = TRUE, sep = ",")
head(data)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE  EVTYPE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL TORNADO
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL TORNADO
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL TORNADO
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL TORNADO
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL TORNADO
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL TORNADO
##   BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1         0                                               0         NA
## 2         0                                               0         NA
## 3         0                                               0         NA
## 4         0                                               0         NA
## 5         0                                               0         NA
## 6         0                                               0         NA
##   END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1         0                      14.0   100 3   0          0       15    25.0
## 2         0                       2.0   150 2   0          0        0     2.5
## 3         0                       0.1   123 2   0          0        2    25.0
## 4         0                       0.0   100 2   0          0        2     2.5
## 5         0                       0.0   150 2   0          0        2     2.5
## 6         0                       1.5   177 2   0          0        6     2.5
##   PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1          K       0                                         3040      8812
## 2          K       0                                         3042      8755
## 3          K       0                                         3340      8742
## 4          K       0                                         3458      8626
## 5          K       0                                         3412      8642
## 6          K       0                                         3450      8748
##   LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1       3051       8806              1
## 2          0          0              2
## 3          0          0              3
## 4          0          0              4
## 5          0          0              5
## 6          0          0              6

Next, we have to subset data to look at values pertaining to health and economic impact, as well as filter out any inputs with no values. This is done to ensure that there is accurate data when calculating the total cost

subset_data <- c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
data <- data[, subset_data]

data <- as.data.table(data)
data <- data[(EVTYPE != "?" & (INJURIES > 0 | FATALITIES > 0 | PROPDMG > 0 | CROPDMG > 0)),
             c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
summary(data)
##     EVTYPE            FATALITIES           INJURIES            PROPDMG       
##  Length:254632      Min.   :  0.00000   Min.   :   0.0000   Min.   :   0.00  
##  Class :character   1st Qu.:  0.00000   1st Qu.:   0.0000   1st Qu.:   2.00  
##  Mode  :character   Median :  0.00000   Median :   0.0000   Median :   5.00  
##                     Mean   :  0.05948   Mean   :   0.5519   Mean   :  42.75  
##                     3rd Qu.:  0.00000   3rd Qu.:   0.0000   3rd Qu.:  25.00  
##                     Max.   :583.00000   Max.   :1700.0000   Max.   :5000.00  
##   PROPDMGEXP           CROPDMG         CROPDMGEXP       
##  Length:254632      Min.   :  0.000   Length:254632     
##  Class :character   1st Qu.:  0.000   Class :character  
##  Mode  :character   Median :  0.000   Mode  :character  
##                     Mean   :  5.411                     
##                     3rd Qu.:  0.000                     
##                     Max.   :990.000

Next, we have to convert the values in PROPDMG and CROPDMG from exponent values to numeric values. This is done so that when we create the graphs, we’re able to have numerical values that can be quickly read and understood

damage <- c("PROPDMGEXP", "CROPDMGEXP")
data[, (damage) := c(lapply(.SD, toupper)), .SDcols = damage]

prop_damage <- c("\"\"" = 10^0, 
                 "-" = 10^0, "+" = 10^0, "0" = 10^0, "1" = 10^1, "2" = 10^2, "3" = 10^3,
                 "4" = 10^4, "5" = 10^5, "6" = 10^6, "7" = 10^7, "8" = 10^8, "9" = 10^9, 
                 "H" = 10^2, "K" = 10^3, "M" = 10^6, "B" = 10^9)
crop_damage <- c("\"\"" = 10^0, "?" = 10^0, "0" = 10^0, "K" = 10^3, "M" = 10^6, "B" = 10^9)

data[, PROPDMGEXP := prop_damage[as.character(data[,PROPDMGEXP])]]
data[is.na(PROPDMGEXP), PROPDMGEXP := 10^0]

data[, CROPDMGEXP := crop_damage[as.character(data[,CROPDMGEXP])]]
data[is.na(CROPDMGEXP), CROPDMGEXP := 10^0]

data <- data[, .(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, PROPCOST = 
                 PROPDMG * PROPDMGEXP, CROPDMG, CROPDMGEXP, CROPCOST = CROPDMG * CROPDMGEXP)]

Analysis

Estimating the total fatalities and injuries

harmful_events <- data[, .(FATALITIES = sum(FATALITIES), INJURIES = sum(INJURIES), 
                           TOTAL_HARMFUL_EVENTS = sum(FATALITIES) + sum(INJURIES)), by = .(EVTYPE)]
harmful_events <- harmful_events[order(-TOTAL_HARMFUL_EVENTS),]
harmful_events <- harmful_events[1:10,]
head(harmful_events, 10)
##                EVTYPE FATALITIES INJURIES TOTAL_HARMFUL_EVENTS
##                <char>      <num>    <num>                <num>
##  1:           TORNADO       5633    91346                96979
##  2:    EXCESSIVE HEAT       1903     6525                 8428
##  3:         TSTM WIND        504     6957                 7461
##  4:             FLOOD        470     6789                 7259
##  5:         LIGHTNING        816     5230                 6046
##  6:              HEAT        937     2100                 3037
##  7:       FLASH FLOOD        978     1777                 2755
##  8:         ICE STORM         89     1975                 2064
##  9: THUNDERSTORM WIND        133     1488                 1621
## 10:      WINTER STORM        206     1321                 1527

Estimating the total prop and crop cost

eco_result <- data[, .(PROPCOST = sum(PROPCOST), CROPCOST = sum(CROPCOST), 
                           TOTAL_ECO_RESULT = sum(PROPCOST) + sum(CROPCOST)), by = .(EVTYPE)]
eco_result <- eco_result[order(-TOTAL_ECO_RESULT),]
eco_result <- eco_result[1:10,]
head(eco_result, 10)
##                EVTYPE     PROPCOST    CROPCOST TOTAL_ECO_RESULT
##                <char>        <num>       <num>            <num>
##  1:             FLOOD 144657709807  5661968450     150319678257
##  2: HURRICANE/TYPHOON  69305840000  2607872800      71913712800
##  3:           TORNADO  56947380677   414953270      57362333947
##  4:       STORM SURGE  43323536000        5000      43323541000
##  5:              HAIL  15735267513  3025954473      18761221986
##  6:       FLASH FLOOD  16822673979  1421317100      18243991079
##  7:           DROUGHT   1046106000 13972566000      15018672000
##  8:         HURRICANE  11868319010  2741910000      14610229010
##  9:       RIVER FLOOD   5118945500  5029459000      10148404500
## 10:         ICE STORM   3944927860  5022113500       8967041360

Results

population_health <- melt(harmful_events, id.vars = "EVTYPE", variable.name = "Labels")
ggplot(population_health, aes(x = reorder(EVTYPE, -value), y = value)) +
  geom_bar(stat = "identity", aes(fill = Labels), position = "dodge") +
  ylab("Total Fatalities/Injuries") + xlab("Event") +
  theme(axis.text.x = element_text(angle=30, hjust=0.5)) +
  ggtitle("Weather Events That Are Most Harmful With Respect To Population Health") 

This graph shows that tornadoes cause the most amount of fatalities and injuries

eco_consequences <- melt(eco_result, id.vars = "EVTYPE", variable.name = "Labels")
ggplot(eco_consequences, aes(x = reorder(EVTYPE, -value), y = value/1e9)) + 
  geom_bar(stat = "identity", aes(fill = Labels), position = "dodge") + 
  ylab("Cost/Damage In Billions") + xlab("Event") + 
  theme(axis.text.x = element_text(angle=30, hjust=0.5)) + 
  ggtitle("Weather Events That Have The Greatest Economic Consequence") 

This graph shows that floods cause the most amount of property and crop cost/damage