Synopsis

This documents explores the US National Oceanic and Atmospheric Administration (NOAA) storm data. It aims to identify which weather event has affected the population health and the economy by visualization. The data consists all the information collected from year 1950 to the end of November 2011. R packages “data.table” and “dplyr” were used for data processing while “ggplot2” and “scales” were used for creating the visuals.

Loading the necessary R packages

library(data.table)
library(ggplot2)
library(dplyr)
library(scales)

Data Processing

Using the package data.table, the data is loaded by calling the function fread().

data <- fread("./stormData")
## 
Read 0.0% of 967216 rows
Read 22.7% of 967216 rows
Read 38.3% of 967216 rows
Read 51.7% of 967216 rows
Read 60.0% of 967216 rows
Read 73.4% of 967216 rows
Read 78.6% of 967216 rows
Read 85.8% of 967216 rows
Read 902297 rows and 37 (of 37) columns from 0.523 GB file in 00:00:11

It is important to see the structure first to get to know the data. Function str() is used here.

str(data)
## Classes 'data.table' and 'data.frame':   902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : chr  "3" "2" "2" "2" ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
##  - attr(*, ".internal.selfref")=<externalptr>

The data is composed of 902,297 observations and 37 variables.

It is also important to see if some variables has NAs to know if the information is reliable enough.

NAtotal <- colSums((is.na(data)))
NAtotal
##    STATE__   BGN_DATE   BGN_TIME  TIME_ZONE     COUNTY COUNTYNAME 
##          0          0          0          0          0          0 
##      STATE     EVTYPE  BGN_RANGE    BGN_AZI BGN_LOCATI   END_DATE 
##          0          0          0          0          0          0 
##   END_TIME COUNTY_END COUNTYENDN  END_RANGE    END_AZI END_LOCATI 
##          0          0     902297          0          0          0 
##     LENGTH      WIDTH          F        MAG FATALITIES   INJURIES 
##          0          0          0          0          0          0 
##    PROPDMG PROPDMGEXP    CROPDMG CROPDMGEXP        WFO STATEOFFIC 
##          0          0          0          0          0          0 
##  ZONENAMES   LATITUDE  LONGITUDE LATITUDE_E LONGITUDE_    REMARKS 
##          0         47          0         40          0          0 
##     REFNUM 
##          0

Only COUNTYENDN and LATITUDE has NAs. These variables are not our concern for this analysis thus it should not affect the results.

Creating Subsets

To proceed, we need to create a data subset. First is the subset for the event type and the number of fatalities and injuries it caused. The package “dplyr” is very handy on this part.

harmfulEvent <- data %>% group_by(EVTYPE) %>% summarise(total.Fatal = sum(FATALITIES), total.Injury = sum(INJURIES))
harmfulEvent_1 <- harmfulEvent[order(-harmfulEvent$total.Fatal),]
harmfulEvent_2 <- harmfulEvent[order(-harmfulEvent$total.Injury),]

Another subset is created for the event type and its damage cost both in property (PROPDMG) and crop (CROPDMG). The variables PROPDMG and CROPDMG were abbreviated in units given in PROPDMGEXP/CROPDMGEXP. This means the numbers need to be recalculated by multiplying the units in PROPDMGEXP/CROPDMGEXP and the figures in PROPDMG/CROPDMG.

## For Property

data <- data[PROPDMGEXP == "K", PropExp := 1000]
data <- data[PROPDMGEXP == "M", PropExp := 1000000]
data <- data[PROPDMGEXP == "", PropExp := 1]
data <- data[PROPDMGEXP == "B", PropExp := 1000000000]
data <- data[PROPDMGEXP == "m", PropExp := 1000000]
data <- data[PROPDMGEXP == "0", PropExp := 1]
data <- data[PROPDMGEXP == "5", PropExp := 100000]
data <- data[PROPDMGEXP == "6", PropExp := 1000000]
data <- data[PROPDMGEXP == "4", PropExp := 10000]
data <- data[PROPDMGEXP == "2", PropExp := 100]
data <- data[PROPDMGEXP == "3", PropExp := 1000]
data <- data[PROPDMGEXP == "h", PropExp := 100]
data <- data[PROPDMGEXP == "7", PropExp := 10000000]
data <- data[PROPDMGEXP == "H", PropExp := 100]
data <- data[PROPDMGEXP == "1", PropExp := 10]
data <- data[PROPDMGEXP == "8", PropExp := 100000000]

# unknown unit
data <- data[PROPDMGEXP == "+", PropExp := 0]
data <- data[PROPDMGEXP == "-", PropExp := 0]
data <- data[PROPDMGEXP == "?", PropExp := 0]



## Calculating the property damage value
data$total.PROPDMG <- data$PROPDMG * data$PropExp


## For Crops
data <- data[CROPDMGEXP == "M", CropExp := 1000000]
data <- data[CROPDMGEXP == "K", CropExp := 1000]
data <- data[CROPDMGEXP == "m", CropExp := 1000000]
data <- data[CROPDMGEXP == "B", CropExp := 1000000000]
data <- data[CROPDMGEXP == "0", CropExp := 1]
data <- data[CROPDMGEXP == "k", CropExp := 1000]
data <- data[CROPDMGEXP == "2", CropExp := 100]
data <- data[CROPDMGEXP == "", CropExp := 1]

## 3 unknown unit
data <- data[CROPDMGEXP == "?", CropExp := 0]

## Calculating the crop damage value
data$total.CROPDMG <- data$CROPDMG * data$CropExp

## create a data subset
economicDamage <- data %>% group_by(EVTYPE) %>% summarise(total.PropDMG = sum(total.PROPDMG), total.CropDMG = sum(total.CROPDMG))
economicDamage$total.Damage <- economicDamage$total.PropDMG + economicDamage$total.CropDMG
economicDamage <- economicDamage[order(-economicDamage$total.Damage),]

topRank <- economicDamage[1:10, 1]
economicDamage_1 <- melt(economicDamage[, c(1:3)])
economicDamage_2 <- economicDamage_1[economicDamage_1$EVTYPE %in% topRank$EVTYPE, ]

Results

1. Population Health

The following results show the top 10 most fatal event on record.

plot1 <- ggplot(harmfulEvent_1[1:10,], aes(x = reorder(EVTYPE, -total.Fatal), y = total.Fatal)) + 
        geom_col(fill = "red", alpha = 0.9, width = .60) +
        labs(x = "EVENT TYPE \n", y = "\n TOTAL FATALITIES \n", title = "\n TOP 10 FATAL EVENTS \n") +
        scale_y_continuous(expand = c(0,0), limits = c(-10, 6000), labels = comma) +
        theme_minimal() + theme(axis.text = element_text(family = 'Trebuchet MS'), axis.text.x = element_text(angle = 45, hjust = 1)) +
        theme(plot.title = element_text(family = "Trebuchet MS", color = "#666666", face = "bold", size = 16, hjust = 0)) +
        theme(axis.title = element_text(family = "Trebuchet MS", color = "#666666", face = "bold", size = 12)) 
plot1

The figure shows that TORNADO is the most fatal among the events with more than 5,500 recorded deaths followed by EXCESSIVE HEAT and FLASHFLOOD at under 2,000 deaths.

The following results show the top 10 event with most injuries on record.

plot2 <- ggplot(harmfulEvent_2[1:10,], aes(x = reorder(EVTYPE, - total.Injury), y = total.Injury)) +
        geom_col(fill = "royalblue", alpha = 0.9, width = .60) +
        labs(x = "\n EVENT TYPE \n", y = "\n TOTAL INJURIES \n", title = "\n TOP 10 INJURY CAUSING EVENTS \n") +
        scale_y_continuous(expand = c(0,0), limits = c(-10, 100000), labels = comma) +
        theme_minimal() + theme(axis.text = element_text(family = 'Trebuchet MS'), axis.text.x = element_text(angle = 45, hjust = 1)) +
        theme(plot.title = element_text(family = "Trebuchet MS", color = "#666666", face = "bold", size = 16, hjust = 0)) +
        theme(axis.title = element_text(family = "Trebuchet MS", color = "#666666", face = "bold", size = 12)) 
plot2

Again, TORNADO is the clear leader in this criteria with more than 80,000 injuries recorded. TSTM WIND is far behind on 2nd with less than 20,000 followed by FLOOD on 3rd.

2. Economic Cost

The 2nd part of the results show the Top 10 events with most total cost of damage to property and crops. The chart shows the aggregate of the two.

plot3 <- ggplot(economicDamage_2, aes(x = reorder(EVTYPE,-value), y = value, fill = variable)) +
        geom_bar(stat = "identity", alpha = 0.9) +
        labs(x = "\n EVENT TYPE \n", y = "\n TOTAL DAMAGE \n", title = "\n TOP 10 MOST DAMAGING EVENTS \n", fill = " ") +
        scale_y_continuous(expand = c(0,0), limits = c(-10, 180000000000), labels = comma) +
        scale_fill_manual(values = c("navy", "royalblue"), labels = c("PROPERTY", "CROP")) +
        theme_minimal() + theme(axis.text = element_text(family = 'Trebuchet MS'), axis.text.x = element_text(angle = 45, hjust = 1)) +
        theme(plot.title = element_text(family = "Trebuchet MS", color = "#666666", face = "bold", size = 16, hjust = 0)) +
        theme(axis.title = element_text(family = "Trebuchet MS", color = "#666666", face = "bold", size = 12))
plot3

The chart shows that the most costly event is FLOOD with around 150B USD followed by HURRICANE/TYPHOON.