This documents explores the US National Oceanic and Atmospheric Administration (NOAA) storm data. It aims to identify which weather event has affected the population health and the economy by visualization. The data consists all the information collected from year 1950 to the end of November 2011. R packages “data.table” and “dplyr” were used for data processing while “ggplot2” and “scales” were used for creating the visuals.
library(data.table)
library(ggplot2)
library(dplyr)
library(scales)
Using the package data.table, the data is loaded by calling the function fread().
data <- fread("./stormData")
##
Read 0.0% of 967216 rows
Read 22.7% of 967216 rows
Read 38.3% of 967216 rows
Read 51.7% of 967216 rows
Read 60.0% of 967216 rows
Read 73.4% of 967216 rows
Read 78.6% of 967216 rows
Read 85.8% of 967216 rows
Read 902297 rows and 37 (of 37) columns from 0.523 GB file in 00:00:11
It is important to see the structure first to get to know the data. Function str() is used here.
str(data)
## Classes 'data.table' and 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : chr "3" "2" "2" "2" ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
## - attr(*, ".internal.selfref")=<externalptr>
The data is composed of 902,297 observations and 37 variables.
It is also important to see if some variables has NAs to know if the information is reliable enough.
NAtotal <- colSums((is.na(data)))
NAtotal
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME
## 0 0 0 0 0 0
## STATE EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE
## 0 0 0 0 0 0
## END_TIME COUNTY_END COUNTYENDN END_RANGE END_AZI END_LOCATI
## 0 0 902297 0 0 0
## LENGTH WIDTH F MAG FATALITIES INJURIES
## 0 0 0 0 0 0
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC
## 0 0 0 0 0 0
## ZONENAMES LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS
## 0 47 0 40 0 0
## REFNUM
## 0
Only COUNTYENDN and LATITUDE has NAs. These variables are not our concern for this analysis thus it should not affect the results.
To proceed, we need to create a data subset. First is the subset for the event type and the number of fatalities and injuries it caused. The package “dplyr” is very handy on this part.
harmfulEvent <- data %>% group_by(EVTYPE) %>% summarise(total.Fatal = sum(FATALITIES), total.Injury = sum(INJURIES))
harmfulEvent_1 <- harmfulEvent[order(-harmfulEvent$total.Fatal),]
harmfulEvent_2 <- harmfulEvent[order(-harmfulEvent$total.Injury),]
Another subset is created for the event type and its damage cost both in property (PROPDMG) and crop (CROPDMG). The variables PROPDMG and CROPDMG were abbreviated in units given in PROPDMGEXP/CROPDMGEXP. This means the numbers need to be recalculated by multiplying the units in PROPDMGEXP/CROPDMGEXP and the figures in PROPDMG/CROPDMG.
## For Property
data <- data[PROPDMGEXP == "K", PropExp := 1000]
data <- data[PROPDMGEXP == "M", PropExp := 1000000]
data <- data[PROPDMGEXP == "", PropExp := 1]
data <- data[PROPDMGEXP == "B", PropExp := 1000000000]
data <- data[PROPDMGEXP == "m", PropExp := 1000000]
data <- data[PROPDMGEXP == "0", PropExp := 1]
data <- data[PROPDMGEXP == "5", PropExp := 100000]
data <- data[PROPDMGEXP == "6", PropExp := 1000000]
data <- data[PROPDMGEXP == "4", PropExp := 10000]
data <- data[PROPDMGEXP == "2", PropExp := 100]
data <- data[PROPDMGEXP == "3", PropExp := 1000]
data <- data[PROPDMGEXP == "h", PropExp := 100]
data <- data[PROPDMGEXP == "7", PropExp := 10000000]
data <- data[PROPDMGEXP == "H", PropExp := 100]
data <- data[PROPDMGEXP == "1", PropExp := 10]
data <- data[PROPDMGEXP == "8", PropExp := 100000000]
# unknown unit
data <- data[PROPDMGEXP == "+", PropExp := 0]
data <- data[PROPDMGEXP == "-", PropExp := 0]
data <- data[PROPDMGEXP == "?", PropExp := 0]
## Calculating the property damage value
data$total.PROPDMG <- data$PROPDMG * data$PropExp
## For Crops
data <- data[CROPDMGEXP == "M", CropExp := 1000000]
data <- data[CROPDMGEXP == "K", CropExp := 1000]
data <- data[CROPDMGEXP == "m", CropExp := 1000000]
data <- data[CROPDMGEXP == "B", CropExp := 1000000000]
data <- data[CROPDMGEXP == "0", CropExp := 1]
data <- data[CROPDMGEXP == "k", CropExp := 1000]
data <- data[CROPDMGEXP == "2", CropExp := 100]
data <- data[CROPDMGEXP == "", CropExp := 1]
## 3 unknown unit
data <- data[CROPDMGEXP == "?", CropExp := 0]
## Calculating the crop damage value
data$total.CROPDMG <- data$CROPDMG * data$CropExp
## create a data subset
economicDamage <- data %>% group_by(EVTYPE) %>% summarise(total.PropDMG = sum(total.PROPDMG), total.CropDMG = sum(total.CROPDMG))
economicDamage$total.Damage <- economicDamage$total.PropDMG + economicDamage$total.CropDMG
economicDamage <- economicDamage[order(-economicDamage$total.Damage),]
topRank <- economicDamage[1:10, 1]
economicDamage_1 <- melt(economicDamage[, c(1:3)])
economicDamage_2 <- economicDamage_1[economicDamage_1$EVTYPE %in% topRank$EVTYPE, ]
The following results show the top 10 most fatal event on record.
plot1 <- ggplot(harmfulEvent_1[1:10,], aes(x = reorder(EVTYPE, -total.Fatal), y = total.Fatal)) +
geom_col(fill = "red", alpha = 0.9, width = .60) +
labs(x = "EVENT TYPE \n", y = "\n TOTAL FATALITIES \n", title = "\n TOP 10 FATAL EVENTS \n") +
scale_y_continuous(expand = c(0,0), limits = c(-10, 6000), labels = comma) +
theme_minimal() + theme(axis.text = element_text(family = 'Trebuchet MS'), axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(plot.title = element_text(family = "Trebuchet MS", color = "#666666", face = "bold", size = 16, hjust = 0)) +
theme(axis.title = element_text(family = "Trebuchet MS", color = "#666666", face = "bold", size = 12))
plot1
The figure shows that TORNADO is the most fatal among the events with more than 5,500 recorded deaths followed by EXCESSIVE HEAT and FLASHFLOOD at under 2,000 deaths.
The following results show the top 10 event with most injuries on record.
plot2 <- ggplot(harmfulEvent_2[1:10,], aes(x = reorder(EVTYPE, - total.Injury), y = total.Injury)) +
geom_col(fill = "royalblue", alpha = 0.9, width = .60) +
labs(x = "\n EVENT TYPE \n", y = "\n TOTAL INJURIES \n", title = "\n TOP 10 INJURY CAUSING EVENTS \n") +
scale_y_continuous(expand = c(0,0), limits = c(-10, 100000), labels = comma) +
theme_minimal() + theme(axis.text = element_text(family = 'Trebuchet MS'), axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(plot.title = element_text(family = "Trebuchet MS", color = "#666666", face = "bold", size = 16, hjust = 0)) +
theme(axis.title = element_text(family = "Trebuchet MS", color = "#666666", face = "bold", size = 12))
plot2
Again, TORNADO is the clear leader in this criteria with more than 80,000 injuries recorded. TSTM WIND is far behind on 2nd with less than 20,000 followed by FLOOD on 3rd.
The 2nd part of the results show the Top 10 events with most total cost of damage to property and crops. The chart shows the aggregate of the two.
plot3 <- ggplot(economicDamage_2, aes(x = reorder(EVTYPE,-value), y = value, fill = variable)) +
geom_bar(stat = "identity", alpha = 0.9) +
labs(x = "\n EVENT TYPE \n", y = "\n TOTAL DAMAGE \n", title = "\n TOP 10 MOST DAMAGING EVENTS \n", fill = " ") +
scale_y_continuous(expand = c(0,0), limits = c(-10, 180000000000), labels = comma) +
scale_fill_manual(values = c("navy", "royalblue"), labels = c("PROPERTY", "CROP")) +
theme_minimal() + theme(axis.text = element_text(family = 'Trebuchet MS'), axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(plot.title = element_text(family = "Trebuchet MS", color = "#666666", face = "bold", size = 16, hjust = 0)) +
theme(axis.title = element_text(family = "Trebuchet MS", color = "#666666", face = "bold", size = 12))
plot3
The chart shows that the most costly event is FLOOD with around 150B USD followed by HURRICANE/TYPHOON.