Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

The objective of this project is to explore the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database to answer the following 2 key questions: 1) Across the United States, which types of events are most harmful with respect to population health? 2) Across the United States, which types of events have the greatest economic consequences?

The key steps of the data analysis are as followed: 1) Data Processing - Loading and transforming the database 2) Exploratory Data Analysis - For data understanding 3) Results - Analysing the data and plotting of relevant plots to address the 2 key questions

Data Processing Part 1 (Loading Data)

Firstly, we read in the raw data file which is downloaded and saved in the R working directory. The data is stored as object “storm_raw”.

storm_raw <- read.csv("repdata-data-StormData.csv.bz2")

As the objectives on the analysis focus on the event types which result in the worst health/economic consequences, we eliminate the irrelvant data columns relating to geographical location and timings to reduce the data size and improve the data processing efficiency. The new dataset is saved as “storm”.

storm <- storm_raw[,c(8,23:28)]

Exploratory Data Analysis

We get a better understanding of the remaining data by calling str().

str(storm)
## 'data.frame':    902297 obs. of  7 variables:
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...

From here, we understand that there are a total of 985 different storm event types. Furthermore, we noted that further data processing needs to be performed on the PROPDMGEXP & CROPDMGEXP variables as they currently contain missing/invalid data.

Data Processing Part 2 (Pre-Processing Damage Variables)

Based on the National Weather Service Storm Data Documentation (“https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf”), we understand that the property and crop damages estimates are rounded to three significant digits, followed by an alphabetical character signifying the magnitude of the number, i.e., 1.55B for $1,550,000,000.

It was further stated that “Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions. However, in calling str() on storm, we noted that the characters for Property and Crop damages are factor variables containing more than the above 3 stated values (refer to table() function called below)

table(storm$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5 
## 465934      1      8      5    216     25     13      4      4     28 
##      6      7      8      B      h      H      K      m      M 
##      4      5      1     40      1      6 424665      7  11330
table(storm$CROPDMGEXP)
## 
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994

As seen above, there are a total of 19 factors for PROPDMGEXP and 9 factors for CROPDMGEXP, of which several are not part of the defined values of “K”, “M” & “B”.

Therefore, a new column “MAGNITUDE” is created to represent the magnitude, and factors not within the pre-defined values are excluded and set as “0”.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
storm$PROPMAGNITUDE <- 0
storm$CROPMAGNITUDE <- 0
storm[which(storm$PROPDMGEXP == "K"),]$PROPMAGNITUDE <- 1000
storm[which(storm$PROPDMGEXP == "m"),]$PROPMAGNITUDE <- 1000000
storm[which(storm$PROPDMGEXP == "M"),]$PROPMAGNITUDE <- 1000000
storm[which(storm$PROPDMGEXP == "B"),]$PROPMAGNITUDE <- 1000000000
storm[which(storm$CROPDMGEXP == "K"),]$CROPMAGNITUDE <- 1000
storm[which(storm$CROPDMGEXP == "k"),]$CROPMAGNITUDE <- 1000
storm[which(storm$CROPDMGEXP == "m"),]$CROPMAGNITUDE <- 1000000
storm[which(storm$CROPDMGEXP == "M"),]$CROPMAGNITUDE <- 1000000
storm[which(storm$CROPDMGEXP == "B"),]$CROPMAGNITUDE <- 1000000000

Data Processing Part 3 (Create Health & Economic Variables)

In order to answer the 2 questions on health & economic consequences, we create 2 columns representing the total TOTAL & ECONOMIC (damages) consequences for the listed events.

For the HEALTH variable, we assign an arbitrary weight of 10 to FATALITY and weight of 1 to INJURIES. Note that this weight can be adjusted based on the views on health consequence between a death vs an injury.

storm$HEALTH_CONSEQUENCE <- (storm$FATALITIES * 10) + (storm$INJURIES * 1)

For the ECONOMIC variable, we sum the total PROPERTY damages and CROP damages.

storm$ECONOMIC_CONSEQUENCE <- (storm$PROPDMG * storm$PROPMAGNITUDE) + (storm$CROPDMG * storm$CROPMAGNITUDE)

Data Processing Part 4 (Group by Event Type)

We create a new object “storm_final” by grouping the HEALTH and ECONOMIC consequences by event types:

storm_event <- group_by(storm, EVTYPE)
storm_final <- summarise(storm_event, HEALTH = sum(HEALTH_CONSEQUENCE), ECONOMIC = sum(ECONOMIC_CONSEQUENCE))

Results

Question 1) Across the United States, which types of events are most harmful with respect to population health?

We plot the following bar chart to represent the top 10 event types that are most harmful to population health:

library(ggplot2)

storm_final_health <- arrange(storm_final, desc(HEALTH)) 
storm_10_health <- head(storm_final_health,10)
ggplot(data=storm_10_health, aes(x = factor(EVTYPE), y = HEALTH)) + 
geom_bar(stat = "identity", fill="white", colour="darkgreen") + 
  coord_flip() + labs(title = "Top 10 Event Types Most Harmful to Population Health", x = "Event Types", y = "HEALTH FACTOR (10 for Fatalities & 1 for Injuries)")

Therefore, we can conclude that TORNADOs are most harmful to population health.

Question 2) Across the United States, which types of events have the greatest economic consequences?

We plot the following bar chart to represent the top 10 event types that have the greatest economic consequences (most damages in value):

storm_final_eco <- arrange(storm_final, desc(ECONOMIC)) 
storm_10_economic <- head(storm_final_eco,10)
ggplot(data=storm_10_economic, aes(x = factor(EVTYPE), y = ECONOMIC/1.0e9)) + 
geom_bar(stat = "identity", fill="white", colour="darkgreen") + 
  coord_flip() + labs(title = "Top 10 Event Types with Worst Economic Consequences", x = "Event Types", y = "Economic Consequences ($ in billions)")

Therefore, we can conclude that FLOOD causes the worst economic consequence.