Reproducible Research Course Project 2: Extreme Weather Events Impact on Public Health & Economy

Synopsis

There were numbers of severe weather events & storms in the U.S. These heavily affected both the level of well-being and economy of the Americans. The data is taken from the NOAA (National Oceanic and Atmospheric Administration) and can be downloaded from this link. This data describes the content on storms and other weather events in the U.S (including the time & place) and also the estimates of any fatalities, injuries, and property damage.

Data Processing

Before we start, let’s import the libraries used in this project.

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

If you have downloaded the dataset, we can directly load it. More info can be found on this link and this link

raw_data <- read.csv('./repdata_data_StormData.csv.bz2')
colnames(raw_data)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"
dim(raw_data)
## [1] 902297     37

Now we can clearly see that there are over 900,000 number of observations (rows) of data and 37 variables (columns). However, we will not be using all columns as the relevant analysis only includes date (BGN_DATE), event types (EVTYPE), FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP. If you open one of those links I mentioned above, all of the 48 event types occur during and after 1996. This means that the years before that are not related to our task.

# cast date column as datetime
raw_data$BGN_DATE <- as.Date(raw_data$BGN_DATE, format='%m/%d/%Y')
# include only the related columns to observe health and economy impact
subset_data <- select(raw_data, BGN_DATE, EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP, FATALITIES, INJURIES)
# subset data >= 1996
subset_data$year <- year(subset_data$BGN_DATE)
subset_data <- subset(subset_data, year >= 1996)
# some of the values on PROPDMG, CROPDMG, FATALITIES, and INJURIES are 0, filter them out to get the data we wanted
subset_data <- subset(subset_data, PROPDMG >= 1 | CROPDMG >= 1 | FATALITIES >= 1 | INJURIES >= 1)

Interestingly, columns like PROPDMG and CROPDMG contain non-numeric values. Intuitively, these values represent the values in the order of “thousands”.

  1. "" -> 0
  2. “K” -> 1,000
  3. “M” -> 1,000,000
  4. “B” -> 1,000,000,000

First, let’s look at PROPDMGEXP column

unique(subset_data$PROPDMGEXP)
## [1] "K" ""  "M" "B"

Same thing as the CROPDMGEXP column

unique(subset_data$CROPDMGEXP)
## [1] "K" ""  "M" "B"

Simply just fill in the “real numeric” values according to the categories (’‘, ’B’, ‘K’, or ‘M’)

subset_data$CROPDMGVAL[(subset_data$CROPDMGEXP == '')] <- 10^0
subset_data$CROPDMGVAL[(subset_data$CROPDMGEXP == 'B')] <- 10^9
subset_data$CROPDMGVAL[(subset_data$CROPDMGEXP == 'K')] <- 10^3
subset_data$CROPDMGVAL[(subset_data$CROPDMGEXP == 'M')] <- 10^6

subset_data$PROPDMGVAL[(subset_data$PROPDMGEXP == '')] <- 10^0
subset_data$PROPDMGVAL[(subset_data$PROPDMGEXP == 'B')] <- 10^9
subset_data$PROPDMGVAL[(subset_data$PROPDMGEXP == 'K')] <- 10^3
subset_data$PROPDMGVAL[(subset_data$PROPDMGEXP == 'M')] <- 10^6

When observing extreme weather events on health impact, we can divide it into FATALITIES and INJURIES as they are closely related to “health”. The same goes to economic impact. We can take PROPDMG, PROPDMGVAL, CROPDMG, and CROPDMGVAL to estimate it. From here, we could do a process called feature engineering to generate a new variable based on the existing and related variables.

subset_data <- mutate(subset_data, health_impact = FATALITIES + INJURIES)
subset_data <- mutate(subset_data, eco_impact = PROPDMG*PROPDMGVAL + CROPDMG*CROPDMGVAL)

Results

Finally, plot the “cleaned” data grouped by the weather events in a descending order for health impact.

health_plot <- subset_data %>% group_by(EVTYPE) %>% summarise(health_impact = sum(health_impact)) %>%
                arrange(desc(health_impact))

ggplot(health_plot[1:10,], aes(x=reorder(EVTYPE, -health_impact), y=health_impact, color=EVTYPE)) + 
      geom_bar(stat="identity") + 
      theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
      xlab("Event") + ylab("Number of fatalities and injuries") +
      theme(legend.position="none") +
      ggtitle("Biggest Health Impact in the U.S. grouped by Event Types")

We can see that ‘TORNADO’ had the biggest impact on people’s health in the U.S.

Next, plot for the economy impact

eco_plot <- subset_data %>% group_by(EVTYPE) %>% summarise(eco_impact = sum(eco_impact)) %>%
                arrange(desc(eco_impact))

ggplot(eco_plot[1:10,], aes(x=reorder(EVTYPE, -eco_impact), y=eco_impact, color=EVTYPE)) + 
      geom_bar(stat="identity") + 
      theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
      xlab("Event") + ylab("Number of fatalities and injuries") +
      theme(legend.position="none") +
      ggtitle("Biggest Economy Impact in the U.S. grouped by Event Types")

For the biggest economy impact, “FLOOD” tops the chart followed by “HURRICANE/TYPHOON” and “STORM SURGE”.