Synopsis

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database to address the following questions:

  1. Across the United States, which types of events are most harmful with respect to population health?

  2. Across the United States, which types of events have the greatest economic consequences?

Population health impact is determined by the number of fatalities and injuries caused by the severe weather, while economic damage is calculated by the total cost damage to crops and properties of the public.

The analysis shows that, across US, excessive heat caused the greatest number of fatalities, tornado caused the greatest number of injuries and flood was the most damaging severe weather in terms of economic consequences.

Data description

The database and variables used in the analysis are comprehensive described at:

Data Processing

Load the required libraries

library(dplyr)
library(lubridate)
library(ggplot2)
library(tidyr)

Load the data

Download the compressed data file (bz2 file) and using read.csv to read ‘bz2’ directly without the need to decompress it. Note that, to prevent repeatedly download the compressed file everytime you run the code, add a command that make R check if the file does locally exist. Make sure the data file was downloaded correctly. It must be 46.8 MB (49,177,144 bytes) with 902297 records** on 37 variables.

if(!exists("./data")){dir.create("./data")}
stormDataFileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
stormDataFile <- "data/stormData.csv.bz2"
if (!file.exists(stormDataFile)) {
  download.file(url = stormDataFileUrl, destfile = stormDataFile)
}
stopifnot(file.size(stormDataFile) == 49177144)
stormData <- read.csv(stormDataFile, header = TRUE)
stopifnot(dim(stormData) == c(902297,37))

Seeing structure of the storm dataset

str(stormData)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Data processing

The dataset is big and contains many data irrelevant to the purposes of the analysis. So subsetting only those that are needed.

  1. Subsetting relevant variables

The analysis want to address two questions:

After reading the Storm Data Documentation and inspecting all the variables of the storm dataset, I decided that, variables listed below are those needed for doing the analysis:

Variable Description
EVTYPE Event type (Flood, Heat, Hurricane, Tornado, …)
FATALITIES Number of fatalities resulting from event
INJURIES Number of injuries resulting from event
PROPDMG Property damage in USD
PROPDMGEXP Unit multiplier for property damage
CROPDMG Crop damage in USD
CROPDMGEXP Unit multiplier for property damage
BGN_DATE Begin date of the event
  1. Filter the records from the Jan. 1996 onwards.

According to NOAA the data recording start from Jan. 1950. At that time they recorded one event type, tornado. They add more events gradually and only from Jan. 1996 they start recording all events type. Since my objective is comparing the effects of different weather events, I only included records from Jan. 1996 onwards.

  1. Remove records that contain no information about health or economic damages

The dataset contains a lot of observations without any information about health and/or economic damages. These observations are excluded from the analysis.

stormData$BGN_DATE <- mdy_hms(stormData$BGN_DATE)
interestedData <- stormData %>% 
  select(c("EVTYPE",
           "FATALITIES",
           "INJURIES",
           "PROPDMG",
           "PROPDMGEXP",
           "CROPDMG",
           "CROPDMGEXP",
           "BGN_DATE")) %>% 
  filter(year(BGN_DATE) >= 1996) %>% 
  filter(FATALITIES > 0 | INJURIES > 0 | PROPDMG > 0 | CROPDMG > 0)

Cleaning the EVTYPE column

The official events type are 48. However, there are total 222 unique event types in the current dataset.

length(unique(interestedData$EVTYPE))
## [1] 222

That is because, many values are very similar. Just some changes in the case, spelling that make them become different event type. Solving this problem by changing all the words in EVTYPE column to uppercase and using gsub to combine similar event type into unique one.

interestedData$EVTYPE <- toupper(interestedData$EVTYPE)
## create a new variable to store the event types after cleaning
interestedData$CLEAN_EVTYPE <- NULL

## combine similar event types into unique group
interestedData$CLEAN_EVTYPE <- gsub('.*HAIL.*', 'HAIL', interestedData$EVTYPE)
interestedData$CLEAN_EVTYPE <- gsub('.*HEAT.*', 'HEAT', interestedData$EVTYPE)
interestedData$CLEAN_EVTYPE <- gsub('.*FLOOD.*', 'FLOOD', interestedData$EVTYPE)
interestedData$CLEAN_EVTYPE <- gsub('.*WIND.*', 'WIND', interestedData$EVTYPE)
interestedData$CLEAN_EVTYPE <- gsub('.*STORM.*', 'STORM', interestedData$EVTYPE)
interestedData$CLEAN_EVTYPE <- gsub('.*SNOW.*', 'SNOW', interestedData$EVTYPE)
interestedData$CLEAN_EVTYPE <- gsub('.*TORNADO.*', 'TORNADO', interestedData$EVTYPE)
interestedData$CLEAN_EVTYPE <- gsub('.*WINTER.*', 'WINTER', interestedData$EVTYPE)
interestedData$CLEAN_EVTYPE <- gsub('.*RAIN.*', 'RAIN', interestedData$EVTYPE)
length(unique(interestedData$CLEAN_EVTYPE))
## [1] 177

The current data now has 177 event types.

Handle exponent value of PROPDMGEXP and CROPDMGEXP

According to National Weather Service Storm Data Documentation, the property damage and crop damages prodivded in the storm dataset comes with a seperate exponent that is PROPDMGEXP and CROPDMGEXP. The values of those variables are shown as below:

table(stormData$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5      6 
## 465934      1      8      5    216     25     13      4      4     28      4 
##      7      8      B      h      H      K      m      M 
##      5      1     40      1      6 424665      7  11330
table(stormData$CROPDMGEXP)
## 
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994

This in-depth work has shown us how to interpret those values.

Possible values of PROPDMGEXP and CROPDMGEXP:

In order to calculate costs, the PROPDMGEXP and CROPDMGEXP variables will be mapped to a multiplier factor which will then be used to calculate the actual costs for both property and crop damage. Two new variables will be created to store damage costs:

# function to get multiplier factor
getMultiplier <- function(exp) {
    exp <- toupper(exp);
    if (exp == "")  return (10^0);
    if (exp == "-") return (10^0);
    if (exp == "?") return (10^0);
    if (exp == "+") return (10^0);
    if (exp == "0") return (10^0);
    if (exp == "1") return (10^1);
    if (exp == "2") return (10^2);
    if (exp == "3") return (10^3);
    if (exp == "4") return (10^4);
    if (exp == "5") return (10^5);
    if (exp == "6") return (10^6);
    if (exp == "7") return (10^7);
    if (exp == "8") return (10^8);
    if (exp == "9") return (10^9);
    if (exp == "H") return (10^2);
    if (exp == "K") return (10^3);
    if (exp == "M") return (10^6);
    if (exp == "B") return (10^9);
    return (NA);
}

# calculate property damage and crop damage costs 
interestedData$PROP_COST <- with(interestedData, as.numeric(PROPDMG) * sapply(PROPDMGEXP, getMultiplier))
interestedData$CROP_COST <- with(interestedData, as.numeric(CROPDMG) * sapply(CROPDMGEXP, getMultiplier))

Data Analysis

Tidy data for analysing health impact

healthImpact <- interestedData %>% 
  group_by(EVTYPE) %>% 
  summarise(FATALITIES = sum(FATALITIES),
            INJURIES = sum(INJURIES)) 

tidyHealthImpact <- pivot_longer(healthImpact,
                             cols = -EVTYPE,
                             names_to = "health.impact",
                             values_to = "values") 

fatalityData <- tidyHealthImpact %>% 
  filter(health.impact == "FATALITIES") %>% 
  arrange(desc(values)) %>% 
  slice(1:10)

injuryData <- tidyHealthImpact %>% 
  filter(health.impact == "INJURIES") %>% 
  arrange(desc(values)) %>% 
  slice(1:10)

Tidy data for analysing economic impact

economicImpactData <- aggregate(x = list(DAMAGE_IMPACT = interestedData$PROP_COST + interestedData$CROP_COST), 
                                  by = list(EVENT_TYPE = interestedData$EVTYPE), 
                                  FUN = sum,
                                  na.rm = TRUE)
economicImpactData <- economicImpactData[order(economicImpactData$DAMAGE_IMPACT, decreasing = TRUE),]

Results

Impact of severve weather on population health

fatalityPlot <- ggplot(fatalityData)+
  geom_bar(aes(x = EVTYPE, y = values), stat = "identity", fill = "red", alpha = 0.5)+
  coord_flip()+
  ggtitle ("Top 10 weather events with greatest number of fatalities")+
  ylab("Number of fatalities")+
  xlab("Event type")

injuryPlot <- ggplot(injuryData)+
  geom_bar(aes(x = EVTYPE, y = values), stat = "identity", fill = "blue", alpha = 0.5)+
  coord_flip()+
  ggtitle ("Top 10 weather events with greatest number of injuries")+
  ylab("Number of injuries")+
  xlab("Event type")

gridExtra::grid.arrange(fatalityPlot, injuryPlot)

So far, excessive heat caused the greatest number of fatalities in the US, while tornado caused the greatest number of injuries.

Impact of severve weather on population economy

economicImpactPlot <- ggplot(head(economicImpactData, 10),
                            aes(x = EVENT_TYPE, y = DAMAGE_IMPACT, fill = EVENT_TYPE)) +
                            coord_flip() +
                            geom_bar(stat = "identity") + 
                            xlab("Event Type") +
                            ylab("Total Property / Crop Damage Cost") +
                            ggtitle("Top 10 Weather Events with\nGreatest Economic Consequences")+
  theme(legend.position = "none")
print(economicImpactPlot)

Across the United States, flood have the greatest economic consequences.