Synopsis

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

For preventing such outcomes to the extent possible is a key concern. There are 2 questions which this report would like to address: 1. Across the United States, which types of events are most harmful with respect to population health? 2. Across the United States, which types of events have the greatest economic consequences?

Data processing

Load data from zip file

The data for this analysis come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. The source data file is downloaded from this link

# Load data file into R
  StormData <- read.csv("C:/Users/Admin/Desktop/RStudio_et_Github/Reproducible Research/Final assigment/repdata_data_StormData.csv.bz2")

# Summarize the loaded data frame
  str(StormData)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Subset data

  • Explanation of column names exits in this link. However, for the scoop of this analysis, only public heath and economy related variables are subset, including:

    • BGN_DATE Beginning date
    • EVTYPE Weather event types
    • FATALITIES and INJURIES People die or injured by the event, which are impact on public health
    • PROPDMG and CROPDMG Property damage and crop damage, which are impact on public health
    • PROPDMGEXP and CROPDMGEXP Unit (by USD) for property and crop damage
  • According to the NOAA (https://www.ncdc.noaa.gov/stormevents/details.jsp), only since 1996, they can record all type of events. For comparison between events, data older than the year 1996 should be eliminated

  • Any observations with NA values should be eliminated

# Subset needed data
  data <- StormData %>% select(BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) 

# Reformat and select data since 1996
  data$BGN_DATE <- as.Date(data$BGN_DATE, "%m/%d/%Y")
  
# Filter data 
  data <- data %>% filter(BGN_DATE >= "1996/01/01")  # Data since 1996
  data <- data %>% filter (!is.na(data))    # Drop NA value
  data <- data %>% select(-BGN_DATE)        # Drop time data

Process unit for Property and Crop damage

  • Variables PROPDMGEXP and CROPDMGEXP are unit (in USD) for PRODMG and CROPDMG, respectively. However, their value input was coded as:
    • "“,”?“,”+“,”-": 1
    • “0”: 1
    • “1”: 10
    • “2”: 100
    • “3”: 1.000
    • “4”: 10.000
    • “5”: 100.000
    • “6”: 1.000.000
    • “7”: 10.000.000
    • “8”: 100.000.000
    • “9”: 1.000.000.000
    • “H”: 100
    • “K”: 1.000
    • “M”: 1.000.000
    • “B”: 1.000.000.000
    => For even unit, all the code in character should be transfer into number
# For PROPDMGEXP
  ## Check for code using
    table(data$PROPDMGEXP)
## 
##             0      B      K      M 
## 276185      1     32 369938   7374
  ## => Code appear in this column include 0, B, K, M

  ## Replace for suitable value for 0, B, K, M
  data$PROPDMGEXP <- gsub("0","1",data$PROPDMGEXP)
  data$PROPDMGEXP <- gsub("B","9",data$PROPDMGEXP) 
  data$PROPDMGEXP <- gsub("K","3",data$PROPDMGEXP) 
  data$PROPDMGEXP <- gsub("M","6",data$PROPDMGEXP)
  data$PROPDMGEXP <- as.numeric(data$PROPDMGEXP)
  
# For CROPDMGEXP
  ## Check for code using
    table(data$CROPDMGEXP)
## 
##             B      K      M 
## 373069      4 278686   1771
  ## => Code appear in this column include B, K, M
    
  ## Replace for suitable value for B, K, M
  data$CROPDMGEXP <- gsub("B","9",data$CROPDMGEXP) 
  data$CROPDMGEXP <- gsub("K","3",data$CROPDMGEXP) 
  data$CROPDMGEXP <- gsub("M","6",data$CROPDMGEXP)
  data$CROPDMGEXP <- as.numeric(data$CROPDMGEXP)
  
# Null value will return as NA, they are the value without unit, therefore we can ignore them
  • Calculating economy impact
  data[,"PROPERTY"] <- with(data,PROPDMG*10^PROPDMGEXP)
  data[,"CROP"] <- with(data,CROPDMG*10^CROPDMGEXP)

Analysis

Calculate health impact

Considering number of death and injury people have the same weight impact on public health. Note that there is many event types, however, we will only look at events with top highest impact.

# Calculate health impact by events 

  Health <- aggregate(Health$PH,by=list(Health$EVTYPE),sum,na.rm=TRUE)
  Health <- subset(Health,x>quantile(x,prob=0.98))
  Health <- Health[order(-Health$x),]
  colnames(Health) <- c("Event","Impact")
  Health 
##                 Event Impact
## 426           TORNADO  22178
## 81     EXCESSIVE HEAT   8188
## 102             FLOOD   7172
## 224         LIGHTNING   4792
## 434         TSTM WIND   3870
## 98        FLASH FLOOD   2561
## 421 THUNDERSTORM WIND   1530
## 507      WINTER STORM   1483
## 147              HEAT   1459
## 185 HURRICANE/TYPHOON   1339
## 177         HIGH WIND   1318

Calculate economy impact

Considering property damage and crop damage have the same weight impact on economy

#EC is a new variable represent economy impact
  Economy <- data %>% mutate(data,EC = PROPERTY + CROP) 
# Calculate health impact by events 

  Economy <- aggregate(Economy$EC,by=list(Economy$EVTYPE),sum,na.rm=TRUE)
  Economy <- subset(Economy,x>quantile(x,prob=0.98))
  Economy <- Economy[order(-Economy$x),]
  colnames(Economy) <- c("Event","Impact")
  Economy
##                 Event       Impact
## 102             FLOOD 137278823900
## 185 HURRICANE/TYPHOON  29348167800
## 426           TORNADO  16308770350
## 183         HURRICANE  12404268000
## 142              HAIL   9331288590
## 98        FLASH FLOOD   8402099530
## 343  STORM SURGE/TIDE   4641493000
## 421 THUNDERSTORM WIND   3780985440
## 496          WILDFIRE   3684468370
## 177         HIGH WIND   3057106640
## 63            DROUGHT   1868412000

Results

Events with biggest impact on public health

# Take 10 biggest impact for plotting
Health <- head(Health,10)

# Plot Event by impact

health.plot <- ggplot(Health, aes(x = Event, y = Impact, fill = Event)) +
              geom_bar(stat = "identity") +
              coord_flip() +
              ylab("Total number of health impact") +
              ggtitle("Weather event types impact on public health") +
              theme(plot.title = element_text(hjust = 0.5))

print(health.plot)

According to the graph, Tornado is the event with biggest impact on public health (both fatalities and injuries)

Events with biggest impact on economy

# Take 10 biggest impact for plotting
Economy <- head(Economy,10)

# Plot Event by impact

economy.plot <- ggplot(Economy, aes(x = Event, y = Impact, fill = Event)) +
              geom_bar(stat = "identity") +
              coord_flip() +
              ylab("Total number of economy impact by USD") +
              ggtitle("Weather event types impact on economy") +
              theme(plot.title = element_text(hjust = 0.5))

print(economy.plot)

According to the graph, Flood is the event with biggest impact on Economy (both property and crop)