Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Data

The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:

There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

Data Processing

Downloading data

Set up directory for downloading data

dir.create("./data", showWarnings = FALSE)

downloadURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
downloadedFile <- "./data/repdata_data_StormData.csv.bz2"

Download and unzip data

library(R.utils)
if(!file.exists(downloadedFile)) {
    download.file(downloadURL, downloadedFile, method = "curl")
    bunzip2(downloadedFile, destname = "./data/stormdata.csv", remove = FALSE)
}

Check if required data is downloaded

file.exists(downloadedFile)
## [1] TRUE

Loading data

The actual dimension of the data is 902297x37. To reduce the time-consuming when loading it to the environment, we should read only the useful columns. Here are the useful columns and their column indexes:

Column name Column index
EVTYPE 8
FATALITIES 23
INJURIES 24
PROPDMG 25
PROPDMGEXP 26
CROPDMG 27
CROPDMGEXP 28

To load these specific features, we need to specify their classes as numeric or character, the types of other columns will be set to NULL:

stormdata <- read.csv("./data/stormdata.csv",
                      colClasses = c(rep("NULL", 7),
                                     "character",
                                     rep("NULL", 14),
                                     rep("numeric", 3),
                                     "character",
                                     "numeric",
                                     "character",
                                     rep("NULL", 9)),
                      sep = ",",
                      header = TRUE)

Data summary:

Take a quick look of our data:

head(stormdata)
##    EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO          0       15    25.0          K       0           
## 2 TORNADO          0        0     2.5          K       0           
## 3 TORNADO          0        2    25.0          K       0           
## 4 TORNADO          0        2     2.5          K       0           
## 5 TORNADO          0        2     2.5          K       0           
## 6 TORNADO          0        6     2.5          K       0

Summary:

summary(stormdata)
##     EVTYPE            FATALITIES          INJURIES            PROPDMG       
##  Length:902297      Min.   :  0.0000   Min.   :   0.0000   Min.   :   0.00  
##  Class :character   1st Qu.:  0.0000   1st Qu.:   0.0000   1st Qu.:   0.00  
##  Mode  :character   Median :  0.0000   Median :   0.0000   Median :   0.00  
##                     Mean   :  0.0168   Mean   :   0.1557   Mean   :  12.06  
##                     3rd Qu.:  0.0000   3rd Qu.:   0.0000   3rd Qu.:   0.50  
##                     Max.   :583.0000   Max.   :1700.0000   Max.   :5000.00  
##   PROPDMGEXP           CROPDMG         CROPDMGEXP       
##  Length:902297      Min.   :  0.000   Length:902297     
##  Class :character   1st Qu.:  0.000   Class :character  
##  Mode  :character   Median :  0.000   Mode  :character  
##                     Mean   :  1.527                     
##                     3rd Qu.:  0.000                     
##                     Max.   :990.000

Handle Exponent value of PROPDMGEXP and CROPDMGEXP

These are possible values of PROPDMGEXP and CROPDMGEXP:

  • B or b = Billion = 109
  • M or m = Million = 106
  • K or k = Thousand = 103
  • H or h = Hundred = 102
  • The symbol “-” refers to less than = 100 (ignore it)
  • The symbol “+” refers to greater than = 100 (ignore it)
  • The symbol “?” refers to low certainty than = 100 (ignore it)
  • The number from 0 to 10 represent the power of ten = 10TheNumber
  • The black/empty character = 100

For more information, consider to visit here or here

1. Convert the Exponent columns to number:

Create a converter data frame for each symbol and its value:

symbol <- c("B", "b", "M", "m", "K", "k", "H", "h",
            "-", "+", "?", as.character(0:10), "")
value <- c(rep(10^9, 2), rep(10^6, 2), rep(10^3, 2), rep(10^2, 2), 
           rep(10^0, 3), 10^c(0:10), 10^0)

converter <- data.frame(Symbol = symbol, Value = value)
head(converter)
##   Symbol Value
## 1      B 1e+09
## 2      b 1e+09
## 3      M 1e+06
## 4      m 1e+06
## 5      K 1e+03
## 6      k 1e+03

Replace each symbol by its value in the stormdata:

## process for PROPDMGEXP
temp <- sapply(stormdata$PROPDMGEXP, 
                               function(ele) converter[converter$Symbol == ele, 2])
stormdata$PROPDMGEXP <- unlist(temp, use.names = FALSE)

## process for CROPDMGEXP
temp <- sapply(stormdata$CROPDMGEXP, 
                               function(ele) converter[converter$Symbol == ele, 2])
stormdata$CROPDMGEXP <- unlist(temp, use.names = FALSE)
head(stormdata)
##    EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO          0       15    25.0       1000       0          1
## 2 TORNADO          0        0     2.5       1000       0          1
## 3 TORNADO          0        2    25.0       1000       0          1
## 4 TORNADO          0        2     2.5       1000       0          1
## 5 TORNADO          0        2     2.5       1000       0          1
## 6 TORNADO          0        6     2.5       1000       0          1

2. Re-calculate the Property and Crop Damage in PROPDMG and CROPDMG columns:

Re-calculate:

stormdata$PROPDMG <- stormdata$PROPDMG * stormdata$PROPDMGEXP
stormdata$CROPDMG <- stormdata$CROPDMG * stormdata$CROPDMGEXP
head(stormdata)
##    EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO          0       15   25000       1000       0          1
## 2 TORNADO          0        0    2500       1000       0          1
## 3 TORNADO          0        2   25000       1000       0          1
## 4 TORNADO          0        2    2500       1000       0          1
## 5 TORNADO          0        2    2500       1000       0          1
## 6 TORNADO          0        6    2500       1000       0          1

Remove PROPDMGEXP and CROPDMGEXP columns:

stormdata$PROPDMGEXP <- NULL
stormdata$CROPDMGEXP <- NULL
head(stormdata)
##    EVTYPE FATALITIES INJURIES PROPDMG CROPDMG
## 1 TORNADO          0       15   25000       0
## 2 TORNADO          0        0    2500       0
## 3 TORNADO          0        2   25000       0
## 4 TORNADO          0        2    2500       0
## 5 TORNADO          0        2    2500       0
## 6 TORNADO          0        6    2500       0

Analysis

In this part, we will group the records that have same EVTYPE and then calculate the sum of each feature:

library(dplyr)
group <- stormdata %>% 
    group_by(EVTYPE) %>%
    summarise_all(sum)
head(group)
## # A tibble: 6 × 5
##   EVTYPE                  FATALITIES INJURIES PROPDMG CROPDMG
##   <chr>                        <dbl>    <dbl>   <dbl>   <dbl>
## 1 "   HIGH SURF ADVISORY"          0        0  200000       0
## 2 " COASTAL FLOOD"                 0        0       0       0
## 3 " FLASH FLOOD"                   0        0   50000       0
## 4 " LIGHTNING"                     0        0       0       0
## 5 " TSTM WIND"                     0        0 8100000       0
## 6 " TSTM WIND (G45)"               0        0    8000       0

Results

Function to generate a bar plot:

This function called generate.bar with input:

library(ggplot2)
generate.bar <- function(df, x, y, x.lab, y.lab, title) {
    p <- ggplot(df, aes(x = x, y = y, fill = x))
    p <- p + geom_bar(stat = "identity") +
      xlab(x.lab) +
      ylab(y.lab) +
      ggtitle(title) +
      theme(legend.position = "none") +  ## remove legends
      theme(text = element_text(size = 12)) +  ## resize text
      theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))  ## rotate labels
}

Question 1: Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

In this part, we will generate the bar plots to find the top 5 weather events that are most harmful to the US citizen for each type of damage:

In term of Fatalities damage:

Extract the top 5 worst harmful events that result in Fatalities damage:

fatalities <- group[order(-group$FATALITIES), ]
fatalities <- fatalities[1:5, ]
head(fatalities)
## # A tibble: 5 × 5
##   EVTYPE         FATALITIES INJURIES      PROPDMG    CROPDMG
##   <chr>               <dbl>    <dbl>        <dbl>      <dbl>
## 1 TORNADO              5633    91346 56947380676.  414953270
## 2 EXCESSIVE HEAT       1903     6525     7753700   492402000
## 3 FLASH FLOOD           978     1777 16822673978. 1421317100
## 4 HEAT                  937     2100     1797000   401461500
## 5 LIGHTNING             816     5230   930379430.   12092090

Create a plot for this damage:

fatalities.plot <- generate.bar(fatalities, 
                                fatalities$EVTYPE, fatalities$FATALITIES,
                                "Type of event", "Fatalities damage", 
                                "Top 5 most harmful events damaged in Fatalities")

In term of Injuries damage:

Extract the top 5 worst harmful events that result in Injuries damage:

injuries <- group[order(-group$INJURIES), ]
injuries <- injuries[1:5, ]
head(injuries)
## # A tibble: 5 × 5
##   EVTYPE         FATALITIES INJURIES       PROPDMG    CROPDMG
##   <chr>               <dbl>    <dbl>         <dbl>      <dbl>
## 1 TORNADO              5633    91346  56947380676.  414953270
## 2 TSTM WIND             504     6957   4484928495   554007350
## 3 FLOOD                 470     6789 144657709807  5661968450
## 4 EXCESSIVE HEAT       1903     6525      7753700   492402000
## 5 LIGHTNING             816     5230    930379430.   12092090

Create a plot for this damage:

injuries.plot <- generate.bar(injuries, 
                              injuries$EVTYPE, injuries$INJURIES,
                              "Type of event", "Injuries damage",
                              "Top 5 most harmful events damaged in Injuries")

Plot both these types of damage into a panel

library(gridExtra)
grid.arrange(fatalities.plot, injuries.plot, ncol = 2)

→ As can be seen from the plot, the TORNADO event is the most harmful type of event resulting in both Fatalities and Injuries consequences. Its damage is very high compared to the figures of other kinds of events, almost 2.5 times and 80 times higher than the total damage for the second-worst harmful events in terms of Fatalities and Injuries, respectively.

Question 2: Across the United States, which types of events have the greatest economic consequences?

In this part, we will generate the bar plots to find the top 5 weather events that are most harmful to the US citizen for each type of economic consequences:

In term of Property consequences:

Extract the top 5 worst harmful events that result in Property consequences:

property <- group[order(-group$PROPDMG), ]
property <- property[1:5, ]
head(property)
## # A tibble: 5 × 5
##   EVTYPE            FATALITIES INJURIES       PROPDMG    CROPDMG
##   <chr>                  <dbl>    <dbl>         <dbl>      <dbl>
## 1 FLOOD                    470     6789 144657709807  5661968450
## 2 HURRICANE/TYPHOON         64     1275  69305840000  2607872800
## 3 TORNADO                 5633    91346  56947380676.  414953270
## 4 STORM SURGE               13       38  43323536000        5000
## 5 FLASH FLOOD              978     1777  16822673978. 1421317100

Create a plot for this economic consequences:

property.plot <- generate.bar(property, 
                              property$EVTYPE, property$PROPDMG,
                              "Type of event", "Property consequences",
                              "Top 5 most harmful events damaged in Property")

In term of Crop consequences:

Extract the top 5 worst harmful events that result in Crop consequences:

crop <- group[order(-group$CROPDMG), ]
crop <- crop[1:5, ]
head(crop)
## # A tibble: 5 × 5
##   EVTYPE      FATALITIES INJURIES       PROPDMG     CROPDMG
##   <chr>            <dbl>    <dbl>         <dbl>       <dbl>
## 1 DROUGHT              0        4   1046106000  13972566000
## 2 FLOOD              470     6789 144657709807   5661968450
## 3 RIVER FLOOD          2        2   5118945500   5029459000
## 4 ICE STORM           89     1975   3944927860   5022113500
## 5 HAIL                15     1361  15735267513.  3025954473

Create a plot for this economic consequences:

crop.plot <- generate.bar(crop, 
                          crop$EVTYPE, crop$CROPDMG,
                          "Type of event", "Crop damage",
                          "Top 5 most harmful events damaged in Crop")

Plot both these types of economic consequence into a panel

grid.arrange(property.plot, crop.plot, ncol = 2)

→ As can be seen from the plot, the FLOOD and DROUGHT events are the most harmful type of event resulting in Property and Crop economic consequences with approximately 15x1010 and 1.4x1010 in term of total damage, respectively.