i. Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern. This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

ii. Research Questions

The research questions that this study is trying to address is as follows:
1. Across the United States, which types of events are most harmful with respect to population health?
2. Across the United States, which types of events have the greatest economic consequences?

iii. Loading and Data Processing

We first read in the from the raw csv file included in the zip archive. The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

A. Load data from internet

download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2","StormData.csv.bz2")
data <- read.csv(bzfile("StormData.csv.bz2"))

B. Pre-Process and understand data

Next, we understand the structure and summary of data.

str(data)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
summary(data)
##     STATE__       BGN_DATE           BGN_TIME          TIME_ZONE        
##  Min.   : 1.0   Length:902297      Length:902297      Length:902297     
##  1st Qu.:19.0   Class :character   Class :character   Class :character  
##  Median :30.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :31.2                                                           
##  3rd Qu.:45.0                                                           
##  Max.   :95.0                                                           
##                                                                         
##      COUNTY       COUNTYNAME           STATE              EVTYPE         
##  Min.   :  0.0   Length:902297      Length:902297      Length:902297     
##  1st Qu.: 31.0   Class :character   Class :character   Class :character  
##  Median : 75.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :100.6                                                           
##  3rd Qu.:131.0                                                           
##  Max.   :873.0                                                           
##                                                                          
##    BGN_RANGE          BGN_AZI           BGN_LOCATI          END_DATE        
##  Min.   :   0.000   Length:902297      Length:902297      Length:902297     
##  1st Qu.:   0.000   Class :character   Class :character   Class :character  
##  Median :   0.000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :   1.484                                                           
##  3rd Qu.:   1.000                                                           
##  Max.   :3749.000                                                           
##                                                                             
##    END_TIME           COUNTY_END COUNTYENDN       END_RANGE       
##  Length:902297      Min.   :0    Mode:logical   Min.   :  0.0000  
##  Class :character   1st Qu.:0    NA's:902297    1st Qu.:  0.0000  
##  Mode  :character   Median :0                   Median :  0.0000  
##                     Mean   :0                   Mean   :  0.9862  
##                     3rd Qu.:0                   3rd Qu.:  0.0000  
##                     Max.   :0                   Max.   :925.0000  
##                                                                   
##    END_AZI           END_LOCATI            LENGTH              WIDTH         
##  Length:902297      Length:902297      Min.   :   0.0000   Min.   :   0.000  
##  Class :character   Class :character   1st Qu.:   0.0000   1st Qu.:   0.000  
##  Mode  :character   Mode  :character   Median :   0.0000   Median :   0.000  
##                                        Mean   :   0.2301   Mean   :   7.503  
##                                        3rd Qu.:   0.0000   3rd Qu.:   0.000  
##                                        Max.   :2315.0000   Max.   :4400.000  
##                                                                              
##        F               MAG            FATALITIES          INJURIES        
##  Min.   :0.0      Min.   :    0.0   Min.   :  0.0000   Min.   :   0.0000  
##  1st Qu.:0.0      1st Qu.:    0.0   1st Qu.:  0.0000   1st Qu.:   0.0000  
##  Median :1.0      Median :   50.0   Median :  0.0000   Median :   0.0000  
##  Mean   :0.9      Mean   :   46.9   Mean   :  0.0168   Mean   :   0.1557  
##  3rd Qu.:1.0      3rd Qu.:   75.0   3rd Qu.:  0.0000   3rd Qu.:   0.0000  
##  Max.   :5.0      Max.   :22000.0   Max.   :583.0000   Max.   :1700.0000  
##  NA's   :843563                                                           
##     PROPDMG         PROPDMGEXP           CROPDMG         CROPDMGEXP       
##  Min.   :   0.00   Length:902297      Min.   :  0.000   Length:902297     
##  1st Qu.:   0.00   Class :character   1st Qu.:  0.000   Class :character  
##  Median :   0.00   Mode  :character   Median :  0.000   Mode  :character  
##  Mean   :  12.06                      Mean   :  1.527                     
##  3rd Qu.:   0.50                      3rd Qu.:  0.000                     
##  Max.   :5000.00                      Max.   :990.000                     
##                                                                           
##      WFO             STATEOFFIC         ZONENAMES            LATITUDE   
##  Length:902297      Length:902297      Length:902297      Min.   :   0  
##  Class :character   Class :character   Class :character   1st Qu.:2802  
##  Mode  :character   Mode  :character   Mode  :character   Median :3540  
##                                                           Mean   :2875  
##                                                           3rd Qu.:4019  
##                                                           Max.   :9706  
##                                                           NA's   :47    
##    LONGITUDE        LATITUDE_E     LONGITUDE_       REMARKS         
##  Min.   :-14451   Min.   :   0   Min.   :-14455   Length:902297     
##  1st Qu.:  7247   1st Qu.:   0   1st Qu.:     0   Class :character  
##  Median :  8707   Median :   0   Median :     0   Mode  :character  
##  Mean   :  6940   Mean   :1452   Mean   :  3509                     
##  3rd Qu.:  9605   3rd Qu.:3549   3rd Qu.:  8735                     
##  Max.   : 17124   Max.   :9706   Max.   :106220                     
##                   NA's   :40                                        
##      REFNUM      
##  Min.   :     1  
##  1st Qu.:225575  
##  Median :451149  
##  Mean   :451149  
##  3rd Qu.:676723  
##  Max.   :902297  
## 

The results above show that the dataset contains 902,297 observations and 37 variables.

We will now select only the important variables to conduct the next steps of analysis. 7 variables will be chosen which are:
1. Event Type - EVTYPE
2. Number of Fatalities - FATALITIES
3. Number of Injuries - INJURIES
4. Property Damage - PROPDMG
5. Property Damage Multiplier - PROPDMGEXP
6. Crop Damage - CROPDMG
7. Crop Damange Multiplier - CROPDMGEXP

There are no missing values/NAs in the 7 variables that we are interested in, as seen in the summary data above.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
df1 <- data %>% select(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)

We will now create a new variable called TOTAL_POPN_HARM, which sums up the total FATALITIES plus total INJURIES.

df1 <- df1 %>% mutate(TOTAL_POPN_HARM = FATALITIES + INJURIES)
str(df1)
## 'data.frame':    902297 obs. of  8 variables:
##  $ EVTYPE         : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ FATALITIES     : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES       : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG        : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP     : chr  "K" "K" "K" "K" ...
##  $ CROPDMG        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP     : chr  "" "" "" "" ...
##  $ TOTAL_POPN_HARM: num  15 0 2 2 2 6 1 0 15 0 ...

C. Generalise events type categories

Next, we will now go through the Event Type categories and use the gsub command combine categories to create less categories that are unique and have more general terms.

# Standardising all naming to Uppercase
df1$EVTYPE <- toupper(df1$EVTYPE)

# Factor levels within each variables
df1$EVTYPE <- factor(df1$EVTYPE)
df1$PROPDMGEXP <- factor(df1$PROPDMGEXP)
df1$CROPDMGEXP <- factor(df1$CROPDMGEXP)

# Categorise events by more general terms
# Avalanche
df1$EVTYPE <- gsub("^.*AVALAN.*$", "AVALANCHE", df1$EVTYPE)

# Blizzard
df1$EVTYPE <- gsub("^.*BLIZZARD.*$", "BLIZZARD", df1$EVTYPE)

# Cloud
df1$EVTYPE <- gsub("^.*CLOUD.*$", "CLOUD", df1$EVTYPE)

# Cold Weather
df1$EVTYPE <- gsub("^.*COLD.*$", "COLD WEATHER", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*ICE.*$", "COLD WEATHER", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*ICY.*$", "COLD WEATHER", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*FREEZ.*$", "COLD WEATHER", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*COOL.*$", "COLD WEATHER", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*LOW TEMP.*$", "COLD WEATHER", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*WINTER.*$", "COLD WEATHER", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*FROS.*$", "COLD WEATHER", df1$EVTYPE)

# Dry Weather
df1$EVTYPE <- gsub("^.*DRY.*$", "DRY WEATHER", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*DRIE.*$", "DRY WEATHER", df1$EVTYPE)

# Dust
df1$EVTYPE <- gsub("^.*DUST.*$", "DUST", df1$EVTYPE)

# Flood
df1$EVTYPE <- gsub("^.*FLOO.*$", "FLOOD", df1$EVTYPE)

# Fog
df1$EVTYPE <- gsub("^.*FOG.*$", "FOG", df1$EVTYPE)

# Hail
df1$EVTYPE <- gsub("^.*HAIL.*$", "HAIL", df1$EVTYPE)

# High Tide/Waves
df1$EVTYPE <- gsub("^.*SEAS.*$", "HIGH TIDE/WAVES", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*SURF.*$", "HIGH TIDE/WAVES", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*TIDE.*$", "HIGH TIDE/WAVES", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*WAVE.*$", "HIGH TIDE/WAVES", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*SWELL.*$", "HIGH TIDE/WAVES", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*HIGH WATER.*$", "HIGH TIDE/WAVES", df1$EVTYPE)

# Hot Weather
df1$EVTYPE <- gsub("^.*HOT.*$", "HOT WEATHER", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*WARM.*$", "HOT WEATHER", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*HEAT.*$", "HOT WEATHER", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*HIGH TEMP.*$", "HOT WEATHER", df1$EVTYPE)

# Hurricane
df1$EVTYPE <- gsub("^.*HURRICANE.*$", "HURRICANE", df1$EVTYPE)

# Hypothermia
df1$EVTYPE <- gsub("^.*HYPOTHERMIA.*$", "HYPOTHERMIA", df1$EVTYPE)

# Landslide
df1$EVTYPE <- gsub("^.*LANDSLIDE.*$", "LANDSLIDE", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*EROSI.*$", "LANDSLIDE", df1$EVTYPE)

# Mud slide
df1$EVTYPE <- gsub("^.*MUD.*$", "MUDSLIDE", df1$EVTYPE)

# Precipitation
df1$EVTYPE <- gsub("^.*PRECIP.*$", "PRECIPITATION RELATED", df1$EVTYPE)

# Rain/Lightning
df1$EVTYPE <- gsub("^.*RAIN.*$", "RAIN/LIGHTNING", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*LIGHTNING.*$", "RAIN/LIGHTNING", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*SHOWER.*$", "RAIN/LIGHTNING", df1$EVTYPE)

# Rip Current
df1$EVTYPE <- gsub("^.*RIP CURR.*$", "RIP CURRENT", df1$EVTYPE)

# Snow
df1$EVTYPE <- gsub("^.*SNOW.*$", "SNOW", df1$EVTYPE)

# Strong Winds
df1$EVTYPE <- gsub("^.*WIND.*$", "WINDS", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*WINDS.*$", "WINDS", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*TSTM.*$", "WINDS", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*WND.*$", "WINDS", df1$EVTYPE)

# Storm
df1$EVTYPE <- gsub("^.*STORM.*$", "STORM", df1$EVTYPE)

# Storm
df1$EVTYPE <- gsub("^.*SUMMARY.*$", "SUMMARY", df1$EVTYPE)

# Tornado
df1$EVTYPE <- gsub("^.*TORNADO.*$", "TORNADO", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*GUSTNADO.*$", "TORNADO", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*TORNDAO.*$", "TORNADO", df1$EVTYPE)

# Volcano
df1$EVTYPE <- gsub("^.*VOLCANIC.*$", "VOLCANIC", df1$EVTYPE)

# Urban Small Stream
df1$EVTYPE <- gsub("^.*URBAN.*$", "URBAN/SMALL STREAM", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*STREAM.*$", "URBAN/SMALL STREAM", df1$EVTYPE)

# Wet Weather
df1$EVTYPE <- gsub("^.*WET.*$", "WET WEATHER", df1$EVTYPE)

# Water Spout
df1$EVTYPE <- gsub("^.*SPOUT.*$", "WATERSPOUT", df1$EVTYPE)

# Wild Fires
df1$EVTYPE <- gsub("^.*FIRE.*$", "WILD FIRES", df1$EVTYPE)

df1$EVTYPE <- factor(df1$EVTYPE)

D. Calculate Property and Crop damage values

The property and crop damage multiplier will need to be redefined next. Here is how it will be categorised:
1. Non numbers or 0 = 0
2. 1 = 10
3. 2 = 100
4. h/H = 100
5. 3 = 1,000
6. K = 1,000
7. 4 = 10,000
8. 5 = 100,000
9. 6 = 1,000,000
10. m/M = 1,000,000
11. 7 = 10,000,000
12. 8 = 100,000,000
13. B = 1,000,000,000

The first will be for property damage

df1$PROPDMGEXP <- factor(df1$PROPDMGEXP,
                         levels = c("","-","?","+","0","1","2","3","4","5","6","7","8","B","h","H","K","m","M"),
                         labels = c("0","0","0","0","0","10","100","1000","10000","100000","1000000","10000000", 
                                    "100000000","1000000000","100","100","1000","1000000","1000000"))

df1$PROPDMGEXP <- as.character(df1$PROPDMGEXP)
df1$PROPDMGEXP <- as.integer(df1$PROPDMGEXP)

df1 <- df1 %>% mutate(PROP_DMG_VALUE = PROPDMG * PROPDMGEXP)

The second will be for crop damage (CROP_DMG_VALUE). Both the CROP_DMG_VALUE and property damagePROP_DMG_VALUEwill be mutated and summed up into a new variable calledTOTAL_DMG.

df1$CROPDMGEXP <- factor(df1$CROPDMGEXP,
                         levels = c("","?","0","2","B","k","K","m","M"),
                         labels = c("0","0","0","100","1000000000","1000","1000","1000000","1000000"))

df1$CROPDMGEXP <- as.character(df1$CROPDMGEXP)
df1$CROPDMGEXP <- as.integer(df1$CROPDMGEXP)

df1 <- df1 %>% mutate(CROP_DMG_VALUE = CROPDMG * CROPDMGEXP)
df1 <- df1 %>% mutate(TOTAL_DMG = CROP_DMG_VALUE * PROP_DMG_VALUE)
str(df1)
## 'data.frame':    902297 obs. of  11 variables:
##  $ EVTYPE         : Factor w/ 75 levels "?","APACHE COUNTY",..: 64 64 64 64 64 64 64 64 64 64 ...
##  $ FATALITIES     : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES       : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG        : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP     : int  1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 ...
##  $ CROPDMG        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ TOTAL_POPN_HARM: num  15 0 2 2 2 6 1 0 15 0 ...
##  $ PROP_DMG_VALUE : num  25000 2500 25000 2500 2500 2500 2500 2500 25000 25000 ...
##  $ CROP_DMG_VALUE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TOTAL_DMG      : num  0 0 0 0 0 0 0 0 0 0 ...

iv. Results

In this section, we will plot some graphs to answer the Research Questions.

A. Q1: Which types of events are most harmful with respect to population health?

The graph and table below show that Tornados cause the most harm to the population, with 97,043 fatalities plus injuries.

df2 <- df1 %>% group_by(EVTYPE) %>% summarise(SUM_HARM = sum(TOTAL_POPN_HARM)) %>% arrange(desc(SUM_HARM))
head(df2, 10)
## # A tibble: 10 × 2
##    EVTYPE         SUM_HARM
##    <fct>             <dbl>
##  1 TORNADO           97043
##  2 WINDS             12464
##  3 HOT WEATHER       11789
##  4 FLOOD             10127
##  5 RAIN/LIGHTNING     6438
##  6 COLD WEATHER       5305
##  7 WILD FIRES         1698
##  8 HAIL               1512
##  9 HURRICANE          1461
## 10 SNOW               1293
df3 <- head(df2, 10)

barplot(df3$SUM_HARM, names.arg = df3$EVTYPE, 
        main = "Event types most harmful to population health (Fatalities + Injuries)",
        col = "green",
        las = 2)

B. Q2: Which types of events have the greatest economic consequences?

From the graph and table below, it can be seen that Floods cause the most economic damage.

df4 <- df1 %>% group_by(EVTYPE) %>% summarise(SUM_DMG = sum(TOTAL_DMG)) %>% arrange(desc(SUM_DMG))
head(df4, 10)
## # A tibble: 10 × 2
##    EVTYPE       SUM_DMG
##    <fct>          <dbl>
##  1 FLOOD        2.91e19
##  2 HURRICANE    1.36e19
##  3 WINDS        1.80e17
##  4 HAIL         1.50e16
##  5 DROUGHT      1.43e16
##  6 WILD FIRES   1.34e16
##  7 STORM        1.14e16
##  8 COLD WEATHER 3.72e15
##  9 TORNADO      3.08e15
## 10 BLIZZARD     2.56e15
df5 <- head(df4, 10)

barplot(df5$SUM_DMG, names.arg = df5$EVTYPE, 
        main = "Event types that caused the most damage",
        col = "blue",
        las = 2)