Weather Analysis

1. Synopsis

A major contributor to both public health and economic problems in the US are storms and other severe weather events. Many of these severe weather events cause many fatalities and injuries, and often times millions of dollars in property damage. This is why preventing such catastrophic events is a key concern for many nations, including the United States.

This project explores the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database.This database tracks characteristics of major weather events in the United States, including when and where they occur, estimates of any fatalities, injuries, and property damage, as well as breif descriptions of each event. This database dates back to about 1950, and it should be noted that there is less information about weather events that date further back.

This report studies the effect of severe weather events on the number of fatalities, the number of injuries, amount of property damage, and the amount of damage done to crops. Three plots were generated and represent the mean of the top 10 severe weather events effecting the fatalities, property damage, and crop damage, respectively.

2. Data Processing

2.1 The Data

There is only one dataset for this project and it comes as a comma-separated-value file (CSV) that was compressed using the bzip2 algorithm. You can download the file from the course website. There is also a very useful FAQ, as well as some documentation from which you can reference if there is any other confusion about the data.

2.2 The Assignment

The basic goal of this assignment is to explore the NOAA Storm Database and answer the following basic questions about severe weather events.

1.Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

2.Across the United States, which types of events have the greatest economic consequences?

2.3 Cleaning Process

2.3.1 Loading in The Data

The data was downloaded from the link listed above, and was then loaded into R using the following code.

setwd("~/Documents/GitHub/Repro_Research/RepData_PeerAssessment2/")
storm0 <- read.csv("repdata_data_StormData.csv", header = TRUE)
head(storm0)

##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6

2.3.2 Extract Necessary Data

Most of the variables in this dataset are simply not required for the analyis being run. Furthermore, this dataset is fairly large, and takes up a lot of memory, so removing unnecessary variables will increase the speed of any further analysis. This is done using the following code.

event <- c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
storm <- storm0[event]
head(storm)

##    EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO          0       15    25.0          K       0           
## 2 TORNADO          0        0     2.5          K       0           
## 3 TORNADO          0        2    25.0          K       0           
## 4 TORNADO          0        2     2.5          K       0           
## 5 TORNADO          0        2     2.5          K       0           
## 6 TORNADO          0        6     2.5          K       0

2.3.3 Group Similar Events

Using length(unique(storm$EVTYPE)) will return a value of 985. This means that there are 985 different kinds of severe weather events in the dataset. Clearly, there are many duplicates in the dataset. Not only this, but uniformity of reporting was seemingly not necessary when entering data into this database. For instance, Unseasonably Warm and Dry, EXTREME HEAT, and EXCESSIVE HEAT are all considered different event types by this dataset. This is not the only case this occurs. These synonomical errors happen throughout the dataset. To remedy this, the grepl() function is used to take common words in the event types and groups them into one single type of event. The code is demonstrated below.

for (i in 1:length(storm$EVTYPE)){
    if (grepl("HEAT", storm$EVTYPE[i])){
        storm$EVTYPE[i] <- "HEAT"
    } else if (grepl("TORNADO", storm$EVTYPE[i])){
        storm$EVTYPE[i] <- "TORNADO"
    } else if (grepl("WARM", storm$EVTYPE[i])){
        storm$EVTYPE[i] <- "HEAT"
    } else {
        next
    }
}

for (i in 1:length(storm$EVTYPE)){
    if (grepl("COLD", storm$EVTYPE[i])){
        storm$EVTYPE[i] <- "COLD"
    } else {
        next
    }
}
for (i in 1:length(storm$EVTYPE)){
    if (grepl("FLOOD", storm$EVTYPE[i])){
        storm$EVTYPE[i] <- "FLOOD"
    } else {
        next
    }
}

The code is broken down into several for-loops because it simply takes a very long time to run (~15 min). However, it does exactly what is needed for this dataset.

2.3.4 Remove Anomonlous Events

The next problem with analyzing this data is that many of those who input data into this database used the full name of certain events. For instance, what should have been classified as simply HURRICANE, is instead classified as HURRICANE KATRINA. While it is important to note which hurricane occured at which time, one very large hurricane is still just a hurricane, and not a completely different kind of severe weather event. This was remedied using the code below. This block of code essentially removes all rows whose respective EVTYPE occurs less than 8 times in the dataset. The threshold is completely arbitrary and was chosen due to the fact that the highest-frequency anomylous event occurred 8 times. It can be changed by editing the value for thresh.

type.lengths <- tapply(storm$EVTYPE, storm$EVTYPE, length)
type <- names(as.data.frame(t(type.lengths)))
type.0 <- cbind(type,type.lengths)
thresh <- 8   #edit this if threshold must be changed
valu <- as.numeric(type.lengths>thresh)
valu[is.na(valu)] <- 0
test.0 <- cbind(type.0,valu)
type.1 <- test.0[valu==1,]
storm.d <- cbind(storm, as.numeric(storm$EVTYPE %in% type.1[,1]))
storm.1 <- storm.d[storm.d[,8]==1,]

2.3.5 Normalizing Damages

The last aspect of the data that needed to be cleaned was the way that the property and crop damage was represented. Essentially, the amount of property and crop damage was listed with a value in one varieble and then the exponent belonging to that respective value in another variable. To remedy this, the exponent codes were replaced with their real number values and then multiplied with the property damage values column. This code is demonstrated below.

# cleaning damage data
storm$PROPEXP[storm$PROPDMGEXP == "K"] <- 1000
storm$PROPEXP[storm$PROPDMGEXP == "M"] <- 1e+06
storm$PROPEXP[storm$PROPDMGEXP == ""] <- 1
storm$PROPEXP[storm$PROPDMGEXP == "B"] <- 1e+09
storm$PROPEXP[storm$PROPDMGEXP == "m"] <- 1e+06
storm$PROPEXP[storm$PROPDMGEXP == "0"] <- 1
storm$PROPEXP[storm$PROPDMGEXP == "5"] <- 1e+05
storm$PROPEXP[storm$PROPDMGEXP == "6"] <- 1e+06
storm$PROPEXP[storm$PROPDMGEXP == "4"] <- 10000
storm$PROPEXP[storm$PROPDMGEXP == "2"] <- 100
storm$PROPEXP[storm$PROPDMGEXP == "3"] <- 1000
storm$PROPEXP[storm$PROPDMGEXP == "h"] <- 100
storm$PROPEXP[storm$PROPDMGEXP == "7"] <- 1e+07
storm$PROPEXP[storm$PROPDMGEXP == "H"] <- 100
storm$PROPEXP[storm$PROPDMGEXP == "1"] <- 10
storm$PROPEXP[storm$PROPDMGEXP == "8"] <- 1e+08

# Assigning '0' to invalid exponent data
storm$PROPEXP[storm$PROPDMGEXP == "+"] <- 0
storm$PROPEXP[storm$PROPDMGEXP == "-"] <- 0
storm$PROPEXP[storm$PROPDMGEXP == "?"] <- 0

# Calculating the property damage value
storm$PROPDMGVAL <- storm$PROPDMG * storm$PROPEXP

# Assigning values for the crop exponent data 
storm$CROPEXP[storm$CROPDMGEXP == "M"] <- 1e+06
storm$CROPEXP[storm$CROPDMGEXP == "K"] <- 1000
storm$CROPEXP[storm$CROPDMGEXP == "m"] <- 1e+06
storm$CROPEXP[storm$CROPDMGEXP == "B"] <- 1e+09
storm$CROPEXP[storm$CROPDMGEXP == "0"] <- 1
storm$CROPEXP[storm$CROPDMGEXP == "k"] <- 1000
storm$CROPEXP[storm$CROPDMGEXP == "2"] <- 100
storm$CROPEXP[storm$CROPDMGEXP == ""] <- 1

# Assigning '0' to invalid exponent storm
storm$CROPEXP[storm$CROPDMGEXP == "?"] <- 0

# calculating the crop damage value
storm$CROPDMGVAL <- storm$CROPDMG * storm$CROPEXP

3. Data Analysis

3.1 Find the Highest Mean of Each Incident by Event Type

It was observed that " most harmful to population health" events are fatalities. Therefore, only those events with the highest fatalities were selecetd.

It was also observed that events with “greatest economic consequences” are Property and crop damages.So, only those events with the highest property and crop damage were selecetd.

Then for each incident (Fatalities, Property damage and Crop damage), the mean values were estimated. The code is demonstrated below.

means <- aggregate(FATALITIES ~ EVTYPE, storm.1, FUN = mean)
means <- means[order(-means$FATALITIES),]
top.means <- head(means, n=11L)

propdmg <- aggregate(PROPDMGVAL ~ EVTYPE, storm, FUN = mean)
cropdmg <- aggregate(CROPDMGVAL ~ EVTYPE, storm, FUN = mean)
propdmg <- propdmg[order(-propdmg$PROPDMGVAL),]
cropdmg <- cropdmg[order(-cropdmg$CROPDMGVAL),]

3.2 Post-Data Cleaning

In the cropdmg and propdmg variables, there are some anomalous events still present. To remedy this, the following code was run. Please run this code in order and exactly as written.

head(propdmg, n = 10L)

##                        EVTYPE PROPDMGVAL
## 226 HEAVY RAIN/SEVERE WEATHER 1250000000
## 330         HURRICANE/TYPHOON  787566364
## 327            HURRICANE OPAL  352538444
## 548               STORM SURGE  165990559
## 785                WILD FIRES  156025000
## 328 HURRICANE OPAL/HIGH WINDS  100000000
## 493       SEVERE THUNDERSTORM   92720000
## 207                 HAILSTORM   80333333
## 321                 HURRICANE   68208730
## 804   WINTER STORM HIGH WINDS   60000000

propdmg <- propdmg[-3,]
propdmg <- propdmg[-5,]
propdmg <- propdmg[-10,]
propdmg <- propdmg[-10,]
head(cropdmg, n = 10L)

##                        EVTYPE CROPDMGVAL
## 108         EXCESSIVE WETNESS  142000000
## 62            DAMAGING FREEZE   43683333
## 95                Early Frost   42000000
## 330         HURRICANE/TYPHOON   29634918
## 324            HURRICANE ERIN   19430000
## 61            Damaging Freeze   17065000
## 321                 HURRICANE   15758103
## 111              Extreme Cold   10000000
## 328 HURRICANE OPAL/HIGH WINDS   10000000
## 129                    FREEZE    6030068

cropdmg <- cropdmg[-5,]
cropdmg <- cropdmg[-8,]
propdmg <- propdmg[1:10,]
cropdmg <- cropdmg[1:10,]
propdmg <- head(propdmg, n = 10L)
cropdmg <- head(cropdmg, n = 10L)
propdmg

##                        EVTYPE PROPDMGVAL
## 226 HEAVY RAIN/SEVERE WEATHER 1250000000
## 330         HURRICANE/TYPHOON  787566364
## 548               STORM SURGE  165990559
## 785                WILD FIRES  156025000
## 493       SEVERE THUNDERSTORM   92720000
## 207                 HAILSTORM   80333333
## 321                 HURRICANE   68208730
## 804   WINTER STORM HIGH WINDS   60000000
## 739                   TYPHOON   54566364
## 549          STORM SURGE/TIDE   31359378

cropdmg

##                        EVTYPE CROPDMGVAL
## 108         EXCESSIVE WETNESS  142000000
## 62            DAMAGING FREEZE   43683333
## 95                Early Frost   42000000
## 330         HURRICANE/TYPHOON   29634918
## 61            Damaging Freeze   17065000
## 321                 HURRICANE   15758103
## 111              Extreme Cold   10000000
## 129                    FREEZE    6030068
## 494 SEVERE THUNDERSTORM WINDS    5800000
## 70                    DROUGHT    5615983

3.3 Plot Findings

3.3.1 Shorten Variable Names

Many of the variable names are exeedingly long. This greatly affects the aesthetics of the presented plots, so they are shortened to make the plots fit onto the page, while maintaining necessary information.

propdmg[1,1] <- "HEAVY RAIN"
propdmg[8,1] <- "WINTER STORM"
propdmg[5,1] <- "THUNDERSTORM"
cropdmg[9,1] <- "THUNDERSTORM"
propdmg

##                EVTYPE PROPDMGVAL
## 226        HEAVY RAIN 1250000000
## 330 HURRICANE/TYPHOON  787566364
## 548       STORM SURGE  165990559
## 785        WILD FIRES  156025000
## 493      THUNDERSTORM   92720000
## 207         HAILSTORM   80333333
## 321         HURRICANE   68208730
## 804      WINTER STORM   60000000
## 739           TYPHOON   54566364
## 549  STORM SURGE/TIDE   31359378

cropdmg

##                EVTYPE CROPDMGVAL
## 108 EXCESSIVE WETNESS  142000000
## 62    DAMAGING FREEZE   43683333
## 95        Early Frost   42000000
## 330 HURRICANE/TYPHOON   29634918
## 61    Damaging Freeze   17065000
## 321         HURRICANE   15758103
## 111      Extreme Cold   10000000
## 129            FREEZE    6030068
## 494      THUNDERSTORM    5800000
## 70            DROUGHT    5615983

3.3.2 Plot

The top 10 mean causes of fatality, crop damage, and property damage are graphed using the code below.

# Fatality plot
require(ggplot2)

## Loading required package: ggplot2

gplot <- ggplot(data = top.means, aes(x=reorder(EVTYPE, -FATALITIES), y=FATALITIES))
gplot <- gplot + geom_bar(stat = "identity", color="skyblue", fill = "steelblue")
gplot <- gplot + theme(axis.text.x=element_text(angle=45, hjust=1))
gplot <- gplot + labs(title="Mean Number of Fatalities by Event Type",
                      x ="EVENT", y = "AVERAGE FATALITY RATE")
gplot

# Property damage and crop damage plots
par(mfrow = c(1, 2), mar = c(12, 4, 3, 2), mgp = c(3, 1, 0), cex = 0.8)
barplot(propdmg$PROPDMGVAL/(10^9), las = 3, names.arg = propdmg$EVTYPE, 
        main = "Events with Highest Average Property Damage", ylab = "Damage Cost ($ billions)", 
        col = "steelblue")
barplot(cropdmg$CROPDMGVAL/(10^9), las = 3, names.arg = cropdmg$EVTYPE, 
        main = "Events With Highest Average Crop Damage", ylab = "Damage Cost ($ billions)", 
        col = "steelblue")

4. Results

Based on this data, Tsunamis had the highest average fatality rate. Tsunamis were closely followed by heat, rip current and avalanches. Clearly, the “CURRENT” term should have been included in my word grouping algorithm because Rip currents appear twice, which is an unintended result. Moreover, heavy rain and “excessive wetness” caused the most property damage and crop damage. It is unclear for now, but it is safe to make the assumption that excessive wetness is a result of heavy rain, and can therefore be classified as heavy rain. In second place, hurricanes caused almost as much property damage as heavy rain. Similarly, freezing temperatures caused almost as much crop damage as excessive wetness.