US NOAA Storm Data Analysis (1950-2011)

Synopsis:

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

This analysis focuses on answering two basic questions:
1- Which types of events are most harmful with respect to population health? 2- Which types of events have the greatest economic consequences?

Storm dataset

The data for this project come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from this web site:

Storm Data(47Mb)

There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

National Weather Service Storm Data Documentation

National Climatic Data Center Storm Events FAQ

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

Data transformation

Download and read the storm dataset. This code chunck is time-consuming and we will be using the cache = TRUE option. This way, after reading the dataset, it will be saved in cache.

# Set working directory
setwd("~/R_coursera/Reproducible Research/Assignment2")
# If file is not in the working directory then download it.
if (!file.exists("stormdata.csv.bz2")) {
      download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
                    destfile = "stormdata.csv.bz2")
}
# Read storm dataset
stormdata <- read.csv("stormdata.csv.bz2", header = TRUE, sep = "," )

This is the structure of the raw dataset that will be used:

# Display stormdata dataset structure
str(stormdata)

## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

The first 6 records of the Storm dataset look like this:

# Display first 6 records of the stormdata dataset
head(stormdata)

##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6

In order to make the analysis less time-comsuming, a subset of the Storm dataset will be created. The new structure of the subset dataset will be displayed :

# Subset stormdata with only the required columns
cols <- c("EVTYPE","FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")
stormsub <- stormdata[c(8, 23:28)]
# Display stormdata subset head records
head(stormsub)

##    EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO          0       15    25.0          K       0           
## 2 TORNADO          0        0     2.5          K       0           
## 3 TORNADO          0        2    25.0          K       0           
## 4 TORNADO          0        2     2.5          K       0           
## 5 TORNADO          0        2     2.5          K       0           
## 6 TORNADO          0        6     2.5          K       0

Results:

Question 1:

Which types of events are most harmful with respect to population health?

To answer this questions is necessary to aggregate the dataset by type of event(EVTYPE) and add the variables FATALITIES and INJURIES. After, only the 15 most harmfull types of events will selected to be later displayed in a bar chart to better illustrate the answer.

# Aggregate number of harmed people by storm type of event
harmfull <- aggregate(FATALITIES+INJURIES ~ EVTYPE, stormsub, sum)
# Change data frame columns names
names(harmfull) = c("EVTYPE","FATALINJUR")
# Get only the top 15 most harmfull storm events
topharmfull <- head(harmfull[order(harmfull$FATALINJUR, decreasing= T),], n = 15)
tevent <- topharmfull$EVTYPE[1]
tinjur <- topharmfull$FATALINJUR[1]

topharmfull$FATALINJUR <- topharmfull$FATALINJUR/1000

Here is the bar chart to better illustrate the 15 most harmfull types of events:

# Plot a bar chart total number of harmed people by storm type of event
barplot(topharmfull$FATALINJUR,
        names.arg = topharmfull$EVTYPE,
        ylab = "Total number fatalities and injuries(in thousands)", main = "Top 15 Most Harmfull Types of Events",
        cex.axis = 0.8, cex.names = 0.6, col = "blue", axes = T, las = 2, ylim = c(0,100))

Answer 1:

The most harmfull type of event is TORNADO and it caused 9.697910^{4} fatalities and injuries.

.

Question 2:

Which types of events have the greatest economic consequences?

To better answer this question, some extra data processing is necessary:

The exponents are stored in the dataset in many different characters:

# In order to get the total of economic damage, it is necessary to multiply the damage by its exponent
unique(stormsub$PROPDMGEXP)

##  [1] K M   B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels:  - ? + 0 1 2 3 4 5 6 7 8 B h H K m M

Property and crop damage are calculated multiplying the value of the damage(“PROPDMG” and “CROPDMG”) by its exponent(“PROPDMGEXP” and “CROPDMGEXP”). The exponents will be converted to a numeric based on its value and multiplied accordingly.

# create a column with numeric proprety damage exponent 
stormsub$PROPDMGEXP <- as.character(stormsub$PROPDMGEXP)
stormsub$PROPDMGEXP[(stormsub$PROPDMGEXP=="")|(stormsub$PROPDMGEXP=="+")|(stormsub$PROPDMGEXP=="?")|
                (stormsub$PROPDMGEXP=="-")|(stormsub$PROPDMGEXP=="0")] <- 0
stormsub$PROPDMGEXP[(stormsub$PROPDMGEXP=="h")|(stormsub$PROPDMGEXP=="H")] <- 2
stormsub$PROPDMGEXP[(stormsub$PROPDMGEXP=="K")] <- 3
stormsub$PROPDMGEXP[(stormsub$PROPDMGEXP=="M")|(stormsub$PROPDMGEXP=="m")] <- 6
stormsub$PROPDMGEXP[(stormsub$PROPDMGEXP=="B")] <- 9

stormsub$CROPDMGEXP <- as.character(stormsub$CROPDMGEXP)
stormsub$CROPDMGEXP[(stormsub$CROPDMGEXP=="")|(stormsub$CROPDMGEXP=="?")|(stormsub$CROPDMGEXP=="0")] <- 0
stormsub$CROPDMGEXP[(stormsub$CROPDMGEXP=="K")|(stormsub$CROPDMGEXP=="k")] <- 3
stormsub$CROPDMGEXP[(stormsub$CROPDMGEXP=="M")|(stormsub$CROPDMGEXP=="m")] <- 6
stormsub$CROPDMGEXP[(stormsub$CROPDMGEXP=="B")] <- 9

stormsub$PROPDMGEXP <- as.numeric(stormsub$PROPDMGEXP)
stormsub$CROPDMGEXP <- as.numeric(stormsub$CROPDMGEXP)

stormsub$TPROPDMG <- stormsub$PROPDMG*10^stormsub$PROPDMGEXP
stormsub$TCROPDMG <- stormsub$CROPDMG*10^stormsub$CROPDMGEXP
stormsub$TOTDMG <- stormsub$TPROPDMG+stormsub$TCROPDMG

It is also necessary to aggregate the types of events by the total damage it causes. Then, only the top 15 types of events with the heghest economic damages are selected. To better illustrate the answer , the total damage is divided by one billion and displayed in a bar chart in US$ billions.

# Aggregate number of harmed people by storm type of event
ecodmg <- aggregate(TOTDMG ~ EVTYPE, stormsub, sum)
# Get only the top 15 highest economic damage storm events
topecodmg <- head(ecodmg[order(ecodmg$TOTDMG, decreasing= T),], n = 15)
# divide totals by 1,000,000,000 to convert in units of billion
topecodmg$TOTDMG <- topecodmg$TOTDMG/1000000000
#variables with top event and its value
tedmg <- topecodmg$EVTYPE[1]
tvdmg <- topecodmg$TOTDMG[1]

Here is the bar plot that illustrates the top 15 types of events with highest economic damage:

# Plot a bar chart total economic damage by storm type of event
barplot(topecodmg$TOTDMG,
        names.arg = topecodmg$EVTYPE,
        ylab = "Total economic damage(in US$ billions)", main = "Top 15 highest economic damage Types of Events",
        cex.axis = 0.8, cex.names = 0.6, col = "green", axes = T, las = 2)

Answer 2:

The type of event with greatest economic consequence is FLOOD and it causes an economic damage of 150.3196783 in billions of US dollars.