Synopsis

The preparation for severe weather events implies the need to prioritize resources for different types of events. This is critical for all government or municipal managers who might be responsible for that task. This report describes the analyses of the data present in the NOAA Storm Database, from 1950 to Nov. 2011 and the main conclusions are that tornados are the principal cause of economic and health damages across the USA states. So, special resources should be allocated to the damage prevention and repair of this kind of weather event. In the case of economic damages, floods are nearly as important as tornados, so deserve also great attention. Much more detailed information is avaiable for each individual state, by request to the author of this study throught the Coursera Platform.

Data Processing

The first step is the aquisition of the raw data, the foundation of all the posterior work.

# Create folder for storm data
if(!file.exists("data")) {
      dir.create("data")
}

# Download the storm data
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileUrl, destfile = "./data/storm.csv", method = "curl")

dateDownload <- date()
dateDownload
## [1] "Sat May 23 19:31:45 2015"
# Load the data set
stormData <- read.table("./data/storm.csv", sep = ",", header = TRUE, na.strings = "")

Next, we get a global perspective of the data set.

# Look at the data
str(stormData)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29600 levels "5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13512 1872 4597 10591 4371 10093 1972 23872 24417 4597 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 34 levels "  N"," NW","E",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ BGN_LOCATI: Factor w/ 54428 levels " Christiansburg",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ END_DATE  : Factor w/ 6662 levels "1/1/1993 0:00:00",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ END_TIME  : Factor w/ 3646 levels " 0900CST"," 200CST",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 23 levels "E","ENE","ESE",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ END_LOCATI: Factor w/ 34505 levels " CANTON"," TULIA",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 18 levels "-","?","+","0",..: 16 16 16 16 16 16 16 16 16 16 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 8 levels "?","0","2","B",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ WFO       : Factor w/ 541 levels " CI","%SD","$AC",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ STATEOFFIC: Factor w/ 249 levels "ALABAMA, Central",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ ZONENAMES : Factor w/ 25111 levels "                                                                                                                               "| __truncated__,..: NA NA NA NA NA NA NA NA NA NA ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436780 levels "\t","\t\t","\t\t\t\t",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

We see that we have a total of 902297 observations of 37 variables, which is a lot, but not all the variables are of interest for our objectives, so we will eliminate them.

# Select the 9 variables of interest
library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
myStormData <- select(stormData, BGN_DATE, STATE, EVTYPE, FATALITIES, INJURIES,
                      PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)

# Look again at the data
str(myStormData)
## 'data.frame':    902297 obs. of  9 variables:
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 18 levels "-","?","+","0",..: 16 16 16 16 16 16 16 16 16 16 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 8 levels "?","0","2","B",..: NA NA NA NA NA NA NA NA NA NA ...

The number of variables was reduced to 9, but two of them, PROPDMGEXP and CROPDMGEXP, look intriguing, so a closer examination is necessary.

# Look closer at PROPDMGEXP and CROPDMGEXP
table(myStormData$PROPDMGEXP)
## 
##      -      ?      +      0      1      2      3      4      5      6 
##      1      8      5    216     25     13      4      4     28      4 
##      7      8      B      h      H      K      m      M 
##      5      1     40      1      6 424665      7  11330
table(myStormData$CROPDMGEXP)
## 
##      ?      0      2      B      k      K      m      M 
##      7     19      1      9     21 281832      1   1994

PROPDMGEXP and CROPDMGEXP should be the base 10 exponentials that multiply the correspondent values of PROPDMG and CROPDMG, respectively. I interpret the levels of PROPDMGEXP and CROPDMGEXP like this: -, ? and + (no meaning); i in 0:8 (* 10exp(i)); h or H ( * 10^2); k or K (* 10^3); m or M (* 10^6) and B (* 10^9). My strategy will be to multiply this two sets of columns, obtaining PROPDMG and CROPDMG with full values. But first, I have to convert the values of the EXP columns to class numeric.

To alleviate the weight of the data set on future computations, I begin to get rid of observations with no fatalities, injuries, property damages and crop damages.

# Remove rows with useless data (e.g. FATALITIES = INJURIES = PROPDMG = CROPDMG = 0)
filterStormData <- filter(myStormData, FATALITIES != 0 | INJURIES != 0 | 
                                PROPDMG != 0 | CROPDMG != 0)
dim(filterStormData)
## [1] 254633      9

There are now 254633 observations, instead of the previous 902297, a decrease of 72%. The conversion of the PROPDMGEXP and CROPDMGEXP to the appropriate values and format is complex, it will be executed in multiple steps.

# Convert factor levels of PROPDMGEXP and CROPDMGEXP to "character"
filterStormData$PROPDMGEXP <- as.character(levels(filterStormData$PROPDMGEXP)[filterStormData$PROPDMGEXP])
filterStormData$CROPDMGEXP <- as.character(levels(filterStormData$CROPDMGEXP)[filterStormData$CROPDMGEXP])

# function to modify (translate) values of PROPDMGEXP and CROPDMGEXP
conv <- function(var) {
      for(i in 1:length(var)) {
            if(var[i] %in% c("-", "+", "?")) {
            var[i] <- 1
             }
            if(var[i] %in% c("0", "1", "2", "3", "4", "5", "6", "7", "8")) {
            var[i] <- 10^as.numeric(var[i])
            }
            if(var[i] %in% c("h", "H")) {
            var[i] <- 10^2
            }
            if(var[i] %in% c("k", "K")) {
            var[i] <- 10^3
            }
            if(var[i] %in% c("m", "M")) {
            var[i] <- 10^6
            }
            if(var[i] %in% c("b", "B")) {
            var[i] <- 10^9
            }
      }
      return(var)
}

# Convert to numeric values, PROPDMGEXP and CROPDMGEXP
filterStormData$PROPDMGEXP <- as.numeric(conv(filterStormData$PROPDMGEXP))
filterStormData$CROPDMGEXP <- as.numeric(conv(filterStormData$CROPDMGEXP))

# Check the data
table(filterStormData$PROPDMGEXP)
## 
##      1     10    100   1000  10000  1e+05  1e+06  1e+07  1e+09 
##    210      6      8 231429      4     18  11330      3     40
table(filterStormData$CROPDMGEXP)
## 
##     1    10  1000 1e+06 1e+09 
##    17     6 99953  1986     7
sum(is.na(filterStormData$PROPDMGEXP))
## [1] 11585
sum(is.na(filterStormData$CROPDMGEXP))
## [1] 152664

We can see that there are still a lot of NAs in this variables. To make them neutral and permit the upcomming columns multiplication, they will be “translated” to 1.

# Fill NAs of CROPDMGEXP and PROPDMGEXP with 1
for(i in 1:length(filterStormData$CROPDMGEXP)) {
      if(is.na(filterStormData$CROPDMGEXP[i])) {
            filterStormData$CROPDMGEXP[i] <- 1
      }
}

for(i in 1:length(filterStormData$PROPDMGEXP)) {
      if(is.na(filterStormData$PROPDMGEXP[i])) {
            filterStormData$PROPDMGEXP[i] <- 1
      }
}

# Check the data
table(filterStormData$PROPDMGEXP)
## 
##      1     10    100   1000  10000  1e+05  1e+06  1e+07  1e+09 
##  11795      6      8 231429      4     18  11330      3     40
table(filterStormData$CROPDMGEXP)
## 
##      1     10   1000  1e+06  1e+09 
## 152681      6  99953   1986      7
sum(is.na(filterStormData$PROPDMGEXP))
## [1] 0
sum(is.na(filterStormData$CROPDMGEXP))
## [1] 0

PROPDMGEXP and CROPDMGEXP are now ready to multiply PROPDMG and CROPDMG, respectively.

# Multiply PROPDMG and CROPDMG by their values in PROPDMGEXP and CROPDMGEXP
mutStormData <- mutate(filterStormData, PROPDMG = PROPDMG * PROPDMGEXP, 
                       CROPDMG = CROPDMG * CROPDMGEXP )

# Get rid of PROPDMGEXP and CROPDMGEXP
selectStormData <- select(mutStormData, c(BGN_DATE:PROPDMG, CROPDMG))

The processing of the data is near the end, but a final operation is needed to transform the BGN_DATE variable, formated as factor, to a more proper date format.

# Convert BGN_DATE from "Factor" to "character"
selectStormData$BGN_DATE <- as.character(levels(selectStormData$BGN_DATE)[selectStormData$BGN_DATE])

# Transform variable BGN_DATE from class "character" to class "Date"
library(lubridate)
selectStormData$BGN_DATE <- mdy_hms(selectStormData$BGN_DATE)

A final look at the data that will be analysed.

# Check data
str(selectStormData)
## 'data.frame':    254633 obs. of  7 variables:
##  $ BGN_DATE  : POSIXct, format: "1950-04-18" "1950-04-18" ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25000 2500 25000 2500 2500 2500 2500 2500 25000 25000 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...

The data is tidy, ready to deliver insight.

tidyStorm <- selectStormData

Results

For our purposes, the crop damages and the property damages can be aggregated in one unique new variable (ECONOMICDMG) and similarly, the fatalities and injuries are both health damages, so they will be aggregated also in one unique new variable (HEALTHDMG). ECONOMICDMG values will be in dollars and HEALTHDMG values will be counts of events.

# Aggregate PROPDMG and CROPDMG in one unique variable ECONOMICDMG and
# FATALITIES and INJURIES in one unique variable HEALTHDMG
tidyStorm <- mutate(tidyStorm, ECONOMICDMG = PROPDMG + CROPDMG, 
                    HEALTHDMG = FATALITIES + INJURIES)
tidyStorm <- select(tidyStorm, -c(PROPDMG, CROPDMG, FATALITIES, INJURIES))

The strategy, at this stage, is to sum all the damages in the years of the observations (1950 - Nov 2011) by event type and state. The damages will be divided in two sectors: economic and health.

# Convert tidyStorm to a data.table format
library(data.table)
## 
## Attaching package: 'data.table'
## 
## The following objects are masked from 'package:lubridate':
## 
##     hour, mday, month, quarter, wday, week, yday, year
## 
## The following objects are masked from 'package:dplyr':
## 
##     between, last
stormDT <- as.data.table(tidyStorm)

# Sum ECONOMICDMG by STATE and by EVTYPE 
economicStormDT <- stormDT[ , .(SUMDMG = sum(ECONOMICDMG)), by = .(STATE, EVTYPE) ]
head(economicStormDT)
##    STATE    EVTYPE     SUMDMG
## 1:    AL   TORNADO 6378094060
## 2:    AL TSTM WIND   63444000
## 3:    AZ   TORNADO   48005330
## 4:    AZ TSTM WIND  383488100
## 5:    AZ      HAIL 2829059700
## 6:    AR   TORNADO 2591514320
dim(economicStormDT)
## [1] 2518    3
# Sum HEALTHDMG by STATE and by EVTYPE 
healthStormDT <- stormDT[ , .(SUMDMG = sum(HEALTHDMG)), by = .(STATE, EVTYPE) ]
head(healthStormDT)
##    STATE    EVTYPE SUMDMG
## 1:    AL   TORNADO   8546
## 2:    AL TSTM WIND    367
## 3:    AZ   TORNADO    150
## 4:    AZ TSTM WIND    162
## 5:    AZ      HAIL      9
## 6:    AR   TORNADO   5495
dim(healthStormDT)
## [1] 2518    3

As there are 985 event types refered in the data set, I will make the option to identify, for each state, the event type with greatest economic consequences and the event type with more harmful health consequences.

# Choose the event type with greatest economic consequences in each state
economicDMG <- group_by(economicStormDT, STATE)
topEconomicDMG <- top_n(economicDMG, 1, SUMDMG)
head(topEconomicDMG)
##   STATE  EVTYPE       SUMDMG
## 1    AL TORNADO   6378094060
## 2    AZ    HAIL   2829059700
## 3    AR TORNADO   2591514320
## 4    CA   FLOOD 117377795000
## 5    CO    HAIL   1543434755
## 6    CT TORNADO    596236620
# Choose the most harmful event type for health in each state
healthDMG <- group_by(healthStormDT, STATE)
topHealthDMG <- top_n(healthDMG, 1, SUMDMG)
head(topHealthDMG)
##   STATE      EVTYPE SUMDMG
## 1    AL     TORNADO   8546
## 2    AZ FLASH FLOOD    212
## 3    AR     TORNADO   5495
## 4    CA    WILDFIRE    655
## 5    CO   LIGHTNING    308
## 6    CT     TORNADO    707

Data interpretation is much more easier if made with the help of visual representations. I will make two histograms that give the frequencies of the principal natural disaster causes of economic and health damages.

# Plot principal origin of economic damages in each state
library(ggplot2)

plot1 <- qplot(EVTYPE, data = topEconomicDMG, xlab = "", ylab = "State Count", 
           main = "Principal Origin of Economic Damages in each State (1950-2011)" ) 
plot1 + coord_flip() + geom_histogram(colour="blue", fill="blue")

We can see that tornados are the principal cause of economic damages in 17 states, followed closely by floods (12 states, but if we add flash floods, that will give a total of 16 for floods altogheter). Tied in third place, we have marine thunderstorm wind and hail, that are the primary cause of economic damages in 5 states. It’s interesting to notice that only 21 in 985 of the possible types of events appear in this chart, telling us that a minority of events dominated the economic damages in those years.

# Plot principal origin of health damages in each state
plot2 <- qplot(EVTYPE, data = topHealthDMG, xlab = "", ylab = "State Count", 
               main = "Principal Origin of Health Damages in each State (1950-2011)" ) 
plot2 + coord_flip() + geom_histogram(colour="red", fill="red")

What strikes most in this chart is the absolute dominiom of tornados as the principal cause of health damages in nothing less than 33 of the 72 states. Tied in second place, also related with big movements of air masses, we have marine thunderstorm winds and marine strong winds, booth with 5 states. Like in the previous plot, a minority of events, 19 in this case,
dominated the health damages in those years.