Reproducible Research Peer Assessment 2: Analysis of Storm Data

Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern. Here we present an analysis based on the NOAA Storm Database. The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete. The analysis adresses the following two questions:

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
  2. Across the United States, which types of events have the greatest economic consequences?

Data Processing

The data come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. The file can be downloaded from here.

There is also some documentation of the database available. There it is explained how some of the variables are constructed/defined.

Using R, the data can be downloaded using the link given above and loaded to R using the following code chunk. The data are store in a variable called data.

url <- "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url, destfile="repdata_data_StormData.csv.bz2")
dateDownloaded <- date()
data <- read.csv(bzfile("repdata_data_StormData.csv.bz2"))

By executing the following code in R we get a summary of the data

str(data)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

As we can see from the summary above, the data frame contains 902297 records with 37 variables. For our analysis we need only the following 7 variables:

  1. EVTYPE
  2. PROPDMG
  3. PROPDMGEXP
  4. FATALITIES
  5. INJURIES
  6. CROPDMG
  7. CROPDMGEXP

The EVTYPE variable contains 985 unique values. According to the documentation of the database, there should be only 48 Storm event types. In order to be consistent, we are going to use the data records of the 48 designated event types given in the documentation of the database.

Also, to be consistent we are going to make all the string values of the dataset to upper case to avoid any duplicates.

The code below executes the steps described above to create a new dataset.

tidy <- data[, c("EVTYPE", "PROPDMG", "PROPDMGEXP", "FATALITIES", "INJURIES", "CROPDMG", "CROPDMGEXP")]

approvedTypes <- c("Astronomical Low Tide", "Avalanche", "Blizzard", "Coastal Flood", "Cold/Wind Chill", "Debris Flow", "Dense Fog", "Dense Smoke", "Drought", "Dust Devil", "Dust Storm", "Excessive Heat", "Extreme Cold/Wind Chill", "Flash Flood", "Flood", "Freezing Fog", "Frost/Freeze", "Funnel Cloud", "Hail", "Heat", "Heavy Rain", "Heavy Snow", "High Surf", "High Wind", "Hurricane/Typhoon", "Ice Storm", "Lakeshore Flood", "Lake-Effect Snow", "Lightning", "Marine Hail", "Marine High Wind", "Marine Strong Wind", "Marine Thunderstorm Wind", "Rip Current", "Seiche", "Sleet", "Storm Tide", "Strong Wind", "Thunderstorm Wind", "Tornado", "Tropical Depression", "Tropical Storm", "Tsunami", "Volcanic Ash", "Waterspout", "Wildfire", "Winter Storm", "Winter Weather")

approvedTypes <- toupper(approvedTypes)
tidy$EVTYPE <- toupper(tidy$EVTYPE)
tidy <- tidy[tidy$EVTYPE %in% approvedTypes,]

The data now contains 635291 records with 7 variables.

In the sequence, we are going to deal with the PROPDMGEXP and CROPDMGEXP variables. Below we can see the summary of the PROPDMGEXP and CROPDMGEXP variables respectively.

summary(tidy$PROPDMGEXP)
##             -      ?      +      0      1      2      3      4      5 
## 279394      1      6      2     51      8      8      1      1     18 
##      6      7      8      B      h      H      K      m      M 
##      1      4      1     26      0      1 345724      4  10040
summary(tidy$PROPDMGEXP)
##             -      ?      +      0      1      2      3      4      5 
## 279394      1      6      2     51      8      8      1      1     18 
##      6      7      8      B      h      H      K      m      M 
##      1      4      1     26      0      1 345724      4  10040

According to the documentation of the database, the accepted values are H, K, M and B standing for hundrends, thousands, millions and billions, respectively. The empty values are also acceptable. The lower case h, k, m and b will be transformed to upper case and will be accepted. All the records with the rest of the values are going to be removed from the dataset and the calculation. The impact is negligible.

In the sequence, the values from the PROPDMGEXP and CROPDMGEXP are going to be multiplied with the values of PROPDMG and CROPDMG to get the complete amount of dollars in the new columns PROPDMGTOTAL and CROPDMGTOTAL by using the respective values for the exponentials H, K, M and B (100, 1000, 10^6 and 10^9). The final tidy dataset is going to include the following 5 variables:

  1. EVTYPE
  2. PROPDMGTOTAL
  3. FATALITIES
  4. INJURIES
  5. CROPDMGTOTAL

In the code below executes the steps mentioned above.

tidy$PROPDMGEXP <- toupper(tidy$PROPDMGEXP)
tidy$CROPDMGEXP <- toupper(tidy$CROPDMGEXP)

exponentials <- c("H","K","M","B","")
tidy <- tidy[tidy$PROPDMGEXP %in% exponentials,]
tidy <- tidy[tidy$CROPDMGEXP %in% exponentials,]

tidy$PROPDMGEXP[tidy$PROPDMGEXP==""] <- 0
tidy$PROPDMGEXP[tidy$PROPDMGEXP=="H"] <- 100
tidy$PROPDMGEXP[tidy$PROPDMGEXP=="K"] <- 1000
tidy$PROPDMGEXP[tidy$PROPDMGEXP=="M"] <- 1000000
tidy$PROPDMGEXP[tidy$PROPDMGEXP=="B"] <- 1000000000
tidy$PROPDMGTOTAL <- as.numeric(tidy$PROPDMGEXP)*tidy$PROPDMG

tidy$CROPDMGEXP[tidy$CROPDMGEXP==""] <- 0
tidy$CROPDMGEXP[tidy$CROPDMGEXP=="K"] <- 1000
tidy$CROPDMGEXP[tidy$CROPDMGEXP=="M"] <- 1000000
tidy$CROPDMGEXP[tidy$CROPDMGEXP=="B"] <- 1000000000
tidy$CROPDMGTOTAL <- as.numeric(tidy$CROPDMGEXP)*tidy$CROPDMG

tidy <- tidy[, c("EVTYPE", "PROPDMGTOTAL", "FATALITIES", 
                 "INJURIES", "CROPDMGTOTAL")]

At this point the data are tidy and clean ready for our analysis. Below is the summary of the data.

str(tidy)
## 'data.frame':    635181 obs. of  5 variables:
##  $ EVTYPE      : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ PROPDMGTOTAL: num  25000 2500 25000 2500 2500 2500 2500 2500 25000 25000 ...
##  $ FATALITIES  : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES    : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ CROPDMGTOTAL: num  0 0 0 0 0 0 0 0 0 0 ...

In order to perform our analysis and provide the results, the following packages are used:

library(ggplot2)

Results

The top 10 most harmful types of events that cause fatalities to the population can be calculated by using the following R code:

totalFatalitiesPerEvent <- aggregate(tidy$FATALITIES, 
                                     by=list(tidy$EVTYPE), sum)
colnames(totalFatalitiesPerEvent) <- c("EVTYPE","FATALITIES")
totalFatalitiesPerEvent <- totalFatalitiesPerEvent[
  order(totalFatalitiesPerEvent$FATALITIES, decreasing=TRUE),]

and are presented below

print(totalFatalitiesPerEvent[1:10,1:2], row.names=FALSE)
##          EVTYPE FATALITIES
##         TORNADO       5630
##  EXCESSIVE HEAT       1903
##     FLASH FLOOD        978
##            HEAT        937
##       LIGHTNING        816
##           FLOOD        470
##     RIP CURRENT        368
##       HIGH WIND        246
##       AVALANCHE        224
##    WINTER STORM        206

The top 10 most harmful types of events that cause injuries to the population can be calculated by using the following R code:

totalInjuriesPerEvent <- aggregate(tidy$INJURIES, 
                                   by=list(tidy$EVTYPE), sum)
colnames(totalInjuriesPerEvent) <- c("EVTYPE","INJURIES")
totalInjuriesPerEvent <- totalInjuriesPerEvent[
  order(totalInjuriesPerEvent$INJURIES, decreasing=TRUE),]

and are presented below

print(totalInjuriesPerEvent[1:10,1:2], row.names=FALSE)
##             EVTYPE INJURIES
##            TORNADO    91321
##              FLOOD     6789
##     EXCESSIVE HEAT     6525
##          LIGHTNING     5230
##               HEAT     2100
##          ICE STORM     1975
##        FLASH FLOOD     1777
##  THUNDERSTORM WIND     1488
##               HAIL     1360
##       WINTER STORM     1321

Below is a graph that further summarizes the top 10 most harmful events in term of fatalities

plot1 <- ggplot(totalFatalitiesPerEvent[1:10,], 
             aes(x=reorder(EVTYPE, FATALITIES) ,y=FATALITIES)) + 
  geom_bar(stat="identity",fill="dark red") + coord_flip() +
  labs(x = "Event type", 
       y="Fatalities", 
       title="Top 10 events by Fatalities")

plot1

Here is the graph that summarizes the top 10 most harmful events in term of injuries

plot2 <- ggplot(totalInjuriesPerEvent[1:10,], 
             aes(x=reorder(EVTYPE, INJURIES) ,y=INJURIES)) + 
  geom_bar(stat="identity",fill="red") + coord_flip() +
  labs(x = "Event type", 
       y="Injuries", 
       title="Top 10 events by Injuries")

plot2

It is obvious that the Storm Event that is most harmful for the population is the Tornado Storm Event.

The top 10 most harmful types of events for the economy can be calculated by using the following R code:

EconomyDamagePROP <- aggregate(tidy$PROPDMGTOTAL, 
                               by=list(tidy$EVTYPE), sum)
colnames(EconomyDamagePROP) <- c("EVTYPE","PROPDAMAGES")
EconomyDamageCROP <- aggregate(tidy$CROPDMGTOTAL, 
                               by=list(tidy$EVTYPE), sum)
colnames(EconomyDamageCROP) <- c("EVTYPE","CROPDAMAGES")
totalEconomyDamages <- merge(EconomyDamagePROP,EconomyDamageCROP,
                             by.x ="EVTYPE", by.y ="EVTYPE")
totalEconomyDamages$DAMAGES <- totalEconomyDamages$PROPDAMAGES + 
  totalEconomyDamages$CROPDAMAGES
totalEconomyDamages <- totalEconomyDamages[
  order(totalEconomyDamages$DAMAGES, decreasing=TRUE),]

and are presented below

print(totalEconomyDamages[1:10, c("EVTYPE","DAMAGES")], 
      row.names=FALSE)
##             EVTYPE      DAMAGES
##              FLOOD 150319678250
##  HURRICANE/TYPHOON  71913712800
##            TORNADO  57301935590
##               HAIL  18733216670
##        FLASH FLOOD  17561538610
##            DROUGHT  15018672000
##          ICE STORM   8967037810
##     TROPICAL STORM   8382236550
##       WINTER STORM   6715441250
##          HIGH WIND   5908617560

Below is a graph that further summarizes the top 10 most harmful events in term of economy

plot3 <- ggplot(totalEconomyDamages[1:10,], 
             aes(x=reorder(EVTYPE, DAMAGES) ,y=DAMAGES/10^9)) + 
  geom_bar(stat="identity",fill="dark orange") + coord_flip() +
  labs(x = "Event type", 
       y="Billions of dollars", 
       title="Top 10 events by combined economical damages")

plot3

It is obvious that the Storm Event that is most harmful to the economy in total is the Flood.