An analysis of the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database

SYNOPSIS

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) STORM database, and finding out which climate events are most harmful in terms of human and economic damage.

The NOOA STORM database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. The database contains records of major events starting in 1950 to actual date.

The database was firstly analysed in relation to its size, characteristics and variables of interest to the analysis. Data was then processed so that the analysis could be performed. Details of the data processing performed are detailed in this report.

The analysis of the database shows that tornadoes are by far the most harmful weather event, in terms of both fatalities and injuries. This is followed by thunderstorm wind and then by flooding. In relation to their economic impact to both property & crops, flooding is the most harmful weather event, followed by hurricanes and then by tornadoes.

DATA PROCESSING

First we load the dabatase into our computer, and load required R programming libraries:

### Load the database ###
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", 
                  dest="tmp.bz2", 
                  method="curl")

NOOA_STORM_DB <- read.csv("repdata%2Fdata%2FStormData.csv.bz2")

### Load required R language libraries ###
library (dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(reshape2)

We then have a look at the STORM DB structure and contents:

str(NOOA_STORM_DB)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

The database has 902,297 weather events, each observation has 37 variables.

Our variables of interest are:

  • EVTYPE: describes the event type (tornado, flood, hurricane, etc). It is a factor variable.
  • FATALITIES: numeric variable, contains # of fatalities for each weather event recorded in the DB.
  • INJURIES: numeric variable, contains # injuried at each weather event recorded in the DB.
  • PROPDMG: numeric variable containing numeric estimation of property damage caused by each event (in USD)
  • PROPDMGEXP: factor variable, contains exponential to be applied to PROPDMG in order to have the total cost of damage.
  • CROPDMG: numeric variable containing numeric estimation of damage to crops caused by each event (in USD)
  • PROPDMGEXP: factor variable, contains exponential to be applied to CROPDMG in order to have the total cost of damage to crops.

We then proceed to find out if there are any N/A o “NAN” (not a number) values in FATALITIES and INJURIES, for all of the events present/recorded in the DB.

isna <- is.na(NOOA_STORM_DB$FATALITIES)
table(isna)
## isna
##  FALSE 
## 902297

No missing values for FATALITIES.

isna <- is.na(NOOA_STORM_DB$INJURIES)
table(isna)
## isna
##  FALSE 
## 902297

And no missing values for INJURIES.

Therefore, we do not need to account or make any data transformations to the DB in relation to this.

Now we proceed to aggregate/sum up all fatalities and injuries, by Event Type (EVTYPE):

totalcasualties <- aggregate(list(NOOA_STORM_DB$FATALITIES, NOOA_STORM_DB$INJURIES), by=list(NOOA_STORM_DB$EVTYPE), FUN=sum)
colnames(totalcasualties) <- c("Event", "Fatalities", "Injuries")

We sort the results in descending order for both FATALITIES and INJURIES

totalcasualties <- arrange(totalcasualties, desc(Fatalities), desc(Injuries))

Moving on to economic impact (damage) data in the database, we firstly analyse values of PROPDMG and CROPDMG variables:

### Review of damage info in database
table(NOOA_STORM_DB$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5 
## 465934      1      8      5    216     25     13      4      4     28 
##      6      7      8      B      h      H      K      m      M 
##      4      5      1     40      1      6 424665      7  11330
print (table(NOOA_STORM_DB$PROPDMGEXP))
## 
##             -      ?      +      0      1      2      3      4      5 
## 465934      1      8      5    216     25     13      4      4     28 
##      6      7      8      B      h      H      K      m      M 
##      4      5      1     40      1      6 424665      7  11330
table(NOOA_STORM_DB$CROPDMGEXP)
## 
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994
print(table(NOOA_STORM_DB$CROPDMGEXP))
## 
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994

As per the NOOA STORM database documentation, exponential/magnitudes for dollar amounts related to damage estimates can have 3 values: K, M and B. We do see here that there are additional values in the DB.

We review some of these cases to try understand what do these non-described values could mean:

oddEXPvalues <- grep("1", NOOA_STORM_DB$PROPDMGEXP) 
oddEXPvalues <- NOOA_STORM_DB [oddEXPvalues, ]     
print(head(oddEXPvalues, 10))
##        STATE__          BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME
## 233829      47 5/10/1995 0:00:00     2140       CST      9     BLOUNT
## 233830      47 5/10/1995 0:00:00     2145       CST      9     BLOUNT
## 233833      47 5/18/1995 0:00:00     2105       CST      9     BLOUNT
## 233876      47 5/14/1995 0:00:00     0545       CST     13   CAMPBELL
## 233898      47 5/18/1995 0:00:00     0840       CST     17    CARROLL
## 233955      47 5/18/1995 0:00:00     1730       CST     31     COFFEE
## 233971      47 5/14/1995 0:00:00     0400       CST     35 CUMBERLAND
## 233972      47 5/18/1995 0:00:00     1707       CST     35 CUMBERLAND
## 233974      47 5/18/1995 0:00:00     1930       CST     35 CUMBERLAND
## 234019      47 5/18/1995 0:00:00     1345       CST     39    DECATUR
##        STATE             EVTYPE BGN_RANGE BGN_AZI         BGN_LOCATI
## 233829    TN THUNDERSTORM WINDS         0               Friendsville
## 233830    TN THUNDERSTORM WINDS         0                 Louisville
## 233833    TN THUNDERSTORM WINDS         0              Fountain City
## 233876    TN THUNDERSTORM WINDS         0                  Jacksboro
## 233898    TN THUNDERSTORM WINDS         0                 Huntingdon
## 233955    TN            TORNADO         0                 Manchester
## 233971    TN THUNDERSTORM WINDS         0                           
## 233972    TN               HAIL         0                 Crossville
## 233974    TN            TORNADO         0         Lake Tanzi Village
## 234019    TN               HAIL         0               Decaturville
##        END_DATE END_TIME COUNTY_END COUNTYENDN END_RANGE END_AZI
## 233829                            0         NA         0        
## 233830           2150CST          0         NA         0        
## 233833                            0         NA         0        
## 233876                            0         NA         0        
## 233898                            0         NA         0        
## 233955                            0         NA         0        
## 233971           0500CST          0         NA         0        
## 233972                            0         NA         0        
## 233974                            0         NA         0        
## 234019           1350CST          0         NA         0        
##        END_LOCATI LENGTH WIDTH  F MAG FATALITIES INJURIES PROPDMG
## 233829                 0     0 NA   0          0        0       0
## 233830                 0     0 NA  57          0        0       0
## 233833                 0     0 NA   0          0        0       0
## 233876                 0     0 NA   0          0        0       0
## 233898                 0     0 NA   0          0        0       0
## 233955                 1    20  0   0          0        0       0
## 233971                 0     0 NA   0          0        0       0
## 233972                 0     0 NA 175          0        0       0
## 233974                 2    25  1   0          0        0       0
## 234019                 0     0 NA 275          0        0       0
##        PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE
## 233829          1       0                                            0
## 233830          1       0                                            0
## 233833          1       0                                            0
## 233876          1       0                                            0
## 233898          1       0                                            0
## 233955          1       0                                         3529
## 233971          1       0                                            0
## 233972          1       0                                            0
## 233974          1       0                                         3558
## 234019          1       0                                            0
##        LONGITUDE LATITUDE_E LONGITUDE_
## 233829         0          0          0
## 233830         0          0          0
## 233833         0          0          0
## 233876         0          0          0
## 233898         0          0          0
## 233955      8605          0          0
## 233971         0          0          0
## 233972         0          0          0
## 233974      8459          0          0
## 234019         0          0          0
##                                                                                                                                               REMARKS
## 233829                                                                                                            Numerous trees wwere knocked down. 
## 233830 A 65 mph wind gust was measured at the Knoxville National Weather Service Office.  Numerous trees and power lines were knocked down in Alcoa. 
## 233833                                                                       Several trees were uprooted. One tree fell and heavily damaged a house. 
## 233876                                                       A roof was blown off of a barn.  Several trees and power lines were blown down as well. 
## 233898                                                                       Several trees were knocked down. One fallen tree damaged a church roof. 
## 233955                                                                         A house trailer was overturned.  Several large trees were blown down. 
## 233971                                                                                           Numerous telephone and power lines were blown down. 
## 233972                                                                                                                                               
## 233974                                                                                            At least 30 homes and house trailers were damaged. 
## 234019                                                                                             Several cars had their windshield broken by hail. 
##        REFNUM
## 233829 233790
## 233830 233791
## 233833 233794
## 233876 233837
## 233898 233859
## 233955 233942
## 233971 233958
## 233972 233959
## 233974 233961
## 234019 234006

There are not monetary/USD values associated with the non-described PRODMGEXP and CROPDMGEXP exponentials. Maybe if we remove events that do no have DAMAGE COSTs associated we get rid of strange factors in PROPDMEXP & CROPDMGEXP So let´s take out all rows where PROPERTY or CROP DAMAGE = 0

NOOA_fordamage_Est <- filter(NOOA_STORM_DB, PROPDMG != 0 | CROPDMG != 0)
table(NOOA_fordamage_Est$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5 
##   4357      1      0      5    209      0      1      1      4     18 
##      6      7      8      B      h      H      K      m      M 
##      3      2      0     40      1      6 229057      7  11319
table(NOOA_fordamage_Est$CROPDMGEXP)
## 
##             ?      0      2      B      k      K      m      M 
## 145037      6     17      0      7     21  97960      1   1982

No luck. Non-described exponential values are still present in the database.

In reviewing frequency and value of PROPERTY & CROP DAMAGE descriptive statistics values, I assume we can discard the odd exponential values since their contribution to damage costing estimates are negligible.

We continue the processing so that we can analyse damage impact of events. We will use the subsetted database that we generated to extract only weather events that caused damage.

In order to discard the odd exponential values –> anything that is not Billion (B), Million (M) or Hundred (H), is set to “0” so will not add to the sum of costs.

NOOA_fordamage_Est$PROPDMGEXP <- toupper(NOOA_fordamage_Est$PROPDMGEXP)
NOOA_fordamage_Est$CROPDMGEXP <- toupper(NOOA_fordamage_Est$CROPDMGEXP)

for (i in seq_along(NOOA_fordamage_Est$PROPDMGEXP))
{ 
    if (NOOA_fordamage_Est [i,26] == "K")  
        {NOOA_fordamage_Est [i,26] =1000}
    else if (NOOA_fordamage_Est [i,26] == "M")  
    {NOOA_fordamage_Est [i,26 ] =1e+06}
    else if (NOOA_fordamage_Est [i,26] == "B")  
    {NOOA_fordamage_Est [i,26 ] =1e+09}
    else {NOOA_fordamage_Est [i,26] =0}
}

for (i in seq_along(NOOA_fordamage_Est$CROPDMGEXP))
{ 
    if (NOOA_fordamage_Est [i,28] == "K")  
    {NOOA_fordamage_Est [i,28] =1000}
    else if (NOOA_fordamage_Est [i,28] == "M")  
    {NOOA_fordamage_Est [i,28 ] =1e+06}
    else if (NOOA_fordamage_Est [i,28] == "B")  
    {NOOA_fordamage_Est [i,28 ] =1e+09}
    else {NOOA_fordamage_Est [i,28] =0}
}

NOOA_fordamage_Est$PROPDMGEXP <- as.numeric(NOOA_fordamage_Est$PROPDMGEXP)
NOOA_fordamage_Est$CROPDMGEXP <- as.numeric(NOOA_fordamage_Est$CROPDMGEXP)

Now we: * Multiply the dollar amounts by their exponential, so we have Total Property Damge & Total Crop Damage per event * Calculate the TOTAL DAMAGE (sum of property and crop values) of each event * Aggregate/sum up TOTAL DAMAGE, per event (new dataframe called “Damagecosts”)

NOOA_fordamage_Est <- mutate(NOOA_fordamage_Est, Damages_cost = PROPDMG * PROPDMGEXP + CROPDMG * CROPDMGEXP)
Damagecosts <- select(NOOA_fordamage_Est, EVTYPE, Damages_cost)
Damagecosts <- aggregate(Damagecosts$Damages_cost, by=list(Damagecosts$EVTYPE), FUN=sum)
colnames(Damagecosts) <- c("Event", "Damages_cost")
Damagecosts <- arrange(Damagecosts, desc(Damages_cost))

RESULTS

Review of the Top 10 most harmful weather events in terms of fatalities and injuries:

head(totalcasualties, 10)
##             Event Fatalities Injuries
## 1         TORNADO       5633    91346
## 2  EXCESSIVE HEAT       1903     6525
## 3     FLASH FLOOD        978     1777
## 4            HEAT        937     2100
## 5       LIGHTNING        816     5230
## 6       TSTM WIND        504     6957
## 7           FLOOD        470     6789
## 8     RIP CURRENT        368      232
## 9       HIGH WIND        248     1137
## 10      AVALANCHE        224      170

And here is the plot of these results:

### First melt the DF so that ggplot can be used
topcasualties <- melt (totalcasualties, id.vars = "Event")
colnames(topcasualties) <- c("Event", "Type_of_harm", "Value")
topcasualties <- arrange(topcasualties, desc(Value))

Casualties_plot <- ggplot(topcasualties [1:20, ], aes(x=reorder(Event, -Value), y=Value))
Casualties_plot + geom_bar(stat="identity", aes(fill=Type_of_harm), position="dodge") + theme(axis.text.x = element_text(angle=90, vjust=0.5, hjust=1)) + ggtitle("10 Most Harmful Events") + xlab("Type of Event") + ylab("# of occurences")

Tornadoes are by far the most harmful weather event, in terms of both fatalities and injuries. This is followed by thunderstorm wind, and then by flooding.


Review of the Top 10 most harmful weather events in terms of economic impact:

head(Damagecosts, 10)
##                Event Damages_cost
## 1              FLOOD 150319678250
## 2  HURRICANE/TYPHOON  71913712800
## 3            TORNADO  57352113590
## 4        STORM SURGE  43323541000
## 5               HAIL  18758221170
## 6        FLASH FLOOD  17562128610
## 7            DROUGHT  15018672000
## 8          HURRICANE  14610229010
## 9        RIVER FLOOD  10148404500
## 10         ICE STORM   8967041310

And here is the corresponding plot of these results:

damageplot <- ggplot(Damagecosts[1:10, ], aes(x=reorder(Event, -Damages_cost), y=Damages_cost))
damageplot + geom_bar(stat = "identity", fill="Blue") + theme(axis.text.x = element_text(angle=90, vjust=0.5, hjust=1)) + ggtitle("Top 10 Events with Highest Economic Impact") + ylab("Damages Cost in USD") + xlab("Event type")

Flooding is responsible for the largest proportion of total economic impact out of all event types, followed by hurricanes and then by tornadoes.