Weather phenomena with significant consequences in the United States

Synopsis

In this report we aim to identify the most significant weather phenomena that are considered most harmful in respect of public health and economy for the United States.

Analysing the data originated from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, including events registered from 1955 to 2011 in the United States, we can see that TORNADOs and FLOODs are the type of events with significat consequence in the United States.

TORNADOs have been the main cause of loss of life and injuries with 96915 victims (62% of the total number of victims caused by weather phenomena). While FLOODs have been the main cause of damages/ economic consequences with a circa 150 billion dollars estimated value (31% of the total estimated damages caused by weather phenomena).

Data Processing

The data is originated from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. The data covers the period from 1955 to 2011.

The data can be download from the following link in the form of a comma-separated-value file compressed via the bzip2 algorithm.

Data Processing Steps

  • Acquire the data
  • Reading in the data
  • Cleaning the data
  • Adding extra features
  • Aggregating the data

Acquire the data

Downloading the data as a compressed file in the current working directory.

theUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
theFile <- "./stormData.csv.bz2"
download.file(theUrl, theFile, method = "curl", quiet = TRUE, mode = "wb")

Reading in the data

Read the data included in the compressed file using read.csv. When reading the file the options to transfor strings to factor is turned off.

rawData <- read.csv(theFile, stringsAsFactors = FALSE)

Cleaning the data

The raw dataset includes 902297 observations (rows), each observations including 37 features (columns) with the following names.

dim(rawData)
## [1] 902297     37
names(rawData)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

Removing not necessary features (columns)

Not all of the features (columns) are needed in order to answer the questions so the first step is to try to identify a subset of the raw dataset containing only the relevant data.

Based on the available data documentation and the “questions”, we are going to create a simplified raw dataset including the following features (columns):

  • type of event (EVTYPE)
  • Fatalities/ injuries
    • direct (FATALITIES, INJURIES)
  • damage
    • property damage information (PROPDMG, PROPDMGEXP)
    • crop damage information (CROPDMG, CROPDMGEXP)

The simplified raw dataset allows us to reduce reduce the level complexity, to remove features (columns) not important for the questions we are trying to answer and to increase the performace during the data clean up and transformation.

#simplified raw dataset
sRawData <- rawData[, c(8,23,24,25,26,27,28)]
names(sRawData)
## [1] "EVTYPE"     "FATALITIES" "INJURIES"   "PROPDMG"    "PROPDMGEXP"
## [6] "CROPDMG"    "CROPDMGEXP"

Preparing selected features (columns)

Using the summary it is possible to have an overview of the content of the raw dataset and NAs.

summary(sRawData)
##     EVTYPE            FATALITIES          INJURIES        
##  Length:902297      Min.   :  0.0000   Min.   :   0.0000  
##  Class :character   1st Qu.:  0.0000   1st Qu.:   0.0000  
##  Mode  :character   Median :  0.0000   Median :   0.0000  
##                     Mean   :  0.0168   Mean   :   0.1557  
##                     3rd Qu.:  0.0000   3rd Qu.:   0.0000  
##                     Max.   :583.0000   Max.   :1700.0000  
##     PROPDMG         PROPDMGEXP           CROPDMG         CROPDMGEXP       
##  Min.   :   0.00   Length:902297      Min.   :  0.000   Length:902297     
##  1st Qu.:   0.00   Class :character   1st Qu.:  0.000   Class :character  
##  Median :   0.00   Mode  :character   Median :  0.000   Mode  :character  
##  Mean   :  12.06                      Mean   :  1.527                     
##  3rd Qu.:   0.50                      3rd Qu.:  0.000                     
##  Max.   :5000.00                      Max.   :990.000
#Number of NAs by feature (column)
theCounter <- function(x){
    return(sum(is.na(x)))
}
apply(sRawData, 2, theCounter)
##     EVTYPE FATALITIES   INJURIES    PROPDMG PROPDMGEXP    CROPDMG 
##          0          0          0          0          0          0 
## CROPDMGEXP 
##          0
Considerations around EVTYPE feature

Checking the possible values of the EVTYPE features (column) we can see that there are 985 possible values (over 902297 observations).

tmp <- unique(sRawData$EVTYPE)
length(tmp)
## [1] 985
head(tmp)
## [1] "TORNADO"               "TSTM WIND"             "HAIL"                 
## [4] "FREEZING RAIN"         "SNOW"                  "ICE STORM/FLASH FLOOD"
tail(tmp)
## [1] "DENSE SMOKE"              "LAKESHORE FLOOD"         
## [3] "MARINE THUNDERSTORM WIND" "MARINE STRONG WIND"      
## [5] "ASTRONOMICAL LOW TIDE"    "VOLCANIC ASHFALL"

Transforming ENVTYPE into a factor and showing the 10 most frequent event type.

sRawData$EVTYPE <- as.factor(sRawData$EVTYPE)
summary(sRawData$EVTYPE, maxsum = 10)
##               HAIL          TSTM WIND  THUNDERSTORM WIND 
##             288661             219940              82563 
##            TORNADO        FLASH FLOOD              FLOOD 
##              60652              54277              25326 
## THUNDERSTORM WINDS          HIGH WIND          LIGHTNING 
##              20843              20212              15754 
##            (Other) 
##             114069
Considerations around PROPDMGEXP, CRPDMGEXP features

Note! From available data documentation: * ‘Estimates should be rounded to three significant digits, followed by an alphabetical character signifying the magnitude of the number, i.e., 1.55B for $1,550,000,000. Alphabetical characters used to signify magnitude include K for thousands, M for millions, and B for billions.’

Checking the possible values for the PROPDMGEXP feature we can see that it is set to values other that “K”, “M” and “B” and (assumption) “” (without magnitude).

tmp <- unique(sRawData$PROPDMGEXP)
length(tmp)
## [1] 19
tmp
##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8"

The actual distribution of such possible values is summarized in the table below.

table(sRawData$PROPDMGEXP)
## 
##             +      -      0      1      2      3      4      5      6 
## 465934      5      1    216     25     13      4      4     28      4 
##      7      8      ?      B      H      K      M      h      m 
##      5      1      8     40      6 424665  11330      1      7

Removing the unexpected values for PROPDMGEXP from the dataset.

sRawData <- sRawData[sRawData$PROPDMGEXP == "" | sRawData$PROPDMGEXP == "K" | sRawData$PROPDMGEXP == "M" | sRawData$PROPDMGEXP == "B",]
dim(sRawData)
## [1] 901969      7
unique(sRawData$PROPDMGEXP)
## [1] "K" "M" ""  "B"

Same processing is applied for CRPDMGEXP.

table(sRawData$CROPDMGEXP)
## 
##             0      2      ?      B      K      M      k 
## 618094     19      1      7      9 281826   1992     21
sRawData <- sRawData[sRawData$CROPDMGEXP == "" | sRawData$CROPDMGEXP == "K" | sRawData$CROPDMGEXP == "M" | sRawData$CROPDMGEXP == "B",]
dim(sRawData)
## [1] 901921      7
unique(sRawData$CROPDMGEXP)
## [1] ""  "M" "K" "B"

Removing the unexpected values for PROPDMGEXP, CROPDMGEXP we have a dataset of 901921 obs, starting from an original dataset of 902297 obs.

Adding extra features

In order to make simplify the calculation, around impact on public heath and economy, 2 new features, PROP_DOLLAR and CROP_DOLLAR, have been added to the dataset reprting the damage value in dollars (considering the magnitude).

sRawData$PROP_DOLLAR <- 0
sRawData$CROP_DOLLAR <- 0

sRawData[sRawData$PROPDMGEXP == "","PROP_DOLLAR"] <- sRawData[sRawData$PROPDMGEXP == "","PROPDMG"]
sRawData[sRawData$PROPDMGEXP == "K","PROP_DOLLAR"] <- sRawData[sRawData$PROPDMGEXP == "K","PROPDMG"] * 1000
sRawData[sRawData$PROPDMGEXP == "M","PROP_DOLLAR"] <- sRawData[sRawData$PROPDMGEXP == "M","PROPDMG"]* 1000000
sRawData[sRawData$PROPDMGEXP == "B","PROP_DOLLAR"] <- sRawData[sRawData$PROPDMGEXP == "B","PROPDMG"] * 1000000000

sRawData[sRawData$CROPDMGEXP == "","CROP_DOLLAR"] <- sRawData[sRawData$CROPDMGEXP == "","CROPDMG"]
sRawData[sRawData$CROPDMGEXP == "K","CROP_DOLLAR"] <- sRawData[sRawData$CROPDMGEXP == "K","CROPDMG"] * 1000
sRawData[sRawData$CROPDMGEXP == "M","CROP_DOLLAR"] <- sRawData[sRawData$CROPDMGEXP == "M","CROPDMG"] * 1000000
sRawData[sRawData$CROPDMGEXP == "B","CROP_DOLLAR"] <- sRawData[sRawData$CROPDMGEXP == "B","CROPDMG"] * 1000000000

sRawData$TOTAL_DOLLAR <- sRawData$PROP_DOLLAR + sRawData$CROP_DOLLAR 
sRawData$TOTAL_HEALTH <- sRawData$INJURIES + sRawData$FATALITIES 

Aggregating the data

Finally the dataset is ready to be aggregated by EVTYPE. Specifically the data is aggregated calculating the total population health and total economic consequences by type of event, and ordered in ascending order by total economic and health impacts respectively.

healthPerEventType  <- aggregate(TOTAL_HEALTH ~ EVTYPE, data=sRawData, sum)
economyPerEventType  <- aggregate(TOTAL_DOLLAR ~ EVTYPE, data=sRawData, sum)

healthPerEventType <- healthPerEventType[order(healthPerEventType$TOTAL_HEALTH, decreasing=TRUE),]
economyPerEventType <- economyPerEventType[order(economyPerEventType$TOTAL_DOLLAR, decreasing=TRUE),]

Results

Most harmful events with respect to population health

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

Lets print out the first total number of victims (injuries + casualties) for the different event types in our dataset, limiting the focus to the 10 top most harmful event types.

healthPerEventType$PERCENTAGE <- (healthPerEventType$TOTAL_HEALTH/ sum(healthPerEventType$TOTAL_HEALTH)) * 100 
healthPerEventType[1:10,]
##                EVTYPE TOTAL_HEALTH PERCENTAGE
## 822           TORNADO        96915 62.2963149
## 123    EXCESSIVE HEAT         8428  5.4174621
## 842         TSTM WIND         7461  4.7958810
## 165             FLOOD         7259  4.6660367
## 449         LIGHTNING         6046  3.8863284
## 268              HEAT         3037  1.9521633
## 150       FLASH FLOOD         2755  1.7708956
## 418         ICE STORM         2064  1.3267254
## 749 THUNDERSTORM WIND         1621  1.0419680
## 958      WINTER STORM         1527  0.9815454

Plotting the 5 top harmful events types

par(mfrow=c(1,2), mar=c(4,4,2,4), cex = 0.6)
barplot(healthPerEventType[1:5,"TOTAL_HEALTH"], main = "Total number of victims by Event Type (top 5)", names.arg = healthPerEventType[1:5,"EVTYPE"], xlab = "Event Type", ylab = "Total number of victims")
barplot(healthPerEventType[1:5,"PERCENTAGE"], main = "Percentage on overall victims by Event Type (top 5)", names.arg = healthPerEventType[1:5,"EVTYPE"], xlab = "Event Type", ylab = "Percentage", ylim = c(0, 100))

We can see that across the United States the most harmful event type with respect to population health is the TORNADO.

Events with greatest economic consequences

Across the United States, which types of events have the greatest economic consequences?

Lets print out the first total cost (economic consequences) for the different event types in our dataset, limiting the focus to the 10 top most expensive event types.

economyPerEventType$PERCENTAGE <- (economyPerEventType$TOTAL_DOLLAR / sum(economyPerEventType$TOTAL_DOLLAR)) * 100
economyPerEventType[1:10,]
##                EVTYPE TOTAL_DOLLAR PERCENTAGE
## 165             FLOOD 150319678257  31.560119
## 389 HURRICANE/TYPHOON  71913712800  15.098524
## 822           TORNADO  57290435593  12.028318
## 652       STORM SURGE  43323541000   9.095922
## 238              HAIL  18727703230   3.931944
## 150       FLASH FLOOD  17561538817   3.687104
## 90            DROUGHT  15018672000   3.153220
## 381         HURRICANE  14610229010   3.067466
## 573       RIVER FLOOD  10148404500   2.130691
## 418         ICE STORM   8967037810   1.882660

Plotting the 5 top most expensive event types

tmp <- economyPerEventType[1:5,]
tmp$TOTAL_DOLLAR <- tmp$TOTAL_DOLLAR / 1000000000 #Convert to Billions of dollar

par(mfrow=c(1,2), mar=c(4,4,2,4), cex = 0.6)
barplot(tmp[,"TOTAL_DOLLAR"], main = "Total cost by Event Type (top 5)", names.arg = tmp[,"EVTYPE"], xlab = "Event Type", ylab = "Billions of dollars")
barplot(tmp[,"PERCENTAGE"], main = "Percentage on overall cost by Event Type (top 5)", names.arg = tmp[,"EVTYPE"], xlab = "Event Type", ylab = "Percentage", ylim = c(0, 100))

We can see that across the United States the event type with the greatest economic consequences is the FLOOD.

Software Environment Information

## R version 3.1.2 (2014-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## 
## locale:
## [1] C/C/C/C/C/no_NO.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.8    evaluate_0.8    formatR_1.2.1   htmltools_0.2.6
##  [5] knitr_1.10.5    magrittr_1.5    rmarkdown_0.7   stringi_0.5-5  
##  [9] stringr_1.0.0   tools_3.1.2