Title

This project is an analysis of the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm data that were recorded from 1950 to November 2011. Storm events and impact on people, property and crop causing injuries, death and damages have been estimated based on reports and records from different sources all over the country including county, state and federal agencies, law enforcement, storm trackers and individual contributions from citizens.

Synopsis

The purpose of this project is to explore the NOAA dataset to determine possible scenarios that could happen in future of similar events. Data will be downloaded from the course website and processed using R Studio for analysis of the types of storm events and its characteristics. The results will determine the preparation to be taken in anticipation of the impact on human population based on occurence of injuries and fatalities. It will determine the potential magnitude in costs for property and crop damages. The information will be used by an agency official for preparation of severe weather events and prioritization of resources.

Data Processing

Loading the dataset

The stormdata.csv file is extracted from a url hyperlink and loaded into R using the read.csv function which created the dataset, df, with 902,297 observations from 37 variables. The initial data was reviewed using str and summary functions. It was subsetted into a new dataset, dfsub, to be used for further analysis. Subsequently, a library of functions, dplyr, sqldf and scales were attached to assist in processing in R Studio.

url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
dstfile <- "stormdata.csv"
download.file(url, dstfile)
df <- read.csv("stormdata.csv")
str(df)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Subsetting the dataset

dfsub <- subset(df, select = c(7:8, 23:28))
str(dfsub)
## 'data.frame':    902297 obs. of  8 variables:
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
library(dplyr)
## Warning: Installed Rcpp (0.12.10) different from Rcpp used to build dplyr (0.12.11).
## Please reinstall dplyr to avoid random crashes or undefined behavior.
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
library(scales)

Summarizing the data

Based on review of the initial subset, dfsub, it was necessary to create a separate dataset, named newdf, for summarization. The property and crop damages values corresponded to different labels with similar magnitude, i.e. terms of billion, million and hundreds. For example, m and M, both represented millions. The variables were then multiplied with the number of occurrences to create a new column PROP_AND_CROP_DAMAGES, using the mutate function of the dplyr package.

Subsequently, the sqldf function created a new dataset, stormdata. In this step, zero values were removed to reduce dimensionality grouped by event types.

newdf <- dfsub %>% mutate(pdmg = ifelse(PROPDMGEXP == 'H' | PROPDMGEXP == 'h', 100
 , ifelse(PROPDMGEXP == 'K' | PROPDMGEXP == 'k', 1000
 , ifelse(PROPDMGEXP == 'B' | PROPDMG == 'b', 1000000000
 , ifelse(PROPDMGEXP == 'M' | PROPDMGEXP == 'm', 1000000, 0))))) %>% mutate(cdmg = ifelse(CROPDMGEXP == 'H' | CROPDMGEXP == 'h', 100
 , ifelse(CROPDMGEXP == 'K' | CROPDMGEXP == 'k', 1000
 , ifelse(CROPDMGEXP == 'B' | CROPDMG == 'b', 1000000000
 , ifelse(CROPDMGEXP == 'M' | CROPDMGEXP == 'm', 1000000, 0))))) 

stormdata <- sqldf("select EVTYPE, (FATALITIES + INJURIES) AS FATAL_INJURIES,  (PROPDMG*pdmg + CROPDMG*cdmg) AS PROP_CROP_DAMAGES from newdf where FATALITIES > 0 or INJURIES > 0 or PROPDMG > 0 or CROPDMG > 0 ") 
## Loading required package: tcltk
## Warning: Quoted identifiers should have class SQL, use DBI::SQL() if the
## caller performs the quoting.

Aggregation

Since the stormdataset is large, it is desired to reduce dimensionality by aggregating the sum of the fatalities, injuries, property and crop damages per event. An events dataset was created by grouping similar events into categories using R’s sqldf function. An example is TORNADO and TORNDAO to be the same event type. This step is not perfect because it is based on intuition and assumptions which included descriptions that had incorrect spelling, inconsistent alpha-case labelling and in general, ambiguity of entries.

Two sets of summaries were calculated, namely; (1) fatalities and injuries and, (2) property and crop damages. First, an events dataset is created and a subsequent dataset, aggs.

events <- sqldf("select case when EVTYPE like 'AVAL%E' or EVTYPE in ('mudslides', 'MUDSLIDES')
        Or EVTYPE like '%MUD%SLIDE%' or EVTYPE like '%ROCK%SLIDE%'
        Or EVTYPE  like 'LANDSLIDE%'
        then 'AVALANCHE'  
    when EVTYPE like '%BLIZZARD%' or EVTYPE like 'BLOWING SNOW'  or EVTYPE like upper('%SNOW%')
        or EVTYPE like 'WINTER%' or EVTYPE like 'SLEET%'  
        Or EVTYPE like upper('%FOG%') Or EVTYPE like 'MIXED PRECIP%' 
        or EVTYPE like 'BLACK ICE%' or EVTYPE like 'ICY ROADS' or EVTYPE like 'ICE%' or EVTYPE like '%HAIL%'
        or EVTYPE like 'WINTRY%' 
        then 'SNOW STORM/WINTRY MIX'
    when EVTYPE like '%COLD%' or EVTYPE like '%WINDCHILL%' or EVTYPE like upper('%HYPOTHERMIA%')
        or EVTYPE like upper('%FREEZ%') or EVTYPE like 'LOW TEMP%' or EVTYPE like 'Early Frost' 
        then 'EXTREME COLD' 
    when EVTYPE like '%EXCESSIVE%HEAT%'  or EVTYPE like '%DROUGHT%' 
        or EVTYPE like '%HEAT%' or EVTYPE like '%WARM%' or EVTYPE like upper('%HYPERTHERMIA%')
        then 'EXCESSIVE HEAT/DROUGHT'  
    when EVTYPE like '%DUST%' or EVTYPE like '%WIND%' then 'DUST STORM'
    when EVTYPE like '%HURRICANE%' or EVTYPE like '%TROPICAL%' 
        Or EVTYPE like 'FUNNEL CLOUD'  
        then 'HURRICANE/TYPHOON'
    when EVTYPE in ('Microburst', 'MICROBURST') or EVTYPE like '%THUNDER%'  or EVTYPE = 'DRY MICROBURST'
        Then 'THUNDERSTORM'
    when EVTYPE like '%COLD%' or EVTYPE like 'FROST%' or EVTYPE like 'GLAZE%' or EVTYPE like 'COOL AND WET%'
        then 'EXTREME COLD'
    when EVTYPE like '%FLOOD%' or EVTYPE like '%RAIN%' or EVTYPE like '%STREAM FLD%' 
        or EVTYPE like '%STORM%SURGE%' or EVTYPE like 'WATERSPOUT'
        Or EVTYPE like '%RISING WATER%' or EVTYPE like 'EXCESSIVE WETNESS' 
        or EVTYPE in ('HEAVY MIX', 'HEAVY PRECIPITATION', 'HEAVY SHOWER', 'HEAVY SWELLS') 
        then 'FLOODS'
    when EVTYPE like '%WILD%FIRE%' or EVTYPE like 'BRUSH FIRE' or EVTYPE like 'DENSE SMOKE' 
        or EVTYPE like 'GRASS FIRES' then 'WILDFIRES'
    when EVTYPE in ('LIGHTNING','LIGHTING', 'LIGNTNING') or EVTYPE like 'LIGHTNING%' then 'LIGHTNING'
    when EVTYPE like '%MARINE%' or EVTYPE like upper('%SURF%') or EVTYPE like 'COASTAL%' 
        or EVTYPE like '%RIP CURRENT%' or EVTYPE like 'ROUGH SEAS' or EVTYPE like 'HEAVY SEAS'
        or EVTYPE like 'SEICHE%' or EVTYPE like 'SEVERE TURBULENCE' 
        then 'MARINE STORM'
    When EVTYPE like 'HIGH SEAS%' or EVTYPE like 'HIGH WAVE%' or EVTYPE like 'HIGH SWELLS'
        or EVTYPE like 'HIGH SWELLS' or EVTYPE like 'HIGH%' or EVTYPE like 'ROGUE WAVE'
        or EVTYPE like '%HIGH TIDE'   
         then 'TSUNAMI'
    when EVTYPE like '%TORNADO%'  or EVTYPE like 'TORNDAO'  then 'TORNADOES'
    When EVTYPE in ('Other', 'OTHER') then 'OTHER'
    else EVTYPE END AS EVTYPE
    , FATAL_INJURIES, PROP_CROP_DAMAGES 
from stormdata")

Summarizing the events dataset in this step to result in a smaller set named aggs, with 34 observations.

aggs <- aggregate(. ~ EVTYPE, events, sum)
View(aggs)

Results

Hierarchical clustering is going to be employed to identify clusters/groups and variability for impact of the different events on population health and damages to property and crop.

This methodology will result in a dendrogram, which is a tree-representation of the observations. It is the result of an agglomerative clustering described as bottom up by starting with single element cluster (one leaf) which continuously combine with the nearest neighbor until a big single root is formed.

The dist() function is used to compute the Euclidean distance between observations. The ward.D2 method is subsequently used in clustering the observations which minimizes the within-cluster variance.

Fatalities and Injuries

As an unsupervised learning algorithm, the analysis provided 3 clusters purporting close similarities between events. These are potentially the top storm categories causing most harm to the human population. A table is provided listing the top 10 event types and numbers of casualties, which is sorted in descending magnitude. It shows that TORNADOES are causing the highest number of fatalities and injuries, followed by DUST STORM, EXCESSIVE HEAT/DROUGHT, FLOODS and SNOW STORM/WINTRY MIX.

fi_df <- aggs[1:2]
fi <- aggs[2]
fi <- scale(fi)
Storm_Events_d <- dist(fi, method = "euclidean")
res.hc <- hclust(Storm_Events_d, method = "ward.D2")
plot(res.hc, labels = aggs$EVTYPE, cex=0.5, hang = -1)
rect.hclust(hclust(Storm_Events_d, method = "ward.D2"), h=1)

grp <- cutree(res.hc, k = 3)
table(grp)
## grp
##  1  2  3 
## 27  5  1

Table shows 3 groups where group 2 is referenced in the sorted top 10 list below.

rownames(aggs) [grp == 2]
## [1] "9"  "10" "12" "18" "21"
sorted_fi <- fi_df[ order(-fi_df[,2]),]
head(sorted_fi,10)
##                    EVTYPE FATAL_INJURIES
## 23              TORNADOES          97043
## 9              DUST STORM          12983
## 10 EXCESSIVE HEAT/DROUGHT          12426
## 12                 FLOODS          10714
## 21  SNOW STORM/WINTRY MIX           9460
## 18              LIGHTNING           6048
## 15      HURRICANE/TYPHOON           1913
## 33              WILDFIRES           1698
## 19           MARINE STORM           1547
## 11           EXTREME COLD           1082

Property and Crop Damages

Similarly, a dendrogram was produced showing 4 clusters of events that caused property and crop damages in order of magnitude. The table lists FLOODS, HURRICANE/TYPHOONS, TORNADOES, SNOW STORM/WINTRY MIX AND EXCESSIVE HEAT/DROUGHT on the top 10 list.

dmg_df <- aggs[c(1,3)]
dmg <- aggs[3]
dmg <- scale(dmg)
Storm_Events_dd <- dist(dmg, method = "euclidean")
res.hcddd <- hclust(Storm_Events_dd, method = "ward.D2")

plot(res.hcddd, labels = aggs$EVTYPE, cex=0.5, hang = -1)
rect.hclust(hclust(Storm_Events_dd, method = "ward.D2"), h=1)

grpddd <- cutree(res.hcddd, k = 3)
table(grpddd)
## grpddd
##  1  2  3 
## 29  1  3
rownames(aggs) [grpddd == 2]
## [1] "12"
sorted_dmg <- dmg_df[ order(-dmg_df[,2]),]
head(sorted_dmg,10)
##                    EVTYPE PROP_CROP_DAMAGES
## 12                 FLOODS      231996527860
## 15      HURRICANE/TYPHOON       98572690960
## 23              TORNADOES       57408059490
## 21  SNOW STORM/WINTRY MIX       38517885620
## 9              DUST STORM       17979337850
## 10 EXCESSIVE HEAT/DROUGHT       15943477030
## 33              WILDFIRES        8894510130
## 11           EXTREME COLD        3711408800
## 22           THUNDERSTORM        1233946150
## 18              LIGHTNING         945791370

Conclusion

This project provided information on the potential storm events which caused highest impact on human health and damages to the population. The data source is the most comprehensive compilation from NOAA from 1950 to 2011 but during the analysis, it is found to include typographic errors, spelling and ambiguity in the records. Although there are limitations, if the analysis can be proven to result in close estimates and is reproducible, it offers an option which can be used for preparation and prioritization of resources for protection from major storm events.