Reproducible Research 2: Analyzing NOAA Events to Map and Rank Top US Weather Threats.

Synopsys

In a reproducible way, and using R-Studio tools, this analysis script retrieves and process NOAA catalogued weather events to extract most harmful or damaging threats affecting the United States. The analysis will attempt to answer: “Where, How Often and What Type of Events constitute for each US state his major threats?” After documenting the tools, retrieval and cleaning/reformatting of the data, an analysis is conducted to extract, rank the Top5 and display the major weather threats catalogued up to 2011. Four different categories representing facets of this threat affecting Fatalities, Injuries, Property Damages and Crop Damages, quantify these threats as catalogued in the NOAA database as EVTYPE. The Top5 threats of each category are extracted for each US State, and their yearly rate is normalized and combined. Visualization is made easy thru mapping across the US, for both relative impact and relative frequency. This is performed on a spectrum of time windows, ranging between 5 and 25 years span to discern trends, and presented side-by-side. Having assessed cumulative risk impact and frequency by location, the last step of this analysis quantifies and compares each US State side-by-side with their respective cumulated threat index, with most impacting individual event types quantified and color coded. It is expected that this type of analysis will contribute in event preparation, based on prior events occurrence, impact, frequency and localization.

System and Platform documentation

Sys.info()[1:5]                         # system info but exclude login and user info

##                      sysname                      release 
##                    "Windows"                      "7 x64" 
##                      version                     nodename 
## "build 7601, Service Pack 1"                   "STALLION" 
##                      machine 
##                     "x86-64"

project<-'PA2'                          # create new project
if (!file.exists(project)){dir.create(project)}           
userdir<-getwd()                        # user-defined startup directory (Drive:/~/PA2/)
projectdir<-paste(userdir,project,sep="/")
setwd(projectdir)                       # work from the project directory
datadir<-"./data" ; if (!file.exists("data")) { dir.create("data") } # data will reside in subdir 
library(dplyr)                          # provides data manipulating functions

## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)                        # for graphics
library(magrittr)                       # ceci n'est pas une pipe
library(scales)                         # for scaling maps in ggplot
library(RCurl)                          # handles bzip2 files as well

## Loading required package: bitops

library(tidyr)                          # to tidy the data

## 
## Attaching package: 'tidyr'
## 
## The following object is masked from 'package:magrittr':
## 
##     extract

library(ggmap)                          # for maps

## 
## Attaching package: 'ggmap'
## 
## The following object is masked from 'package:magrittr':
## 
##     inset

library(maps)
library(reshape2)
sessionInfo()                           # to document platform

## R version 3.2.0 (2015-04-16)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] reshape2_1.4.1 maps_2.3-9     ggmap_2.4      tidyr_0.2.0   
##  [5] RCurl_1.95-4.6 bitops_1.0-6   scales_0.2.4   magrittr_1.5  
##  [9] ggplot2_1.0.1  dplyr_0.4.1   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.11.5         knitr_1.10          MASS_7.3-40        
##  [4] munsell_0.4.2       lattice_0.20-31     geosphere_1.3-13   
##  [7] colorspace_1.2-6    rjson_0.2.15        jpeg_0.1-8         
## [10] stringr_0.6.2       plyr_1.8.2          tools_3.2.0        
## [13] parallel_3.2.0      grid_3.2.0          gtable_0.1.2       
## [16] png_0.1-7           DBI_0.3.1           htmltools_0.2.6    
## [19] yaml_2.1.13         assertthat_0.1      digest_0.6.8       
## [22] RJSONIO_1.3-0       mapproj_1.2-2       formatR_1.2        
## [25] evaluate_0.7        rmarkdown_0.5.1     sp_1.1-0           
## [28] RgoogleMaps_1.2.0.7 proto_0.3-10

Notice this markdown was developed on a Windows 64-bit platform with the English US locale. The project resides in the PA1 subdirectory and data will reside in PA/data subdir. We document the libraries used and the session info provides reproducibility on the documented R Studio Platform and the Windows 7 x64 version.

It must be noted that the download implemented here is using a non-secured link, i.e. with reduced confidence compared to a secured https protocol. However, knittr will encounter an error when attempting to download a binary (.zip) file with an https.protocol. Changing to a non-secured http url and setting file.download=“wb” allows knittr to progress. This is strictly observed when knitting and the following error message is generated:

Error in download.file(url, dest = filename, mode = “wb”) : unsupported URL scheme

Note also that R studio can handle https on the same binary file without error.

Data Processing

Loading and pre-processing the data into the activity data frame is automated with this R chunk and the data file is copied in the /data subdirectory.

url <- "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
filename <- paste(datadir,strsplit(url,"%2F")[[1]][length(strsplit(url,"%2F")[[1]])],sep="/")
download.file(url, dest=filename, mode="wb")
filename <-(list.files(path=datadir,full.names=T,recursive=T))          # now holds the csv filename
StormData <- read.csv(filename,header=TRUE,stringsAsFactors=FALSE)      # populate data frame 
unlink(filename)                                                        # remove the .bz2 file
setwd(userdir)                                                          # return to original work dir

We observe basic information and missing data NA as indicated with basic data.frame characteristics.

str(StormData)

## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

summary(StormData)

##     STATE__       BGN_DATE           BGN_TIME          TIME_ZONE        
##  Min.   : 1.0   Length:902297      Length:902297      Length:902297     
##  1st Qu.:19.0   Class :character   Class :character   Class :character  
##  Median :30.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :31.2                                                           
##  3rd Qu.:45.0                                                           
##  Max.   :95.0                                                           
##                                                                         
##      COUNTY       COUNTYNAME           STATE              EVTYPE         
##  Min.   :  0.0   Length:902297      Length:902297      Length:902297     
##  1st Qu.: 31.0   Class :character   Class :character   Class :character  
##  Median : 75.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :100.6                                                           
##  3rd Qu.:131.0                                                           
##  Max.   :873.0                                                           
##                                                                          
##    BGN_RANGE          BGN_AZI           BGN_LOCATI       
##  Min.   :   0.000   Length:902297      Length:902297     
##  1st Qu.:   0.000   Class :character   Class :character  
##  Median :   0.000   Mode  :character   Mode  :character  
##  Mean   :   1.484                                        
##  3rd Qu.:   1.000                                        
##  Max.   :3749.000                                        
##                                                          
##    END_DATE           END_TIME           COUNTY_END COUNTYENDN    
##  Length:902297      Length:902297      Min.   :0    Mode:logical  
##  Class :character   Class :character   1st Qu.:0    NA's:902297   
##  Mode  :character   Mode  :character   Median :0                  
##                                        Mean   :0                  
##                                        3rd Qu.:0                  
##                                        Max.   :0                  
##                                                                   
##    END_RANGE          END_AZI           END_LOCATI       
##  Min.   :  0.0000   Length:902297      Length:902297     
##  1st Qu.:  0.0000   Class :character   Class :character  
##  Median :  0.0000   Mode  :character   Mode  :character  
##  Mean   :  0.9862                                        
##  3rd Qu.:  0.0000                                        
##  Max.   :925.0000                                        
##                                                          
##      LENGTH              WIDTH                F               MAG         
##  Min.   :   0.0000   Min.   :   0.000   Min.   :0.0      Min.   :    0.0  
##  1st Qu.:   0.0000   1st Qu.:   0.000   1st Qu.:0.0      1st Qu.:    0.0  
##  Median :   0.0000   Median :   0.000   Median :1.0      Median :   50.0  
##  Mean   :   0.2301   Mean   :   7.503   Mean   :0.9      Mean   :   46.9  
##  3rd Qu.:   0.0000   3rd Qu.:   0.000   3rd Qu.:1.0      3rd Qu.:   75.0  
##  Max.   :2315.0000   Max.   :4400.000   Max.   :5.0      Max.   :22000.0  
##                                         NA's   :843563                    
##    FATALITIES          INJURIES            PROPDMG       
##  Min.   :  0.0000   Min.   :   0.0000   Min.   :   0.00  
##  1st Qu.:  0.0000   1st Qu.:   0.0000   1st Qu.:   0.00  
##  Median :  0.0000   Median :   0.0000   Median :   0.00  
##  Mean   :  0.0168   Mean   :   0.1557   Mean   :  12.06  
##  3rd Qu.:  0.0000   3rd Qu.:   0.0000   3rd Qu.:   0.50  
##  Max.   :583.0000   Max.   :1700.0000   Max.   :5000.00  
##                                                          
##   PROPDMGEXP           CROPDMG         CROPDMGEXP       
##  Length:902297      Min.   :  0.000   Length:902297     
##  Class :character   1st Qu.:  0.000   Class :character  
##  Mode  :character   Median :  0.000   Mode  :character  
##                     Mean   :  1.527                     
##                     3rd Qu.:  0.000                     
##                     Max.   :990.000                     
##                                                         
##      WFO             STATEOFFIC         ZONENAMES            LATITUDE   
##  Length:902297      Length:902297      Length:902297      Min.   :   0  
##  Class :character   Class :character   Class :character   1st Qu.:2802  
##  Mode  :character   Mode  :character   Mode  :character   Median :3540  
##                                                           Mean   :2875  
##                                                           3rd Qu.:4019  
##                                                           Max.   :9706  
##                                                           NA's   :47    
##    LONGITUDE        LATITUDE_E     LONGITUDE_       REMARKS         
##  Min.   :-14451   Min.   :   0   Min.   :-14455   Length:902297     
##  1st Qu.:  7247   1st Qu.:   0   1st Qu.:     0   Class :character  
##  Median :  8707   Median :   0   Median :     0   Mode  :character  
##  Mean   :  6940   Mean   :1452   Mean   :  3509                     
##  3rd Qu.:  9605   3rd Qu.:3549   3rd Qu.:  8735                     
##  Max.   : 17124   Max.   :9706   Max.   :106220                     
##                   NA's   :40                                        
##      REFNUM      
##  Min.   :     1  
##  1st Qu.:225575  
##  Median :451149  
##  Mean   :451149  
##  3rd Qu.:676723  
##  Max.   :902297  
##

head(StormData,5)

##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5

We will work on a copy of this pristine dataset in a dataframe df, which can take some time to load… We’ll use cache=TRUE option for this chunk, and we’ll start the data cleaning process. Note this script will rely heavily on the magrittr package, so analysts not familiar with this tool are referred to the vignette for further details.

df<-as.data.frame(StormData)            # holds the df to analyze
names(df) %<>% tolower
data(state.fips)

We convert the exponent prefix using generic Regex so we’ll be able to tally amounts of damages both for property and crop, and combine the result as a numeric value (multiplier).

df$propdmgexp %<>%
gsub("[Kk]","3",.) %>%
gsub("[Mm]","6",.) %>%
gsub("[Hh]","2",.) %>%
gsub("B","9",.) %>%
gsub("[[:punct:]]","0",.) %>% 
sub("","0",.) 
df$propdmgexp<-10^as.numeric(df$propdmgexp)

df$cropdmgexp %<>%
gsub("[Kk]","3",.) %>%
gsub("[Mm]","6",.) %>%
gsub("B","9",.) %>%
gsub("[[:punct:]]","0",.) %>%
sub("","0",.) 
df$cropdmgexp<-10^as.numeric(df$cropdmgexp)

We now only extract vectors we intend to process, combine the property and crop damages with their respective multiplier and retain only data entries that present a non-zero value in any of the 4 categories, and then eliminate redundant exponents.

df %<>%
select(bgn_date,bgn_time,evtype,fatalities,injuries,propdmg,propdmgexp,cropdmg,cropdmgexp,state__,state,county,countyname,longitude,latitude) %>%
rename(date=bgn_date,time=bgn_time,state.FIBS=state__,county.FIBS=county,county=countyname) %>%
mutate(propdmg=propdmg*propdmgexp,cropdmg=cropdmg*cropdmgexp) %>%
filter(fatalities > 0 || injuries > 0|| propdmg>0 || cropdmg>0) %>%
rename(propdamages=propdmg,cropdamages=cropdmg)
df<-df[,-c(7,9)]    # remove propdmgexp,cropdmgexp as they are redundant

The next step is to cleanup and format date and time.

df$date %<>%
gsub(" 0:00:00","",.) %>%
as.Date("%m/%d/%Y") 
df$time %<>% strptime("%H%M") %>%
format("%H:%M")

The next step is to cleanup event types for similar events based on shorthand, vocabulary or grammar as many manual entries show obvious occurrences (e.g. evtype ending with s).

df$evtype %<>% toupper %>%
gsub("WINDS","WIND",.) %>%
gsub("STORMS","STORM",.) %>%
gsub("FIRES","FIRE",.) %>%
gsub("LANDSLIDES","LANDSLIDE",.) %>%
gsub("FLD","FLOOD",.) %>%
gsub("GLAZE","ICE",.) %>%
gsub("CURRENTS","CURRENT",.) %>%
replace(grepl("(?i)WIND",.,perl=TRUE),"WIND") %>%
replace(grepl("(?i)SNOW",.,perl=TRUE),"SNOW") %>%
replace(grepl("(?i)DUST",.,perl=TRUE),"DUST") %>%
replace(grepl("(?i)HAIL",.,perl=TRUE),"HAIL") %>%
replace(grepl("(?i)TIDE",.,perl=TRUE),"TIDE") %>%
replace(grepl("(?i)FLOOD",.,perl=TRUE),"FLOOD") %>%
replace(grepl("(?i)SURF",.,perl=TRUE),"SURF") %>%
replace(grepl("(?i)ICE",.,perl=TRUE),"ICE") %>%
replace(grepl("(?i)COLD",.,perl=TRUE),"COLD") %>%
replace(grepl("(?i)HEAT",.,perl=TRUE),"HEAT") %>%
replace(grepl("(?i)TROPICAL STORM",.,perl=TRUE),"TROPICAL STORM") %>%
replace(grepl("(?i)FLOOD",.,perl=TRUE),"FLOOD") %>%
replace(grepl("(?i)FOG",.,perl=TRUE),"FOG") %>%
replace(grepl("(?i)SEAS",.,perl=TRUE),"HIGH SEAS") %>%
replace(grepl("(?i)RAIN",.,perl=TRUE),"RAIN") %>%
replace(grepl("(?i)MIX",.,perl=TRUE),"MIXED PRECIP") %>%
replace(grepl("(?i)FUNNEL",.,perl=TRUE),"TORNADO") %>%
replace(grepl("(?i)TORNADO",.,perl=TRUE),"TORNADO") %>%
replace(grepl("(?i)TROPICAL",.,perl=TRUE),"TROPICAL") %>%
replace(grepl("(?i)WINTER",.,perl=TRUE),"WINTER") %>%
replace(grepl("(?i)HURRICANE",.,perl=TRUE),"HURRICANE") %>%
replace(grepl("(?i)ICY",.,perl=TRUE),"ICE") %>%
replace(grepl("(?i)MISHAP",.,perl=TRUE),"MARINE ACCIDENT") %>%
replace(grepl("(?i)FIRE",.,perl=TRUE),"WILD FIRE")

As we intend to analyze both Impact (i.e. Magnitude) and Frequency, from the data retained in df, build a high impact (hi) and a high frequency (hf) datasets. These sets have same structure. Only difference is that we retain counts only for frequency, not magnitude of the events.

df %>%
mutate(year=as.numeric(format(df$date,"%Y"))) %>%
select(-date,-time) %>%
group_by(state.FIBS,state,year,evtype,longitude,latitude) %>%
summarize(total.fatalities=sum(fatalities),total.injuries=sum(injuries),total.propdamages=sum(propdamages),total.cropdamages=sum(cropdamages)) %>%
arrange(desc(total.fatalities),desc(total.injuries),desc(total.propdamages),desc(total.cropdamages)) %>%
as.data.frame -> hi

For the frequency dataset, start with a copy of hi and set all missing entries to 0, then substitute positive entries with 1 (count only).

hf<-hi
hf$total.fatalities[is.na(hf$total.fatalities)]<-0
hf$total.injuries[is.na(hf$total.injuries)]<-0
hf$total.propdamages[is.na(hf$total.propdamages)]<-0
hf$total.cropdamages[is.na(hf$total.cropdamages)]<-0

hf$total.fatalities[hf$total.fatalities>0]<-1
hf$total.injuries[hf$total.injuries>0]<-1
hf$total.propdamages[hf$total.propdamages>0]<-1
hf$total.cropdamages[hf$total.cropdamages>0]<-1

The final cleaning step leads to initializing data frames for last decade’s statistics on impact (ldhi) and frequency (ldhf) where we will accumulate windowing data. We also prepare our slicing labels.

ldhi.stat<-data.frame(fips=0,state="alabama",cum.fatalities=0,cum.injuries=0,cum.propdamages=0,cum.cropdamages=0,ssa=0,region=0,division=0,abb="AL",polyname="alabama",span=0,slice=0)  # inititialize data.frame for tallying stats
ldhf.stat<-data.frame(fips=0,state="alabama",cum.fatalities=0,cum.injuries=0,cum.propdamages=0,cum.cropdamages=0,ssa=0,region=0,division=0,abb="AL",polyname="alabama",span=0,slice=0)  # inititialize data.frame for tallying stats
sliced<-c("30YR","25YR","20YR","15YR","10YR","05YR")

Analysis Steps

Process Description

We will repeat for increasing window size of 5, 10, 15, and up to 25 years and each case extract similar data we will compile in data.frames (ldhi.stat for impact and ldhf.stat for frequency), at the state level. We will split each type in top 5 and all-other groups. Because not all threat types have same top5, it is necessary after merging to readjust / rebalance the all-others categories accordingly. The code to do this is not complicated but tedious but extracts at most (4x5) top threats plus their ties… and the results will be placed in ldhi and ldhf data frames after the ranking is complete. The data is then reduced to annual rates, and then normalized (between 0 and 1) in each category of threat so comparisons and tally can be presented in a homogeneous scale. Then mapping can be produced to locate the threat impact and frequency across the US and finally the last 5-year period Top5 events will be compared, side-by-side, for each US state.

Step1

Step1: human (fatalities and injuries) and damages (prop and crop) impact grouped by state.FIBS over last 25,20,15,10,5 years… this is a big chunk and loop with comments describing the process key sub-steps.

for (slice in 6:1) {
        window<-5*slice
        span<-2011-window
        hi %>%
        filter(year>span) %>%
        group_by(state.FIBS,state,evtype) %>%
        summarize(state.total.fatalities=sum(total.fatalities),state.total.injuries=sum(total.injuries),state.total.propdamages=sum(total.propdamages),state.total.cropdamages=sum(total.cropdamages)) %>%
        arrange(desc(state.total.fatalities),desc(state.total.injuries),desc(state.total.propdamages),desc(state.total.cropdamages)) %>%
        as.data.frame -> ldhi

# human impact:fatalities ranked (top5+all others) and grouped by state.FIBS over time span

        ldhi %>%
        select(-state.total.injuries,-state.total.propdamages,-state.total.cropdamages) %>%
        group_by(state.FIBS,state,evtype) %>%
        tally(state.total.fatalities,sort=TRUE) %>%
        
        top_n(5) %>%
        
        as.data.frame -> ldhi.top5.fatalities

        ldhi %>%
        select(-state.total.injuries,-state.total.propdamages,-state.total.cropdamages) %>%
        rename(n=state.total.fatalities) %>%
        setdiff(ldhi.top5.fatalities) %>%
        as.data.frame -> dftemp

# rename all non-top5 as "ALL OTHERS" and bind them back to ldhi.fatalities dataframe

        dftemp$evtype<-"ALL OTHERS" 
        dftemp %>%
        group_by(state.FIBS,state,evtype) %>%
        summarize(n=sum(n)) %>%
        as.data.frame %>%
        rbind(ldhi.top5.fatalities) %>%
        rename(state.fatalities=n) %>%
        arrange(state.FIBS,state,desc(state.fatalities)) %>%
        as.data.frame -> ldhi.fatalities

# human impact:injuries ranked (top5+all others) and grouped by state.FIBS over time span

        ldhi %>%
        select(-state.total.fatalities,-state.total.propdamages,-state.total.cropdamages) %>%
        group_by(state.FIBS,state,evtype) %>%
        tally(state.total.injuries,sort=TRUE) %>%
        
        top_n(5) %>%
        
        as.data.frame -> ldhi.top5.injuries

        ldhi %>%
        select(-state.total.fatalities,-state.total.propdamages,-state.total.cropdamages) %>%
        rename(n=state.total.injuries) %>%
        setdiff(ldhi.top5.injuries) %>%
        as.data.frame -> dftemp

# rename all non-top5 as "ALL OTHERS" and bind them back to ldhi.injuries dataframe

        dftemp$evtype<-"ALL OTHERS" 
        dftemp %>%
        group_by(state.FIBS,state,evtype) %>%
        summarize(n=sum(n)) %>%
        as.data.frame %>%
        rbind(ldhi.top5.injuries) %>%
        rename(state.injuries=n) %>%
        arrange(state.FIBS,state,desc(state.injuries)) %>%
        as.data.frame -> ldhi.injuries

# damages impact: props ranked (top5+all others) and grouped by state.FIBS over time span

        ldhi %>%
        group_by(state.FIBS,state,evtype) %>%
        tally(state.total.propdamages,sort=TRUE) %>%
        
        top_n(5) %>%
        
        as.data.frame -> ldhi.top5.propdamages

        ldhi %>%
        select(-state.total.fatalities,-state.total.injuries,-state.total.cropdamages) %>%
        rename(n=state.total.propdamages) %>%
        setdiff(ldhi.top5.propdamages) %>%
        as.data.frame -> dftemp

# rename all non-top5 as "ALL OTHERS" and bind them back to ldhi.propdamages dataframe

        dftemp$evtype<-"ALL OTHERS" 
        dftemp %>%
        group_by(state.FIBS,state,evtype) %>%
        summarize(n=sum(n)) %>%
        as.data.frame %>%
        rbind(ldhi.top5.propdamages) %>%
        rename(state.propdamages=n) %>%
        arrange(state.FIBS,state,desc(state.propdamages)) %>%
        as.data.frame -> ldhi.propdamages

# damages impact: crops ranked (top5+all others) and grouped by state.FIBS over time span

        ldhi %>%
        group_by(state.FIBS,state,evtype) %>%
        tally(state.total.cropdamages,sort=TRUE) %>%
        
        top_n(5) %>%
        
        as.data.frame -> ldhi.top5.cropdamages

        ldhi %>%
        select(-state.total.fatalities,-state.total.injuries,-state.total.propdamages) %>%
        rename(n=state.total.cropdamages) %>%
        setdiff(ldhi.top5.cropdamages) %>%
        as.data.frame -> dftemp

# rename all non-top5 as "ALL OTHERS" and bind them back to ldhi.cropdamages dataframe

        dftemp$evtype<-"ALL OTHERS" 
        dftemp %>%
        group_by(state.FIBS,state,evtype) %>%
        summarize(n=sum(n)) %>%
        as.data.frame %>%
        rbind(ldhi.top5.cropdamages) %>%
        rename(state.cropdamages=n) %>%
        arrange(state.FIBS,state,desc(state.cropdamages)) %>%
        as.data.frame -> ldhi.cropdamages

# now merge back all categories together into the last decades high impact ranked dataframe ldhi.ranked

        ldhi.fatalities %>%
        merge(ldhi.injuries, all=T) %>%
        merge(ldhi.propdamages, all=T) %>%
        merge(ldhi.cropdamages, all=T) %>%
        arrange(state.FIBS,state,desc(state.fatalities),desc(state.injuries),desc(state.propdamages),desc(state.cropdamages)) %>%
        as.data.frame -> ldhi.ranked

# after renaming the cumulative counts in each category, save the results in the high impact ldh dataframe

        ldhi %>% 
        rename(state.fatalities=state.total.fatalities,state.injuries=state.total.injuries,state.propdamages=state.total.propdamages,state.cropdamages=state.total.cropdamages) %>%
        as.data.frame -> ldhi

# since the top 5 in each categories do not necessarily overlap, we obtain some NAs after merging, which
# we must replace with values in ldhi and compensate the "ALL OTHERS" counts for state.fatalities.
# Potentially we could endup with more than 20 aggregated top5, not counting the ties...
# we will repeat the same process for all categories: fatalities, injuries, propdamages and cropdamages

        subset(ldhi.ranked[,c(-5,-6,-7)],is.na(ldhi.ranked$state.fatalities)) %>%  # retain only the missing activity steps data subset filter(is.na(state.fatalities)) %>%
        rename(missing.state.fatalities=state.fatalities) %>%
        merge(ldhi[,c(-5,-6,-7)], all=FALSE) %>%
        select(-missing.state.fatalities)-> ldhi.ranked.m # NAs have been substituted

# retain only the fatalities and tally the deductions for events that need readjusted in ldhi.ranked.agg

        ldhi.ranked.nm<-subset(ldhi.ranked[,c(-5,-6,-7)],!is.na(ldhi.ranked$state.fatalities))

        ldhi.ranked.m %>%
        mutate(evtype="ALL OTHERS") %>%
        group_by(state.FIBS,state,evtype) %>%
        summarize(state.fatalities=-sum(state.fatalities))-> ldhi.ranked.agg # and aggregated for deducting from "ALL OTHERS"

        ldhi.ranked.nm %>%
        filter(evtype=="ALL OTHERS") %>%
        rename(target.state.fatalities=state.fatalities) %>%
        merge(ldhi.ranked.agg,all=FALSE) %>%
        mutate(state.fatalities=target.state.fatalities+state.fatalities) %>%
        select(-target.state.fatalities) %>%    # we'll need to keep the records that show > 0 values after combining
        merge(ldhi.ranked.m,all=TRUE) %>%
        merge(ldhi.ranked.nm,all=TRUE) %>%
        group_by(state.FIBS,state,evtype) %>%
        summarize(state.fatalities=min(state.fatalities)) %>%
        arrange(state.FIBS,state,desc(state.fatalities)) %>%
        as.data.frame -> ldhi.ranked.fatalities

        rm(ldhi.ranked.m,ldhi.ranked.nm,ldhi.ranked.agg)   # cleanup    

# must replace NAs with values in ldhi and compensate the "ALL OTHERS" counts for state.injuries

        subset(ldhi.ranked[,c(-4,-6,-7)],is.na(ldhi.ranked$state.injuries)) %>%  # retain only the missing activity steps data subsetfilter(is.na(state.fatalities)) %>%
        rename(missing.state.injuries=state.injuries) %>%
        merge(ldhi[,c(-4,-6,-7)], all=FALSE) %>%
        select(-missing.state.injuries)-> ldhi.ranked.m # NAs have been substituted

        ldhi.ranked.nm<-subset(ldhi.ranked[,c(-4,-6,-7)],!is.na(ldhi.ranked$state.injuries))

        ldhi.ranked.m %>%
        mutate(evtype="ALL OTHERS") %>%
        group_by(state.FIBS,state,evtype) %>%
        summarize(state.injuries=-sum(state.injuries))-> ldhi.ranked.agg # and aggregated for deducting from "ALL OTHERS"

        ldhi.ranked.nm %>%
        filter(evtype=="ALL OTHERS") %>%
        rename(target.state.injuries=state.injuries) %>%
        merge(ldhi.ranked.agg,all=FALSE) %>%
        mutate(state.injuries=target.state.injuries+state.injuries) %>%
        select(-target.state.injuries) %>%    # we'll need to keep the records that show > 0 values after combining
        merge(ldhi.ranked.m,all=TRUE) %>%
        merge(ldhi.ranked.nm,all=TRUE) %>%
        group_by(state.FIBS,state,evtype) %>%
        summarize(state.injuries=min(state.injuries)) %>%
        arrange(state.FIBS,state,desc(state.injuries)) %>%
        as.data.frame -> ldhi.ranked.injuries

        rm(ldhi.ranked.m,ldhi.ranked.nm,ldhi.ranked.agg)   # cleanup    

# must replace NAs with values in ldhi and compensate the "ALL OTHERS" counts for state.propdamages

        subset(ldhi.ranked[,c(-4,-5,-7)],is.na(ldhi.ranked$state.propdamages)) %>%  # retain only the missing activity steps data subset filter(is.na(state.propdamages)) %>%
        rename(missing.state.propdamages=state.propdamages) %>%
        merge(ldhi[,c(-4,-5,-7)], all=FALSE) %>%
        select(-missing.state.propdamages)-> ldhi.ranked.m # NAs have been substituted

        ldhi.ranked.nm<-subset(ldhi.ranked[,c(-4,-5,-7)],!is.na(ldhi.ranked$state.propdamages))

        ldhi.ranked.m %>%
        mutate(evtype="ALL OTHERS") %>%
        group_by(state.FIBS,state,evtype) %>%
        summarize(state.propdamages=-sum(state.propdamages))-> ldhi.ranked.agg # and aggregated for deducting from "ALL OTHERS"

        ldhi.ranked.nm %>%
        filter(evtype=="ALL OTHERS") %>%
        rename(target.state.propdamages=state.propdamages) %>%
        merge(ldhi.ranked.agg,all=FALSE) %>%
        mutate(state.propdamages=target.state.propdamages+state.propdamages) %>%
        select(-target.state.propdamages) %>%    # we'll need to keep the records that show > 0 values after combining
        merge(ldhi.ranked.m,all=TRUE) %>%
        merge(ldhi.ranked.nm,all=TRUE) %>%
        group_by(state.FIBS,state,evtype) %>%
        summarize(state.propdamages=min(state.propdamages)) %>%
        arrange(state.FIBS,state,desc(state.propdamages)) %>%
        as.data.frame -> ldhi.ranked.propdamages

        rm(ldhi.ranked.m,ldhi.ranked.nm,ldhi.ranked.agg)   # cleanup    

# must replace NAs with values in ldhi and compensate the "ALL OTHERS" counts for state.cropdamages

        subset(ldhi.ranked[,c(-4,-5,-6)],is.na(ldhi.ranked$state.cropdamages)) %>%  # retain only the missing activity steps data subsetfilter(is.na(state.fatalities)) %>%
        rename(missing.state.cropdamages=state.cropdamages) %>%
        merge(ldhi[,c(-4-5,-6)], all=FALSE) %>%
        select(-missing.state.cropdamages)-> ldhi.ranked.m # NAs have been substituted

        ldhi.ranked.nm<-subset(ldhi.ranked[,c(-4,-5,-6)],!is.na(ldhi.ranked$state.cropdamages))

        ldhi.ranked.m %>%
        mutate(evtype="ALL OTHERS") %>%
        group_by(state.FIBS,state,evtype) %>%
        summarize(state.cropdamages=-sum(state.cropdamages))-> ldhi.ranked.agg # and aggregated for deducting from "ALL OTHERS"

        ldhi.ranked.nm %>%
        filter(evtype=="ALL OTHERS") %>%
        rename(target.state.cropdamages=state.cropdamages) %>%
        merge(ldhi.ranked.agg,all=FALSE) %>%
        mutate(state.cropdamages=target.state.cropdamages+state.cropdamages) %>%
        select(-target.state.cropdamages) %>%    # we'll need to keep the records that show > 0 values after combining
        merge(ldhi.ranked.m,all=TRUE) %>%
        merge(ldhi.ranked.nm,all=TRUE) %>%
        group_by(state.FIBS,state,evtype) %>%
        summarize(state.cropdamages=min(state.cropdamages)) %>%
        arrange(state.FIBS,state,desc(state.cropdamages)) %>%
        as.data.frame -> ldhi.ranked.cropdamages

        rm(ldhi.ranked.m,ldhi.ranked.nm,ldhi.ranked.agg)   # cleanup    

# we now aggregate back the adjusted ranked categories into ldhi.
# this appears complex but is only lengthy codewise due to my inexperience...

        ldhi.ranked.fatalities %>%
        merge(ldhi.ranked.injuries,all=TRUE) %>%
        merge(ldhi.ranked.propdamages,all=TRUE) %>%
        merge(ldhi.ranked.cropdamages,all=TRUE) %>%
        filter(state.fatalities>0 || state.injuries>0 || state.propdamages>0 || state.cropdamages>0) %>%  # now is the time to eliminate the 0 values we introduced
        arrange(state.FIBS,state,desc(state.fatalities),desc(state.injuries),desc(state.propdamages),desc(state.cropdamages)) %>%
        as.data.frame -> ldhi

# at this stage, ldhi contains for each state and each event category the last decades top events that had highest impact

        rm(dftemp,ldhi.fatalities,ldhi.injuries,ldhi.propdamages,ldhi.cropdamages,ldhi.ranked.fatalities,ldhi.ranked.injuries,ldhi.ranked.propdamages,ldhi.ranked.cropdamages)

# Summarize by state.FIBS total casualities into ldhi.cum, which holds the data relative to the time slice
# 

        ldhi %>% group_by(state.FIBS,state) %>%
        summarize(cum.fatalities=sum(state.fatalities),cum.injuries=sum(state.injuries),cum.propdamages=sum(state.propdamages),cum.cropdamages=sum(state.cropdamages)) %>%
        arrange(state,desc(cum.fatalities),desc(cum.injuries),desc(cum.propdamages),desc(cum.cropdamages)) %>%
        rename(fips=state.FIBS) %>%
        merge(state.fips,all=T) %>%
        as.data.frame -> ldhi.cum

# complete the records with the span (number of years we analyzed back in time) and the slice (a variable we loop on)

        ldhi.cum$span<-2011-span
        ldhi.cum$slice<-slice

# populate the dataframe we'll use to analyze ldhi.stat

        ldhi.stat <- rbind(ldhi.stat,ldhi.cum)

# now tackle the Threat Frequency Index. This is a repeat but addresses the counts of events rather than the impact

        hf %>%
        filter(year>span) %>%
        group_by(state.FIBS,state,evtype) %>%
        summarize(state.total.fatalities=sum(total.fatalities),state.total.injuries=sum(total.injuries),state.total.propdamages=sum(total.propdamages),state.total.cropdamages=sum(total.cropdamages)) %>%
        arrange(desc(state.total.fatalities),desc(state.total.injuries),desc(state.total.propdamages),desc(state.total.cropdamages)) %>%
        as.data.frame -> ldhf

# human impact:fatalities ranked (top5+all others) and grouped by state.FIBS over time span

        ldhf %>%
        select(-state.total.injuries,-state.total.propdamages,-state.total.cropdamages) %>%
        group_by(state.FIBS,state,evtype) %>%
        tally(state.total.fatalities,sort=TRUE) %>%
        
        top_n(5) %>%
        
        as.data.frame -> ldhf.top5.fatalities

        ldhf %>%
        select(-state.total.injuries,-state.total.propdamages,-state.total.cropdamages) %>%
        rename(n=state.total.fatalities) %>%
        setdiff(ldhf.top5.fatalities) %>%
        as.data.frame -> dftemp
        dftemp$evtype<-"ALL OTHERS" 

        dftemp %>%
        group_by(state.FIBS,state,evtype) %>%
        summarize(n=sum(n)) %>%
        as.data.frame %>%
        rbind(ldhf.top5.fatalities) %>%
        rename(state.fatalities=n) %>%
        arrange(state.FIBS,state,desc(state.fatalities)) %>%
        as.data.frame -> ldhf.fatalities

        rm(ldhf.top5.fatalities)        

# human impact:injuries ranked (top5+all others) and grouped by state.FIBS over time span

        ldhf %>%
        select(-state.total.fatalities,-state.total.propdamages,-state.total.cropdamages) %>%
        group_by(state.FIBS,state,evtype) %>%
        tally(state.total.injuries,sort=TRUE) %>%
        
        top_n(5) %>%
        
        as.data.frame -> ldhf.top5.injuries

        ldhf %>%
        select(-state.total.fatalities,-state.total.propdamages,-state.total.cropdamages) %>%
        rename(n=state.total.injuries) %>%
        setdiff(ldhf.top5.injuries) %>%
        as.data.frame -> dftemp

        dftemp$evtype<-"ALL OTHERS" 

        dftemp %>%
        group_by(state.FIBS,state,evtype) %>%
        summarize(n=sum(n)) %>%
        as.data.frame %>%
        rbind(ldhf.top5.injuries) %>%
        rename(state.injuries=n) %>%
        arrange(state.FIBS,state,desc(state.injuries)) %>%
        as.data.frame -> ldhf.injuries

        rm(ldhf.top5.injuries)        

# damages impact: props ranked (top5+all others) and grouped by state.FIBS over time span

        ldhf %>%
        group_by(state.FIBS,state,evtype) %>%
        tally(state.total.propdamages,sort=TRUE) %>%
        
        top_n(5) %>%
        
        as.data.frame -> ldhf.top5.propdamages

        ldhf %>%
        select(-state.total.fatalities,-state.total.injuries,-state.total.cropdamages) %>%
        rename(n=state.total.propdamages) %>%
        setdiff(ldhf.top5.propdamages) %>%
        as.data.frame -> dftemp

        dftemp$evtype<-"ALL OTHERS" 

        dftemp %>%
        group_by(state.FIBS,state,evtype) %>%
        summarize(n=sum(n)) %>%
        as.data.frame %>%
        rbind(ldhf.top5.propdamages) %>%
        rename(state.propdamages=n) %>%
        arrange(state.FIBS,state,desc(state.propdamages)) %>%
        as.data.frame -> ldhf.propdamages

        rm(ldhf.top5.propdamages)

# damages impact: crops ranked (top5+all others) and grouped by state.FIBS over time span

        ldhf %>%
        group_by(state.FIBS,state,evtype) %>%
        tally(state.total.cropdamages,sort=TRUE) %>%
        
        top_n(5) %>%
        
        as.data.frame -> ldhf.top5.cropdamages

        ldhf %>%
        select(-state.total.fatalities,-state.total.injuries,-state.total.propdamages) %>%
        rename(n=state.total.cropdamages) %>%
        setdiff(ldhf.top5.cropdamages) %>%
        as.data.frame -> dftemp

        dftemp$evtype<-"ALL OTHERS" 

        dftemp %>%
        group_by(state.FIBS,state,evtype) %>%
        summarize(n=sum(n)) %>%
        as.data.frame %>%
        rbind(ldhf.top5.cropdamages) %>%
        rename(state.cropdamages=n) %>%
        arrange(state.FIBS,state,desc(state.cropdamages)) %>%
        as.data.frame -> ldhf.cropdamages

        rm(ldhf.top5.cropdamages)        

        ldhf.fatalities %>%
        merge(ldhf.injuries, all=T) %>%
        merge(ldhf.propdamages, all=T) %>%
        merge(ldhf.cropdamages, all=T) %>%
        arrange(state.FIBS,state,desc(state.fatalities),desc(state.injuries),desc(state.propdamages),desc(state.cropdamages)) %>%
        as.data.frame -> ldhf.ranked

        ldhf %>% rename(state.fatalities=state.total.fatalities,state.injuries=state.total.injuries,state.propdamages=state.total.propdamages,state.cropdamages=state.total.cropdamages) %>%
        as.data.frame -> ldhf

# must replace NAs with values in ldhf and compensate the "ALL OTHERS" counts for state.fatalities

        subset(ldhf.ranked[,c(-5,-6,-7)],is.na(ldhf.ranked$state.fatalities)) %>%  # retain only the missing activity steps data subset filter(is.na(state.fatalities)) %>%
        rename(missing.state.fatalities=state.fatalities) %>%
        merge(ldhf[,-c(5:7)], all=FALSE) %>%
        select(-missing.state.fatalities)-> ldhf.ranked.m # NAs have been substituted

        ldhf.ranked.nm<-subset(ldhf.ranked[,c(-5,-6,-7)],!is.na(ldhf.ranked$state.fatalities))

        ldhf.ranked.m %>%
        mutate(evtype="ALL OTHERS") %>%
        group_by(state.FIBS,state,evtype) %>%
        summarize(state.fatalities=-sum(state.fatalities))-> ldhf.ranked.agg # and aggregated for deducting from "ALL OTHERS"

        ldhf.ranked.nm %>%
        filter(evtype=="ALL OTHERS") %>%
        rename(target.state.fatalities=state.fatalities) %>%
        merge(ldhf.ranked.agg,all=FALSE) %>%
        mutate(state.fatalities=target.state.fatalities+state.fatalities) %>%
        select(-target.state.fatalities) %>%    # we'll need to keep the records that show > 0 values after combining
        merge(ldhf.ranked.m,all=TRUE) %>%
        merge(ldhf.ranked.nm,all=TRUE) %>%
        group_by(state.FIBS,state,evtype) %>%
        summarize(state.fatalities=min(state.fatalities)) %>%
        arrange(state.FIBS,state,desc(state.fatalities)) %>%
        as.data.frame -> ldhf.ranked.fatalities

        rm(ldhf.ranked.m,ldhf.ranked.nm,ldhf.ranked.agg)   # cleanup    

# must replace NAs with values in ldhf and compensate the "ALL OTHERS" counts for state.injuries

        subset(ldhf.ranked[,c(-4,-6,-7)],is.na(ldhf.ranked$state.injuries)) %>%  # retain only the missing activity steps data subsetfilter(is.na(state.fatalities)) %>%
        rename(missing.state.injuries=state.injuries) %>%
        merge(ldhf[,c(-4,-6,-7)], all=FALSE) %>%
        select(-missing.state.injuries)-> ldhf.ranked.m # NAs have been substituted

        ldhf.ranked.nm<-subset(ldhf.ranked[,c(-4,-6,-7)],!is.na(ldhf.ranked$state.injuries))

        ldhf.ranked.m %>%
        mutate(evtype="ALL OTHERS") %>%
        group_by(state.FIBS,state,evtype) %>%
        summarize(state.injuries=-sum(state.injuries))-> ldhf.ranked.agg # and aggregated for deducting from "ALL OTHERS"

        ldhf.ranked.nm %>%
        filter(evtype=="ALL OTHERS") %>%
        rename(target.state.injuries=state.injuries) %>%
        merge(ldhf.ranked.agg,all=FALSE) %>%
        mutate(state.injuries=target.state.injuries+state.injuries) %>%
        select(-target.state.injuries) %>%    # we'll need to keep the records that show > 0 values after combining
        merge(ldhf.ranked.m,all=TRUE) %>%
        merge(ldhf.ranked.nm,all=TRUE) %>%
        group_by(state.FIBS,state,evtype) %>%
        summarize(state.injuries=min(state.injuries)) %>%
        arrange(state.FIBS,state,desc(state.injuries)) %>%
        as.data.frame -> ldhf.ranked.injuries

        rm(ldhf.ranked.m,ldhf.ranked.nm,ldhf.ranked.agg)   # cleanup    

# must replace NAs with values in ldhf and compensate the "ALL OTHERS" counts for state.propdamages

        subset(ldhf.ranked[,c(-4,-5,-7)],is.na(ldhf.ranked$state.propdamages)) %>%  # retain only the missing activity steps data subset filter(is.na(state.propdamages)) %>%
        rename(missing.state.propdamages=state.propdamages) %>%
        merge(ldhf[,c(-4,-5,-7)], all=FALSE) %>%
        select(-missing.state.propdamages)-> ldhf.ranked.m # NAs have been substituted

        ldhf.ranked.nm<-subset(ldhf.ranked[,c(-4,-5,-7)],!is.na(ldhf.ranked$state.propdamages))

        ldhf.ranked.m %>%
        mutate(evtype="ALL OTHERS") %>%
        group_by(state.FIBS,state,evtype) %>%
        summarize(state.propdamages=-sum(state.propdamages))-> ldhf.ranked.agg # and aggregated for deducting from "ALL OTHERS"

        ldhf.ranked.nm %>%
        filter(evtype=="ALL OTHERS") %>%
        rename(target.state.propdamages=state.propdamages) %>%
        merge(ldhf.ranked.agg,all=FALSE) %>%
        mutate(state.propdamages=target.state.propdamages+state.propdamages) %>%
        select(-target.state.propdamages) %>%    # we'll need to keep the records that show > 0 values after combining
        merge(ldhf.ranked.m,all=TRUE) %>%
        merge(ldhf.ranked.nm,all=TRUE) %>%
        group_by(state.FIBS,state,evtype) %>%
        summarize(state.propdamages=min(state.propdamages)) %>%
        arrange(state.FIBS,state,desc(state.propdamages)) %>%
        as.data.frame -> ldhf.ranked.propdamages

        rm(ldhf.ranked.m,ldhf.ranked.nm,ldhf.ranked.agg)   # cleanup    

# must replace NAs with values in ldhf and compensate the "ALL OTHERS" counts for state.cropdamages

        subset(ldhf.ranked[,c(-4,-5,-6)],is.na(ldhf.ranked$state.cropdamages)) %>%  # retain only the missing activity steps data subsetfilter(is.na(state.fatalities)) %>%
        rename(missing.state.cropdamages=state.cropdamages) %>%
        merge(ldhf[,c(-4-5,-6)], all=FALSE) %>%
        select(-missing.state.cropdamages)-> ldhf.ranked.m # NAs have been substituted

        ldhf.ranked.nm<-subset(ldhf.ranked[,c(-4,-5,-6)],!is.na(ldhf.ranked$state.cropdamages))

        ldhf.ranked.m %>%
        mutate(evtype="ALL OTHERS") %>%
        group_by(state.FIBS,state,evtype) %>%
        summarize(state.cropdamages=-sum(state.cropdamages))-> ldhf.ranked.agg # and aggregated for deducting from "ALL OTHERS"

        ldhf.ranked.nm %>%
        filter(evtype=="ALL OTHERS") %>%
        rename(target.state.cropdamages=state.cropdamages) %>%
        merge(ldhf.ranked.agg,all=FALSE) %>%
        mutate(state.cropdamages=target.state.cropdamages+state.cropdamages) %>%
        select(-target.state.cropdamages) %>%    # we'll need to keep the records that show > 0 values after combining
        merge(ldhf.ranked.m,all=TRUE) %>%
        merge(ldhf.ranked.nm,all=TRUE) %>%
        group_by(state.FIBS,state,evtype) %>%
        summarize(state.cropdamages=min(state.cropdamages)) %>%
        arrange(state.FIBS,state,desc(state.cropdamages)) %>%
        as.data.frame -> ldhf.ranked.cropdamages

        rm(ldhf.ranked.m,ldhf.ranked.nm,ldhf.ranked.agg)   # cleanup    

        ldhf.ranked.fatalities %>%
        merge(ldhf.ranked.injuries,all=TRUE) %>%
        merge(ldhf.ranked.propdamages,all=TRUE) %>%
        merge(ldhf.ranked.cropdamages,all=TRUE) %>%
        filter(state.fatalities>0 || state.injuries>0 || state.propdamages>0 || state.cropdamages>0) %>%  # now is the time to eliminate the 0 values we introduced
        arrange(state.FIBS,state,desc(state.fatalities),desc(state.injuries),desc(state.propdamages),desc(state.cropdamages)) %>%
        as.data.frame -> ldhf

        rm(dftemp,ldhf.fatalities,ldhf.injuries,ldhf.propdamages,ldhf.cropdamages,ldhf.ranked.fatalities,ldhf.ranked.injuries,ldhf.ranked.propdamages,ldhf.ranked.cropdamages)


# Summarize by state.FIBS total casualities

        ldhf %>% group_by(state.FIBS,state) %>%
        summarize(cum.fatalities=sum(state.fatalities),cum.injuries=sum(state.injuries),cum.propdamages=sum(state.propdamages),cum.cropdamages=sum(state.cropdamages)) %>%
        arrange(state,desc(cum.fatalities),desc(cum.injuries),desc(cum.propdamages),desc(cum.cropdamages)) %>%
        rename(fips=state.FIBS) %>%
        merge(state.fips,all=T) %>%
        as.data.frame -> ldhf.cum

        ldhf.cum$span<-2011-span
        ldhf.cum$slice<-slice

        ldhf.stat <- rbind(ldhf.stat,ldhf.cum)
}

## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n

Upon completion of the loop, we cleanup and strip the initial data row which served as template.

rm(ldhi.cum,ldhi,ldhf.cum,ldhf)    
ldhi.stat<-ldhi.stat[-1,] 
ldhf.stat<-ldhf.stat[-1,]

We have now populated ldhi.stat and ldhf.stat with highest impact and highest frequency data over progressively smaller slicing periods. The next step will reduce this data so it can be mapped.

Step 2: Obtain Annual Rates

We start now the analysis to produce map ready data, starting with ldhi.stat, the impact dataframe, and will repeat for the frequency data ldhf.stat in an identical way. First obtain the rate of impact by dividing cumulative values by time span (in years).

ldhi.stat %>% 
mutate(fatality.rate=cum.fatalities/ldhi.stat$span,injury.rate=cum.injuries/ldhi.stat$span,propdamage.rate=cum.propdamages/ldhi.stat$span,cropdamage.rate=cum.cropdamages/ldhi.stat$span) %>%
as.data.frame ->ldhi

We continue by setting all missing values to 0 and populate the top vector with the max values, then normalize by simply dividing each rate by the respective maximum value, and then produce the dataframe threati, in which we combine the localization data by merging with the state.fibs dataframe. This step is essential as mapping requires full name of the states, while the original dataset only provided the abbreviated states names.

It should be noted that to render AK and HI in this map requires additional work, using probably as best candidate a Chropleths approach as described in Vignette. We will leave it as a recommended future work to perform the adaptation.

ldhi$fatality.rate[is.na(ldhi$fatality.rate)]<-0
ldhi$injury.rate[is.na(ldhi$injury.rate)]<-0
ldhi$propdamage.rate[is.na(ldhi$propdamage.rate)]<-0
ldhi$cropdamage.rate[is.na(ldhi$cropdamage.rate)]<-0

# determine maximum to normalize the data between 0 and 1 so we can compare between impact threat types

ldhi %>%
select(fatality.rate,injury.rate,propdamage.rate,cropdamage.rate) %>%
summarize(mfr=max(fatality.rate),mir=max(injury.rate),mpr=max(propdamage.rate),mcr=max(cropdamage.rate)) -> top

ldhi %>% 
mutate(Fatality=fatality.rate/top$mfr,Injury=injury.rate/top$mir,Property=propdamage.rate/top$mpr,Crop=cropdamage.rate/top$mcr) %>%
merge(state.fips,all=F) %>%
rename(state.abb=state,state=polyname) %>%
subset(slice > 1) %>%
as.data.frame -> threati
threati$slice<-as.factor(sliced[threati$slice])

At this stage, threati can be mapped, but we prefer the tidy data and use melt to reshape and set all missing values to 0, realizing the 30 year data will not bring significant departure to our observations.

threati %>%
droplevels %>%
select(state,Fatality,Injury,Property,Crop,slice) %>%
melt %>%

as.data.frame -> threatim

## Using state, slice as id variables

threatim$value[is.na(threatim$value)]<-0

We have completed the reduction of the data and threati and threatim (melted) contain the impact data for mapping. The next step will be to display these observations.

Results 1: Impact Data Display

We now produce the facet maps for impact, in orange with a modified fill guide colorbar at bottom, using as template the chunk proposed in the ggplot documentation.

if (require(maps)) {
        states_map <- map_data("state")
        ggplot(threati, aes(map_id = state))+
                geom_map(aes(fill = Fatality), map = states_map)+
                expand_limits(x = states_map$long, y = states_map$lat)
        last_plot()+coord_map()
        ggplot(threatim, aes(map_id = state))+
                geom_map(aes(fill = value), map = states_map)+
                expand_limits(x = states_map$long, y = states_map$lat)+
                theme(legend.position = "bottom")+
                guides(fill=guide_legend(title="Impact \nIndex"),guide_colorbar(barwidth = 10, barheight = .5))+
                scale_fill_gradient(low="white", high="orange")+
                facet_wrap(variable ~ slice)+
                labs(title="Fig.1 - US States Annual Weather-Dependent Aggregated Threat Impact Index 1986-2011 By Year Span\nBased on NOAA Reported Data")+
                xlab("Longitude")+ylab("Lattitude")
}

Fig.1 US States Annual Weather-Dependent Aggregated Threat Impact Index 1986-2011

Fig.1 points out the localization in time and space the most impactful events and establish which states have been repeatedly observing them. Each category is normed, as an Index should, between 0 and 1. The higher the Index, the higher the impact observed in the respective time span. Noticeable is Texas for recurrence of impact events and concomitance of impact on almost all categories. Less severe impact is observed in California and Illinoi. The Eastern/North-Eastern corridor sometimes referred as “Tornado Alley” is quite remarkable on this chart, and quite contrasted with the northern plains.

Let’s now look at the frequency of the same events. We prepare the frequency chart in totally similar way to generate threatf and threatfm.

ldhf.stat %>%
mutate(fatality.rate=cum.fatalities/ldhf.stat$span,injury.rate=cum.injuries/ldhf.stat$span,propdamage.rate=cum.propdamages/ldhf.stat$span,cropdamage.rate=cum.cropdamages/ldhf.stat$span) %>%
as.data.frame ->ldhf
ldhf$fatality.rate[is.na(ldhf$fatality.rate)]<-0
ldhf$injury.rate[is.na(ldhf$injury.rate)]<-0
ldhf$propdamage.rate[is.na(ldhf$propdamage.rate)]<-0
ldhf$cropdamage.rate[is.na(ldhf$cropdamage.rate)]<-0

ldhf %>%
select(fatality.rate,injury.rate,propdamage.rate,cropdamage.rate) %>%
summarize(mfr=max(fatality.rate),mir=max(injury.rate),mpr=max(propdamage.rate),mcr=max(cropdamage.rate)) -> top

ldhf %>%
mutate(Fatality=fatality.rate/top$mfr,Injury=injury.rate/top$mir,Property=propdamage.rate/top$mpr,Crop=cropdamage.rate/top$mcr) %>%
merge(.,state.fips,all=F) %>%
rename(state.abb=state,state=polyname) %>%
subset(slice >1) %>%
as.data.frame -> threatf
threatf$slice<-as.factor(sliced[threatf$slice])

# we prefer tidy data and use melt to reshape and set all missing values to 0

threatf %>%
droplevels %>%
select(state,Fatality,Injury,Property,Crop,slice) %>%
        
melt %>%

as.data.frame -> threatfm

## Using state, slice as id variables

threatfm$value[is.na(threatfm$value)]<-0

We have completed the reduction of the data and threati and threatim contain the impact data for mapping. The next step will be to display these observations. The data frames threatf and threatfm contain the frequency data for mapping.

Results 2: Frequency Data Display

We now produce the facet maps for frequency, in blue with a modified fill guide colorbar at bottom, using as template the chunk proposed in the ggplot vignette.

if (require(maps)) {
        states_map <- map_data("state")
        ggplot(threatf, aes(map_id = state))+
                geom_map(aes(fill = Fatality), map = states_map)+
                expand_limits(x = states_map$long, y = states_map$lat)
        last_plot()+coord_map()
        ggplot(threatfm, aes(map_id = state))+
                geom_map(aes(fill = value), map = states_map)+
                expand_limits(x = states_map$long, y = states_map$lat)+
                theme(legend.position = "bottom")+
                guides(fill=guide_legend(title="Frequency\nIndex"),guide_colorbar(barwidth = 10, barheight = .5))+
                scale_fill_gradient(low="white", high="blue")+
                facet_wrap(variable ~ slice)+
                labs(title="Fig.2 - US States Annual Weather-Dependent Aggregated Threat Frequency Index 1986-2011\nBased on NOAA Reported Data")+
                xlab("Longitude")+ylab("Lattitude")
}

Fig.2 US States Annual Weather-Dependent Aggregated Threat Frequency Index 1986-2011

Fig.2 points out the localization in time and space of the most frequent events and establishes which states have been repeatedly observing them. Each category is normed, as an Index should, between 0 and 1. The higher the Index, the higher the frequency observed in the respective time span. Noticeable are Texas and generally the South East of the US for recurrence of events and concomitance of frequency on almost all categories. Less frequent events are observed in California and in New Mexico. The Eastern/North-Eastern half of the US is quite remarkable on this chart, and quite differentiated with the Pacific Northwest, especially when comparing crop and property damages. Overall, we do not observe drastic differences between impact and frequency; also obviously a multiplier might be related to the local occupancy or activity. This is again left for further analysis.

Results 3: Threat Index 2006-2011 By US States

Having assessed Impact and Frequency, the nature of the threat will be now the focus of the latter part of the analysis. To this goal, we have chosen to perform the analysis on the data obtained from the last 5 years, assessing the categories leading to top5 threats in all 4 categories: Fatalities, Injuries, Property and Crop, and having a major impact (i.e. amongst top5 Impact).

We start with the top5 data frames we cumulated for each categories, add a type and calculate the normalized index similarly in each category, then bind all into the 4 threats data frames into the threats data frame.

ldhi.top5.fatalities$type<-"Fatality"   # add a type
ldhi.top5.injuries$type<-"Injury"
ldhi.top5.propdamages$type<-"Property"
ldhi.top5.cropdamages$type<-"Crop"

ldhi.top5.fatalities$n<-ldhi.top5.fatalities$n/(2011-span)      # calculate annual rate over last 5 years
ldhi.top5.injuries$n<-ldhi.top5.injuries$n/(2011-span)
ldhi.top5.propdamages$n<-ldhi.top5.propdamages$n/(2011-span)
ldhi.top5.cropdamages$n<-ldhi.top5.cropdamages$n/(2011-span) 
        
ldhi.top5.fatalities$n<-ldhi.top5.fatalities$n/max(ldhi.top5.fatalities$n)      # normalize annual rate is index
ldhi.top5.injuries$n<-ldhi.top5.injuries$n/max(ldhi.top5.injuries$n)
ldhi.top5.propdamages$n<-ldhi.top5.propdamages$n/max(ldhi.top5.propdamages$n)
ldhi.top5.cropdamages$n<-ldhi.top5.cropdamages$n/max(ldhi.top5.cropdamages$n)         
 
ldhi.top5.fatalities %>%        # now bind all normalized indexes 
rbind(ldhi.top5.injuries) %>%
rbind(ldhi.top5.propdamages) %>%
rbind(ldhi.top5.cropdamages) %>%
as.data.frame -> threats

threats$evtype %<>% as.factor
threats$type %<>% as.factor

Let’s determine what evtype are listed and observe…

unique(threats$evtype)

##  [1] TORNADO          RIP CURRENT      LIGHTNING        WIND            
##  [5] FLOOD            HEAT             BLIZZARD         AVALANCHE       
##  [9] DENSE SMOKE      FROST/FREEZE     HAIL             ICE             
## [13] LANDSLIDE        RAIN             SNOW             SURF            
## [17] TIDE             WILD FIRE        WINTER           DROUGHT         
## [21] FOG              TROPICAL         DUST             TSUNAMI         
## [25] HURRICANE        VOLCANIC ASHFALL WATERSPOUT       SEICHE          
## [29] SLEET           
## 29 Levels: AVALANCHE BLIZZARD DENSE SMOKE DROUGHT DUST FLOOD ... WINTER

We note it is time to reclassify by threats using Wikipedia’s classification as guide…so a color scheme might regroup visually similar threats…

threats$evtype %<>% 
        gsub("WIND","1.0 WIND",.) %>%
        gsub("TORNADO","1.1 TORNADO",.) %>%
        gsub("HURRICANE","1.2 HURRICANE",.) %>%
        gsub("TROPICAL","1.4 TROPICAL STORM",.) %>%
        gsub("WATERSPOUT","1.5 WATERSPOUT",.) %>%
        gsub("DUST","1.6 DUST",.) %>%
        gsub("LIGHTNING","1.7 LIGHTNING",.) %>%
        gsub("WILD FIRE","1.8 WILD FIRE",.) %>%
        gsub("DENSE SMOKE","1.9 DENSE SMOKE",.) %>%
        gsub("HAIL","2.0 HAIL",.) %>%
        gsub("TSUNAMI","3.0 TSUNAMI",.) %>%
        gsub("FLOOD","3.1 FLOOD",.) %>%
        gsub("RAIN","3.2 RAIN",.) %>%
        gsub("TIDE","3.3 TIDE",.) %>%
        gsub("SURF","3.4 SURF",.) %>%
        gsub("SEICHE","3.5 SEICHE",.) %>%
        gsub("RIP CURRENT","3.6 RIP CURRENTS",.) %>%
        gsub("LANDSLIDE","3.7 LANDSLIDE",.) %>%
        gsub("WINTER","4.0 WINTER",.) %>%
        gsub("ICE","4.1 ICE",.) %>%
        gsub("BLIZZARD","4.2 BLIZZARD",.) %>%
        gsub("AVALANCHE","4.3 AVALANCHE",.) %>%
        gsub("SNOW","4.4 SNOW",.) %>%
        gsub("COLD","4.5 EXTREME COLD",.) %>%
        gsub("SLEET","4.6 SLEET",.) %>%
        gsub("FROST/FREEZE","4.7 FROST/FREEZE",.) %>%
        gsub("FOG","4.8 FOG",.) %>%
        gsub("DROUGHT","5.0 DROUGHT",.) %>%
        gsub("HEAT","5.1 HEAT",.) %>%
        gsub("VOLCANIC ASHFALL","5.2 VOLCANIC ASHFALL",.)

We observe the results of our rearrangement with a significant reduction in event categories, but with improved regrouping.

unique(threats$evtype)

##  [1] "1.1 TORNADO"          "3.6 RIP CURRENTS"     "1.7 LIGHTNING"       
##  [4] "1.0 WIND"             "3.1 FLOOD"            "5.1 HEAT"            
##  [7] "4.2 BLIZZARD"         "4.3 AVALANCHE"        "1.9 DENSE SMOKE"     
## [10] "4.7 FROST/FREEZE"     "2.0 HAIL"             "4.1 ICE"             
## [13] "3.7 LANDSLIDE"        "3.2 RAIN"             "4.4 SNOW"            
## [16] "3.4 SURF"             "3.3 TIDE"             "1.8 WILD FIRE"       
## [19] "4.0 WINTER"           "5.0 DROUGHT"          "4.8 FOG"             
## [22] "1.4 TROPICAL STORM"   "1.6 DUST"             "3.0 TSUNAMI"         
## [25] "1.2 HURRICANE"        "5.2 VOLCANIC ASHFALL" "1.5 WATERSPOUT"      
## [28] "3.5 SEICHE"           "4.6 SLEET"

These are the threats we will now categorize by state and type, and perform some re-ordering, so we maintain consistency in our display charts.

threats %<>% select(-state.FIBS) %>%
mutate (AVG.Index=n/4) %>% # to balance all unweighted
group_by(state,type,evtype) %>%
arrange(desc(AVG.Index)) %>%
as.data.frame

threats$type<-factor(threats$type,levels(threats$type)[c(2,3,4,1)]) # reorder factor for display

We now have the combined top threats categorized by state and types, ready for plot, and will display a bar chart to compare them side-by-side.

ggplot(threats, aes(x=state,y=AVG.Index,fill=evtype))+geom_bar(stat="identity")+
        guides(fill=guide_legend(title="Categorized Event Type",keywidth=2,keyheight=0.75))+
        theme(text = element_text(size=8) , axis.text.x = element_text(angle=90, vjust=1))+
        facet_wrap(~type , nrow=4, ncol=1)+
        labs(title="Fig.3 - US States Annual Weather-Dependent Averaged Threat Index 2006-2011 By US States\nBased on NOAA Reported Data")+
        xlab("Abbreviated United States")+ylab("Averaged Annual Threat Index")

Fig.3 - US States Annual Weather-Dependent Averaged Threat Index 2006-2011

Fig.3 displays on an indexed categorized event type the comparative averaged annual threats for each US state, aligned on the x-axis. It is important to note on this chart events associated with Iowa, Texas, Louisiana and Missouri, confirming some earlier discussed patterns.

Conclusions

This analysis is preliminary and the strategy employed is simply based on separating Impact, Frequency associated with an event occurring in a US State… Describing an event occurrence can be addressed in terms of matching location, impact (or magnitude) and frequency patterns for categorized threats. Some trends are apparent but the analysis on yearly index will improve with more data, as is predicted by the central limit theorem. However, determining trending data on this scale remains an open challenge, and the data extracted from this analysis should only be used as a guide to help prepare for future occurrences, should future weather patterns mimic prior observations.