Synopsys

A descriptive analysis to identify which storm types are most harmful for

population (injuries, fatalities)
economy (crops, properties)

Overview

Using the raw data available at NOOA (documented here) after some laborious computations to make the raw data more easy to analyze we were able to perform a simple analysis to identidy the 10 storm types most harmful to population health and with the greatest economic consquences.

Results and limitations

Economic Damage

The 10 storm types producing more economic damage are:

FLOOD
HURRICANE/TYPHOON
STORM SURGE
TORNADO
HAIL
FLASH FLOOD
DROUGHT
HURRICANE
RIVER FLOOD
ICE STORM

Storm types most harmful to population health

We have data about both fatalities and injuries. To provide a unque overall measure we created an index so defined:

fatalities * 10 + injuries * 4

ie. in a certain way we gave weight 10 to fatalities and 4 to injuries.

The ranking of the “harm to population health was based on this index and this index is shown in the plot.
The 10 storm types that produced more people damage are

1 TORNADO

2 EXCESSIVE HEAT

3 FLOOD

4 LIGHTNING

5 TSTM WIND

6 HEAT

7 FLASH FLOOD

8 ICE STORM

9 WINTER STORM

10 THUNDERSTORM WIND

Limitations

Both for economy and population health damage we show the 10 most harmful storm types.
The years for which less data were available (<10k observations), were not included in the analysis.

Due to resource constraints, to the low-power hardware available and the fact that the data preprocessing consumed a lot of resources/time, analysis was rather direct and there was no room for more refined analysis or for preprocessing that might have allowed additional analysis.

Example of analysis not performed:

eventual “intersection” among the storm types most harmful to people and economy

Example of preprocessing not performed:

aggregating the event types, that are 985, and so might be overlapping and redundant (we see both hurricane/Typhoon and Typhoon for instance), and, even if not overlapping, still are too numerous, supercategories might have provided a more easily understandable analysis

Data Processing

The data presented some peculiarities that required a laborious preprocessing.
All the preprocessing steps, in the form of R code, are included in this document, so it is possible to go from the (url to download the) raw data, to the analytics data in a reproducible way.
Some of the peculiarities of the data were:

high number of observations, relative to student’s background, and high relative to the capabilities of the hardware available, a laptop.Reading the data was extremely slow and in several places a “lazy” approach was taken.
only a small number of variables are actually necessary to anwer the “questions” that are the objective of this study
the two numeric data needed to answer the questions were each spread in two columns, one with a figure, the other with the multiplier expressed as an exponent of 10
data of the exponents were “dirty” ie in a mixed format (digit, ‘h’,‘k’,‘M’,‘B’) and often missing

Techical data transformations for performance

By technical we mean that NO data model transformation is performed, only modification to storage format.
Data are relatively large and take several minutes to load. An attempt to reduce this loading time has been made, based on:

lazy loading: download from internet if local .csv.bz2 file not present
avoid reading directly compressed file, decompress .bz2 and read .csv, this under the assumption that reading uncompressed files is faster (assumption not punctually verified but based on first attempts to read directly compressed file, that had much longer times or hanged, all were manually terminated after running for an extremely long time)
lazy decomopression: decompress .bz2 if .csv not present
save to .rds, under the assumption, supported by previous experiences of the author, and some internet posts, that .rds files are faster stackoverflow post 1 stackoverflow post 2

destfilestem <- "data/StormData"
csvfile <- paste(destfilestem,".csv",sep="")
bz2file <- paste(csvfile,".bz2",sep="")
RDSFile <- paste(destfilestem,".rds",sep="")

if (!file.exists("data")){
    dir.create("data")
}

# lazy download remote file
if(!file.exists(bz2file)){
  url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
  es <- tryCatch(download.file(url, bz2file
    ,method = "auto", quiet = FALSE)
    ,error=function(e) 1)
}
# lazy uncompress, not strictly necessary, hoping that makes things faster
if(!file.exists(csvfile)){
  library(R.utils)
  bunzip2(bz2file,csvfile,remove = FALSE, skip = TRUE)
}
# read data  destfile
if (!file.exists(RDSFile)) {
  tempo <- system.time({
    data_org <- read.csv(csvfile)
    })
  saveRDS(data_org,RDSFile,compress = FALSE)
}
if (!exists(deparse(substitute(data_org)))) {
  data_org <- readRDS(RDSFile) 
}

Check NAs

countNAs <- sapply(1:ncol(data_org), function(x) sum(is.na(data_org[x])))
colsWithNAs <- which(countNAs > 0)
names(data_org)[colsWithNAs]

## [1] "COUNTYENDN" "F"          "LATITUDE"   "LATITUDE_E"

Logical data processing

We have many variables

str(data_org)

## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436774 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Keep only the few variables useful to answer the “questions”

data <- data_org[ , c("BGN_DATE", "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]

Upper case variable names are error prone and not aligned with my personal standards, converting them to lower case

names(data) <- tolower(names(data))

Creating the score to rank people damage and a year column

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

data <- dplyr::mutate(data
  # index for people damage
  ,people_dmg_score = fatalities*10+injuries*4
  ,year = as.numeric(format(as.Date(data$bgn_date, format = "%m/%d/%Y"), "%Y")))

We know that data tend to be of low quality and that we have many compared to our HW processing power, so we ignore all the years with less than 10000 observations

obs_treshold <- 10000
years_grp <- group_by(data,year)
year_obs <- summarize(years_grp, obs_year = n())
year_ord <- order(year_obs$obs_year)
years_ok <- year_obs[year_obs$obs_year > obs_treshold, ]$year
data <-data[data$year %in% years_ok, ]

Transform exponent columns from factors to chars to simplify processing

data$propdmgexp <-  lapply(data$propdmgexp, as.character)
data$cropdmgexp <-  lapply(data$cropdmgexp, as.character)

Value for property and crop damage is, in both cases, split in two columns

propdmg, propdmgexp
copdmg, cropdmgexp

I build two new columns, property and crops, containing the values

multiplier <- function(x){
if (is.na(x) || nchar(x) == 0) 
    return(1)
if(is.numeric(x)) {
    return(10^x)
}
myexp <- switch(tolower(x),
    " " = 0, "-" = -1, "+" = 1
    ,"0" = 0, "1" = 1, "2" = 2, "3" = 3, "4" = 4, "5" = 5
    ,"6" = 6, "7" = 7, "8" = 8, "9" = 9, 
    "h" = 2, "k" = 3, "m" = 6, "b" = 9,
    0
  )
  10^myexp
}

j <- 0 
data$property <- sapply(1:nrow(data), function(i) {
  if (is.na(data[i,5]) || data[i,5] == 0)
    return(0)
  ret <- data[i,5]*multiplier(data[i,6])
  j <<- j+1
  return(ret)
  })
j <- 0
data$crops <- sapply(1:nrow(data), function(i) {
  if (is.na(data[i,7]) || data[i,7] == 0)
    return(0)
  ret <- data[i,7]*multiplier(data[i,8])
  j <<- j+1;  if (j%%1000 == 0) cat(" ",j/1000)
  return(ret)
  })

data$econ_dmg <- data$property + data$crops
# tactical :-)
saveRDS(data,"data_ok.rds")

Plot the 10 event types most harmful to population health,
save plot to disk to use it at beginning of document

evtype_grp <- group_by(data,evtype)
people_dmg_sums <- summarize(evtype_grp,people_sum = sum(people_dmg_score))
people_order <- people_dmg_sums[order(-people_dmg_sums$people_sum), ]
# people_order[(people_order$people_sum > 0),]

library(ggplot2)
toPlot <- people_order[1:10,]
p <- ggplot(toPlot, aes(evtype,people_sum))
p <- p + geom_bar(stat='identity')
p <- p + theme(axis.text.x=element_text(angle=90))
p <- p + ggtitle("The 10 Storm Types Most Harmful To Population Health")
p <- p +xlab("Storm Type")
p <- p +ylab("People's Health Harmfulness Score")
ggsave(peopledmg_plot,p)

## Saving 7 x 5 in image

Plot the 10 storm types with the greatest economic consquences,
save plot to disk to use it at beginning of document

library(dplyr)
library(ggplot2)

# economic damage
evtype_grp <- group_by(data,evtype)
eco_dmg_sums <- summarize(evtype_grp,eco_sum = sum(econ_dmg))
eco_order <- eco_dmg_sums[order(-eco_dmg_sums$eco_sum), ]
eco_order[(eco_order$eco_sum > 0),]

## # A tibble: 431 × 2
##               evtype      eco_sum
##               <fctr>        <dbl>
## 1              FLOOD 150319678257
## 2  HURRICANE/TYPHOON  71913712800
## 3        STORM SURGE  43323541000
## 4            TORNADO  32643740367
## 5               HAIL  18761221986
## 6        FLASH FLOOD  18243991079
## 7            DROUGHT  15018672000
## 8          HURRICANE  14610229010
## 9        RIVER FLOOD  10148404500
## 10         ICE STORM   8967041360
## # ... with 421 more rows

eco_order <- eco_order[1:10, ]

e <- ggplot(eco_order, aes(evtype,eco_sum))
e <- e + geom_bar(stat='identity')
e <- e + theme(axis.text.x=element_text(angle=90))
e <- e + ggtitle("The 10 Storm Types That Cause More Economic Damage")
e <- e +xlab("Storm Type")
e <- e +ylab("Economic Damage")
ggsave(ecodmg_plot,e)

## Saving 7 x 5 in image

Storms in USA, The types most harmful for the population and with greatest economic consequences, based on NOAA Storm data

Enrico

20 marzo 2017