A descriptive analysis to identify which storm types are most harmful for
population (injuries, fatalities)
economy (crops, properties)
Using the raw data available at NOOA (documented here) after some laborious computations to make the raw data more easy to analyze we were able to perform a simple analysis to identidy the 10 storm types most harmful to population health and with the greatest economic consquences.
The 10 storm types producing more economic damage are:
FLOOD
HURRICANE/TYPHOON
STORM SURGE
TORNADO
HAIL
FLASH FLOOD
DROUGHT
HURRICANE
RIVER FLOOD
ICE STORM
ie. in a certain way we gave weight 10 to fatalities and 4 to injuries.
The ranking of the “harm to population health was based on this index and this index is shown in the plot.
The 10 storm types that produced more people damage are
1 TORNADO
2 EXCESSIVE HEAT
3 FLOOD
4 LIGHTNING
5 TSTM WIND
6 HEAT
7 FLASH FLOOD
8 ICE STORM
9 WINTER STORM
10 THUNDERSTORM WIND
Both for economy and population health damage we show the 10 most harmful storm types.
The years for which less data were available (<10k observations), were not included in the analysis.
Due to resource constraints, to the low-power hardware available and the fact that the data preprocessing consumed a lot of resources/time, analysis was rather direct and there was no room for more refined analysis or for preprocessing that might have allowed additional analysis.
Example of analysis not performed:
Example of preprocessing not performed:
The data presented some peculiarities that required a laborious preprocessing.
All the preprocessing steps, in the form of R code, are included in this document, so it is possible to go from the (url to download the) raw data, to the analytics data in a reproducible way.
Some of the peculiarities of the data were:
high number of observations, relative to student’s background, and high relative to the capabilities of the hardware available, a laptop.Reading the data was extremely slow and in several places a “lazy” approach was taken.
only a small number of variables are actually necessary to anwer the “questions” that are the objective of this study
the two numeric data needed to answer the questions were each spread in two columns, one with a figure, the other with the multiplier expressed as an exponent of 10
data of the exponents were “dirty” ie in a mixed format (digit, ‘h’,‘k’,‘M’,‘B’) and often missing
By technical we mean that NO data model transformation is performed, only modification to storage format.
Data are relatively large and take several minutes to load. An attempt to reduce this loading time has been made, based on:
lazy loading: download from internet if local .csv.bz2 file not present
avoid reading directly compressed file, decompress .bz2 and read .csv, this under the assumption that reading uncompressed files is faster (assumption not punctually verified but based on first attempts to read directly compressed file, that had much longer times or hanged, all were manually terminated after running for an extremely long time)
lazy decomopression: decompress .bz2 if .csv not present
save to .rds, under the assumption, supported by previous experiences of the author, and some internet posts, that .rds files are faster stackoverflow post 1 stackoverflow post 2
destfilestem <- "data/StormData"
csvfile <- paste(destfilestem,".csv",sep="")
bz2file <- paste(csvfile,".bz2",sep="")
RDSFile <- paste(destfilestem,".rds",sep="")
if (!file.exists("data")){
dir.create("data")
}
# lazy download remote file
if(!file.exists(bz2file)){
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
es <- tryCatch(download.file(url, bz2file
,method = "auto", quiet = FALSE)
,error=function(e) 1)
}
# lazy uncompress, not strictly necessary, hoping that makes things faster
if(!file.exists(csvfile)){
library(R.utils)
bunzip2(bz2file,csvfile,remove = FALSE, skip = TRUE)
}
# read data destfile
if (!file.exists(RDSFile)) {
tempo <- system.time({
data_org <- read.csv(csvfile)
})
saveRDS(data_org,RDSFile,compress = FALSE)
}
if (!exists(deparse(substitute(data_org)))) {
data_org <- readRDS(RDSFile)
}
Check NAs
countNAs <- sapply(1:ncol(data_org), function(x) sum(is.na(data_org[x])))
colsWithNAs <- which(countNAs > 0)
names(data_org)[colsWithNAs]
## [1] "COUNTYENDN" "F" "LATITUDE" "LATITUDE_E"
We have many variables
str(data_org)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436774 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
Keep only the few variables useful to answer the “questions”
data <- data_org[ , c("BGN_DATE", "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
Upper case variable names are error prone and not aligned with my personal standards, converting them to lower case
names(data) <- tolower(names(data))
Creating the score to rank people damage and a year column
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data <- dplyr::mutate(data
# index for people damage
,people_dmg_score = fatalities*10+injuries*4
,year = as.numeric(format(as.Date(data$bgn_date, format = "%m/%d/%Y"), "%Y")))
We know that data tend to be of low quality and that we have many compared to our HW processing power, so we ignore all the years with less than 10000 observations
obs_treshold <- 10000
years_grp <- group_by(data,year)
year_obs <- summarize(years_grp, obs_year = n())
year_ord <- order(year_obs$obs_year)
years_ok <- year_obs[year_obs$obs_year > obs_treshold, ]$year
data <-data[data$year %in% years_ok, ]
Transform exponent columns from factors to chars to simplify processing
data$propdmgexp <- lapply(data$propdmgexp, as.character)
data$cropdmgexp <- lapply(data$cropdmgexp, as.character)
Value for property and crop damage is, in both cases, split in two columns
propdmg, propdmgexp
copdmg, cropdmgexp
I build two new columns, property and crops, containing the values
multiplier <- function(x){
if (is.na(x) || nchar(x) == 0)
return(1)
if(is.numeric(x)) {
return(10^x)
}
myexp <- switch(tolower(x),
" " = 0, "-" = -1, "+" = 1
,"0" = 0, "1" = 1, "2" = 2, "3" = 3, "4" = 4, "5" = 5
,"6" = 6, "7" = 7, "8" = 8, "9" = 9,
"h" = 2, "k" = 3, "m" = 6, "b" = 9,
0
)
10^myexp
}
j <- 0
data$property <- sapply(1:nrow(data), function(i) {
if (is.na(data[i,5]) || data[i,5] == 0)
return(0)
ret <- data[i,5]*multiplier(data[i,6])
j <<- j+1
return(ret)
})
j <- 0
data$crops <- sapply(1:nrow(data), function(i) {
if (is.na(data[i,7]) || data[i,7] == 0)
return(0)
ret <- data[i,7]*multiplier(data[i,8])
j <<- j+1; if (j%%1000 == 0) cat(" ",j/1000)
return(ret)
})
data$econ_dmg <- data$property + data$crops
# tactical :-)
saveRDS(data,"data_ok.rds")
Plot the 10 event types most harmful to population health,
save plot to disk to use it at beginning of document
evtype_grp <- group_by(data,evtype)
people_dmg_sums <- summarize(evtype_grp,people_sum = sum(people_dmg_score))
people_order <- people_dmg_sums[order(-people_dmg_sums$people_sum), ]
# people_order[(people_order$people_sum > 0),]
library(ggplot2)
toPlot <- people_order[1:10,]
p <- ggplot(toPlot, aes(evtype,people_sum))
p <- p + geom_bar(stat='identity')
p <- p + theme(axis.text.x=element_text(angle=90))
p <- p + ggtitle("The 10 Storm Types Most Harmful To Population Health")
p <- p +xlab("Storm Type")
p <- p +ylab("People's Health Harmfulness Score")
ggsave(peopledmg_plot,p)
## Saving 7 x 5 in image
Plot the 10 storm types with the greatest economic consquences,
save plot to disk to use it at beginning of document
library(dplyr)
library(ggplot2)
# economic damage
evtype_grp <- group_by(data,evtype)
eco_dmg_sums <- summarize(evtype_grp,eco_sum = sum(econ_dmg))
eco_order <- eco_dmg_sums[order(-eco_dmg_sums$eco_sum), ]
eco_order[(eco_order$eco_sum > 0),]
## # A tibble: 431 × 2
## evtype eco_sum
## <fctr> <dbl>
## 1 FLOOD 150319678257
## 2 HURRICANE/TYPHOON 71913712800
## 3 STORM SURGE 43323541000
## 4 TORNADO 32643740367
## 5 HAIL 18761221986
## 6 FLASH FLOOD 18243991079
## 7 DROUGHT 15018672000
## 8 HURRICANE 14610229010
## 9 RIVER FLOOD 10148404500
## 10 ICE STORM 8967041360
## # ... with 421 more rows
eco_order <- eco_order[1:10, ]
e <- ggplot(eco_order, aes(evtype,eco_sum))
e <- e + geom_bar(stat='identity')
e <- e + theme(axis.text.x=element_text(angle=90))
e <- e + ggtitle("The 10 Storm Types That Cause More Economic Damage")
e <- e +xlab("Storm Type")
e <- e +ylab("Economic Damage")
ggsave(ecodmg_plot,e)
## Saving 7 x 5 in image