This document is devoted to analysis data from U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. The main issue of research was to define most harmful events with respect to population health and to economy. There are a lot of different types of events stored in dataset, but in both cases (population health and economy) only several event types causing largest part of harm and destructions. Tornado is undoubted “leader” in “population health” category. Almost 70% of deaths and injuries are related with it. There is more equally situation with economy damages. Floods, hurricanes and tornados (again) are reason for almost 50% of all damage.
Dataset is available here: https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2
FAQ: https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf
First of all we should initialize functions. They will help us later:
##functions
loadpackage <- function (name)
{
if (!require(name, character.only = T)){
install.packages(name)
library(package = name, character.only = T)
}
}
drawpie <- function(slices, main){
labels <- character()
# slices <- groupedDF[groupedDF$column>0]$column
pie(slices, labels, main = paste(main, "(Total:",length(slices),")"))
}
order_by_totaldmg_per <- function (df){
df <- df[order(df$PercentTotalDamage, decreasing = T)]
}
set_totaldmg <- function(df, col1, col2){
df$TotalDamage <- col1+col2
df$PercentTotalDamage <- round(df$TotalDamage*100/sum(df$TotalDamage),2)
return(df)
}
set_exp <- function(col_exp){
col_exp <- gsub("K","1000",col_exp, ignore.case = T)
col_exp<- gsub("M","1e6",col_exp, ignore.case = T)
col_exp <- gsub("B","1e9",col_exp, ignore.case = T)
col_exp <- as.numeric(col_exp)
return(col_exp)
#col_val <- col_exp*col_val
}
Then we must load neccessary packages, using “loadpackage” function:
#loading packages
loadpackage("data.table")
loadpackage("R.utils")
loadpackage("dplyr")
loadpackage("ggplot2")
loadpackage("knitr")
Now we can load, unzip and read data. We use “fread” function because it’s faster than read.csv. Don’t forget to set your working directory!
#load data
filename <- "stormdata.csv.bz2"
filenameCSV <- "stormdata.csv"
if (!file.exists(filename)){
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
filename)
bunzip2(filename, filenameCSV, remove = FALSE, skip = FALSE)
}
DataStorm <- fread(filenameCSV, sep=",", header = T)
According to documentation of this database We are interested in four variables:
Population health:
Fatalities
InjuriesEconomy:
PROPDMG (property damage)
CROPDMG (crop damage)There are different extents for economical variables.
It seems better to ignore rows with another symbols in extents columns.
DataStorm <- subset(DataStorm, DataStorm$CROPDMGEXP %in% c("","b","B","k","K","m","M") &
DataStorm$PROPDMGEXP %in% c("","b","B","k","K","m","M"))
Now we can substitute characters with digits.
DataStorm[DataStorm$PROPDMGEXP==""]$PROPDMGEXP <- 0
DataStorm$PROPDMGEXP <- set_exp(DataStorm$PROPDMGEXP)
DataStorm$PROPDMG <- DataStorm$PROPDMGEXP*DataStorm$PROPDMG
DataStorm[DataStorm$CROPDMGEXP==""]$CROPDMGEXP <- 0
DataStorm$CROPDMGEXP <- set_exp(DataStorm$CROPDMGEXP)
DataStorm$CROPDMG <- DataStorm$CROPDMGEXP*DataStorm$CROPDMG
All economical variables are calculated with the same extent now. So we can operate with them.
First of all we summarise our dataset.
groupedDF <- DataStorm %>% group_by(EVTYPE) %>%
summarise(FATALITIES=sum(FATALITIES), INJURIES=sum(INJURIES),
PROPDMG= sum(PROPDMG),CROPDMG= sum(CROPDMG))
How much different events are in dataset?
events_count <- nrow(groupedDF)
981 . That’s a lot enough. But we shouldn’t explore every single type. We are interested only in “top” cases.
So we should check event’s damages distribution. It will be handy with pie charts (in breaks - total count of events for this damage type):
#pie diagrams
par(mfrow=c(2,2), mar= c(1,1,1,1))
drawpie(groupedDF[groupedDF$FATALITIES>0]$FATALITIES, "Fatalities")
drawpie(groupedDF[groupedDF$INJURIES>0]$INJURIES, "Injuries")
drawpie(groupedDF[groupedDF$PROPDMG>0]$PROPDMG, "Property damage")
drawpie(groupedDF[groupedDF$CROPDMG>0]$CROPDMG, "Crop damage")
Of course these charts are too small to decise something exactly. But we can see, that almost in all cases there is “dominating” event. Contribution of this “leader” event type is almost always on 40-50% level.
So we can analyse only “top”" of events for each case.
We create a new dataframe. There are only “health” variables in it (and only “top” event types).
#health
DFHealth <- subset(groupedDF, groupedDF$FATALITIES>quantile(groupedDF$FATALITIES,0.99, na.rm = T)|
groupedDF$INJURIES>quantile(groupedDF$INJURIES,0.99,na.rm = T),
select=c(EVTYPE, FATALITIES, INJURIES))
And create new variables responsible for total damage and for proportion in sum of total damage.
DFHealth <- set_totaldmg(DFHealth, DFHealth$FATALITIES, DFHealth$INJURIES)
Let’s make a plot:
ggplot(DFHealth, aes(x=DFHealth$FATALITIES,y= DFHealth$INJURIES))+
geom_point()+
geom_text(aes(label=ifelse(FATALITIES>1000,as.character(EVTYPE),'')), hjust=0)+
theme_bw()+
xlim(c(0,7000))+
xlab("Fatalities")+
ylab("Injuries")+
ggtitle("Population health consequences of different events")
We can see that tornado is leading with large odds. No doubt, tornado is the most awful thing for population health.
Let’s do the same operations with “economy” variables:
#Economy
DFEconomy <- subset(groupedDF, groupedDF$PROPDMG>quantile(groupedDF$CROPDMG,0.99,na.rm = T)|
groupedDF$CROPDMG>quantile(groupedDF$CROPDMG,0.99,na.rm = T),
select = c(EVTYPE,PROPDMG,CROPDMG))
#cut values (they are too large)
DFEconomy$PROPDMG <- round(DFEconomy$PROPDMG/1e9,2)
DFEconomy$CROPDMG <- round(DFEconomy$CROPDMG/1e9,2)
DFEconomy <- set_totaldmg(DFEconomy, DFEconomy$PROPDMG, DFEconomy$CROPDMG)
ggplot(DFEconomy, aes(x=PROPDMG,y= CROPDMG))+
geom_point()+
geom_text(aes(label=ifelse(PROPDMG>50|CROPDMG>6,as.character(EVTYPE),'')),
hjust=0)+
theme_bw()+
xlab("Property damage, billions of $")+
ylab("Crop damage, billions of $")+
xlim(c(0,160))+
ggtitle("Economic consequences of different events")
Drought is more danger for crop than anything else. But according to property damage, flood is more scary event (note that absolute values are greater than in crop damages). Hurricane and tornado are very dangerous too.
We have defined more harmful events. In the following tables we can see the top of damage reasons ordered by their contribution in total damage. For population health:
#Health table
DFHealth <- order_by_totaldmg_per(DFHealth)
kable(DFHealth[DFHealth$PercentTotalDamage>1], format = "markdown",
col.names = c("Event","Fatalities",
"Injuries","Total damage", "Total damage, %"))
| Event | Fatalities | Injuries | Total damage | Total damage, % |
|---|---|---|---|---|
| TORNADO | 5630 | 91321 | 96951 | 69.56 |
| EXCESSIVE HEAT | 1903 | 6525 | 8428 | 6.05 |
| TSTM WIND | 504 | 6957 | 7461 | 5.35 |
| FLOOD | 470 | 6789 | 7259 | 5.21 |
| LIGHTNING | 816 | 5230 | 6046 | 4.34 |
| HEAT | 937 | 2100 | 3037 | 2.18 |
| FLASH FLOOD | 978 | 1777 | 2755 | 1.98 |
| ICE STORM | 89 | 1975 | 2064 | 1.48 |
| THUNDERSTORM WIND | 133 | 1488 | 1621 | 1.16 |
And for economy:
#Economy table
DFEconomy <- order_by_totaldmg_per(DFEconomy)
kable(DFEconomy[DFEconomy$PercentTotalDamage>1], format = "markdown",
col.names = c("Event","Property damage $B", "Crop damage $B",
"Total damage $B", "Total damage, %"))
| Event | Property damage $B | Crop damage $B | Total damage $B | Total damage, % |
|---|---|---|---|---|
| FLOOD | 144.66 | 5.66 | 150.32 | 32.29 |
| HURRICANE/TYPHOON | 69.31 | 2.61 | 71.92 | 15.45 |
| TORNADO | 56.94 | 0.36 | 57.30 | 12.31 |
| STORM SURGE | 43.32 | 0.00 | 43.32 | 9.31 |
| HAIL | 15.73 | 3.00 | 18.73 | 4.02 |
| FLASH FLOOD | 16.14 | 1.42 | 17.56 | 3.77 |
| DROUGHT | 1.05 | 13.97 | 15.02 | 3.23 |
| HURRICANE | 11.87 | 2.74 | 14.61 | 3.14 |
| RIVER FLOOD | 5.12 | 5.03 | 10.15 | 2.18 |
| ICE STORM | 3.94 | 5.02 | 8.96 | 1.92 |
| TROPICAL STORM | 7.70 | 0.68 | 8.38 | 1.80 |
| WINTER STORM | 6.69 | 0.03 | 6.72 | 1.44 |
| HIGH WIND | 5.27 | 0.64 | 5.91 | 1.27 |
| WILDFIRE | 4.77 | 0.30 | 5.07 | 1.09 |
| TSTM WIND | 4.48 | 0.55 | 5.03 | 1.08 |