U.S.A goverment have published data about weather events. This data set (called “Storm Data”) covers weather events observations from year 1950 to 2011. It also have information about casulties and material damages. Following document is my simple approach to determine most dangerous type of weather event and also most expensive (in terms of meterial damage) type. It is organised as explorary data analysis, no model will be built, there will be simple remarks and comments for results, and everything should be fully reproducible. Data was downloaded at 21 July 2014, and md5sum for data file is df4aa61fff89427db6b7f7b1113b5553.
Data was opbtained form CloudFront data storage. URLs are included in following R code. Data and documentation files are downloaded only if main data file (StormData.csv.bz2) is missing.
if (!file.exists("StormData.csv.bz2")) {
download.file(url = "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
destfile = "StormData.csv.bz2")
download.file(url = "http://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf",
destfile = "StormDataDoc.pdf")
download.file(url = "http://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf",
destfile = "StormDataFaq.pdf")
}
File with data is just CSV file compressed with bz2 method. We can load it into data frame directly. Lets also observe structure of loaded data.
stormdata <- read.csv("StormData.csv.bz2")
str(stormdata)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "10/10/1954 0:00:00",..: 6523 6523 4213 11116 1426 1426 1462 2873 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "000","0000","00:00:00 AM",..: 212 257 2645 1563 2524 3126 122 1563 3126 3126 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4371 10087 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels "?","ABNORMALLY DRY",..: 826 826 826 826 826 826 826 826 826 826 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels "","E","Eas","EE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","?","(01R)AFB GNRY RNG AL",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","10/10/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels "","?","0000",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","(0E4)PAYSON ARPT",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels "","2","43","9V9",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels "","5NM E OF FAIRPORT MI TO ROCK I - 5NM E OF FAIRPORT MI TO ROCK I",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "",".","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
Some variables are not assigned properly, lets fitx that.
stormdata$BGN_DATE <- as.Date(stormdata$BGN_DATE, format = "%m/%d/%Y")
stormdata$END_DATE <- as.Date(stormdata$END_DATE, format = "%m/%d/%Y")
Names in data frame are written using big leters, lets chante it to lowercase.
names(stormdata) <- tolower(names(stormdata))
stormdata$evtype <- tolower(as.character(stormdata$evtype))
Data contains fatalities or injuries:
fi.stormdata <- stormdata[stormdata$injuries != 0 | stormdata$fatalities != 0, ]
Number of injuries and fatalities:
injuries <- sum(fi.stormdata$injuries)
fatalities <- sum(fi.stormdata$fatalities)
So, based on this data, we have 1.4053 × 105 injuries and 1.5145 × 104 fatalities. Lets order dataframe by number of fatalities and injuries.
fi.stormdata <- fi.stormdata[with(fi.stormdata, order(-fatalities, -injuries)), ]
head(fi.stormdata$fatalities)
## [1] 583 158 116 114 99 90
There are many different *evtype*s. Lets take 16 (abitrary number) types which are causing most of fatalities.
killer.evtypes <- NULL
for (event in 1:nrow(fi.stormdata)) {
if (fi.stormdata$evtype[event] %in% killer.evtypes) {NULL}
else {
killer.evtypes <- c(killer.evtypes, fi.stormdata$evtype[event])
}
if (length(killer.evtypes) == 16) {break}
}
killer.evtypes
## [1] "heat" "tornado"
## [3] "excessive heat" "extreme heat"
## [5] "heat wave" "tsunami"
## [7] "unseasonably warm and dry" "tornadoes, tstm wind, hail"
## [9] "tropical storm" "flash flood"
## [11] "heavy rain" "record/excessive heat"
## [13] "hurricane/typhoon" "flood"
## [15] "wildfire" "landslide"
Some evtype occurs only once so maybe it is good idea to filter data with them.
single.evtype <- names(table(fi.stormdata$evtype)[table(fi.stormdata$evtype) == 1])
head(single.evtype)
## [1] "avalance" "black ice"
## [3] "brush fire" "coastalstorm"
## [5] "coastal flooding/erosion" "cold/winds"
Lets check if type of event that occured only once is not in type of most lethal type of events.
intersect(single.evtype, killer.evtypes)
## [1] "record/excessive heat" "tornadoes, tstm wind, hail"
## [3] "unseasonably warm and dry"
So there are some types of events that are unique but also important in terms of number of casulties. So filtering is not good option.
Ltes chceck how much of total number of fatalities and injuries is covered by number of most lethal type of events (calculated above as killer.evtypes):
injuries.top <- sum(fi.stormdata$injuries[fi.stormdata$evtype %in% killer.evtypes])
fatalities.top <- sum(fi.stormdata$fatalities[fi.stormdata$evtype %in% killer.evtypes])
We have 1.4053 × 105 injuries and 1.5145 × 104 fatalities in whole data set, and 1.1203 × 105 injuries and 1.0626 × 104 fatalities in filtered dataset which is covering 79.7201 percent of injuries and 70.1618 percent of fatalites.
top.fi.stormdata <- fi.stormdata[fi.stormdata$evtype %in% killer.evtypes, ]
total.rows <- nrow(stormdata)
top.fi.rows <- nrow(top.fi.stormdata)
By doing thise type of filtering our working data set (related to fatalities and injuries) has 1.176 percent of original size.
Property and crop damage is located in two types of columns. One column is having numerical vale (e.g propdmg) and second related column suppose to have order of magnitude (e.g propdmgexp) represented by letter from following list: h - hundred, k - tousand, m - milion, b - bilion, and “” (empty) for plain value. Since there are some different symbols in this type of columns lets examine effect of filtering.
valid.exp <- c("","B","b","K","k","M","m","h","H")
da.stormdata <- stormdata[stormdata$propdmgexp %in% valid.exp, ]
da.stormdata <- da.stormdata[da.stormdata$cropdmgexp %in% valid.exp, ]
da.stormdata$propdmgexp <- factor(da.stormdata$propdmgexp)
da.stormdata$cropdmgexp <- factor(da.stormdata$cropdmgexp)
Lets filter out data which contains information about 0 damages.
da.stormdata <- da.stormdata[da.stormdata$propdmg != 0 | da.stormdata$cropdmg != 0, ]
Since value and factor are in two different columns, lets create additional columns with proper values
converter <- function(letter) {
if (letter == "h" || letter == "H" ) {
multiplier <- 100
} else if (letter == "M" || letter == "m") {
multiplier <- 1000000
} else if (letter == "K" || letter == "k") {
multiplier <- 1000
} else if (letter == "B" || letter == "b") {
multiplier <- 1000000000
} else {
multiplier <- 1
}
multiplier
}
da.stormdata$propdmgtot <- sapply(da.stormdata$propdmgexp, converter)
da.stormdata$propdmgtot <- da.stormdata$propdmgtot * da.stormdata$propdmg
da.stormdata$cropdmgtot <- sapply(da.stormdata$cropdmgexp, converter)
da.stormdata$cropdmgtot <- da.stormdata$cropdmgtot * da.stormdata$cropdmg
I will assume the same methodology of analysis as in case with fatalities and injuries. I think that cropdmg is more important that propdmg. Lets calculate total costs of damages and sort dataframe descending by cropdmg and propdmg.
costproperties <- sum(da.stormdata$propdmgtot)
costcrops <- sum(da.stormdata$cropdmgtot)
da.stormdata <- da.stormdata[with(da.stormdata, order(-cropdmgtot, -propdmgtot)), ]
head(da.stormdata$propdmgtot)
## [1] 5.00e+09 5.00e+05 5.88e+09 0.00e+00 0.00e+00 0.00e+00
There are many different *evtype*s. Lets take 16 (abitrary number) types which are causing most of costs of crops damages.
crops.evtypes <- NULL
for (event in 1:nrow(da.stormdata)) {
if (da.stormdata$evtype[event] %in% crops.evtypes) {NULL}
else {
crops.evtypes <- c(crops.evtypes, da.stormdata$evtype[event])
}
if (length(crops.evtypes) == 16) {break}
}
crops.evtypes
## [1] "river flood" "ice storm" "hurricane/typhoon"
## [4] "drought" "extreme cold" "hurricane"
## [7] "flood" "excessive heat" "heat"
## [10] "frost/freeze" "damaging freeze" "flash flood"
## [13] "freeze" "tropical storm" "heavy rain"
## [16] "high wind"
Lets calculate costs for top types of evens
costcrops.top <- sum(da.stormdata$cropdmgtot[da.stormdata$evtype %in% crops.evtypes])
costproperties.top <- sum(da.stormdata$propdmgtot[da.stormdata$evtype %in% crops.evtypes])
We have 4.2732 × 1011 cost of properties damages and 4.9029 × 1010 cost of crops damages in whole data set, and 2.6585 × 1011 cost of properties damages and 4.2561 × 1010 cost of crops damages in filtered dataset which is covering 62.213 percent of costs of properties and 86.8083 percent of costs of crops.
top.da.stormdata <- da.stormdata[da.stormdata$evtype %in% crops.evtypes, ]
top.da.rows <- nrow(top.da.stormdata)
By doing thise type of filtering our working data set (related to damages) has 4.3227 percent of original size.
Lets make barplot to visualise fatalities and injuries according to event type. Data preparation:
people <- data.frame(type = character(length(killer.evtypes)), fatalities = numeric(length(killer.evtypes)), injuries = numeric(length(killer.evtypes)))
people$type <- killer.evtypes
counter <- 1
for (type in killer.evtypes) {
people[counter, "injuries"] <- sum(top.fi.stormdata$injuries[top.fi.stormdata$evtype == type])
people[counter, "fatalities"] <- sum(top.fi.stormdata$fatalities[top.fi.stormdata$evtype == type])
counter <- counter + 1
}
people$sum <- people$injuries + people$fatalities
maximum <- max(people$sum)
max.people <- people$type[people$sum == maximum]
head(people)
## type fatalities injuries sum
## 1 heat 937 2100 3037
## 2 tornado 5633 91346 96979
## 3 excessive heat 1903 6525 8428
## 4 extreme heat 96 155 251
## 5 heat wave 172 379 551
## 6 tsunami 33 129 162
Plot:
par(mai=c(1,3,1,1))
barplot(t(as.matrix(people[,c(2,3)])), names.arg = people$type, cex.names = 1, horiz = TRUE, las = 1, legend = TRUE, col = c("red","blue"), xlab = "Number of people affected")
Lets make barplot to visualise costs of damages according to event type. Data preparation:
damages <- data.frame(type = character(length(crops.evtypes)), crops = numeric(length(crops.evtypes)), properties = numeric(length(crops.evtypes)))
damages$type <- crops.evtypes
counter <- 1
for (type in crops.evtypes) {
damages[counter, "properties"] <- sum(top.da.stormdata$propdmgtot[top.da.stormdata$evtype == type])
damages[counter, "crops"] <- sum(top.da.stormdata$cropdmgtot[top.da.stormdata$evtype == type])
counter <- counter + 1
}
damages$sum <- damages$properties + damages$crops
maximum <- max(damages$sum)
max.damage <- damages$type[damages$sum == maximum]
head(damages)
## type crops properties sum
## 1 river flood 5.029e+09 5.119e+09 1.015e+10
## 2 ice storm 5.022e+09 3.945e+09 8.967e+09
## 3 hurricane/typhoon 2.608e+09 6.931e+10 7.191e+10
## 4 drought 1.397e+10 1.046e+09 1.502e+10
## 5 extreme cold 1.313e+09 6.774e+07 1.381e+09
## 6 hurricane 2.742e+09 1.187e+10 1.461e+10
Plot:
par(mai=c(1,3,1,1))
barplot(t(as.matrix(damages[,c(2,3)])), names.arg = damages$type, cex.names = 1, horiz = TRUE, las = 1, legend = TRUE, col = c("red","blue"), xlab = "Total value of damages")
We can observe based on this simple analysis, that most dangerous type of event for people is tornado and most costly type of events is flood. We can take intersection of this two sets and examine worst yeas based on those type of events.
worst.events <- intersect(crops.evtypes, killer.evtypes)
worst.years <- stormdata$bgn_date[stormdata$evtype %in% worst.events]
hist(worst.years, breaks = "years", main = "Density of worst events in different years", xlab = "Years")
Weather events that are causing most dangerous situations and are most devastating: hurricane/typhoon, flood, excessive heat, heat, flash flood, tropical storm, heavy rain.
Following versions of R and packages were used.
sessionInfo()
## R version 3.1.0 (2014-04-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
##
## locale:
## [1] LC_CTYPE=pl_PL.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=pl_PL.UTF-8 LC_COLLATE=pl_PL.UTF-8
## [5] LC_MONETARY=pl_PL.UTF-8 LC_MESSAGES=pl_PL.UTF-8
## [7] LC_PAPER=pl_PL.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=pl_PL.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] knitr_1.6
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.4 evaluate_0.5.5 formatR_0.10 stringr_0.6.2
## [5] tools_3.1.0