This report is made to make some analysis on the U.S. National Oceanic and Atmospheric Administration’s (NOAA) data about climate disasters. We were able to check the most harmful disasters to population and the most economic harm by event type. The most harmful event is tornado, while the one that causes the most cost is flood.
library(R.utils)
## Loading required package: R.oo
## Loading required package: R.methodsS3
## R.methodsS3 v1.6.1 (2014-01-04) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.18.0 (2014-02-22) successfully loaded. See ?R.oo for help.
##
## Attaching package: 'R.oo'
##
## The following objects are masked from 'package:methods':
##
## getClasses, getMethods
##
## The following objects are masked from 'package:base':
##
## attach, detach, gc, load, save
##
## R.utils v1.33.0 (2014-08-24) successfully loaded. See ?R.utils for help.
##
## Attaching package: 'R.utils'
##
## The following object is masked from 'package:utils':
##
## timestamp
##
## The following objects are masked from 'package:base':
##
## cat, commandArgs, getOption, inherits, isOpen, parse, warnings
library(ggplot2)
library(scales)
library(reshape2)
First we download the data file, unzip and read it
url <- 'http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2'
file <- 'storm-data.csv.bz2'
if(!file.exists(file)) {
download.file(url, file)
bunzip2(file, overwrite=T, remove=F)
}
if(!'originalData' %in% ls()) {
originalData <- read.csv(sub('.bz2', '', file))
}
data <- originalData
Than some briefly look
dim(data)
## [1] 902297 37
head(data)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
We have 902297 lines in 37 rows.
Clean to just usefull data and remove NAs. We got rid of unecessary fields to this specific analisys to make the data set lighter.
data$date <- as.Date(data$BGN_DATE, format = "%m/%d/%Y %H:%M:%S")
usefullFileds <- c('date', 'EVTYPE', 'FATALITIES', 'INJURIES', 'PROPDMG',
'PROPDMGEXP', 'CROPDMG', 'CROPDMGEXP')
data <- data[usefullFileds]
data <- data[complete.cases(data), ]
The data goes from 1950-01-03 to 2011-11-30.
The damages variables are splitted in two fields, one with meaningful characters and the other with base 10 exponent as a character. Here’s a function to return the correct multiplier according to the character
getMultiplier <- function(char) {
exp <- if(is.null(char)) { 0 }
else if(char == 'B') { 9 }
else if(char == 'M') { 6 }
else if(char == 'K') { 3 }
else if(char == 'H') { 2 }
else { 0 }
10^exp
}
Applying the getMultiplier function to calculate all damages and totaling it in a new column.
data$PROPDMG <- data$PROPDMG * sapply(data$PROPDMGEXP, getMultiplier)
data$CROPDMG <- data$CROPDMG * sapply(data$CROPDMGEXP, getMultiplier)
data$TOTALDMG <- data$PROPDMG + data$CROPDMG
Aggregating data to have the top events of each kind (fatalities, injuries, total damages).
data.sum <- aggregate(cbind(FATALITIES, INJURIES, PROPDMG, CROPDMG, TOTALDMG) ~ EVTYPE,
data = data, FUN = sum)
fatalities <- data.sum[order(data.sum$FATALITIES, decreasing=T)[1:10],
which(names(data.sum) %in% c('EVTYPE', 'FATALITIES'))]
injuries <- data.sum[order(data.sum$INJURIES, decreasing=T)[1:10],
which(names(data.sum) %in% c('EVTYPE', 'INJURIES'))]
damages <- data.sum[order(data.sum$TOTALDMG, decreasing=T)[1:10],
which(names(data.sum) %in% c('EVTYPE', 'PROPDMG', 'CROPDMG', 'TOTALDMG'))]
fatalities$EVTYPE <- factor(fatalities$EVTYPE, levels=fatalities$EVTYPE[order(fatalities$FATALITIES, decreasing=T)])
ggplot(fatalities, aes(EVTYPE, FATALITIES)) + geom_bar(stat='identity') +
xlab("Weather Event") +
ylab("Fatalities") +
ggtitle(paste('Total Fatalities by Weather Events in the U.S.\n from ',
format(min(data$date), "%Y"), ' - ', format(max(data$date), "%Y"), sep=''))
fatalities
## EVTYPE FATALITIES
## 834 TORNADO 5633
## 130 EXCESSIVE HEAT 1903
## 153 FLASH FLOOD 978
## 275 HEAT 937
## 464 LIGHTNING 816
## 856 TSTM WIND 504
## 170 FLOOD 470
## 585 RIP CURRENT 368
## 359 HIGH WIND 248
## 19 AVALANCHE 224
The most harmfull fatalities concearning the human health is tornados. It accounts to around 37% of total fatalities.
injuries$EVTYPE <- factor(injuries$EVTYPE, levels=injuries$EVTYPE[order(injuries$INJURIES, decreasing=T)])
ggplot(injuries, aes(EVTYPE, INJURIES)) + geom_bar(stat='identity') +
xlab("Weather Event") +
ylab("Injuries") +
ggtitle(paste('Total Injuries by Weather Events in the U.S.\n from ',
format(min(data$date), "%Y"), ' - ', format(max(data$date), "%Y"), sep=''))
injuries
## EVTYPE INJURIES
## 834 TORNADO 91346
## 856 TSTM WIND 6957
## 170 FLOOD 6789
## 130 EXCESSIVE HEAT 6525
## 464 LIGHTNING 5230
## 275 HEAT 2100
## 427 ICE STORM 1975
## 153 FLASH FLOOD 1777
## 760 THUNDERSTORM WIND 1488
## 244 HAIL 1361
The most harmfull injuries concearning the human health is also tornados. It accounts to around 65% of total injuries.
damagesMelt <- damages[1:3]
names(damagesMelt) <- c('EVTYPE', 'Propriety damages', 'Crop damages')
damagesMelt$EVTYPE <- factor(damages$EVTYPE, levels=damages$EVTYPE[order(damages$TOTALDMG, decreasing=T)])
damagesMelt <- melt(damagesMelt, by= 'EVTYPE')
## Using EVTYPE as id variables
ggplot(damagesMelt, aes(EVTYPE, value, fill=variable)) + geom_bar(stat='identity') +
xlab("Weather Event") +
scale_y_continuous('Damages in dollars') +
ggtitle(paste('Damages by Weather Events in the U.S.\n from ',
format(min(data$date), "%Y"), ' - ', format(max(data$date), "%Y"), sep=''))
damages
## EVTYPE PROPDMG CROPDMG TOTALDMG
## 170 FLOOD 144657709807 5661968450 150319678257
## 411 HURRICANE/TYPHOON 69305840000 2607872800 71913712800
## 834 TORNADO 56925660790 414953270 57340614060
## 670 STORM SURGE 43323536000 5000 43323541000
## 244 HAIL 15727367548 3025537890 18752905438
## 153 FLASH FLOOD 16140812067 1421317100 17562129167
## 95 DROUGHT 1046106000 13972566000 15018672000
## 402 HURRICANE 11868319010 2741910000 14610229010
## 590 RIVER FLOOD 5118945500 5029459000 10148404500
## 427 ICE STORM 3944927860 5022113500 8967041360
The most expensive event is flood, accounting to around 37% of total injuries, but hurricanes, typhoons, tornados and storm surges causes great destructions either.
Run away from these weather events.