In this report we aim to identify the most significant weather phenomena that are considered most harmful in respect of public health and economy for the United States.
Analysing the data originated from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, including events registered from 1955 to 2011 in the United States, we can see that TORNADOs and FLOODs are the type of events with significat consequence in the United States.
TORNADOs have been the main cause of loss of life and injuries with 96915 victims (62% of the total number of victims caused by weather phenomena). While FLOODs have been the main cause of damages/ economic consequences with a circa 150 billion dollars estimated value (31% of the total estimated damages caused by weather phenomena).
The data is originated from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. The data covers the period from 1955 to 2011.
The data can be download from the following link in the form of a comma-separated-value file compressed via the bzip2 algorithm.
Data Processing Steps
Downloading the data as a compressed file in the current working directory.
theUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
theFile <- "./stormData.csv.bz2"
download.file(theUrl, theFile, method = "curl", quiet = TRUE, mode = "wb")
Read the data included in the compressed file using read.csv. When reading the file the options to transfor strings to factor is turned off.
rawData <- read.csv(theFile, stringsAsFactors = FALSE)
The raw dataset includes 902297 observations (rows), each observations including 37 features (columns) with the following names.
dim(rawData)
## [1] 902297 37
names(rawData)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
Not all of the features (columns) are needed in order to answer the questions so the first step is to try to identify a subset of the raw dataset containing only the relevant data.
Based on the available data documentation and the “questions”, we are going to create a simplified raw dataset including the following features (columns):
EVTYPE)FATALITIES, INJURIES)PROPDMG, PROPDMGEXP)CROPDMG, CROPDMGEXP)The simplified raw dataset allows us to reduce reduce the level complexity, to remove features (columns) not important for the questions we are trying to answer and to increase the performace during the data clean up and transformation.
#simplified raw dataset
sRawData <- rawData[, c(8,23,24,25,26,27,28)]
names(sRawData)
## [1] "EVTYPE" "FATALITIES" "INJURIES" "PROPDMG" "PROPDMGEXP"
## [6] "CROPDMG" "CROPDMGEXP"
Using the summary it is possible to have an overview of the content of the raw dataset and NAs.
summary(sRawData)
## EVTYPE FATALITIES INJURIES
## Length:902297 Min. : 0.0000 Min. : 0.0000
## Class :character 1st Qu.: 0.0000 1st Qu.: 0.0000
## Mode :character Median : 0.0000 Median : 0.0000
## Mean : 0.0168 Mean : 0.1557
## 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :583.0000 Max. :1700.0000
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## Min. : 0.00 Length:902297 Min. : 0.000 Length:902297
## 1st Qu.: 0.00 Class :character 1st Qu.: 0.000 Class :character
## Median : 0.00 Mode :character Median : 0.000 Mode :character
## Mean : 12.06 Mean : 1.527
## 3rd Qu.: 0.50 3rd Qu.: 0.000
## Max. :5000.00 Max. :990.000
#Number of NAs by feature (column)
theCounter <- function(x){
return(sum(is.na(x)))
}
apply(sRawData, 2, theCounter)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG
## 0 0 0 0 0 0
## CROPDMGEXP
## 0
EVTYPE featureChecking the possible values of the EVTYPE features (column) we can see that there are 985 possible values (over 902297 observations).
tmp <- unique(sRawData$EVTYPE)
length(tmp)
## [1] 985
head(tmp)
## [1] "TORNADO" "TSTM WIND" "HAIL"
## [4] "FREEZING RAIN" "SNOW" "ICE STORM/FLASH FLOOD"
tail(tmp)
## [1] "DENSE SMOKE" "LAKESHORE FLOOD"
## [3] "MARINE THUNDERSTORM WIND" "MARINE STRONG WIND"
## [5] "ASTRONOMICAL LOW TIDE" "VOLCANIC ASHFALL"
Transforming ENVTYPE into a factor and showing the 10 most frequent event type.
sRawData$EVTYPE <- as.factor(sRawData$EVTYPE)
summary(sRawData$EVTYPE, maxsum = 10)
## HAIL TSTM WIND THUNDERSTORM WIND
## 288661 219940 82563
## TORNADO FLASH FLOOD FLOOD
## 60652 54277 25326
## THUNDERSTORM WINDS HIGH WIND LIGHTNING
## 20843 20212 15754
## (Other)
## 114069
PROPDMGEXP, CRPDMGEXP featuresNote! From available data documentation: * ‘Estimates should be rounded to three significant digits, followed by an alphabetical character signifying the magnitude of the number, i.e., 1.55B for $1,550,000,000. Alphabetical characters used to signify magnitude include K for thousands, M for millions, and B for billions.’
Checking the possible values for the PROPDMGEXP feature we can see that it is set to values other that “K”, “M” and “B” and (assumption) “” (without magnitude).
tmp <- unique(sRawData$PROPDMGEXP)
length(tmp)
## [1] 19
tmp
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8"
The actual distribution of such possible values is summarized in the table below.
table(sRawData$PROPDMGEXP)
##
## + - 0 1 2 3 4 5 6
## 465934 5 1 216 25 13 4 4 28 4
## 7 8 ? B H K M h m
## 5 1 8 40 6 424665 11330 1 7
Removing the unexpected values for PROPDMGEXP from the dataset.
sRawData <- sRawData[sRawData$PROPDMGEXP == "" | sRawData$PROPDMGEXP == "K" | sRawData$PROPDMGEXP == "M" | sRawData$PROPDMGEXP == "B",]
dim(sRawData)
## [1] 901969 7
unique(sRawData$PROPDMGEXP)
## [1] "K" "M" "" "B"
Same processing is applied for CRPDMGEXP.
table(sRawData$CROPDMGEXP)
##
## 0 2 ? B K M k
## 618094 19 1 7 9 281826 1992 21
sRawData <- sRawData[sRawData$CROPDMGEXP == "" | sRawData$CROPDMGEXP == "K" | sRawData$CROPDMGEXP == "M" | sRawData$CROPDMGEXP == "B",]
dim(sRawData)
## [1] 901921 7
unique(sRawData$CROPDMGEXP)
## [1] "" "M" "K" "B"
Removing the unexpected values for PROPDMGEXP, CROPDMGEXP we have a dataset of 901921 obs, starting from an original dataset of 902297 obs.
In order to make simplify the calculation, around impact on public heath and economy, 2 new features, PROP_DOLLAR and CROP_DOLLAR, have been added to the dataset reprting the damage value in dollars (considering the magnitude).
sRawData$PROP_DOLLAR <- 0
sRawData$CROP_DOLLAR <- 0
sRawData[sRawData$PROPDMGEXP == "","PROP_DOLLAR"] <- sRawData[sRawData$PROPDMGEXP == "","PROPDMG"]
sRawData[sRawData$PROPDMGEXP == "K","PROP_DOLLAR"] <- sRawData[sRawData$PROPDMGEXP == "K","PROPDMG"] * 1000
sRawData[sRawData$PROPDMGEXP == "M","PROP_DOLLAR"] <- sRawData[sRawData$PROPDMGEXP == "M","PROPDMG"]* 1000000
sRawData[sRawData$PROPDMGEXP == "B","PROP_DOLLAR"] <- sRawData[sRawData$PROPDMGEXP == "B","PROPDMG"] * 1000000000
sRawData[sRawData$CROPDMGEXP == "","CROP_DOLLAR"] <- sRawData[sRawData$CROPDMGEXP == "","CROPDMG"]
sRawData[sRawData$CROPDMGEXP == "K","CROP_DOLLAR"] <- sRawData[sRawData$CROPDMGEXP == "K","CROPDMG"] * 1000
sRawData[sRawData$CROPDMGEXP == "M","CROP_DOLLAR"] <- sRawData[sRawData$CROPDMGEXP == "M","CROPDMG"] * 1000000
sRawData[sRawData$CROPDMGEXP == "B","CROP_DOLLAR"] <- sRawData[sRawData$CROPDMGEXP == "B","CROPDMG"] * 1000000000
sRawData$TOTAL_DOLLAR <- sRawData$PROP_DOLLAR + sRawData$CROP_DOLLAR
sRawData$TOTAL_HEALTH <- sRawData$INJURIES + sRawData$FATALITIES
Finally the dataset is ready to be aggregated by EVTYPE. Specifically the data is aggregated calculating the total population health and total economic consequences by type of event, and ordered in ascending order by total economic and health impacts respectively.
healthPerEventType <- aggregate(TOTAL_HEALTH ~ EVTYPE, data=sRawData, sum)
economyPerEventType <- aggregate(TOTAL_DOLLAR ~ EVTYPE, data=sRawData, sum)
healthPerEventType <- healthPerEventType[order(healthPerEventType$TOTAL_HEALTH, decreasing=TRUE),]
economyPerEventType <- economyPerEventType[order(economyPerEventType$TOTAL_DOLLAR, decreasing=TRUE),]
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Lets print out the first total number of victims (injuries + casualties) for the different event types in our dataset, limiting the focus to the 10 top most harmful event types.
healthPerEventType$PERCENTAGE <- (healthPerEventType$TOTAL_HEALTH/ sum(healthPerEventType$TOTAL_HEALTH)) * 100
healthPerEventType[1:10,]
## EVTYPE TOTAL_HEALTH PERCENTAGE
## 822 TORNADO 96915 62.2963149
## 123 EXCESSIVE HEAT 8428 5.4174621
## 842 TSTM WIND 7461 4.7958810
## 165 FLOOD 7259 4.6660367
## 449 LIGHTNING 6046 3.8863284
## 268 HEAT 3037 1.9521633
## 150 FLASH FLOOD 2755 1.7708956
## 418 ICE STORM 2064 1.3267254
## 749 THUNDERSTORM WIND 1621 1.0419680
## 958 WINTER STORM 1527 0.9815454
Plotting the 5 top harmful events types
par(mfrow=c(1,2), mar=c(4,4,2,4), cex = 0.6)
barplot(healthPerEventType[1:5,"TOTAL_HEALTH"], main = "Total number of victims by Event Type (top 5)", names.arg = healthPerEventType[1:5,"EVTYPE"], xlab = "Event Type", ylab = "Total number of victims")
barplot(healthPerEventType[1:5,"PERCENTAGE"], main = "Percentage on overall victims by Event Type (top 5)", names.arg = healthPerEventType[1:5,"EVTYPE"], xlab = "Event Type", ylab = "Percentage", ylim = c(0, 100))
We can see that across the United States the most harmful event type with respect to population health is the TORNADO.
Across the United States, which types of events have the greatest economic consequences?
Lets print out the first total cost (economic consequences) for the different event types in our dataset, limiting the focus to the 10 top most expensive event types.
economyPerEventType$PERCENTAGE <- (economyPerEventType$TOTAL_DOLLAR / sum(economyPerEventType$TOTAL_DOLLAR)) * 100
economyPerEventType[1:10,]
## EVTYPE TOTAL_DOLLAR PERCENTAGE
## 165 FLOOD 150319678257 31.560119
## 389 HURRICANE/TYPHOON 71913712800 15.098524
## 822 TORNADO 57290435593 12.028318
## 652 STORM SURGE 43323541000 9.095922
## 238 HAIL 18727703230 3.931944
## 150 FLASH FLOOD 17561538817 3.687104
## 90 DROUGHT 15018672000 3.153220
## 381 HURRICANE 14610229010 3.067466
## 573 RIVER FLOOD 10148404500 2.130691
## 418 ICE STORM 8967037810 1.882660
Plotting the 5 top most expensive event types
tmp <- economyPerEventType[1:5,]
tmp$TOTAL_DOLLAR <- tmp$TOTAL_DOLLAR / 1000000000 #Convert to Billions of dollar
par(mfrow=c(1,2), mar=c(4,4,2,4), cex = 0.6)
barplot(tmp[,"TOTAL_DOLLAR"], main = "Total cost by Event Type (top 5)", names.arg = tmp[,"EVTYPE"], xlab = "Event Type", ylab = "Billions of dollars")
barplot(tmp[,"PERCENTAGE"], main = "Percentage on overall cost by Event Type (top 5)", names.arg = tmp[,"EVTYPE"], xlab = "Event Type", ylab = "Percentage", ylim = c(0, 100))
We can see that across the United States the event type with the greatest economic consequences is the FLOOD.
## R version 3.1.2 (2014-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
##
## locale:
## [1] C/C/C/C/C/no_NO.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.8 evaluate_0.8 formatR_1.2.1 htmltools_0.2.6
## [5] knitr_1.10.5 magrittr_1.5 rmarkdown_0.7 stringi_0.5-5
## [9] stringr_1.0.0 tools_3.1.2