Extreme weather events can be harmful and costly for communities through causing fatalities, injuries, property and crop damage. This analysis uses the U.S. National Oceanic and Atmospheric Administration’s (NOAA) Storm Database, characterizes major storms and weather events in the Unitied States from 1950 to 2011, to explore which types of events (1) are most harmful with respect to population health and (2) have the greatest economic consequences.
After classifying the event names to their respective groups, and analysing the top 14 groups, Flood is the most damaging of all the events in terms of total property damage and total crop damage. Heat Related events have the most devastating effects to the crop.
After classifying the event names to their respective groups, and analysing the top 14 groups, Tornado is the most damaging of all the events in terms of total fatalities and total injuries.
Data Accessed: September 19, 2014
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:
Storm Data [47Mb] There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
National Weather Service Storm Data Documentation
National Climatic Data Center Storm Events FAQ
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
The following code checks if the data file is already existing on the ./data folder, if not then it downloads the data file and save it as repdata-data-StormData.csv.bz2 in the ./data directory.
fileNameZip <- "repdata-data-StormData.csv.bz2"
fileDestZip <- paste(".","data", fileNameZip, sep = "/")
fileURL <- "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2" ## Be sure to use http, instead of https in the file URL since it gives errors in .Rmd files.
if (file.exists(fileNameZip)) {
message("Data file already exists in the current working directory")
} else {
message("Download the zipped data file into './data' directory")
if (!file.exists("./data")) {
dir.create("./data")
}
download.file(fileURL, fileDestZip, method = "auto")
}
## Data file already exists in the current working directory
Since the file compressed via the bzip2 algorithm, the size of unzipped data file is large (548Mb). Thus, it is better to read the data directly from the zipped file.
read.csv is used to read the data, since it is in a csv format with first row containing the column names so no additional parameters needed to be added.
Observing the dimension of the data, and the column names so that the required columns can be selected for the analysis.
dataRaw<- read.csv("./repdata-data-StormData.csv.bz2")
dim(dataRaw)
## [1] 902297 37
colnames(dataRaw)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
Before applying the transformations, we include the libraries which may be needed for subsetting and transformations.
library(plyr)
library(reshape2)
library(xtable)
library(ggplot2)
Since the analysis concerns majorily with a few set of columns so subsetting the whole data with the selected columns i.e. "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP"
requiredCOl <- c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
data<-dataRaw[,requiredCOl]
Looking for the data if it consist of any NA values
colSums(is.na(data))
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG
## 0 0 0 0 0 0
## CROPDMGEXP
## 0
Since data does not contains any NA values so we can proceed with the further analysis
Some of the columns, which need to be transformed include * PROPDMG: Property damange amount. * CROPDMG: Crop damange amount. * PROPDMGEXP: Unit of property damage. Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, “B” for billions and “” for non. If additional precision is available, it may be provided in the narrative part of the entry. * CROPDMGEXP: Unit of crop damange. Same logical alphabetical units used in PROPDMGEXP applies. + According to the original data documentation there ar 4 units for PROPDMG and CROPDMG; however, computations clearly shows that there are some other alphabetical units such as +, -, ?, h and numbers (instead of letter). So, this analysis interprests signs as non units, and h as hundreds.
Mapping the character symbols to their numerical counterparts for PROPDMGEXP variable
## visualizing the characters in the PROPDMGEXP variable
table(data$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5
## 465934 1 8 5 216 25 13 4 4 28
## 6 7 8 B h H K m M
## 4 5 1 40 1 6 424665 7 11330
## mapping each character to its numeric counterpart and replacing them by `chartr`
## function
data$PROPDMGEXP<-chartr("-?+hHkKmMbB", "11122336699", data$PROPDMGEXP)
## Some rows have empty space as the value so that is assumed to be 0
data$PROPDMGEXP[data$PROPDMGEXP==""]<-0
Similar mapping is done for ``
## visualizing the characters in the CROPDMGEXP variable
table(data$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
## mapping each character to its numeric counterpart and replacing them by `chartr`
## function
data$CROPDMGEXP<-chartr("?kKmMbB", "1336699", data$CROPDMGEXP)
## Some rows have empty space as the value so that is assumed to be 0
data$CROPDMGEXP[data$CROPDMGEXP==""]<-0
Create variable PropDamage and CropDamage which contains the values after multiplying \(variableDMG * 10^variableDMGEXP\). PropDamage and CropDamage values are calculated in millions.
data$PropDamage<-data$PROPDMG*(10^(as.numeric(data$PROPDMGEXP)-6))
data$CropDamage<-data$CROPDMG*(10^(as.numeric(data$CROPDMGEXP)-6))
On visualizing the column EVTYPE in the data, it is noted it contains 985 different unique levels, so all levels were analysed.
length(levels(data$EVTYPE))
## [1] 985
So a set of keywords were selected and were mapped to a Event group that describe them the best
The mapping of the keywords were done as follows
keywords1 <- c("(.*FLOOD.*|.*fld.*)", "(.*TORN.*|.*WATERSPOUT.*)", "(.*HURRICANE.*|.*TYPHOON.*)", "(.*THUNDE.*|.*STORM.*|.*TROPICAL DEPRESSION.*)", ".*FIRE.*", "(.*ICY.*|.*ICE.*|.*FROST.*|.*SNOW.*|.*HAIL.*|.*FREEZ.*|.*BLIZ.*|.*WINT.*|.*GLAZE.*|.*COLD.*)", "(.*WIND.*|.*DUST*.)", "(.*RAIN.*|.*PRECIPITATION.*|.*WET.*|.*HEAVY MIX.*|.*PRECIP.*)", "(.*HEAT.*|.*DRY.*|.*DROUGHT.*|.*WARM.*)", ".*FOG.*", "(.*TIDE.*|.*SURF.*|.*SEICHE.*|.*TSUNAMI.*|.*COASTAL SURGE.*|.*HIGH WATER.*|.*CURRENT.*|.*SEAS.*|.*MARINE.*)", "(.*AVALANCHE.*|.*EROSION.*|.*SLIDE.*|.*Landslump.*)","(.*LIGHTNING.*)", "(.*DAM BREAK.*|.*VOLCANIC ASH.*|.*OTHER.*)")
replace1 <- c("FLOOD", "TORNADO", "HURRICANE", "STORM", "FIRE", "WINTER RELATED", "WIND", "RAIN", "HEAT RELATED", "FOG", "OCEAN RELATED", "LAND RELATED","LIGHTNING RELATED", "OTHER")
table1 <- xtable(cbind(keywords1, replace1))
print(table1, type = "html")
| keywords1 | replace1 | |
|---|---|---|
| 1 | (.FLOOD.|.fld.) | FLOOD |
| 2 | (.TORN.|.WATERSPOUT.) | TORNADO |
| 3 | (.HURRICANE.|.TYPHOON.) | HURRICANE |
| 4 | (.THUNDE.|.STORM.|.TROPICAL DEPRESSION.) | STORM |
| 5 | .FIRE. | FIRE |
| 6 | (.ICY.|.ICE.|.FROST.|.SNOW.|.HAIL.|.FREEZ.|.BLIZ.|.WINT.|.GLAZE.|.COLD.) | WINTER RELATED |
| 7 | (.WIND.|.DUST.) | WIND |
| 8 | (.RAIN.|.PRECIPITATION.|.WET.|.HEAVY MIX.|.PRECIP.) | RAIN |
| 9 | (.HEAT.|.DRY.|.DROUGHT.|.WARM.) | HEAT RELATED |
| 10 | .FOG. | FOG |
| 11 | (.TIDE.|.SURF.|.SEICHE.|.TSUNAMI.|.COASTAL SURGE.|.HIGH WATER.|.CURRENT.|.SEAS.|.MARINE.) | OCEAN RELATED |
| 12 | (.AVALANCHE.|.EROSION.|.SLIDE.|.Landslump.) | LAND RELATED |
| 13 | (.LIGHTNING.) | LIGHTNING RELATED |
| 14 | (.DAM BREAK.|.VOLCANIC ASH.|.OTHER.) | OTHER |
This mapping was run on the EVTYPE column to reduce the levels of the event, which was reduced to 162 seperate events.
eventType<-as.character(data$EVTYPE)
for (i in seq_along(keywords1)){
found<-grepl(keywords1[i], eventType, ignore.case = TRUE)
eventType[found]<-replace1[i]
}
data$EVTYPE<-as.factor(eventType)
length(levels(data$EVTYPE))
## [1] 162
Now these rows are summarized by based on their Event type by summing their PropDamage, CropDamage, FATALITIES and INJURIES as totalPropDamage, totalCropDamage, totalFatalities and totalInjuries. Two new columns are added totalDamage (as sum of totalPropDamage and totalCropDamage) and healthHarm (as sum of totalFatalities and totalInjuries)
finalCol<-c("EVTYPE")
summarizedRes<-ddply(data, finalCol, summarize, totalPropDamage=sum(PropDamage), totalCropDamage=sum(CropDamage), totalFatalities=sum(FATALITIES), totalInjuries=sum(INJURIES))
summarizedRes$totalDamage <- summarizedRes$totalPropDamage + summarizedRes$totalCropDamage
summarizedRes$healthHarm<-summarizedRes$totalFatalities + summarizedRes$totalInjuries
hist(log(summarizedRes$totalDamage), breaks = 10, main = "Histogram for Distribution of Total Economic Damage", xlab = expression("log(Total Economic Damage(in million Dollar scale))"), col="lightblue4")
hist(log(summarizedRes$healthHarm), breaks = 10, , main = "Histogram for Distribution of Total Health Damage", xlab = expression("log(Total Health Damage)"), col="lightblue4")
Visualizing from the bimodal structure of the graph for total damage and health harm the event type cause, the event type can be divided to two classifications. ### 5) Analysis #### 5.1) Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
First order the summarized data on descending order based on the healthHarm attribute. It is noticed that maximum harm is visible in the first 14 type of event types.
So melt the data based on the totalFatalities and totalInjuries and then plot EVTYpe with the healthHarm, filling in according to the melted column.
harm<-summarizedRes[order(summarizedRes$healthHarm, decreasing = TRUE),]
head(harm[,c("EVTYPE","totalFatalities","totalInjuries", "healthHarm")], 14)
## EVTYPE totalFatalities totalInjuries healthHarm
## 145 TORNADO 5664 91436 97100
## 24 HEAT RELATED 3181 9275 12456
## 14 FLOOD 1553 8683 10236
## 160 WIND 983 8909 9892
## 77 STORM 635 6691 7326
## 42 LIGHTNING RELATED 817 5231 6048
## 161 WINTER RELATED 819 4821 5640
## 53 OCEAN RELATED 795 924 1719
## 12 FIRE 90 1608 1698
## 36 HURRICANE 135 1333 1468
## 15 FOG 80 1076 1156
## 38 LAND RELATED 268 225 493
## 55 RAIN 102 306 408
## 54 OTHER 8 4 12
harmFinal<- melt(harm[1:14,], id=c("EVTYPE", "healthHarm"), measure.vars=c("totalFatalities", "totalInjuries"))
ggplot(harmFinal, aes(x = reorder(EVTYPE, healthHarm), y = value, fill=variable)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Harm on Public Health by Generally Equivalent Event Types", y = "Total Harm Puplic Health", x = "Generally Equivalent Events Types") +
scale_fill_manual(values = c("lightblue4", "lightblue3"), name = "Harm Type", breaks = c("totalFatalities", "totalInjuries"), labels = c("Fatalities", "Injuries"))
As observed, Tornado has the biggest impact on public health with the total harm of 97100, with 5664 fatalities and 91436 injuries
First order the summarized data on descending order based on the healthHarm attribute. It is noticed that maximum harm is visible in the first 14 type of event types.
So melt the data based on the totalPropDamage and totalPropDamage and then plot EVTYpe with the totalDamage, filling in according to the melted column.
Damage<-summarizedRes[order(summarizedRes$totalDamage, decreasing = TRUE),]
head(harm[,c("EVTYPE","totalPropDamage","totalCropDamage", "totalDamage")], 14)
## EVTYPE totalPropDamage totalCropDamage totalDamage
## 145 TORNADO 5.861e+04 417.462 5.903e+04
## 24 HEAT RELATED 1.073e+03 14877.060 1.595e+04
## 14 FLOOD 1.683e+05 12388.597 1.807e+05
## 160 WIND 1.058e+04 1321.333 1.190e+04
## 77 STORM 7.327e+04 6406.889 7.967e+04
## 42 LIGHTNING RELATED 9.304e+02 12.092 9.425e+02
## 161 WINTER RELATED 1.779e+04 6787.134 2.457e+04
## 53 OCEAN RELATED 2.575e+02 0.020 2.575e+02
## 12 FIRE 8.502e+03 403.282 8.905e+03
## 36 HURRICANE 8.536e+04 5516.118 9.087e+04
## 15 FOG 2.283e+01 0.000 2.283e+01
## 38 LAND RELATED 3.320e+02 20.017 3.520e+02
## 55 RAIN 3.233e+03 953.153 4.186e+03
## 54 OTHER 1.558e+00 1.034 2.592e+00
DamageFinal<- melt(Damage[1:14,], id=c("EVTYPE", "totalDamage"), measure.vars=c("totalPropDamage", "totalCropDamage"))
ggplot(DamageFinal, aes(x = reorder(EVTYPE, totalDamage), y = value, fill=variable)) +
geom_bar(stat = "identity") +
coord_flip() +
ylab("Total number in Million dollars") +
xlab("Generally Equivalent Events Types") +
ggtitle("Total number of damages per weather event") +
scale_fill_manual(values = c("lightblue4", "lightblue3"), name = "Damage Type", breaks = c("totalPropDamage", "totalCropDamage"), labels = c("Property Damage", "Crop Damage"))
As observed, Flood are the cause of maximum economic damages of 180700 Million dollar