This report aims on an analysis of severe weather events and their impact on public health and ecomomy of the US between the years 1950 and 2011 from this link.
Our hypothesis is that some weather events causes higher injuries, whereas other weather phenomens have a higher impact on fatalities. Furthermore weather events having the highest consequences on economic aspects are not always the same as for injuries and fatalities.
To accomplish our task we obtained data from the National Oceanic And Atmospheric Administration between the years 1950 and 2011.
As i analyzed the underlying data set i found out that Tornados have the highest impact on Injuries and also on Fatalities.
But the highest costs on Property Damage causes Floods and the highest costs on Crop Damage is caused by Droughts. If i take a look on the Combined Costs of Property- and Crop Damage i will see that Flood here has the highest ecomoic consequence.
I examined especially the Top 20 of all covered weather events.
First i load the necessary libraries and external scripts and set useful options.
library(plyr)
library(ggplot2)
library(scales)
library(reshape2)
library(knitr)
opts_chunk$set(fig.path='figures/')
if(!exists("convertExpCharToExpValue", mode = "function") && !exists("multiplot", mode = "function")) {
source("functions.R")
}
Now i load the data set which is a huge csv-File and take a quick look at the data set’s dimensions and structure. The data set consists of 902297 observations and 37 features.
# stop execution if no data file could be found
if(!file.exists("data/repdata-data-StormData.csv")) {
stop("No data file found in data path!")
}
stormDataRaw <- read.csv("data/repdata-data-StormData.csv", na.strings=c("NA","NaN", "", " ", "-", "?", "+"), stringsAsFactors=FALSE)
dim(stormDataRaw)
## [1] 902297 37
head(stormDataRaw, 5)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 <NA> <NA> <NA> <NA> 0
## 2 TORNADO 0 <NA> <NA> <NA> <NA> 0
## 3 TORNADO 0 <NA> <NA> <NA> <NA> 0
## 4 TORNADO 0 <NA> <NA> <NA> <NA> 0
## 5 TORNADO 0 <NA> <NA> <NA> <NA> 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 <NA> <NA> 14.0 100 3 0 0
## 2 NA 0 <NA> <NA> 2.0 150 2 0 0
## 3 NA 0 <NA> <NA> 0.1 123 2 0 0
## 4 NA 0 <NA> <NA> 0.0 100 2 0 0
## 5 NA 0 <NA> <NA> 0.0 150 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0 <NA> <NA> <NA> <NA>
## 2 0 2.5 K 0 <NA> <NA> <NA> <NA>
## 3 2 25.0 K 0 <NA> <NA> <NA> <NA>
## 4 2 2.5 K 0 <NA> <NA> <NA> <NA>
## 5 2 2.5 K 0 <NA> <NA> <NA> <NA>
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 <NA> 1
## 2 3042 8755 0 0 <NA> 2
## 3 3340 8742 0 0 <NA> 3
## 4 3458 8626 0 0 <NA> 4
## 5 3412 8642 0 0 <NA> 5
In the next step i filter the data set and extract only these columns i really need to accomplish our task. The features i need are : EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP
# extract only needed features
stormData <- stormDataRaw[, c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
# convert some data to uppercase for easier processing
stormData$PROPDMGEXP <- toupper(stormData$PROPDMGEXP)
stormData$CROPDMGEXP <- toupper(stormData$CROPDMGEXP)
write.csv(stormData, file="data/storm_data_small.csv")
Unfortunatly the observations of the property damage and crop damage are not very tidy. There is a mixture of characters and numbers. and i have lots of NAs.
For property damage i have ~52% NAs and for crop damage ~63%. This is much.
# shows the messiness of data for property damage
unique(stormDataRaw$PROPDMGEXP)
## [1] "K" "M" NA "B" "m" "0" "5" "6" "4" "2" "3" "h" "7" "H" "1" "8"
# shows the messiness of data for crop damage
unique(stormDataRaw$CROPDMGEXP)
## [1] NA "M" "K" "m" "B" "0" "k" "2"
# how many NAs of property damage ?
mean(is.na(stormData$PROPDMGEXP))
## [1] 0.5164
# how many NAs of crop damage ?
mean(is.na(stormData$CROPDMGEXP))
## [1] 0.6854
The documentation about the numbers of the property and crop damage states them as categories:
- Cat. 1 --> Less than $50
- Cat. 2 --> $50 to $500
- Cat. 3 --> $500 to $5.000
- Cat. 4 --> $5.000 to $50.000
- Cat. 5 --> $50.000 to $500.000
- Cat. 6 --> $500.000 to $5.000.000
- Cat. 7 --> $5.000.000 to $50.000.000
- Cat. 8 --> $50.000.000 to $500.000.000
- Cat. 9 --> $500.000.000 to $5.000.000.000
But some samples shows typos in it’s value and because the count is low (85 entries) i decided to remove this rows. Furthermore some observations of the EVTYPE feature are just summaries and have also a low count (73) so i removed them too. The remaining observation are characters and NAs.
The characters were exponents for 10^x
- H stands for hundred --> 10^2
- K stands for thousand --> 10^3
- M stands for million --> 10^6
- B stands for billion --> 10^9
NAs were converted to 0 but imputing for a real world scenario may be a better idea. I added two new colums holding the computed costs of property and crop damage.
# remove unclear and unnecessary entries with numbers because count is low
stormData <- stormData[-grep("Summary", stormData$EVTYPE), ]
stormData <- stormData[-which(stormData$PROPDMGEXP %in% c("1", "2", "3", "4", "5", "6", "7", "8", "9")),]
stormData <- stormData[-which(stormData$CROPDMGEXP %in% c("1", "2", "3", "4", "5", "6", "7", "8", "9")),]
# convert all remaining property damage values to exponents so we can use 10^2
stormData <- convertExpCharToExpValue(stormData, "PROPDMGEXP")
stormData$PROPDMGEXP <- as.integer(stormData$PROPDMGEXP)
# compuite cost of property damage
stormData$PropDmgCost <- stormData$PROPDMG * 10^stormData$PROPDMGEXP
stormData$PROPDMGEXP <- as.factor(stormData$PROPDMGEXP)
# convert all remaining crop damage values to exponents so we can use 10^2
stormData <- convertExpCharToExpValue(stormData, "CROPDMGEXP")
stormData$CROPDMGEXP <- as.integer(stormData$CROPDMGEXP)
# compuite cost of crop damage
stormData$CropDmgCost <- stormData$CROPDMG * 10^stormData$CROPDMGEXP
stormData$CROPDMGEXP <- as.factor(stormData$CROPDMGEXP)
stormData$CombinedDamageCost <- stormData$PropDmgCost + stormData$CropDmgCost
Now lets prepare the data for fatality and injury inspection. I limited the data to the Top 20 most harmful types of events with respect to population health.
# sum injuries for every weather event
injuries <- ddply(stormData, .(EVTYPE), summarise, Injuries = sum(INJURIES))
# sort injuries (highest on top)
injuriesSorted <- arrange(injuries, desc(Injuries))
# limiting the results to the top 20
injuriesSorted <- injuriesSorted[1:20,]
injuriesSorted$Type <- "Injuries"
colnames(injuriesSorted) <- c("Eventtype", "Count", "Type")
injuriesSortedLevels <- injuriesSorted$Eventtype[order(injuriesSorted$Count, injuriesSorted$Eventtype)]
injuriesSorted$Eventtype <- factor(injuriesSorted$Eventtype, ordered = TRUE, levels = injuriesSortedLevels)
# sum fatalities for every weather event
fatalities <- ddply(stormData, .(EVTYPE), summarise, Fatalities = sum(FATALITIES))
# sort fatalities (highest on top)
fatalitiesSorted <- arrange(fatalities, desc(Fatalities))
# limiting the results to the top 20
fatalitiesSorted <- fatalitiesSorted[1:20,]
fatalitiesSorted$Type <- "Fatalities"
colnames(fatalitiesSorted) <- c("Eventtype", "Count", "Type")
fatalitiesSortedLevels <- fatalitiesSorted$Eventtype[order(fatalitiesSorted$Count, fatalitiesSorted$Eventtype)]
fatalitiesSorted$Eventtype <- factor(fatalitiesSorted$Eventtype, ordered = TRUE, levels = fatalitiesSortedLevels)
A look on the injuries and fatalities summaries:
# shows amongst the others the max count of injuries
summary(stormData$INJURIES)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 0.2 0.0 1700.0
# shows a summary of the top 20 inj.
summary(injuriesSorted)
## Eventtype Count Type
## DUST STORM : 1 Min. : 440 Length:20
## WILD/FOREST FIRE : 1 1st Qu.: 910 Class :character
## FOG : 1 Median : 1341 Mode :character
## BLIZZARD : 1 Mean : 6732
## THUNDERSTORM WINDS: 1 3rd Qu.: 2882
## WILDFIRE : 1 Max. :91345
## (Other) :14
# shows amongst the others the max count of fatalities
summary(stormData$FATALITIES)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 0 0 583
# shows a summary of the top 20 fat.
summary(fatalitiesSorted)
## Eventtype Count Type
## BLIZZARD : 1 Min. : 101 Length:20
## HIGH SURF : 1 1st Qu.: 131 Class :character
## STRONG WIND : 1 Median : 215 Mode :character
## EXTREME COLD/WIND CHILL: 1 Mean : 676
## HEAVY SNOW : 1 3rd Qu.: 582
## THUNDERSTORM WIND : 1 Max. :5633
## (Other) :14
In this step i prepare the data having the greatest economic consequences. I limited the data to the Top 20 types of events.
# sum property damages for every weather event
propDamages <- ddply(stormData, .(EVTYPE), summarise, PropertyDamage = sum(PropDmgCost))
# sort, highest first
propDamagesSorted <- arrange(propDamages, desc(PropertyDamage))
# limiting the results to the top 20
propDamagesSorted <- propDamagesSorted[1:20,]
colnames(propDamagesSorted) <- c("Eventtype", "PropertyDamage")
propDamagesSortedLevels <- propDamagesSorted$Eventtype[order(propDamagesSorted$PropertyDamage, propDamagesSorted$Eventtype)]
propDamagesSorted$Eventtype <- factor(propDamagesSorted$Eventtype, ordered = TRUE, levels = propDamagesSortedLevels)
# sum crop damage for every weather event
cropDamages <- ddply(stormData, .(EVTYPE), summarise, CropDamage = sum(CropDmgCost))
# sort, highest first
cropDamagesSorted <- arrange(cropDamages, desc(CropDamage))
# limiting the results to the top 20
cropDamagesSorted <- cropDamagesSorted[1:20,]
colnames(cropDamagesSorted) <- c("Eventtype", "CropDamage")
propDamagesSortedLevels <- cropDamagesSorted$Eventtype[order(cropDamagesSorted$CropDamage, cropDamagesSorted$Eventtype)]
cropDamagesSorted$Eventtype <- factor(cropDamagesSorted$Eventtype, ordered = TRUE, levels = propDamagesSortedLevels)
Finaly i calculated the global costs by adding property and crop damage together for every weather type.
# compute global costs
costs <- ddply(stormData, .(EVTYPE), summarise, PropertyDamageCost = sum(PropDmgCost), CropDamageCost = sum(CropDmgCost), GlobalCost=sum(CombinedDamageCost))
costs <- arrange(costs, desc(GlobalCost), PropertyDamageCost, CropDamageCost)
#limit to top 20
costs <- costs[1:20,]
# turn columns into rows
costsForPlot <- melt(costs, id.vars = "EVTYPE", variable.name = "CostType", value.name = "Cost")
costsLevels <- costsForPlot$EVTYPE[order(costsForPlot$Cost, costsForPlot$EVTYPE, decreasing = TRUE)]
costsLevels <- unique(costsLevels)
costsForPlot$EVTYPE <- factor(costsForPlot$EVTYPE, ordered = TRUE, levels = costsLevels)
A look on the ecomic consequences summaries:
# shows a summary of the top 20 property- and crop damage cost and global cost
summary(costs)
## EVTYPE PropertyDamageCost CropDamageCost
## Length:20 Min. :1.05e+09 Min. :0.00e+00
## Class :character 1st Qu.:3.83e+09 1st Qu.:8.68e+07
## Mode :character Median :5.19e+09 Median :5.96e+08
## Mean :2.07e+10 Mean :2.13e+09
## 3rd Qu.:1.58e+10 3rd Qu.:2.81e+09
## Max. :1.45e+11 Max. :1.40e+10
## GlobalCost
## Min. :2.50e+09
## 1st Qu.:4.94e+09
## Median :8.67e+09
## Mean :2.28e+10
## 3rd Qu.:1.79e+10
## Max. :1.50e+11
In this section i will answer the following two questions:
* Across the United States, which types of events are most harmful with respect to population health?
* Across the United States, which types of events have the greatest economic consequences?
I plotted the top 20 most harmful weather events and the according injury- and fatality numbers together into one figure.
p1 <- ggplot(injuriesSorted, aes(x = Eventtype, y = Count, fill = Type)) +
geom_bar(stat = "identity") +
scale_fill_manual(values=c("#CC6666")) +
coord_flip() +
labs(x="Wheather event type") +
labs(y="Number of Injuries") +
ggtitle("Top 20 of the most harmful Injuries with respect to population health \n in the US between (1950-2011)") +
theme(plot.title = element_text(color="blue", size=12, vjust=1.0))
p2 <- ggplot(fatalitiesSorted, aes(x = Eventtype, y = Count, fill = Type)) +
geom_bar(stat = "identity") +
scale_fill_manual(values=c("#9999CC")) +
coord_flip() +
labs(x="Wheather event type") +
labs(y="Number of Fatalities")
multiplot(p1, p2, cols=1)
The figure above shows us that Tornados followed by TSTM Wind and Flood causes the most harmful injuries. Whereas Tornados, Excessive heat and Flash Flood produces the most harmful fatalities.
Summing it up: Tornados have the biggest impact to population health.
Here i plotted the top 20 weather events and the according costs of the property- and crop damage together into one figure.
p3 <- ggplot(propDamagesSorted, aes(x = Eventtype, y = PropertyDamage)) +
geom_bar(stat = "identity") +
scale_y_continuous(labels = dollar) +
#scale_colour_manual(values=c("#CC6666")) +
coord_flip() +
labs(x="Wheather event type") +
labs(y="Property damage in USD") +
ggtitle("Top 20 of weather events having the greatest economic consequences \n in the US between the years 1950-2011") +
theme(plot.title = element_text(color="blue", size=12, vjust=1.0))
p4 <- ggplot(cropDamagesSorted, aes(x = Eventtype, y = CropDamage)) +
geom_bar(stat = "identity") +
scale_y_continuous(labels = dollar) +
#scale_colour_manual(values=c("#9999CC")) +
coord_flip() +
labs(x="Wheather event type") +
labs(y="Crop damage in USD")
multiplot(p3, p4, cols=1)
As we can see in the figure above Floods, Hurricans/Typhoons, Tornados and Storm Surges produces the highest costs of property damage.
On The other hand weather events like Drought, Flood, River Flood and Ice Storm causing the highest costs of crop damage. River Flood and Ice Storm are very close to each other. Nevertheless Flood damages properties and crop with high costs.
To see what causes the highest global costs i added the costs of property- and crop damage together and plotted them into the following figure:
p5 <- ggplot(costsForPlot, aes(x = EVTYPE, y = Cost, fill = CostType)) +
geom_bar(stat = "identity", position = "dodge") +
scale_y_continuous(labels = dollar) +
coord_flip() +
labs(x="Wheather event type") +
labs(y="Costs in USD") +
ggtitle("Comparing the overall costs of the economic consequences \n and their fractions of property damage and crop damage \n in the US between the years 1950-2011") +
theme(plot.title = element_text(color="blue", size=12, vjust=1.0))
print(p5)
In this figure the reader can see that the overall economic consequences are the highest for Flood, Hurricanes/Typhoons, Tornados and Storm Surge.
Comparatively relative low costs can we expect from weather events like Heavy Rain or Wild/Forest Fire.