Synopsis

This report aims on an analysis of severe weather events and their impact on public health and ecomomy of the US between the years 1950 and 2011 from this link.

Our hypothesis is that some weather events causes higher injuries, whereas other weather phenomens have a higher impact on fatalities. Furthermore weather events having the highest consequences on economic aspects are not always the same as for injuries and fatalities.

To accomplish our task we obtained data from the National Oceanic And Atmospheric Administration between the years 1950 and 2011.

As i analyzed the underlying data set i found out that Tornados have the highest impact on Injuries and also on Fatalities.

But the highest costs on Property Damage causes Floods and the highest costs on Crop Damage is caused by Droughts. If i take a look on the Combined Costs of Property- and Crop Damage i will see that Flood here has the highest ecomoic consequence.

I examined especially the Top 20 of all covered weather events.

Data Processing

First i load the necessary libraries and external scripts and set useful options.

library(plyr)
library(ggplot2)
library(scales)
library(reshape2)
library(knitr)

opts_chunk$set(fig.path='figures/')

if(!exists("convertExpCharToExpValue", mode = "function") && !exists("multiplot", mode = "function")) {
  source("functions.R")
}

Load and quick overview

Now i load the data set which is a huge csv-File and take a quick look at the data set’s dimensions and structure. The data set consists of 902297 observations and 37 features.

# stop execution if no data file could be found
if(!file.exists("data/repdata-data-StormData.csv")) {
  stop("No data file found in data path!")
}

stormDataRaw <- read.csv("data/repdata-data-StormData.csv", na.strings=c("NA","NaN", "", " ", "-", "?", "+"), stringsAsFactors=FALSE)
dim(stormDataRaw)
## [1] 902297     37
head(stormDataRaw, 5)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0    <NA>       <NA>     <NA>     <NA>          0
## 2 TORNADO         0    <NA>       <NA>     <NA>     <NA>          0
## 3 TORNADO         0    <NA>       <NA>     <NA>     <NA>          0
## 4 TORNADO         0    <NA>       <NA>     <NA>     <NA>          0
## 5 TORNADO         0    <NA>       <NA>     <NA>     <NA>          0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0    <NA>       <NA>   14.0   100 3   0          0
## 2         NA         0    <NA>       <NA>    2.0   150 2   0          0
## 3         NA         0    <NA>       <NA>    0.1   123 2   0          0
## 4         NA         0    <NA>       <NA>    0.0   100 2   0          0
## 5         NA         0    <NA>       <NA>    0.0   150 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP  WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0       <NA> <NA>       <NA>      <NA>
## 2        0     2.5          K       0       <NA> <NA>       <NA>      <NA>
## 3        2    25.0          K       0       <NA> <NA>       <NA>      <NA>
## 4        2     2.5          K       0       <NA> <NA>       <NA>      <NA>
## 5        2     2.5          K       0       <NA> <NA>       <NA>      <NA>
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806    <NA>      1
## 2     3042      8755          0          0    <NA>      2
## 3     3340      8742          0          0    <NA>      3
## 4     3458      8626          0          0    <NA>      4
## 5     3412      8642          0          0    <NA>      5

Cleanup

In the next step i filter the data set and extract only these columns i really need to accomplish our task. The features i need are : EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP

# extract only needed features
stormData <- stormDataRaw[, c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
# convert some data to uppercase for easier processing
stormData$PROPDMGEXP <- toupper(stormData$PROPDMGEXP)
stormData$CROPDMGEXP <- toupper(stormData$CROPDMGEXP)
write.csv(stormData, file="data/storm_data_small.csv")

Unfortunatly the observations of the property damage and crop damage are not very tidy. There is a mixture of characters and numbers. and i have lots of NAs.

For property damage i have ~52% NAs and for crop damage ~63%. This is much.

# shows the messiness of data for property damage
unique(stormDataRaw$PROPDMGEXP)
##  [1] "K" "M" NA  "B" "m" "0" "5" "6" "4" "2" "3" "h" "7" "H" "1" "8"
# shows the messiness of data for crop damage
unique(stormDataRaw$CROPDMGEXP)
## [1] NA  "M" "K" "m" "B" "0" "k" "2"
# how many NAs of property damage ?
mean(is.na(stormData$PROPDMGEXP))
## [1] 0.5164
# how many NAs of crop damage ?
mean(is.na(stormData$CROPDMGEXP))
## [1] 0.6854

The documentation about the numbers of the property and crop damage states them as categories:

- Cat. 1 --> Less than $50
- Cat. 2 --> $50 to $500
- Cat. 3 --> $500 to $5.000
- Cat. 4 --> $5.000 to $50.000
- Cat. 5 --> $50.000 to $500.000
- Cat. 6 --> $500.000 to $5.000.000
- Cat. 7 --> $5.000.000 to $50.000.000
- Cat. 8 --> $50.000.000 to $500.000.000
- Cat. 9 --> $500.000.000 to $5.000.000.000

But some samples shows typos in it’s value and because the count is low (85 entries) i decided to remove this rows. Furthermore some observations of the EVTYPE feature are just summaries and have also a low count (73) so i removed them too. The remaining observation are characters and NAs.

The characters were exponents for 10^x

- H stands for hundred --> 10^2
- K stands for thousand --> 10^3
- M stands for million --> 10^6
- B stands for billion --> 10^9

NAs were converted to 0 but imputing for a real world scenario may be a better idea. I added two new colums holding the computed costs of property and crop damage.

# remove unclear and unnecessary entries with numbers because count is low
stormData <- stormData[-grep("Summary", stormData$EVTYPE), ]
stormData <- stormData[-which(stormData$PROPDMGEXP %in% c("1", "2", "3", "4", "5", "6", "7", "8", "9")),]
stormData <- stormData[-which(stormData$CROPDMGEXP %in% c("1", "2", "3", "4", "5", "6", "7", "8", "9")),]

# convert all remaining property damage values to exponents so we can use 10^2
stormData <- convertExpCharToExpValue(stormData, "PROPDMGEXP")
stormData$PROPDMGEXP <- as.integer(stormData$PROPDMGEXP)
# compuite cost of property damage
stormData$PropDmgCost <- stormData$PROPDMG * 10^stormData$PROPDMGEXP
stormData$PROPDMGEXP <- as.factor(stormData$PROPDMGEXP)

# convert all remaining crop damage values to exponents so we can use 10^2
stormData <- convertExpCharToExpValue(stormData, "CROPDMGEXP")
stormData$CROPDMGEXP <- as.integer(stormData$CROPDMGEXP)
# compuite cost of crop damage
stormData$CropDmgCost <- stormData$CROPDMG * 10^stormData$CROPDMGEXP
stormData$CROPDMGEXP <- as.factor(stormData$CROPDMGEXP)
stormData$CombinedDamageCost <- stormData$PropDmgCost + stormData$CropDmgCost

Preprocessing Fatalities and Injuries

Now lets prepare the data for fatality and injury inspection. I limited the data to the Top 20 most harmful types of events with respect to population health.

# sum injuries for every weather event
injuries <- ddply(stormData, .(EVTYPE), summarise, Injuries = sum(INJURIES))
# sort injuries (highest on top)
injuriesSorted <- arrange(injuries, desc(Injuries))
# limiting the results to the top 20
injuriesSorted <- injuriesSorted[1:20,]
injuriesSorted$Type <- "Injuries"
colnames(injuriesSorted) <- c("Eventtype", "Count", "Type")
injuriesSortedLevels <- injuriesSorted$Eventtype[order(injuriesSorted$Count, injuriesSorted$Eventtype)] 
injuriesSorted$Eventtype <- factor(injuriesSorted$Eventtype, ordered = TRUE, levels = injuriesSortedLevels)

# sum fatalities for every weather event
fatalities <- ddply(stormData, .(EVTYPE), summarise, Fatalities = sum(FATALITIES))
# sort fatalities (highest on top)
fatalitiesSorted <- arrange(fatalities, desc(Fatalities))
# limiting the results to the top 20
fatalitiesSorted <- fatalitiesSorted[1:20,]
fatalitiesSorted$Type <- "Fatalities"
colnames(fatalitiesSorted) <- c("Eventtype", "Count", "Type")
fatalitiesSortedLevels <- fatalitiesSorted$Eventtype[order(fatalitiesSorted$Count, fatalitiesSorted$Eventtype)] 
fatalitiesSorted$Eventtype <- factor(fatalitiesSorted$Eventtype, ordered = TRUE, levels = fatalitiesSortedLevels)

A look on the injuries and fatalities summaries:

# shows amongst the others the max count of injuries
summary(stormData$INJURIES)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0     0.0     0.2     0.0  1700.0
# shows a summary of the top 20 inj.
summary(injuriesSorted)
##               Eventtype      Count           Type          
##  DUST STORM        : 1   Min.   :  440   Length:20         
##  WILD/FOREST FIRE  : 1   1st Qu.:  910   Class :character  
##  FOG               : 1   Median : 1341   Mode  :character  
##  BLIZZARD          : 1   Mean   : 6732                     
##  THUNDERSTORM WINDS: 1   3rd Qu.: 2882                     
##  WILDFIRE          : 1   Max.   :91345                     
##  (Other)           :14
# shows amongst the others the max count of fatalities
summary(stormData$FATALITIES)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0       0       0     583
# shows a summary of the top 20 fat.
summary(fatalitiesSorted)
##                    Eventtype      Count          Type          
##  BLIZZARD               : 1   Min.   : 101   Length:20         
##  HIGH SURF              : 1   1st Qu.: 131   Class :character  
##  STRONG WIND            : 1   Median : 215   Mode  :character  
##  EXTREME COLD/WIND CHILL: 1   Mean   : 676                     
##  HEAVY SNOW             : 1   3rd Qu.: 582                     
##  THUNDERSTORM WIND      : 1   Max.   :5633                     
##  (Other)                :14

Preprocessing economic consequences

In this step i prepare the data having the greatest economic consequences. I limited the data to the Top 20 types of events.

# sum property damages for every weather event
propDamages <- ddply(stormData, .(EVTYPE), summarise, PropertyDamage = sum(PropDmgCost))
# sort, highest first
propDamagesSorted <- arrange(propDamages, desc(PropertyDamage))
# limiting the results to the top 20
propDamagesSorted <- propDamagesSorted[1:20,]
colnames(propDamagesSorted) <- c("Eventtype", "PropertyDamage")
propDamagesSortedLevels <- propDamagesSorted$Eventtype[order(propDamagesSorted$PropertyDamage, propDamagesSorted$Eventtype)]
propDamagesSorted$Eventtype <- factor(propDamagesSorted$Eventtype, ordered = TRUE, levels = propDamagesSortedLevels)

# sum crop damage for every weather event
cropDamages <- ddply(stormData, .(EVTYPE), summarise, CropDamage = sum(CropDmgCost))
# sort, highest first
cropDamagesSorted <- arrange(cropDamages, desc(CropDamage))
# limiting the results to the top 20
cropDamagesSorted <- cropDamagesSorted[1:20,]
colnames(cropDamagesSorted) <- c("Eventtype", "CropDamage")
propDamagesSortedLevels <- cropDamagesSorted$Eventtype[order(cropDamagesSorted$CropDamage, cropDamagesSorted$Eventtype)] 
cropDamagesSorted$Eventtype <- factor(cropDamagesSorted$Eventtype, ordered = TRUE, levels = propDamagesSortedLevels)

Finaly i calculated the global costs by adding property and crop damage together for every weather type.

# compute global costs
costs <- ddply(stormData, .(EVTYPE), summarise, PropertyDamageCost = sum(PropDmgCost), CropDamageCost = sum(CropDmgCost), GlobalCost=sum(CombinedDamageCost))

costs <- arrange(costs, desc(GlobalCost), PropertyDamageCost, CropDamageCost)
#limit to top 20
costs <- costs[1:20,]
# turn columns into rows
costsForPlot <- melt(costs, id.vars = "EVTYPE", variable.name = "CostType", value.name = "Cost")
costsLevels <- costsForPlot$EVTYPE[order(costsForPlot$Cost, costsForPlot$EVTYPE, decreasing = TRUE)]
costsLevels <- unique(costsLevels)
costsForPlot$EVTYPE <- factor(costsForPlot$EVTYPE, ordered = TRUE, levels = costsLevels)

A look on the ecomic consequences summaries:

# shows a summary of the top 20 property- and crop damage cost and global cost
summary(costs)
##     EVTYPE          PropertyDamageCost CropDamageCost    
##  Length:20          Min.   :1.05e+09   Min.   :0.00e+00  
##  Class :character   1st Qu.:3.83e+09   1st Qu.:8.68e+07  
##  Mode  :character   Median :5.19e+09   Median :5.96e+08  
##                     Mean   :2.07e+10   Mean   :2.13e+09  
##                     3rd Qu.:1.58e+10   3rd Qu.:2.81e+09  
##                     Max.   :1.45e+11   Max.   :1.40e+10  
##    GlobalCost      
##  Min.   :2.50e+09  
##  1st Qu.:4.94e+09  
##  Median :8.67e+09  
##  Mean   :2.28e+10  
##  3rd Qu.:1.79e+10  
##  Max.   :1.50e+11

Results

In this section i will answer the following two questions:

* Across the United States, which types of events are most harmful with respect to population health?
* Across the United States, which types of events have the greatest economic consequences?

Most harmful events with respect to population health

I plotted the top 20 most harmful weather events and the according injury- and fatality numbers together into one figure.

p1 <- ggplot(injuriesSorted, aes(x = Eventtype, y = Count, fill = Type)) + 
  geom_bar(stat = "identity") +
  scale_fill_manual(values=c("#CC6666")) +
  coord_flip() +
  labs(x="Wheather event type") +
  labs(y="Number of Injuries") +
  ggtitle("Top 20 of the most harmful Injuries with respect to population health \n in the US between (1950-2011)") +
  theme(plot.title = element_text(color="blue", size=12, vjust=1.0))

p2 <- ggplot(fatalitiesSorted, aes(x = Eventtype, y = Count, fill = Type)) + 
  geom_bar(stat = "identity") +
  scale_fill_manual(values=c("#9999CC")) +
  coord_flip() +
  labs(x="Wheather event type") +
  labs(y="Number of Fatalities")

multiplot(p1, p2, cols=1)

plot of chunk plot1

The figure above shows us that Tornados followed by TSTM Wind and Flood causes the most harmful injuries. Whereas Tornados, Excessive heat and Flash Flood produces the most harmful fatalities.

Summing it up: Tornados have the biggest impact to population health.

Events having the greatest economic consequences

Here i plotted the top 20 weather events and the according costs of the property- and crop damage together into one figure.

p3 <- ggplot(propDamagesSorted, aes(x = Eventtype, y = PropertyDamage)) + 
  geom_bar(stat = "identity") +
  scale_y_continuous(labels = dollar) +
  #scale_colour_manual(values=c("#CC6666")) +
  coord_flip() +
  labs(x="Wheather event type") +
  labs(y="Property damage in USD") +
  ggtitle("Top 20 of weather events having the greatest economic consequences \n in the US between the years 1950-2011") +
  theme(plot.title = element_text(color="blue", size=12, vjust=1.0))

p4 <- ggplot(cropDamagesSorted, aes(x = Eventtype, y = CropDamage)) + 
  geom_bar(stat = "identity") +
  scale_y_continuous(labels = dollar) +
  #scale_colour_manual(values=c("#9999CC")) +
  coord_flip() +
  labs(x="Wheather event type") +
  labs(y="Crop damage in USD")

multiplot(p3, p4, cols=1)

plot of chunk plot2

As we can see in the figure above Floods, Hurricans/Typhoons, Tornados and Storm Surges produces the highest costs of property damage.

On The other hand weather events like Drought, Flood, River Flood and Ice Storm causing the highest costs of crop damage. River Flood and Ice Storm are very close to each other. Nevertheless Flood damages properties and crop with high costs.

To see what causes the highest global costs i added the costs of property- and crop damage together and plotted them into the following figure:

p5 <- ggplot(costsForPlot, aes(x = EVTYPE, y = Cost, fill = CostType)) + 
  geom_bar(stat = "identity", position = "dodge") +
  scale_y_continuous(labels = dollar) +
  coord_flip() +
  labs(x="Wheather event type") +
  labs(y="Costs in USD") +
  ggtitle("Comparing the overall costs of the economic consequences \n and their fractions of property damage and crop damage  \n in the US between the years 1950-2011") +
  theme(plot.title = element_text(color="blue", size=12, vjust=1.0))
print(p5)

plot of chunk plot3

In this figure the reader can see that the overall economic consequences are the highest for Flood, Hurricanes/Typhoons, Tornados and Storm Surge.

Comparatively relative low costs can we expect from weather events like Heavy Rain or Wild/Forest Fire.