Weather Event Damage Analysis

Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

The goal of this analysis is to answer the following 2 questions:

Across the United States, which types of events are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

Data Processing

Getting Data:

Download Link
Storm Data Documentation
Date: Dec 20th, 2014

Needed libraries

library(plyr)
library(ggplot2)
library(scales)
library(gridExtra)

## Loading required package: grid

Download and Extract the file:

setwd("~/Repos/RepData_PeerAssessment2/")

if (!"repdata-data-StormData.csv.bz2" %in% dir("./data/")) {
  download.file("http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
                destfile = "data/repdata-data-StormData.csv.bz2")
  bunzip2("./data/repdata-data-StormData.csv.bz2", overwrite=TRUE, remove=FALSE)
}else{
  print("File already Downloaded and extracted.")
}

## [1] "File already Downloaded and extracted."

data <- read.csv("./data/repdata-data-StormData.csv")
dim(data)

## [1] 902297     37

names(data)

##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

There are a total of 902297 observations of 37 variables. For this analysis I will only consider the variables related to Event Type, Fatalities, Injuries and Damage.

Data Processing:

First I will remove the unused variables to have a more readable console output:

data <- subset(data, select=c("EVTYPE","FATALITIES","INJURIES",
                              "PROPDMG","PROPDMGEXP", "CROPDMG","CROPDMGEXP"))
summary(data)

##                EVTYPE         FATALITIES          INJURIES        
##  HAIL             :288661   Min.   :  0.0000   Min.   :   0.0000  
##  TSTM WIND        :219940   1st Qu.:  0.0000   1st Qu.:   0.0000  
##  THUNDERSTORM WIND: 82563   Median :  0.0000   Median :   0.0000  
##  TORNADO          : 60652   Mean   :  0.0168   Mean   :   0.1557  
##  FLASH FLOOD      : 54277   3rd Qu.:  0.0000   3rd Qu.:   0.0000  
##  FLOOD            : 25326   Max.   :583.0000   Max.   :1700.0000  
##  (Other)          :170878                                         
##     PROPDMG          PROPDMGEXP        CROPDMG          CROPDMGEXP    
##  Min.   :   0.00          :465934   Min.   :  0.000          :618413  
##  1st Qu.:   0.00   K      :424665   1st Qu.:  0.000   K      :281832  
##  Median :   0.00   M      : 11330   Median :  0.000   M      :  1994  
##  Mean   :  12.06   0      :   216   Mean   :  1.527   k      :    21  
##  3rd Qu.:   0.50   B      :    40   3rd Qu.:  0.000   0      :    19  
##  Max.   :5000.00   5      :    28   Max.   :990.000   B      :     9  
##                    (Other):    84                     (Other):     9

str(data)

## 'data.frame':    902297 obs. of  7 variables:
##  $ EVTYPE    : Factor w/ 985 levels "?","ABNORMALLY DRY",..: 830 830 830 830 830 830 830 830 830 830 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...

In a first look we can say that there aren’t NA but both PROPDMGEXP and CROPDMGEXP have trash values that will need a further clean.

To answer the first question I will take a look to the columns FATALITIES and INJURIES. I will aggregate EVTYPE by these 2 variables and plot it.

pop.healt <- ddply(data, .(EVTYPE),summarize,
                   TOTAL = sum(FATALITIES) + sum(INJURIES))
head(pop.healt)

##                 EVTYPE TOTAL
## 1                    ?     0
## 2       ABNORMALLY DRY     0
## 3       ABNORMALLY WET     0
## 4      ABNORMAL WARMTH     0
## 5 ACCUMULATED SNOWFALL     0
## 6  AGRICULTURAL FREEZE     0

dim(pop.healt)

## [1] 985   2

I’m not interested in events with zero casualties so I will remove it from the DataSet

pop.healt <- subset(pop.healt, TOTAL != 0)
dim(pop.healt)

## [1] 220   2

220 obs is still too much for a quick overview of the top events. Let’s look at the quantiles to see if we can narrow to the top events.

quantile(pop.healt$TOTAL)

##       0%      25%      50%      75%     100% 
##     1.00     1.75     5.00    44.25 96979.00

It seems like the 25% of the Top events has more than 44.25 casualties. Let’s narrow it a little bit more to have a clear big picture of the top events…

pop.healt <- subset(pop.healt, TOTAL > 44.25)
dim(pop.healt)

## [1] 55  2

Let’s look at the top event:

pop.healt[which.max(pop.healt$TOTAL),]

##      EVTYPE TOTAL
## 830 TORNADO 96979

Now let’s plot the Top 10 Events ordered by TOTAL

pop.healt <- arrange(pop.healt, TOTAL, decreasing = TRUE)[1:10,]
# Re-arrange the levels is mandatory for ggplot to plot the bars in the same
# order as the dataframe
pop.healt$EVTYPE <- factor(pop.healt$EVTYPE,levels = pop.healt$EVTYPE)

# Custom ggplot theme
theme <- theme(axis.text.x = element_text(angle = 45, hjust = 1)) + theme_minimal()

ggplot(pop.healt) + geom_bar(aes(EVTYPE,TOTAL),stat = "identity") + 
  xlab("Weather Events") + ylab("Number of people affected") +
  ggtitle("Numbers of people Affected in the U.S. by Weather Events from 1995 to 2011") +
  theme

That looks great. But I can’t sum apples and bananas (FATALITIES and INJURIES) so let’s look a little bit into the two variables by separated and how they impact in the number of people affected

fatalities <- ddply(data, .(EVTYPE),summarize,
                   FATALITIES = sum(FATALITIES))
injuries <- ddply(data, .(EVTYPE),summarize,
                   INJURIES = sum(INJURIES))

fatalities <- arrange(fatalities, FATALITIES, decreasing = TRUE)[1:10,]
injuries <- arrange(injuries, INJURIES, decreasing = TRUE)[1:10,]

fatalities$EVTYPE <- factor(fatalities$EVTYPE,levels = fatalities$EVTYPE)
injuries$EVTYPE <- factor(injuries$EVTYPE, levels = injuries$EVTYPE)

fatalities

##            EVTYPE FATALITIES
## 1         TORNADO       5633
## 2  EXCESSIVE HEAT       1903
## 3     FLASH FLOOD        978
## 4            HEAT        937
## 5       LIGHTNING        816
## 6       TSTM WIND        504
## 7           FLOOD        470
## 8     RIP CURRENT        368
## 9       HIGH WIND        248
## 10      AVALANCHE        224

injuries

##               EVTYPE INJURIES
## 1            TORNADO    91346
## 2          TSTM WIND     6957
## 3              FLOOD     6789
## 4     EXCESSIVE HEAT     6525
## 5          LIGHTNING     5230
## 6               HEAT     2100
## 7          ICE STORM     1975
## 8        FLASH FLOOD     1777
## 9  THUNDERSTORM WIND     1488
## 10              HAIL     1361

Perfect! A little bit of Copy-Paste but it’s fine so far. Now lets try to plot it together

injuries.plot <- ggplot(injuries) + geom_bar(aes(EVTYPE,INJURIES), stat = "identity")+
  ggtitle("Numbers of people injured\n in the U.S.\n by Weather Events\n from 1995 to 2011") +
  theme

fatalities.plot <- ggplot(fatalities) + geom_bar(aes(EVTYPE,FATALITIES), stat = "identity")+
  ggtitle("Numbers of fatalities\n in the U.S.\n by Weather Events\n from 1995 to 2011") +
  theme

grid.arrange(fatalities.plot, injuries.plot, ncol = 2)

Now let’s dig a little bit about the economic consequences. We have 4 columns:
- PROPDMG
- PROPDMGEXP
- CROPDMG
- CROPDMGEXP

Let’s try to unify the columns in just 1 numeric Value: TOTALDMG. Both PROPDMGEXP and CROPDMGEXP columns have a multiplier for each value in PROPDMG and CROPDMG. We have Hundred (H), Thousand (K), Million (M) and Billion (B). We also have trash values like -, +, ?

data$PROPDMGEXP <- tolower(as.character(data$PROPDMGEXP))
data$CROPDMGEXP <- tolower(as.character(data$CROPDMGEXP))

data[data$CROPDMGEXP == "b", "CROPDMGEXP"] <- 9
data[data$CROPDMGEXP == "m", "CROPDMGEXP"] <- 6
data[data$CROPDMGEXP == "k", "CROPDMGEXP"] <- 3
data[data$CROPDMGEXP == "", "CROPDMGEXP"] <- 0
data[data$CROPDMGEXP == "?", "CROPDMGEXP"] <- 0
data$CROPDMGEXP <- as.numeric(data$CROPDMGEXP)
table(data$CROPDMGEXP)

## 
##      0      2      3      6      9 
## 618439      1 281853   1995      9

data[data$PROPDMGEXP == "b", "PROPDMGEXP"] <- 9
data[data$PROPDMGEXP == "m", "PROPDMGEXP"] <- 6
data[data$PROPDMGEXP == "k", "PROPDMGEXP"] <- 3
data[data$PROPDMGEXP == "h", "PROPDMGEXP"] <- 2
data[data$PROPDMGEXP == "", "PROPDMGEXP"] <- 0
data[data$PROPDMGEXP == "?", "PROPDMGEXP"] <- 0
data[data$PROPDMGEXP == "-", "PROPDMGEXP"] <- 0
data[data$PROPDMGEXP == "+", "PROPDMGEXP"] <- 0
data$PROPDMGEXP <- as.numeric(data$PROPDMGEXP)
table(data$PROPDMGEXP)

## 
##      0      1      2      3      4      5      6      7      8      9 
## 466164     25     20 424669      4     28  11341      5      1     40

Excellent! Now we have to calculate the TOTALDMG:

crop.total.dmg <- data$CROPDMG * (10 ^ data$CROPDMGEXP)
prop.total.dmg <- data$PROPDMG * (10 ^ data$PROPDMGEXP)

total.dmg <- crop.total.dmg + prop.total.dmg

data$TOTALDMG <- total.dmg

Now we just need to summarize the data and create a nice plot.

event.dmg <- ddply(data, .(EVTYPE), summarize,
                   TOTALDMG = sum(TOTALDMG))
event.dmg <- arrange(event.dmg, TOTALDMG, decreasing = TRUE)[1:10,]

event.dmg$EVTYPE <- factor(event.dmg$EVTYPE, levels = event.dmg$EVTYPE)

ggplot(event.dmg) + geom_bar(aes(EVTYPE,TOTALDMG), stat="identity") +
  xlab("Weather Events") + 
  scale_y_continuous("Total Damage in Dollars",labels = comma) + 
  ggtitle("Total Damage in Dollars in the U.S,\n by Weather Events from 1995 to 2011") + 
  theme

Results

Across the United States, which types of events are most harmful with respect to population health?

It’s safe to say that TORNADOS are the most harmful events, followed by EXCESIVE HEAT.

Across the United States, which types of events have the greatest economic consequences?

It’s safe to say that FLOODS are the most harmful events, followed by HURRICANE/TYPHOON