Synopsis

This document will show that RIP CURRENT and EXCEESIVE HEAT are the events that most people killed in average by each time they happen. Also, the events that have killed more people between 2002 and November 2011 are TORNADO and again EXCEESIVE HEAT.

On the other hand, the event that more economic damage produces each time it occurs is HURRICANE/TYPHOON with $817 million of damage, followed by STORM SURGE and HURRICANE. However, summarizing the data from 2002 up to November 2011, the event FLOOD leads the damage with a total economic damage of $13.7 billions.

Data Processing

The data for this analysis was downloaded from the NOAA web site (https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2). Then it was extracted and preprocessed using R Studio. In the next section we will see how these tasks are performed in detail:

Load the libraries that will help us with the process and set a couple of options for the R environment

library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(xtable)
## Warning: package 'xtable' was built under R version 3.1.3
library(reshape)
## Warning: package 'reshape' was built under R version 3.1.3
## 
## Attaching package: 'reshape'
## 
## The following object is masked from 'package:dplyr':
## 
##     rename
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.1.3
Sys.setlocale(category = "LC_ALL", locale = "C") 
## [1] "C"
options(scipen = 20)

Loading the data

Here we have downloaded the file from NOAA and copied to the project folder. Then we load the file using R:

# download.file("https://d396qusza40orc.cloudfront.net/repdata-data-StormData.csv.bz2", "repdata-data-StormData.csv.bz2")
data= read.csv(bzfile("repdata-data-StormData.csv.bz2"), stringsAsFactors=FALSE)

Let’s look at the data loaded:

str(data)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Preprocessing the data

In the following section, we will convert the date column to date format. Also we will filter our data set to include only the records from 2002. In this way, we will make sure that we work with relatively recent data as well as avoid some normalization of old data. Also, we will work only with events that have 30 or more observations in order to have a better statistical relevance of our analysis.

#converting dates:
data$BGN_DATE=as.Date(data$BGN_DATE, "%m/%d/%Y")

#Using only roughly 10 years for the analysis
data=dplyr::filter(data, BGN_DATE>='2002-01-01')

#also we will remove the events where we have less than 30 observations:
enoughtDataEvents=dplyr::group_by(data, EVTYPE) %>% dplyr::summarise(n=n()) %>% filter(n>=30)
data=dplyr::inner_join(data, enoughtDataEvents, by="EVTYPE")

Let’s see now how is the data related to economic damage:

table(data$PROPDMGEXP)
## 
##             0      B      K      M 
## 141242      1     27 307171   4911
summary(data$PROPDMG)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00   11.42    1.00 5000.00
table(data$CROPDMGEXP)
## 
##             B      K      M 
## 185500      4 266829   1019
summary(data$CROPDMG)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   1.689   0.000 990.000

Looking at the economic damage columns (PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP), we see that we can intuitively understand the values (K=thousands, M=millions, B=billions). However, we can see an odd value of “0” in the PROPDMGEXP column, for 1 observation. We will remove this observation. Also, we normalize the event “RIP CURRENT”, which is also written as “RIP CURRENTS” in the database.

data=dplyr::filter(data, PROPDMGEXP!="0")
data$EVTYPE=ifelse(grepl("RIP CURRENT", data$EVTYPE), "RIP CURRENT", data$EVTYPE)

We can assume that the total economic damage can be expressed by the sum of the different types of damages: PROPDMG and CROPDMG. For this analysis we will transform those values according to the columns PROPDMGEXP and CROPDMGEXP. The values in those columns (“K”, “M”, and “B”) indicate that the damage is in thousands, millions or billions of dollars, respectively. Also, we will scale the final values to a magnitude of thousands of dollars.

data$PROPDMG2=ifelse(data$PROPDMGEXP=="K", data$PROPDMG, 
                ifelse(data$PROPDMGEXP=="M", data$PROPDMG*1000,
                       ifelse(data$PROPDMGEXP=="B", data$PROPDMG*1000000, 
                              data$PROPDMG/1000 #normalize to thousands of dollars
                              )))

data$CROPDMG2=ifelse(data$CROPDMGEXP=="K", data$CROPDMG, 
                ifelse(data$CROPDMGEXP=="M", data$CROPDMG*1000,
                       ifelse(data$CROPDMGEXP=="B", data$CROPDMG*1000000, 
                              data$CROPDMG/1000 #normalize to thousands of dollars 
                              )))

Results

1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

Lets look at the events with most fatalities per each occurrence:

tmp=dplyr::group_by(data, EVTYPE) %>%
    dplyr::summarize(FATALITIES=sum(FATALITIES, na.rm=T), 
                     INJURIES=sum(INJURIES, na.rm=T), 
                     Events=n()) %>%
    dplyr::mutate(FatalitiesPerEvent=(FATALITIES/Events))

tmp=as.data.frame(tmp)
tmp=dplyr::arrange(tmp, -FatalitiesPerEvent)

print(xtable(head(tmp,10)),type = "html", comment=F)
EVTYPE FATALITIES INJURIES Events FatalitiesPerEvent
1 RIP CURRENT 384.00 290.00 492 0.78
2 EXCESSIVE HEAT 691.00 2797.00 901 0.77
3 HURRICANE/TYPHOON 64.00 1275.00 88 0.73
4 AVALANCHE 145.00 103.00 274 0.53
5 HEAT 229.00 1222.00 709 0.32
6 MARINE STRONG WIND 14.00 22.00 48 0.29
7 FOG 22.00 146.00 84 0.26
8 EXTREME COLD 13.00 0.00 69 0.19
9 HEAVY SURF/HIGH SURF 42.00 48.00 228 0.18
10 COLD/WIND CHILL 95.00 12.00 539 0.18

Now, lets see the events that contribute more to fatalities, regardless the number of events:

tmp=dplyr::arrange(tmp, -FATALITIES)
print(xtable(head(tmp,10)),type = "html", comment=F)
EVTYPE FATALITIES INJURIES Events FatalitiesPerEvent
1 TORNADO 1112.00 13588.00 15167 0.07
2 EXCESSIVE HEAT 691.00 2797.00 901 0.77
3 FLASH FLOOD 539.00 517.00 36031 0.01
4 RIP CURRENT 384.00 290.00 492 0.78
5 LIGHTNING 370.00 2250.00 7899 0.05
6 FLOOD 247.00 301.00 18071 0.01
7 HEAT 229.00 1222.00 709 0.32
8 AVALANCHE 145.00 103.00 274 0.53
9 THUNDERSTORM WIND 130.00 1400.00 81402 0.00
10 EXTREME COLD/WIND CHILL 125.00 24.00 1002 0.12

We can see that tornado events have taken the biggest number of lives between 2002 and the end of 2011. However, if we look at the deaths per each occurrence of the events, the more dangerous in terms of fatalities are rip currents.

Now, lets see the events with most injuries per each occurrence:

tmp=dplyr::mutate(tmp, InjuriesPerEvent=(INJURIES/Events)) %>%
    dplyr::arrange(-InjuriesPerEvent) %>%
    dplyr::select(-FatalitiesPerEvent)
print(xtable(head(tmp,10)),type = "html", comment=F)
EVTYPE FATALITIES INJURIES Events InjuriesPerEvent
1 HURRICANE/TYPHOON 64.00 1275.00 88 14.49
2 EXCESSIVE HEAT 691.00 2797.00 901 3.10
3 FOG 22.00 146.00 84 1.74
4 HEAT 229.00 1222.00 709 1.72
5 TORNADO 1112.00 13588.00 15167 0.90
6 DUST STORM 9.00 209.00 342 0.61
7 RIP CURRENT 384.00 290.00 492 0.59
8 MARINE STRONG WIND 14.00 22.00 48 0.46
9 TROPICAL STORM 27.00 261.00 584 0.45
10 WILD/FOREST FIRE 1.00 140.00 345 0.41

And also, the injuries regardless the number of occurrence of the events

tmp=dplyr::arrange(tmp, -INJURIES)
print(xtable(head(tmp,10)),type = "html", comment=F)
EVTYPE FATALITIES INJURIES Events InjuriesPerEvent
1 TORNADO 1112.00 13588.00 15167 0.90
2 EXCESSIVE HEAT 691.00 2797.00 901 3.10
3 LIGHTNING 370.00 2250.00 7899 0.28
4 THUNDERSTORM WIND 130.00 1400.00 81402 0.02
5 HURRICANE/TYPHOON 64.00 1275.00 88 14.49
6 HEAT 229.00 1222.00 709 1.72
7 TSTM WIND 77.00 1146.00 61122 0.02
8 WILDFIRE 75.00 911.00 2732 0.33
9 FLASH FLOOD 539.00 517.00 36031 0.01
10 HIGH WIND 97.00 486.00 14749 0.03

2. Across the United States, which types of events have the greatest economic consequences?

Now we will see which type of events have the greatest damage in terms of property and corp damage.

Lets look at the events with the biggest total damage per each occurrence of the event. The damage is expressed in thousands of dollars :

tmp=dplyr::mutate(data, TotalDamage=PROPDMG2+CROPDMG2) %>%
    dplyr::group_by(EVTYPE) %>%
    dplyr::summarize(PropertyDamage=sum(PROPDMG2, na.rm=T), 
                     CropDamage=sum(CROPDMG2, na.rm=T),
                     TotalDamage=sum(TotalDamage, na.rm=T),
                     Events=n()) %>%
    dplyr::mutate(TotalDamagePerEvent=(TotalDamage/Events))

tmp=as.data.frame(tmp)
tmp=dplyr::arrange(tmp, -TotalDamagePerEvent)

print(xtable(head(tmp,10)),type = "html", comment=F)
EVTYPE PropertyDamage CropDamage TotalDamage Events TotalDamagePerEvent
1 HURRICANE/TYPHOON 69305840.00 2607872.80 71913712.80 88 817201.28
2 STORM SURGE 43168315.00 0.00 43168315.00 103 419109.85
3 HURRICANE 3036855.01 448510.00 3485365.01 40 87134.13
4 STORM SURGE/TIDE 4641188.00 850.00 4642038.00 148 31365.12
5 FLOOD 133387648.53 3591907.40 136979555.93 18071 7580.08
6 TROPICAL STORM 2008360.55 410061.00 2418421.55 584 4141.13
7 DROUGHT 845958.00 5423625.00 6269583.00 1845 3398.15
8 WILDFIRE 4758667.00 295472.80 5054139.80 2732 1849.98
9 ICE STORM 1964717.80 8665.00 1973382.80 1267 1557.52
10 TORNADO 18406922.66 220589.91 18627512.57 15167 1228.16

Now, lets see the events that contribute more to the total damage, regardless the number of events. The damage is expressed in thousands of dollars:

tmp=dplyr::arrange(tmp, -TotalDamage)
print(xtable(head(tmp,10)),type = "html", comment=F)
EVTYPE PropertyDamage CropDamage TotalDamage Events TotalDamagePerEvent
1 FLOOD 133387648.53 3591907.40 136979555.93 18071 7580.08
2 HURRICANE/TYPHOON 69305840.00 2607872.80 71913712.80 88 817201.28
3 STORM SURGE 43168315.00 0.00 43168315.00 103 419109.85
4 TORNADO 18406922.66 220589.91 18627512.57 15167 1228.16
5 FLASH FLOOD 10709402.71 812514.00 11521916.71 36031 319.78
6 HAIL 9174277.52 1393287.90 10567565.42 142257 74.29
7 DROUGHT 845958.00 5423625.00 6269583.00 1845 3398.15
8 HIGH WIND 4833335.91 494272.00 5327607.91 14749 361.22
9 WILDFIRE 4758667.00 295472.80 5054139.80 2732 1849.98
10 STORM SURGE/TIDE 4641188.00 850.00 4642038.00 148 31365.12

Here we can see that in total, the type of event that produced more damage from 2002 and the end of 2011 were floods. They damage almost twice than hurricane/typhons. However, the last ones produced a lot more damage if we look at them as the average damage for each occuerrence of the event.

Finally, we will compare the economical damage in thousands of dollars, and the fatalities for a selected group of events:

mainEvents=c("AVALANCHE","DROUGHT","DUST STORM","EXCESSIVE HEAT",
             "EXTREME COLD/WIND CHILL","FLASH FLOOD","FLOOD","FOG",
             "HAIL","HEAT","HIGH WIND","HURRICANE/TYPHOON","LIGHTNING",
             "MARINE STRONG WIND","RIP CURRENT","RIP CURRENTS","STORM SURGE",
             "STORM SURGE/TIDE","THUNDERSTORM WIND","TORNADO","TROPICAL STORM","WILDFIRE")

tmp=data[,]

tmp <- melt(tmp[,c("EVTYPE", "PROPDMG2", "CROPDMG2")], id.vars=c("EVTYPE"))
colnames(tmp)[2:3]=c("Type", "T.Dollars")

tmp=dplyr::group_by(tmp, EVTYPE, Type) %>%
    dplyr::summarize(T.Dollars=sum(T.Dollars, na.rm=T))

tmp=tmp[tmp$EVTYPE %in% mainEvents,]


g1 = ggplot(tmp, aes(x = EVTYPE, y=T.Dollars, fill=Type)) + 
    geom_bar(alpha=0.7, stat="identity") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    coord_flip()+
    ggtitle("Economic cost per type of event (Thousands U$)")

#Now the human cost:
tmp=data[,]

tmp <- melt(tmp[,c("EVTYPE", "FATALITIES")], id.vars=c("EVTYPE"))
colnames(tmp)[2:3]=c("Type", "T.Dollars")

tmp=dplyr::group_by(data, EVTYPE) %>%
    dplyr::summarize(FATALITIES=sum(FATALITIES, na.rm=T), 
                     INJURIES=sum(INJURIES, na.rm=T))

tmp=tmp[tmp$EVTYPE %in% mainEvents,]

g2 = ggplot(tmp, aes(x = EVTYPE, y=FATALITIES)) + 
    geom_bar(alpha=0.7, stat="identity") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    coord_flip()+
    ggtitle("Fatalities per type of event")

g3 = ggplot(tmp, aes(x = EVTYPE, y=INJURIES)) + 
    geom_bar(alpha=0.7, stat="identity") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    coord_flip()+
    ggtitle("Injuries per type of event")

g1 #Economic Cost

g2 #Fatalities

g3 #Injuries