Exploring the NOAA storm database

Synopsis

This document will show that RIP CURRENT and EXCEESIVE HEAT are the events that most people killed in average by each time they happen. Also, the events that have killed more people between 2002 and November 2011 are TORNADO and again EXCEESIVE HEAT.

On the other hand, the event that more economic damage produces each time it occurs is HURRICANE/TYPHOON with $817 million of damage, followed by STORM SURGE and HURRICANE. However, summarizing the data from 2002 up to November 2011, the event FLOOD leads the damage with a total economic damage of $13.7 billions.

Data Processing

The data for this analysis was downloaded from the NOAA web site (https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2). Then it was extracted and preprocessed using R Studio. In the next section we will see how these tasks are performed in detail:

Load the libraries that will help us with the process and set a couple of options for the R environment

library(dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(xtable)

## Warning: package 'xtable' was built under R version 3.1.3

library(reshape)

## Warning: package 'reshape' was built under R version 3.1.3

## 
## Attaching package: 'reshape'
## 
## The following object is masked from 'package:dplyr':
## 
##     rename

library(gridExtra)

## Warning: package 'gridExtra' was built under R version 3.1.3

Sys.setlocale(category = "LC_ALL", locale = "C")

## [1] "C"

options(scipen = 20)

Loading the data

Here we have downloaded the file from NOAA and copied to the project folder. Then we load the file using R:

# download.file("https://d396qusza40orc.cloudfront.net/repdata-data-StormData.csv.bz2", "repdata-data-StormData.csv.bz2")
data= read.csv(bzfile("repdata-data-StormData.csv.bz2"), stringsAsFactors=FALSE)

Let’s look at the data loaded:

str(data)

## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Preprocessing the data

In the following section, we will convert the date column to date format. Also we will filter our data set to include only the records from 2002. In this way, we will make sure that we work with relatively recent data as well as avoid some normalization of old data. Also, we will work only with events that have 30 or more observations in order to have a better statistical relevance of our analysis.

#converting dates:
data$BGN_DATE=as.Date(data$BGN_DATE, "%m/%d/%Y")

#Using only roughly 10 years for the analysis
data=dplyr::filter(data, BGN_DATE>='2002-01-01')

#also we will remove the events where we have less than 30 observations:
enoughtDataEvents=dplyr::group_by(data, EVTYPE) %>% dplyr::summarise(n=n()) %>% filter(n>=30)
data=dplyr::inner_join(data, enoughtDataEvents, by="EVTYPE")

Let’s see now how is the data related to economic damage:

table(data$PROPDMGEXP)

## 
##             0      B      K      M 
## 141242      1     27 307171   4911

summary(data$PROPDMG)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00   11.42    1.00 5000.00

table(data$CROPDMGEXP)

## 
##             B      K      M 
## 185500      4 266829   1019

summary(data$CROPDMG)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   1.689   0.000 990.000

Looking at the economic damage columns (PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP), we see that we can intuitively understand the values (K=thousands, M=millions, B=billions). However, we can see an odd value of “0” in the PROPDMGEXP column, for 1 observation. We will remove this observation. Also, we normalize the event “RIP CURRENT”, which is also written as “RIP CURRENTS” in the database.

data=dplyr::filter(data, PROPDMGEXP!="0")
data$EVTYPE=ifelse(grepl("RIP CURRENT", data$EVTYPE), "RIP CURRENT", data$EVTYPE)

We can assume that the total economic damage can be expressed by the sum of the different types of damages: PROPDMG and CROPDMG. For this analysis we will transform those values according to the columns PROPDMGEXP and CROPDMGEXP. The values in those columns (“K”, “M”, and “B”) indicate that the damage is in thousands, millions or billions of dollars, respectively. Also, we will scale the final values to a magnitude of thousands of dollars.

data$PROPDMG2=ifelse(data$PROPDMGEXP=="K", data$PROPDMG, 
                ifelse(data$PROPDMGEXP=="M", data$PROPDMG*1000,
                       ifelse(data$PROPDMGEXP=="B", data$PROPDMG*1000000, 
                              data$PROPDMG/1000 #normalize to thousands of dollars
                              )))

data$CROPDMG2=ifelse(data$CROPDMGEXP=="K", data$CROPDMG, 
                ifelse(data$CROPDMGEXP=="M", data$CROPDMG*1000,
                       ifelse(data$CROPDMGEXP=="B", data$CROPDMG*1000000, 
                              data$CROPDMG/1000 #normalize to thousands of dollars 
                              )))

Results

1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

Lets look at the events with most fatalities per each occurrence:

tmp=dplyr::group_by(data, EVTYPE) %>%
    dplyr::summarize(FATALITIES=sum(FATALITIES, na.rm=T), 
                     INJURIES=sum(INJURIES, na.rm=T), 
                     Events=n()) %>%
    dplyr::mutate(FatalitiesPerEvent=(FATALITIES/Events))

tmp=as.data.frame(tmp)
tmp=dplyr::arrange(tmp, -FatalitiesPerEvent)

print(xtable(head(tmp,10)),type = "html", comment=F)

	EVTYPE	FATALITIES	INJURIES	Events	FatalitiesPerEvent
1	RIP CURRENT	384.00	290.00	492	0.78
2	EXCESSIVE HEAT	691.00	2797.00	901	0.77
3	HURRICANE/TYPHOON	64.00	1275.00	88	0.73
4	AVALANCHE	145.00	103.00	274	0.53
5	HEAT	229.00	1222.00	709	0.32
6	MARINE STRONG WIND	14.00	22.00	48	0.29
7	FOG	22.00	146.00	84	0.26
8	EXTREME COLD	13.00	0.00	69	0.19
9	HEAVY SURF/HIGH SURF	42.00	48.00	228	0.18
10	COLD/WIND CHILL	95.00	12.00	539	0.18

Now, lets see the events that contribute more to fatalities, regardless the number of events:

tmp=dplyr::arrange(tmp, -FATALITIES)
print(xtable(head(tmp,10)),type = "html", comment=F)

	EVTYPE	FATALITIES	INJURIES	Events	FatalitiesPerEvent
1	TORNADO	1112.00	13588.00	15167	0.07
2	EXCESSIVE HEAT	691.00	2797.00	901	0.77
3	FLASH FLOOD	539.00	517.00	36031	0.01
4	RIP CURRENT	384.00	290.00	492	0.78
5	LIGHTNING	370.00	2250.00	7899	0.05
6	FLOOD	247.00	301.00	18071	0.01
7	HEAT	229.00	1222.00	709	0.32
8	AVALANCHE	145.00	103.00	274	0.53
9	THUNDERSTORM WIND	130.00	1400.00	81402	0.00
10	EXTREME COLD/WIND CHILL	125.00	24.00	1002	0.12

We can see that tornado events have taken the biggest number of lives between 2002 and the end of 2011. However, if we look at the deaths per each occurrence of the events, the more dangerous in terms of fatalities are rip currents.

Now, lets see the events with most injuries per each occurrence:

tmp=dplyr::mutate(tmp, InjuriesPerEvent=(INJURIES/Events)) %>%
    dplyr::arrange(-InjuriesPerEvent) %>%
    dplyr::select(-FatalitiesPerEvent)
print(xtable(head(tmp,10)),type = "html", comment=F)

	EVTYPE	FATALITIES	INJURIES	Events	InjuriesPerEvent
1	HURRICANE/TYPHOON	64.00	1275.00	88	14.49
2	EXCESSIVE HEAT	691.00	2797.00	901	3.10
3	FOG	22.00	146.00	84	1.74
4	HEAT	229.00	1222.00	709	1.72
5	TORNADO	1112.00	13588.00	15167	0.90
6	DUST STORM	9.00	209.00	342	0.61
7	RIP CURRENT	384.00	290.00	492	0.59
8	MARINE STRONG WIND	14.00	22.00	48	0.46
9	TROPICAL STORM	27.00	261.00	584	0.45
10	WILD/FOREST FIRE	1.00	140.00	345	0.41

And also, the injuries regardless the number of occurrence of the events

tmp=dplyr::arrange(tmp, -INJURIES)
print(xtable(head(tmp,10)),type = "html", comment=F)

	EVTYPE	FATALITIES	INJURIES	Events	InjuriesPerEvent
1	TORNADO	1112.00	13588.00	15167	0.90
2	EXCESSIVE HEAT	691.00	2797.00	901	3.10
3	LIGHTNING	370.00	2250.00	7899	0.28
4	THUNDERSTORM WIND	130.00	1400.00	81402	0.02
5	HURRICANE/TYPHOON	64.00	1275.00	88	14.49
6	HEAT	229.00	1222.00	709	1.72
7	TSTM WIND	77.00	1146.00	61122	0.02
8	WILDFIRE	75.00	911.00	2732	0.33
9	FLASH FLOOD	539.00	517.00	36031	0.01
10	HIGH WIND	97.00	486.00	14749	0.03

2. Across the United States, which types of events have the greatest economic consequences?

Now we will see which type of events have the greatest damage in terms of property and corp damage.

Lets look at the events with the biggest total damage per each occurrence of the event. The damage is expressed in thousands of dollars :

tmp=dplyr::mutate(data, TotalDamage=PROPDMG2+CROPDMG2) %>%
    dplyr::group_by(EVTYPE) %>%
    dplyr::summarize(PropertyDamage=sum(PROPDMG2, na.rm=T), 
                     CropDamage=sum(CROPDMG2, na.rm=T),
                     TotalDamage=sum(TotalDamage, na.rm=T),
                     Events=n()) %>%
    dplyr::mutate(TotalDamagePerEvent=(TotalDamage/Events))

tmp=as.data.frame(tmp)
tmp=dplyr::arrange(tmp, -TotalDamagePerEvent)

print(xtable(head(tmp,10)),type = "html", comment=F)

	EVTYPE	PropertyDamage	CropDamage	TotalDamage	Events	TotalDamagePerEvent
1	HURRICANE/TYPHOON	69305840.00	2607872.80	71913712.80	88	817201.28
2	STORM SURGE	43168315.00	0.00	43168315.00	103	419109.85
3	HURRICANE	3036855.01	448510.00	3485365.01	40	87134.13
4	STORM SURGE/TIDE	4641188.00	850.00	4642038.00	148	31365.12
5	FLOOD	133387648.53	3591907.40	136979555.93	18071	7580.08
6	TROPICAL STORM	2008360.55	410061.00	2418421.55	584	4141.13
7	DROUGHT	845958.00	5423625.00	6269583.00	1845	3398.15
8	WILDFIRE	4758667.00	295472.80	5054139.80	2732	1849.98
9	ICE STORM	1964717.80	8665.00	1973382.80	1267	1557.52
10	TORNADO	18406922.66	220589.91	18627512.57	15167	1228.16

Now, lets see the events that contribute more to the total damage, regardless the number of events. The damage is expressed in thousands of dollars:

tmp=dplyr::arrange(tmp, -TotalDamage)
print(xtable(head(tmp,10)),type = "html", comment=F)

	EVTYPE	PropertyDamage	CropDamage	TotalDamage	Events	TotalDamagePerEvent
1	FLOOD	133387648.53	3591907.40	136979555.93	18071	7580.08
2	HURRICANE/TYPHOON	69305840.00	2607872.80	71913712.80	88	817201.28
3	STORM SURGE	43168315.00	0.00	43168315.00	103	419109.85
4	TORNADO	18406922.66	220589.91	18627512.57	15167	1228.16
5	FLASH FLOOD	10709402.71	812514.00	11521916.71	36031	319.78
6	HAIL	9174277.52	1393287.90	10567565.42	142257	74.29
7	DROUGHT	845958.00	5423625.00	6269583.00	1845	3398.15
8	HIGH WIND	4833335.91	494272.00	5327607.91	14749	361.22
9	WILDFIRE	4758667.00	295472.80	5054139.80	2732	1849.98
10	STORM SURGE/TIDE	4641188.00	850.00	4642038.00	148	31365.12

Here we can see that in total, the type of event that produced more damage from 2002 and the end of 2011 were floods. They damage almost twice than hurricane/typhons. However, the last ones produced a lot more damage if we look at them as the average damage for each occuerrence of the event.

Finally, we will compare the economical damage in thousands of dollars, and the fatalities for a selected group of events:

mainEvents=c("AVALANCHE","DROUGHT","DUST STORM","EXCESSIVE HEAT",
             "EXTREME COLD/WIND CHILL","FLASH FLOOD","FLOOD","FOG",
             "HAIL","HEAT","HIGH WIND","HURRICANE/TYPHOON","LIGHTNING",
             "MARINE STRONG WIND","RIP CURRENT","RIP CURRENTS","STORM SURGE",
             "STORM SURGE/TIDE","THUNDERSTORM WIND","TORNADO","TROPICAL STORM","WILDFIRE")

tmp=data[,]

tmp <- melt(tmp[,c("EVTYPE", "PROPDMG2", "CROPDMG2")], id.vars=c("EVTYPE"))
colnames(tmp)[2:3]=c("Type", "T.Dollars")

tmp=dplyr::group_by(tmp, EVTYPE, Type) %>%
    dplyr::summarize(T.Dollars=sum(T.Dollars, na.rm=T))

tmp=tmp[tmp$EVTYPE %in% mainEvents,]


g1 = ggplot(tmp, aes(x = EVTYPE, y=T.Dollars, fill=Type)) + 
    geom_bar(alpha=0.7, stat="identity") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    coord_flip()+
    ggtitle("Economic cost per type of event (Thousands U$)")

#Now the human cost:
tmp=data[,]

tmp <- melt(tmp[,c("EVTYPE", "FATALITIES")], id.vars=c("EVTYPE"))
colnames(tmp)[2:3]=c("Type", "T.Dollars")

tmp=dplyr::group_by(data, EVTYPE) %>%
    dplyr::summarize(FATALITIES=sum(FATALITIES, na.rm=T), 
                     INJURIES=sum(INJURIES, na.rm=T))

tmp=tmp[tmp$EVTYPE %in% mainEvents,]

g2 = ggplot(tmp, aes(x = EVTYPE, y=FATALITIES)) + 
    geom_bar(alpha=0.7, stat="identity") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    coord_flip()+
    ggtitle("Fatalities per type of event")

g3 = ggplot(tmp, aes(x = EVTYPE, y=INJURIES)) + 
    geom_bar(alpha=0.7, stat="identity") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    coord_flip()+
    ggtitle("Injuries per type of event")

g1 #Economic Cost

g2 #Fatalities

g3 #Injuries