Synopsis

Study Goal

Natural hazards, such as tornadoes, water-floods, hurricanes etc… are causing severe damages (human and economic). U.S. National Oceanic and Atmospheric Administration’s (NOAA) is maintaing a specific database of theses events since the 1950’s.

The goal of this study is to answer two questions:

  1. Across the United States, which types of events (as indicated in the 𝙴𝚅𝚃𝚈𝙿𝙴 variable) are most harmful with respect to population health?

  2. Across the United States, which types of events have the greatest economic consequences?

The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:

There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

National Weather Service Storm Data Documentation, web: https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf

National Climatic Data Center Storm Events FAQ, url: https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

Data Processing

Loading and preprocessing the data

  1. Data needs first to be downloaded from the course web site at the present URL: https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2
WebURL <- "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(WebURL,destfile="./repdata-data-StormData.csv.bz2",method="curl")
  1. Let’s load the data into RStudio
StormData <- read.table(bzfile("./repdata-data-StormData.csv.bz2"),sep=",", header=TRUE, stringsAsFactors=FALSE, na.strings="")

Process/transform the data (if necessary) into a format suitable for your analysis

Let’s have a look to the data with the strfunction

str(StormData)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  NA NA NA NA ...
##  $ BGN_LOCATI: chr  NA NA NA NA ...
##  $ END_DATE  : chr  NA NA NA NA ...
##  $ END_TIME  : chr  NA NA NA NA ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  NA NA NA NA ...
##  $ END_LOCATI: chr  NA NA NA NA ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  NA NA NA NA ...
##  $ WFO       : chr  NA NA NA NA ...
##  $ STATEOFFIC: chr  NA NA NA NA ...
##  $ ZONENAMES : chr  NA NA NA NA ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  NA NA NA NA ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

For our analysis, we will need the following variables:

  • BGN_DATE: date of the hazard (needs to be formatted as date)
  • EVTYPE: the type of hazard (tornado, hurricane, floods etc….)
  • FATALITIES: number of human losses related to a specific event
  • INJURIES: number of injured people related to a specific event
  • PROPDMG: property damage in USD related to a specific event (needs to be multiplied by a factor in PROPDMGEXP column)
  • PROPDMGEXP: property damage factor (2=100, K=1000, 6=1000000, M=1000000 etc…, this is to be reformated as well)
  • CROPDMG: Crop damage in USD related to a specific event (needs to be multiplied by a factor in CROPDMGEXP column)
  • CROPDMGEXP: Crop damage factor (similar as PROPDMGEXP)

Let’load the libraries we will need

require(lubridate)
## Loading required package: lubridate
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
require(plyr)
## Loading required package: plyr
## 
## Attaching package: 'plyr'
## The following object is masked from 'package:lubridate':
## 
##     here
require(ggplot2)
## Loading required package: ggplot2

We can also reduce the number of variables to keep only the ones we are interested in

## Keep only the columns we are interestes in
StormDataShort <- StormData[,c("BGN_DATE","EVTYPE","FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")]
head(StormDataShort)
##             BGN_DATE  EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP
## 1  4/18/1950 0:00:00 TORNADO          0       15    25.0          K
## 2  4/18/1950 0:00:00 TORNADO          0        0     2.5          K
## 3  2/20/1951 0:00:00 TORNADO          0        2    25.0          K
## 4   6/8/1951 0:00:00 TORNADO          0        2     2.5          K
## 5 11/15/1951 0:00:00 TORNADO          0        2     2.5          K
## 6 11/15/1951 0:00:00 TORNADO          0        6     2.5          K
##   CROPDMG CROPDMGEXP
## 1       0       <NA>
## 2       0       <NA>
## 3       0       <NA>
## 4       0       <NA>
## 5       0       <NA>
## 6       0       <NA>

As the ‘BGN_DATE’ variable is not formatted as a date, let’s convert it. We will keep only the year information.

##Add a new column 'year' to ease the analysis
StormDataShort$year <-  year(strptime(StormDataShort$BGN_DATE,format='%m/%d/%Y %H:%M:%S'))
head(StormDataShort)
##             BGN_DATE  EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP
## 1  4/18/1950 0:00:00 TORNADO          0       15    25.0          K
## 2  4/18/1950 0:00:00 TORNADO          0        0     2.5          K
## 3  2/20/1951 0:00:00 TORNADO          0        2    25.0          K
## 4   6/8/1951 0:00:00 TORNADO          0        2     2.5          K
## 5 11/15/1951 0:00:00 TORNADO          0        2     2.5          K
## 6 11/15/1951 0:00:00 TORNADO          0        6     2.5          K
##   CROPDMG CROPDMGEXP year
## 1       0       <NA> 1950
## 2       0       <NA> 1950
## 3       0       <NA> 1951
## 4       0       <NA> 1951
## 5       0       <NA> 1951
## 6       0       <NA> 1951

The number of victims is obtained by summing ‘FATALITIES’ and ‘INJURIES’ variables.

## Add a new column 'vitimes' which is the sum of injured + fatalities
StormDataShort$victims <- StormDataShort$INJURIES+StormDataShort$FATALITIES

The calculation of the economic impact is a litte bit more complicated. Costs are split into categories: property damages (PROPDMGEXP) and crop damages (CROPDMG). For both categories, numbers have to be multiplied by a factor magnitude present in the variables ‘PROPDMGEXP’ and ‘CROPDMGEXP’. These factors need to be reworked:

  • numbers (from 1 to 9), representing numberes of 0’s
  • “h” for hundred, “k” for thousand, “M” for million and “B” for billion

Finally we will re-arrange the dataset to make it easily readable

## let'store unique 'PROPDMGEXP' data in a variable
PropDmgMagn <- unique(StormDataShort$PROPDMGEXP)
## let'store unique 'CROPDMGEXP' data in a variable
CropDmgMagn <- unique(StormDataShort$CROPDMGEXP)
## We can use the 'mapvalues' function to replace these mixed values by numeric ones as following:
PropDmgMagnNew <- c(10^2,10^6,10^0,10^9,10^6,10^0,10^0,10^5,10^6,10^0,10^4,10^2,10^3,10^2,10^7,10^2,10^0,10^1,10^8)
StormDataShort$PROPDMGEXP <- as.numeric(mapvalues(StormDataShort$PROPDMGEXP,PropDmgMagn, PropDmgMagnNew))
CropDmgMagnNew <- c(10^0,10^6,10^2,10^6,10^9,10^0,10^0,10^2,10^2)
StormDataShort$CROPDMGEXP <- as.numeric(mapvalues(StormDataShort$CROPDMGEXP,CropDmgMagn, CropDmgMagnNew))

## We can then calculate the costs as new variables
StormDataShort$PropertyDamageCost <- StormDataShort$PROPDMGEX*StormDataShort$PROPDMGEXP
StormDataShort$CropCost <- StormDataShort$CROPDMGEX*StormDataShort$CROPDMGEXP
StormDataShort$TotalCost <- StormDataShort$PropertyDamageCost+StormDataShort$CropCost

## Finally, re-arrange the table
StormDatTidy<-StormDataShort[,c(9,2,10,11,12,13)]
colnames(StormDatTidy)[2] <- "eventType"
head(StormDatTidy)
##   year eventType victims PropertyDamageCost CropCost TotalCost
## 1 1950   TORNADO      15              10000        1     10001
## 2 1950   TORNADO       0              10000        1     10001
## 3 1951   TORNADO       2              10000        1     10001
## 4 1951   TORNADO       2              10000        1     10001
## 5 1951   TORNADO       2              10000        1     10001
## 6 1951   TORNADO       6              10000        1     10001

Results

Across the United States, which types of events (as indicated in the 𝙴𝚅𝚃𝚈𝙿𝙴 variable) are most harmful with respect to population health?

## Let'sum the number of victims per event type
VictimsPerEventType <- aggregate(victims ~ eventType, data = StormDatTidy, sum)
VictimsPerEventType <- VictimsPerEventType[order(-VictimsPerEventType$victims),]
nrow(VictimsPerEventType)
## [1] 985
## as there are 985 event types, let's focus on the 20 harmfullest ones
Top20Victim <- head(VictimsPerEventType,20)
## Let's make a barplot

g <- ggplot(data=Top20Victim,aes(eventType,victims))
g <- g+geom_bar(stat="identity")+ 
        scale_x_discrete(limits=Top20Victim$eventType)
g <- g+labs(y="Cumulated number of victims", x="Event type",title="20 most Harmful Events for Population Health in the US (1950-2011)") + theme(axis.text.x = element_text(angle=90))
g

We can then answer the question: Tornadoes are from far the most harmful natural hazards in the US in terms of number of victims (injuries and fatalities). Tornadoes damages are reported since 1950 (nearly 100’000 victims), which is most probably not the case for other type of natural hazards. This might explain such a high contrast in the results.

Across the United States, which types of events have the greatest economic consequences?

Regarding economic consequences, the sum of property damages and crop damages costs are taken into account here. As for the previous question, we focus here on the 20 most costly natural hazards in the US for the 1950-2011 period.

## Let'sum the number of victims per event type
CostsPerEventType <- aggregate(TotalCost ~ eventType, data = StormDatTidy, sum)
CostsPerEventType <- CostsPerEventType[order(-CostsPerEventType$TotalCost),]
nrow(CostsPerEventType)
## [1] 985
## as there are 986 event types, let's focus on the 20 harmfullest ones
Top20Costs <- head(CostsPerEventType,20)
## Let's make a barplot

g <- ggplot(data=Top20Costs,aes(eventType,TotalCost))
g <- g+geom_bar(stat="identity")+ 
        scale_x_discrete(limits=Top20Costs$eventType)
g <- g+labs(y="Cumulated Damage Costs (Property + Crop)",title="20 most Economic Harmful Natural Events in the US (1950-2011)") + theme(axis.text.x = element_text(angle=90))
g

To answer the question, Hurricanes economic consequences are in the first position, followed by flood and drought. Tornadoes are number 4 (number 1 in terms of cumulated recorded victims).

Conclusions and Perspectives

Most probably are these datasets not complete. All variables were probably not recorded in the same way from 1950 to 2011. This is a common problem with historical data and it is not always possible to complete the datasets in absence of concrete records. With nearly 1000 different event types, we might consider a simplification of the natural hazards classification (10-20 classes, not more).