Natural hazards, such as tornadoes, water-floods, hurricanes etc… are causing severe damages (human and economic). U.S. National Oceanic and Atmospheric Administration’s (NOAA) is maintaing a specific database of theses events since the 1950’s.
The goal of this study is to answer two questions:
Across the United States, which types of events (as indicated in the 𝙴𝚅𝚃𝚈𝙿𝙴 variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:
There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
National Weather Service Storm Data Documentation, web: https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf
National Climatic Data Center Storm Events FAQ, url: https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
WebURL <- "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(WebURL,destfile="./repdata-data-StormData.csv.bz2",method="curl")
StormData <- read.table(bzfile("./repdata-data-StormData.csv.bz2"),sep=",", header=TRUE, stringsAsFactors=FALSE, na.strings="")
Let’s have a look to the data with the strfunction
str(StormData)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr NA NA NA NA ...
## $ BGN_LOCATI: chr NA NA NA NA ...
## $ END_DATE : chr NA NA NA NA ...
## $ END_TIME : chr NA NA NA NA ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr NA NA NA NA ...
## $ END_LOCATI: chr NA NA NA NA ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr NA NA NA NA ...
## $ WFO : chr NA NA NA NA ...
## $ STATEOFFIC: chr NA NA NA NA ...
## $ ZONENAMES : chr NA NA NA NA ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr NA NA NA NA ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
For our analysis, we will need the following variables:
Let’load the libraries we will need
require(lubridate)
## Loading required package: lubridate
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
require(plyr)
## Loading required package: plyr
##
## Attaching package: 'plyr'
## The following object is masked from 'package:lubridate':
##
## here
require(ggplot2)
## Loading required package: ggplot2
We can also reduce the number of variables to keep only the ones we are interested in
## Keep only the columns we are interestes in
StormDataShort <- StormData[,c("BGN_DATE","EVTYPE","FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")]
head(StormDataShort)
## BGN_DATE EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP
## 1 4/18/1950 0:00:00 TORNADO 0 15 25.0 K
## 2 4/18/1950 0:00:00 TORNADO 0 0 2.5 K
## 3 2/20/1951 0:00:00 TORNADO 0 2 25.0 K
## 4 6/8/1951 0:00:00 TORNADO 0 2 2.5 K
## 5 11/15/1951 0:00:00 TORNADO 0 2 2.5 K
## 6 11/15/1951 0:00:00 TORNADO 0 6 2.5 K
## CROPDMG CROPDMGEXP
## 1 0 <NA>
## 2 0 <NA>
## 3 0 <NA>
## 4 0 <NA>
## 5 0 <NA>
## 6 0 <NA>
As the ‘BGN_DATE’ variable is not formatted as a date, let’s convert it. We will keep only the year information.
##Add a new column 'year' to ease the analysis
StormDataShort$year <- year(strptime(StormDataShort$BGN_DATE,format='%m/%d/%Y %H:%M:%S'))
head(StormDataShort)
## BGN_DATE EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP
## 1 4/18/1950 0:00:00 TORNADO 0 15 25.0 K
## 2 4/18/1950 0:00:00 TORNADO 0 0 2.5 K
## 3 2/20/1951 0:00:00 TORNADO 0 2 25.0 K
## 4 6/8/1951 0:00:00 TORNADO 0 2 2.5 K
## 5 11/15/1951 0:00:00 TORNADO 0 2 2.5 K
## 6 11/15/1951 0:00:00 TORNADO 0 6 2.5 K
## CROPDMG CROPDMGEXP year
## 1 0 <NA> 1950
## 2 0 <NA> 1950
## 3 0 <NA> 1951
## 4 0 <NA> 1951
## 5 0 <NA> 1951
## 6 0 <NA> 1951
The number of victims is obtained by summing ‘FATALITIES’ and ‘INJURIES’ variables.
## Add a new column 'vitimes' which is the sum of injured + fatalities
StormDataShort$victims <- StormDataShort$INJURIES+StormDataShort$FATALITIES
The calculation of the economic impact is a litte bit more complicated. Costs are split into categories: property damages (PROPDMGEXP) and crop damages (CROPDMG). For both categories, numbers have to be multiplied by a factor magnitude present in the variables ‘PROPDMGEXP’ and ‘CROPDMGEXP’. These factors need to be reworked:
Finally we will re-arrange the dataset to make it easily readable
## let'store unique 'PROPDMGEXP' data in a variable
PropDmgMagn <- unique(StormDataShort$PROPDMGEXP)
## let'store unique 'CROPDMGEXP' data in a variable
CropDmgMagn <- unique(StormDataShort$CROPDMGEXP)
## We can use the 'mapvalues' function to replace these mixed values by numeric ones as following:
PropDmgMagnNew <- c(10^2,10^6,10^0,10^9,10^6,10^0,10^0,10^5,10^6,10^0,10^4,10^2,10^3,10^2,10^7,10^2,10^0,10^1,10^8)
StormDataShort$PROPDMGEXP <- as.numeric(mapvalues(StormDataShort$PROPDMGEXP,PropDmgMagn, PropDmgMagnNew))
CropDmgMagnNew <- c(10^0,10^6,10^2,10^6,10^9,10^0,10^0,10^2,10^2)
StormDataShort$CROPDMGEXP <- as.numeric(mapvalues(StormDataShort$CROPDMGEXP,CropDmgMagn, CropDmgMagnNew))
## We can then calculate the costs as new variables
StormDataShort$PropertyDamageCost <- StormDataShort$PROPDMGEX*StormDataShort$PROPDMGEXP
StormDataShort$CropCost <- StormDataShort$CROPDMGEX*StormDataShort$CROPDMGEXP
StormDataShort$TotalCost <- StormDataShort$PropertyDamageCost+StormDataShort$CropCost
## Finally, re-arrange the table
StormDatTidy<-StormDataShort[,c(9,2,10,11,12,13)]
colnames(StormDatTidy)[2] <- "eventType"
head(StormDatTidy)
## year eventType victims PropertyDamageCost CropCost TotalCost
## 1 1950 TORNADO 15 10000 1 10001
## 2 1950 TORNADO 0 10000 1 10001
## 3 1951 TORNADO 2 10000 1 10001
## 4 1951 TORNADO 2 10000 1 10001
## 5 1951 TORNADO 2 10000 1 10001
## 6 1951 TORNADO 6 10000 1 10001
## Let'sum the number of victims per event type
VictimsPerEventType <- aggregate(victims ~ eventType, data = StormDatTidy, sum)
VictimsPerEventType <- VictimsPerEventType[order(-VictimsPerEventType$victims),]
nrow(VictimsPerEventType)
## [1] 985
## as there are 985 event types, let's focus on the 20 harmfullest ones
Top20Victim <- head(VictimsPerEventType,20)
## Let's make a barplot
g <- ggplot(data=Top20Victim,aes(eventType,victims))
g <- g+geom_bar(stat="identity")+
scale_x_discrete(limits=Top20Victim$eventType)
g <- g+labs(y="Cumulated number of victims", x="Event type",title="20 most Harmful Events for Population Health in the US (1950-2011)") + theme(axis.text.x = element_text(angle=90))
g
We can then answer the question: Tornadoes are from far the most harmful natural hazards in the US in terms of number of victims (injuries and fatalities). Tornadoes damages are reported since 1950 (nearly 100’000 victims), which is most probably not the case for other type of natural hazards. This might explain such a high contrast in the results.
Regarding economic consequences, the sum of property damages and crop damages costs are taken into account here. As for the previous question, we focus here on the 20 most costly natural hazards in the US for the 1950-2011 period.
## Let'sum the number of victims per event type
CostsPerEventType <- aggregate(TotalCost ~ eventType, data = StormDatTidy, sum)
CostsPerEventType <- CostsPerEventType[order(-CostsPerEventType$TotalCost),]
nrow(CostsPerEventType)
## [1] 985
## as there are 986 event types, let's focus on the 20 harmfullest ones
Top20Costs <- head(CostsPerEventType,20)
## Let's make a barplot
g <- ggplot(data=Top20Costs,aes(eventType,TotalCost))
g <- g+geom_bar(stat="identity")+
scale_x_discrete(limits=Top20Costs$eventType)
g <- g+labs(y="Cumulated Damage Costs (Property + Crop)",title="20 most Economic Harmful Natural Events in the US (1950-2011)") + theme(axis.text.x = element_text(angle=90))
g
To answer the question, Hurricanes economic consequences are in the first position, followed by flood and drought. Tornadoes are number 4 (number 1 in terms of cumulated recorded victims).
Most probably are these datasets not complete. All variables were probably not recorded in the same way from 1950 to 2011. This is a common problem with historical data and it is not always possible to complete the datasets in absence of concrete records. With nearly 1000 different event types, we might consider a simplification of the natural hazards classification (10-20 classes, not more).