Adrian Lim June 2015
This Project analyses the storm data from NOAA and answers 2 main questions:
- Across the United States, which Event Types causes the most harm to population health
- Across the Unites States, which Event Types causes the most economic damage.
Load the required R libraries.
library(R.utils)
## Loading required package: R.oo
## Loading required package: R.methodsS3
## R.methodsS3 v1.7.0 (2015-02-19) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.19.0 (2015-02-27) successfully loaded. See ?R.oo for help.
##
## Attaching package: 'R.oo'
##
## The following objects are masked from 'package:methods':
##
## getClasses, getMethods
##
## The following objects are masked from 'package:base':
##
## attach, detach, gc, load, save
##
## R.utils v2.1.0 (2015-05-27) successfully loaded. See ?R.utils for help.
##
## Attaching package: 'R.utils'
##
## The following object is masked from 'package:utils':
##
## timestamp
##
## The following objects are masked from 'package:base':
##
## cat, commandArgs, getOption, inherits, isOpen, parse, warnings
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
The storm data is downloaded from Storm Data.
The zipped file is assummed downloaded into the current working directory for further processing.
Documentation describing the data is downloaded from Storm Data Documentation.
There is a useful FAQ.
bunzip2("./repdata-data-StormData.csv.bz2", "./storm3.csv", remove = FALSE)
stormdf <- read.csv("./storm3.csv",header=TRUE,sep=",")
Dimensions of the data are:
dim(stormdf)
## [1] 902297 37
Basic information on the data:
str(stormdf)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436774 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
Lets subset the data file to include only the pertinent columns for our analysis. We require only information about the event type, number of fatalities and injuries, amount of property and crop damage.
df <- select(stormdf,STATE,EVTYPE,FATALITIES,INJURIES,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP)
head(df)
## STATE EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 AL TORNADO 0 15 25.0 K 0
## 2 AL TORNADO 0 0 2.5 K 0
## 3 AL TORNADO 0 2 25.0 K 0
## 4 AL TORNADO 0 2 2.5 K 0
## 5 AL TORNADO 0 2 2.5 K 0
## 6 AL TORNADO 0 6 2.5 K 0
Check for any missing values:
sum(is.na(df))
## [1] 0
There are no missing values.
Table 2.1.1 (Storm Event Table) describing the event types sourced from the below URL https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf lists 48 event types.
eventypenames <- unique(df$EVTYPE)
numeventtypes <- length(eventypenames)
head(eventypenames)
## [1] TORNADO TSTM WIND HAIL
## [4] FREEZING RAIN SNOW ICE STORM/FLASH FLOOD
## 985 Levels: HIGH SURF ADVISORY COASTAL FLOOD ... WND
However the data currently has 985 event types so clearly there is a lot of EVTYPE data which requires cleaning. Due to the lack of time though I am leaving this to follow up work.
First lets group the events by type, sum the total number of fatalities for each event type then order by highest to the lowest
fatalitiesByEventType <- df %>% group_by(EVTYPE) %>% summarise(TotalFatalities = sum(FATALITIES)) %>% arrange(desc(TotalFatalities))
head(fatalitiesByEventType)
## Source: local data frame [6 x 2]
##
## EVTYPE TotalFatalities
## 1 TORNADO 5633
## 2 EXCESSIVE HEAT 1903
## 3 FLASH FLOOD 978
## 4 HEAT 937
## 5 LIGHTNING 816
## 6 TSTM WIND 504
Now lets plot a graph of the top eight highest fatality count by event
barplot(height=fatalitiesByEventType$TotalFatalities[1:8],names.arg = fatalitiesByEventType$EVTYPE[1:8],col=rainbow(10,start=0,end=1),main="Top 8 Fatality Events",ylab="Total Number of Fatalities",las=2,cex.axis=0.75)
Clearly Tornadoes are the leading cause of fatalities by far.
Similiarly for Injuries, lets first lets group the events by type, sum the total number of injuries for each event type then order by highest to the lowest
InjuriesByEventType <- df %>% group_by(EVTYPE) %>% summarise(TotalInjuries = sum(INJURIES)) %>% arrange(desc(TotalInjuries))
head(InjuriesByEventType)
## Source: local data frame [6 x 2]
##
## EVTYPE TotalInjuries
## 1 TORNADO 91346
## 2 TSTM WIND 6957
## 3 FLOOD 6789
## 4 EXCESSIVE HEAT 6525
## 5 LIGHTNING 5230
## 6 HEAT 2100
Now lets plot a graph of the top eight highest injuries count by event
barplot(height=InjuriesByEventType$TotalInjuries[1:8],names.arg = InjuriesByEventType$EVTYPE[1:8],col=rainbow(10,start=0,end=1),main="Top 8 Injury Events",ylab="Total Number of Injuries",las=2,cex.axis=0.75)
Clearly Tornadoes again are the leading cause of injuries by far compared to other event types.
The PROPDMGEXP and CROPDMGEXP fiels contain the multipliers for the damages for property damage and crop damage respectively. The unique values of those fields are:
unique(df$PROPDMGEXP)
## [1] K M B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
unique(df$CROPDMGEXP)
## [1] M K m B ? 0 k 2
## Levels: ? 0 2 B k K m M
We need to replace the non numeric values like K,M etc with their numeric equivalents eg 1000, 1,000,000. “”,“+”,“-”,“?” will be replaced with numeric
0,1,-1,0 respectively. Firstly lets convert to all upper case.
df$PROPDMGEXP <- toupper(df$PROPDMGEXP)
df$CROPDMGEXP <- toupper(df$CROPDMGEXP)
symbol <- c("","+","-","?",0:9,"H","K","M","B")
replacement <- c(0,1,-1,0,0:9,100,1000,1000000,1000000000)
multiplier <-data.frame(symbol,replacement)
head(multiplier)
## symbol replacement
## 1 0
## 2 + 1
## 3 - -1
## 4 ? 0
## 5 0 0
## 6 1 1
Now we calculate the actual economic damage value per event using the multipliers and covert to the scale to Billions of Dollars
df$PROPDMG <- df$PROPDMG*multiplier[match(df$PROPDMGEXP,multiplier$symbol),2]
df$CROPDMG <- df$CROPDMG*multiplier[match(df$CROPDMGEXP,multiplier$symbol),2]
df$PROPDMG <- df$PROPDMG/1000000000
df$CROPDMG <- df$CROPDMG/1000000000
For economic impact sum the total property damage and total crop damage for each event type then calculate the total Damage value which is the sum of the proporty damage plus crop damage and then arrange from highest to lowest total Damage value
EconomicImpactByEventType <- df %>% group_by(EVTYPE) %>% summarise(TotalProp = sum(PROPDMG), TotalCrop = sum(CROPDMG)) %>% mutate(TotalDamageValue = TotalProp + TotalCrop) %>% arrange(desc(TotalDamageValue))
head(EconomicImpactByEventType)
## Source: local data frame [6 x 4]
##
## EVTYPE TotalProp TotalCrop TotalDamageValue
## 1 FLOOD 144.65771 5.6619684 150.31968
## 2 HURRICANE/TYPHOON 69.30584 2.6078728 71.91371
## 3 TORNADO 56.93716 0.4149531 57.35211
## 4 STORM SURGE 43.32354 0.0000050 43.32354
## 5 HAIL 15.73227 3.0259544 18.75822
## 6 FLASH FLOOD 16.14081 1.4213171 17.56213
options(scipen=999)
barplot(height=EconomicImpactByEventType$TotalDamageValue[1:8],names.arg = EconomicImpactByEventType$EVTYPE[1:8],col=rainbow(8,start=0,end=1),main="Top 8 Economic Damage Event Types",ylab="Total Economic Damage (Billion$)",las=2)
The most dangerous event type is clearly tornadoes as it causes by far the most number of fatalities and injuries.
From the last graph, it is quite clear that the the top 3 event types causing the most economic damage is:
1.Floods
2.Typhoon
3.Tornados
More work can be done to clean up the EVTYPE fields and reanalyse the result to see whether there is a significant impact. This shall be left to a follow up project.