This report seeks to answer 2 questions:
Data used in this analysis was sourced from the U.S. National Weather Services from the period 1950 to Nov 2011.
From the analysis of this data, the following answers were concluded:
1. The most harmful weather type in terms of population health are Tornados, contributing 5,633 fatalities (37.2% of total) and 91,346 injured (65% of total).
Other events of note on health were Excessive Heat, Flash Floods, Heat, Lightning, TSTM Wind, and Floods.
2, The most damaging weather type in terms of economic consequences are Floods, resulting in Damages amounting to US$144.7B (33.9% of total) of property and US$5,7B (65% of total) of crops.
Other events of note on economic consequences were Hurricanes/Typhoons, Tornados, Storm Surges, Flash Floods, Droughts, River Floods, Ice Storms and Hail.
various libraries required for analysis using R were loaded.
library(data.table)
library(R.utils)
library(dplyr)
library(ggplot2)
library(grid)
library(gridExtra)
The data used in this analysis was based on data from the link provided : Storm Data (47Mb)
The data is a copy of the StormData.CSV file recorded between 1950 and Nov 2011 originally sourced from the US National Weather Service.
We first create a download sub directory and download the file from the URL into it on the first run only.
Documentation on the data is available from these links:
* National Weather Service Storm Data Documentation
* National Climatic Data Center Storm Events FAQ
# Download source files from URL into and create download sub-directory on 1st run only.
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
destfile <- "./download/StormData.csv.bz2"
if(!file.exists("./download/StormData.csv.bz2")) {
dir.create("./download")
download.file(url,destfile=destfile)
}
The data was then read from the file into a data table.
data <- fread("./download/StormData.csv.bz2")
Examining the data we find the following:
str(data)
## Classes 'data.table' and 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
## - attr(*, ".internal.selfref")=<externalptr>
The data set to be examined to answer the above questions only required selecting the following variables:
* EVTYPE (CHR)
* FATALITIES (NUM)
* INJURIES (NUM)
* PROPDMG (NUM)
* PROPDMGEXP (CHR)
* CROPDMG (NUM)
* CROPDMGEXP (CHR)
# select required variables
data <- data %>% select(EVTYPE,FATALITIES,INJURIES,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP)
The data was found to have no clear NA missing values.
mean(is.na(data))
## [1] 0
Further examination of the selected data was performed.
This variable had an extremely large variance in values as compared to the 48 types documented.
# Check unique variables in EVTYPE
length(unique(data$EVTYPE))
## [1] 985
Only minor cleaning was performed on this variable which were in order:
# Cleaning EVTYPE
data$EVTYPE <- toupper(data$EVTYPE)
data$EVTYPE <- trimws(data$EVTYPE)
searchString <- ' '
replacementString <- ' '
data$EVTYPE = sub(searchString,replacementString,data$EVTYPE)
# Check unique variables in EVTYPE
length(unique(data$EVTYPE))
## [1] 883
This somewhat reduced the unique values.
Notes:
1. The Weather event types in the data was highly contextual and any matching / correction to the documented types required weather services expertise.
2. It was targeted that the top 5 event types should represent at least 60% of the total from the data set or else further cleaning would be required.
The order of magnitude variables for property damage, PROPDMGEXP and crop damage, CROPDMGEXP had more unique values than stated in the documentation. i.e. (H= hundreds, K= thousands, M= millions, B= billions).
# Checking prop dmg. exp
rd <- 4
PDE <- data %>%
count(PROPDMGEXP, name='PDEcount') %>%
mutate (PDE_pct = round(100*PDEcount/sum(PDEcount),rd))
# Checking crop dmg. exp
CDE <- data %>%
count(CROPDMGEXP, name='CDEcount') %>%
mutate (CDE_pct = round(100*CDEcount/sum(CDEcount),rd))
print(PDE, row.names= NULL)
## PROPDMGEXP PDEcount PDE_pct
## 465934 51.6387
## - 1 0.0001
## ? 8 0.0009
## + 5 0.0006
## 0 216 0.0239
## 1 25 0.0028
## 2 13 0.0014
## 3 4 0.0004
## 4 4 0.0004
## 5 28 0.0031
## 6 4 0.0004
## 7 5 0.0006
## 8 1 0.0001
## B 40 0.0044
## h 1 0.0001
## H 6 0.0007
## K 424665 47.0649
## m 7 0.0008
## M 11330 1.2557
print(CDE, row.names= NULL)
## CROPDMGEXP CDEcount CDE_pct
## 618413 68.5376
## ? 7 0.0008
## 0 19 0.0021
## 2 1 0.0001
## B 9 0.0010
## k 21 0.0023
## K 281832 31.2349
## m 1 0.0001
## M 1994 0.2210
Although the erroneous EXP values for both EXP components were hardly significant it still had to be cleaned up to calculate the amount of damage.
The erroneous EXP values were converted according to the following assumptions in order:
Property & crop damage values were then calculated and combined into variables “PropDmg” & “CropDmg” respectively in the clean data set.
# Replace erroneous values of Property & Crop Damages EXP.
## Cleaning prop & crop dmg. exp
data$PROPDMGEXP <- gsub("[^[:alnum:]]","",data$PROPDMGEXP)
data$CROPDMGEXP <- gsub("[^[:alnum:]]","",data$CROPDMGEXP)
data <-data %>%
mutate(PROPDMGEXP = toupper(PROPDMGEXP),CROPDMGEXP = toupper(CROPDMGEXP)) %>%
mutate(PROPDMGEXP = replace(PROPDMGEXP,PROPDMGEXP=="",0),
CROPDMGEXP = replace(CROPDMGEXP,CROPDMGEXP=="",0)) %>%
mutate(PROPDMGEXP = replace(PROPDMGEXP,PROPDMGEXP=='H', 100),
CROPDMGEXP = replace(CROPDMGEXP,CROPDMGEXP=='H', 100)) %>%
mutate(PROPDMGEXP = replace(PROPDMGEXP,PROPDMGEXP=='K', 1000),
CROPDMGEXP = replace(CROPDMGEXP,CROPDMGEXP=='K', 1000)) %>%
mutate(PROPDMGEXP = replace(PROPDMGEXP,PROPDMGEXP=='M', 1000000),
CROPDMGEXP = replace(CROPDMGEXP,CROPDMGEXP=='M', 1000000)) %>%
mutate(PROPDMGEXP = replace(PROPDMGEXP,PROPDMGEXP=='B', 1000000000),
CROPDMGEXP = replace(CROPDMGEXP,CROPDMGEXP=='B', 1000000000)) %>%
mutate(PROPDMGEXP = as.double(PROPDMGEXP),CROPDMGEXP = as.double(CROPDMGEXP))
## Calculating Damage Values and combining into new variables.
cleandata <- data %>%
mutate(PropDmg = PROPDMG*PROPDMGEXP, CropDmg = CROPDMG*CROPDMGEXP) %>%
select("EVTYPE", "FATALITIES", "INJURIES", "PropDmg", "CropDmg" )
head(cleandata)
## EVTYPE FATALITIES INJURIES PropDmg CropDmg
## 1: TORNADO 0 15 25000 0
## 2: TORNADO 0 0 2500 0
## 3: TORNADO 0 2 25000 0
## 4: TORNADO 0 2 2500 0
## 5: TORNADO 0 2 2500 0
## 6: TORNADO 0 6 2500 0
Showing top of the clean data set.
The clean data set was filtered and grouped to show the required categorical summaries. Only the top 5 events in each category were plotted accordingly.
Initial options for the plots:
# Options for Figures
mytheme <- ttheme_default(base_size=10,
core= list(fg_params= list(hjust=0,x=0.1)),
colhead= list(fg_params=list(hjust=0, x=0.1)))
top <-5
## Fatalities
fatal <- cleandata %>%
group_by(EVTYPE) %>%
summarise(totfatal = sum(FATALITIES)) %>%
arrange(desc(totfatal)) %>%
filter(totfatal !=0) %>%
mutate(pct=round(100*totfatal/sum(totfatal),1)) %>%
slice_max(totfatal, n=top)
p1f <- ggplot(fatal,aes(x=totfatal, y=reorder(EVTYPE, totfatal)))
p1f <- p1f + geom_col(fill ='red')
p1f <- p1f + labs(title= paste('Fatal (Top',top, '=', sum(fatal$pct),'%)'),
y=NULL, x= 'Count')
fatal$totfatal <- format(fatal$totfatal, big.mark=",",core.just= "right")
tb1 <- tableGrob(fatal, theme= mytheme, rows = NULL, cols= c("Event Type","Fatalities","%"))
## Injured
injured <- cleandata %>%
group_by(EVTYPE) %>%
summarise(totinjured = sum(INJURIES)) %>%
arrange(desc(totinjured)) %>%
filter(totinjured !=0) %>%
mutate(pct=round(100*totinjured/sum(totinjured),1)) %>%
slice_max(totinjured,n=top)
p2i <- ggplot(injured, aes(y=reorder(EVTYPE, totinjured), x=totinjured))
p2i <- p2i + geom_col(fill ='orange')
p2i <- p2i + labs(title = paste('Injured (Top',top, '=', sum(fatal$pct),'%)'),
y=NULL, x= 'Count')
injured$totinjured <- format(injured$totinjured, big.mark=",",core.just= "right")
tb2 <- tableGrob(injured, theme= mytheme, rows = NULL, cols= c("Event Type","Injured","%"))
grid.arrange(p1f, p2i, tb1, tb2, heights= unit(c(120,80), c("mm", "mm")),
widths= unit(c(120,120), c("mm", "mm")),
top= "Weather Events Harmful to Population Health")
PDmg <- cleandata %>%
group_by(EVTYPE) %>%
summarise(totPDmg = round(sum(PropDmg)/1e9,3)) %>%
arrange(desc(totPDmg)) %>%
filter(totPDmg !=0) %>%
mutate(pct=round(100*totPDmg/sum(totPDmg),1)) %>%
slice_max(totPDmg,n=top)
p3p <- ggplot(PDmg, aes(y=reorder(EVTYPE, totPDmg), x= totPDmg))
p3p <- p3p + geom_col(fill ='Purple')
p3p <- p3p + labs(title= paste('Property Damage (Top', top, '=', sum(PDmg$pct),'%)'),
y= NULL, x= 'Value ($B)')
PDmg$totPDmg <- format(PDmg$totPDmg, big.mark=",",core.just= "right")
tb3 <- tableGrob(PDmg, theme= mytheme, rows = NULL, cols= c("Event Type","Value ($B)","%"))
## Crop Damage
CDmg <- cleandata %>%
group_by(EVTYPE) %>%
summarise(totCDmg = round(sum(CropDmg)/1e9,3)) %>%
arrange(desc(totCDmg)) %>%
filter(totCDmg !=0) %>%
mutate(pct=round(100*totCDmg/sum(totCDmg),1)) %>%
slice_max(totCDmg,n=top)
p4c <- ggplot(CDmg,aes(y=reorder(EVTYPE, totCDmg), x=totCDmg))
p4c <- p4c + geom_col(fill ='Steel Blue')
p4c <- p4c + labs(title= paste('Crop Damage (Top',top, '=', sum(CDmg$pct),'%)'),
y= NULL, x= 'Value ($B)')
CDmg$totCDmg <- format(CDmg$totCDmg, big.mark=",",core.just= "right")
tb4 <- tableGrob(CDmg, theme= mytheme, rows = NULL, cols= c("Event Type","Value ($B)","%"))
grid.arrange(p3p,p4c, tb3, tb4, heights= unit(c(120,80), c("mm", "mm")),
widths= unit(c(120,120), c("mm", "mm")),
top= "Weather Events with Damaging Economic Consequences")