The data analysis looks briefly at weather events impact to the U.S. population. It is based on data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database from 1950 to 2011. Two main areas are investigated. Firstly, which types of weather events are the most harmful to the United States population and secondly which types of events have the greatest economic impact.
Firstly the raw data was loaded in from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. Some supporting documentation was also downloaded to understand the most relevant variables. There are fewer events recorded for the earlier years (most likely due to a lack of good records) and as a result more recent years records should be considered more complete.
library(dplyr)
library(data.table)
library(ggplot2)
library(grDevices)
library(RColorBrewer)
library(plotly)
## Setup the working directory where the data is located
setwd("C:/Users/paddy/Documents/Coursera/Assignments/Reproducable Research/Week 4/")
#setwd("C:/Users/paddy/Documents/Coursera/Data Science/Course Material/5. Reproducable Research/Assignments/Week 2/Reproducible Research Course Project 1")
## Download the raw archive file from the web:
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileUrl, destfile = "./repdata%2Fdata%2FStormData.csv.bz2")
file2Url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf"
download.file(file2Url, destfile = "./repdata%2Fpeer2_doc%2Fpd01016005curr.pdf",mode = "wb")
file3Url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf"
download.file(file3Url, destfile = "./repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf",mode = "wb")
dateDownloaded <- date()
dateDownloaded
# unzip(zipfile = "./repdata%2Fdata%2FStormData.csv.bz2")
As read.csv function can handle the .bz2 format it was read in directly.
## Load the test data
data <- read.csv("./repdata%2Fdata%2FStormData.csv.bz2")
## Take a quick look at the data
head(data)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
## Load data into dplyr
data1 <- tbl_df(data)
rm(data)
To investigate the most harmful weather events with respect to population health the FATALITIES variable was investigated and summarised per event type (EVTYPE).
This analysis was carried out using the dplyr package to determine the most impactful weather events.
## summarise FATALITIES by EVTYPE in dplyr
pop_health<-
data1%>%
select(BGN_DATE,BGN_TIME,EVTYPE,FATALITIES,INJURIES)%>%
filter(FATALITIES>0) %>%
group_by(EVTYPE) %>%
summarise(total_fatalities=sum(FATALITIES),total_injuries=sum(INJURIES))%>%
arrange(desc(total_fatalities))
TopN_fatalities<-head(pop_health,20)
These results were then plotted using ggplot2 package and can be seen in the results section.
p1<-ggplot(TopN_fatalities,
aes(x=reorder(EVTYPE,-total_fatalities), y=total_fatalities)) +
geom_col(colour="steelblue2",fill="lightsteelblue1",width = 0.75) +
xlab("Weather Event Type") +
ylab(expression("Total Number of Fatalities")) +
ggtitle("U.S. Top 20 Weather Events for Total Fatalities 1950-2011") +
geom_text(aes(label = round(total_fatalities,0)),colour = "indianred3",nudge_y = 200,size=3)+
theme_minimal()+
theme(axis.text.y = element_text(size = 8,angle=0,color = "indianred3"))+
theme(axis.text.x = element_text(size = 8,angle=90,color = "indianred3",vjust=0,hjust = 1)) # changes axis labels
Following on from this, in order to continue looking at the weather events most harmful to population health, the INJURIES variable was also investigated and again summarised per event type (EVTYPE). This analysis was again carried out using the dplyr package and the results were plotted using ggplot2 package.
## Load data into dplyr
pop_health_i<-
data1%>%
select(BGN_DATE,BGN_TIME,EVTYPE,FATALITIES,INJURIES)%>%
filter(INJURIES>0) %>%
group_by(EVTYPE) %>%
summarise(total_fatalities=sum(FATALITIES),total_injuries=sum(INJURIES)) %>%
arrange(desc(total_injuries))
TopN_injuries<-head(pop_health_i,20)
# pop_health$INJURIES<-as.numeric(pop_health$INJURIES)
The resulting Top 20 most impactful events for injury can be seen in the results section later on in the report.
p2<- ggplot(TopN_injuries,
aes(x=reorder(EVTYPE,-total_injuries), y=total_injuries)) +
geom_col(colour="darkslateblue",fill="steelblue",width = 0.75) +
xlab("Weather Event Type") +
ylab(expression("Total Number of Injuries")) +
ggtitle("U.S. Top 20 Weather Events for Total Injuries 1950-2011") +
geom_text(aes(label = round(total_fatalities,0)),colour = "indianred3",nudge_y = 2500,size=3)+
theme_minimal()+
theme(axis.text.y = element_text(size = 8,angle=0,color = "indianred3"))+
theme(axis.text.x = element_text(size = 8,angle=90,color = "indianred3",vjust=0,hjust = 1)) # changes axis labels
A function found on this page was used to combine 2x ggplots onto a single figure.
## Multiple plot function
## ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
## - cols: Number of columns in layout
## - layout: A matrix specifying the layout. If present, 'cols' is ignored.
## If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
## then plot 1 will go in the upper left, 2 will go in the upper right, and
## 3 will go all the way across the bottom.
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
require(grid)
# Make a list from the ... arguments and plotlist
plots <- c(list(...), plotlist)
numPlots = length(plots)
# If layout is NULL, then use 'cols' to determine layout
if (is.null(layout)) {
# Make the panel
# ncol: Number of columns of plots
# nrow: Number of rows needed, calculated from # of cols
layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
ncol = cols, nrow = ceiling(numPlots/cols))
}
if (numPlots==1) {
print(plots[[1]])
}
else {
# Set up the page
grid.newpage()
pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
# Make each plot, in the correct location
for (i in 1:numPlots) {
# Get the i,j matrix positions of the regions that contain this subplot
matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
layout.pos.col = matchidx$col))
}
}
}
The second part of the analysis attmpted to understand which weather events had the the greatest economic consequences across the United States between 1950 and 2011.
For damage there are 2x relevant variables (PROPDMG,PROPDMG) corresponding to property damage and crop damage respectively. According to the support documentation from NOAA the damage estimates should be rounded to three significant digits, followed by an alphabetical character signifying the magnitude of the number, i.e., 1.55B for $1,550,000,000. Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions. These characters were recorded in the associated (PROPDMGEXP,CROPDMGEXP) columns.
First the economic extent of damage inflicted on property was analysed. Entries with zero value in the property damage (PROPDMG) variable were removed as they would not contribute to the total cost.
The entries with incorrect values for PROPDMGEXP were checked to see their contribution. As they only represent 0.13% of the total samples they were ignored as they were not deemed to have a major influence on the end result.
##remove the zero value rows from data1 and summarise property damage by EVTYPE
prop<-
data1%>%
select(BGN_DATE,BGN_TIME,EVTYPE,PROPDMG,PROPDMGEXP)%>%
filter(PROPDMG>0)
#as.character(prop$PROPDMGEXP)
prop$PROPDMGEXP <- toupper(prop$PROPDMGEXP)
unique(prop$PROPDMGEXP)
## [1] "K" "M" "B" "+" "0" "" "5" "6" "4" "H" "2" "7" "3" "-"
##review the non-styandard entries of PROPDMGEXP:
prop_typo<-prop%>%
select(BGN_DATE,BGN_TIME,EVTYPE,PROPDMG,PROPDMGEXP)%>%
filter(PROPDMGEXP=="+"|PROPDMGEXP=="0"|PROPDMGEXP==""|PROPDMGEXP=="5"|PROPDMGEXP=="6"|PROPDMGEXP=="4"|PROPDMGEXP=="2"|PROPDMGEXP=="7"|PROPDMGEXP=="-"|PROPDMGEXP=="3")
100*dim(prop_typo)/dim(prop) ##to understand % of entries impactped by typos
## [1] 0.1337938 100.0000000
prop_edit<-prop%>%
select(BGN_DATE,BGN_TIME,EVTYPE,PROPDMG,PROPDMGEXP)%>%
filter(!(PROPDMGEXP=="+"|PROPDMGEXP=="0"|PROPDMGEXP==""|PROPDMGEXP=="5"|PROPDMGEXP=="6"|PROPDMGEXP=="4"|PROPDMGEXP=="2"|PROPDMGEXP=="7"|PROPDMGEXP=="-"|PROPDMGEXP=="3"))
dim(prop_edit)
## [1] 238854 5
unique(prop_edit$PROPDMGEXP)
## [1] "K" "M" "B" "H"
##check the distribution of correct entries to understand the makeup
PROPDMGEXP_count<-
prop_edit%>%
select(BGN_DATE,BGN_TIME,EVTYPE,PROPDMGEXP)%>%
group_by(PROPDMGEXP)%>%
summarise(count_PROPDMGEXP=length(PROPDMGEXP))
PROPDMGEXP_count
## # A tibble: 4 x 2
## PROPDMGEXP count_PROPDMGEXP
## <chr> <int>
## 1 B 40
## 2 H 7
## 3 K 227481
## 4 M 11326
#class(prop_edit$PROPDMGEXP)
#as.character(prop_edit$PROPDMGEXP)
prop_edit$PROPDMGEXP <- toupper(prop_edit$PROPDMGEXP)
unique(prop_edit$PROPDMGEXP)
## [1] "K" "M" "B" "H"
##replace the PROPDMGEXP variable with a numeric equvalent:
prop_edit$PROPDMGEXP[prop_edit$PROPDMGEXP == "B"] <- as.numeric("1000000000")*1
prop_edit$PROPDMGEXP[prop_edit$PROPDMGEXP == "M"] <- as.numeric("1000000")*1
prop_edit$PROPDMGEXP[prop_edit$PROPDMGEXP == "K"] <- as.numeric("1000")*1
prop_edit$PROPDMGEXP[prop_edit$PROPDMGEXP == "H"] <- as.numeric("100")*1
prop_edit$PROPDMGEXP<-as.numeric(prop_edit$PROPDMGEXP)
class(prop_edit$PROPDMGEXP)
## [1] "numeric"
unique(prop_edit$PROPDMGEXP)
## [1] 1e+03 1e+06 1e+09 1e+02
prop_dmg<-prop_edit%>%
select(BGN_DATE,BGN_TIME,EVTYPE,PROPDMG,PROPDMGEXP)%>%
group_by(EVTYPE)%>%
summarise(Tot_PROPDMG=sum(PROPDMG*PROPDMGEXP))%>%
arrange(desc(Tot_PROPDMG))
This was then followed by similar analysis of the ecomonic impact to crop damage. Entries with zero value in the crop damage (CROPDMG) variable were removed as again they would not contribute to the total cost. The entries with incorrect entries for CROPDMGEXP were also checked and found to represent 0.016% of the overall samples so were ignored and not deemed to have significant impact in the end result.
##remove the zero value rows from data1 and summarise crop damage by EVTYPE
crop<-
data1%>%
select(BGN_DATE,BGN_TIME,EVTYPE,CROPDMG,CROPDMGEXP)%>%
filter(CROPDMG>0)
crop$CROPDMGEXP <- toupper(crop$CROPDMGEXP)
unique(crop$CROPDMGEXP)
## [1] "M" "K" "B" "0" ""
##review the non-styandard entries of PROPDMGEXP:
crop_typo<-crop%>%
select(BGN_DATE,BGN_TIME,EVTYPE,CROPDMG,CROPDMGEXP)%>%
filter(CROPDMGEXP==""|CROPDMGEXP=="0")
100*dim(crop_typo)/dim(crop) ##to understand % of entries impactped by typos
## [1] 0.06787637 100.00000000
crop_edit<-crop%>%
select(BGN_DATE,BGN_TIME,EVTYPE,CROPDMG,CROPDMGEXP)%>%
filter(!(CROPDMGEXP==""|CROPDMGEXP=="0"))
dim(crop_edit)
## [1] 22084 5
unique(crop_edit$CROPDMGEXP)
## [1] "M" "K" "B"
##check the distribution of correct entries to understand the makeup
CROPDMGEXP_count<-
crop_edit%>%
select(BGN_DATE,BGN_TIME,EVTYPE,CROPDMGEXP)%>%
group_by(CROPDMGEXP)%>%
summarise(count_CROPDMGEXP=length(CROPDMGEXP))
class(crop_edit$CROPDMGEXP)
## [1] "character"
crop_edit$CROPDMGEXP <- toupper(crop_edit$CROPDMGEXP)
unique(crop_edit$CROPDMGEXP)
## [1] "M" "K" "B"
##replace the PROPDMGEXP variable with a numeric equvalent:
crop_edit$CROPDMGEXP[crop_edit$CROPDMGEXP == "B"] <- "1000000000"
crop_edit$CROPDMGEXP[crop_edit$CROPDMGEXP == "M"] <- "1000000"
crop_edit$CROPDMGEXP[crop_edit$CROPDMGEXP == "K"] <- "1000"
crop_edit$CROPDMGEXP<-as.numeric(crop_edit$CROPDMGEXP)
class(crop_edit$CROPDMGEXP)
## [1] "numeric"
crop_dmg<-crop_edit%>%
select(BGN_DATE,BGN_TIME,EVTYPE,CROPDMG,CROPDMGEXP)%>%
group_by(EVTYPE)%>%
summarise(Tot_CROPDMG=sum(CROPDMG*CROPDMGEXP))%>%
arrange(desc(Tot_CROPDMG))
The total economic cost was based on the sum of both the crop damage and property damage.
Tot_dmg <- merge(crop_dmg, prop_dmg, by.x = "EVTYPE", by.y = "EVTYPE")
Total_dmg<-
Tot_dmg%>%
select(EVTYPE,Tot_CROPDMG,Tot_PROPDMG)%>%
group_by(EVTYPE)%>%
summarise(Tot_DMG=sum(Tot_CROPDMG,Tot_PROPDMG))%>%
arrange(desc(Tot_DMG))
Total_dmg_summary<-
Total_dmg%>%
select(EVTYPE,Tot_DMG)%>%
group_by(EVTYPE)%>%
arrange(desc(Tot_DMG))%>%
mutate("Cost (Billion$)"=format(paste("$",round(Tot_DMG / 1e9, 1), "B"), trim = TRUE))%>%
select(EVTYPE,"Cost (Billion$)")
A pie chart was then created to show which weather events impacted most economically. This chart itself can be seen in the results section below.
dfD <- data.frame(CategoryD = Total_dmg$EVTYPE, TotD=Total_dmg$Tot_DMG)
##dfD$TotD<- format(paste("$",round(dfD$TotD / 1e9, 1), "B"), trim = TRUE)
colors <- c('rgb(211,94,96)', 'rgb(128,133,133)', 'rgb(144,103,167)', 'rgb(171,104,87)', 'rgb(114,147,203)')
pie <- plot_ly(dfD, labels = ~CategoryD, values = ~TotD, type = 'pie',
textposition = 'inside',
textinfo = 'label+percent',
insidetextfont = list(color = '#FFFFFF'),
hoverinfo = 'text',
text = format(paste("$",round(dfD$TotD / 1e9, 1), "B"), trim = TRUE),
marker = list(colors = colors,
line = list(color = '#FFFFFF', width = 1)),
#The 'pull' attribute can also be used to create space between the sectors
showlegend = FALSE) %>%
layout(title = 'United States Economic Impact per Weather Event',
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
The results of the analysis can be seen in the plots below.
In summary from a population health impact, tornadoes have been found to have the most significant impact. From the economic impact, floods have the most impact.
The tables below show to most impactful weather events on population health. These are shown by fatalities and injuries. In both instances tornadoes come out on top.
TopN_fatalities
## # A tibble: 20 x 3
## EVTYPE total_fatalities total_injuries
## <fct> <dbl> <dbl>
## 1 TORNADO 5633 60187
## 2 EXCESSIVE HEAT 1903 4791
## 3 FLASH FLOOD 978 641
## 4 HEAT 937 1420
## 5 LIGHTNING 816 649
## 6 TSTM WIND 504 646
## 7 FLOOD 470 2679
## 8 RIP CURRENT 368 87
## 9 HIGH WIND 248 308
## 10 AVALANCHE 224 79
## 11 WINTER STORM 206 599
## 12 RIP CURRENTS 204 67
## 13 HEAT WAVE 172 269
## 14 EXTREME COLD 160 199
## 15 THUNDERSTORM WIND 133 130
## 16 HEAVY SNOW 127 225
## 17 EXTREME COLD/WIND CHILL 125 15
## 18 STRONG WIND 103 40
## 19 BLIZZARD 101 718
## 20 HIGH SURF 101 39
TopN_injuries
## # A tibble: 20 x 3
## EVTYPE total_fatalities total_injuries
## <fct> <dbl> <dbl>
## 1 TORNADO 5227 91346
## 2 TSTM WIND 199 6957
## 3 FLOOD 104 6789
## 4 EXCESSIVE HEAT 402 6525
## 5 LIGHTNING 283 5230
## 6 HEAT 73 2100
## 7 ICE STORM 35 1975
## 8 FLASH FLOOD 171 1777
## 9 THUNDERSTORM WIND 54 1488
## 10 HAIL 3 1361
## 11 WINTER STORM 85 1321
## 12 HURRICANE/TYPHOON 32 1275
## 13 HIGH WIND 102 1137
## 14 HEAVY SNOW 51 1021
## 15 WILDFIRE 55 911
## 16 THUNDERSTORM WINDS 25 908
## 17 BLIZZARD 48 805
## 18 FOG 38 734
## 19 WILD/FOREST FIRE 7 545
## 20 DUST STORM 19 440
##plot histograms in a single figure
multiplot(p1, p2, cols=1)
## Loading required package: grid
Combining both the calcuations from both property and crop damage variables it was possible to determine the total cost of damage per weather event. The table below shows the impact of the top 20 events. Taking this data we can get an understanding of the overall impact of the main weather events contributing to the ecomonomic impact in the United States during the measurement period. The top 5 events account for 72% of the ecomonic damage.
head(Total_dmg_summary,20)
## # A tibble: 20 x 2
## # Groups: EVTYPE [20]
## EVTYPE `Cost (Billion$)`
## <fct> <chr>
## 1 FLOOD $ 150.3 B
## 2 HURRICANE/TYPHOON $ 71.9 B
## 3 TORNADO $ 57.4 B
## 4 STORM SURGE $ 43.3 B
## 5 HAIL $ 18.8 B
## 6 FLASH FLOOD $ 17.6 B
## 7 DROUGHT $ 15 B
## 8 HURRICANE $ 14.6 B
## 9 RIVER FLOOD $ 10.1 B
## 10 ICE STORM $ 9 B
## 11 TROPICAL STORM $ 8.4 B
## 12 WINTER STORM $ 6.7 B
## 13 HIGH WIND $ 5.9 B
## 14 WILDFIRE $ 5.1 B
## 15 TSTM WIND $ 5 B
## 16 STORM SURGE/TIDE $ 4.6 B
## 17 THUNDERSTORM WIND $ 3.9 B
## 18 HURRICANE OPAL $ 3.2 B
## 19 WILD/FOREST FIRE $ 3.1 B
## 20 THUNDERSTORM WINDS $ 1.9 B
pie