This data analysis addresses the following questions:
Data is taken from NOAA’s storm database, with the relevant variables processed from 1993 onwards, and plotted.
Tornadoes have resulted in the greatest number of victims reported since 1993: 24931 victims. While flooding has resulted in the greatest damage cost-wise across the USA since 1993, with a total cost of $150,319,678,257.
Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
library(plyr)
library(dplyr)
library(lubridate)
library(ggplot2)
library(reshape2)
The computing environment:
sessionInfo()
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252
## [2] LC_CTYPE=English_United Kingdom.1252
## [3] LC_MONETARY=English_United Kingdom.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United Kingdom.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] reshape2_1.4.3 ggplot2_3.1.0 lubridate_1.7.4 dplyr_0.7.8
## [5] plyr_1.8.4
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.0 knitr_1.21 bindr_0.1.1 magrittr_1.5
## [5] munsell_0.5.0 tidyselect_0.2.5 colorspace_1.4-0 R6_2.3.0
## [9] rlang_0.3.1 stringr_1.3.1 tools_3.5.2 grid_3.5.2
## [13] gtable_0.2.0 xfun_0.4 withr_2.1.2 htmltools_0.3.6
## [17] lazyeval_0.2.1 yaml_2.2.0 assertthat_0.2.0 digest_0.6.18
## [21] tibble_2.0.1 crayon_1.3.4 bindrcpp_0.2.2 purrr_0.3.0
## [25] glue_1.3.0 evaluate_0.12 rmarkdown_1.11 stringi_1.2.4
## [29] compiler_3.5.2 pillar_1.3.1 scales_1.0.0 pkgconfig_2.0.2
## Download and unzip the data, provided it doesn't already exist
if(!file.exists("FStormData.csv.bz2")){
fileUrl<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileUrl,destfile="FStormData.csv.bz2",method="curl")
}
## Read in the data
if(!"FStormData.csv.bz2" %in% ls()){
data <- read.csv("FStormData.csv.bz2")
}
The next step is to perform exploratory data analysis to determine what is relevant. The ‘str’ function is first used for this purpose.
str(data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
As can be seen, there are 902297 observations from 37 variables.
So as to visualize the data prior to cleaning and processing, a bar plot was drawn showing the number of different event types which were reported in the US per year. So as to be able to group the data by year, an extra variable was added with only the year of the reported event.
data$year <- year(mdy_hms(data$BGN_DATE))
## Creating a data frame where the event types reported are grouped by year
cutoffYear <- data %>% group_by(year) %>% summarise(n=n_distinct(EVTYPE))
## Bar plot
plot(x=cutoffYear$year,y=cutoffYear$n, type = "h",
main = "Number of Different Weather Events Reported per Year across the US",
xlab = "Year", ylab = "Events Types")
Figure 1: Barplot showing the number of storm event types reported per year in the US
From the plot, there is a large jump in the number of different reported incidents from 1993 onwards, while pre 1993, the number remains close to 0. It has been determined that the pre 1993 data will be insufficient to provide any satisfactory conclusions, therefore it will be cut.
## Filtering the data to only include data from post 1992
S_Data <- filter(data, year > 1992)
For the first question, to determine which type of event is most harmful with respect to population health, the first step is to create a data frame consisting of only the relevant variables: EVTYPE (event type), FATALITIES, and INJURIES. In addition, an extra variable is created by adding the values for the FATALITIES and INJURIEs columns together, so that the top 10 most harmful events can be determined.
## Relevant variables selected and placed in new data frame
healthData <- S_Data %>% select(EVTYPE,FATALITIES,INJURIES)
## New column added to help with ordering
healthData$victims <- healthData$FATALITIES + healthData$INJURIES
The sum of the fatalities and injuries per event type were calculated using the ‘aggregate’ function, then ordered by the decreasing number of victims. The data was then melted to allow the variables to be plotted simultaneously.
## Aggregate the fatalities, injuries, and victims columns by event type
health <- aggregate(cbind(FATALITIES,INJURIES,victims)~EVTYPE,healthData,sum)
## Data frame created by ordering by the top 10 victim numbers
healthTop <- head(health[order(-health$victims),],10)
## Melt the fatalities and injuries variables so that they can be plotted together
healthM <- melt(healthTop,id.vars="EVTYPE",
measure.vars=c("INJURIES","FATALITIES"))
Bar plot created to showcase results. Please jump to Results section.
healthPlot <- ggplot(healthM, aes(EVTYPE, value, fill = variable)) +
geom_bar(stat="identity") +
labs(title = "Victim (Injury & Fatality) Numbers per Event Type Reported across the USA since 1993",
x = "Event Type", y = "Reported Incidents") +
theme(axis.text.x = element_text(hjust = 1, angle = 20),
legend.title=element_blank()) +
scale_fill_manual(values = c("steelblue1","red"))
For the second question, to determine which types of events have the greatest economic consequences, the first step is to create a data frame consisting of only the relevant variables: EVTYPE (event type), PROPDMG (property damage), PROPDMGEXP (property damage exponent), CROPDMG (crop damage), CROPDMGEXP (crop damage exponent).
economData <- S_Data %>% select(EVTYPE,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP)
To be able to calculate the damage done to property and crops, the exponents in the PROPDMGEXP and CROPDMGEXP columns needed to be replaced. This as done using the ‘revalue’ function. Symbols were assumed to be 0.
## Determining the values that compromise PROPDMGEXP and CROPDMGEXP
unique(economData$PROPDMGEXP)
## [1] B K M m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
unique(economData$CROPDMGEXP)
## [1] M K m B ? 0 k 2
## Levels: ? 0 2 B k K m M
## Dictionary created with keys and their respective values
dmgOrder <- c("-"=0,"?"=0,"+"=0,"h"=2,"H"=2,"k"=3,"K"=3,"m"=6,"M"=6,"B"=9)
## Empty cells were assigned a zero
economData[economData==''] <- 0
## The PROPDMGEXP and CROPDMGEXP values replaced using 'dmgOrder' dictionary
economData$PROPDMGEXP <- revalue(economData$PROPDMGEXP, dmgOrder, warn_missing = FALSE)
economData$CROPDMGEXP <- revalue(economData$CROPDMGEXP, dmgOrder, warn_missing = FALSE)
## Values converted from factor to numeric so that they can be further processed
economData$PROPDMGEXP <- as.numeric(as.character(economData$PROPDMGEXP))
economData$CROPDMGEXP <- as.numeric(as.character(economData$CROPDMGEXP))
With the completion of the previous step, the next step was to calculate the actual damage costs by multiplying the damage cost base by the damage exponent for both properties and crops. Then just like with the Health Impact analysis, an extra variable is created by adding the values for the total property damage cost (PROPDMGTOT) and the total crop damage cost (CROPDMGTOT) columns together, so that the top 10 most costly events can be determined.
## New variable added to contain damage costs base multiplied by exponent
economData$PROPDMGTOT <- economData$PROPDMG * 10^economData$PROPDMGEXP
economData$CROPDMGTOT <- economData$CROPDMG * 10^economData$CROPDMGEXP
## New column added to help with ordering
economData$TotalCost <- economData$PROPDMGTOT + economData$CROPDMGTOT
The sum of the PROPDMGTOT and CROPDMGTOT per event types were calculated using the ‘aggregate’ function, then ordered by the decreasing amount of damage. The data was then melted to allow the variables to be plotted simultaneously.
## Aggregate the total property damage, total crop damage, and the total added
## together columns by event type
cost <- aggregate(cbind(PROPDMGTOT,CROPDMGTOT,TotalCost)~EVTYPE,economData,sum)
## Data frame created by ordering by the top 10 costliest
costTop <- head(cost[order(-cost$TotalCost),],10)
## Melt the total property damage, total crop damage variables so that they can be plotted together
costM <- melt(costTop,id.vars="EVTYPE",
measure.vars=c("PROPDMGTOT","CROPDMGTOT"))
Bar plot created to showcase results. Please jump to Results section.
costPlot <- ggplot(costM, aes(EVTYPE, value, fill = variable)) +
geom_bar(stat="identity") +
labs(title = "Economic Cost (Property & Crop Damage) per Event Type Reported across the USA since 1993",
x = "Event Type", y = "Damage in USD") +
theme(axis.text.x = element_text(hjust = 1, angle = 20)) +
scale_fill_manual(values = c("steelblue1","red"),
name = "Type of Damage",
labels = c("Property","Crops"))
healthTop
## EVTYPE FATALITIES INJURIES victims
## 834 TORNADO 1621 23310 24931
## 130 EXCESSIVE HEAT 1903 6525 8428
## 170 FLOOD 470 6789 7259
## 464 LIGHTNING 816 5230 6046
## 856 TSTM WIND 241 3631 3872
## 275 HEAT 937 2100 3037
## 153 FLASH FLOOD 978 1777 2755
## 427 ICE STORM 89 1975 2064
## 760 THUNDERSTORM WIND 133 1488 1621
## 972 WINTER STORM 206 1321 1527
healthPlot
Figure 2: Top 10 most harmful (in terms of number of deaths and injuries) storm events to the US since 1993
The figure shows that Tornadoes are the most harmful event to the population of the US in terms of the total number of victims affected - 24931 victims. This is due to the large number of injuries that result from the event, more than three times the number of injuries that result from the second most harmful event - Excessive Heat. However, in terms of the number of fatalities, the most harmful event to the population is Excessive Heat, with 1903 deaths, to tornadoes’ 1621.
costTop
## EVTYPE PROPDMGTOT CROPDMGTOT TotalCost
## 170 FLOOD 144657709807 5661968450 150319678257
## 411 HURRICANE/TYPHOON 69305840000 2607872800 71913712800
## 670 STORM SURGE 43323536000 5000 43323541000
## 834 TORNADO 26349182107 414953270 26764135377
## 244 HAIL 15735267513 3025954473 18761221986
## 153 FLASH FLOOD 16822673979 1421317100 18243991079
## 95 DROUGHT 1046106000 13972566000 15018672000
## 402 HURRICANE 11868319010 2741910000 14610229010
## 590 RIVER FLOOD 5118945500 5029459000 10148404500
## 427 ICE STORM 3944927860 5022113500 8967041360
costPlot
Figure 3: Top 10 most economically damaging (in terms of cost to properties and crops) storm events to the US since 1993
The figure shows that Floods do the most economic damage in the US in terms of the total cost of $150,319,678,257. This is due to the large amount of property damage done; which is more than twice that of Hurricane/Typhoon damage. However, in terms of the cost of the damage done to crops, the most harmful event is Drought, with $13972566000 worth of damage, to flood’s $5661968450.
From 1993, tornadoes have been the most harmful event to the population of the US, while floods have done the greatest damage economically.