The goal of the assignment is to explore the NOAA Storm Database and explore the effects of severe weather events on both population and economy. The database covers the time period between 1950 and November 2011.
The analysis addresses the following questions:
Information on the Data: Documentation
It is assumed that the raw data are available in the work directory.
For data reading and manipulation the dplyr package is used.
During the data read process the following data transformations are performed:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(reshape2)
library(ggplot2)
sdb <-
sdb <-read.csv("repdata_data_StormData.csv",sep=",",na.strings=NULL,colClasses =c("numeric","character","character","character","numeric","character","character","character","numeric","character","character","character","character","numeric","character","numeric","character","character","numeric","numeric","character","numeric","numeric","numeric","numeric","character","numeric","character","character","character","character","numeric","numeric","numeric","numeric","character","numeric")) %>%
mutate(BGN_DATE=as.Date(BGN_DATE,format="%m/%d/%Y"),END_DATE=as.Date(END_DATE,format="%m/%d/%Y"),BGN_YEAR=year(BGN_DATE))
In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
The number of available events over time is explored using a simple bar plot:
sdb_cnt<-
sdb %>%
group_by(BGN_YEAR) %>%
summarise(n(),.groups="keep")
plot(sdb_cnt, type="h", col="red", xlab="Years", ylab="Number of cases", main="Number of cases over Time")
Based on the results years 1989-2011 are selected for further analysis.
Property and crops are stored in the way that the base part and the exponent part are stored in different columns. However, the exponent notation is mixed. It may contain numbers such as “2” which translates to 10^2 or characters such as “M” which translates to 10^6.
unique(sdb[,"PROPDMGEXP"])
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
unique(sdb[,"CROPDMGEXP"])
## [1] "" "M" "K" "m" "B" "?" "0" "k" "2"
In the next the steps a tidy dataset is created
sdb_tidy <-
sdb %>%
filter(BGN_YEAR >= 1989) %>%
mutate(PROPDMG_VAL=PROPDMG*
case_when(
toupper(PROPDMGEXP) == "1" ~ 10^1,
toupper(PROPDMGEXP) == "2" ~ 10^2,
toupper(PROPDMGEXP) == "3" ~ 10^3,
toupper(PROPDMGEXP) == "4" ~ 10^4,
toupper(PROPDMGEXP) == "5" ~ 10^5,
toupper(PROPDMGEXP) == "6" ~ 10^6,
toupper(PROPDMGEXP) == "7" ~ 10^7,
toupper(PROPDMGEXP) == "8" ~ 10^8,
toupper(PROPDMGEXP) == "9" ~ 10^9,
toupper(PROPDMGEXP) == "H" ~ 10^2,
toupper(PROPDMGEXP) == "K" ~ 10^3,
toupper(PROPDMGEXP) == "M" ~ 10^6,
toupper(PROPDMGEXP) == "B" ~ 10^9,
.default = 10^0
),
CROPDM_VAL=CROPDMG*
case_when(
toupper(CROPDMGEXP) == "2" ~ 10^2,
toupper(CROPDMGEXP) == "K" ~ 10^3,
toupper(CROPDMGEXP) == "M" ~ 10^6,
toupper(CROPDMGEXP) == "B" ~ 10^9,
.default = 10^0
)
)
In the last step two further datsets are created for the graphical analysis:
The following data transformations were performed:
sdb_health<-
sdb_tidy %>%
filter(FATALITIES > 0|INJURIES > 0) %>%
mutate(TOTAL_VAL=FATALITIES+INJURIES,EVTYPE=factor(EVTYPE)) %>%
select(EVTYPE,FATALITIES,INJURIES,TOTAL_VAL) %>%
group_by(EVTYPE) %>%
summarise(across(where(is.numeric),sum),.groups="keep") %>%
ungroup() %>%
slice_max(TOTAL_VAL,n = 10) %>%
melt(id="EVTYPE", variable="CASUALTY_TYPE", value.name="CASUALTY_VALUE") %>%
filter(CASUALTY_TYPE!="TOTAL_VAL")
The following graph shows the top 10 events that are the most harmful to Population Health:
plot_health <- ggplot(sdb_health) +
geom_bar(aes(x = reorder(EVTYPE, CASUALTY_VALUE), y = CASUALTY_VALUE, fill=CASUALTY_TYPE),
position = "stack", stat = "identity") +
coord_flip() +
labs(x = "Event Type",y = "Casualty Cases", title="Top 10 most harmful events (Health)")
print(plot_health)
The following data transformations were performed:
sdb_eco<-
sdb_tidy %>%
filter(PROPDMG_VAL > 0|CROPDM_VAL > 0) %>%
mutate(TOTAL_VAL=PROPDMG_VAL+CROPDM_VAL) %>%
select(EVTYPE,PROPDMG_VAL,CROPDM_VAL,TOTAL_VAL) %>%
group_by(EVTYPE) %>%
summarise(across(where(is.numeric),sum),.groups="keep") %>%
ungroup() %>%
slice_max(TOTAL_VAL,n = 10) %>%
melt(id="EVTYPE", variable="DAMAGE_TYPE", value.name="DAMAGE_VALUE") %>%
filter(DAMAGE_TYPE!="TOTAL_VAL")
The following graph shows the top 10 events that have the greatest economic consequences:
plot_eco <- ggplot(sdb_eco) +
geom_bar(aes(x = reorder(EVTYPE, DAMAGE_VALUE), y = DAMAGE_VALUE, fill=DAMAGE_TYPE),
position = "stack", stat = "identity") +
coord_flip() +
labs(x = "Event Type",y = "Damage Value",title = "Top 10 most harmful events (Economy)")
print(plot_eco)