This R markdown comprises an analysis made for the course “Reproducible Research” on the Data Science Specialization in Coursera. It aims to gauge the effects of major weather events, such as storms and heat waves, in the United States. For this analysis, the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database was used. Two plots were created, showing which events cause the greatest effect on population health and on the Economy.
To perform this analysis some R libraries need to be imported.
library(R.utils)
## Warning: package 'R.utils' was built under R version 4.3.3
## Carregando pacotes exigidos: R.oo
## Warning: package 'R.oo' was built under R version 4.3.3
## Carregando pacotes exigidos: R.methodsS3
## Warning: package 'R.methodsS3' was built under R version 4.3.3
## R.methodsS3 v1.8.2 (2022-06-13 22:00:14 UTC) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.26.0 (2024-01-24 05:12:50 UTC) successfully loaded. See ?R.oo for help.
##
## Attaching package: 'R.oo'
## The following object is masked from 'package:R.methodsS3':
##
## throw
## The following objects are masked from 'package:methods':
##
## getClasses, getMethods
## The following objects are masked from 'package:base':
##
## attach, detach, load, save
## R.utils v2.12.3 (2023-11-18 01:00:02 UTC) successfully loaded. See ?R.utils for help.
##
## Attaching package: 'R.utils'
## The following object is masked from 'package:utils':
##
## timestamp
## The following objects are masked from 'package:base':
##
## cat, commandArgs, getOption, isOpen, nullfile, parse, warnings
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:R.utils':
##
## extract
library(stringdist)
##
## Attaching package: 'stringdist'
## The following object is masked from 'package:tidyr':
##
## extract
## The following object is masked from 'package:R.utils':
##
## extract
library(lubridate)
## Warning: package 'lubridate' was built under R version 4.3.3
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggplot2)
library(patchwork)
## Warning: package 'patchwork' was built under R version 4.3.3
library(scales)
Then, download the data given on the project instructions as a zip file, and extract the .csv file using the bunzip2 function.
# Download the data
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", destfile = "./stormdata.csv.bz2")
bunzip2("stormdata.csv.bz2", remove = FALSE)
Next, the data should be read into R:
# Read the file into R
storm_df = read.csv("stormdata.csv")
After importing the dataset, some cleaning needs to be performed on the dataset. The dataset starts registering events from 1950, however, all the differents types were only measured from 1996 onwards. Since we would like to have a fair basis of comparison, we will filter only records that begin from that point in time.
All unnecessary columns will be removed from the dataset, remaining only those with event types and their health and economic results.
Since property and crop damage contain an exponent column, we will convert its value to numeric to generate a single number of comparison.
# Register of all types of occurences only started in 1996, so we will filter only those
# Remove unnecessary columns
storm_df$BGN_DATE = as.Date(storm_df$BGN_DATE, "%m/%d/%Y")
cutdate = as.Date("1996-01-01")
storm_df_1996 = storm_df %>%
filter(BGN_DATE > cutdate) %>%
select(BGN_DATE, STATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP,
CROPDMG, CROPDMGEXP)
# Convert data for property damage and crop damage using exponents
storm_df_1996 = storm_df_1996 %>%
mutate(PROPDMGEXP_NUM = case_when(PROPDMGEXP == "" ~ 1, PROPDMGEXP == "K" ~ 1E3,
PROPDMGEXP == "M" ~ 1E6, PROPDMGEXP == "B" ~ 1E9),
CROPDMGEXP_NUM = case_when(CROPDMGEXP == "" ~ 1, CROPDMGEXP == "K" ~ 1E3,
CROPDMGEXP == "M" ~ 1E6, CROPDMGEXP == "B" ~ 1E9)) %>%
mutate(PROPDMG_act = PROPDMG * PROPDMGEXP_NUM, CROPDMG_act = CROPDMG * CROPDMGEXP_NUM)
# Create column for year
storm_df_1996 = storm_df_1996 %>%
mutate(YEAR = year(BGN_DATE))
The list of registered event types in the dataset is quite larger than the actual official list, most likely due to mislabeling and typos. To adjust this, the stringdist library will be used. It provides a function called amatch, that is able to approximate the list value to an official event type. These steps are performed below.
# Create vector of official event types
offic_evtypes = c("Astronomical Low Tide", "Avalanche", "Blizzard", "Coastal Flood", "Cold/Wind Chill",
"Debris Flow", "Dense Fog", "Dense Smoke", "Drought", "Dust Devil", "Dust Storm",
"Excessive Heat", "Extreme Cold/Wind Chill", "Flash Flood", "Flood", "Frost/Freeze",
"Funnel Cloud", "Freezing Fog", "Hail", "Heat", "Heavy Rain", "Heavy Snow", "High Surf",
"High Wind", "Hurricane (Typhoon)", "Ice Storm", "Lake-Effect Snow", "Lakeshore Flood",
"Lightning", "Marine Hail", "Marine High Wind", "Marine Strong Wind", "Marine Thunderstorm Wind",
"Rip Current", "Seiche", "Sleet", "Storm Surge/Tide", "Strong Wind", "Thunderstorm Wind",
"Tornado", "Tropical Depression", "Tropical Storm", "Tsunami", "Volcanic Ash", "Waterspout",
"Wildfire", "Winter Storm", "Winter Weather")
# Perform approximate matching of official list and recorded events
storm_df_1996$cleaned_EVTYPE = offic_evtypes[amatch(storm_df_1996$EVTYPE, offic_evtypes, maxDist = 30)]
With a cleaned dataset, dataframes can be prepared to analyse the effect of the weather events. Two dataframes will be created, one for analysing the effects on population health, through the number of injuries and fatalities, and the other on the economic malaise due to property and crop damage. This variables were grouped by each event type, considering the total number of fatalities, injuries and damage caused since 1996, and also the average effect for a single event.
# Group data by event type and compute the sums
storm_grouped = storm_df_1996 %>%
group_by(cleaned_EVTYPE) %>%
summarise(total_fatalities = sum(FATALITIES, na.rm = TRUE),
total_injuries = sum(INJURIES, na.rm = TRUE),
total_fatplusinj = sum(FATALITIES, na.rm = TRUE) + sum(INJURIES, na.rm = TRUE),
average_fatalities = mean(FATALITIES, na.rm = TRUE),
average_injuries = mean(INJURIES, na.rm = TRUE),
average_fatplusinj = mean(FATALITIES + INJURIES, na.rm = TRUE)) %>%
arrange(desc(total_fatplusinj))
# Reshape the data to long format for plotting
storm_grouped_long <- storm_grouped %>%
pivot_longer(cols = c(total_fatalities, total_injuries,
average_fatalities, average_injuries),
names_to = c("Measure", "Type"),
names_sep = "_",
values_to = "Total") %>%
mutate(Type = factor(ifelse(Type == "fatalities", "Fatality", "Injury"))) %>%
pivot_wider(names_from = "Measure", values_from = "Total")
# Group data by event type and compute the sums
storm_damage_grouped = storm_df_1996 %>%
group_by(cleaned_EVTYPE) %>%
summarise(total_propertydmg = sum(PROPDMG_act, na.rm = TRUE),
total_cropdmg = sum(CROPDMG_act, na.rm = TRUE),
total_propcrop = sum(PROPDMG_act, na.rm = TRUE) + sum(CROPDMG_act, na.rm = TRUE),
average_propertydmg = mean(PROPDMG_act, na.rm = TRUE),
average_cropdmg = mean(CROPDMG_act, na.rm = TRUE),
average_propcrop = mean(PROPDMG_act + CROPDMG_act, na.rm = TRUE)) %>%
arrange(desc(total_propcrop))
# Reshape the data to long format for plotting
storm_damage_grouped_long <- storm_damage_grouped %>%
pivot_longer(cols = c(total_propertydmg, total_cropdmg,
average_propertydmg, average_cropdmg),
names_to = c("Measure", "Type"),
names_sep = "_",
values_to = "Total") %>%
mutate(Type = factor(ifelse(Type == "propertydmg", "Property", "Crop"))) %>%
pivot_wider(names_from = "Measure", values_from = "Total")
Now the data can be analysed with the aid of ggplot2. First of all, we will check with a bar plot which 5 events caused the most human damage with deaths and injuries since 1996. Additionally, the side plot will show which 5 events have on average the greatest effect.
# Create bar chart of total fatalities and injuries per event type for the top 5 events
g = ggplot(storm_grouped_long[1:10, ], aes(x = reorder(cleaned_EVTYPE, desc(total_fatplusinj)), y = total, fill = Type)) +
geom_bar(stat = "identity") +
geom_text(aes(label = total), position = position_stack(vjust = 0.5), color = "black") +
labs(title = "Total Fatalities + Injuries due to Weather Events since 1996", x = "Event Type", y = "Total") +
theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Reorder according to the average occurences
storm_grouped_long = storm_grouped_long %>% arrange(desc(average_fatplusinj))
# Create bar chart of average fatalities and injuries per event type for the top 5 events
h = ggplot(storm_grouped_long[1:10, ], aes(x = reorder(cleaned_EVTYPE, desc(average_fatplusinj)), y = average, fill = Type)) +
geom_bar(stat = "identity") +
geom_text(aes(label = sprintf("%.2f", average)), position = position_stack(vjust = 0.5), color = "black") +
labs(title = "Average Fatalities + Injuries due to Weather Events since 1996", x = "Event Type", y = "Average") +
theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Plot the chart
print(g + h)
From the plot, we can clearly see that tornados are the most damaging event, especially in terms of injuries. Excessive heat causes a larger number of fatalities, and comes in second overall. Interestingly, heat is also in the top 5. On the average per occurrence, the heat events are also among the most degrading to human health.
The next plot will look at the economic effect of weather events. The first bar chart shows the accumulated property and crop damage by the top 5 most destructive events, while the second chart provides the average effect for the top 5.
# Create bar chart of total property and crop damage per event type for the top 10 events
g2 = ggplot(storm_damage_grouped_long[1:10, ], aes(x = reorder(cleaned_EVTYPE, desc(total_propcrop)), y = total, fill = Type)) +
geom_bar(stat = "identity") +
geom_text(aes(label = scientific(total)), position = position_stack(vjust = 0.5), color = "black") +
labs(title = "Total Property + Crop Damage due to Weather Events since 1996", x = "Event Type", y = "Total") +
theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Reorder according to the average occurrences
storm_damage_grouped_long = storm_damage_grouped_long %>% arrange(desc(average_propcrop))
# Create bar chart of average fatalities and injuries per event type for the top 5 events
h2 = ggplot(storm_damage_grouped_long[1:10, ], aes(x = reorder(cleaned_EVTYPE, desc(average_propcrop)), y = average, fill = Type)) +
geom_bar(stat = "identity") +
geom_text(aes(label = scientific(average)), position = position_stack(vjust = 0.5), color = "black") +
labs(title = "Average Property + Crop Damage due to Weather Events since 1996", x = "Event Type", y = "Average") +
theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Plot the chart
print(g2 + h2)
The results show on both plots floods, avalanche and dense smoke, which indicates how they cause great damage on a single event, and also how they happen frequently. Floods have by a large margin been the most damaging economic event in the US since 1996.