Reproducible research project 2

##This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

####The data analysis addressed the following questions: #- Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health? #- Across the United States, which types of events have the greatest economic consequences?

####In conclusion;

Download url Stormdata: https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2

More available documentation about the database, with some describtion of how variables are constructed or defined. National Weather Service https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf National Climatic Data Center Storm Events https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf

##Data Processing

#reading data
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
destfile <- "StormData.Csv.bz2"
curl::curl_download(url, destfile)
Raw_data <- read.csv(file = destfile, header= TRUE, sep=",")

On the coursera discussion platform are given extra mentor comments to make the task easier. (https://www.coursera.org/learn/reproducible-research/discussions/weeks/4/threads/IdtP_JHzEeaePQ71AQUtYw)

####Here is described that however the data collection started at 1950, only at Jan 1996 they started with recoridng all events type. So we can use the data since then and neglect all other data.

# subsetting by date
Main_data <- Raw_data
Main_data$BGN_DATE <- as.POSIXct(Raw_data$BGN_DATE, format = "%m/%d/%Y %H:%M:%S")
Main_data <- subset(Main_data, BGN_DATE > as.POSIXct("1995-12-31"))

#So for the questions we need to look at events types and at which events have the greatest economic consequences.Therefore we need to focous on the following 7 variabeles;EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP.

Main_data <- subset(Main_data, select = c(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP))
num_unique_events <- length(unique(Main_data$EVTYPE))
print(num_unique_events)

## [1] 516

##There are in total 516 different unique events in the variable EVTYPE. We only need to need have the most harmful with respect to the population or the ones with the greatest economic consequences.

##To answer the first question let start with most harmful with respect to the population.Which includes the variables FATALITIES and INJURIES.

Health_data <- aggregate(cbind(FATALITIES, INJURIES) ~ EVTYPE, data = Main_data, FUN=sum)
Health_data$PEOPLE_LOSS <- Health_data$FATALITIES + Health_data$INJURIES
Health_data <- Health_data[order(Health_data$PEOPLE_LOSS, decreasing = TRUE), ]
Top10_events_people <- Health_data[1:10,]
#knitr::kable(Top10_events_people, format = "markdown")

#Now lets look at the economic consequences

##The values in the PROPDMGEXP and CROPDMGEXP columns represent exponents that indicate powers of ten. This means the total damage is calculated by multiplying the PROPDMG or CROPDMG value by 10 raised to the power specified in the exponent column.

###Exponent values are:
Letters, which correspond to specific magnitudes: B or b = Billion (10^9) M or m = Million (10^6) K or k = Thousand (10^3) H or h = Hundred (10^2)

Symbols: “-” = Indicates a value less than the stated amount. “+” = Suggests a value greater than the stated amount. “?” = Represents uncertainty or low confidence in the value. These symbols (-, +, and ?) can be optionally ignored if they do not provide meaningful information.

# Function to convert damage exponents to numeric values
convert_dmg_exp <- function(exp_column) {
  # Replace letter and symbol codes with corresponding numbers
  exp_column <- gsub("[Hh]", "2", exp_column)  # Hundreds -> 10^2
  exp_column <- gsub("[Kk]", "3", exp_column)  # Thousands -> 10^3
  exp_column <- gsub("[Mm]", "6", exp_column)  # Millions -> 10^6
  exp_column <- gsub("[Bb]", "9", exp_column)  # Billions -> 10^9
  exp_column <- gsub("\\+", "1", exp_column)   # '+' -> 1 (Positive adjustment)
  exp_column <- gsub("[\\?\\-\\ ]", "0", exp_column)  # '?' '-' and empty space -> 0

  # Convert to numeric and handle any NAs by replacing them with 0
  exp_column <- as.numeric(exp_column)
  exp_column[is.na(exp_column)] <- 0
  return(exp_column)
}

# Apply the function to both PROPDMGEXP and CROPDMGEXP columns
Main_data$PROPDMGEXP <- convert_dmg_exp(Main_data$PROPDMGEXP)
Main_data$CROPDMGEXP <- convert_dmg_exp(Main_data$CROPDMGEXP)

#Create total property, crop damage and total damage economic loss

# Load necessary library
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Create new columns for total property damage and total crop damage
Main_data <- mutate(Main_data, 
                    PROPDMGTOTAL = PROPDMG * (10 ^ PROPDMGEXP), 
                    CROPDMGTOTAL = CROPDMG * (10 ^ CROPDMGEXP))

# Aggregate total property and crop damage by event type (EVTYPE)
Economic_data <- aggregate(cbind(PROPDMGTOTAL, CROPDMGTOTAL) ~ EVTYPE, 
                           data = Main_data, FUN = sum)

# Create a new column for total economic loss (property + crop damage)
Economic_data$ECONOMIC_LOSS <- Economic_data$PROPDMGTOTAL + Economic_data$CROPDMGTOTAL

# Sort the data by total economic loss in descending order
Economic_data <- Economic_data[order(Economic_data$ECONOMIC_LOSS, decreasing = TRUE), ]

# Extract the top 10 events with the highest economic loss
Top10_events_economy <- Economic_data[1:10,]

# Display the top 10 events in a markdown table
#knitr::kable(Top10_events_economy, format = "markdown")

##Only three figures were allowed according to the Grading criteria, to still be able to show all the data a combined figure with all the necesarry data was made. There are ranks for the top10 events people loss and economic loss variable name: Rank_People_Loss and Rank_Economic_Loss. Events with no combination in both top 10s are called ‘NA’ in either the People or Economic section of the figure.

# Load necessary library for data manipulation
library(dplyr)

# Add rank for people loss (descending order)
Top10_events_people <- Top10_events_people %>%
  mutate(Rank_People_Loss = row_number(-PEOPLE_LOSS))

# Add rank for economic loss (descending order)
Top10_events_economy <- Top10_events_economy %>%
  mutate(Rank_Economic_Loss = row_number(-ECONOMIC_LOSS))

# Combine the two tables by 'EVTYPE'
Combined_events <- full_join(Top10_events_people, Top10_events_economy, by = "EVTYPE")

# Display the combined table
knitr::kable(Combined_events, format = "markdown")

EVTYPE	FATALITIES	INJURIES	PEOPLE_LOSS	Rank_People_Loss	PROPDMGTOTAL	CROPDMGTOTAL	ECONOMIC_LOSS	Rank_Economic_Loss
TORNADO	1511	20667	22178	1	24616945710	283425010	24900370720	4
EXCESSIVE HEAT	1797	6391	8188	2	NA	NA	NA	NA
FLOOD	414	6758	7172	3	143944833550	4974778400	148919611950	1
LIGHTNING	651	4141	4792	4	NA	NA	NA	NA
TSTM WIND	241	3629	3870	5	NA	NA	NA	NA
FLASH FLOOD	887	1674	2561	6	15222203910	1334901700	16557105610	6
THUNDERSTORM WIND	130	1400	1530	7	NA	NA	NA	NA
WINTER STORM	191	1292	1483	8	NA	NA	NA	NA
HEAT	237	1222	1459	9	NA	NA	NA	NA
HURRICANE/TYPHOON	64	1275	1339	10	69305840000	2607872800	71913712800	2
STORM SURGE	NA	NA	NA	NA	43193536000	5000	43193541000	3
HAIL	NA	NA	NA	NA	14595143420	2476029450	17071172870	5
HURRICANE	NA	NA	NA	NA	11812819010	2741410000	14554229010	7
DROUGHT	NA	NA	NA	NA	1046101000	13367566000	14413667000	8
TROPICAL STORM	NA	NA	NA	NA	7642475550	677711000	8320186550	9
HIGH WIND	NA	NA	NA	NA	5247860360	633561300	5881421660	10

##Results

To answer question 1 and 2 there need to be looked at the most harmful events on the population health and the events with the greatest economic consequences.

#Analyzing harmful events on population health
# Load necessary library for plotting
library(ggplot2)

# Plot the total people loss by event type
ggplot(Top10_events_people, aes(x = reorder(EVTYPE, PEOPLE_LOSS), y = PEOPLE_LOSS)) +
  geom_bar(stat = "identity", colour = "red") +
  labs(title = "Total People Loss in USA by Weather Events (1996-2011)",
       y = "Number of Fatalities and Injuries", 
       x = "Event Type") +
  theme(plot.title = element_text(hjust = 0.5)) +
  coord_flip()

#Analyzing the total economic consequences of harmful events
# Plot the total economic loss by event type
ggplot(Top10_events_economy, aes(x = reorder(EVTYPE, ECONOMIC_LOSS), y = ECONOMIC_LOSS)) +
  geom_bar(stat = "identity", colour = "green") +
  labs(title = "Total Economic Loss in USA by Weather Events (1996-2011)",
       y = "Size of Property and Crop Loss", 
       x = "Event Type") +
  theme(plot.title = element_text(hjust = 0.5)) +
  coord_flip()

Reproducible research project 2

BoasW

2024-09-29