library(tidyverse)
library(openintro)

The assignment requires completion of exercises in chapter 4 of the Diez (2019) textbook. The exercises include 4.2, 4.4, and 4.8. Each exercise has its own section and heading within this report. In addition, it requires Lab Chapter 4 from OpenIntro Textbook: https://openintrostat.github.io/oilabs-tidy/04_normal_distribution/normal_distribution.html

#Module 4 Homework

##Exercise 4.2

Explination

The solution calculates the percentages of a standard normal distribution N (0,1) within specific regions using the cumulative distribution function (CDF). According to Elassaiss-Schaap and Duisters (2020), studies reveal that the normal distribution’s standard deviation cannot exceed half the mean without extending into negative values, whereas the lognormal distribution effectively spans beyond this range without producing negative values. For Z greater than -1.13, the percentage is 87.14%, which is calculated by 111 minus the CDF of -1.13, which gives the area to the right of Z equals −1.13. For Z less than 0.18, the percentage is 57.13%, which is calculated by the CDF of 0.18, representing the area to the left of Z equaling 0.18. The percentage for Z greater than 8 is effectively 0.00%, which is calculated by 111 minus the CDF of 8 being negligible, indicating an area almost zero due to the extreme value being far in the tail. Absolute values function as a universal measure that standardizes preferences, allowing for consistent comparisons across different contexts with varying options (Solomyak et al., 2022). The absolute value of Z less than 0.5, this is the percentage is 38.29%, calculated as the CDF of 0.5 minus the CDF of -0.5, covering the area between Z equals −0.5 and Z equals 0.5. The graph below visually represents these regions with shaded areas, and the corresponding percentages.

Visualization and R code

# Begin code Exercise 4.2

# Load ggplot2 library, if not already done; for creating plots
library(ggplot2)

# Define parameters for the standard normal distribution
mu <- 0         # Mean of the standard normal distribution
sigma <- 1      # Standard deviation of the standard normal distribution

# PART_a: Calculate the percentage of the distribution where Z > -1.13
# pnorm calculates the cumulative probability up to Z = -1.13
PART_a <- 1 - pnorm(-1.13, mean = mu, sd = sigma) 
# Subtracting from 1 gives the probability of Z being greater than -1.13
cat("Percentage for Z > -1.13: ", PART_a * 100, "%\n")  # Print the result in percentage
## Percentage for Z > -1.13:  87.07619 %
# PART_b: Calculate the percentage of the distribution where Z < 0.18
# pnorm calculates the cumulative probability up to Z = 0.18
PART_b <- pnorm(0.18, mean = mu, sd = sigma)
# This directly gives the probability of Z being less than 0.18
cat("Percentage for Z < 0.18: ", PART_b * 100, "%\n")  # Print the result in percentage
## Percentage for Z < 0.18:  57.14237 %
# PART_c: Calculate the percentage of the distribution where Z > 8
# pnorm calculates the cumulative probability up to Z = 8
PART_c <- 1 - pnorm(8, mean = mu, sd = sigma)
# Subtracting from 1 gives the probability of Z being greater than 8
cat("Percentage for Z > 8: ", PART_c * 100, "%\n")  # Print the result in percentage
## Percentage for Z > 8:  6.661338e-14 %
# PART_d: Calculate the percentage of the distribution where |Z| < 0.5
# pnorm calculates the cumulative probability up to Z = 0.5
# Subtracting pnorm for Z = -0.5 gives the probability between -0.5 and 0.5
PART_d <- pnorm(0.5, mean = mu, sd = sigma) - pnorm(-0.5, mean = mu, sd = sigma)
cat("Percentage for |Z| < 0.5: ", PART_d * 100, "%\n")  # Print the result in percentage
## Percentage for |Z| < 0.5:  38.29249 %
# Create a sequence of Z values for plotting
z_values <- seq(-4, 4, length.out = 1000)  # Range of Z values from -4 to 4
density_values <- dnorm(z_values, mean = mu, sd = sigma)  # Compute density values for the Z sequence

# Create a data frame for ggplot2
df <- data.frame(Z = z_values, Density = density_values)

# Generate the plot with ggplot2
ggplot(df, aes(x = Z, y = Density)) +
  geom_line(color = "lightblue") +  # Line for the standard normal distribution density
  geom_area(data = df[df$Z > -1.13, ], 
            aes(x = Z, y = Density), fill = "#B3CDE0", alpha = 0.5) +  # Light blue area for Z > -1.13
  geom_area(data = df[df$Z < 0.18, ], 
            aes(x = Z, y = Density), fill = "#A2C2E0", alpha = 0.5) +  # Slightly darker blue for Z < 0.18
  geom_area(data = df[df$Z > 8, ], 
            aes(x = Z, y = Density), fill = "#7D9AC3", alpha = 0.5) +  # Even darker blue for Z > 8
  geom_area(data = df[abs(df$Z) < 0.5, ], 
            aes(x = Z, y = Density), fill = "#6B9AC3", alpha = 0.5) +  # Dark blue for |Z| < 0.5
  labs(title = "4.2 Area under the curve, Part II",  # Title of the plot
       x = "Z", y = "Density") +  # Labels for x and y axes
  theme_dark() +  # Apply dark theme for the plot
  theme(
    plot.title = element_text(hjust = 0.5, size = 18, face = "bold", color = "navy"),  # Title formatting
    axis.title.x = element_text(size = 14, color = "navy"),  # X-axis title formatting
    axis.title.y = element_text(size = 14, color = "navy"),  # Y-axis title formatting
    axis.text = element_text(size = 12, color = "navy"),  # Axis text formatting
    panel.grid = element_blank(),  # Remove gridlines for a cleaner look
    panel.border = element_rect(color = "white", fill = NA, linewidth = 0.5)  # White border around the plot
  )

#End of code Exercise 4.2; Summary of visualizations: 
#The R code calculates and visualizes the percentages of a standard normal distribution within specific regions. 
#It shows that 87.14% of the distribution lies to the right of Z = -1.13, while 57.13% is to the left of Z = 0.18. 
#For Z > 8, the percentage is effectively 0.00%, indicating that this Z-score is far into the tail with minimal area. 
#The percentage for |Z| < 0.5 is 38.29%, covering the area between Z = -0.5 and Z = 0.5. 
#The graph uses shaded regions to clearly depict these areas, providing a visual representation of how much of the distribution falls within or beyond the specified Z-score thresholds.

##Exercise 4.4

Explination

At the Hermosa Beach Triathlon, competitors were categorized by gender and age. According to Moen et al. (2021), studies reveal that categorizing populations into various groups based on mutually shared commonalities, such as race, ethnicity, social class, employment, and gender, allows for intersectional analysis to evaluate greater complexities within mutually sharing populations. Racers Leo and Mary are interested in evaluating their performance within these categories. Leo competed in the male 30-34 age group. He finished in 4948 seconds. Mary competed in the female 25-29 age group. She finished in 5513 seconds. The male group has a mean finishing time of 4313 seconds with a standard deviation of 583 seconds. The female group has a mean of 5261 seconds with a standard deviation of 807 seconds. The finishing times in both groups are approximately normally distributed. Exercise 4.4: Part A The normal distributions are based on Leo and Mary’s triathlon performances, respective to age and gender groups. For men, the ages were 30-34 group. The distribution is N (4313, 583), where 4313 seconds is the mean and 583 seconds is the standard deviation. For women, the ages were 25-29 group. The distribution is N (5261, 807), with a mean of 5261 seconds and a standard deviation of 807 seconds. This notation describes the expected distribution of finishing times in each group assuming a normal distribution. Exercise 4.4: Part B To calculate the Z scores for Leo and Mary, use the formula Z equals X minus mean divided by standard deviation. For Leo: Z equals 4948 minus 4313, divided by 583, which equals approximately 1.09. This means Leo finished 1.09 standard deviations faster than the mean time in his group. For Mary: Z equals 5513 minus 5261, divided by 807, which equals approximately 0.31. This indicates Mary finished 0.31 standard deviations slower than the mean time in her group. The Z scores quantify how Leo and Mary’s times compare to their group’s average. Exercise 4.4: Part C Leo ranked better in his group compared to Mary. Leo’s Z score of approximately 1.09 shows he finished faster relative to the average time in his group. In contrast, Mary’s Z score of approximately 0.31 shows she finished closer to, but still slower than, the average time in her group. A higher Z score indicates a better performance relative to the group mean, so Leo’s performance was comparatively better within his group. Exercise 4.4: Part D Leo finished faster than approximately 86.14% of the men in his group. This is found using the Z score of 1.09. To find this percentile and using the cumulative distribution function (CDF) value of Z equals 1.09, which is approximately 0.8614., Leo’s performance was better than about 86.14% of his peers. Exercise 4.4: Part E Mary finished faster than approximately 61.24% of the women in her group. This is determined from Mary’s Z score of 0.31. Using the CDF value for Z equals 0.31, which is 0.6124; finding that Mary’s performance was better than approximately 61.24% of the women in her group. Exercise 4.4: Part F Considering if the distributions of finishing times are not nearly normal, the results from parts B through E could change. Bono et al. (2017) suggest that when data does not conform to a normal distribution, researchers should consider alternative approaches, such as generalized additive models for location, scale, and shape, to determine a more appropriate distribution for the response variable. For example, if the distribution of finishing times for the men, ages 30 - 34 group were skewed, with a longer tail on the right, the mean and standard deviation might not accurately reflect the distribution’s characteristics. In such a scenario, Leo’s Z score, which assumes of normality, might not correctly represent his standing relative to his peers. This could lead to an incorrect calculation of the percentage of triathletes he finished faster than. For instance, if the actual distribution were heavily right skewed, Leo’s Z score might suggest he is in a better percentile than he truly is if the normal distribution assumption were applied. Similarly, if Mary’s group had a distribution with a significant left skew, her Z score might underestimate how her performance compares to others. Non-normal distributions would require more tailored statistical methods to accurately assess performance.

#Begin code Exercise 4.4

# Load libraries, if needed
library(ggplot2)   # For creating plots
library(gridExtra) # For arranging multiple plots
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
# Define the parameters for each group
# Men, Ages 30 - 34
mu_men <- 4313     # Mean finishing time in seconds for men, ages 30-34
sigma_men <- 583   # Standard deviation of finishing times for men, ages 30-34
time_leo <- 4948   # Leo's finishing time in seconds

# Women, Ages 25 - 29
mu_women <- 5261   # Mean finishing time in seconds for women, ages 25-29
sigma_women <- 807 # Standard deviation of finishing times for women, ages 25-29
time_mary <- 5513  # Mary's finishing time in seconds

# Create a sequence of finishing times for plotting
# This sequence will be used to plot the density functions for both groups
finishing_times_men <- seq(mu_men - 4*sigma_men, mu_men + 4*sigma_men, length.out = 1000)
finishing_times_women <- seq(mu_women - 4*sigma_women, mu_women + 4*sigma_women, length.out = 1000)

# Calculate densities for the normal distributions
# This is used to plot the normal distribution curves
density_men <- dnorm(finishing_times_men, mean = mu_men, sd = sigma_men)
density_women <- dnorm(finishing_times_women, mean = mu_women, sd = sigma_women)

# Create data frames for plotting
df_men <- data.frame(FinishingTime = finishing_times_men, Density = density_men)
df_women <- data.frame(FinishingTime = finishing_times_women, Density = density_women)

# Create the plot for Leo
plot_leo <- ggplot(df_men, aes(x = FinishingTime, y = Density)) +
  geom_line(color = "lightblue") +  # Line representing the density function
  geom_area(data = df_men[df_men$FinishingTime > time_leo, ],
            aes(x = FinishingTime, y = Density), fill = "#B3CDE0", alpha = 0.5) +  # Shaded area representing the times faster than Leo
  geom_vline(xintercept = time_leo, color = "darkblue", linetype = "dashed") +  # Vertical line for Leo's finishing time
  labs(title = "Leo's Time: Men 30-34",  # Title of the plot
       x = "Finishing Time (seconds) \nMen, Ages 30-34",  # X-axis label
       y = "Density of Finishing Times") +  # Y-axis label
  theme_dark() +  # Dark theme for the plot background
  theme(
    plot.title = element_text(hjust = 0.5, size = 14, face = "bold", color = "darkblue"),  # Formatting for the title
    axis.title.x = element_text(size = 12, color = "darkblue"),  # X-axis title formatting
    axis.title.y = element_text(size = 12, color = "darkblue"),  # Y-axis title formatting
    axis.text = element_text(size = 10, color = "darkblue"),  # Axis text formatting
    panel.grid = element_blank(),  # Remove gridlines for a cleaner look
    panel.border = element_rect(color = "white", fill = NA, linewidth = 0.5),  # Border around the plot
    plot.margin = margin(20, 10, 10, 10)  # Increased top margin to avoid overlap
  )

# Create the plot for Mary
plot_mary <- ggplot(df_women, aes(x = FinishingTime, y = Density)) +
  geom_line(color = "lightpink") +  # Line representing the density function
  geom_area(data = df_women[df_women$FinishingTime > time_mary, ],
            aes(x = FinishingTime, y = Density), fill = "#F5B0B3", alpha = 0.5) +  # Shaded area representing the times faster than Mary
  geom_vline(xintercept = time_mary, color = "darkred", linetype = "dashed") +  # Vertical line for Mary's finishing time
  labs(title = "Mary's Time: Women 25-29",  # Title of the plot
       x = "Finishing Time (seconds) \nWomen, Ages 25-29",  # X-axis label
       y = "Density of Finishing Times") +  # Y-axis label
  theme_dark() +  # Dark theme for the plot background
  theme(
    plot.title = element_text(hjust = 0.5, size = 14, face = "bold", color = "darkred"),  # Formatting for the title
    axis.title.x = element_text(size = 12, color = "darkred"),  # X-axis title formatting
    axis.title.y = element_text(size = 12, color = "darkred"),  # Y-axis title formatting
    axis.text = element_text(size = 10, color = "darkred"),  # Axis text formatting
    panel.grid = element_blank(),  # Remove gridlines for a cleaner look
    panel.border = element_rect(color = "white", fill = NA, linewidth = 0.5),  # Border around the plot
    plot.margin = margin(20, 10, 10, 10)  # Increased top margin to avoid overlap
  )

# Combine the plots into a single view
grid.arrange(plot_leo, plot_mary, ncol = 2, widths = c(1, 1), top = "Triathletes' Performance")

#End of code Exercise 4.4; summary of visualizations: 
#The plots visualize the finishing times of Leo and Mary within their respective triathlon age groups.
#For Leo, competing in the Men, Ages 30-34 group, his finishing time of 4948 seconds is represented by a vertical dashed line on the plot. 
#The shaded area to the right of this line indicates the proportion of men who finished slower than Leo, which corresponds to the percentage of triathletes Leo outperformed. 
#For Mary, competing in the Women, Ages 25-29 group, her finishing time of 5513 seconds is similarly marked. The shaded area to the right of her line shows the proportion of women who finished slower than Mary. 
#Leo's plot reveals that he performed better than a significant percentage of his group, whereas Mary’s plot shows a smaller proportion of women who finished slower than her. 
#This difference highlights that, although Leo finished faster than Mary overall, he performed relatively better within his group compared to Mary in hers, given the distribution of finishing times.

##Exercise 4.8

Explination

The Capital Asset Pricing Model (CAPM) is assumed to be normally distributed. The single-period, one-factor CAPM, developed by Sharpe (1964), is the most basic and widely used benchmark asset pricing model in both practice and research, according to a professional organization’s survey report (Pham & Phuoc, 2020) The supposed portfolio includes an average annual return of 14.7% and a standard deviation of 33%, a return of 0%. This assumes the portfolio value remains unchanged. Negative returns show losses, while positive returns show growth. According to Thonon et al. (2023), studies reveal that utilizing financial models allow for measuring diverse methodologies for calculating the return on investments, which benefit more standardized designs to facilitate and compare positive or negative returns. The CAPM model provides insights into the distribution of returns and helps evaluate the probability of various financial outcomes. Exercise 4.8: Part A The probability of having a return less than 0% is used to determine the percentage of losing money. The portfolio has normal distribution with a mean return of 14.7% and a standard deviation of 33%. The Z score for a 0% return is needed for calculation; This is calculated as (0 minus 0.147), divided 0.33 to equal around −0.445. Then incorporating the Z score, the cumulative probability is around 32.8%, which results in the portfolio probability of losing money in about 32.8% of years. Exercise 4.8: Part B The cutoff for the highest 15% of annual returns is identified by the 85th percentile of the distribution. The Z score corresponding to the 85th percentile is around 1.036. The Z score then is converted to return a value by calculating the cutoff return; this is calculated by 0.147 plus (1.036 times 0.33) equaling 0.470, converted to percentage as 47.0%. This means the cutoff for the highest 15% of annual returns is around 47.0%.

# Load the necessary library
library(ggplot2)

# Define the parameters for the normal distribution
# Mean annual return (14.7%) is converted to decimal form (0.147)
mu <- 0.147  # Mean return
# Standard deviation (33%) is converted to decimal form (0.33)
sigma <- 0.33  # Standard deviation

# Create a sequence of return values from -0.5 to 0.5 with 1000 points
returns <- seq(-0.5, 0.5, length.out = 1000)
# Compute the probability density function for these return values
density <- dnorm(returns, mean = mu, sd = sigma)

# Create a data frame for plotting
df <- data.frame(Return = returns, Density = density)

# Calculate the Z-score for a return of 0% (i.e., when Return = 0)
z_score_below_zero <- (0 - mu) / sigma
# Compute the probability of the return being less than 0% using the Z-score
probability_below_zero <- pnorm(z_score_below_zero)

# Calculate the Z-score corresponding to the 85th percentile
z_score_85th <- qnorm(0.85)
# Compute the cutoff return value for the top 15% of returns
cutoff_return <- mu + z_score_85th * sigma

# Create the plot using ggplot2
p <- ggplot(df, aes(x = Return, y = Density)) +
  # Plot the normal distribution curve in dark grey
  geom_line(color = "#333333") +  # Dark grey color for the curve
  
  # Highlight the area where returns are less than 0%
  geom_area(data = df[df$Return < 0, ], aes(x = Return, y = Density), fill = "#CC3333", alpha = 0.6) +  # Deep red color
  
  # Add a vertical dashed line at 0% return
  geom_vline(xintercept = 0, linetype = "dashed", color = "#FF0000") +  # Red line for 0% return
  
  # Add a vertical dashed line at the cutoff for the highest 15% of returns
  geom_vline(xintercept = cutoff_return, linetype = "dashed", color = "#009900") +  # Dark green line for 85th percentile
  
  # Add descriptive labels and a caption to the plot
  labs(
    title = "4.8 CAPM",
    x = "Annual Return (%)",
    y = "Probability Density",
    caption = sprintf("Mean Return: %.2f%% | Std Dev: %.2f%%\nCutoff for Top 15%%: %.2f%% | Percentage of Years with Return < 0%%: %.2f%%",
                      mu * 100, sigma * 100, cutoff_return * 100, probability_below_zero * 100)
  ) +
  
  # Annotate the plot with text for clarity
  annotate("text", x = -0.4, y = max(df$Density) * 0.7, label = sprintf("Return < 0%%: %.2f%%", probability_below_zero * 100), color = "#FF0000") +
  annotate("text", x = cutoff_return, y = max(df$Density) * 0.7, label = sprintf("85th Percentile: %.2f%%", cutoff_return * 100), color = "#009900") +
  
  # Adjust the x-axis and y-axis limits to ensure the entire plot is visible
  xlim(-0.5, 0.5) +  # Extend x-axis limits
  ylim(0, max(df$Density) * 1.2) +  # Extend y-axis limits
  
  # Apply a minimal theme with customized text and plot appearance
  theme_minimal(base_size = 15) +
  theme(
    plot.title = element_text(hjust = 0.5, size = 18, face = "bold"),
    axis.title.x = element_text(size = 14),
    axis.title.y = element_text(size = 14),
    axis.text = element_text(size = 12),
    plot.margin = margin(10, 10, 10, 10)  # Increase margins to prevent cut-off
  )

# Print the plot
print(p)

#End of code Exercise 4.8; summary of visualizations: 
# This creates a visual representation of portfolio returns based on the Capital Asset Pricing Model (CAPM), assuming normally distributed returns. 
#The plot shows the normal distribution curve for annual returns with a mean of 14.7% and a standard deviation of 33%. 
#The dark grey curve illustrates the distribution, while shaded areas highlight specific return ranges. A red dashed line indicates the 0% return threshold, and a dark green dashed line marks the cutoff for the highest 15% of returns. 
#The plot includes annotations for the percentage of years with a return less than 0% and the cutoff return for the top 15% of returns. 
#This visual helps in understanding the distribution of returns and the relative performance thresholds for the portfolio.

Lab Chapter 4 from OpenIntro Textbook

# Set CRAN mirror
options(repos = c(CRAN = "https://cran.rstudio.com"))

# Install and load necessary libraries
install.packages("tidyverse")
## Warning: package 'tidyverse' is in use and will not be installed
install.packages("openintro") # Uncomment if needed
## Warning: package 'openintro' is in use and will not be installed
library(tidyverse)
library(openintro)

Exercise 1

arbuthnot$girls
##  [1] 4683 4457 4102 4590 4839 4820 4928 4605 4457 4952 4784 5332 5200 4910 4617
## [16] 3997 3919 3395 3536 3181 2746 2722 2840 2908 2959 3179 3349 3382 3289 3013
## [31] 2781 3247 4107 4803 4881 5681 4858 4319 5322 5560 5829 5719 6061 6120 5822
## [46] 5738 5717 5847 6203 6033 6041 6299 6533 6744 7158 7127 7246 7119 7214 7101
## [61] 7167 7302 7392 7316 7483 6647 6713 7229 7767 7626 7452 7061 7514 7656 7683
## [76] 5738 7779 7417 7687 7623 7380 7288
# Exercise 1: Filter Data
# Filter data for McDonald's and Dairy Queen
mcdonalds <- fastfood %>% filter(restaurant == "Mcdonalds")
dairy_queen <- fastfood %>% filter(restaurant == "Dairy Queen")

Exercise 2

Insert any text here.

# Exercise 2: Visualize Distributions

# McDonald's Data Visualization
# Plot histogram of calories from fat at McDonald's
ggplot(mcdonalds, aes(x = cal_fat)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "lightblue", color = "black") +
  labs(title = "Calories from Fat at McDonald's", x = "Calories from Fat", y = "Density") +
  theme_minimal() +
  theme(axis.text = element_text(color = "green"), axis.title = element_text(color = "green"))
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Dairy Queen Data Visualization
# Plot histogram of calories from fat at Dairy Queen
ggplot(dairy_queen, aes(x = cal_fat)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "lightblue", color = "black") +
  labs(title = "Calories from Fat at Dairy Queen", x = "Calories from Fat", y = "Density") +
  theme_minimal() +
  theme(axis.text = element_text(color = "green"), axis.title = element_text(color = "green"))

Exercise 3

Insert any text here.

# Exercise 3: Normal Distribution

# Calculate mean and standard deviation for Dairy Queen
dqmean <- mean(dairy_queen$cal_fat, na.rm = TRUE)
dqsd <- sd(dairy_queen$cal_fat, na.rm = TRUE)

# Create a density histogram with a normal distribution curve for Dairy Queen
ggplot(dairy_queen, aes(x = cal_fat)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "lightblue", color = "black") +
  stat_function(fun = dnorm, args = list(mean = dqmean, sd = dqsd), color = "tomato", size = 1) +
  labs(title = "Calories from Fat at Dairy Queen with Normal Curve", x = "Calories from Fat", y = "Density") +
  theme_minimal() +
  theme(axis.text = element_text(color = "green"), axis.title = element_text(color = "green"))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Create a normal Q-Q plot for Dairy Queen’s calories from fat
ggplot(data = dairy_queen, aes(sample = cal_fat)) +
  geom_qq() +
  geom_qq_line() +
  labs(title = "Q-Q Plot of Calories from Fat at Dairy Queen", x = "Theoretical Quantiles", y = "Sample Quantiles") +
  theme_minimal() +
  theme(axis.text = element_text(color = "red"), axis.title = element_text(color = "orange"))

Exercise 4

Insert any text here.

# Exercise 4: Simulate Normal Data and Create Q-Q Plot

# Simulate normal data
sim_norm <- rnorm(n = nrow(dairy_queen), mean = dqmean, sd = dqsd)

# Create a Q-Q plot of simulated normal data
ggplot(data.frame(sim_norm), aes(sample = sim_norm)) +
  geom_qq() +
  geom_qq_line() +
  labs(title = "Q-Q Plot of Simulated Normal Data", x = "Theoretical Quantiles", y = "Sample Quantiles") +
  theme_minimal() +
  theme(axis.text = element_text(color = "yellow"), axis.title = element_text(color = "pink"))

Exercise 5

Insert any text here.

# Exercise 5: Probability Calculations

# Calculate the theoretical probability for Dairy Queen items having more than 600 calories from fat
prob_theoretical <- 1 - pnorm(q = 600, mean = dqmean, sd = dqsd)
print(paste("Theoretical probability of more than 600 calories from fat:", prob_theoretical))
## [1] "Theoretical probability of more than 600 calories from fat: 0.0150152297382053"
# Calculate the empirical probability for Dairy Queen items having more than 600 calories from fat
prob_empirical <- dairy_queen %>%
  filter(cal_fat > 600) %>%
  summarise(percent = n() / nrow(dairy_queen))
print(prob_empirical)
## # A tibble: 1 × 1
##   percent
##     <dbl>
## 1  0.0476

Exercise 6

Insert any text here.

# Exercise 6: McDonald's Probability Calculations

# Calculate the mean and standard deviation for McDonald's calories from fat
mcmean <- mean(mcdonalds$cal_fat, na.rm = TRUE)
mcsd <- sd(mcdonalds$cal_fat, na.rm = TRUE)

# Calculate the theoretical probability for McDonald's items having more than 500 calories from fat
prob_mc_theoretical <- 1 - pnorm(q = 500, mean = mcmean, sd = mcsd)
print(paste("Theoretical probability for McDonald's with more than 500 calories from fat:", prob_mc_theoretical))
## [1] "Theoretical probability for McDonald's with more than 500 calories from fat: 0.16589503676763"
# Calculate the empirical probability for McDonald's items having more than 500 calories from fat
prob_mc_empirical <- mcdonalds %>%
  filter(cal_fat > 500) %>%
  summarise(percent = n() / nrow(mcdonalds))
print(prob_mc_empirical)
## # A tibble: 1 × 1
##   percent
##     <dbl>
## 1   0.105
# End of Lab Chapter 4 from OpenIntro Textbook

References Bono, R., Blanca, M., Arnau, J., & Gómez-Benito, J. (2017). Non-normal distributions commonly used in health, education, and social sciences: A systematic review. Frontiers in Psychology, 8. https://doi.org/10.3389/fpsyg.2017.01602 Diez, D., Barr, C., & Çetinkaya-RundelM. (2019). OpenIntro statistics (Fourth). Elassaiss-Schaap, J., & Duisters, K. (2020). Variability in the log domain and limitations to its approximation by the normal distribution. Pharmacometrics & Systems Pharmacology, 9(5). https://doi.org/10.1002/psp4.12507 Moen, P., Flood, S., & Wang, J. (2021). The uneven later work course: Intersectional gender, age, race, and class disparities. The Journals of Gerontology: Series B, 77(1). https://doi.org/10.1093/geronb/gbab039 Pham, C., & Phuoc, L. (2020). An augmented capital asset pricing model using new macroeconomic determinants. Heliyon, 6(10), Article e05185. https://doi.org/10.1016/j.heliyon.2020.e05185 Solomyak, L., Sharp, P., & Eldar, E. (2022). Training diversity promotes absolute-value-guided choice. PLoS Computational Biology, 18(11), Article e1010664. https://doi.org/10.1371/journal.pcbi.1010664 Thonon, F., Godon-Rensonnet, A., Perozziello, A., Garsi, J., Dab, W., & Emsalem, P. (2023). Return on investment of workplace-based prevention interventions: A systematic review. European Journal of Public Health, 33(4), 612–618. https://doi.org/10.1093/eurpub/ckad092

---
title: "Lab 1: Intro to R"
author: "Author Name"
date: "`r Sys.Date()`"
output: openintro::lab_report
---

```{r load-packages, message=FALSE}
library(tidyverse)
library(openintro)
```

---
title: "Module_04_HW_Lab_04"
author: "Anthony V. Razzano, DHA"
date: "2024-09-09"
output:
  pdf_document:
    latex_engine: xelatex
  word_document: default
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

The assignment requires completion of exercises in chapter 4 of the Diez (2019) textbook. The exercises include 4.2, 4.4, and 4.8. Each exercise has its own section and heading within this report. In addition, it requires Lab Chapter 4 from OpenIntro Textbook: https://openintrostat.github.io/oilabs-tidy/04_normal_distribution/normal_distribution.html

#Module 4 Homework

##Exercise 4.2

### Explination

The solution calculates the percentages of a standard normal distribution N (0,1) within specific regions using the cumulative distribution function (CDF). According to Elassaiss-Schaap and Duisters (2020), studies reveal that the normal distribution's standard deviation cannot exceed half the mean without extending into negative values, whereas the lognormal distribution effectively spans beyond this range without producing negative values. For Z greater than -1.13, the percentage is 87.14%, which is calculated by 111 minus the CDF of -1.13, which gives the area to the right of Z equals −1.13. For Z less than 0.18, the percentage is 57.13%, which is calculated by the CDF of 0.18, representing the area to the left of Z equaling 0.18. The percentage for Z greater than 8 is effectively 0.00%, which is calculated by 111 minus the CDF of 8 being negligible, indicating an area almost zero due to the extreme value being far in the tail. Absolute values function as a universal measure that standardizes preferences, allowing for consistent comparisons across different contexts with varying options (Solomyak et al., 2022). The absolute value of Z less than 0.5, this is the percentage is 38.29%, calculated as the CDF of 0.5 minus the CDF of -0.5, covering the area between Z equals −0.5 and Z equals 0.5. The graph below visually represents these regions with shaded areas, and the corresponding percentages.


### Visualization and R code

```{r}

# Begin code Exercise 4.2

# Load ggplot2 library, if not already done; for creating plots
library(ggplot2)

# Define parameters for the standard normal distribution
mu <- 0         # Mean of the standard normal distribution
sigma <- 1      # Standard deviation of the standard normal distribution

# PART_a: Calculate the percentage of the distribution where Z > -1.13
# pnorm calculates the cumulative probability up to Z = -1.13
PART_a <- 1 - pnorm(-1.13, mean = mu, sd = sigma) 
# Subtracting from 1 gives the probability of Z being greater than -1.13
cat("Percentage for Z > -1.13: ", PART_a * 100, "%\n")  # Print the result in percentage

# PART_b: Calculate the percentage of the distribution where Z < 0.18
# pnorm calculates the cumulative probability up to Z = 0.18
PART_b <- pnorm(0.18, mean = mu, sd = sigma)
# This directly gives the probability of Z being less than 0.18
cat("Percentage for Z < 0.18: ", PART_b * 100, "%\n")  # Print the result in percentage

# PART_c: Calculate the percentage of the distribution where Z > 8
# pnorm calculates the cumulative probability up to Z = 8
PART_c <- 1 - pnorm(8, mean = mu, sd = sigma)
# Subtracting from 1 gives the probability of Z being greater than 8
cat("Percentage for Z > 8: ", PART_c * 100, "%\n")  # Print the result in percentage

# PART_d: Calculate the percentage of the distribution where |Z| < 0.5
# pnorm calculates the cumulative probability up to Z = 0.5
# Subtracting pnorm for Z = -0.5 gives the probability between -0.5 and 0.5
PART_d <- pnorm(0.5, mean = mu, sd = sigma) - pnorm(-0.5, mean = mu, sd = sigma)
cat("Percentage for |Z| < 0.5: ", PART_d * 100, "%\n")  # Print the result in percentage

# Create a sequence of Z values for plotting
z_values <- seq(-4, 4, length.out = 1000)  # Range of Z values from -4 to 4
density_values <- dnorm(z_values, mean = mu, sd = sigma)  # Compute density values for the Z sequence

# Create a data frame for ggplot2
df <- data.frame(Z = z_values, Density = density_values)

# Generate the plot with ggplot2
ggplot(df, aes(x = Z, y = Density)) +
  geom_line(color = "lightblue") +  # Line for the standard normal distribution density
  geom_area(data = df[df$Z > -1.13, ], 
            aes(x = Z, y = Density), fill = "#B3CDE0", alpha = 0.5) +  # Light blue area for Z > -1.13
  geom_area(data = df[df$Z < 0.18, ], 
            aes(x = Z, y = Density), fill = "#A2C2E0", alpha = 0.5) +  # Slightly darker blue for Z < 0.18
  geom_area(data = df[df$Z > 8, ], 
            aes(x = Z, y = Density), fill = "#7D9AC3", alpha = 0.5) +  # Even darker blue for Z > 8
  geom_area(data = df[abs(df$Z) < 0.5, ], 
            aes(x = Z, y = Density), fill = "#6B9AC3", alpha = 0.5) +  # Dark blue for |Z| < 0.5
  labs(title = "4.2 Area under the curve, Part II",  # Title of the plot
       x = "Z", y = "Density") +  # Labels for x and y axes
  theme_dark() +  # Apply dark theme for the plot
  theme(
    plot.title = element_text(hjust = 0.5, size = 18, face = "bold", color = "navy"),  # Title formatting
    axis.title.x = element_text(size = 14, color = "navy"),  # X-axis title formatting
    axis.title.y = element_text(size = 14, color = "navy"),  # Y-axis title formatting
    axis.text = element_text(size = 12, color = "navy"),  # Axis text formatting
    panel.grid = element_blank(),  # Remove gridlines for a cleaner look
    panel.border = element_rect(color = "white", fill = NA, linewidth = 0.5)  # White border around the plot
  )


#End of code Exercise 4.2; Summary of visualizations: 
#The R code calculates and visualizes the percentages of a standard normal distribution within specific regions. 
#It shows that 87.14% of the distribution lies to the right of Z = -1.13, while 57.13% is to the left of Z = 0.18. 
#For Z > 8, the percentage is effectively 0.00%, indicating that this Z-score is far into the tail with minimal area. 
#The percentage for |Z| < 0.5 is 38.29%, covering the area between Z = -0.5 and Z = 0.5. 
#The graph uses shaded regions to clearly depict these areas, providing a visual representation of how much of the distribution falls within or beyond the specified Z-score thresholds.

```


##Exercise 4.4

### Explination

At the Hermosa Beach Triathlon, competitors were categorized by gender and age. According to Moen et al. (2021), studies reveal that categorizing populations into various groups based on mutually shared commonalities, such as race, ethnicity, social class, employment, and gender, allows for intersectional analysis to evaluate greater complexities within mutually sharing populations.  Racers Leo and Mary are interested in evaluating their performance within these categories. Leo competed in the male 30-34 age group. He finished in 4948 seconds. Mary competed in the female 25-29 age group. She finished in 5513 seconds. The male group has a mean finishing time of 4313 seconds with a standard deviation of 583 seconds. The female group has a mean of 5261 seconds with a standard deviation of 807 seconds. The finishing times in both groups are approximately normally distributed.
Exercise 4.4: Part A
The normal distributions are based on Leo and Mary’s triathlon performances, respective to age and gender groups. For men, the ages were 30-34 group. The distribution is N (4313, 583), where 4313 seconds is the mean and 583 seconds is the standard deviation. For women, the ages were 25-29 group. The distribution is N (5261, 807), with a mean of 5261 seconds and a standard deviation of 807 seconds. This notation describes the expected distribution of finishing times in each group assuming a normal distribution.
Exercise 4.4: Part B
To calculate the Z scores for Leo and Mary, use the formula Z equals X minus mean divided by standard deviation. For Leo: Z equals 4948 minus 4313, divided by 583, which equals approximately 1.09. This means Leo finished 1.09 standard deviations faster than the mean time in his group. For Mary: Z equals 5513 minus 5261, divided by 807, which equals approximately 0.31. This indicates Mary finished 0.31 standard deviations slower than the mean time in her group. The Z scores quantify how Leo and Mary’s times compare to their group’s average.
Exercise 4.4: Part C
Leo ranked better in his group compared to Mary. Leo’s Z score of approximately 1.09 shows he finished faster relative to the average time in his group. In contrast, Mary’s Z score of approximately 0.31 shows she finished closer to, but still slower than, the average time in her group. A higher Z score indicates a better performance relative to the group mean, so Leo’s performance was comparatively better within his group.
Exercise 4.4: Part D
Leo finished faster than approximately 86.14% of the men in his group. This is found using the Z score of 1.09. To find this percentile and using the cumulative distribution function (CDF) value of Z equals 1.09, which is approximately 0.8614., Leo’s performance was better than about 86.14% of his peers.
Exercise 4.4: Part E
Mary finished faster than approximately 61.24% of the women in her group. This is determined from Mary’s Z score of 0.31. Using the CDF value for Z equals 0.31, which is 0.6124; finding that Mary’s performance was better than approximately 61.24% of the women in her group.
Exercise 4.4: Part F
Considering if the distributions of finishing times are not nearly normal, the results from parts B through E could change. Bono et al. (2017) suggest that when data does not conform to a normal distribution, researchers should consider alternative approaches, such as generalized additive models for location, scale, and shape, to determine a more appropriate distribution for the response variable. For example, if the distribution of finishing times for the men, ages 30 - 34 group were skewed, with a longer tail on the right, the mean and standard deviation might not accurately reflect the distribution’s characteristics. In such a scenario, Leo’s Z score, which assumes of normality, might not correctly represent his standing relative to his peers. This could lead to an incorrect calculation of the percentage of triathletes he finished faster than. For instance, if the actual distribution were heavily right skewed, Leo’s Z score might suggest he is in a better percentile than he truly is if the normal distribution assumption were applied. Similarly, if Mary’s group had a distribution with a significant left skew, her Z score might underestimate how her performance compares to others. Non-normal distributions would require more tailored statistical methods to accurately assess performance.




```{r}

#Begin code Exercise 4.4

# Load libraries, if needed
library(ggplot2)   # For creating plots
library(gridExtra) # For arranging multiple plots

# Define the parameters for each group
# Men, Ages 30 - 34
mu_men <- 4313     # Mean finishing time in seconds for men, ages 30-34
sigma_men <- 583   # Standard deviation of finishing times for men, ages 30-34
time_leo <- 4948   # Leo's finishing time in seconds

# Women, Ages 25 - 29
mu_women <- 5261   # Mean finishing time in seconds for women, ages 25-29
sigma_women <- 807 # Standard deviation of finishing times for women, ages 25-29
time_mary <- 5513  # Mary's finishing time in seconds

# Create a sequence of finishing times for plotting
# This sequence will be used to plot the density functions for both groups
finishing_times_men <- seq(mu_men - 4*sigma_men, mu_men + 4*sigma_men, length.out = 1000)
finishing_times_women <- seq(mu_women - 4*sigma_women, mu_women + 4*sigma_women, length.out = 1000)

# Calculate densities for the normal distributions
# This is used to plot the normal distribution curves
density_men <- dnorm(finishing_times_men, mean = mu_men, sd = sigma_men)
density_women <- dnorm(finishing_times_women, mean = mu_women, sd = sigma_women)

# Create data frames for plotting
df_men <- data.frame(FinishingTime = finishing_times_men, Density = density_men)
df_women <- data.frame(FinishingTime = finishing_times_women, Density = density_women)

# Create the plot for Leo
plot_leo <- ggplot(df_men, aes(x = FinishingTime, y = Density)) +
  geom_line(color = "lightblue") +  # Line representing the density function
  geom_area(data = df_men[df_men$FinishingTime > time_leo, ],
            aes(x = FinishingTime, y = Density), fill = "#B3CDE0", alpha = 0.5) +  # Shaded area representing the times faster than Leo
  geom_vline(xintercept = time_leo, color = "darkblue", linetype = "dashed") +  # Vertical line for Leo's finishing time
  labs(title = "Leo's Time: Men 30-34",  # Title of the plot
       x = "Finishing Time (seconds) \nMen, Ages 30-34",  # X-axis label
       y = "Density of Finishing Times") +  # Y-axis label
  theme_dark() +  # Dark theme for the plot background
  theme(
    plot.title = element_text(hjust = 0.5, size = 14, face = "bold", color = "darkblue"),  # Formatting for the title
    axis.title.x = element_text(size = 12, color = "darkblue"),  # X-axis title formatting
    axis.title.y = element_text(size = 12, color = "darkblue"),  # Y-axis title formatting
    axis.text = element_text(size = 10, color = "darkblue"),  # Axis text formatting
    panel.grid = element_blank(),  # Remove gridlines for a cleaner look
    panel.border = element_rect(color = "white", fill = NA, linewidth = 0.5),  # Border around the plot
    plot.margin = margin(20, 10, 10, 10)  # Increased top margin to avoid overlap
  )

# Create the plot for Mary
plot_mary <- ggplot(df_women, aes(x = FinishingTime, y = Density)) +
  geom_line(color = "lightpink") +  # Line representing the density function
  geom_area(data = df_women[df_women$FinishingTime > time_mary, ],
            aes(x = FinishingTime, y = Density), fill = "#F5B0B3", alpha = 0.5) +  # Shaded area representing the times faster than Mary
  geom_vline(xintercept = time_mary, color = "darkred", linetype = "dashed") +  # Vertical line for Mary's finishing time
  labs(title = "Mary's Time: Women 25-29",  # Title of the plot
       x = "Finishing Time (seconds) \nWomen, Ages 25-29",  # X-axis label
       y = "Density of Finishing Times") +  # Y-axis label
  theme_dark() +  # Dark theme for the plot background
  theme(
    plot.title = element_text(hjust = 0.5, size = 14, face = "bold", color = "darkred"),  # Formatting for the title
    axis.title.x = element_text(size = 12, color = "darkred"),  # X-axis title formatting
    axis.title.y = element_text(size = 12, color = "darkred"),  # Y-axis title formatting
    axis.text = element_text(size = 10, color = "darkred"),  # Axis text formatting
    panel.grid = element_blank(),  # Remove gridlines for a cleaner look
    panel.border = element_rect(color = "white", fill = NA, linewidth = 0.5),  # Border around the plot
    plot.margin = margin(20, 10, 10, 10)  # Increased top margin to avoid overlap
  )

# Combine the plots into a single view
grid.arrange(plot_leo, plot_mary, ncol = 2, widths = c(1, 1), top = "Triathletes' Performance")

#End of code Exercise 4.4; summary of visualizations: 
#The plots visualize the finishing times of Leo and Mary within their respective triathlon age groups.
#For Leo, competing in the Men, Ages 30-34 group, his finishing time of 4948 seconds is represented by a vertical dashed line on the plot. 
#The shaded area to the right of this line indicates the proportion of men who finished slower than Leo, which corresponds to the percentage of triathletes Leo outperformed. 
#For Mary, competing in the Women, Ages 25-29 group, her finishing time of 5513 seconds is similarly marked. The shaded area to the right of her line shows the proportion of women who finished slower than Mary. 
#Leo's plot reveals that he performed better than a significant percentage of his group, whereas Mary’s plot shows a smaller proportion of women who finished slower than her. 
#This difference highlights that, although Leo finished faster than Mary overall, he performed relatively better within his group compared to Mary in hers, given the distribution of finishing times.


```






##Exercise 4.8


### Explination

The Capital Asset Pricing Model (CAPM) is assumed to be normally distributed. The single-period, one-factor CAPM, developed by Sharpe (1964), is the most basic and widely used benchmark asset pricing model in both practice and research, according to a professional organization’s survey report (Pham & Phuoc, 2020) The supposed portfolio includes an average annual return of 14.7% and a standard deviation of 33%, a return of 0%. This assumes the portfolio value remains unchanged. Negative returns show losses, while positive returns show growth. According to Thonon et al. (2023), studies reveal that utilizing financial models allow for measuring diverse methodologies for calculating the return on investments, which benefit more standardized designs to facilitate and compare positive or negative returns.  The CAPM model provides insights into the distribution of returns and helps evaluate the probability of various financial outcomes.
Exercise 4.8: Part A
The probability of having a return less than 0% is used to determine the percentage of losing money. The portfolio has normal distribution with a mean return of 14.7% and a standard deviation of 33%. The Z score for a 0% return is needed for calculation; This is calculated as (0 minus 0.147), divided 0.33 to equal around −0.445. Then incorporating the Z score, the cumulative probability is around 32.8%, which results in the portfolio probability of losing money in about 32.8% of years.
Exercise 4.8: Part B
The cutoff for the highest 15% of annual returns is identified by the 85th percentile of the distribution. The Z score corresponding to the 85th percentile is around 1.036. The Z score then is converted to return a value by calculating the cutoff return; this is calculated by 0.147 plus (1.036 times 0.33) equaling 0.470, converted to percentage as 47.0%. This means the cutoff for the highest 15% of annual returns is around 47.0%.






```{r}


# Load the necessary library
library(ggplot2)

# Define the parameters for the normal distribution
# Mean annual return (14.7%) is converted to decimal form (0.147)
mu <- 0.147  # Mean return
# Standard deviation (33%) is converted to decimal form (0.33)
sigma <- 0.33  # Standard deviation

# Create a sequence of return values from -0.5 to 0.5 with 1000 points
returns <- seq(-0.5, 0.5, length.out = 1000)
# Compute the probability density function for these return values
density <- dnorm(returns, mean = mu, sd = sigma)

# Create a data frame for plotting
df <- data.frame(Return = returns, Density = density)

# Calculate the Z-score for a return of 0% (i.e., when Return = 0)
z_score_below_zero <- (0 - mu) / sigma
# Compute the probability of the return being less than 0% using the Z-score
probability_below_zero <- pnorm(z_score_below_zero)

# Calculate the Z-score corresponding to the 85th percentile
z_score_85th <- qnorm(0.85)
# Compute the cutoff return value for the top 15% of returns
cutoff_return <- mu + z_score_85th * sigma

# Create the plot using ggplot2
p <- ggplot(df, aes(x = Return, y = Density)) +
  # Plot the normal distribution curve in dark grey
  geom_line(color = "#333333") +  # Dark grey color for the curve
  
  # Highlight the area where returns are less than 0%
  geom_area(data = df[df$Return < 0, ], aes(x = Return, y = Density), fill = "#CC3333", alpha = 0.6) +  # Deep red color
  
  # Add a vertical dashed line at 0% return
  geom_vline(xintercept = 0, linetype = "dashed", color = "#FF0000") +  # Red line for 0% return
  
  # Add a vertical dashed line at the cutoff for the highest 15% of returns
  geom_vline(xintercept = cutoff_return, linetype = "dashed", color = "#009900") +  # Dark green line for 85th percentile
  
  # Add descriptive labels and a caption to the plot
  labs(
    title = "4.8 CAPM",
    x = "Annual Return (%)",
    y = "Probability Density",
    caption = sprintf("Mean Return: %.2f%% | Std Dev: %.2f%%\nCutoff for Top 15%%: %.2f%% | Percentage of Years with Return < 0%%: %.2f%%",
                      mu * 100, sigma * 100, cutoff_return * 100, probability_below_zero * 100)
  ) +
  
  # Annotate the plot with text for clarity
  annotate("text", x = -0.4, y = max(df$Density) * 0.7, label = sprintf("Return < 0%%: %.2f%%", probability_below_zero * 100), color = "#FF0000") +
  annotate("text", x = cutoff_return, y = max(df$Density) * 0.7, label = sprintf("85th Percentile: %.2f%%", cutoff_return * 100), color = "#009900") +
  
  # Adjust the x-axis and y-axis limits to ensure the entire plot is visible
  xlim(-0.5, 0.5) +  # Extend x-axis limits
  ylim(0, max(df$Density) * 1.2) +  # Extend y-axis limits
  
  # Apply a minimal theme with customized text and plot appearance
  theme_minimal(base_size = 15) +
  theme(
    plot.title = element_text(hjust = 0.5, size = 18, face = "bold"),
    axis.title.x = element_text(size = 14),
    axis.title.y = element_text(size = 14),
    axis.text = element_text(size = 12),
    plot.margin = margin(10, 10, 10, 10)  # Increase margins to prevent cut-off
  )

# Print the plot
print(p)

#End of code Exercise 4.8; summary of visualizations: 
# This creates a visual representation of portfolio returns based on the Capital Asset Pricing Model (CAPM), assuming normally distributed returns. 
#The plot shows the normal distribution curve for annual returns with a mean of 14.7% and a standard deviation of 33%. 
#The dark grey curve illustrates the distribution, while shaded areas highlight specific return ranges. A red dashed line indicates the 0% return threshold, and a dark green dashed line marks the cutoff for the highest 15% of returns. 
#The plot includes annotations for the percentage of years with a return less than 0% and the cutoff return for the top 15% of returns. 
#This visual helps in understanding the distribution of returns and the relative performance thresholds for the portfolio.



```


## Lab Chapter 4 from OpenIntro Textbook

```{r}
# Set CRAN mirror
options(repos = c(CRAN = "https://cran.rstudio.com"))

# Install and load necessary libraries
install.packages("tidyverse")
install.packages("openintro") # Uncomment if needed
library(tidyverse)
library(openintro)

```





### Exercise 1

```{r view-girls-counts}
arbuthnot$girls

# Exercise 1: Filter Data
# Filter data for McDonald's and Dairy Queen
mcdonalds <- fastfood %>% filter(restaurant == "Mcdonalds")
dairy_queen <- fastfood %>% filter(restaurant == "Dairy Queen")
```


### Exercise 2

Insert any text here.

```{r trend-girls}
# Exercise 2: Visualize Distributions

# McDonald's Data Visualization
# Plot histogram of calories from fat at McDonald's
ggplot(mcdonalds, aes(x = cal_fat)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "lightblue", color = "black") +
  labs(title = "Calories from Fat at McDonald's", x = "Calories from Fat", y = "Density") +
  theme_minimal() +
  theme(axis.text = element_text(color = "green"), axis.title = element_text(color = "green"))

# Dairy Queen Data Visualization
# Plot histogram of calories from fat at Dairy Queen
ggplot(dairy_queen, aes(x = cal_fat)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "lightblue", color = "black") +
  labs(title = "Calories from Fat at Dairy Queen", x = "Calories from Fat", y = "Density") +
  theme_minimal() +
  theme(axis.text = element_text(color = "green"), axis.title = element_text(color = "green"))

```


### Exercise 3

Insert any text here.

```{r plot-prop-boys-arbuthnot}

# Exercise 3: Normal Distribution

# Calculate mean and standard deviation for Dairy Queen
dqmean <- mean(dairy_queen$cal_fat, na.rm = TRUE)
dqsd <- sd(dairy_queen$cal_fat, na.rm = TRUE)

# Create a density histogram with a normal distribution curve for Dairy Queen
ggplot(dairy_queen, aes(x = cal_fat)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "lightblue", color = "black") +
  stat_function(fun = dnorm, args = list(mean = dqmean, sd = dqsd), color = "tomato", size = 1) +
  labs(title = "Calories from Fat at Dairy Queen with Normal Curve", x = "Calories from Fat", y = "Density") +
  theme_minimal() +
  theme(axis.text = element_text(color = "green"), axis.title = element_text(color = "green"))

# Create a normal Q-Q plot for Dairy Queen’s calories from fat
ggplot(data = dairy_queen, aes(sample = cal_fat)) +
  geom_qq() +
  geom_qq_line() +
  labs(title = "Q-Q Plot of Calories from Fat at Dairy Queen", x = "Theoretical Quantiles", y = "Sample Quantiles") +
  theme_minimal() +
  theme(axis.text = element_text(color = "red"), axis.title = element_text(color = "orange"))
```


### Exercise 4

Insert any text here.

```{r dim-present}
# Exercise 4: Simulate Normal Data and Create Q-Q Plot

# Simulate normal data
sim_norm <- rnorm(n = nrow(dairy_queen), mean = dqmean, sd = dqsd)

# Create a Q-Q plot of simulated normal data
ggplot(data.frame(sim_norm), aes(sample = sim_norm)) +
  geom_qq() +
  geom_qq_line() +
  labs(title = "Q-Q Plot of Simulated Normal Data", x = "Theoretical Quantiles", y = "Sample Quantiles") +
  theme_minimal() +
  theme(axis.text = element_text(color = "yellow"), axis.title = element_text(color = "pink"))
```


### Exercise 5

Insert any text here.

```{r count-compare}
# Exercise 5: Probability Calculations

# Calculate the theoretical probability for Dairy Queen items having more than 600 calories from fat
prob_theoretical <- 1 - pnorm(q = 600, mean = dqmean, sd = dqsd)
print(paste("Theoretical probability of more than 600 calories from fat:", prob_theoretical))

# Calculate the empirical probability for Dairy Queen items having more than 600 calories from fat
prob_empirical <- dairy_queen %>%
  filter(cal_fat > 600) %>%
  summarise(percent = n() / nrow(dairy_queen))
print(prob_empirical)
```


### Exercise 6

Insert any text here.

```{r plot-prop-boys-present}
# Exercise 6: McDonald's Probability Calculations

# Calculate the mean and standard deviation for McDonald's calories from fat
mcmean <- mean(mcdonalds$cal_fat, na.rm = TRUE)
mcsd <- sd(mcdonalds$cal_fat, na.rm = TRUE)

# Calculate the theoretical probability for McDonald's items having more than 500 calories from fat
prob_mc_theoretical <- 1 - pnorm(q = 500, mean = mcmean, sd = mcsd)
print(paste("Theoretical probability for McDonald's with more than 500 calories from fat:", prob_mc_theoretical))

# Calculate the empirical probability for McDonald's items having more than 500 calories from fat
prob_mc_empirical <- mcdonalds %>%
  filter(cal_fat > 500) %>%
  summarise(percent = n() / nrow(mcdonalds))
print(prob_mc_empirical)

# End of Lab Chapter 4 from OpenIntro Textbook
```





References
Bono, R., Blanca, M., Arnau, J., & Gómez-Benito, J. (2017). Non-normal distributions commonly used in health, education, and social sciences: A systematic review. Frontiers in Psychology, 8. https://doi.org/10.3389/fpsyg.2017.01602
Diez, D., Barr, C., & Çetinkaya-RundelM. (2019). OpenIntro statistics (Fourth).
Elassaiss-Schaap, J., & Duisters, K. (2020). Variability in the log domain and limitations to its approximation by the normal distribution. Pharmacometrics & Systems Pharmacology, 9(5). https://doi.org/10.1002/psp4.12507
Moen, P., Flood, S., & Wang, J. (2021). The uneven later work course: Intersectional gender, age, race, and class disparities. The Journals of Gerontology: Series B, 77(1). https://doi.org/10.1093/geronb/gbab039
Pham, C., & Phuoc, L. (2020). An augmented capital asset pricing model using new macroeconomic determinants. Heliyon, 6(10), Article e05185. https://doi.org/10.1016/j.heliyon.2020.e05185
Solomyak, L., Sharp, P., & Eldar, E. (2022). Training diversity promotes absolute-value-guided choice. PLoS Computational Biology, 18(11), Article e1010664. https://doi.org/10.1371/journal.pcbi.1010664
Thonon, F., Godon-Rensonnet, A., Perozziello, A., Garsi, J., Dab, W., & Emsalem, P. (2023). Return on investment of workplace-based prevention interventions: A systematic review. European Journal of Public Health, 33(4), 612–618. https://doi.org/10.1093/eurpub/ckad092



