# Load required libraries
library(ggplot2)
# I wanted to explore the relationship between study hours and test scores, so I created a fictional dataset.
set.seed(42)
study_data <- data.frame(
# I decided to simulate data for study hours ranging from 0 to 10 hours per week.
Study_Hours = runif(100, min = 0, max = 10),
# I generated test scores using a formula that includes some variability to mimic real-world data.
Test_Scores = 50 + 5 * runif(100, min = 0, max = 10) + rnorm(100, mean = 0, sd = 5)
)
# I used ggplot2 to create a scatterplot with a regression line to visualize the data.
ggplot(study_data, aes(x = Study_Hours, y = Test_Scores)) +
# I chose purple for the points because I thought it would stand out and look appealing.
geom_point(color = "purple", size = 2) +
# I added a regression line in green to highlight the trend between study hours and test scores.
geom_smooth(method = "lm", formula = y ~ x, color = "green", se = FALSE, size = 1) +
# I wanted to keep the design clean, so I used a minimal theme.
theme_minimal() +
# I customized the title and labels to make the plot easier to understand.
labs(
title = "Linear Regression: Study Hours vs. Test Scores",
# I decided to label the x-axis as 'Study Hours' to clearly indicate what the predictor variable is.
x = "Study Hours (per week)",
# I labeled the y-axis as 'Test Scores' because it represents the response variable in my analysis.
y = "Test Scores (out of 100)"
) +
# I adjusted the title and axis text to make the plot visually appealing and professional.
theme(
plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.title = element_text(size = 12),
axis.text = element_text(size = 10)
)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# --- Narrative Explanation ---
# I analyzed the relationship between study hours and test scores using a simple linear regression model.
# The equation Y ≈ β0 + β1X assumes a linear relationship where X is the predictor (study hours) and Y is the response (test scores).
# I used the least squares method to estimate the coefficients β0 (intercept) and β1 (slope).
# I thought this method would help me find the best-fitting line that minimizes the differences between the observed and predicted test scores.
# After plotting the results, I observed that the green regression line closely follows the trend of the data points, represented as purple dots.
# I noticed that this line represents the predicted test scores for different study hours.
# As I reviewed the plot, I noticed that while the model captures the overall trend well,
# there are some deviations where students scored higher or lower than predicted. These are represented by the residuals,
# which are the differences between observed and predicted values.
# By visualizing the data this way, I could confidently interpret the relationship: as study hours increase, test scores tend to improve.
# I also saw how other factors might influence test scores beyond study hours, reminding me of the complexity of real-world data.