As a college student, grade is really important, but there are many factor that can affect the outcome of student’s grade like how long do they sleep, how they study, how hard is the class and et cetera; during today analyzing, we can dig in and learning how and what affect student’s grade
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(tidyr)
library(broom)
note: some of these library might not already install in Rstudio (web or application), use this syntax in terminal: install.packages(“library_name”) to install the library.
data = read.csv("Exam_Score_Prediction.csv")
This data is collected from Kaggle by KUNDAN SAGAR BEDMUTHA, link is as followed: “https://www.kaggle.com/datasets/kundanbedmutha/exam-score-prediction-dataset” NOTE: the file have to be downloaded into the local file so Rstudio (wed or application) can read and progress the code, and don’t change the name of the file
poor_sleep <- data %>%
filter(sleep_quality == "poor")
average_sleep <- data %>%
filter(sleep_quality == "average")
good_sleep <- data %>%
filter(sleep_quality == "good")
bind_rows(
tidy(cor.test(poor_sleep$sleep_hours, poor_sleep$exam_score)) %>% mutate(sleep_quality = "poor"),
tidy(cor.test(average_sleep$sleep_hours, average_sleep$exam_score)) %>% mutate(sleep_quality="average"),
tidy(cor.test(good_sleep$sleep_hours, good_sleep$exam_score)) %>% mutate(sleep_quality="good")
) %>%
mutate(
p.value = format.pval(p.value, digits = 4, eps = 2.2e-16)
) %>%
select(sleep_quality, estimate, p.value, conf.low, conf.high)
## # A tibble: 3 × 5
## sleep_quality estimate p.value conf.low conf.high
## <chr> <dbl> <chr> <dbl> <dbl>
## 1 poor 0.130 < 2.2e-16 0.106 0.153
## 2 average 0.137 < 2.2e-16 0.113 0.160
## 3 good 0.133 < 2.2e-16 0.109 0.156
ggplot(poor_sleep, aes(sleep_hours, exam_score)) +
geom_point(alpha = 0.15, size = 1) +
geom_smooth(method = "lm", se = TRUE, linewidth = 1.2)
## `geom_smooth()` using formula = 'y ~ x'
ggplot(data, aes(sleep_hours, exam_score, color = sleep_quality)) +
geom_point(alpha = 0.15, size = 1) +
geom_smooth(method = "lm", se = TRUE)
## `geom_smooth()` using formula = 'y ~ x'
The scatter plot shows a lot of variability, so the relationship between
sleep hours and exam score is not visually strong. However, the
correlation test gives a very small p-value (p < 2.2×10⁻¹⁶), meaning
the correlation is statistically different from zero. The effect size is
small (r ≈ 0.13), and R^2 = r^2 , so sleep hours explain only about 1.7%
of the variation in exam scores. In other words, the relationship is
real but weak, and most variation in exam scores is explained by other
factors.
\[ r = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})} {\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^{n} (y_i - \bar{y})^2}} \] This is math notation to calculate Pearson correlation score, where it determine how much y changes based on x
\[ t = r \sqrt{\frac{n - 2}{1 - r^2}} \] this will determine if the correlation is statistically significant enough
\[ p = 2\Big(1 - F_t(|t|;\text{df})\Big) \] in R; there a function called pt() which will find a t-value with the confidence percent and degree of freedom; if this p-score < 0.05; it will reject null hypothesis and confirming the correlation is significant.
bind_rows(
tidy(cor.test(data$class_attendance, data$exam_score)) %>%
mutate(x_axis = "class_attendance"),
tidy(cor.test(data$study_hours, data$exam_score)) %>%
mutate(x_axis = "study_hours")
) %>%
mutate(
p.value = format.pval(p.value, digits = 4, eps = 2.2e-16)
) %>%
select(x_axis, estimate, p.value, conf.low, conf.high)
## # A tibble: 2 × 5
## x_axis estimate p.value conf.low conf.high
## <chr> <dbl> <chr> <dbl> <dbl>
## 1 class_attendance 0.309 < 2.2e-16 0.296 0.321
## 2 study_hours 0.718 < 2.2e-16 0.711 0.724
ggplot(data, aes(class_attendance, exam_score)) +
geom_point(alpha = 0.15, size = 1)+
geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
ggplot(data, aes(study_hours, exam_score)) +
geom_point(alpha = 0.15, size = 1)+
geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
Two separate Pearson correlation tests were conducted to examine the
relationships between class attendance and exam score, and between study
hours and exam score. Both correlations were statistically significant,
with p-values less than 0.05. The association between study hours and
exam score was particularly strong, with a correlation coefficient of r
= 0.72 and a coefficient of determination of R^2 = 0.518, indicating
that study hours explain approximately 51.8% of the variation in exam
scores. In contrast, the relationship between class attendance and exam
score was more modest, with r = 0.31 and R^2 = 0.09, suggesting that
class attendance accounts for about 9% of the variation in exam
performance. These relationships are also visually supported both 2 of
2D scatter plots (class attendance vs exam score; study hours vs exam
score).
plot_ly(
data = data,
x = ~class_attendance,
y = ~study_hours,
z = ~exam_score,
type = "scatter3d",
mode = "markers",
marker = list(
size = 4,
opacity = 0.6
))
This is 3D scatter plot of exam score versus class attendance and study hours, which shows a steep upward trend along the study hours axis and a weaker but still positive trend along the class attendance axis. Together, the numerical results and the 3D visualization indicate that while both predictors are statistically significant, study hours are a substantially stronger predictor of exam score than class attendance.
From the data and calculation, we have these claims: - Sleep hours have a statistically significant but weak affect on exam score based on p_score and Pearson score - Sleep quality does have a significant affect on exam score (better sleep quality -> better score) - Study hours strongly predict exam score - Class attendance has a medium positive effect on exam score
Takeaway from myself: - Sleep more and sleep better tend to have a better grade - Study affectively will 95% guarantee good exam score - Class attendance is optional unless required in the syllabus