HW3

Introduction

Why is this data set?

As a college student, grade is really important, but there are many factor that can affect the outcome of student’s grade like how long do they sleep, how they study, how hard is the class and et cetera; during today analyzing, we can dig in and learning how and what affect student’s grade

Goal for this analyze

Explore what affect exam score?
Using statistic computation tool function from base R and multiple different libraries
Visualization using ggplot2 and plotly

Library loading and data

Library loading

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(tidyr)
library(broom)

note: some of these library might not already install in Rstudio (web or application), use this syntax in terminal: install.packages(“library_name”) to install the library.

Data

data = read.csv("Exam_Score_Prediction.csv")

This data is collected from Kaggle by KUNDAN SAGAR BEDMUTHA, link is as followed: “https://www.kaggle.com/datasets/kundanbedmutha/exam-score-prediction-dataset” NOTE: the file have to be downloaded into the local file so Rstudio (wed or application) can read and progress the code, and don’t change the name of the file

Question raising

Is sleeping more, but quality of sleep is different can actualy affect the grade outcome?
Is longer study section actually increase the grade of that student or is there an amount of study hour needed to have good grade?
Is student come to class more often has higher grade?

Sleep hours vs exam score (sort based on sleep quality)

poor_sleep <- data %>%
  filter(sleep_quality == "poor")

average_sleep <- data %>%
  filter(sleep_quality == "average") 

good_sleep <- data %>%
  filter(sleep_quality == "good") 


bind_rows(
  tidy(cor.test(poor_sleep$sleep_hours, poor_sleep$exam_score))  %>% mutate(sleep_quality = "poor"),
  tidy(cor.test(average_sleep$sleep_hours, average_sleep$exam_score)) %>% mutate(sleep_quality="average"),
  tidy(cor.test(good_sleep$sleep_hours, good_sleep$exam_score)) %>% mutate(sleep_quality="good")
) %>%
  mutate(
    p.value = format.pval(p.value, digits = 4, eps = 2.2e-16)
  ) %>%
  select(sleep_quality, estimate, p.value, conf.low, conf.high)

## # A tibble: 3 × 5
##   sleep_quality estimate p.value   conf.low conf.high
##   <chr>            <dbl> <chr>        <dbl>     <dbl>
## 1 poor             0.130 < 2.2e-16    0.106     0.153
## 2 average          0.137 < 2.2e-16    0.113     0.160
## 3 good             0.133 < 2.2e-16    0.109     0.156

ggplot(poor_sleep, aes(sleep_hours, exam_score)) + 
  geom_point(alpha = 0.15, size = 1) +
  geom_smooth(method = "lm", se = TRUE, linewidth = 1.2)

## `geom_smooth()` using formula = 'y ~ x'

ggplot(data, aes(sleep_hours, exam_score, color = sleep_quality)) +
  geom_point(alpha = 0.15, size = 1) +
  geom_smooth(method = "lm", se = TRUE)

## `geom_smooth()` using formula = 'y ~ x'

The scatter plot shows a lot of variability, so the relationship between sleep hours and exam score is not visually strong. However, the correlation test gives a very small p-value (p < 2.2×10⁻¹⁶), meaning the correlation is statistically different from zero. The effect size is small (r ≈ 0.13), and R^2 = r^2 , so sleep hours explain only about 1.7% of the variation in exam scores. In other words, the relationship is real but weak, and most variation in exam scores is explained by other factors.

math behind the scene

Pearson correlation coefficient

\[ r = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})} {\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^{n} (y_i - \bar{y})^2}} \] This is math notation to calculate Pearson correlation score, where it determine how much y changes based on x

T-score, hypothesis test

\[ t = r \sqrt{\frac{n - 2}{1 - r^2}} \] this will determine if the correlation is statistically significant enough

P-score

\[ p = 2\Big(1 - F_t(|t|;\text{df})\Big) \] in R; there a function called pt() which will find a t-value with the confidence percent and degree of freedom; if this p-score < 0.05; it will reject null hypothesis and confirming the correlation is significant.

Study hours vs exam score; class attendance vs exam score

bind_rows(
  tidy(cor.test(data$class_attendance, data$exam_score)) %>% 
    mutate(x_axis = "class_attendance"),
  
  tidy(cor.test(data$study_hours, data$exam_score)) %>% 
    mutate(x_axis = "study_hours")
) %>%
  mutate(
    p.value = format.pval(p.value, digits = 4, eps = 2.2e-16)
  ) %>%
  select(x_axis, estimate, p.value, conf.low, conf.high)

## # A tibble: 2 × 5
##   x_axis           estimate p.value   conf.low conf.high
##   <chr>               <dbl> <chr>        <dbl>     <dbl>
## 1 class_attendance    0.309 < 2.2e-16    0.296     0.321
## 2 study_hours         0.718 < 2.2e-16    0.711     0.724

ggplot(data, aes(class_attendance, exam_score)) + 
  geom_point(alpha = 0.15, size = 1)+
  geom_smooth(method = "lm")

## `geom_smooth()` using formula = 'y ~ x'

ggplot(data, aes(study_hours, exam_score)) + 
  geom_point(alpha = 0.15, size = 1)+
  geom_smooth(method = "lm")

## `geom_smooth()` using formula = 'y ~ x'

Two separate Pearson correlation tests were conducted to examine the relationships between class attendance and exam score, and between study hours and exam score. Both correlations were statistically significant, with p-values less than 0.05. The association between study hours and exam score was particularly strong, with a correlation coefficient of r = 0.72 and a coefficient of determination of R^2 = 0.518, indicating that study hours explain approximately 51.8% of the variation in exam scores. In contrast, the relationship between class attendance and exam score was more modest, with r = 0.31 and R^2 = 0.09, suggesting that class attendance accounts for about 9% of the variation in exam performance. These relationships are also visually supported both 2 of 2D scatter plots (class attendance vs exam score; study hours vs exam score).

3D plot showing the correlation between study hoours, class attendance and exam score

plot_ly(
  data = data,
  x = ~class_attendance,
  y = ~study_hours,
  z = ~exam_score,
  type = "scatter3d",
  mode = "markers",
  marker = list(
    size = 4,
    opacity = 0.6
  ))

This is 3D scatter plot of exam score versus class attendance and study hours, which shows a steep upward trend along the study hours axis and a weaker but still positive trend along the class attendance axis. Together, the numerical results and the 3D visualization indicate that while both predictors are statistically significant, study hours are a substantially stronger predictor of exam score than class attendance.

Conclusion

From the data and calculation, we have these claims: - Sleep hours have a statistically significant but weak affect on exam score based on p_score and Pearson score - Sleep quality does have a significant affect on exam score (better sleep quality -> better score) - Study hours strongly predict exam score - Class attendance has a medium positive effect on exam score

Takeaway from myself: - Sleep more and sleep better tend to have a better grade - Study affectively will 95% guarantee good exam score - Class attendance is optional unless required in the syllabus

HW3

Cong Kha Le