knitr::opts_chunk$set(
echo = TRUE,
message = FALSE,
warning = FALSE
)
library(tidyverse)
library(car)
library(lmtest)
library(sandwich)
library(tseries)
library(ggplot2)
library(patchwork)
rm(list = ls())
Cieľom tejto analýzy je modelovať výsledné skúškové skóre študentov (exam_score) v závislosti od troch vysvetľujúcich premenných:
hours_studied – počet hodín štúdia,
sleep_hours – priemerný počet hodín spánku,
attendance_percent – percentuálna účasť na vyučovaní.
H₁: Vyšší počet hodín štúdia, viac spánku a vyššia účasť vedú k lepšiemu skúškovému skóre.
Načítame a vyčistíme dáta z databázy student_exam_scores.csv. Chýbajúce hodnoty nahradíme mediánom danej premennej.
df <- read_csv("student_exam_scores.csv")
# Vyber relevantné stĺpce
df <- df %>%
select(exam_score, hours_studied, sleep_hours, attendance_percent, previous_scores)
# Imputácia chýbajúcich hodnôt mediánom
column_medians <- sapply(df, median, na.rm = TRUE)
for (col in names(df)) {
df[[col]][is.na(df[[col]])] <- column_medians[col]
}
summary(df)
exam_score hours_studied sleep_hours attendance_percent previous_scores
Min. :17.10 Min. : 1.000 Min. :4.000 Min. : 50.30 Min. :40.0
1st Qu.:29.50 1st Qu.: 3.500 1st Qu.:5.300 1st Qu.: 62.20 1st Qu.:54.0
Median :34.05 Median : 6.150 Median :6.700 Median : 75.25 Median :67.5
Mean :33.95 Mean : 6.325 Mean :6.622 Mean : 74.83 Mean :66.8
3rd Qu.:38.75 3rd Qu.: 9.000 3rd Qu.:8.025 3rd Qu.: 87.42 3rd Qu.:80.0
Max. :51.30 Max. :12.000 Max. :9.000 Max. :100.00 Max. :95.0
Vizuálne skontrolujeme premenné pomocou boxplotov.
par(mfrow = c(2, 3))
for (col in names(df)) {
boxplot(df[[col]], main = col, col = "#8ecae6", border = "#023047")
}
par(mfrow = c(1, 1))
model <- lm(exam_score ~ hours_studied + sleep_hours + attendance_percent, data = df)
summary(model)
Call:
lm(formula = exam_score ~ hours_studied + sleep_hours + attendance_percent,
data = df)
Residuals:
Min 1Q Median 3Q Max
-8.5534 -2.7064 -0.1704 3.1321 7.6393
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.90021 1.95832 5.566 8.49e-08 ***
hours_studied 1.62964 0.08502 19.168 < 2e-16 ***
sleep_hours 0.57941 0.18318 3.163 0.00181 **
attendance_percent 0.11907 0.01920 6.202 3.24e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.857 on 196 degrees of freedom
Multiple R-squared: 0.6821, Adjusted R-squared: 0.6773
F-statistic: 140.2 on 3 and 196 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(model, col = "#023047", pch = 19, cex = 0.6)
par(mfrow = c(1, 1))
residuals <- residuals(model)
jarque.bera.test(residuals)
Jarque Bera Test
data: residuals
X-squared = 5.1125, df = 2, p-value = 0.0776
outlierTest(model)
No Studentized residuals with Bonferroni p < 0.05
Largest |rstudent|:
Ak p-hodnota testu Jarque–Bera > 0.05, rezíduá sú približne normálne rozložené. Žiadne významné odľahlé pozorovania sa nevyskytli.
Premennú hours_studied zlogaritmizujeme, aby sme zachytili možný nelineárny vzťah.
model2 <- lm(exam_score ~ I(log(hours_studied + 1)) + sleep_hours + attendance_percent, data = df)
summary(model2)
Call:
lm(formula = exam_score ~ I(log(hours_studied + 1)) + sleep_hours +
attendance_percent, data = df)
Residuals:
Min 1Q Median 3Q Max
-9.605 -2.687 0.091 3.340 9.939
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.11559 2.27489 0.930 0.35353
I(log(hours_studied + 1)) 10.04270 0.57918 17.339 < 2e-16 ***
sleep_hours 0.60652 0.19503 3.110 0.00215 **
attendance_percent 0.11990 0.02045 5.863 1.9e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.108 on 196 degrees of freedom
Multiple R-squared: 0.6394, Adjusted R-squared: 0.6339
F-statistic: 115.9 on 3 and 196 DF, p-value: < 2.2e-16
Diagnostika nového modelu:
par(mfrow = c(2, 2))
plot(model2, col = "#219ebc", pch = 19, cex = 0.6)
par(mfrow = c(1, 1))
Ak nový model má vyššie R² a lepšie diagnostické vlastnosti, je vhodnejší pre interpretáciu.
Vizualizácia rezíduí voči vybraným premenným:
p1 <- ggplot(df, aes(x = hours_studied, y = resid(model)^2)) +
geom_point(color = "#8ecae6", alpha = 0.7) +
geom_smooth(method = "loess", color = "#ffb703") +
labs(x = "Hours studied", y = "Squared residuals", title = "Residuals vs Hours studied") +
theme_minimal()
p2 <- ggplot(df, aes(x = attendance_percent, y = resid(model)^2)) +
geom_point(color = "#8ecae6", alpha = 0.7) +
geom_smooth(method = "loess", color = "#ffb703") +
labs(x = "Attendance (%)", y = "Squared residuals", title = "Residuals vs Attendance") +
theme_minimal()
p1 + p2
Formálny test heteroskedasticity (Breusch–Pagan):
bptest(model)
studentized Breusch-Pagan test
data: model
BP = 2.1436, df = 3, p-value = 0.5431
bptest(model2)
studentized Breusch-Pagan test
data: model2
BP = 1.7747, df = 3, p-value = 0.6205
Ak p-hodnota > 0.05 → predpoklad homoskedasticity nie je porušený.
✅ Výsledky:
Model je vhodný na interpretáciu vzťahu medzi študijnými návykmi a úspešnosťou študentov. Študenti, ktorí sa viac učia a častejšie navštevujú hodiny, dosahujú lepšie výsledky skúšok.