my_packages <- c("tidyverse", "stats", "effectsize", "ggtext")
lapply(my_packages, require, character.only = TRUE)Student Performance Analysis
Purpose
Frank Johnson High School wants to understand how variables such as “Study Hours”, “Smartphone usage”, “Subject”, and “Learning Method” affect the exam scores of their students. This would help the school determine what interventions or programs they should develop to continue supporting their students’ development.
Variables Examined
- Study Hours/Day: The amount of study hours per student
- Attendance %: Attendance percentage per student
- Exam Score: General exam scores per student (Doesn’t specify a subject)
- Favorite Subject: Students’ favorite subjects
- Smartphone Usage: The amount of hours students use their smartphones
- Learning Method: The method students attend school (e.g., Hybrid, On-site, Online)
Analysis Goals
- Is there a relationship between smartphone usage and student exam scores? (Correlation Test)
- Is there a significant mean difference in exam scores between STEM & Non-STEM Majors? (T-Test)
- Can attendance predict student exam scores? (Linear Regression)
- Do students who learn On-site perform better on exams than students who are online? (T-Test)
Hypotheses
- H1: There will be a strong negative relationship between student smartphone usage and student exam scores.
- H2: Students in STEM Majors will have a significantly higher mean in exam scores compared to Non-STEM Majors.
- H3: Student attendance will be a strong positive predictor of student exam scores.
- H4: Students who attend school in person will have a significantly higher mean in exam scores compared to students who learn online.
1. INSTALL & LOAD PACKAGES
2. LOAD, VIEW DATA, & CHECK FOR NA’S
students_df <- read_csv("education_dataset.csv")
glimpse(students_df)Rows: 500
Columns: 8
$ `Student Name` <chr> "Student_1", "Student_2", "Student_3", "S…
$ `Class/Grade` <chr> "Grade 11", "Grade 11", "Grade 12", "Grad…
$ `Study Hours/Day` <dbl> 4.88, 4.16, 3.23, 5.51, 3.47, 3.62, 5.57,…
$ `Attendance %` <dbl> 67.52, 99.44, 96.33, 66.64, 94.50, 89.79,…
$ `Exam Score` <dbl> 44.00, 53.48, 75.08, 83.41, 79.23, 79.19,…
$ `Favorite Subject` <chr> "Math", "English", "Math", "Computer", "C…
$ `Learning Method` <chr> "Online", "Online", "Mixed", "Offline", "…
$ `Smartphone Usage (hrs/day)` <dbl> 4.35, 1.26, 2.00, 3.69, 0.63, 3.81, 2.01,…
summary(students_df) Student Name Class/Grade Study Hours/Day Attendance %
Length:500 Length:500 Min. :1.010 Min. :60.05
Class :character Class :character 1st Qu.:2.328 1st Qu.:70.27
Mode :character Mode :character Median :3.620 Median :80.69
Mean :3.614 Mean :80.12
3rd Qu.:4.952 3rd Qu.:89.94
Max. :5.990 Max. :99.86
Exam Score Favorite Subject Learning Method
Min. :35.09 Length:500 Length:500
1st Qu.:51.45 Class :character Class :character
Median :67.08 Mode :character Mode :character
Mean :67.70
3rd Qu.:84.12
Max. :99.99
Smartphone Usage (hrs/day)
Min. :0.510
1st Qu.:1.607
Median :2.605
Mean :2.685
3rd Qu.:3.783
Max. :4.990
colSums(is.na(students_df)) Student Name Class/Grade
0 0
Study Hours/Day Attendance %
0 0
Exam Score Favorite Subject
0 0
Learning Method Smartphone Usage (hrs/day)
0 0
3. DATA CLEANING FOR STATISTICAL TEST
3.1 Set up for correlation and linear regression test
relevant_cols <- c("Study Hours/Day",
"Attendance %",
"Exam Score",
"Favorite Subject",
"Smartphone Usage (hrs/day)",
"Learning Method")
analyzed_students_df <- students_df %>%
select(all_of(relevant_cols))
glimpse(analyzed_students_df)Rows: 500
Columns: 6
$ `Study Hours/Day` <dbl> 4.88, 4.16, 3.23, 5.51, 3.47, 3.62, 5.57,…
$ `Attendance %` <dbl> 67.52, 99.44, 96.33, 66.64, 94.50, 89.79,…
$ `Exam Score` <dbl> 44.00, 53.48, 75.08, 83.41, 79.23, 79.19,…
$ `Favorite Subject` <chr> "Math", "English", "Math", "Computer", "C…
$ `Smartphone Usage (hrs/day)` <dbl> 4.35, 1.26, 2.00, 3.69, 0.63, 3.81, 2.01,…
$ `Learning Method` <chr> "Online", "Online", "Mixed", "Offline", "…
3.2 Set up for Independent Samples T-Test
analyzed_students_df <- analyzed_students_df %>%
mutate(
Majors = case_when(
`Favorite Subject` == "Math" ~ 1,
`Favorite Subject` == "Computer" ~ 1,
`Favorite Subject` == "Science" ~ 1,
`Favorite Subject` == "English" ~ 0,
`Favorite Subject` == "History" ~ 0
) %>% factor(
levels = c(0,1),
labels = c("Non-STEM", "STEM")
)
)
Lmethod_t_test <- analyzed_students_df %>%
filter(`Learning Method` %in% c("Online", "Offline")) %>%
mutate(
Learning_Group = case_when(
`Learning Method` == "Online" ~ 1,
`Learning Method` == "Offline" ~ 0
) %>%
factor(
levels = c(1,0),
labels = c("Online", "Onsite")
)
)
# Viewing if sample size is balanced for the t-test
analyzed_students_df %>% count(Majors) # A tibble: 2 × 2
Majors n
<fct> <int>
1 Non-STEM 184
2 STEM 316
Lmethod_t_test %>% count(Learning_Group)# A tibble: 2 × 2
Learning_Group n
<fct> <int>
1 Online 161
2 Onsite 164
4. EXPLORATORY DATA ANALYSIS (VISUALIZATIONS)
4.1 Visualize the distribution for each single variable being analyzed
ggplot(analyzed_students_df, aes(x = `Study Hours/Day`)) +
geom_histogram(color = "black", fill = "pink")ggplot(analyzed_students_df, aes(x = `Attendance %`)) +
geom_histogram(bins = 40, color = "black", fill = "blue") ggplot(analyzed_students_df, aes(x = `Exam Score`)) +
geom_histogram(color = "black", fill = "lightblue") ggplot(analyzed_students_df, aes(x = `Smartphone Usage (hrs/day)`)) +
geom_histogram(color = "black", fill = "darkred")4.2 Visualize scatter plots for H1 & H3
ggplot(analyzed_students_df, aes(x = `Smartphone Usage (hrs/day)`,
y = `Exam Score`)) +
geom_point(alpha = 0.5, color = "black", size = 2) +
geom_smooth(method = "lm", color = "red", linewidth = 1, se = TRUE) +
labs(x = "Students Smart Phone Usage (Hrs)",
y = "Students Exam Score")ggplot(analyzed_students_df, aes(x = `Attendance %`,
y = `Exam Score`)) +
geom_point(color = "black", size = 3) +
geom_smooth(method = "lm", se = TRUE, linewidth = 1) +
labs(x = "Students Attendance",
y = "Students Exam Score")4.3 Visualize T-Test for H2 & H4
ggplot(analyzed_students_df, aes(x = Majors,
y = `Exam Score`)) +
geom_boxplot(aes(fill = Majors))ggplot(Lmethod_t_test, aes(x = Learning_Group,
y = `Exam Score`)) +
geom_boxplot()5. STATISTICAL TEST FOR ANALYSIS
5.1 Conduct Correlation Statistical Test
cor.test(analyzed_students_df$`Smartphone Usage (hrs/day)`,
analyzed_students_df$`Exam Score`,
alternative = "less",
method = "pearson",
conf.level = 0.95)
Pearson's product-moment correlation
data: analyzed_students_df$`Smartphone Usage (hrs/day)` and analyzed_students_df$`Exam Score`
t = -1.6978, df = 498, p-value = 0.04509
alternative hypothesis: true correlation is less than 0
95 percent confidence interval:
-1.000000000 -0.002223946
sample estimates:
cor
-0.0758597
While there was a statistically significant finding between student smartphone usage & student exam score, the effect size was very low (r = -0.07). Meaning that there is no practical important relationship between student smartphone usage & student exam score, the significant result can be mostly due to the large sample size rather than real world impact.
5.2 Conduct Linear Regression Statistical Test
regression_model <- lm(`Exam Score` ~ `Attendance %`,
data = analyzed_students_df)
summary(regression_model)
Call:
lm(formula = `Exam Score` ~ `Attendance %`, data = analyzed_students_df)
Residuals:
Min 1Q Median 3Q Max
-33.617 -16.350 -0.326 16.928 32.568
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 73.98221 5.97635 12.379 <2e-16 ***
`Attendance %` -0.07838 0.07384 -1.062 0.289
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 19 on 498 degrees of freedom
Multiple R-squared: 0.002258, Adjusted R-squared: 0.0002544
F-statistic: 1.127 on 1 and 498 DF, p-value: 0.2889
There was not a statistically significant finding between student attendance & student exam scores. The predictive strength and direction was found to be very weak (b = -0.08). Student attendance could not be used as a variable to predict student exam performance (r2 = 0.00).
5.3 Conduct Independent Sample T-Test Statistical Tests
t.test(`Exam Score` ~ Majors,
var.equal = FALSE,
conf.level = 0.95,
data = analyzed_students_df)
Welch Two Sample t-test
data: Exam Score by Majors
t = -2.4696, df = 387.37, p-value = 0.01396
alternative hypothesis: true difference in means between group Non-STEM and group STEM is not equal to 0
95 percent confidence interval:
-7.7483153 -0.8794572
sample estimates:
mean in group Non-STEM mean in group STEM
64.97592 69.28981
descrpitive_stats <- analyzed_students_df %>%
group_by(Majors) %>%
summarise(
mean = mean(`Exam Score`),
sd = sd(`Exam Score`),
n = n()
)
mean_difference <- descrpitive_stats[2, 2] - descrpitive_stats[1, 2]
cohens_d <- cohens_d(`Exam Score` ~ Majors,
data = analyzed_students_df)
descrpitive_stats# A tibble: 2 × 4
Majors mean sd n
<fct> <dbl> <dbl> <int>
1 Non-STEM 65.0 18.7 184
2 STEM 69.3 19.0 316
mean_difference mean
1 4.313886
cohens_dCohen's d | 95% CI
--------------------------
-0.23 | [-0.41, -0.05]
- Estimated using pooled SD.
There was a statistically significant mean difference in exam scores between STEM & Non-STEM majors (p < 0.05). Although STEM majors scored an average of 4.31 points higher in exam scores compared to Non-STEM majors, the effect size was small (d = 0.23). This means that the difference in exam scores between both groups overlap a lot.
t.test(`Exam Score` ~ Learning_Group,
var.equal = FALSE,
data = Lmethod_t_test,
conf.level = 0.95)
Welch Two Sample t-test
data: Exam Score by Learning_Group
t = -1.721, df = 319.67, p-value = 0.08621
alternative hypothesis: true difference in means between group Online and group Onsite is not equal to 0
95 percent confidence interval:
-7.7864778 0.5200947
sample estimates:
mean in group Online mean in group Onsite
65.20553 68.83872
lm_descrpitive_stats <- Lmethod_t_test %>%
group_by(Learning_Group) %>%
summarise(
mean = mean(`Exam Score`),
sd = sd(`Exam Score`),
N = n()
)
lm_mean_difference <- lm_descrpitive_stats[2, 2] - lm_descrpitive_stats[1, 2]
lm_cohen_d <- cohens_d(`Exam Score` ~ Learning_Group,
data = Lmethod_t_test)
lm_descrpitive_stats# A tibble: 2 × 4
Learning_Group mean sd N
<fct> <dbl> <dbl> <int>
1 Online 65.2 19.8 161
2 Onsite 68.8 18.2 164
lm_mean_difference mean
1 3.633192
lm_cohen_dCohen's d | 95% CI
-------------------------
-0.19 | [-0.41, 0.03]
- Estimated using pooled SD.
There was not a statistically significant mean difference in exam score between Online & On-Site students (p > 0.09). Although the Onsite students scored an average of 3.63 points higher in exam score compared to Online students, the effect size was small. Implying that the difference in exam scores between both groups overlap.
6. FINAL VISUALIZATIONS
6.1 Correlation visualization of student smart phone usage and exam score
ggplot(analyzed_students_df, aes(x = `Smartphone Usage (hrs/day)`,
y = `Exam Score`)
) +
geom_point(color = "#6200EE", size = 2, alpha = 0.5) +
geom_smooth(method = "lm", color = "#03DAC5", se = TRUE, linewidth = 1.5) +
labs(
title = "No Practical Relationship between Student Smartphone Usage and Exam Scores",
subtitle = "Relationship found statistically significant (*p* < 0.05) but, the effect size is low (*r* = -0.07). Significance susceptible to large sample size instead of real world effects.",
x = "Smartphone Usage (Hours)",
y = "Exam Scores") +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", margin = margin(b = 10)),
plot.subtitle = element_textbox_simple(margin = margin(b = 15)),
panel.grid.minor = element_blank(),
panel.grid.major = element_blank(),
axis.title = element_text(size = 11),
axis.line = element_line(linetype = "dashed"),
axis.text = element_text(size = 10)) 6.2 Linear regression visualization of student attendance and exam scores
ggplot(analyzed_students_df, aes(x = `Attendance %`, y = `Exam Score`)
) +
geom_jitter(width = 0.5, height = 0.5, alpha = 0.5, color = "black") +
geom_smooth(method = "lm", color = "red", se = TRUE, linewidth = 1.5) +
labs(x = "Attendance (Percentage)",
title = "Student Attendance Unable to Predict Student Exam Scores ",
subtitle = "Relationship was not statistically significant and had weak predictive strength and direction (*b* = -0.08). Student attendance not a strong variable to predict attendance (*r2* = 0.00).") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", margin = margin(b = 5)),
plot.subtitle = element_textbox_simple(margin = margin(b = 15)),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.text = element_text(size = 10),
axis.line = element_line()) 6.3 Independent samples t-test visualization (STEM & Non-STEM Majors)
ggplot(analyzed_students_df, aes(x = Majors, y = `Exam Score`)
) +
geom_boxplot( aes( fill = Majors)) +
labs(title = "No Strong Difference in Exam Scores between Majors",
subtitle = "Although the difference was statistically significant (*p* < 0.05),
the effect size was small (*d* = 0.23).") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", margin = margin(b = 10)),
plot.subtitle = element_textbox_simple(margin = margin(b = 15)),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_line(linetype = "dashed"),
axis.text = element_text(size = 10))6.4 Independent samples t-test visualization (Online & Onsite Students)
ggplot(Lmethod_t_test, aes(x = Learning_Group, y = `Exam Score`)
) +
geom_boxplot(aes(fill = Learning_Group)) +
scale_fill_brewer(palette = "BuPu") +
labs(title = " No Strong Difference in Exam Scores between Learning Groups",
subtitle = "The difference was not statistically significant (*p* > 0.05) and the effect size was small (*d* = -0.19). Implying that the difference in exam scores between both groups over lap.") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", margin = margin(r = 10, b = 5)),
plot.subtitle = element_textbox_simple(margin = margin(t = 10, b = 12)),
panel.grid = element_blank(),
axis.line = element_line(linetype = "dashed"),
axis.text = element_text(size = 10)
)7. Conclusions
Considering there were not strong effects found in the chosen variables on exam scores when assessed through statistical tests. Frank Johnson high school should experiment with new variables to assess how the new variables may affect their student academic performance leading to the potential development of a initiative or investment in a program that will help the school support the improvement of their students.