Student Performance Analysis

Author

Rasheed Johnson

Purpose

Frank Johnson High School wants to understand how variables such as “Study Hours”, “Smartphone usage”, “Subject”, and “Learning Method” affect the exam scores of their students. This would help the school determine what interventions or programs they should develop to continue supporting their students’ development.

Variables Examined

Study Hours/Day: The amount of study hours per student
Attendance %: Attendance percentage per student
Exam Score: General exam scores per student (Doesn’t specify a subject)
Favorite Subject: Students’ favorite subjects
Smartphone Usage: The amount of hours students use their smartphones
Learning Method: The method students attend school (e.g., Hybrid, On-site, Online)

Analysis Goals

Is there a relationship between smartphone usage and student exam scores? (Correlation Test)
Is there a significant mean difference in exam scores between STEM & Non-STEM Majors? (T-Test)
Can attendance predict student exam scores? (Linear Regression)
Do students who learn On-site perform better on exams than students who are online? (T-Test)

Hypotheses

H1: There will be a strong negative relationship between student smartphone usage and student exam scores.
H2: Students in STEM Majors will have a significantly higher mean in exam scores compared to Non-STEM Majors.
H3: Student attendance will be a strong positive predictor of student exam scores.
H4: Students who attend school in person will have a significantly higher mean in exam scores compared to students who learn online.

1. INSTALL & LOAD PACKAGES

my_packages <- c("tidyverse", "stats", "effectsize", "ggtext")
lapply(my_packages, require, character.only = TRUE)

2. LOAD, VIEW DATA, & CHECK FOR NA’S

students_df <- read_csv("education_dataset.csv")
glimpse(students_df)

Rows: 500
Columns: 8
$ `Student Name`               <chr> "Student_1", "Student_2", "Student_3", "S…
$ `Class/Grade`                <chr> "Grade 11", "Grade 11", "Grade 12", "Grad…
$ `Study Hours/Day`            <dbl> 4.88, 4.16, 3.23, 5.51, 3.47, 3.62, 5.57,…
$ `Attendance %`               <dbl> 67.52, 99.44, 96.33, 66.64, 94.50, 89.79,…
$ `Exam Score`                 <dbl> 44.00, 53.48, 75.08, 83.41, 79.23, 79.19,…
$ `Favorite Subject`           <chr> "Math", "English", "Math", "Computer", "C…
$ `Learning Method`            <chr> "Online", "Online", "Mixed", "Offline", "…
$ `Smartphone Usage (hrs/day)` <dbl> 4.35, 1.26, 2.00, 3.69, 0.63, 3.81, 2.01,…

summary(students_df)

 Student Name       Class/Grade        Study Hours/Day  Attendance %  
 Length:500         Length:500         Min.   :1.010   Min.   :60.05  
 Class :character   Class :character   1st Qu.:2.328   1st Qu.:70.27  
 Mode  :character   Mode  :character   Median :3.620   Median :80.69  
                                       Mean   :3.614   Mean   :80.12  
                                       3rd Qu.:4.952   3rd Qu.:89.94  
                                       Max.   :5.990   Max.   :99.86  
   Exam Score    Favorite Subject   Learning Method   
 Min.   :35.09   Length:500         Length:500        
 1st Qu.:51.45   Class :character   Class :character  
 Median :67.08   Mode  :character   Mode  :character  
 Mean   :67.70                                        
 3rd Qu.:84.12                                        
 Max.   :99.99                                        
 Smartphone Usage (hrs/day)
 Min.   :0.510             
 1st Qu.:1.607             
 Median :2.605             
 Mean   :2.685             
 3rd Qu.:3.783             
 Max.   :4.990

colSums(is.na(students_df))

              Student Name                Class/Grade 
                         0                          0 
           Study Hours/Day               Attendance % 
                         0                          0 
                Exam Score           Favorite Subject 
                         0                          0 
           Learning Method Smartphone Usage (hrs/day) 
                         0                          0

3. DATA CLEANING FOR STATISTICAL TEST

3.1 Set up for correlation and linear regression test

relevant_cols <- c("Study Hours/Day", 
                   "Attendance %",
                   "Exam Score",
                   "Favorite Subject",
                   "Smartphone Usage (hrs/day)",
                   "Learning Method")

analyzed_students_df <- students_df %>% 
  select(all_of(relevant_cols)) 

glimpse(analyzed_students_df)

Rows: 500
Columns: 6
$ `Study Hours/Day`            <dbl> 4.88, 4.16, 3.23, 5.51, 3.47, 3.62, 5.57,…
$ `Attendance %`               <dbl> 67.52, 99.44, 96.33, 66.64, 94.50, 89.79,…
$ `Exam Score`                 <dbl> 44.00, 53.48, 75.08, 83.41, 79.23, 79.19,…
$ `Favorite Subject`           <chr> "Math", "English", "Math", "Computer", "C…
$ `Smartphone Usage (hrs/day)` <dbl> 4.35, 1.26, 2.00, 3.69, 0.63, 3.81, 2.01,…
$ `Learning Method`            <chr> "Online", "Online", "Mixed", "Offline", "…

3.2 Set up for Independent Samples T-Test

analyzed_students_df <- analyzed_students_df %>% 
  mutate(
    Majors = case_when(
      `Favorite Subject` == "Math" ~ 1,
      `Favorite Subject` == "Computer" ~ 1,
      `Favorite Subject` == "Science" ~ 1,
      `Favorite Subject` == "English" ~ 0,
      `Favorite Subject` == "History" ~ 0
    ) %>% factor(
      levels = c(0,1),
      labels = c("Non-STEM", "STEM")
    )
  )

Lmethod_t_test <- analyzed_students_df %>% 
  filter(`Learning Method` %in% c("Online", "Offline")) %>% 
  mutate(
    Learning_Group = case_when(
      `Learning Method` == "Online" ~ 1,
      `Learning Method` == "Offline" ~ 0
    ) %>% 
      factor(
        levels = c(1,0),
        labels = c("Online", "Onsite")
      )
  )

# Viewing if sample size is balanced for the t-test
analyzed_students_df %>% count(Majors)

# A tibble: 2 × 2
  Majors       n
  <fct>    <int>
1 Non-STEM   184
2 STEM       316

Lmethod_t_test %>% count(Learning_Group)

# A tibble: 2 × 2
  Learning_Group     n
  <fct>          <int>
1 Online           161
2 Onsite           164

4. EXPLORATORY DATA ANALYSIS (VISUALIZATIONS)

4.1 Visualize the distribution for each single variable being analyzed

ggplot(analyzed_students_df, aes(x = `Study Hours/Day`)) + 
  geom_histogram(color = "black", fill = "pink")

ggplot(analyzed_students_df, aes(x = `Attendance %`)) +
  geom_histogram(bins = 40, color = "black", fill = "blue")

ggplot(analyzed_students_df, aes(x = `Exam Score`)) +
  geom_histogram(color = "black", fill = "lightblue")

ggplot(analyzed_students_df, aes(x = `Smartphone Usage (hrs/day)`)) + 
  geom_histogram(color = "black", fill = "darkred")

4.2 Visualize scatter plots for H1 & H3

ggplot(analyzed_students_df, aes(x = `Smartphone Usage (hrs/day)`,
                                 y = `Exam Score`)) +
  geom_point(alpha = 0.5, color = "black", size = 2) +
  geom_smooth(method = "lm", color = "red", linewidth = 1, se = TRUE) +
  labs(x = "Students Smart Phone Usage (Hrs)",
       y = "Students Exam Score")

ggplot(analyzed_students_df, aes(x = `Attendance %`,
                                 y = `Exam Score`)) + 
  geom_point(color = "black", size = 3) + 
  geom_smooth(method = "lm", se = TRUE, linewidth = 1) + 
  labs(x = "Students Attendance",
       y = "Students Exam Score")

4.3 Visualize T-Test for H2 & H4

ggplot(analyzed_students_df, aes(x = Majors,
                                 y = `Exam Score`)) + 
  geom_boxplot(aes(fill = Majors))

ggplot(Lmethod_t_test, aes(x = Learning_Group,
                           y = `Exam Score`)) +
  geom_boxplot()

5. STATISTICAL TEST FOR ANALYSIS

5.1 Conduct Correlation Statistical Test

cor.test(analyzed_students_df$`Smartphone Usage (hrs/day)`, 
         analyzed_students_df$`Exam Score`,
         alternative = "less",
         method = "pearson",
         conf.level = 0.95)


    Pearson's product-moment correlation

data:  analyzed_students_df$`Smartphone Usage (hrs/day)` and analyzed_students_df$`Exam Score`
t = -1.6978, df = 498, p-value = 0.04509
alternative hypothesis: true correlation is less than 0
95 percent confidence interval:
 -1.000000000 -0.002223946
sample estimates:
       cor 
-0.0758597

Key Finding

While there was a statistically significant finding between student smartphone usage & student exam score, the effect size was very low (r = -0.07). Meaning that there is no practical important relationship between student smartphone usage & student exam score, the significant result can be mostly due to the large sample size rather than real world impact.

5.2 Conduct Linear Regression Statistical Test

regression_model <- lm(`Exam Score` ~ `Attendance %`,
                       data = analyzed_students_df)
summary(regression_model)


Call:
lm(formula = `Exam Score` ~ `Attendance %`, data = analyzed_students_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-33.617 -16.350  -0.326  16.928  32.568 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    73.98221    5.97635  12.379   <2e-16 ***
`Attendance %` -0.07838    0.07384  -1.062    0.289    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 19 on 498 degrees of freedom
Multiple R-squared:  0.002258,  Adjusted R-squared:  0.0002544 
F-statistic: 1.127 on 1 and 498 DF,  p-value: 0.2889

Key Finding

There was not a statistically significant finding between student attendance & student exam scores. The predictive strength and direction was found to be very weak (b = -0.08). Student attendance could not be used as a variable to predict student exam performance (r2 = 0.00).

5.3 Conduct Independent Sample T-Test Statistical Tests

t.test(`Exam Score` ~ Majors, 
       var.equal = FALSE, 
       conf.level = 0.95,
       data = analyzed_students_df)


    Welch Two Sample t-test

data:  Exam Score by Majors
t = -2.4696, df = 387.37, p-value = 0.01396
alternative hypothesis: true difference in means between group Non-STEM and group STEM is not equal to 0
95 percent confidence interval:
 -7.7483153 -0.8794572
sample estimates:
mean in group Non-STEM     mean in group STEM 
              64.97592               69.28981

descrpitive_stats <- analyzed_students_df %>% 
  group_by(Majors) %>% 
  summarise(
    mean = mean(`Exam Score`),
    sd = sd(`Exam Score`),
    n = n()
  )

mean_difference <- descrpitive_stats[2, 2] - descrpitive_stats[1, 2]
cohens_d <- cohens_d(`Exam Score` ~ Majors, 
                     data = analyzed_students_df)

descrpitive_stats

# A tibble: 2 × 4
  Majors    mean    sd     n
  <fct>    <dbl> <dbl> <int>
1 Non-STEM  65.0  18.7   184
2 STEM      69.3  19.0   316

mean_difference

      mean
1 4.313886

cohens_d

Cohen's d |         95% CI
--------------------------
-0.23     | [-0.41, -0.05]

- Estimated using pooled SD.

Key Finding

There was a statistically significant mean difference in exam scores between STEM & Non-STEM majors (p < 0.05). Although STEM majors scored an average of 4.31 points higher in exam scores compared to Non-STEM majors, the effect size was small (d = 0.23). This means that the difference in exam scores between both groups overlap a lot.

t.test(`Exam Score` ~ Learning_Group,
       var.equal = FALSE,
       data = Lmethod_t_test,
       conf.level = 0.95)


    Welch Two Sample t-test

data:  Exam Score by Learning_Group
t = -1.721, df = 319.67, p-value = 0.08621
alternative hypothesis: true difference in means between group Online and group Onsite is not equal to 0
95 percent confidence interval:
 -7.7864778  0.5200947
sample estimates:
mean in group Online mean in group Onsite 
            65.20553             68.83872

lm_descrpitive_stats <- Lmethod_t_test %>% 
  group_by(Learning_Group) %>% 
  summarise(
    mean = mean(`Exam Score`),
    sd = sd(`Exam Score`),
    N = n()
  )

lm_mean_difference <- lm_descrpitive_stats[2, 2] - lm_descrpitive_stats[1, 2]
lm_cohen_d <- cohens_d(`Exam Score` ~ Learning_Group,
         data = Lmethod_t_test)

lm_descrpitive_stats

# A tibble: 2 × 4
  Learning_Group  mean    sd     N
  <fct>          <dbl> <dbl> <int>
1 Online          65.2  19.8   161
2 Onsite          68.8  18.2   164

lm_mean_difference

      mean
1 3.633192

lm_cohen_d

Cohen's d |        95% CI
-------------------------
-0.19     | [-0.41, 0.03]

- Estimated using pooled SD.

Key Finding

There was not a statistically significant mean difference in exam score between Online & On-Site students (p > 0.09). Although the Onsite students scored an average of 3.63 points higher in exam score compared to Online students, the effect size was small. Implying that the difference in exam scores between both groups overlap.

6. FINAL VISUALIZATIONS

6.1 Correlation visualization of student smart phone usage and exam score

ggplot(analyzed_students_df, aes(x = `Smartphone Usage (hrs/day)`,
                                 y = `Exam Score`)
       ) +
  geom_point(color = "#6200EE", size = 2, alpha = 0.5) + 
  geom_smooth(method = "lm", color = "#03DAC5", se = TRUE, linewidth = 1.5) +
  labs(
    title = "No Practical Relationship between Student Smartphone Usage and Exam Scores",
    subtitle = "Relationship found statistically significant (*p* < 0.05) but, the effect size is low (*r* = -0.07). Significance susceptible to large sample size instead of real world effects.",
    x = "Smartphone Usage (Hours)",
    y = "Exam Scores") + 
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", margin = margin(b = 10)),
    plot.subtitle = element_textbox_simple(margin = margin(b = 15)),
    panel.grid.minor = element_blank(),
    panel.grid.major = element_blank(),
    axis.title = element_text(size = 11),
    axis.line = element_line(linetype = "dashed"),
    axis.text = element_text(size = 10))

6.2 Linear regression visualization of student attendance and exam scores

ggplot(analyzed_students_df, aes(x = `Attendance %`, y = `Exam Score`)
       ) +
  geom_jitter(width = 0.5, height = 0.5, alpha = 0.5, color = "black") +
  geom_smooth(method = "lm", color = "red", se = TRUE, linewidth = 1.5) +
  labs(x = "Attendance (Percentage)",
       title = "Student Attendance Unable to Predict Student Exam Scores ",
       subtitle = "Relationship was not statistically significant and had weak predictive strength and direction (*b* = -0.08). Student attendance not a strong variable to predict attendance (*r2* = 0.00).") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", margin = margin(b = 5)),
        plot.subtitle = element_textbox_simple(margin = margin(b = 15)),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.text = element_text(size = 10),
        axis.line = element_line())

6.3 Independent samples t-test visualization (STEM & Non-STEM Majors)

ggplot(analyzed_students_df, aes(x = Majors, y = `Exam Score`)
       ) +
  geom_boxplot( aes( fill = Majors)) +
  labs(title = "No Strong Difference in Exam Scores between Majors",
       subtitle = "Although the difference was statistically significant (*p* < 0.05),
       the effect size was small (*d* = 0.23).") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", margin = margin(b = 10)),
        plot.subtitle = element_textbox_simple(margin = margin(b = 15)),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.line = element_line(linetype = "dashed"),
        axis.text = element_text(size = 10))

6.4 Independent samples t-test visualization (Online & Onsite Students)

ggplot(Lmethod_t_test, aes(x = Learning_Group, y = `Exam Score`)
       ) +
  geom_boxplot(aes(fill = Learning_Group)) +
  scale_fill_brewer(palette = "BuPu") + 
  labs(title = " No Strong Difference in Exam Scores between Learning Groups",
       subtitle = "The difference was not statistically significant (*p* > 0.05) and the effect size was small (*d* = -0.19). Implying that the difference in exam scores between both groups over lap.") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", margin = margin(r = 10, b = 5)),
        plot.subtitle = element_textbox_simple(margin = margin(t = 10, b = 12)),
        panel.grid = element_blank(),
        axis.line = element_line(linetype = "dashed"),
        axis.text = element_text(size = 10)
  )

7. Conclusions

Final Remarks

Considering there were not strong effects found in the chosen variables on exam scores when assessed through statistical tests. Frank Johnson high school should experiment with new variables to assess how the new variables may affect their student academic performance leading to the potential development of a initiative or investment in a program that will help the school support the improvement of their students.