Homework Assignment 1

Author: Jethro Muwanguzi

1 Data Import

file_path <- "/Users/jethro/Downloads/Student_performance.csv"
students_data <- read_csv(file_path)

## Rows: 2392 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (15): StudentID, Age, Gender, Ethnicity, ParentalEducation, StudyTimeWee...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

show_col_types = FALSE
head(students_data)

## # A tibble: 6 × 15
##   StudentID   Age Gender Ethnicity ParentalEducation StudyTimeWeekly Absences
##       <dbl> <dbl>  <dbl>     <dbl>             <dbl>           <dbl>    <dbl>
## 1      1001    17      1         0                 2           19.8         7
## 2      1002    18      0         0                 1           15.4         0
## 3      1003    15      0         2                 3            4.21       26
## 4      1004    17      1         0                 3           10.0        14
## 5      1005    17      1         0                 2            4.67       17
## 6      1006    18      0         0                 1            8.19        0
## # ℹ 8 more variables: Tutoring <dbl>, ParentalSupport <dbl>,
## #   Extracurricular <dbl>, Sports <dbl>, Music <dbl>, Volunteering <dbl>,
## #   GPA <dbl>, GradeClass <dbl>

2 Data Description

Unit of Observation: A single student
Sample Size: 2,392 rows (based on Student ID range of 1001 to 3392).

Variables

Student ID: A unique identifier assigned to each student (1001 to 3392).
Age: (15 to 18 years).
Gender: 0: Male 1: Female
Ethnicity: (0 - Caucasian, 1 - African American, 2 - Asian, 3 - Other)
Parental Education: (0 - None, 1 - High School, 2 - Some College, 3 - Bachelor’s Degree, 4 - Higher Degree)
Study Time Weekly: in hours (range: 0 to 20).
Absences: during the school year (range: 0 to 30).
Tutoring: 0- No, 1: Yes
The level of parental involvement (0: None, 1: Low, 2: Moderate, 3: High, 4: Very High)
Participation in extracurricular activities:(0: No, 1: Yes)
Participation in sports:(0: No, 1: Yes)
Participation in music activities:(0: No, 1: Yes)
Participation in volunteering:(0: No, 1: Yes)
GPA scale from 2.0 to 4.0.
Classification of students’ grades based on GPA: 0: ‘A’ (GPA ≥ 3.5) 1: ‘B’ (3.0 ≤ GPA < 3.5) 2: ‘C’ (2.5 ≤ GPA < 3.0) 3: ‘D’ (2.0 ≤ GPA < 2.5) 4: ‘F’ (GPA < 2.0)

Source: The dataset was got from Kaggle.

3 Descriptive Statistics

library(psych)
describe(students_data)

##                   vars    n    mean     sd  median trimmed    mad  min     max
## StudentID            1 2392 2196.50 690.66 2196.50 2196.50 886.59 1001 3392.00
## Age                  2 2392   16.47   1.12   16.00   16.46   1.48   15   18.00
## Gender               3 2392    0.51   0.50    1.00    0.51   0.00    0    1.00
## Ethnicity            4 2392    0.88   1.03    0.00    0.73   0.00    0    3.00
## ParentalEducation    5 2392    1.75   1.00    2.00    1.75   1.48    0    4.00
## StudyTimeWeekly      6 2392    9.77   5.65    9.71    9.73   6.97    0   19.98
## Absences             7 2392   14.54   8.47   15.00   14.57  10.38    0   29.00
## Tutoring             8 2392    0.30   0.46    0.00    0.25   0.00    0    1.00
## ParentalSupport      9 2392    2.12   1.12    2.00    2.14   1.48    0    4.00
## Extracurricular     10 2392    0.38   0.49    0.00    0.35   0.00    0    1.00
## Sports              11 2392    0.30   0.46    0.00    0.25   0.00    0    1.00
## Music               12 2392    0.20   0.40    0.00    0.12   0.00    0    1.00
## Volunteering        13 2392    0.16   0.36    0.00    0.07   0.00    0    1.00
## GPA                 14 2392    1.91   0.92    1.89    1.90   1.07    0    4.00
## GradeClass          15 2392    2.98   1.23    4.00    3.16   0.00    0    4.00
##                     range  skew kurtosis    se
## StudentID         2391.00  0.00    -1.20 14.12
## Age                  3.00  0.04    -1.37  0.02
## Gender               1.00 -0.04    -2.00  0.01
## Ethnicity            3.00  0.76    -0.77  0.02
## ParentalEducation    4.00  0.22    -0.29  0.02
## StudyTimeWeekly     19.98  0.05    -1.14  0.12
## Absences            29.00 -0.03    -1.18  0.17
## Tutoring             1.00  0.86    -1.25  0.01
## ParentalSupport      4.00 -0.17    -0.73  0.02
## Extracurricular      1.00  0.48    -1.77  0.01
## Sports               1.00  0.85    -1.27  0.01
## Music                1.00  1.52     0.32  0.01
## Volunteering         1.00  1.88     1.54  0.01
## GPA                  4.00  0.01    -0.87  0.02
## GradeClass           4.00 -0.90    -0.42  0.03

3.1 RESEARCH QUESTION 1:

Is there a difference in the average GPA between students who attended tutoring sessions and those who did not?

#I begin by visualizing the distribution of GPA for students who attended tutoring and those who did not using a histogram, aiming to observe patterns or differences in GPA distributions between the two groups.
#i go for a joint visualization as was done in seminar 1, task 4 for easier observation of differences.

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

students_data$Tutoring_Group <- ifelse(students_data$Tutoring == 1, "Tutoring", "No Tutoring")
ggplot(students_data, aes(x = GPA, fill = Tutoring_Group)) +
  geom_histogram(binwidth = 0.3, position = "dodge", color = "ghostwhite", alpha = 0.9) +
   scale_fill_manual(values = c("No Tutoring" = "gold", "Tutoring" = "black")) +
     labs(
        title = "                                         Distribution of GPA",
        x = "GPA",
        y = "Frequency",
        fill = "Tutoring Status"
        ) +
        theme_minimal()

From observing the histogram, students who did-not get tutoring seem to have higher numbers across all GPA levels compared to those who attended tutoring. However, the “No Tutoring” group has more students at both the very low and mid-range of the GPA scale.

Now we carry out a parametric and non-parametric test to determine whether the GPA data for each group follows a normal distribution.

#To ensure the validity of subsequent tests, I first check the normality of GPA data for each group using the Shapiro-Wilk test.

library(dplyr)

gpa_tutoring_yes <- students_data %>% 
  filter(Tutoring_Group == "Tutoring") %>% 
    pull(GPA)

gpa_tutoring_no <- students_data %>% 
  filter(Tutoring_Group == "No Tutoring") %>% 
    pull(GPA)

shapiro_tutoring <- shapiro.test(gpa_tutoring_yes)
shapiro_no_tutoring <- shapiro.test(gpa_tutoring_no)

    print(shapiro_tutoring)

## 
##  Shapiro-Wilk normality test
## 
## data:  gpa_tutoring_yes
## W = 0.98201, p-value = 9.628e-08

    print(shapiro_no_tutoring)

## 
##  Shapiro-Wilk normality test
## 
## data:  gpa_tutoring_no
## W = 0.98076, p-value = 3.466e-14

i) Tutoring group: - H₀: GPA values for students who attended tutoring are normally distributed. - H₁: GPA values for students who attended tutoring are not normally distributed. We reject the null hypothesis (H₀) at p-value < 0.003

ii) No Tutoring group: - H₀: GPA values for students who did not attend tutoring are normally distributed. - H₁: GPA values for students who did not attend tutoring are not normally distributed. We reject the null hypothesis (H₀) at p-value < 0.001

The Shapiro-Wilk test results indicate that the GPA data for both students who attended tutoring and those who did not are not normally distributed, as the null hypothesis of normality was rejected for both groups.

This violation of the normality assumption suggests that a non-parametric test. However, for completeness of the homework, I will proceed with both parametric and non-parametric tests.

#For a parametric test, I use the Welch Two Sample t-test
  t_test <- t.test(gpa_tutoring_yes, gpa_tutoring_no, var.equal = FALSE)
    print(t_test)

## 
##  Welch Two Sample t-test
## 
## data:  gpa_tutoring_yes and gpa_tutoring_no
## t = 7.1725, df = 1366.6, p-value = 1.203e-12
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.2102165 0.3684970
## sample estimates:
## mean of x mean of y 
##  2.108325  1.818968

H₀: The mean GPA of students who attended tutoring is equal to the mean GPA of those who did not.
H₁: The mean GPA of students who attended tutoring is not equal to the mean GPA of those who did not.

We reject the null hypothesis at p-value < 0.001, indicating a statistically significant difference in the mean GPA between the two groups. However, due to the violation of normality assumptions, the t-test results should be interpreted cautiously, and a non-parametric test is more appropriate for robust analysis.

#Under the assumption of non-normality, I now use the Wilcoxon Rank Sum Test to determine whether GPA values differ significantly between students who attended tutoring and those who did not.

  wilcox_test <- wilcox.test(gpa_tutoring_yes, gpa_tutoring_no)
    print(wilcox_test)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  gpa_tutoring_yes and gpa_tutoring_no
## W = 704808, p-value = 3.919e-11
## alternative hypothesis: true location shift is not equal to 0

H₀: The GPA distributions for students who attended tutoring and those who did not are the same.
H₁: The GPA distributions for students who attended tutoring and those who did not are different.

We reject the null hypothesis at p-value < 0.001, indicating a statistically significant difference in GPA distributions between the two groups. I proceed to carry out further analysis of effect size to assess the practical significance of this difference.

#I calculate the rank-biserial correlation to quantify the effect size of the difference in GPA distributions between students who attended tutoring and those who did not.

library(effectsize)

## 
## Attaching package: 'effectsize'

## The following object is masked from 'package:psych':
## 
##     phi

  effect <- effectsize(
    wilcox.test(gpa_tutoring_yes, gpa_tutoring_no, alternative = "two.sided")
    )

    print(effect)

## r (rank biserial) |       95% CI
## --------------------------------
## 0.17              | [0.12, 0.22]

The rank-biserial correlation of 0.17 indicates a small effect size, suggesting that while there is a statistically significant difference between the two groups, the practical impact of this difference is small.

3.1.2 CONCLUSION

The analysis reveals that students who attended tutoring have GPAs that are statistically significantly different from those who did not. However, the small effect size indicates that while tutoring might have some influence, its association with GPA improvements is limited and not practically substantial.

3.2 Research Question 2:

Does the level of parental support correlate with students’ GPA?

#To begin the analysis, I examine the structure and summary statistics of the dataset, focusing on GPA and Parental Support, to ensure the data is ready for correlation analysis.

str(students_data)

## spc_tbl_ [2,392 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ StudentID        : num [1:2392] 1001 1002 1003 1004 1005 ...
##  $ Age              : num [1:2392] 17 18 15 17 17 18 15 15 17 16 ...
##  $ Gender           : num [1:2392] 1 0 0 1 1 0 0 1 0 1 ...
##  $ Ethnicity        : num [1:2392] 0 0 2 0 0 0 1 1 0 0 ...
##  $ ParentalEducation: num [1:2392] 2 1 3 3 2 1 1 4 0 1 ...
##  $ StudyTimeWeekly  : num [1:2392] 19.83 15.41 4.21 10.03 4.67 ...
##  $ Absences         : num [1:2392] 7 0 26 14 17 0 10 22 1 0 ...
##  $ Tutoring         : num [1:2392] 1 0 0 0 1 0 0 1 0 0 ...
##  $ ParentalSupport  : num [1:2392] 2 1 2 3 3 1 3 1 2 3 ...
##  $ Extracurricular  : num [1:2392] 0 0 0 1 0 1 0 1 0 1 ...
##  $ Sports           : num [1:2392] 0 0 0 0 0 0 1 0 1 0 ...
##  $ Music            : num [1:2392] 1 0 0 0 0 0 0 0 0 0 ...
##  $ Volunteering     : num [1:2392] 0 0 0 0 0 0 0 0 1 0 ...
##  $ GPA              : num [1:2392] 2.929 3.043 0.113 2.054 1.288 ...
##  $ GradeClass       : num [1:2392] 2 1 4 3 4 1 2 4 2 0 ...
##  $ Tutoring_Group   : chr [1:2392] "Tutoring" "No Tutoring" "No Tutoring" "No Tutoring" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   StudentID = col_double(),
##   ..   Age = col_double(),
##   ..   Gender = col_double(),
##   ..   Ethnicity = col_double(),
##   ..   ParentalEducation = col_double(),
##   ..   StudyTimeWeekly = col_double(),
##   ..   Absences = col_double(),
##   ..   Tutoring = col_double(),
##   ..   ParentalSupport = col_double(),
##   ..   Extracurricular = col_double(),
##   ..   Sports = col_double(),
##   ..   Music = col_double(),
##   ..   Volunteering = col_double(),
##   ..   GPA = col_double(),
##   ..   GradeClass = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

summary(students_data$GPA)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.175   1.893   1.906   2.622   4.000

summary(students_data$ParentalSupport)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   2.000   2.122   3.000   4.000

3.2.1 Analysis

# Both variables are numeric so I conduct the Pearson correlation test.

correlation_pearson <- cor.test(students_data$GPA, students_data$ParentalSupport, method = "pearson")
print(correlation_pearson)

## 
##  Pearson's product-moment correlation
## 
## data:  students_data$GPA and students_data$ParentalSupport
## t = 9.501, df = 2390, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1518567 0.2291002
## sample estimates:
##       cor 
## 0.1907737

Pearson Correlation Test:

H₀: There is no correlation between GPA and Parental Support. H₁: There is a correlation between GPA and Parental Support.

We reject the null hypothesis (H₀) at p-value < 0.001. The correlation coefficient is 0.19, indicating a weak positive correlation between GPA and Parental Support. This suggests that higher parental support is slightly associated with higher GPA values.

library(ggplot2)

# I procced to create a jittered scatter plot combined with boxplots to visualize the relationship between Parental Support and GPA, aiming to observe variability.

ggplot(students_data, aes(x = as.factor(ParentalSupport), y = GPA)) +
  geom_jitter(width = 0.25, alpha = 0.5, color = "gold") +
  geom_boxplot(alpha = 0.3, fill = "gray", outlier.shape = NA) +
  labs(
    title = "Relationship Between Parental Support and GPA",
    x = "Parental Support (0 = None, 4 = Very High)",
    y = "GPA"
  ) +
  theme_minimal()

From observation, higher parental support may slightly improve GPA, but the effect may not substantial, as evidenced by the considerable overlap and variability across all levels. I proceed to carry out further analysis to calculate the effect size identified above.

#I then perform the Spearman correlation test to assess the relationship between GPA and Parental Support.

correlation_spearman <- cor.test(students_data$GPA, students_data$ParentalSupport, method = "spearman")

## Warning in cor.test.default(students_data$GPA, students_data$ParentalSupport, :
## Cannot compute exact p-value with ties

print(correlation_spearman)

## 
##  Spearman's rank correlation rho
## 
## data:  students_data$GPA and students_data$ParentalSupport
## S = 1879573489, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.1760002

Spearman Correlation Test:

H₀: There is no correlation between GPA and Parental Support.
H₁: There is a correlation between GPA and Parental Support.

We reject the null hypothesis (H₀) at p-value < 0.001. The Spearman correlation coefficient is 0.176, indicating a weak positive correlation between GPA and Parental Support. This suggests that as parental support increases, GPA tends to increase slightly, but the relationship is not strong.

3.2.2 CONCLUSION

The analysis reveals a weak positive relationship between parental support and GPA. Both the Pearson correlation coefficient (r = 0.19, p < 0.001) and the Spearman correlation coefficient (rho = 0.176, p < 0.001) indicate that higher levels of parental support are slightly associated with higher GPA values. However, the strength of the correlation is weak, suggesting that while there is an association, its impact on GPA is minimal.

3.3 Research Question 3:

Is there an association between students’ grades and absenteeism?

# I start by creating a new variable to classify students as having high or low absenteeism.

students_data$HighAbsenteeism <- ifelse(students_data$Absences >= 15, "Yes", "No")

# I then create a new variable to classify the grade levels as categorical variables

students_data$GradeClass <- factor(
  students_data$GradeClass,
  levels = c(0, 1, 2, 3, 4),
  labels = c("A", "B", "C", "D", "F") 
)

str(students_data$GradeClass)

##  Factor w/ 5 levels "A","B","C","D",..: 3 2 5 4 5 2 3 5 3 1 ...

# I proceed to create contingency table summarize the counts for each combination

contingency_table <- table(students_data$GradeClass, students_data$HighAbsenteeism)
print(contingency_table)

##    
##       No  Yes
##   A   91   16
##   B  243   26
##   C  367   24
##   D  323   91
##   F  161 1050

From observation, the contingency table shows that students with lower grades (F & D) have more absenteeism, while those with higher grades (A, B, & C) have low absenteeism.

# for further analysis, I focus on the students with 'F' and the frequency of absenteeism.
observed_value <- contingency_table["F", "Yes"]
print(observed_value)

## [1] 1050

# i then extract the expected value for Grade F

chi_test <- chisq.test(contingency_table)
expected_value <- chi_test$expected["F", "Yes"]

print(paste( round(expected_value, 2)))

## [1] "611.07"

The expected value for Grade F and High Absenteeism is approximately 611. The observed value of 1050 is significantly higher than expected, indicating that this category strongly contributes to the association between grades and absenteeism.

# I then Extract the standardized residual for Grade F and High Absenteeism

standardized_residuals <- (contingency_table - chi_test$expected) / sqrt(chi_test$expected)
std_residual <- standardized_residuals["F", "Yes"]
print(paste(round(std_residual, 2)))

## [1] "17.76"

The standardized residual for Grade F and High Absenteeism is 17.76, which is well above 2. This confirms that students with Grade F and high absenteeism significantly contribute to the association.

# I will now calculate Cramér's V 

#install.packages("rcompanion")
library(rcompanion)

## 
## Attaching package: 'rcompanion'

## The following object is masked from 'package:effectsize':
## 
##     phi

## The following object is masked from 'package:psych':
## 
##     phi

cramers_v <- cramerV(contingency_table)
print(cramers_v)

## Cramer V 
##   0.7403

The Cramér’s V value above of 0.7403 indicates a strong association between grade classification and absenteeism.

*3.3.2 CONCLUSION

There is a strong and statistically significant association between the studenrts’ grade F and their absenteeism, as evidenced by the Chi-Square Test and the Cramér’s V value (0.7403). Students with grade F, are significantly more likely to have high absenteeism. This finding is further supported by the large standardized residual (17.76), indicating that this category contributes substantially to the overall association.

Homework Assignment 1 - MVA

Jethro Muwanguzi

2025-19-01