Author: Jethro Muwanguzi
file_path <- "/Users/jethro/Downloads/Student_performance.csv"
students_data <- read_csv(file_path)
## Rows: 2392 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (15): StudentID, Age, Gender, Ethnicity, ParentalEducation, StudyTimeWee...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
show_col_types = FALSE
head(students_data)
## # A tibble: 6 × 15
## StudentID Age Gender Ethnicity ParentalEducation StudyTimeWeekly Absences
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1001 17 1 0 2 19.8 7
## 2 1002 18 0 0 1 15.4 0
## 3 1003 15 0 2 3 4.21 26
## 4 1004 17 1 0 3 10.0 14
## 5 1005 17 1 0 2 4.67 17
## 6 1006 18 0 0 1 8.19 0
## # ℹ 8 more variables: Tutoring <dbl>, ParentalSupport <dbl>,
## # Extracurricular <dbl>, Sports <dbl>, Music <dbl>, Volunteering <dbl>,
## # GPA <dbl>, GradeClass <dbl>
Unit of Observation: A single student
Sample Size: 2,392 rows (based on Student ID range of 1001 to 3392).
Variables
Source: The dataset was got from Kaggle.
library(psych)
describe(students_data)
## vars n mean sd median trimmed mad min max
## StudentID 1 2392 2196.50 690.66 2196.50 2196.50 886.59 1001 3392.00
## Age 2 2392 16.47 1.12 16.00 16.46 1.48 15 18.00
## Gender 3 2392 0.51 0.50 1.00 0.51 0.00 0 1.00
## Ethnicity 4 2392 0.88 1.03 0.00 0.73 0.00 0 3.00
## ParentalEducation 5 2392 1.75 1.00 2.00 1.75 1.48 0 4.00
## StudyTimeWeekly 6 2392 9.77 5.65 9.71 9.73 6.97 0 19.98
## Absences 7 2392 14.54 8.47 15.00 14.57 10.38 0 29.00
## Tutoring 8 2392 0.30 0.46 0.00 0.25 0.00 0 1.00
## ParentalSupport 9 2392 2.12 1.12 2.00 2.14 1.48 0 4.00
## Extracurricular 10 2392 0.38 0.49 0.00 0.35 0.00 0 1.00
## Sports 11 2392 0.30 0.46 0.00 0.25 0.00 0 1.00
## Music 12 2392 0.20 0.40 0.00 0.12 0.00 0 1.00
## Volunteering 13 2392 0.16 0.36 0.00 0.07 0.00 0 1.00
## GPA 14 2392 1.91 0.92 1.89 1.90 1.07 0 4.00
## GradeClass 15 2392 2.98 1.23 4.00 3.16 0.00 0 4.00
## range skew kurtosis se
## StudentID 2391.00 0.00 -1.20 14.12
## Age 3.00 0.04 -1.37 0.02
## Gender 1.00 -0.04 -2.00 0.01
## Ethnicity 3.00 0.76 -0.77 0.02
## ParentalEducation 4.00 0.22 -0.29 0.02
## StudyTimeWeekly 19.98 0.05 -1.14 0.12
## Absences 29.00 -0.03 -1.18 0.17
## Tutoring 1.00 0.86 -1.25 0.01
## ParentalSupport 4.00 -0.17 -0.73 0.02
## Extracurricular 1.00 0.48 -1.77 0.01
## Sports 1.00 0.85 -1.27 0.01
## Music 1.00 1.52 0.32 0.01
## Volunteering 1.00 1.88 1.54 0.01
## GPA 4.00 0.01 -0.87 0.02
## GradeClass 4.00 -0.90 -0.42 0.03
#I begin by visualizing the distribution of GPA for students who attended tutoring and those who did not using a histogram, aiming to observe patterns or differences in GPA distributions between the two groups.
#i go for a joint visualization as was done in seminar 1, task 4 for easier observation of differences.
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
students_data$Tutoring_Group <- ifelse(students_data$Tutoring == 1, "Tutoring", "No Tutoring")
ggplot(students_data, aes(x = GPA, fill = Tutoring_Group)) +
geom_histogram(binwidth = 0.3, position = "dodge", color = "ghostwhite", alpha = 0.9) +
scale_fill_manual(values = c("No Tutoring" = "gold", "Tutoring" = "black")) +
labs(
title = " Distribution of GPA",
x = "GPA",
y = "Frequency",
fill = "Tutoring Status"
) +
theme_minimal()
From observing the histogram, students who did-not get tutoring seem to have higher numbers across all GPA levels compared to those who attended tutoring. However, the “No Tutoring” group has more students at both the very low and mid-range of the GPA scale.
#To ensure the validity of subsequent tests, I first check the normality of GPA data for each group using the Shapiro-Wilk test.
library(dplyr)
gpa_tutoring_yes <- students_data %>%
filter(Tutoring_Group == "Tutoring") %>%
pull(GPA)
gpa_tutoring_no <- students_data %>%
filter(Tutoring_Group == "No Tutoring") %>%
pull(GPA)
shapiro_tutoring <- shapiro.test(gpa_tutoring_yes)
shapiro_no_tutoring <- shapiro.test(gpa_tutoring_no)
print(shapiro_tutoring)
##
## Shapiro-Wilk normality test
##
## data: gpa_tutoring_yes
## W = 0.98201, p-value = 9.628e-08
print(shapiro_no_tutoring)
##
## Shapiro-Wilk normality test
##
## data: gpa_tutoring_no
## W = 0.98076, p-value = 3.466e-14
i) Tutoring group: - H₀: GPA values for students who attended tutoring are normally distributed. - H₁: GPA values for students who attended tutoring are not normally distributed. We reject the null hypothesis (H₀) at p-value < 0.003
ii) No Tutoring group: - H₀: GPA values for students who did not attend tutoring are normally distributed. - H₁: GPA values for students who did not attend tutoring are not normally distributed. We reject the null hypothesis (H₀) at p-value < 0.001
The Shapiro-Wilk test results indicate that the GPA data for both students who attended tutoring and those who did not are not normally distributed, as the null hypothesis of normality was rejected for both groups.
This violation of the normality assumption suggests that a non-parametric test. However, for completeness of the homework, I will proceed with both parametric and non-parametric tests.
#For a parametric test, I use the Welch Two Sample t-test
t_test <- t.test(gpa_tutoring_yes, gpa_tutoring_no, var.equal = FALSE)
print(t_test)
##
## Welch Two Sample t-test
##
## data: gpa_tutoring_yes and gpa_tutoring_no
## t = 7.1725, df = 1366.6, p-value = 1.203e-12
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.2102165 0.3684970
## sample estimates:
## mean of x mean of y
## 2.108325 1.818968
We reject the null hypothesis at p-value < 0.001, indicating a statistically significant difference in the mean GPA between the two groups. However, due to the violation of normality assumptions, the t-test results should be interpreted cautiously, and a non-parametric test is more appropriate for robust analysis.
#Under the assumption of non-normality, I now use the Wilcoxon Rank Sum Test to determine whether GPA values differ significantly between students who attended tutoring and those who did not.
wilcox_test <- wilcox.test(gpa_tutoring_yes, gpa_tutoring_no)
print(wilcox_test)
##
## Wilcoxon rank sum test with continuity correction
##
## data: gpa_tutoring_yes and gpa_tutoring_no
## W = 704808, p-value = 3.919e-11
## alternative hypothesis: true location shift is not equal to 0
We reject the null hypothesis at p-value < 0.001, indicating a statistically significant difference in GPA distributions between the two groups. I proceed to carry out further analysis of effect size to assess the practical significance of this difference.
#I calculate the rank-biserial correlation to quantify the effect size of the difference in GPA distributions between students who attended tutoring and those who did not.
library(effectsize)
##
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
##
## phi
effect <- effectsize(
wilcox.test(gpa_tutoring_yes, gpa_tutoring_no, alternative = "two.sided")
)
print(effect)
## r (rank biserial) | 95% CI
## --------------------------------
## 0.17 | [0.12, 0.22]
The rank-biserial correlation of 0.17 indicates a small effect size, suggesting that while there is a statistically significant difference between the two groups, the practical impact of this difference is small.
#To begin the analysis, I examine the structure and summary statistics of the dataset, focusing on GPA and Parental Support, to ensure the data is ready for correlation analysis.
str(students_data)
## spc_tbl_ [2,392 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ StudentID : num [1:2392] 1001 1002 1003 1004 1005 ...
## $ Age : num [1:2392] 17 18 15 17 17 18 15 15 17 16 ...
## $ Gender : num [1:2392] 1 0 0 1 1 0 0 1 0 1 ...
## $ Ethnicity : num [1:2392] 0 0 2 0 0 0 1 1 0 0 ...
## $ ParentalEducation: num [1:2392] 2 1 3 3 2 1 1 4 0 1 ...
## $ StudyTimeWeekly : num [1:2392] 19.83 15.41 4.21 10.03 4.67 ...
## $ Absences : num [1:2392] 7 0 26 14 17 0 10 22 1 0 ...
## $ Tutoring : num [1:2392] 1 0 0 0 1 0 0 1 0 0 ...
## $ ParentalSupport : num [1:2392] 2 1 2 3 3 1 3 1 2 3 ...
## $ Extracurricular : num [1:2392] 0 0 0 1 0 1 0 1 0 1 ...
## $ Sports : num [1:2392] 0 0 0 0 0 0 1 0 1 0 ...
## $ Music : num [1:2392] 1 0 0 0 0 0 0 0 0 0 ...
## $ Volunteering : num [1:2392] 0 0 0 0 0 0 0 0 1 0 ...
## $ GPA : num [1:2392] 2.929 3.043 0.113 2.054 1.288 ...
## $ GradeClass : num [1:2392] 2 1 4 3 4 1 2 4 2 0 ...
## $ Tutoring_Group : chr [1:2392] "Tutoring" "No Tutoring" "No Tutoring" "No Tutoring" ...
## - attr(*, "spec")=
## .. cols(
## .. StudentID = col_double(),
## .. Age = col_double(),
## .. Gender = col_double(),
## .. Ethnicity = col_double(),
## .. ParentalEducation = col_double(),
## .. StudyTimeWeekly = col_double(),
## .. Absences = col_double(),
## .. Tutoring = col_double(),
## .. ParentalSupport = col_double(),
## .. Extracurricular = col_double(),
## .. Sports = col_double(),
## .. Music = col_double(),
## .. Volunteering = col_double(),
## .. GPA = col_double(),
## .. GradeClass = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
summary(students_data$GPA)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.175 1.893 1.906 2.622 4.000
summary(students_data$ParentalSupport)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 2.000 2.122 3.000 4.000
# Both variables are numeric so I conduct the Pearson correlation test.
correlation_pearson <- cor.test(students_data$GPA, students_data$ParentalSupport, method = "pearson")
print(correlation_pearson)
##
## Pearson's product-moment correlation
##
## data: students_data$GPA and students_data$ParentalSupport
## t = 9.501, df = 2390, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1518567 0.2291002
## sample estimates:
## cor
## 0.1907737
Pearson Correlation Test:
H₀: There is no correlation between GPA and Parental Support. H₁: There is a correlation between GPA and Parental Support.
We reject the null hypothesis (H₀) at p-value < 0.001. The correlation coefficient is 0.19, indicating a weak positive correlation between GPA and Parental Support. This suggests that higher parental support is slightly associated with higher GPA values.
library(ggplot2)
# I procced to create a jittered scatter plot combined with boxplots to visualize the relationship between Parental Support and GPA, aiming to observe variability.
ggplot(students_data, aes(x = as.factor(ParentalSupport), y = GPA)) +
geom_jitter(width = 0.25, alpha = 0.5, color = "gold") +
geom_boxplot(alpha = 0.3, fill = "gray", outlier.shape = NA) +
labs(
title = "Relationship Between Parental Support and GPA",
x = "Parental Support (0 = None, 4 = Very High)",
y = "GPA"
) +
theme_minimal()
#I then perform the Spearman correlation test to assess the relationship between GPA and Parental Support.
correlation_spearman <- cor.test(students_data$GPA, students_data$ParentalSupport, method = "spearman")
## Warning in cor.test.default(students_data$GPA, students_data$ParentalSupport, :
## Cannot compute exact p-value with ties
print(correlation_spearman)
##
## Spearman's rank correlation rho
##
## data: students_data$GPA and students_data$ParentalSupport
## S = 1879573489, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.1760002
Spearman Correlation Test:
We reject the null hypothesis (H₀) at p-value < 0.001. The Spearman correlation coefficient is 0.176, indicating a weak positive correlation between GPA and Parental Support. This suggests that as parental support increases, GPA tends to increase slightly, but the relationship is not strong.
# I start by creating a new variable to classify students as having high or low absenteeism.
students_data$HighAbsenteeism <- ifelse(students_data$Absences >= 15, "Yes", "No")
# I then create a new variable to classify the grade levels as categorical variables
students_data$GradeClass <- factor(
students_data$GradeClass,
levels = c(0, 1, 2, 3, 4),
labels = c("A", "B", "C", "D", "F")
)
str(students_data$GradeClass)
## Factor w/ 5 levels "A","B","C","D",..: 3 2 5 4 5 2 3 5 3 1 ...
# I proceed to create contingency table summarize the counts for each combination
contingency_table <- table(students_data$GradeClass, students_data$HighAbsenteeism)
print(contingency_table)
##
## No Yes
## A 91 16
## B 243 26
## C 367 24
## D 323 91
## F 161 1050
# for further analysis, I focus on the students with 'F' and the frequency of absenteeism.
observed_value <- contingency_table["F", "Yes"]
print(observed_value)
## [1] 1050
# i then extract the expected value for Grade F
chi_test <- chisq.test(contingency_table)
expected_value <- chi_test$expected["F", "Yes"]
print(paste( round(expected_value, 2)))
## [1] "611.07"
# I then Extract the standardized residual for Grade F and High Absenteeism
standardized_residuals <- (contingency_table - chi_test$expected) / sqrt(chi_test$expected)
std_residual <- standardized_residuals["F", "Yes"]
print(paste(round(std_residual, 2)))
## [1] "17.76"
# I will now calculate Cramér's V
#install.packages("rcompanion")
library(rcompanion)
##
## Attaching package: 'rcompanion'
## The following object is masked from 'package:effectsize':
##
## phi
## The following object is masked from 'package:psych':
##
## phi
cramers_v <- cramerV(contingency_table)
print(cramers_v)
## Cramer V
## 0.7403