Source: Student Performance Factors (Keggle)
#Importing the data
mydata <- read.table("C:/Users/Veronika/ŠOLA/EKONOMSKA FAKULTETA/PRIJAVA NA IMB/MVA/MVA HW1/StudentPerformanceFactors.csv", header=TRUE, sep=",", dec=".")
head(mydata)
## Hours_Studied Attendance Parental_Involvement Access_to_Resources
## 1 23 84 Low High
## 2 19 64 Low Medium
## 3 24 98 Medium Medium
## 4 29 89 Low Medium
## 5 19 92 Medium Medium
## 6 19 88 Medium Medium
## Extracurricular_Activities Sleep_Hours Previous_Scores
## 1 No 7 73
## 2 No 8 59
## 3 Yes 7 91
## 4 Yes 8 98
## 5 Yes 6 65
## 6 Yes 8 89
## Motivation_Level Internet_Access Tutoring_Sessions Family_Income
## 1 Low Yes 0 Low
## 2 Low Yes 2 Medium
## 3 Medium Yes 2 Medium
## 4 Medium Yes 1 Medium
## 5 Medium Yes 3 Medium
## 6 Medium Yes 3 Medium
## Teacher_Quality School_Type Peer_Influence Physical_Activity
## 1 Medium Public Positive 3
## 2 Medium Public Negative 4
## 3 Medium Public Neutral 4
## 4 Medium Public Negative 4
## 5 High Public Neutral 4
## 6 Medium Public Positive 3
## Learning_Disabilities Parental_Education_Level Distance_from_Home
## 1 No High School Near
## 2 No College Moderate
## 3 No Postgraduate Near
## 4 No High School Moderate
## 5 No College Near
## 6 No Postgraduate Near
## Gender Exam_Score
## 1 Male 67
## 2 Female 61
## 3 Male 74
## 4 Male 71
## 5 Female 70
## 6 Male 71
Unit of observation is one individual student.
Description of all variables before the data manipulation:
summary(mydata)
## Hours_Studied Attendance Parental_Involvement
## Min. : 1.00 Min. : 60.00 Length:6607
## 1st Qu.:16.00 1st Qu.: 70.00 Class :character
## Median :20.00 Median : 80.00 Mode :character
## Mean :19.98 Mean : 79.98
## 3rd Qu.:24.00 3rd Qu.: 90.00
## Max. :44.00 Max. :100.00
## Access_to_Resources Extracurricular_Activities Sleep_Hours
## Length:6607 Length:6607 Min. : 4.000
## Class :character Class :character 1st Qu.: 6.000
## Mode :character Mode :character Median : 7.000
## Mean : 7.029
## 3rd Qu.: 8.000
## Max. :10.000
## Previous_Scores Motivation_Level Internet_Access
## Min. : 50.00 Length:6607 Length:6607
## 1st Qu.: 63.00 Class :character Class :character
## Median : 75.00 Mode :character Mode :character
## Mean : 75.07
## 3rd Qu.: 88.00
## Max. :100.00
## Tutoring_Sessions Family_Income Teacher_Quality
## Min. :0.000 Length:6607 Length:6607
## 1st Qu.:1.000 Class :character Class :character
## Median :1.000 Mode :character Mode :character
## Mean :1.494
## 3rd Qu.:2.000
## Max. :8.000
## School_Type Peer_Influence Physical_Activity
## Length:6607 Length:6607 Min. :0.000
## Class :character Class :character 1st Qu.:2.000
## Mode :character Mode :character Median :3.000
## Mean :2.968
## 3rd Qu.:4.000
## Max. :6.000
## Learning_Disabilities Parental_Education_Level Distance_from_Home
## Length:6607 Length:6607 Length:6607
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## Gender Exam_Score
## Length:6607 Min. : 55.00
## Class :character 1st Qu.: 65.00
## Mode :character Median : 67.00
## Mean : 67.24
## 3rd Qu.: 69.00
## Max. :101.00
Here we can clearly see that data needs to be adjusted. Since we have a lot of variables, I decided to delete some of the variables (columns) that I believe are not important for my research. My further steps continue as followed.
# Keep only the necessary variables for the analysis
mydata2 <- mydata[, colnames(mydata) %in% c("Gender", "Exam_Score", "Hours_Studied", "Extracurricular_Activities", "Parental_Involvement", "Motivation_Level")]
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#Adding additional column ID
mydata2$ID <- seq(1, nrow(mydata2))
mydata2 <- mydata2 %>%
select(ID, everything())
#Factoring variables
mydata2$Gender <- factor(mydata2$Gender,
levels = c("Male", "Female"),
labels = c("Male", "Female"))
mydata2$Extracurricular_Activities <- factor(mydata2$Extracurricular_Activities,
levels = c("Yes", "No"),
labels = c("Yes", "No"))
mydata2$Parental_Involvement <- factor(mydata2$Parental_Involvement,
levels = c("High", "Medium", "Low"),
labels = c("High", "Medium", "Low"))
mydata2$Motivation_Level <- factor(mydata2$Motivation_Level,
levels = c("High", "Medium", "Low"),
labels = c("High", "Medium", "Low"))
summary(mydata2)
## ID Hours_Studied Parental_Involvement
## Min. : 1 Min. : 1.00 High :1908
## 1st Qu.:1652 1st Qu.:16.00 Medium:3362
## Median :3304 Median :20.00 Low :1337
## Mean :3304 Mean :19.98
## 3rd Qu.:4956 3rd Qu.:24.00
## Max. :6607 Max. :44.00
## Extracurricular_Activities Motivation_Level Gender
## Yes:3938 High :1319 Male :3814
## No :2669 Medium:3351 Female:2793
## Low :1937
##
##
##
## Exam_Score
## Min. : 55.00
## 1st Qu.: 65.00
## Median : 67.00
## Mean : 67.24
## 3rd Qu.: 69.00
## Max. :101.00
We can see that this already looks much better, due to factoring.
library(pastecs)
##
## Attaching package: 'pastecs'
## The following objects are masked from 'package:dplyr':
##
## first, last
mydata_stat.desc <- mydata2[ , c(-1, -3, -4, -5,-6)]
round(stat.desc(mydata_stat.desc), 2)
## Hours_Studied Exam_Score
## nbr.val 6607.00 6607.00
## nbr.null 0.00 0.00
## nbr.na 0.00 0.00
## min 1.00 55.00
## max 44.00 101.00
## range 43.00 46.00
## sum 131977.00 444226.00
## median 20.00 67.00
## mean 19.98 67.24
## SE.mean 0.07 0.05
## CI.mean.0.95 0.14 0.09
## var 35.89 15.14
## std.dev 5.99 3.89
## coef.var 0.30 0.06
The maximum hours of studying for an exam was 44 hours, and the minimal was 1 hour. In total all students together studied for 131.977 hours. The mean of studying hours was 19.98 hours and the median was 20 hours. Meaning 50% of students studied for up to including 20 hours for the exam and 50% of students studied for more than 20 hours.
The highest exam score was 101 points, and the minimal was 55 points, meaning the range of points was 46. The average amount of earned points was 67.27 points and the median was 67 points. Meaning 50% of students achieved up to including 67 points for the exam and 50% of students earned more than 67 points on the exam.
Since we see we do not have any n/a we do not have to perform the removal of non-availables.
Let’s take a look at how our data looks like with function head before we continue to our first research question:
head(mydata2)
## ID Hours_Studied Parental_Involvement Extracurricular_Activities
## 1 1 23 Low No
## 2 2 19 Low No
## 3 3 24 Medium Yes
## 4 4 29 Low Yes
## 5 5 19 Medium Yes
## 6 6 19 Medium Yes
## Motivation_Level Gender Exam_Score
## 1 Low Male 67
## 2 Low Female 61
## 3 Medium Male 74
## 4 Medium Male 71
## 5 Medium Female 70
## 6 Medium Male 71
# Summary of the data we are actually going to use for this analysis
library(psych)
summary(mydata2)
## ID Hours_Studied Parental_Involvement
## Min. : 1 Min. : 1.00 High :1908
## 1st Qu.:1652 1st Qu.:16.00 Medium:3362
## Median :3304 Median :20.00 Low :1337
## Mean :3304 Mean :19.98
## 3rd Qu.:4956 3rd Qu.:24.00
## Max. :6607 Max. :44.00
## Extracurricular_Activities Motivation_Level Gender
## Yes:3938 High :1319 Male :3814
## No :2669 Medium:3351 Female:2793
## Low :1937
##
##
##
## Exam_Score
## Min. : 55.00
## 1st Qu.: 65.00
## Median : 67.00
## Mean : 67.24
## 3rd Qu.: 69.00
## Max. :101.00
Here we are checking if participation in extracurricular activities affects the final exam score.
We are going to conduct the analysis of two independent samples .
#Descriptive statistics by group - Male and Female
library(psych)
describeBy(mydata2$Exam_Score, mydata2$Extracurricular_Activities)
##
## Descriptive statistics by group
## group: Yes
## vars n mean sd median trimmed mad min max range skew
## X1 1 3938 67.44 3.94 67 67.3 2.97 57 101 44 1.8
## kurtosis se
## X1 11.58 0.06
## ----------------------------------------------------
## group: No
## vars n mean sd median trimmed mad min max range skew
## X1 1 2669 66.93 3.79 67 66.84 2.97 55 97 42 1.39
## kurtosis se
## X1 8.78 0.07
For the analysis, the following assumptions need to be completed:
Variable is numeric: this is assumption is fulfilled since exam score is numeric variable
The distribution of variable is normal in both populations: I will test this with Shapiro-Wilk normality test
The data must come from two independent populations: this is true
Variable has the same variance in both populations: since this assumption is often violated we apply Welch correction
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ECA_yes <- ggplot(mydata2[mydata2$Extracurricular_Activities == "Yes", ], aes(x = Exam_Score)) +
theme_linedraw() +
geom_histogram(binwidth = 1, col = "black", fill = "orchid1") +
ylab("Frequency") +
ggtitle("Exam score distribution of students with extracurricular activities")
ECA_no <- ggplot(mydata2[mydata2$Extracurricular_Activities == "No", ], aes(x = Exam_Score)) +
theme_linedraw() +
geom_histogram(binwidth = 1, col = "black", fill = "steelblue1") +
ylab("Frequency") +
ggtitle("Exam score distribution of students without extracurricular activities")
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.4.2
ggarrange(ECA_yes, ECA_no,
ncol = 2, nrow = 1)
library(ggpubr)
ggqqplot(mydata2,
"Exam_Score",
facet.by = "Extracurricular_Activities")
Based on the graphs and quantile quantile plot above It is possible that we have a normal distribution and possibly some outliers. But let’s check it.
library(rstatix)
## Warning: package 'rstatix' was built under R version 4.4.2
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
mydata2 %>%
group_by(Extracurricular_Activities) %>%
shapiro_test(Exam_Score)
## # A tibble: 2 × 4
## Extracurricular_Activities variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 Yes Exam_Score 0.892 3.40e-46
## 2 No Exam_Score 0.918 7.87e-36
Shapiro test Hypothesis:
Based on the executed Shapiro test we can reject the null hypothesis H0 that the variable Exam Score is normally distributed based on the p-value p<0.01 for both of the groups of students.
Since normality is not confirmed we could only continue with non-parametric tests but later on we will also conduct the parametric tests for educational purposes.
With this we did not manage to confirmed the second assumption for two independent sample arithmetic means testing.
I decided not to remove any of the units, since I do not see them as outliers but only as highly performing and ambitious students that had equal possibilities as all the other students and they managed to performe well.
Since we have independent sample of 2 population (students who do and do not participate in extracurricular activities), we will start our parametric test with t-test with Welch correction (Welch two sample test). This is the test you usually use when normality is confirmed (which is not our case).
# Independent samples t-test
t.test(mydata2$Exam_Score ~ mydata2$Extracurricular_Activities,
var.equal = FALSE,
alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: mydata2$Exam_Score by mydata2$Extracurricular_Activities
## t = 5.2826, df = 5873.3, p-value = 1.319e-07
## alternative hypothesis: true difference in means between group Yes and group No is not equal to 0
## 95 percent confidence interval:
## 0.3210011 0.6998262
## sample estimates:
## mean in group Yes mean in group No
## 67.44185 66.93143
The hypothesis:
ECA_yes = Students who participate in extracurricular activities
ECA_no = Students who do not participate in extracurricular activities
H0: mean(ECA_yes) = mean(ECA_no) or mean(ECA_yes) - mean(ECA_no) = 0
H1: mean(ECA_yes) =/ mean(ECA_no) or mean(ECA_yes) - mean(ECA_no) =/ 0
With Welch two sample test we can reject H0 with p<0.01, therefore reject that the difference in means between students who do and students who do not participate in extracurricular activities is equal to 0.
I also measured the effect size with Cohen’s d statistics:
library(effectsize)
##
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
##
## cohens_d, eta_squared
## The following object is masked from 'package:psych':
##
## phi
effectsize::cohens_d(mydata2$Exam_Score ~ mydata2$Extracurricular_Activities,
pooled_sd = FALSE)
## Cohen's d | 95% CI
## ------------------------
## 0.13 | [0.08, 0.18]
##
## - Estimated using un-pooled SD.
interpret_cohens_d(0.13, rules = "sawilowsky2009")
## [1] "very small"
## (Rules: sawilowsky2009)
We can observe very small differences in the average exam score results between student who do and students who do not participate in extracurricular activities.
Moving on to our non-parametric alternative that we use in case if normality was rejected with Shapiro-Wilk test, as happened in our case:
wilcox.test(mydata2$Exam_Score ~ mydata2$Extracurricular_Activities,
correct = FALSE,
exact = FALSE,
alternative = "two.sided")
##
## Wilcoxon rank sum test
##
## data: mydata2$Exam_Score by mydata2$Extracurricular_Activities
## W = 5649817, p-value = 1.924e-07
## alternative hypothesis: true location shift is not equal to 0
Based on the p-value p<0.05, we can reject H0 that the distribution location of of exam score is the same for student who do and students who do not participate in extracurricular activities.
library(ggplot2)
ggplot(mydata2, aes(x = Exam_Score, fill = Extracurricular_Activities)) +
geom_histogram(position = position_dodge(width = 2.5), binwidth = 5, colour = "Black") +
scale_x_continuous(breaks = seq(0, 75, 5)) +
scale_fill_manual(values = c("No" = "steelblue1", "Yes" = "orchid1")) +
ylab("Frequency") +
labs(fill = "Exam Score")
The function scale_fill_manual allowed me to manually specify aesthetic fill values for the graph to match the previous one in colors for both groups of students.
library(effectsize)
effectsize(wilcox.test(mydata2$Exam_Score ~ mydata2$Extracurricular_Activities,
correct = FALSE,
exact = FALSE,
alternative = "two.sided"))
## r (rank biserial) | 95% CI
## --------------------------------
## 0.08 | [0.05, 0.10]
Based on Funder&Ozer 2019 scale, the difference in distribution location is very small.
#RQ1 Conclusion and answer
In conclusion, I believe that a better test to do is non-parametric Wilcoxon Rank Sum Test since normality was not confirmed. Based on the p-value p<0.05, we can reject H0 that the distribution location of of exam score is the same for student who do and students who do not participate in extracurricular activities.
I would conclude that the answer to my research question is that participation in extracurricular activities has an effect on the final exam score of students, yet it is proven to be very small (r=0.08). Based on the calculated average exam score we could conclude that students with extracurricular activities indeed perform a bit better than students who do not have extracurricular activities.
library(psych)
psych::describe(mydata2[ , c("Hours_Studied", "Exam_Score")])
## vars n mean sd median trimmed mad min max range
## Hours_Studied 1 6607 19.98 5.99 20 19.97 5.93 1 44 43
## Exam_Score 2 6607 67.24 3.89 67 67.11 2.97 55 101 46
## skew kurtosis se
## Hours_Studied 0.01 0.02 0.07
## Exam_Score 1.64 10.56 0.05
library(ggplot2)
ggplot(mydata2, aes(x = Hours_Studied, y = Exam_Score)) +
geom_point()
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
## The following object is masked from 'package:dplyr':
##
## recode
scatterplotMatrix(mydata2[ , c(2, 7)], smooth = FALSE)
We can see that this graphical representations are suggesting a possible positive correlation. Let’s check the linear relationship with descriptive approach.
cor(mydata2$Hours_Studied, mydata2$Exam_Score,
method = "pearson",
use = "complete.obs")
## [1] 0.445455
Based on the result we can say that the linear relationship between hours spent studying for the exam and actual exam score is positive and semi strong (ρ=0.445).
We still conduct the test of correlation coefficient with:
cor.test(mydata2$Hours_Studied, mydata2$Exam_Score,
method = "pearson",
use = "complete.obs")
##
## Pearson's product-moment correlation
##
## data: mydata2$Hours_Studied and mydata2$Exam_Score
## t = 40.436, df = 6605, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4259164 0.4645782
## sample estimates:
## cor
## 0.445455
H0: ρ = 0
H1: ρ =/ 0
Based on the p-value p<0.01 we can reject the null hypothesis H0 that correlation coefficient is equal to 0.
#RQ2 Conclusion and answer
Based on the statistical tests above we can conclude that there is a statistically significant correlation between hours spent studying and exam results.
In conclusion, the answer to my research question is that there is a positive and semi strong linear relationship between the number of hours studied and the final exam score.
# Pearson Chi2 test
results <- chisq.test(mydata2$Parental_Involvement, mydata2$Motivation_Level,
correct = FALSE)
results
##
## Pearson's Chi-squared test
##
## data: mydata2$Parental_Involvement and mydata2$Motivation_Level
## X-squared = 4.3704, df = 4, p-value = 0.3582
H0: There is no association between parental involvement and student’s motivation level
H1: There is association between parental involvement and student’s motivation level
We cannot reject H0 that there is no association between parental involvement and student’s motivation level based on the p-value p>0.05.
addmargins(results$observed)
## mydata2$Motivation_Level
## mydata2$Parental_Involvement High Medium Low Sum
## High 359 975 574 1908
## Medium 682 1685 995 3362
## Low 278 691 368 1337
## Sum 1319 3351 1937 6607
Here R created the table with observed (empirical) findings from our data set.
For example: Number 359 means that there were 359 students that experienced high parental involvement and high intrinsic motivation and 278 students that experienced low parental involvement and high intrinsic motivation. Together in the right column we see the sums of parental involvement by it’s categories and in the bottom row we can see sums of observed units based on motivation level of a student. And in the right bottom corner is the sum of all units which matches the number of units in the sample (n=6607).
We continue with calculating a table of expected frequencies.
#Calculating expected frequencies
round(results$expected, 2)
## mydata2$Motivation_Level
## mydata2$Parental_Involvement High Medium Low
## High 380.91 967.72 559.38
## Medium 671.18 1705.17 985.65
## Low 266.91 678.11 391.97
In this step we calculated the expected frequencies. This means how many student we would expect to see in each paired category if there wouldn’t be any association between the variables parental involvement and motivation level.
For example: If there wouldn’t be association between parental involvement and motivation level we would expect 391.97 student with low parental support and low motivation level.
With this calculations we also check one of the required assumptions for this statistical analysis. Assumptions for association analysis between two categorical variables has 2 main assumptions that we checked in previous steps:
The observations are independent of each other: This is true
All expected frequencies are greater than 5: We can see in the table above that this is true
Since we checked that all assumptions hold we can continue with parametric tests for checking our hypothesis.
# Calculation of standardized residuals
round(results$res, 2)
## mydata2$Motivation_Level
## mydata2$Parental_Involvement High Medium Low
## High -1.12 0.23 0.62
## Medium 0.42 -0.49 0.30
## Low 0.68 0.49 -1.21
Here we can see that non of the standardized residuals are significant (all are bellow 1.96), meaning there are no significant differences found. We could say we didn’t find any statistical effect of parental involvement on student’s motivation level.
addmargins(round(prop.table(results$observed), 3))
## mydata2$Motivation_Level
## mydata2$Parental_Involvement High Medium Low Sum
## High 0.054 0.148 0.087 0.289
## Medium 0.103 0.255 0.151 0.509
## Low 0.042 0.105 0.056 0.203
## Sum 0.199 0.508 0.294 1.001
Here is a structured table where all of the data together sums up to 1 (around 1 or 1.001 in our case). For example in our data there is 10.3% of students with medium parental involvement and high motivation level and 5.6% of students with low motivation level and low parental involvement. There is also 19.9% of students with high motivation level and 50.9% of students of medium level of parental involvement.
addmargins(round(prop.table(results$observed, 1), 3), 2)
## mydata2$Motivation_Level
## mydata2$Parental_Involvement High Medium Low Sum
## High 0.188 0.511 0.301 1.000
## Medium 0.203 0.501 0.296 1.000
## Low 0.208 0.517 0.275 1.000
In this table we focused on the parental involvement. We observe that also from the table where all of the parental involvement categories sum up to one. Results form this table can be interpreted as followed:
30.1% of students with high parental involvement experience low motivation level and 20.8% of students with low parental support experience high motivation level.
addmargins(round(prop.table(results$observed, 2), 3), 1)
## mydata2$Motivation_Level
## mydata2$Parental_Involvement High Medium Low
## High 0.272 0.291 0.296
## Medium 0.517 0.503 0.514
## Low 0.211 0.206 0.190
## Sum 1.000 1.000 1.000
In this table we can observe data based on the motivation level variable. We can spot this from the bottom row with the sum of ones.
The data from this table can be interpreted as followed: Out of all people with high motivation there is 27.7% students with high parental involvement, 51.7% of students with medium parental involvement and 21.1% of students with low parental involvement (and all of these sum to 100% of students with high motivation).
library(effectsize)
effectsize::cramers_v(mydata2$Parental_Involvement, mydata2$Motivation_Level)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 5.29e-03 | [0.00, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(5.29e-03)
## [1] "tiny"
## (Rules: funder2019)
Based on Cramers statistics we can say there are tiny differences between motivation level of students and parental involvement.
#RQ3 Conclusion and answer
Based on the test and checked standardized residuals, We can conclude that there is no association found between parental involvement and the motivation level of students, since we cannot reject the null hypothesis of there being no association between parental involvement and motivation level of a student (p>0.05). However the Cramers test suggests some differences, but very tiny (r=0.00529).
Therefore, official answer to third research question is that there is no statistically significant association between parental involvement and student’s motivation level.