data <- read.csv("StudentPerformanceFactors.csv", header = TRUE)Homework_2_Tobias_Schöfbeck
Exercise 2
- Importing the data into R Studio:
Exercise 3
- Displaying it using head()
head(data) Hours_Studied Attendance Parental_Involvement Access_to_Resources
1 23 84 Low High
2 19 64 Low Medium
3 24 98 Medium Medium
4 29 89 Low Medium
5 19 92 Medium Medium
6 19 88 Medium Medium
Extracurricular_Activities Sleep_Hours Previous_Scores Motivation_Level
1 No 7 73 Low
2 No 8 59 Low
3 Yes 7 91 Medium
4 Yes 8 98 Medium
5 Yes 6 65 Medium
6 Yes 8 89 Medium
Internet_Access Tutoring_Sessions Family_Income Teacher_Quality School_Type
1 Yes 0 Low Medium Public
2 Yes 2 Medium Medium Public
3 Yes 2 Medium Medium Public
4 Yes 1 Medium Medium Public
5 Yes 3 Medium High Public
6 Yes 3 Medium Medium Public
Peer_Influence Physical_Activity Learning_Disabilities
1 Positive 3 No
2 Negative 4 No
3 Neutral 4 No
4 Negative 4 No
5 Neutral 4 No
6 Positive 3 No
Parental_Education_Level Distance_from_Home Gender Exam_Score
1 High School Near Male 67
2 College Moderate Female 61
3 Postgraduate Near Male 74
4 High School Moderate Male 71
5 College Near Female 70
6 Postgraduate Near Male 71
Exercise 4
- Data Manipulation and explanation of variables
data <- data[1:200, c(-3,-4,-5,-6,-8,-9,-10,-12,-13,-14,-15,-16,-17,-18)]
head(data) Hours_Studied Attendance Previous_Scores Family_Income Gender Exam_Score
1 23 84 73 Low Male 67
2 19 64 59 Medium Female 61
3 24 98 91 Medium Male 74
4 29 89 98 Medium Male 71
5 19 92 65 Medium Female 70
6 19 88 89 Medium Male 71
After altering the data, as to not have 6607 observations and 20 variables, the data has the variables:
Hours_Studied is the number of hours spent per week studying.
Attendance is the percentage of classes attended.
Previous_Scores is the score for the previous exam.
Family_Income is the Family income level in three categories (Low, Medium, High).
Gender is the gender of the student (Male, Female).
Exam_Score is the Final exam score.
and 200 observations
Exercise 5
I found the dataset on Kaggle under the link: “https://www.kaggle.com/datasets/lainguyn123/student-performance-factors?resource=download”
Exercise 6
- Data Manipulation
Creating factors
data$Gender <- factor(data$Gender)
data$Family_Income <- factor(data$Family_Income)
summary(data) Hours_Studied Attendance Previous_Scores Family_Income Gender
Min. : 4.00 Min. : 60.0 Min. : 50.00 High :36 Female: 73
1st Qu.:16.00 1st Qu.: 70.0 1st Qu.: 65.00 Low :76 Male :127
Median :20.00 Median : 79.0 Median : 77.50 Medium:88
Mean :19.79 Mean : 80.2 Mean : 76.44
3rd Qu.:23.00 3rd Qu.: 91.0 3rd Qu.: 89.00
Max. :36.00 Max. :100.0 Max. :100.00
Exam_Score
Min. : 60.00
1st Qu.: 65.00
Median : 67.00
Mean : 67.33
3rd Qu.: 69.00
Max. :100.00
I then converted Gender and Family_Income into factors. The data shows that there are 73 females and 127 males. The 200 individuals are divided by their family income. 36 people fall into the category high, 76 in low and 88 in medium.
- Altering the data
data[2,3] <- 60After that, I changed the Previous Test Score of one student, by only a few points, as I didn’t want
to change the data itself too much.
- Creating a subset for the females
dataF <- data[data$Gender == "Female", ]
head(dataF) Hours_Studied Attendance Previous_Scores Family_Income Gender Exam_Score
2 19 64 60 Medium Female 61
5 19 92 65 Medium Female 70
16 17 68 70 Medium Female 64
18 22 70 82 Low Female 65
19 15 80 91 Low Female 67
21 29 78 99 High Female 69
First I created a subset with only female students, this subset now has 73 observations.
- Creating a subset with high attendance
hattendance <- data[data$Attendance >= 80,]
head(hattendance) Hours_Studied Attendance Previous_Scores Family_Income Gender Exam_Score
1 23 84 73 Low Male 67
3 24 98 91 Medium Male 74
4 29 89 98 Medium Male 71
5 19 92 65 Medium Female 70
6 19 88 89 Medium Male 71
7 29 84 68 Low Male 67
Then I created a subset of students that have an attendance of over 80%, which has 99 observations.
- Changing the name of a variable
library(dplyr)data <- data %>% rename(Household_Income = Family_Income)
head(data) Hours_Studied Attendance Previous_Scores Household_Income Gender Exam_Score
1 23 84 73 Low Male 67
2 19 64 60 Medium Female 61
3 24 98 91 Medium Male 74
4 29 89 98 Medium Male 71
5 19 92 65 Medium Female 70
6 19 88 89 Medium Male 71
With the help of the library “dplyr” I changed the name of Family_Income to Household_Income.
- Using descriptive statistics
library(psych)
describeBy(data$Exam_Score, data$Gender)
Descriptive statistics by group
group: Female
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 73 67.47 4.98 67 66.95 2.97 61 100 39 3.88 22.62 0.58
------------------------------------------------------------
group: Male
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 127 67.26 3.15 67 67.25 2.97 60 76 16 0.08 -0.47 0.28
With the function describeBy we can now compare male and female students.
Arithmetic mean
The groups have very similar means - around 67 - the average exam score achieved is 67
Median
The median is also 67 for both. This means that 50 percent of scores is below or equal to 67 and 50 percent is above 67.
Range
Range is calculated by subtracting the minimum from the maximum, in this case, by substracting the lowest exam score from the highest one.
The females have a bigger range of exam scores between 61 and 100, while males only have scores between 60 to 76.
So a range of 16 compared to the females with a range of 39.
Exercise 7
Research question
Is there a difference between the scores that were achieved in the final exam and the scores achieved in the previous exam?
Checking if the assumptions are fulfilled
The values are numeric. Now it needs to be checked, if the differences on the population are normally distributed.
Histogramm:
data$Difference <- data$Exam_Score - data$Previous_Scores
library(ggplot2)
ggplot(data, aes(x = Difference)) +
geom_histogram(position = "identity", binwidth = 3, colour = "black",
fill = "cyan3") + xlab("Difference")The mean difference of the Exam Scores does not seem to be normally distributed. One sign of this are the many peaks.
- Shapiro-Wilk Test
shapiro.test(data$Difference)
Shapiro-Wilk normality test
data: data$Difference
W = 0.96737, p-value = 0.0001356
H0: The mean difference of the scores achieved in the two exam is normally distributed
HA: The mean difference of the scores achieved in the two exam is not normally distributed
The Nullhypothesis can be rejected, as p<0.001. This means that the mean difference is not normally distributed.
- QQ-Plot
library(ggpubr)
ggqqplot(data$Difference)By checking the QQ-Plot, we can see that many observation stray far from the QQ-Line and are outside the confidence intervall. This also indicates that the assumption of normality is violated.
- Paired T-Test
t.test(data$Exam_Score, data$Previous_Scores, paired = TRUE,
alternative = "two.sided")
Paired t-test
data: data$Exam_Score and data$Previous_Scores
t = -8.7676, df = 199, p-value = 8.121e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
-11.165088 -7.064912
sample estimates:
mean difference
-9.115
H0: The mean difference of the exam scores is equal to zero.
HA: The mean difference of the exam scores is not equal to zero.
The Nullhypothesis can be rejected as p<0.001. This means that the mean differences between the exam scores is not equal to zero.
- Interpreting the effect size
library(effectsize)
cohens_d(data$Difference)Cohen's d | 95% CI
--------------------------
-0.62 | [-0.77, -0.47]
interpret_cohens_d(0.62, rules = "sawilowsky2009")[1] "medium"
(Rules: sawilowsky2009)
The effect size is medium. This means that there is a noticeable and meaningful difference between results of the final and the previous exam.
- Performing a Wilcoxon Signed Rank Test
wilcox.test(data$Exam_Score, data$Previous_Scores, paired = TRUE,
correct = FALSE, exact = FALSE, alternative = "two.sided")
Wilcoxon signed rank test
data: data$Exam_Score and data$Previous_Scores
V = 3961, p-value = 1.782e-13
alternative hypothesis: true location shift is not equal to 0
H0: The distribution location for the exam scores for the previous exam and the final exam are the same.
HA: The distribution location for the exam scores for the previous exam and the final exam are not the same
Because p<0.001 the Nullhypothesis that the distribution location of exam scores for the previous exam and the final exam are the same, can be rejected.
- Interpreting the effect size
effectsize(wilcox.test(data$Exam_Score, data$Previous_Scores, paired = TRUE,
correct = FALSE, exact = FALSE, alternative = "two.sided"))r (rank biserial) | 95% CI
----------------------------------
-0.60 | [-0.69, -0.49]
interpret_rank_biserial(0.6, rules = "funder2019")[1] "very large"
(Rules: funder2019)
The effect size is very large. This means that the difference in distribution locations is very large and the scores achieved are very different in the two exams.
- Conclusion
The non parametric test is more suitable in this particular case, because the assumption of normallity is violated, which makes the Wilcoxon Test the better option.
Since the mean difference is negative (-9.115) it means that the scores achieved in the previous exam were, on average, higher than the scores achieved in the final exam.
When looking at the Wilcoxon Signed Rank Test and the very large effect size we can say:
On average, the points achieved in the previous exam were significantly higher than the scores achieved in the final exam.