Homework_2_Tobias_Schöfbeck

Exercise 2

  • Importing the data into R Studio:
data <- read.csv("StudentPerformanceFactors.csv", header = TRUE)

Exercise 3

  • Displaying it using head()
head(data)
  Hours_Studied Attendance Parental_Involvement Access_to_Resources
1            23         84                  Low                High
2            19         64                  Low              Medium
3            24         98               Medium              Medium
4            29         89                  Low              Medium
5            19         92               Medium              Medium
6            19         88               Medium              Medium
  Extracurricular_Activities Sleep_Hours Previous_Scores Motivation_Level
1                         No           7              73              Low
2                         No           8              59              Low
3                        Yes           7              91           Medium
4                        Yes           8              98           Medium
5                        Yes           6              65           Medium
6                        Yes           8              89           Medium
  Internet_Access Tutoring_Sessions Family_Income Teacher_Quality School_Type
1             Yes                 0           Low          Medium      Public
2             Yes                 2        Medium          Medium      Public
3             Yes                 2        Medium          Medium      Public
4             Yes                 1        Medium          Medium      Public
5             Yes                 3        Medium            High      Public
6             Yes                 3        Medium          Medium      Public
  Peer_Influence Physical_Activity Learning_Disabilities
1       Positive                 3                    No
2       Negative                 4                    No
3        Neutral                 4                    No
4       Negative                 4                    No
5        Neutral                 4                    No
6       Positive                 3                    No
  Parental_Education_Level Distance_from_Home Gender Exam_Score
1              High School               Near   Male         67
2                  College           Moderate Female         61
3             Postgraduate               Near   Male         74
4              High School           Moderate   Male         71
5                  College               Near Female         70
6             Postgraduate               Near   Male         71

Exercise 4

  • Data Manipulation and explanation of variables
data <- data[1:200, c(-3,-4,-5,-6,-8,-9,-10,-12,-13,-14,-15,-16,-17,-18)]
head(data)
  Hours_Studied Attendance Previous_Scores Family_Income Gender Exam_Score
1            23         84              73           Low   Male         67
2            19         64              59        Medium Female         61
3            24         98              91        Medium   Male         74
4            29         89              98        Medium   Male         71
5            19         92              65        Medium Female         70
6            19         88              89        Medium   Male         71

After altering the data, as to not have 6607 observations and 20 variables, the data has the variables:

  • Hours_Studied is the number of hours spent per week studying.

  • Attendance is the percentage of classes attended.

  • Previous_Scores is the score for the previous exam.

  • Family_Income is the Family income level in three categories (Low, Medium, High).

  • Gender is the gender of the student (Male, Female).

  • Exam_Score is the Final exam score.

and 200 observations

Exercise 5

I found the dataset on Kaggle under the link: “https://www.kaggle.com/datasets/lainguyn123/student-performance-factors?resource=download

Exercise 6

  • Data Manipulation
    Creating factors
data$Gender <- factor(data$Gender)
data$Family_Income <- factor(data$Family_Income)
summary(data)
 Hours_Studied     Attendance    Previous_Scores  Family_Income    Gender   
 Min.   : 4.00   Min.   : 60.0   Min.   : 50.00   High  :36     Female: 73  
 1st Qu.:16.00   1st Qu.: 70.0   1st Qu.: 65.00   Low   :76     Male  :127  
 Median :20.00   Median : 79.0   Median : 77.50   Medium:88                 
 Mean   :19.79   Mean   : 80.2   Mean   : 76.44                             
 3rd Qu.:23.00   3rd Qu.: 91.0   3rd Qu.: 89.00                             
 Max.   :36.00   Max.   :100.0   Max.   :100.00                             
   Exam_Score    
 Min.   : 60.00  
 1st Qu.: 65.00  
 Median : 67.00  
 Mean   : 67.33  
 3rd Qu.: 69.00  
 Max.   :100.00  

I then converted Gender and Family_Income into factors. The data shows that there are 73 females and 127 males. The 200 individuals are divided by their family income. 36 people fall into the category high, 76 in low and 88 in medium.

  • Altering the data
data[2,3] <- 60

After that, I changed the Previous Test Score of one student, by only a few points, as I didn’t want

to change the data itself too much.

  • Creating a subset for the females
dataF <- data[data$Gender == "Female", ]
head(dataF)
   Hours_Studied Attendance Previous_Scores Family_Income Gender Exam_Score
2             19         64              60        Medium Female         61
5             19         92              65        Medium Female         70
16            17         68              70        Medium Female         64
18            22         70              82           Low Female         65
19            15         80              91           Low Female         67
21            29         78              99          High Female         69

First I created a subset with only female students, this subset now has 73 observations.

  • Creating a subset with high attendance
hattendance <- data[data$Attendance >= 80,]
head(hattendance)
  Hours_Studied Attendance Previous_Scores Family_Income Gender Exam_Score
1            23         84              73           Low   Male         67
3            24         98              91        Medium   Male         74
4            29         89              98        Medium   Male         71
5            19         92              65        Medium Female         70
6            19         88              89        Medium   Male         71
7            29         84              68           Low   Male         67

Then I created a subset of students that have an attendance of over 80%, which has 99 observations.

  • Changing the name of a variable
library(dplyr)
data <- data %>% rename(Household_Income = Family_Income)
head(data)
  Hours_Studied Attendance Previous_Scores Household_Income Gender Exam_Score
1            23         84              73              Low   Male         67
2            19         64              60           Medium Female         61
3            24         98              91           Medium   Male         74
4            29         89              98           Medium   Male         71
5            19         92              65           Medium Female         70
6            19         88              89           Medium   Male         71

With the help of the library “dplyr” I changed the name of Family_Income to Household_Income.

  • Using descriptive statistics
library(psych)
describeBy(data$Exam_Score, data$Gender)

 Descriptive statistics by group 
group: Female
   vars  n  mean   sd median trimmed  mad min max range skew kurtosis   se
X1    1 73 67.47 4.98     67   66.95 2.97  61 100    39 3.88    22.62 0.58
------------------------------------------------------------ 
group: Male
   vars   n  mean   sd median trimmed  mad min max range skew kurtosis   se
X1    1 127 67.26 3.15     67   67.25 2.97  60  76    16 0.08    -0.47 0.28

With the function describeBy we can now compare male and female students.

  • Arithmetic mean

    The groups have very similar means - around 67 - the average exam score achieved is 67

  • Median

    The median is also 67 for both. This means that 50 percent of scores is below or equal to 67 and 50 percent is above 67.

  • Range

    Range is calculated by subtracting the minimum from the maximum, in this case, by substracting the lowest exam score from the highest one.
    The females have a bigger range of exam scores between 61 and 100, while males only have scores between 60 to 76.
    So a range of 16 compared to the females with a range of 39.

Exercise 7

  • Research question

    Is there a difference between the scores that were achieved in the final exam and the scores achieved in the previous exam?

  • Checking if the assumptions are fulfilled

    The values are numeric. Now it needs to be checked, if the differences on the population are normally distributed.

  • Histogramm:

data$Difference <- data$Exam_Score - data$Previous_Scores
library(ggplot2)
ggplot(data, aes(x = Difference)) +
  geom_histogram(position = "identity", binwidth = 3, colour = "black",
                 fill = "cyan3") + xlab("Difference")

The mean difference of the Exam Scores does not seem to be normally distributed. One sign of this are the many peaks.

  • Shapiro-Wilk Test
shapiro.test(data$Difference)

    Shapiro-Wilk normality test

data:  data$Difference
W = 0.96737, p-value = 0.0001356

H0: The mean difference of the scores achieved in the two exam is normally distributed
HA: The mean difference of the scores achieved in the two exam is not normally distributed

The Nullhypothesis can be rejected, as p<0.001. This means that the mean difference is not normally distributed.

  • QQ-Plot
library(ggpubr)
ggqqplot(data$Difference)

By checking the QQ-Plot, we can see that many observation stray far from the QQ-Line and are outside the confidence intervall. This also indicates that the assumption of normality is violated.

  • Paired T-Test
t.test(data$Exam_Score, data$Previous_Scores, paired = TRUE,
       alternative = "two.sided")

    Paired t-test

data:  data$Exam_Score and data$Previous_Scores
t = -8.7676, df = 199, p-value = 8.121e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -11.165088  -7.064912
sample estimates:
mean difference 
         -9.115 

H0: The mean difference of the exam scores is equal to zero.
HA: The mean difference of the exam scores is not equal to zero.

The Nullhypothesis can be rejected as p<0.001. This means that the mean differences between the exam scores is not equal to zero.

  • Interpreting the effect size
library(effectsize)
cohens_d(data$Difference)
Cohen's d |         95% CI
--------------------------
-0.62     | [-0.77, -0.47]
interpret_cohens_d(0.62, rules = "sawilowsky2009")
[1] "medium"
(Rules: sawilowsky2009)

The effect size is medium. This means that there is a noticeable and meaningful difference between results of the final and the previous exam.

  • Performing a Wilcoxon Signed Rank Test
wilcox.test(data$Exam_Score, data$Previous_Scores, paired = TRUE,
            correct = FALSE, exact = FALSE, alternative = "two.sided")

    Wilcoxon signed rank test

data:  data$Exam_Score and data$Previous_Scores
V = 3961, p-value = 1.782e-13
alternative hypothesis: true location shift is not equal to 0

H0: The distribution location for the exam scores for the previous exam and the final exam are the same.
HA: The distribution location for the exam scores for the previous exam and the final exam are not the same

Because p<0.001 the Nullhypothesis that the distribution location of exam scores for the previous exam and the final exam are the same, can be rejected.

  • Interpreting the effect size
effectsize(wilcox.test(data$Exam_Score, data$Previous_Scores, paired = TRUE,
            correct = FALSE, exact = FALSE, alternative = "two.sided"))
r (rank biserial) |         95% CI
----------------------------------
-0.60             | [-0.69, -0.49]
interpret_rank_biserial(0.6, rules = "funder2019")
[1] "very large"
(Rules: funder2019)

The effect size is very large. This means that the difference in distribution locations is very large and the scores achieved are very different in the two exams.

  • Conclusion

The non parametric test is more suitable in this particular case, because the assumption of normallity is violated, which makes the Wilcoxon Test the better option.

Since the mean difference is negative (-9.115) it means that the scores achieved in the previous exam were, on average, higher than the scores achieved in the final exam.

When looking at the Wilcoxon Signed Rank Test and the very large effect size we can say:

On average, the points achieved in the previous exam were significantly higher than the scores achieved in the final exam.