Importing Data and Dependencies

library(ggplot2)
library(ggpubr)
theme_algoritma <- readRDS("theme_algoritma.rds")

student <- read.csv(file = "data_input/StudentsPerformance.csv")
str(student)

## 'data.frame':    1000 obs. of  8 variables:
##  $ gender                     : chr  "female" "female" "female" "male" ...
##  $ race.ethnicity             : chr  "group B" "group C" "group B" "group A" ...
##  $ parental.level.of.education: chr  "bachelor's degree" "some college" "master's degree" "associate's degree" ...
##  $ lunch                      : chr  "standard" "standard" "standard" "free/reduced" ...
##  $ test.preparation.course    : chr  "none" "completed" "none" "none" ...
##  $ math.score                 : int  72 69 90 47 76 71 88 40 64 38 ...
##  $ reading.score              : int  72 90 95 57 78 83 95 43 64 60 ...
##  $ writing.score              : int  74 88 93 44 75 78 92 39 67 50 ...

Data Exploration

Head, Tail and Dimension

head(student)

tail(student)

dim(student)

## [1] 1000    8

Data Pre-Processing

Converting Data Types

# Changes Gender, Race Ethnicity, Parental Level of Education, Lunch, and Test Preparation into factor data type

student$gender <- as.factor(student$gender)
student$race.ethnicity <- as.factor(student$race.ethnicity) 
student$parental.level.of.education <- as.factor(student$parental.level.of.education)
student$lunch <- as.factor(student$lunch)
student$test.preparation.course <- as.factor(student$test.preparation.course)

str(student)

## 'data.frame':    1000 obs. of  8 variables:
##  $ gender                     : Factor w/ 2 levels "female","male": 1 1 1 2 2 1 1 2 2 1 ...
##  $ race.ethnicity             : Factor w/ 5 levels "group A","group B",..: 2 3 2 1 3 2 2 2 4 2 ...
##  $ parental.level.of.education: Factor w/ 6 levels "associate's degree",..: 2 5 4 1 5 1 5 5 3 3 ...
##  $ lunch                      : Factor w/ 2 levels "free/reduced",..: 2 2 2 1 2 2 2 1 1 1 ...
##  $ test.preparation.course    : Factor w/ 2 levels "completed","none": 2 1 2 2 2 2 1 2 1 2 ...
##  $ math.score                 : int  72 69 90 47 76 71 88 40 64 38 ...
##  $ reading.score              : int  72 90 95 57 78 83 95 43 64 60 ...
##  $ writing.score              : int  74 88 93 44 75 78 92 39 67 50 ...

Adding Total Column

Total is a new column which combine all scores from math, reading, and writing. Total will be used to present each student’s performance.

student$total <- student$math.score + student$reading.score + student$writing.score

Summary and Basic Statistics

Summary of Data

summary(student)

##     gender    race.ethnicity     parental.level.of.education          lunch    
##  female:518   group A: 89    associate's degree:222          free/reduced:355  
##  male  :482   group B:190    bachelor's degree :118          standard    :645  
##               group C:319    high school       :196                            
##               group D:262    master's degree   : 59                            
##               group E:140    some college      :226                            
##                              some high school  :179                            
##  test.preparation.course   math.score     reading.score    writing.score   
##  completed:358           Min.   :  0.00   Min.   : 17.00   Min.   : 10.00  
##  none     :642           1st Qu.: 57.00   1st Qu.: 59.00   1st Qu.: 57.75  
##                          Median : 66.00   Median : 70.00   Median : 69.00  
##                          Mean   : 66.09   Mean   : 69.17   Mean   : 68.05  
##                          3rd Qu.: 77.00   3rd Qu.: 79.00   3rd Qu.: 79.00  
##                          Max.   :100.00   Max.   :100.00   Max.   :100.00  
##      total      
##  Min.   : 27.0  
##  1st Qu.:175.0  
##  Median :205.0  
##  Mean   :203.3  
##  3rd Qu.:233.0  
##  Max.   :300.0

Interpretation from Summary:
1. In this data, proportion of male and female student is not too far. There is 518 female, and 482 male. 2. From an ethnic, the majority ethnic group is group C, and minority is group A. 3. The level of education of the parents of the students is mostly College, and the least is Masters. 4. The lunch category is dominated by standard rather than free / reduced categories. 5. The majority of students do not take test preparation classes. Only 358 students complete test preparation course, and other 642 have not completed.

From the summary above, we can take some insights about student’s score on math, reading, and writing test.
1. Math:
a. Minimum score is 0, and maximum score achieved is 100. For 0 score, I assume that the student did not attend the test somehow.
b. The mean and median are same.
2. Reading:
a. Reading has minimum score 17, and maximum 100. The Mean is quite higher than math score a little bit.
b. The emdian is slightly higher that it Mean.
3. Writing Score:
a. Writing Score has 10 score in the lowest, it is above minimum score of reading score, but better than math score minimum.
b. Mean and Median are quite close, and the maximum score is 100.
4. Total score:
a. Highest total score is 3000, meaning that there is a student achieved 100 score in every test. The lowest is 27, which I guess from a student who achieve the lowest score.
b. The mean and median is quite similar.

Basic Statistic: Finding What Factors Influence Student Performance

Main Question

Main question in this analysis is “What factors influence student performance on exams significantly ?”. To answer the main question, some hypotheses are applied.

Hypotheses

Before going to further analysis, first, I want to make some hypotheses with main question: What factors Influence Student Performance?. Basic hypothesis in this analysis is: “All variables are significantly influence student performance on test”. However, there are still many factors and research that try to reveal this same question. The main hypothesis will be helped by derrivative hypotheses below:

Demography
From demography, I will use gender and race ethnicity. Some hypotheses are supported based on scientific publication.

H1A: Male perform better than female on math
H2A: Female perform better than male on verbal test (reading and writing)

Race/Ethnicity will not be included to analysis, but still used to add more explanation.

Further reading: Balart, P., Oosterveen, M. Females show more sustained performance during test-taking than males. Nat Commun 10, 3798 (2019). https://doi.org/10.1038/s41467-019-11691-y
Socio-Economic Factor
In socio-economic factor, parent’s level of education and lunch type indicate socio-economic status of student. Higher parental level of education means good salary; standard lunch indicates comes from middle-end class.

H3A: Student’s parent level of education significantly affect student’s performance.

Further reading:
- M.S. Farooq1, A.H. Chaudhry1, M. Shafiq1, G. Berhanu. 2011. FACTORS AFFECTING STUDENTS’ QUALITY OF ACADEMIC PERFORMANCE: A CASE OF SECONDARY SCHOOL LEVEL. Journal of Quality and Technology Management. Volume VII, Issue II, December, 2011, Page 01 ‐ 14.
- Abdu-Raheem, B. O. 2015. Parents’ Socio-Economic Status as Predictor of Secondary School Students’ Academic Performance in Ekiti State, Nigeria. Journal of Education and Practice, v6 n1 p123-128.
Preparation Factor

H5A: Students who join student preparation course before test perform better in the test than who does not join/finish.

Analysis

In order to test the hypotheses, this analysis will use student t-test, and ANOVA test. Confidence interval is 95%, and standard error is 5%. Alpha = 0.05.

Analysis, and Visualization

Data Wrangling and Student t-test

From the whole dataset, before doing t-test and ANOVA test, some steps must be followed:
1. Subset the data and save as new dataset
2. Perform t-test or ANOVA test
3. Visualize

Demography

# H1A: Male perform better than female on math

h1a <- student[, c(1, 6)]
test1 <- t.test(math.score ~ gender, data = h1a, var.equal = TRUE)
print(test1)

## 
##  Two Sample t-test
## 
## data:  math.score by gender
## t = -5.3832, df = 998, p-value = 9.12e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -6.952285 -3.237737
## sample estimates:
## mean in group female   mean in group male 
##             63.63320             68.72822

Interpretation:
From t-test, the result is 5 = -5.38, with p-value 9.123e-08. Therefore, the conclusion is there is significantly different mean among male and female group where male mean (68) is significantly higher than female mean (63) on math test. This founding accept H1A hypothesis.

# H2A: Female perform better than male on verbal test (reading and writing)

h2a <- student[, c(1, 7, 8)]

test2 <- t.test(reading.score + writing.score ~ gender, data = h2a, var.equal = TRUE)
test2

## 
##  Two Sample t-test
## 
## data:  reading.score + writing.score by gender
## t = 9.089, df = 998, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  12.77378 19.80833
## sample estimates:
## mean in group female   mean in group male 
##             145.0753             128.7842

test3 <- t.test(reading.score ~ gender, data = h2a, var.equal = TRUE)
test3

## 
##  Two Sample t-test
## 
## data:  reading.score by gender
## t = 7.9593, df = 998, p-value = 4.681e-15
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  5.375946 8.894212
## sample estimates:
## mean in group female   mean in group male 
##             72.60811             65.47303

test4 <- t.test(writing.score ~ gender, data = h2a, var.equal = TRUE)
test4

## 
##  Two Sample t-test
## 
## data:  writing.score by gender
## t = 9.9796, df = 998, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   7.35558 10.95638
## sample estimates:
## mean in group female   mean in group male 
##             72.46718             63.31120

Interpretation:
Test 2 until Test 4 are performed to count p-value of each testing.
Test 2 counts the p-value of reading score and writing score combine with gender; Test 3 and Test 4 count p-value of reading score, and writing score with gender differently. By comparing p-value with alpha, from test 2, 3, and 4, the means are significantly different between male and female group.

Test 2, Test 3, Test 4: Mean of female is higher than male’s mean. Therefore, H2A is accepted -> female perform better in verbal test

Demography Visualization

ggboxplot(h1a, x = "gender", y = "math.score", 
          color = "gender", palette = c("#00AFBB", "#E7B800"),
          ylab = "Math Score", xlab = "Gender") +
  labs(title = "Math Score between Male and Female Student") +
  theme_algoritma

Overall: Gender x Total Score

h4a <- student[, c(1 , 9)]
test5 <- t.test(total ~ gender, data = h4a, var.equal = TRUE)
test5

## 
##  Two Sample t-test
## 
## data:  total by gender
## t = 4.1699, df = 998, p-value = 3.312e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   5.927234 16.464858
## sample estimates:
## mean in group female   mean in group male 
##             208.7085             197.5124

Interpretation:
In order to examine what gender perform better in all test (math, reading, writing), Test 5 counts the p-value of total score based on gender. The result shows the p-value is 3.312e-05. In conclusion, female perform better in all type of test rather than male.

ggboxplot(student, x = "race.ethnicity", y = "total",
          color = "gender", palette = c("#00AFBB", "#E7B800"),
          ylab = "Total Score", xlab = "Gender") +
  labs(title = "Total Score based on Ethnicity") +
  theme_algoritma

Socio-Economic

# H3A: Student's parent level of education significantly affect student's performance.

h5a <- student[, c(3, 4, 9)]

# Lunch Category
test6 <- t.test(total ~ lunch, data = h5a, var.equal = TRUE)
test6

## 
##  Two Sample t-test
## 
## data:  total by lunch
## t = -9.5751, df = 998, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -31.22541 -20.60348
## sample estimates:
## mean in group free/reduced     mean in group standard 
##                   186.5972                   212.5116

Interpretation
Test 6 counts p-value for gender and lunch category. Lunch category is used to determine wether student comes from high socio-economic or low class. Test 6 confirm that mean score of student with standard lunch is higher than student with free/reduced lunch. In addition, Test 7 shows that parental level education affects student performance as well. In conclusion, Hypothesis H3A is accepted.

# Parental level of education

test7 <- oneway.test(total ~ parental.level.of.education, data = h5a)
test7

## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  total and parental.level.of.education
## F = 10.882, num df = 5.00, denom df = 339.03, p-value = 1.006e-09

Socio-Economic Visualization

ggplot(h5a, aes(x = lunch, y = total, fill = lunch))+
  geom_boxplot()+
  scale_fill_viridis_d(alpha = 0.6) +
  geom_jitter(color="black", size=0.9, alpha=0.2) +
    theme_algoritma +
    theme(
      # legend.position="none",
      plot.title = element_text(size=14)
    ) +
    ggtitle("Boxplot Lunch and Total Score Comparison") +
    xlab("Lunch Category") +
    ylab("Total")

ggplot(h5a, aes(x = parental.level.of.education, y = total, fill = lunch))+
  geom_boxplot()+
  scale_fill_viridis_d(alpha = 0.6) +
  geom_jitter(color="black", size=0.9, alpha=0.2) +
    theme_algoritma +
    theme(
      # legend.position="none",
      plot.title = element_text(size=14)
    ) +
    ggtitle("Boxplot Lunch, Parent's Education and Total Score Comparison") +
    xlab("Level of Education") +
    ylab("Total")

Preparation Course

h6a <- student[, c(5, 9)]
test8 <- t.test(total ~ test.preparation.course, data = h6a, var.equal = TRUE)
test8

## 
##  Two Sample t-test
## 
## data:  total by test.preparation.course
## t = 8.3909, df = 998, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  17.53804 28.24508
## sample estimates:
## mean in group completed      mean in group none 
##                218.0084                195.1168

Interpretation:
Test 8 is applied to determine what group (completed/none) perform better in the test. The mean of prepared group is 218 compared with none-completed only 195. With p-value smaller than 2.2e-16, the decision is to accept the alternative hypothesis H5A -> preparation course before the test affect student performance significantly.

ggplot(h6a, aes(x = test.preparation.course, y = total))+
  geom_boxplot()+
  scale_fill_viridis_d(alpha = 0.6) +
  geom_jitter(color="black", size=0.9, alpha=0.2) +
    theme_algoritma +
    theme(
      # legend.position="none",
      plot.title = element_text(size=14)
    ) +
    ggtitle("Boxplot Lunch, Parent's Education and Total Score Comparison") +
    xlab("Level of Education") +
    ylab("Total")

ggplot(h6a, aes(x=total, color=test.preparation.course, fill=test.preparation.course)) +
    geom_density(alpha=0.6) +
    labs(title = "Density Plot based on Preparation Course",
         x = "Total Score",
         y = "") +
    theme_algoritma

Additional Analysis

ggplot(student, aes(x=math.score, y=reading.score)) +
  geom_point() +
  labs(title = "Math Score x Reading Score") +
  geom_smooth(method=lm , color="red", fill="#69b3a2", se=TRUE) +
  theme_algoritma

## `geom_smooth()` using formula 'y ~ x'

ggplot(student, aes(x=math.score, y=writing.score)) +
  geom_point() +
  labs(title = "Math Score x Writing Score") +
  geom_smooth(method=lm , color="red", fill="#69b3a2", se=TRUE) +
  theme_algoritma

## `geom_smooth()` using formula 'y ~ x'

ggplot(student, aes(x=reading.score, y=writing.score)) +
  geom_point() +
  labs(title = "Reading Score x Writing Score") +
  geom_smooth(method=lm , color="red", fill="#69b3a2", se=TRUE) +
  theme_algoritma

## `geom_smooth()` using formula 'y ~ x'

h7a <- student[, c(4, 5, 9)]
table1 <- table(h7a$lunch, h7a$test.preparation.course)
as.data.frame.matrix(table1)

Conclusion

By looking the p-value from several test before, we could answer the main question given. Based on demography characteristic, Gender and Race, each gender has their strong and weakness in the test category. Male perform better than Female on math test, yet Female do better than Male in verbal test (Writing and Reading). Further testing using total score and gender shows that in overall, Female perform better than Male does. Race could not be used as factor because of its uniqueness. All Group of race perform quite similar on test. However, some group has low score on total score. This interpretation does not mean one race is better than others.

Socio economic also being significant factor. Determining lunch category and parental level of education, several test show that student who come from middle-high socio-economic level perform better in the test. Lastly, third factor which is test preparation course (completed/none) has significantly affect students score. Student who complete the course before test achieved higher score that who does not join or complete. Moreover, by seeing the scatterplot and trendline, we could say that every score has strong relationship.

From the Socio-Economic and Total score analysis, I wonder if Test Preparation Course is correlated with student’s Socio-Economic Level. Therefore, a contingency table that shows frequency of student based on lunch category and test preparation course. If my assumption is correct, there will be form two major groups: free/reduced lunch with incomplete course (assuming that the course is paid, so low level economic student could not afford the course), and standard with complete preparation course. The table shows that majority group comes from standard lunch (high socio-economic) but with none (failed to complete preparation course). The number category standard (lunch) & none, and free/reduced & completed are not significanly distignuished. In conclusion, socio-economic status does not determined the student complete the course or failed. The course might be free given by the school.

Student Performance: An Exploratory Data Analysis

Muhammad Asadullah Al Ghozi

05/04/2021