Main Question
Main question in this analysis is “What factors influence student performance on exams significantly ?”. To answer the main question, some hypotheses are applied.
library(ggplot2)
library(ggpubr)
theme_algoritma <- readRDS("theme_algoritma.rds")
student <- read.csv(file = "data_input/StudentsPerformance.csv")
str(student)## 'data.frame': 1000 obs. of 8 variables:
## $ gender : chr "female" "female" "female" "male" ...
## $ race.ethnicity : chr "group B" "group C" "group B" "group A" ...
## $ parental.level.of.education: chr "bachelor's degree" "some college" "master's degree" "associate's degree" ...
## $ lunch : chr "standard" "standard" "standard" "free/reduced" ...
## $ test.preparation.course : chr "none" "completed" "none" "none" ...
## $ math.score : int 72 69 90 47 76 71 88 40 64 38 ...
## $ reading.score : int 72 90 95 57 78 83 95 43 64 60 ...
## $ writing.score : int 74 88 93 44 75 78 92 39 67 50 ...
head(student)tail(student)dim(student)## [1] 1000 8
# Changes Gender, Race Ethnicity, Parental Level of Education, Lunch, and Test Preparation into factor data type
student$gender <- as.factor(student$gender)
student$race.ethnicity <- as.factor(student$race.ethnicity)
student$parental.level.of.education <- as.factor(student$parental.level.of.education)
student$lunch <- as.factor(student$lunch)
student$test.preparation.course <- as.factor(student$test.preparation.course)
str(student)## 'data.frame': 1000 obs. of 8 variables:
## $ gender : Factor w/ 2 levels "female","male": 1 1 1 2 2 1 1 2 2 1 ...
## $ race.ethnicity : Factor w/ 5 levels "group A","group B",..: 2 3 2 1 3 2 2 2 4 2 ...
## $ parental.level.of.education: Factor w/ 6 levels "associate's degree",..: 2 5 4 1 5 1 5 5 3 3 ...
## $ lunch : Factor w/ 2 levels "free/reduced",..: 2 2 2 1 2 2 2 1 1 1 ...
## $ test.preparation.course : Factor w/ 2 levels "completed","none": 2 1 2 2 2 2 1 2 1 2 ...
## $ math.score : int 72 69 90 47 76 71 88 40 64 38 ...
## $ reading.score : int 72 90 95 57 78 83 95 43 64 60 ...
## $ writing.score : int 74 88 93 44 75 78 92 39 67 50 ...
Total is a new column which combine all scores from math, reading, and writing. Total will be used to present each student’s performance.
student$total <- student$math.score + student$reading.score + student$writing.scoresummary(student)## gender race.ethnicity parental.level.of.education lunch
## female:518 group A: 89 associate's degree:222 free/reduced:355
## male :482 group B:190 bachelor's degree :118 standard :645
## group C:319 high school :196
## group D:262 master's degree : 59
## group E:140 some college :226
## some high school :179
## test.preparation.course math.score reading.score writing.score
## completed:358 Min. : 0.00 Min. : 17.00 Min. : 10.00
## none :642 1st Qu.: 57.00 1st Qu.: 59.00 1st Qu.: 57.75
## Median : 66.00 Median : 70.00 Median : 69.00
## Mean : 66.09 Mean : 69.17 Mean : 68.05
## 3rd Qu.: 77.00 3rd Qu.: 79.00 3rd Qu.: 79.00
## Max. :100.00 Max. :100.00 Max. :100.00
## total
## Min. : 27.0
## 1st Qu.:175.0
## Median :205.0
## Mean :203.3
## 3rd Qu.:233.0
## Max. :300.0
Interpretation from Summary:
1. In this data, proportion of male and female student is not too far. There is 518 female, and 482 male. 2. From an ethnic, the majority ethnic group is group C, and minority is group A. 3. The level of education of the parents of the students is mostly College, and the least is Masters. 4. The lunch category is dominated by standard rather than free / reduced categories. 5. The majority of students do not take test preparation classes. Only 358 students complete test preparation course, and other 642 have not completed.
From the summary above, we can take some insights about student’s score on math, reading, and writing test.
1. Math:
a. Minimum score is 0, and maximum score achieved is 100. For 0 score, I assume that the student did not attend the test somehow.
b. The mean and median are same.
2. Reading:
a. Reading has minimum score 17, and maximum 100. The Mean is quite higher than math score a little bit.
b. The emdian is slightly higher that it Mean.
3. Writing Score:
a. Writing Score has 10 score in the lowest, it is above minimum score of reading score, but better than math score minimum.
b. Mean and Median are quite close, and the maximum score is 100.
4. Total score:
a. Highest total score is 3000, meaning that there is a student achieved 100 score in every test. The lowest is 27, which I guess from a student who achieve the lowest score.
b. The mean and median is quite similar.
Main question in this analysis is “What factors influence student performance on exams significantly ?”. To answer the main question, some hypotheses are applied.
Before going to further analysis, first, I want to make some hypotheses with main question: What factors Influence Student Performance?. Basic hypothesis in this analysis is: “All variables are significantly influence student performance on test”. However, there are still many factors and research that try to reveal this same question. The main hypothesis will be helped by derrivative hypotheses below:
Demography
From demography, I will use gender and race ethnicity. Some hypotheses are supported based on scientific publication.
H1A: Male perform better than female on math
H2A: Female perform better than male on verbal test (reading and writing)
Race/Ethnicity will not be included to analysis, but still used to add more explanation.
Further reading: Balart, P., Oosterveen, M. Females show more sustained performance during test-taking than males. Nat Commun 10, 3798 (2019). https://doi.org/10.1038/s41467-019-11691-y
Socio-Economic Factor
In socio-economic factor, parent’s level of education and lunch type indicate socio-economic status of student. Higher parental level of education means good salary; standard lunch indicates comes from middle-end class.
H3A: Student’s parent level of education significantly affect student’s performance.
Further reading:
Preparation Factor
H5A: Students who join student preparation course before test perform better in the test than who does not join/finish.
In order to test the hypotheses, this analysis will use student t-test, and ANOVA test. Confidence interval is 95%, and standard error is 5%. Alpha = 0.05.
From the whole dataset, before doing t-test and ANOVA test, some steps must be followed:
1. Subset the data and save as new dataset
2. Perform t-test or ANOVA test
3. Visualize
# H1A: Male perform better than female on math
h1a <- student[, c(1, 6)]
test1 <- t.test(math.score ~ gender, data = h1a, var.equal = TRUE)
print(test1)##
## Two Sample t-test
##
## data: math.score by gender
## t = -5.3832, df = 998, p-value = 9.12e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -6.952285 -3.237737
## sample estimates:
## mean in group female mean in group male
## 63.63320 68.72822
Interpretation:
From t-test, the result is 5 = -5.38, with p-value 9.123e-08. Therefore, the conclusion is there is significantly different mean among male and female group where male mean (68) is significantly higher than female mean (63) on math test. This founding accept H1A hypothesis.
# H2A: Female perform better than male on verbal test (reading and writing)
h2a <- student[, c(1, 7, 8)]
test2 <- t.test(reading.score + writing.score ~ gender, data = h2a, var.equal = TRUE)
test2##
## Two Sample t-test
##
## data: reading.score + writing.score by gender
## t = 9.089, df = 998, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 12.77378 19.80833
## sample estimates:
## mean in group female mean in group male
## 145.0753 128.7842
test3 <- t.test(reading.score ~ gender, data = h2a, var.equal = TRUE)
test3##
## Two Sample t-test
##
## data: reading.score by gender
## t = 7.9593, df = 998, p-value = 4.681e-15
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 5.375946 8.894212
## sample estimates:
## mean in group female mean in group male
## 72.60811 65.47303
test4 <- t.test(writing.score ~ gender, data = h2a, var.equal = TRUE)
test4##
## Two Sample t-test
##
## data: writing.score by gender
## t = 9.9796, df = 998, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 7.35558 10.95638
## sample estimates:
## mean in group female mean in group male
## 72.46718 63.31120
Interpretation:
Test 2 until Test 4 are performed to count p-value of each testing.
Test 2 counts the p-value of reading score and writing score combine with gender; Test 3 and Test 4 count p-value of reading score, and writing score with gender differently. By comparing p-value with alpha, from test 2, 3, and 4, the means are significantly different between male and female group.
Test 2, Test 3, Test 4: Mean of female is higher than male’s mean. Therefore, H2A is accepted -> female perform better in verbal test
ggboxplot(h1a, x = "gender", y = "math.score",
color = "gender", palette = c("#00AFBB", "#E7B800"),
ylab = "Math Score", xlab = "Gender") +
labs(title = "Math Score between Male and Female Student") +
theme_algoritmah4a <- student[, c(1 , 9)]
test5 <- t.test(total ~ gender, data = h4a, var.equal = TRUE)
test5##
## Two Sample t-test
##
## data: total by gender
## t = 4.1699, df = 998, p-value = 3.312e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 5.927234 16.464858
## sample estimates:
## mean in group female mean in group male
## 208.7085 197.5124
Interpretation:
In order to examine what gender perform better in all test (math, reading, writing), Test 5 counts the p-value of total score based on gender. The result shows the p-value is 3.312e-05. In conclusion, female perform better in all type of test rather than male.
ggboxplot(student, x = "race.ethnicity", y = "total",
color = "gender", palette = c("#00AFBB", "#E7B800"),
ylab = "Total Score", xlab = "Gender") +
labs(title = "Total Score based on Ethnicity") +
theme_algoritma# H3A: Student's parent level of education significantly affect student's performance.
h5a <- student[, c(3, 4, 9)]
# Lunch Category
test6 <- t.test(total ~ lunch, data = h5a, var.equal = TRUE)
test6##
## Two Sample t-test
##
## data: total by lunch
## t = -9.5751, df = 998, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -31.22541 -20.60348
## sample estimates:
## mean in group free/reduced mean in group standard
## 186.5972 212.5116
Interpretation
Test 6 counts p-value for gender and lunch category. Lunch category is used to determine wether student comes from high socio-economic or low class. Test 6 confirm that mean score of student with standard lunch is higher than student with free/reduced lunch. In addition, Test 7 shows that parental level education affects student performance as well. In conclusion, Hypothesis H3A is accepted.
# Parental level of education
test7 <- oneway.test(total ~ parental.level.of.education, data = h5a)
test7##
## One-way analysis of means (not assuming equal variances)
##
## data: total and parental.level.of.education
## F = 10.882, num df = 5.00, denom df = 339.03, p-value = 1.006e-09
ggplot(h5a, aes(x = lunch, y = total, fill = lunch))+
geom_boxplot()+
scale_fill_viridis_d(alpha = 0.6) +
geom_jitter(color="black", size=0.9, alpha=0.2) +
theme_algoritma +
theme(
# legend.position="none",
plot.title = element_text(size=14)
) +
ggtitle("Boxplot Lunch and Total Score Comparison") +
xlab("Lunch Category") +
ylab("Total")ggplot(h5a, aes(x = parental.level.of.education, y = total, fill = lunch))+
geom_boxplot()+
scale_fill_viridis_d(alpha = 0.6) +
geom_jitter(color="black", size=0.9, alpha=0.2) +
theme_algoritma +
theme(
# legend.position="none",
plot.title = element_text(size=14)
) +
ggtitle("Boxplot Lunch, Parent's Education and Total Score Comparison") +
xlab("Level of Education") +
ylab("Total")h6a <- student[, c(5, 9)]
test8 <- t.test(total ~ test.preparation.course, data = h6a, var.equal = TRUE)
test8##
## Two Sample t-test
##
## data: total by test.preparation.course
## t = 8.3909, df = 998, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 17.53804 28.24508
## sample estimates:
## mean in group completed mean in group none
## 218.0084 195.1168
Interpretation:
Test 8 is applied to determine what group (completed/none) perform better in the test. The mean of prepared group is 218 compared with none-completed only 195. With p-value smaller than 2.2e-16, the decision is to accept the alternative hypothesis H5A -> preparation course before the test affect student performance significantly.
ggplot(h6a, aes(x = test.preparation.course, y = total))+
geom_boxplot()+
scale_fill_viridis_d(alpha = 0.6) +
geom_jitter(color="black", size=0.9, alpha=0.2) +
theme_algoritma +
theme(
# legend.position="none",
plot.title = element_text(size=14)
) +
ggtitle("Boxplot Lunch, Parent's Education and Total Score Comparison") +
xlab("Level of Education") +
ylab("Total")ggplot(h6a, aes(x=total, color=test.preparation.course, fill=test.preparation.course)) +
geom_density(alpha=0.6) +
labs(title = "Density Plot based on Preparation Course",
x = "Total Score",
y = "") +
theme_algoritma ggplot(student, aes(x=math.score, y=reading.score)) +
geom_point() +
labs(title = "Math Score x Reading Score") +
geom_smooth(method=lm , color="red", fill="#69b3a2", se=TRUE) +
theme_algoritma## `geom_smooth()` using formula 'y ~ x'
ggplot(student, aes(x=math.score, y=writing.score)) +
geom_point() +
labs(title = "Math Score x Writing Score") +
geom_smooth(method=lm , color="red", fill="#69b3a2", se=TRUE) +
theme_algoritma## `geom_smooth()` using formula 'y ~ x'
ggplot(student, aes(x=reading.score, y=writing.score)) +
geom_point() +
labs(title = "Reading Score x Writing Score") +
geom_smooth(method=lm , color="red", fill="#69b3a2", se=TRUE) +
theme_algoritma## `geom_smooth()` using formula 'y ~ x'
h7a <- student[, c(4, 5, 9)]
table1 <- table(h7a$lunch, h7a$test.preparation.course)
as.data.frame.matrix(table1) By looking the p-value from several test before, we could answer the main question given. Based on demography characteristic, Gender and Race, each gender has their strong and weakness in the test category. Male perform better than Female on math test, yet Female do better than Male in verbal test (Writing and Reading). Further testing using total score and gender shows that in overall, Female perform better than Male does. Race could not be used as factor because of its uniqueness. All Group of race perform quite similar on test. However, some group has low score on total score. This interpretation does not mean one race is better than others.
Socio economic also being significant factor. Determining lunch category and parental level of education, several test show that student who come from middle-high socio-economic level perform better in the test. Lastly, third factor which is test preparation course (completed/none) has significantly affect students score. Student who complete the course before test achieved higher score that who does not join or complete. Moreover, by seeing the scatterplot and trendline, we could say that every score has strong relationship.
From the Socio-Economic and Total score analysis, I wonder if Test Preparation Course is correlated with student’s Socio-Economic Level. Therefore, a contingency table that shows frequency of student based on lunch category and test preparation course. If my assumption is correct, there will be form two major groups: free/reduced lunch with incomplete course (assuming that the course is paid, so low level economic student could not afford the course), and standard with complete preparation course. The table shows that majority group comes from standard lunch (high socio-economic) but with none (failed to complete preparation course). The number category standard (lunch) & none, and free/reduced & completed are not significanly distignuished. In conclusion, socio-economic status does not determined the student complete the course or failed. The course might be free given by the school.