Mun Kar Yenc Natalie s3991774, Livia Nathania Fireta s3980951, Qingqing Wang s3886626
2023-05-23
This dataset is used to test the various factors affecting a student’s test score in the subjects of math, reading and writing.
According to the study by Briggs (2009), the test preparation course had the most effect on the math score. Therefore, in this study, we concentrated on the effect of test preparation course in terms of math score.
A two-sample independent t-test is carried out to determine whether there is a significant mean difference in math score between the students who completed the test preparation course and students with no test preparation course.
The descriptive statistics of the dataset is summarised.
The missing values and outliers are checked.
The assumptions for the two-sample independent t-test such as the normality and homogeneity of variance are also checked through the Q-Q plot and Levene’s test.
## Rows: 1,000
## Columns: 8
## $ gender <chr> "female", "female", "female", "male", "m…
## $ `race/ethnicity` <chr> "group B", "group C", "group B", "group …
## $ `parental level of education` <chr> "bachelor's degree", "some college", "ma…
## $ lunch <chr> "standard", "standard", "standard", "fre…
## $ `test preparation course` <chr> "none", "completed", "none", "none", "no…
## $ `math score` <dbl> 72, 69, 90, 47, 76, 71, 88, 40, 64, 38, …
## $ `reading score` <dbl> 72, 90, 95, 57, 78, 83, 95, 43, 64, 60, …
## $ `writing score` <dbl> 74, 88, 93, 44, 75, 78, 92, 39, 67, 50, …
stud_perf_exam$`test preparation course`<- stud_perf_exam$`test preparation course` %>%
factor(levels= c("none","completed"))Data Source: https://www.kaggle.com/datasets/spscientist/students-performance-in-exams
The data is taken from a high school in United States.
Important Variables:
gender - This variable consists of the gender of the students who took the tests in this dataset.
test preparation course - this variable states whether the students took test preparation course or did not take the course.
math score - This variable states the score of the math test taken by each student in this dataset.
reading score - This variable states the score of the reading test taken by each student in this dataset.
writing score - This variable states the score of the writing test taken by each student in this dataset.
Preprocessing The Dataset:
The dataset is imported to the RMarkdown using the read_csv() function. The structure and data types of each variable in the dataset is inspected using the str() function. The important variables in the dataset is test preparation course converted to factor data type using factor() stating the levels without ordering the levels for the variable.
math_testprep <- stud_perf_exam %>%
group_by(`test preparation course`) %>%
summarise(Min = min(`math score`,na.rm = TRUE),
Q1 = quantile(`math score`,probs = .25,na.rm = TRUE),
Median = median(`math score`, na.rm = TRUE),
Q3 = quantile(`math score`,probs = .75,na.rm = TRUE),
Max = max(`math score`,na.rm = TRUE),
Mean = mean(`math score`, na.rm = TRUE),
SD = sd(`math score`, na.rm = TRUE),
IQR =IQR(`math score`, na.rm = TRUE),
Range = Q3-Q1,
n = n())
knitr::kable(math_testprep)| test preparation course | Min | Q1 | Median | Q3 | Max | Mean | SD | IQR | Range | n |
|---|---|---|---|---|---|---|---|---|---|---|
| none | 0 | 54 | 64 | 74.75 | 100 | 64.07788 | 15.19238 | 20.75 | 20.75 | 642 |
| completed | 23 | 60 | 69 | 79.00 | 100 | 69.69553 | 14.44470 | 19.00 | 19.00 | 358 |
The descriptive statistics for math score is summarised based on the test preparation course groups using the group_by() and summarise() functions. Regardless of the test preparation course group, the maximum for math score is 100 for both none and completed groups. This might indicate that students who excel in their studies can get 100 regardless of whether they have undertaken the course or not.
Whereas, the minimum score is lower for the group that has not completed the test preparation course compared to the one that has completed the test preparation course.
Based on observing the descriptive statistics alone, all the values for those who have completed the test preparation course are higher compared to those who did not take the test preparation course.
The mean and median for test preparation variables are similar to each other. Furthermore, the IQR is approximately 1.33σ and the range is approximately 6σ. This indicates that it might be normally distributed.
The number of students with no test preparation course is 642 which is much higher compared to the number of students who completed the test preparation course group which consists of only 358 students.
## gender race/ethnicity
## 0 0
## parental level of education lunch
## 0 0
## test preparation course math score
## 0 0
## reading score writing score
## 0 0
Missing values are scanned by each column of the dataset using the colSums() and is.na() functions. No missing values are detected from the dataset. Therefore, no further actions are undertaken.
Generally, the outliers fall below the lower boundary of the boxplots. There are no outliers greater than 100. This makes sense because marks are within the boundary of 0 to 100.
Therefore, the outliers detected from the boxplot are reasonable and not abnormal or due to errors.
It can be observed from the boxplot that there are more outliers in the none test preparation course group compared to the completed test preparation course
## [1] 43 632
## [1] 299 238
All the data points fall within the normal range in the Q-Q plots. Therefore, it is safe to assume that the dataset is normally distributed.
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 0.533 0.4655
## 998
The Levene’s Test reports a p-value that is compared to the standard 0.05 significance level. The Levene’s Test is used to compare the variances of none test preparation course and completed test preparation course groups’ scores for math.
The p-value for the Levene’s test for math score is p = 0.47 which is greater than 0.05, the significance level. We fail to reject the null hypothesis and it is safe to assume equal variance for the math score.
The two-sample t-tests has the following hypotheses:
\[ \begin{align} \mu_1 = \text{The math score of students who completed the test preparation course}\\ \mu_2 = \text{The math score of students who did not complete the test preparation course} \end{align} \]
\[ H_0: \mu_1 = \mu_2 \\ H_A: \mu_1 \neq \mu_2 \\ \] Null Hypothesis (H0): The difference of math score between the none and completed test preparation course groups’ means is zero.
Alternative Hypothesis (HA): There is a difference of math score between the none and completed test preparation course groups’ means.
t.test(`math score`~`test preparation course`,
data = stud_perf_exam,
var.equal = TRUE,
alternative = "two.sided")##
## Two Sample t-test
##
## data: math score by test preparation course
## t = -5.7046, df = 998, p-value = 1.536e-08
## alternative hypothesis: true difference in means between group none and group completed is not equal to 0
## 95 percent confidence interval:
## -7.550077 -3.685221
## sample estimates:
## mean in group none mean in group completed
## 64.07788 69.69553
Since it is safe to assume equal variances for math scores, the argument for var.equal within the t.test() function is set to be TRUE. The difference of math score between the none and completed group estimated by the sample was 69.69553-64.07788 = 5.61765. We are 95% confident that the difference in means between the two groups is between -7.55 and -3.69. The p-value is 1.54e-08 which is lower than 0.05 and therefore reject the null hypothesis. There is a statistically significant difference between the two groups in terms of math score.
Two independent sample tests were conducted to test the significant difference of math score for the test preparation course variable.
The two independent sample t-test show that the difference of math score between those who completed and did not complete the test preparation course are statistically significant. Therefore, this implies that students that have completed the test preparation course have a significantly different score compared to students who did not complete the course.
The strength: The dataset has a big sample size which can give a more accurate representation of the mean.
The limitation: The test can only be applied to the variables that have two categories/ levels. For example, in this dataset, the test cannot be conducted to the parental level of education because it has more than 2 categories and the two sample t-test could not accommodate this.
Direction for future research: In the future research, the test should use another statistic test in order to accommodate variables that have more than 2 categories. One of the test that can be conducted is the ANOVA test. Moreover, it can try to apply the paired sample t-test to compare the effect of test preparation course on the same group of students.
Take home message: From the statistically significant results, the test preparation course could potentially help students who are weak in the math. Through the analysis of such datasets, it could help schools and policymakers in the education sector to make better teaching plans and decisions for future students.
Briggs DC (2007) ‘The Effect of Admissions Test Preparation: Evidence from NELS:88’ , Chance, 14(1):10-18
Seshapanpu, J (2018) Students Performance in Exam, Kaggle website, Accessed 19 May 2023. https://www.kaggle.com/datasets/spscientist/students-performance-in-exams
Tafakori L (2023) ‘Sampling: Randomly Representative’[Course Module, MATH1324], RMIT University, Melbourne.
Tafakori L (2023) ‘Testing the Null: Data on Trial’[Course Module, MATH1324], RMIT University, Melbourne.