Laine Heidenreich s3866463 and Klara Vickov s3873315
Last updated: 24 October, 2020
Rpubs link is as follows: www
The data set was found on the Kaggle website and contains the marks secured by high school students in the United States. The website does not provide a clear method of how they obtained the data and does not explain the background information about the data. The purpose of the data is to help provide an understanding of how a variety of factors can influence student performance in exams.
The question driving this investigation is whether females or males are more likely to complete a test preparation course. In order to answer the question we first preprocessed the data so it is ready for analysis. Then, we used the Chi-square Test of association as we were dealing with categorical variables, gender and test.preparation.course, to see if there was an association between the two variables.
The data set was found on the Kaggle website at the following URL: https://www.kaggle.com/spscientist/students-performance-in-exams. The data set includes 8 variables which are listed below:
The data set was imported into R using the read.csv() function as shown below. A head of the data set was produced using head().
StudentsPerformance <-
read.csv("D:/RMIT/Semester 2 2020/MATH1324/Assessments/StudentsPerformance.csv")
print.data.frame(head(StudentsPerformance))## gender race.ethnicity parental.level.of.education lunch
## 1 female group B bachelor's degree standard
## 2 female group C some college standard
## 3 female group B master's degree standard
## 4 male group A associate's degree free/reduced
## 5 male group C some college standard
## 6 female group B associate's degree standard
## test.preparation.course math.score reading.score writing.score
## 1 none 72 72 74
## 2 completed 69 90 88
## 3 none 90 95 93
## 4 none 47 57 44
## 5 none 76 78 75
## 6 none 71 83 78
To get the data ready for analysis we processed the data. To the data type of the variables the str() function was used. As shown below the data set contains 1000 observations of 8 variables. The first five variables are character and the last three are numeric.
## 'data.frame': 1000 obs. of 8 variables:
## $ gender : chr "female" "female" "female" "male" ...
## $ race.ethnicity : chr "group B" "group C" "group B" "group A" ...
## $ parental.level.of.education: chr "bachelor's degree" "some college" "master's degree" "associate's degree" ...
## $ lunch : chr "standard" "standard" "standard" "free/reduced" ...
## $ test.preparation.course : chr "none" "completed" "none" "none" ...
## $ math.score : int 72 69 90 47 76 71 88 40 64 38 ...
## $ reading.score : int 72 90 95 57 78 83 95 43 64 60 ...
## $ writing.score : int 74 88 93 44 75 78 92 39 67 50 ...
The following variables were converted from character to factor using the as.factor() function and the levels() function was used to show the levels of the factor variables. They were converted to factor variables as the two variables will be the ones involved in the hypothesis testing.
StudentsPerformance$gender <- as.factor(StudentsPerformance$gender)
levels(StudentsPerformance$gender)## [1] "female" "male"
StudentsPerformance$test.preparation.course <-
as.factor(StudentsPerformance$test.preparation.course)
levels(StudentsPerformance$test.preparation.course)## [1] "completed" "none"
The math.score, reading.score and writing.score variables are numeric data types and their scale is assumed to be as a percentage score in each test.
To check if there was any missing values, sum(is.na()) was used. As shown below there was no missing values in the data set.
## [1] 0
The summary() function was used to provide the number of females and males in the data set and the number of students who completed or did not complete the test preparation course.
## female male
## 518 482
## completed none
## 358 642
A table containing the variables in question was created using the table() and prop.table() functions as shown below. This table was then used to create a barplot using the barplot() function to visualise the relationship between gender and test preparation course.
table <-
table(StudentsPerformance$test.preparation.course, StudentsPerformance$gender)%>%
prop.table(margin = 2)
knitr::kable(table)| female | male | |
|---|---|---|
| completed | 0.3552124 | 0.3609959 |
| none | 0.6447876 | 0.6390041 |
barplot(table, ylab = "Propoertion within group", ylim = c(0, .9),
legend = rownames(table), beside = TRUE,
args.legend = c(x = "top", horiz = TRUE, title = "Test Preparation Course"),
xlab = "Gender", col = c("#99d8c9", "#2ca25f"), border = "#69b3a2") The two variables in question contain no outliers as they are categorical variables.
The Null Hypothesis is as follows:
\[H_0: There\ is\ no\ association\ in\ the\ population\ between\ gender\ and\ completion\ of\ test\ preparation\ course.\]
The Alternative Hypothesis is as follows:
\[H_A: There\ is\ an\ association\ in\ the\ population\ between\ gender\ and\ completion\ of\ test\ preparation\ course.\]
The hypothesis test chosen for this data set was the Chi-square Test of Association as we are interested in the relationship between gender and test preparation course, which are categorical variables. The first step was to use the chisq.test() function to determine the chi-squared value.
chi1 <- chisq.test(
table(StudentsPerformance$gender, StudentsPerformance$test.preparation.course))
chi1##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(StudentsPerformance$gender, StudentsPerformance$test.preparation.course)
## X-squared = 0.015529, df = 1, p-value = 0.9008
The next step was to check the observed and expected values as shown below.
##
## completed none
## female 184 334
## male 174 308
##
## completed none
## female 185.444 332.556
## male 172.556 309.444
Then qchisq() function was used to find the critical value.
## [1] 3.841459
The Chi-squared value (0.015529) was less than the critical value (3.841459). This means we failed to reject the Null Hypothesis.
The pchisq() function was used to find the p-value.
## [1] 0.900828
The p-value (0.900828) is greater than the significance level of 0.05. Therefore, the result is NOT statistically significant.
The question posed at the beginning of the investigation was to find out if females or males were more likely to complete a test preparation course. The hypothesis test that was used was the Chi-square Test of Association to see if there was an association between gender and test preparation course. The results failed to reject the Null Hypothesis as it showed there was no association between the two variables and they were not statistically significant. Therefore, they are inconclusive. The strength of this investigation is having a reasonably large sample size (n = 1000). However, there are many limitations associated with this investigation. Some include not knowing if the students are from the same school, the age of the students was not divulged, the students were only from the United States, etc. For future investigations into the association between gender and test preparation course, further information such as age, school and socioeconomic status should be included in the data to provide a clearer and more significant result. In this investigation there was no association between gender and the completion of a test preparation course. Due to the lack of statistical significance, the results cannot be generalised to the wider population.
Data source from: https://www.kaggle.com/spscientist/students-performance-in-exams