MATH1324 Applied Analytics

Assignment 2

Laine Heidenreich s3866463 and Klara Vickov s3873315

Last updated: 24 October, 2020

Introduction

The data set was found on the Kaggle website and contains the marks secured by high school students in the United States. The website does not provide a clear method of how they obtained the data and does not explain the background information about the data. The purpose of the data is to help provide an understanding of how a variety of factors can influence student performance in exams.

Problem Statement

The question driving this investigation is whether females or males are more likely to complete a test preparation course. In order to answer the question we first preprocessed the data so it is ready for analysis. Then, we used the Chi-square Test of association as we were dealing with categorical variables, gender and test.preparation.course, to see if there was an association between the two variables.

Data

The data set was found on the Kaggle website at the following URL: https://www.kaggle.com/spscientist/students-performance-in-exams. The data set includes 8 variables which are listed below:

Data Cont.

The data set was imported into R using the read.csv() function as shown below. A head of the data set was produced using head().

StudentsPerformance <- 
  read.csv("D:/RMIT/Semester 2 2020/MATH1324/Assessments/StudentsPerformance.csv")
print.data.frame(head(StudentsPerformance))
##   gender race.ethnicity parental.level.of.education        lunch
## 1 female        group B           bachelor's degree     standard
## 2 female        group C                some college     standard
## 3 female        group B             master's degree     standard
## 4   male        group A          associate's degree free/reduced
## 5   male        group C                some college     standard
## 6 female        group B          associate's degree     standard
##   test.preparation.course math.score reading.score writing.score
## 1                    none         72            72            74
## 2               completed         69            90            88
## 3                    none         90            95            93
## 4                    none         47            57            44
## 5                    none         76            78            75
## 6                    none         71            83            78

Data Cont.

To get the data ready for analysis we processed the data. To the data type of the variables the str() function was used. As shown below the data set contains 1000 observations of 8 variables. The first five variables are character and the last three are numeric.

str(StudentsPerformance)
## 'data.frame':    1000 obs. of  8 variables:
##  $ gender                     : chr  "female" "female" "female" "male" ...
##  $ race.ethnicity             : chr  "group B" "group C" "group B" "group A" ...
##  $ parental.level.of.education: chr  "bachelor's degree" "some college" "master's degree" "associate's degree" ...
##  $ lunch                      : chr  "standard" "standard" "standard" "free/reduced" ...
##  $ test.preparation.course    : chr  "none" "completed" "none" "none" ...
##  $ math.score                 : int  72 69 90 47 76 71 88 40 64 38 ...
##  $ reading.score              : int  72 90 95 57 78 83 95 43 64 60 ...
##  $ writing.score              : int  74 88 93 44 75 78 92 39 67 50 ...

Data Cont.

The following variables were converted from character to factor using the as.factor() function and the levels() function was used to show the levels of the factor variables. They were converted to factor variables as the two variables will be the ones involved in the hypothesis testing.

StudentsPerformance$gender <- as.factor(StudentsPerformance$gender)
levels(StudentsPerformance$gender)
## [1] "female" "male"
StudentsPerformance$test.preparation.course <- 
  as.factor(StudentsPerformance$test.preparation.course)
levels(StudentsPerformance$test.preparation.course)
## [1] "completed" "none"

The math.score, reading.score and writing.score variables are numeric data types and their scale is assumed to be as a percentage score in each test.

To check if there was any missing values, sum(is.na()) was used. As shown below there was no missing values in the data set.

sum(is.na(StudentsPerformance))
## [1] 0

Descriptive Statistics

The summary() function was used to provide the number of females and males in the data set and the number of students who completed or did not complete the test preparation course.

summary(StudentsPerformance$gender)
## female   male 
##    518    482
summary(StudentsPerformance$test.preparation.course)
## completed      none 
##       358       642

Descriptive Visualisation

A table containing the variables in question was created using the table() and prop.table() functions as shown below. This table was then used to create a barplot using the barplot() function to visualise the relationship between gender and test preparation course.

table <- 
  table(StudentsPerformance$test.preparation.course, StudentsPerformance$gender)%>% 
  prop.table(margin = 2)
knitr::kable(table)
female male
completed 0.3552124 0.3609959
none 0.6447876 0.6390041

Descriptive Visualisation Cont.

barplot(table, ylab = "Propoertion within group", ylim = c(0, .9), 
        legend = rownames(table), beside = TRUE, 
        args.legend = c(x = "top", horiz = TRUE, title = "Test Preparation Course"), 
        xlab = "Gender", col = c("#99d8c9", "#2ca25f"), border = "#69b3a2")

The two variables in question contain no outliers as they are categorical variables.

Hypothesis Testing

The Null Hypothesis is as follows:

\[H_0: There\ is\ no\ association\ in\ the\ population\ between\ gender\ and\ completion\ of\ test\ preparation\ course.\]

The Alternative Hypothesis is as follows:

\[H_A: There\ is\ an\ association\ in\ the\ population\ between\ gender\ and\ completion\ of\ test\ preparation\ course.\]

The hypothesis test chosen for this data set was the Chi-square Test of Association as we are interested in the relationship between gender and test preparation course, which are categorical variables. The first step was to use the chisq.test() function to determine the chi-squared value.

chi1 <- chisq.test(
  table(StudentsPerformance$gender, StudentsPerformance$test.preparation.course))
chi1
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(StudentsPerformance$gender, StudentsPerformance$test.preparation.course)
## X-squared = 0.015529, df = 1, p-value = 0.9008

Hypothesis Testing Cont.

The next step was to check the observed and expected values as shown below.

chi1$observed
##         
##          completed none
##   female       184  334
##   male         174  308
chi1$expected
##         
##          completed    none
##   female   185.444 332.556
##   male     172.556 309.444

Hypothesis Testing Cont.

Then qchisq() function was used to find the critical value.

qchisq(p = .95, df = 1)
## [1] 3.841459

The Chi-squared value (0.015529) was less than the critical value (3.841459). This means we failed to reject the Null Hypothesis.

The pchisq() function was used to find the p-value.

pchisq(q = 0.015529, df = 1, lower.tail = FALSE)
## [1] 0.900828

The p-value (0.900828) is greater than the significance level of 0.05. Therefore, the result is NOT statistically significant.

Discussion

The question posed at the beginning of the investigation was to find out if females or males were more likely to complete a test preparation course. The hypothesis test that was used was the Chi-square Test of Association to see if there was an association between gender and test preparation course. The results failed to reject the Null Hypothesis as it showed there was no association between the two variables and they were not statistically significant. Therefore, they are inconclusive. The strength of this investigation is having a reasonably large sample size (n = 1000). However, there are many limitations associated with this investigation. Some include not knowing if the students are from the same school, the age of the students was not divulged, the students were only from the United States, etc. For future investigations into the association between gender and test preparation course, further information such as age, school and socioeconomic status should be included in the data to provide a clearer and more significant result. In this investigation there was no association between gender and the completion of a test preparation course. Due to the lack of statistical significance, the results cannot be generalised to the wider population.

References

Data source from: https://www.kaggle.com/spscientist/students-performance-in-exams