Tarana Tabassum (s3802509), Siddhant Godbole (s3777167), Muhammad Abdullah Anis (s3790720)
Last updated: 27 October, 2019
http://rpubs.com/tarana_pia/543023
For this assignment we have chosen certain columns to use from the “Student Alcohol Consumption” dataset (https://www.kaggle.com/uciml/student-alcohol-consumption) from www.kaggle.com. Source/ author information: UCI Machine Learning Repository
The data was obtained in a survey of students in math and portuguese language courses in secondary school. It contains a lot of interesting social, gender and study information about students. From the two available courses, we chose the dataset representing the students in Math course. There are 33 variables in this dataset, such as- school, sex, age, family size, weekend and weekday drinking amount, record of failure in previous classes, grades for the tests in this language course etc. We chose 2 of these variables to work with, which are sex and G3(final test grades). We want to understand which gender on an average secures a better score in their finals for this course.
The goal of this investigation is to find out that of these 395 male and female students, which gender did better in their final test.
This is a very simple question, great for an independent t-test. We now convert this question to two hypotheses:
a hypothesis that represents no difference between the final grades between the two groups: H0.
an alternative hypothesis representing that there is a difference between the two groups’ scores on the final grade: Ha.
Our observations in the two groups are independent - ehich means they do not represent the same individuals having been surveyed at different times. Thus, the groups that we are comparing in the test are independent. Had the means that we use in the test been obtained from the same persons, we would have to approach the test differently. Before running this test, we preprocess the data and work on the assumptions.
This dataset has 33 variables with a large number of samples- 395. This ensures that we can assume normality for the samples (since n>30) according to the Central Limit Theorem.
From the 33 variables, we chose to work with 2 and made a subset called “Student”. Our chosen variables are:
sex = the gender of the students
G3 = their final grades in Maths.
G3 is the final gradescore on a scale from 0-20. This is a numeric variable. Sex is a categorical variable. This variable has been converted to a factor with 2 levels - female and male.
Data <- read_csv("student-mat.csv")
Student = subset(Data, select = c(sex, G3))
Student$sex <- Student$sex %>% factor(levels = c("F","M"),
labels=c("Female","Male"))
print(Student)## # A tibble: 395 x 2
## sex G3
## <fct> <dbl>
## 1 Female 6
## 2 Female 6
## 3 Female 10
## 4 Female 15
## 5 Female 10
## 6 Male 15
## 7 Male 11
## 8 Female 6
## 9 Male 19
## 10 Male 15
## # … with 385 more rows
We summarised the dataset grouped by sex - female and male and get to see their min, median, max, mean, SD, Q1, Q3, total number of data in each group and if there is any missing data. The summary is saved into a table called Student 1.
Student %>% group_by(sex) %>% summarise(Min = min(G3,na.rm = TRUE),
Q1 = quantile(G3,probs = .25,na.rm = TRUE),
Median = median(G3, na.rm = TRUE),
Q3 = quantile(G3,probs = .75,na.rm = TRUE),
Max = max(G3,na.rm = TRUE),
Mean = mean(G3, na.rm = TRUE),
SD = sd(G3, na.rm = TRUE),
n = n(),
Missing = sum(is.na(G3)))-> Student1
print(Student1)## # A tibble: 2 x 10
## sex Min Q1 Median Q3 Max Mean SD n Missing
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 Female 0 8 10 13 19 9.97 4.62 208 0
## 2 Male 0 9 11 14 20 10.9 4.50 187 0
We also created a histogram and a barplot to visualize the distribution of grades among students and the gender of students involved.
We can see some patterns in the data - there is only a few more females than males, and the final score average is around 10. However, a large numbers of individuals that have failed the tests recieved a grade of 0, which may prove to be outliers in this case and we decide on how to deal with them in the next phase.
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [36] 0 0 0
We now want to run a two sample t-test on this dataset to identify any statistically significant difference between Male and Female grades.
Before running a hypothesis test, we need to confirm two facts now -
population data are normally distributed
population homogeneity of variance.
To confirm normal distribution, we visualize with Q-Q plot for both genders. There seems to be some values outside the 95% normality quantiles in both groups. But we can ignore this according to the Central Limit Theorem because our sample was large enough (n>30).
# Q-Q Plot Female
G3_Female <- Student %>% filter(sex == "Female")
G3_Female$G3 %>% qqPlot(dist="norm")## [1] 66 67
## [1] 64 66
Next, we check homogeneity of variance using Levene’s test to compare the variances of male and female grades. Hypotheses for the Levene’s test are -
H0 : (σ1)^2 = (σ2)^2
HA : (σ1)^2 ≠ (σ2)^2
According to the test, p value = 0.7937 which is > 0.05. So we fail to reject H0, which lets us assume equal variances.
We now run a two sample t-test for any statistically significant difference between male and female grade means. The confidence interval level is 95% and significance level α = 0.05.
Assumptions made and proved for this t-test are following:
Hypotheses for the two-sample t-test:
\[H_0: \mu_1 = \mu_2 \] \[H_A: \mu_1 \ne \mu_2\] where μ1 and μ2 refer to the population means of Female and Male grades respectively. The null hypothesis is simply that the difference between the two independent population means is 0.
The difference between male and female grades estimated by the sample mean was 9.966346 - 10.914439 = - 0.948093.
Otherwise, we fail to reject H0.
##
## Two Sample t-test
##
## data: G3 by sex
## t = -2.062, df = 393, p-value = 0.03987
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.85205632 -0.04412838
## sample estimates:
## mean in group Female mean in group Male
## 9.966346 10.914439
A two-sample t-test was used to test for a significant difference between the Grades of Male and Female students. After the t-test, according to the p-value method, as p=0.03987 < α=0.05, we reject H0. According to the p-value method, there is statistically significant difference between male and female grade means.
According to the t-test, estimated difference between means = 9.966346 - 10.914439 = -0.948093 95% CI of difference between means [-1.85205632 -0.04412838]. As this interval does not capture H0: μ1 − μ2 = 0, we reject H0 once again.Which again proves that there is a statistically significant difference between the means. As the t test shows, mean grades for Male are higher (10.914439) than female mean grades (9.966346). So, according to the t-test, males tend to have a higher mean grade than females.
The results of the two-sample t-test assuming equal variance found a statistically significant difference between the mean grades of male and female, t(df=393)=0.019416, p=0.039, 95% CI for the difference in means [-1.85205632 -0.04412838]. According to the t-test, male grades tend to have higher mean grades than females in this class.
We cannot infer to this result for general purposes, outside of this class. We can only infer to a population that is akin to the individuals represented in the sample. That means that the final scores for males and females that we analyzed only are relevant to similar boys and girls from the same countries, at the same age, at same point in time etc. In case of a generalised answer for who is smarter - males or females; I would rather suggest that one explores the result further by getting more data by gathering bigger samples (from differenct classes, different cities and maybe even different countries) and then running the tests again.
We conclude that the average grades of male and female have a difference in this specific maths class; in this class males tend to have higher grades than females.
The dataset was derived from www.kaggle.com and the following specific address - https://www.kaggle.com/uciml/student-alcohol-consumption Source/ author information: UCI Machine Learning Repository