RPubs link information

Introduction

For this assignment we have chosen certain columns to use from the “Student Alcohol Consumption” dataset (https://www.kaggle.com/uciml/student-alcohol-consumption) from www.kaggle.com. Source/ author information: UCI Machine Learning Repository

The data was obtained in a survey of students in math and portuguese language courses in secondary school. It contains a lot of interesting social, gender and study information about students. From the two available courses, we chose the dataset representing the students in Math course. There are 33 variables in this dataset, such as- school, sex, age, family size, weekend and weekday drinking amount, record of failure in previous classes, grades for the tests in this language course etc. We chose 2 of these variables to work with, which are sex and G3(final test grades). We want to understand which gender on an average secures a better score in their finals for this course.

Problem Statement

The goal of this investigation is to find out that of these 395 male and female students, which gender did better in their final test.

Question: Is there a statistically significant difference between how well boys and girls do on their final scores?

This is a very simple question, great for an independent t-test. We now convert this question to two hypotheses:

a hypothesis that represents no difference between the final grades between the two groups: H0.
an alternative hypothesis representing that there is a difference between the two groups’ scores on the final grade: Ha.

Our observations in the two groups are independent - ehich means they do not represent the same individuals having been surveyed at different times. Thus, the groups that we are comparing in the test are independent. Had the means that we use in the test been obtained from the same persons, we would have to approach the test differently. Before running this test, we preprocess the data and work on the assumptions.

Data

This dataset has 33 variables with a large number of samples- 395. This ensures that we can assume normality for the samples (since n>30) according to the Central Limit Theorem.
From the 33 variables, we chose to work with 2 and made a subset called “Student”. Our chosen variables are:

sex = the gender of the students

G3 = their final grades in Maths.

G3 is the final gradescore on a scale from 0-20. This is a numeric variable. Sex is a categorical variable. This variable has been converted to a factor with 2 levels - female and male.

Data Cont.

Data <- read_csv("student-mat.csv")

Student = subset(Data, select = c(sex, G3))

Student$sex <- Student$sex %>% factor(levels = c("F","M"),
                                                labels=c("Female","Male"))
print(Student)

## # A tibble: 395 x 2
##    sex       G3
##    <fct>  <dbl>
##  1 Female     6
##  2 Female     6
##  3 Female    10
##  4 Female    15
##  5 Female    10
##  6 Male      15
##  7 Male      11
##  8 Female     6
##  9 Male      19
## 10 Male      15
## # … with 385 more rows

Descriptive Statistics and Visualisation

We summarised the dataset grouped by sex - female and male and get to see their min, median, max, mean, SD, Q1, Q3, total number of data in each group and if there is any missing data. The summary is saved into a table called Student 1.

Student %>% group_by(sex) %>% summarise(Min = min(G3,na.rm = TRUE),
                                             Q1 = quantile(G3,probs = .25,na.rm = TRUE),
                                             Median = median(G3, na.rm = TRUE),
                                             Q3 = quantile(G3,probs = .75,na.rm = TRUE),
                                             Max = max(G3,na.rm = TRUE),
                                             Mean = mean(G3, na.rm = TRUE),
                                             SD = sd(G3, na.rm = TRUE),
                                             n = n(),
                                             Missing = sum(is.na(G3)))-> Student1
print(Student1)

## # A tibble: 2 x 10
##   sex      Min    Q1 Median    Q3   Max  Mean    SD     n Missing
##   <fct>  <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <int>   <int>
## 1 Female     0     8     10    13    19  9.97  4.62   208       0
## 2 Male       0     9     11    14    20 10.9   4.50   187       0

Descriptive Statistics and Visualisation (Cont)

We also created a histogram and a barplot to visualize the distribution of grades among students and the gender of students involved.

We can see some patterns in the data - there is only a few more females than males, and the final score average is around 10. However, a large numbers of individuals that have failed the tests recieved a grade of 0, which may prove to be outliers in this case and we decide on how to deal with them in the next phase.

hist(Student$G3, breaks = 30, col = "grey")

sex_graph <- table(Student$sex)
barplot(sex_graph, col="grey", main = "Barplot of Student$sex")

Descriptive Statistics and Visualisation (Cont)

Dealing with outliers: We create a boxplot and check for any outliers. The dataset has some outliers with all 0 values. These show that certain students failed their tests and recieved 0 marks. These outliers were not incorrectly entered and affect the results in terms of mean values for Males and Females. These 0 marks are also defining the performance of the students, as 0 also represents performance of the students that did not make any effort. So we decided to keep the outliers as they hold a significant importance in our investigation.

outliers <-boxplot(G3 ~ sex, data = Student)$out

outliers

##  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [36] 0 0 0

Hypothesis Testing

We now want to run a two sample t-test on this dataset to identify any statistically significant difference between Male and Female grades.

Before running a hypothesis test, we need to confirm two facts now -

population data are normally distributed
population homogeneity of variance.

Hypthesis Testing Cont.

Normal Distribution

To confirm normal distribution, we visualize with Q-Q plot for both genders. There seems to be some values outside the 95% normality quantiles in both groups. But we can ignore this according to the Central Limit Theorem because our sample was large enough (n>30).

# Q-Q Plot Female
G3_Female <- Student %>% filter(sex == "Female")
G3_Female$G3 %>% qqPlot(dist="norm")

## [1] 66 67

# Q-Q Plot Male
G3_Male <- Student %>% filter(sex == "Male")
G3_Male$G3 %>% qqPlot(dist="norm")

## [1] 64 66

Hypthesis Testing Cont.

Homogeneity of Variance

Next, we check homogeneity of variance using Levene’s test to compare the variances of male and female grades. Hypotheses for the Levene’s test are -

H0 : (σ1)^2 = (σ2)^2

HA : (σ1)^2 ≠ (σ2)^2

According to the test, p value = 0.7937 which is > 0.05. So we fail to reject H0, which lets us assume equal variances.

# levene test
leveneTest(G3 ~ sex, data = Student)

Hypthesis Testing Cont.

Two Sample t-test

We now run a two sample t-test for any statistically significant difference between male and female grade means. The confidence interval level is 95% and significance level α = 0.05.

Assumptions made and proved for this t-test are following:

Comparing two independent population means with unknown population variance.
Population data : large sample used (n>30 for both groups) so no need for normality to exist.
Population homogeneity of variance assumption not violated.

Hypotheses for the two-sample t-test:

\[H_0: \mu_1 = \mu_2 \] \[H_A: \mu_1 \ne \mu_2\] where μ1 and μ2 refer to the population means of Female and Male grades respectively. The null hypothesis is simply that the difference between the two independent population means is 0.

The difference between male and female grades estimated by the sample mean was 9.966346 - 10.914439 = - 0.948093.

Hypthesis Testing Cont.

We will reject H0 if p-value < 0.05 , OR, if 95% CI of the mean difference does not capture H0 : μ1 − μ2 = 0.

Otherwise, we fail to reject H0.

#t test
t.test(
  G3 ~ sex,
  data = Student,
  var.equal = TRUE,
  alternative = "two.sided"
)

## 
##  Two Sample t-test
## 
## data:  G3 by sex
## t = -2.062, df = 393, p-value = 0.03987
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.85205632 -0.04412838
## sample estimates:
## mean in group Female   mean in group Male 
##             9.966346            10.914439

Interpretation of the t-test:

A two-sample t-test was used to test for a significant difference between the Grades of Male and Female students. After the t-test, according to the p-value method, as p=0.03987 < α=0.05, we reject H0. According to the p-value method, there is statistically significant difference between male and female grade means.
According to the t-test, estimated difference between means = 9.966346 - 10.914439 = -0.948093 95% CI of difference between means [-1.85205632 -0.04412838]. As this interval does not capture H0: μ1 − μ2 = 0, we reject H0 once again.Which again proves that there is a statistically significant difference between the means. As the t test shows, mean grades for Male are higher (10.914439) than female mean grades (9.966346). So, according to the t-test, males tend to have a higher mean grade than females.

Discussion

Findings:

The results of the two-sample t-test assuming equal variance found a statistically significant difference between the mean grades of male and female, t(df=393)=0.019416, p=0.039, 95% CI for the difference in means [-1.85205632 -0.04412838]. According to the t-test, male grades tend to have higher mean grades than females in this class.

Strength & Limitation/ Directions for future investigation:

We cannot infer to this result for general purposes, outside of this class. We can only infer to a population that is akin to the individuals represented in the sample. That means that the final scores for males and females that we analyzed only are relevant to similar boys and girls from the same countries, at the same age, at same point in time etc. In case of a generalised answer for who is smarter - males or females; I would rather suggest that one explores the result further by getting more data by gathering bigger samples (from differenct classes, different cities and maybe even different countries) and then running the tests again.

Conclusion:

We conclude that the average grades of male and female have a difference in this specific maths class; in this class males tend to have higher grades than females.

References

The dataset was derived from www.kaggle.com and the following specific address - https://www.kaggle.com/uciml/student-alcohol-consumption Source/ author information: UCI Machine Learning Repository

MATH 1324_1950 Assignment 3

Female and male math grades comparison

RPubs link information

Introduction

Problem Statement

Data

Data Cont.

Descriptive Statistics and Visualisation

Descriptive Statistics and Visualisation (Cont)

Descriptive Statistics and Visualisation (Cont)

Hypothesis Testing

Hypthesis Testing Cont.

Normal Distribution

Hypthesis Testing Cont.

Homogeneity of Variance

Hypthesis Testing Cont.

Two Sample t-test

Hypthesis Testing Cont.

Interpretation of the t-test:

Discussion

Findings:

Strength & Limitation/ Directions for future investigation:

Conclusion:

References