MATH1324 Assignment

Understanding Relation between gender and Weekly alcohol consumption

Mohan Pal Singh(S3797951) Akshay Aggarwal S3793489

Introduction

Problem Statement

Data

-This data was collected from online free data source kaggle.com, this data has public domain license and can be used by anyone. - The data were obtained in a survey of students math and portuguese language courses in secondary school - Data reference - https://www.kaggle.com/uciml/student-alcohol-consumption

Data Cont.

alc1 <- read.csv("C:/data/student-mat.csv", stringsAsFactor=FALSE)
gender1 <- alc1$sex
week_alc <- alc1$Walc

dim(alc1) # checking dimensions of dataset
## [1] 395  33
#Analysing summary statistics
alc1 %>% group_by(alc1$sex) %>% summarise(Min = min(week_alc,na.rm = TRUE),
Q1 = quantile(week_alc,probs = .25,na.rm = TRUE),
Median = median(week_alc, na.rm = TRUE),
Q3 = quantile(week_alc,probs = .75,na.rm = TRUE),
Max = max(week_alc,na.rm = TRUE),
Mean = mean(week_alc, na.rm = TRUE),
SD = sd(week_alc, na.rm = TRUE),
n = n(),
Missing = sum(is.na(week_alc)))

Descriptive Statistics and Visualisation

-Sex and weekly alcohol (Wlac) are most important variables from which we will inerpret the validity of our hypothesis. - We have use boxplot to visualize about the weekly alcohol consumption between both the genders. We can make an initial interpretation that male students drink more than females. This is only visual intrepretation. - There were no outliers of missing values in the data. We can impute, delete or cap the outliers.

alc1 %>% boxplot(alc1$Walc ~ alc1$sex, data = ., ylab = "")

Decsriptive Statistics Cont.

alc1 %>% group_by(alc1$sex) %>% summarise(Min = min(week_alc,na.rm = TRUE),
Q1 = quantile(week_alc,probs = .25,na.rm = TRUE),
Median = median(week_alc, na.rm = TRUE),
Q3 = quantile(week_alc,probs = .75,na.rm = TRUE),
Max = max(week_alc,na.rm = TRUE),
Mean = mean(week_alc, na.rm = TRUE),
SD = sd(week_alc, na.rm = TRUE),
n = n(),
Missing = sum(is.na(week_alc)))-> table1
knitr::kable(table1)
alc1$sex Min Q1 Median Q3 Max Mean SD n Missing
F 1 1 2 3 5 2.291139 1.287897 208 0
M 1 1 2 3 5 2.291139 1.287897 187 0
sum(is.na(alc1)) # checking missing values
## [1] 0
sum(is.nan(alc1$sex))
## [1] 0
sum(is.nan(alc1$Walc))
## [1] 0

Hypothesis Testing

-We are creatin a barplot to visualize the data and then perform chi square test.

-NULL Hypothesis is : Female student drink more alcohol in school than males - Alternative `Hypothesis:Gender has no effect on alcohol consumption.

H0-There is no association in the population between the categorical variables (independence) HA: There is an association in the population between the categorical variables (dependence)

alc1$sex <- factor(alc1$sex, levels=c("M","F"),
                                  labels = c("Male","Female"))

table2 <- table(alc1$Walc,alc1$sex)
table2
##    
##     Male Female
##   1   57     94
##   2   34     51
##   3   35     45
##   4   37     14
##   5   24      4
barplot(table2,ylab=" Weekly Alcohol consumption",
          ylim=c(0,100),legend=rownames(table2),beside=TRUE,
          args.legend=c(x = "top",horiz=TRUE,title="Alcohol consumption by Gender"),
          xlab="Gender")

chi2 <- chisq.test(table(alc1$Walc, alc1$sex))
chi2
## 
##  Pearson's Chi-squared test
## 
## data:  table(alc1$Walc, alc1$sex)
## X-squared = 37.364, df = 4, p-value = 1.516e-07
chi2$p.value
## [1] 1.515842e-07
chi2$observed
##    
##     Male Female
##   1   57     94
##   2   34     51
##   3   35     45
##   4   37     14
##   5   24      4
chi2$expected
##    
##         Male   Female
##   1 71.48608 79.51392
##   2 40.24051 44.75949
##   3 37.87342 42.12658
##   4 24.14430 26.85570
##   5 13.25570 14.74430
qchisq(p = .95,df = 4)
## [1] 9.487729

Hypthesis Testing Cont.

-chi square stastic χ2 is calculated as:

χ2=∑(Oij−Eij)2Eij

where Oij is the observed count in the ith row of the jth column and Eij is the expected count assuming no association. Eij is calculated as…

Eij=n(rin)(cjn)

where ri refers to the total count of the ith row and cj is the total count of the jth column.

Discussion

If there was no association between gender and weekly alcohol consumption, the height of the bars (i.e. proportions) of gender and weekly consumption within each of the level would be the same. In the barchart above, this does not seem to be the case. The male are less likely to be at level 1 of alcohol consumption and are more likely to be at level 5 being intoxicated. This is an example of a categorical association. In other words, the probability of being on a alochol level “depends” on the gender. What we need to determine with a Chi-square test of association is whether this relationship is statistically significant or whether it reflects natural sampling variability assuming gender and alchol consumption are independent (i.e. H0).

-A Chi-square test of association was used to test for a statistically significant association between gender and Weekly alcohol consumption. The results of the test found a statistically significant association, χ2=37.364 ,p<.001. χ2>χ2crit that is 37.364>9.84 .The results of this study suggest that women with breast cancer were more likely to give birth to their first child in older age categories when compared to control.

Discussion cont.

-H0 was rejected -We can interpret that There is an association in the population between the categorical variables which indicates that gender and weekly alcohol consumption has a relation and males are more likely to be heavely intoxicated at level 5 as compare to females and females are more at level 1 which is low alcohol.

For future prediction, we can add more data such as a column with blood alcohol content whose mean can be taken, on which we can perform leven’s test to understand the variance and then ttest for hypothesis test with mean.

We would like to conclude that there is a association between gender and weekly alcohol consumption, males are more likely to be at level 5 of alcohol consumption .

References

-https://www.kaggle.com/uciml/student-alcohol-consumption