MATH1324 Assignment

Understanding Relation between gender and Weekly alcohol consumption

Mohan Pal Singh(S3797951) Akshay Aggarwal S3793489

Introduction

Increasing Alochol consumption is one of the most primary concern of Authorties.
High on Adrenaline, Most of the school students start drinking alchol at early age.
Alchol consumption depends on various conditions such as parents financial and relationship status, family problems, peer pressure, schooling and gender.
In this analysis, our group will explore whether female students are tend to drink more alcohol weekly or male students.
We will try to understand about the relation between gender and alcohol consumption as per the available data.

Problem Statement

Is there Any association between gender and alchol consumption between male and female school children.
Does female students drink more alchol or male students on weekly basis, by understanding this problem, we can try to interpret that which gender is more likely to drink more.
For this investigation we will we will use descriptibe statistics to visualize bar plot
After initial investigation we will use hypothesis testingto test null and alternative hypothesis to form a conclusion of this test.
Our group will use various stastical methods to answer the problem, we will visualize the data by ploting graph and perform summary operations. We will analyse the data and then decidea about which test to perform for hypothesis testing

Data

-This data was collected from online free data source kaggle.com, this data has public domain license and can be used by anyone. - The data were obtained in a survey of students math and portuguese language courses in secondary school - Data reference - https://www.kaggle.com/uciml/student-alcohol-consumption

Data Cont.

Sample size - 395
There are 33 variables in the dataset.
Gender and Alcohol consumptions are important variables as we can decide using them
Factors variables - sex, pstatus, address
Numeric variables - famrel, freetime, goout, Dalc, Walc, health, absences, G1, G2, G3, traveltime studytime, failures, Medu, Fedu. — -We have filter the variable with which we want to work, two primary varibles for our analysis are sex and walc(weekly alcohol) We are going to perform hypothesis testing on them and our null hypothesis contains both of these variables therefore we have filtered them out.

alc1 <- read.csv("C:/data/student-mat.csv", stringsAsFactor=FALSE)
gender1 <- alc1$sex
week_alc <- alc1$Walc

dim(alc1) # checking dimensions of dataset

## [1] 395  33

#Analysing summary statistics
alc1 %>% group_by(alc1$sex) %>% summarise(Min = min(week_alc,na.rm = TRUE),
Q1 = quantile(week_alc,probs = .25,na.rm = TRUE),
Median = median(week_alc, na.rm = TRUE),
Q3 = quantile(week_alc,probs = .75,na.rm = TRUE),
Max = max(week_alc,na.rm = TRUE),
Mean = mean(week_alc, na.rm = TRUE),
SD = sd(week_alc, na.rm = TRUE),
n = n(),
Missing = sum(is.na(week_alc)))

Descriptive Statistics and Visualisation

-Sex and weekly alcohol (Wlac) are most important variables from which we will inerpret the validity of our hypothesis. - We have use boxplot to visualize about the weekly alcohol consumption between both the genders. We can make an initial interpretation that male students drink more than females. This is only visual intrepretation. - There were no outliers of missing values in the data. We can impute, delete or cap the outliers.

alc1 %>% boxplot(alc1$Walc ~ alc1$sex, data = ., ylab = "")

Decsriptive Statistics Cont.

We have printed a table of data summary with all the important variable.
In this section, We are checking for missing values such as NA and NAN so that that can be treated, sum function along with is.na tells us that there were no missing values

alc1 %>% group_by(alc1$sex) %>% summarise(Min = min(week_alc,na.rm = TRUE),
Q1 = quantile(week_alc,probs = .25,na.rm = TRUE),
Median = median(week_alc, na.rm = TRUE),
Q3 = quantile(week_alc,probs = .75,na.rm = TRUE),
Max = max(week_alc,na.rm = TRUE),
Mean = mean(week_alc, na.rm = TRUE),
SD = sd(week_alc, na.rm = TRUE),
n = n(),
Missing = sum(is.na(week_alc)))-> table1
knitr::kable(table1)

alc1$sex	Min	Q1	Median	Q3	Max	Mean	SD	n	Missing
F	1	1	2	3	5	2.291139	1.287897	208	0
M	1	1	2	3	5	2.291139	1.287897	187	0

sum(is.na(alc1)) # checking missing values

## [1] 0

sum(is.nan(alc1$sex))

## [1] 0

sum(is.nan(alc1$Walc))

## [1] 0

Hypothesis Testing

We have decided to perform Chi square test of assoication as both of the variables are categorial and ttest can’t be applied under such conditions and for chi sqaure goodness of fit test is about counting of categorial observations. Hence, we have applied chi square test of association as it fulfil all the test requirement.

-We are creatin a barplot to visualize the data and then perform chi square test.

-NULL Hypothesis is : Female student drink more alcohol in school than males - Alternative `Hypothesis:Gender has no effect on alcohol consumption.

H0-There is no association in the population between the categorical variables (independence) HA: There is an association in the population between the categorical variables (dependence)

alc1$sex <- factor(alc1$sex, levels=c("M","F"),
                                  labels = c("Male","Female"))

table2 <- table(alc1$Walc,alc1$sex)
table2

##    
##     Male Female
##   1   57     94
##   2   34     51
##   3   35     45
##   4   37     14
##   5   24      4

barplot(table2,ylab=" Weekly Alcohol consumption",
          ylim=c(0,100),legend=rownames(table2),beside=TRUE,
          args.legend=c(x = "top",horiz=TRUE,title="Alcohol consumption by Gender"),
          xlab="Gender")

chi2 <- chisq.test(table(alc1$Walc, alc1$sex))
chi2

## 
##  Pearson's Chi-squared test
## 
## data:  table(alc1$Walc, alc1$sex)
## X-squared = 37.364, df = 4, p-value = 1.516e-07

chi2$p.value

## [1] 1.515842e-07

chi2$observed

##    
##     Male Female
##   1   57     94
##   2   34     51
##   3   35     45
##   4   37     14
##   5   24      4

chi2$expected

##    
##         Male   Female
##   1 71.48608 79.51392
##   2 40.24051 44.75949
##   3 37.87342 42.12658
##   4 24.14430 26.85570
##   5 13.25570 14.74430

qchisq(p = .95,df = 4)

## [1] 9.487729

Hypthesis Testing Cont.

-chi square stastic χ2 is calculated as:

χ2=∑(Oij−Eij)2Eij

where Oij is the observed count in the ith row of the jth column and Eij is the expected count assuming no association. Eij is calculated as…

Eij=n(rin)(cjn)

where ri refers to the total count of the ith row and cj is the total count of the jth column.

Discussion

To understand potentiol association and pattern , we have plotted boxplot and barplot.

If there was no association between gender and weekly alcohol consumption, the height of the bars (i.e. proportions) of gender and weekly consumption within each of the level would be the same. In the barchart above, this does not seem to be the case. The male are less likely to be at level 1 of alcohol consumption and are more likely to be at level 5 being intoxicated. This is an example of a categorical association. In other words, the probability of being on a alochol level “depends” on the gender. What we need to determine with a Chi-square test of association is whether this relationship is statistically significant or whether it reflects natural sampling variability assuming gender and alchol consumption are independent (i.e. H0).

-A Chi-square test of association was used to test for a statistically significant association between gender and Weekly alcohol consumption. The results of the test found a statistically significant association, χ2=37.364 ,p<.001. χ2>χ2crit that is 37.364>9.84 .The results of this study suggest that women with breast cancer were more likely to give birth to their first child in older age categories when compared to control.

Discussion cont.

-H0 was rejected -We can interpret that There is an association in the population between the categorical variables which indicates that gender and weekly alcohol consumption has a relation and males are more likely to be heavely intoxicated at level 5 as compare to females and females are more at level 1 which is low alcohol.

For future prediction, we can add more data such as a column with blood alcohol content whose mean can be taken, on which we can perform leven’s test to understand the variance and then ttest for hypothesis test with mean.

We would like to conclude that there is a association between gender and weekly alcohol consumption, males are more likely to be at level 5 of alcohol consumption .

References

-https://www.kaggle.com/uciml/student-alcohol-consumption