Student Alcohol Consumption

Author/s: Julian HEATH 480382056; Mulan ZHONG 470415520; Helena KE 470427198; Mikayla KIM 450404999

subtitle: “Project 2” date: “University of Sydney | MATH1005 | September 2018” output: html_document: fig_caption: yes number_sections: yes self_contained: yes theme: flatly toc: true toc_depth: 3 toc_float: true code_folding: hide —

Executive Summary

The aim of this report is to detect if there is any correlation between high school students environmental and personal factors and the likeability of these students to consume alcohol. Our method of analysis was done by implementing a cross examination using qualitative variables such as sex, age, internet, parent occupation/ education of Portuguese high school students and the students likeability to consume alcohol. Subsequently, this report aims to see the impact of student alcohol consumption on their overall examination performance, if these students were to drink during the weekday, rather than the weekend. The results gathered by this report would be most useful for parents, teachers, and students because it provides insight into any underlying or non-direct causes of underage drinking. As an outcome of this research, more efficient teaching and parenting can be developed, improving the quality of education for students overall.

The main discoveries are that there is a larger difference in numbers between female and male students consuming alcohol, with around 150 more female students drinking than their male counterparts. Furthermore, both genders showed to have a larger demographic of drinkers at ages 16 and 17, which means that there is a large quantity of students illegally drinking underage. (Portugals drinking age is 18). We also discovered that the socioeconomic background of most of the students were decent enough for 80% of students to have paid internet access at home but weren’t able to pay for additional educational classes at school.

Analysis showed that there was no direct correlation between parent’s education or occupation on the student’s level of alcohol consumption. We must consider that we purely used an average of the level of consumption in accordance to parent occupation and education so any outliers should be helpful in indicating whether certain jobs had a larger influence on a student’s likeability to consume alcohol.

Finally, our results also showed that there was a negative correlation between exam performance and alcohol consumption, meaning that the more likely the student consumed alcohol, meant that they were more likely to achieve a score that was lower compared to students that abstained from alcohol consumption. Furthermore, our results aimed to discover if alcohol consumption affected the amount of time students dedicate to study, which would overall have an influence on exam performance. Our data suggests that an increase in alcohol consumption within the category of students that do not usually drink actually had a larger influence on their study time in comparison to those who usually consumed a comparatively larger amount of alcohol.

Full Report

Initial Data Analysis (IDA)

##Set Working Directory
setwd("C:/Users/Helena/Desktop/MATH1005/Data/Project 2 - Alcohol Consumption")

##Select Data
student = read.csv("student-mat.csv", header = T)
# Quick look at top 5 rows of data
##   school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob
## 1     GP   F  18       U     GT3       A    4    4  at_home  teacher
## 2     GP   F  17       U     GT3       T    1    1  at_home    other
## 3     GP   F  15       U     LE3       T    1    1  at_home    other
## 4     GP   F  15       U     GT3       T    4    2   health services
## 5     GP   F  16       U     GT3       T    3    3    other    other
## 6     GP   M  16       U     LE3       T    4    3 services    other
##       reason guardian traveltime studytime failures schoolsup famsup paid
## 1     course   mother          2         2        0       yes     no   no
## 2     course   father          1         2        0        no    yes   no
## 3      other   mother          1         2        3       yes     no  yes
## 4       home   mother          1         3        0        no    yes  yes
## 5       home   father          1         2        0        no    yes  yes
## 6 reputation   mother          1         2        0        no    yes  yes
##   activities nursery higher internet romantic famrel freetime goout Dalc
## 1         no     yes    yes       no       no      4        3     4    1
## 2         no      no    yes      yes       no      5        3     3    1
## 3         no     yes    yes      yes       no      4        3     2    2
## 4        yes     yes    yes      yes      yes      3        2     2    1
## 5         no     yes    yes       no       no      4        3     2    1
## 6        yes     yes    yes      yes       no      5        4     2    1
##   Walc health absences G1 G2 G3
## 1    1      3        6  5  6  6
## 2    1      3        4  5  5  6
## 3    3      3       10  7  8 10
## 4    1      5        2 15 14 15
## 5    2      5        4  6 10 10
## 6    2      5       10 15 15 15
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "failures"  
## [16] "schoolsup"  "famsup"     "paid"       "activities" "nursery"   
## [21] "higher"     "internet"   "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"
## [1] "data.frame"

Size of data

dim(student) ## R’s classification of variables names(student) class(student) str(age) #quantitative continuous str(sex) #qualitative nominal str(Medu) #qualitative ordinal str(Fedu) #qualitative ordinal str(internet) #qualitative nominal str(paid) #qualitative nominal str(Dalc) #quantitative continuous str(Mjob) #qualitative nominal str(Fjob) #qualitative nominal str(studytime) #quantitative discrete str(G1) #quantitative continuous str(G2) #quantitative continuous str(G3) #quantitative continuous

R’s classification of data



The data came from (source of data)…

Data came from two different assessable sources: ### Research paper on data mining to predict secondary school student performance

The initial data was collected by university of Minho professor Paulo Cortez and secondary school educator, Alice Maria Goncalves Silva. Their data was collected during the 2005- 2006 school year from two public schools from the Alentejo region of Portugal. The database was built from school reports, based on paper sheets, and questionnaires. The data was collected from 788 students and this was used to predict secondary school exam performance. The source of data offers a higher level of validity as variables that were used were assessed by school professional and were subsequently assessed on a smaller sample size of 15 students for feedback. The data results are practical to a degree, because it was only collected from two different schools in the same vicinity of Portugal. We must consider factors such as the area’s socioeconomic level as well as the quality of education. If the data had been collected from a larger variety of schools in different areas, we would be considering to cater to a larger variety of potential confounding variables. Furthermore, having only assessed the students in two subject areas- maths and Portuguese limits the accuracy of the results.

Fabio Panott’s research on Prediction of Secondary School Students’ Alcohol Addiction Using Random Forest

This research paper was produced by professors of the College of Engineering, located in India and can be found in the International Journal of Computer Applications. It refers to the research data done by Paulo Cortez and Alice Silva, as mentioned above. The research data mentions how the entries that didn’t reveal enough information by the students were discarded so as to improve the reliability of the data and was final processing of the information was done so that any inaccurate information or inconsistent information was also discarded. The data collected were anonymous, meaning that there is less likely for the researchers to have incorporated bias.

Stakeholders ¦people interested in the results of this data would primarily be those who work in the education system as well as parents and students. This information would be useful in predicting whether certain life factors have a direct influence on causing underage individuals to drink, and see if there are any correlations between underage drinking and exam performance. It could also aid in seeing seeing failure rates amongst students. Furthermore, this data would be capable of assisting in government research into the commonness of underage drinking in the country of Portugal. Finally, the results could also aid in improvement of education quality and the quality of social/ environmental factors for high school students so that they may achieve high grades.

Domain knowledge¦ Studies have already shown for e.g. that there is a positive association between alcohol consumption and decreased academic performance for students between 18 and 29. This research furthers into explaining the importance of alcohol prevention activities so that students may abstain from heavy or consistent drinking. Further research has indicated that high levels of alcohol consumption is evident by the underage demographic, with students as young as year 9 having consumed alcohol on a regular basis. Our research into the variables and their possible correlation with alcohol consumption may provide answers to questions that ask, what might be the underlying causes of alcohol consumption amongst high school students, and what might be the impact of drinking to their academic achievements?

Research Question 1: Do the biological factors and socioeconomic status of a student influence alcohol consumption?

VARIABLES: Student alcohol consumption dependent, qualitative, ordinal Sex independent, nominal, qualitative Age independent, quantitative Internet independent, nominal Paid independent, nominal

# Consider 1 qualitative variable sex
sex = student$sex # Isolate the variable sex.  
## [1] "factor"
## sex
##   F   M 
## 208 187

# Consider 1 quantitative variable
age = student$age # Isolate the variable age 
## [1] "integer"
ageN = as.numeric(age) # convert age into a numeric vector and label as ageN
## [1] "numeric"
hist(ageN, main = "Alcohol Consumption", xlab = "age") # create a histogram for alcohol consumption frequencies by age
mean(ageN) # calculate the mean age that consumes alcohol
## [1] 16.6962
median(ageN) # calculate the median age that consumes alcohol
## [1] 17
abline(v = mean(ageN), col = "light blue") # label mean on histogram
abline(v = median(ageN), col = "purple") # label median on histogram

boxplot(ageN, main = "Alcohol Consumption") # create a boxplot
abline(h = mean(ageN), col = "light blue") # label mean on boxplot 
abline(h = median(ageN), col = "purple") # label median on boxplot 

# Consider 1 quantitative variable divided by 1 qualitative variable
# Control for sex
ageNF = ageN[student$sex == "F"]
ageNM = ageN[student$sex == "M"]
par(mfrow = c(2,1))
boxplot(ageNF,horizontal=T, col="light blue") #create a boxplot for female students 
boxplot(ageNM,horizontal=T) #create a boxplot for male students

# Consider 1 qualitative variable 
internet = student$internet  # Isolate the variable internet
## [1] "factor"
## internet
##  no yes 
##  66 329
## internet
##        no       yes 
## 0.1670886 0.8329114

# Consider 1 qualitative variable 
paid = student$paid # Isolate the variable paid
## paid
##  no yes 
## 214 181

## paid
##        no       yes 
## 0.5417722 0.4582278


The results from the study show that more younger students (ages 16-17) consume alcohol. Gender was discovered to not play a role in influencing alcohol consumption. Lastly assumptions can be made from the study that students that consume would majorly have a middle socioeconomic status.

Sex - there was a sample size of 395 students, and out of these there were: 208 females 187 males Due to there being a small difference of only 21, there is therefore no apparent correlation between alcohol consumption and gender based on the dataset. However this study could have been improved by analysing the entire cohort (non-drinkers and drinkers), to see whether there is a larger amount of females than males in the year group, to accurately determine whether there is no link.

Age - By analysing the histogram, it is made apparent that 16 to 17 year olds most commonly consume alcohol. The mean and median was then calculated which showed that: Mean age = 16.7 Median age = 17 Further study through a box plot revealed that 50% of students that drink are between the ages of 16 to 18. Considering the dataset ranged from 15 year olds to 22, it is apparent that more younger students are consuming alcohol.

Correlations between sex and age were then analysed. After observing the boxplots side by side for female and male students, with the exception of an outlier in the male group, their interquartile range is the same (16 to 18). However what was unexpected was that the boxplot of males show that the median sits on the lower quartile, indicative of a skewed male dataset.

Next broadly delving into socioeconomic status by seeing which student homes pay for internet and extended education (paid) it can be seen that with internet: 0.83 (83%) have accessibility to the internet at home, whilst 0.17 (17%) do not. Whilst for extended paid learning: 0.54 (54%) opt not to pay for extra education whilst 46% do. By reviewing these results an assumption can be made that the students that consume alcohol have a middle socioeconomic status. Results on socioeconomic status are expected due to the consideration of whether the students could afford alcoholic beverages however the method and analysis would most certainly be improved if more specific addresses (e.g. suburb), or whether the students were employed, were included in the dataset.

Research Question 2

Is there a link between student alcohol consumption and parent’s education and occupation? Variables: Student Alcohol Consumption dependent, qualitative, ordinal Parent’s Educationindependent, qualitative, ordinal, non-binary Parent’s Jobs ndependent, qualitative, nominal, non-binary

For Alcohol Consumption and Parents Education

##Select Data
data2 = rbind(student, student) #Merging dataset for Edu variable
## Select variables
Walc = student$Walc #Weekday alcohol consumption variable
Medu = student$Medu #Mother's education variable
Fedu = student$Fedu # Father's education variable
Edu = rbind(Medu, Fedu) #combines Medu and Fedu variable to form Edu variable (parent's education as a whole)
Alcohol = data2$Walc #Weekday alcohol consumption variable merged
##Produce Graph
d1 = table(Edu,Alcohol) #Let d1 be the table
##    Alcohol
## Edu  1  2  3  4  5
##   0  2  1  1  1  0
##   1 55 32 26 15 13
##   2 88 44 47 28 11
##   3 78 39 41 21 20
##   4 79 54 45 37 12
rowSums(d1) #Sum of each row (horizontal)
##   0   1   2   3   4 
##   5 141 218 199 227
##Recreate table to manipulate
f=rbind(a,b,c,d,e) #Recreate d1
##Calculate average Student Alcohol Consumption Value for each Education Level
Dependent=c(1,2,3,4,5) #Alcohol Consumption
z=Dependent%*%f #Matrix Multiplication
##      [,1] [,2] [,3] [,4] [,5]
## [1,]   11  322  484  463  530
Average = z/rowSums(d1) #Determining Mean Student Alcohol Consumption Value for each Education Level
##      [,1]     [,2]     [,3]     [,4]     [,5]
## [1,]  2.2 2.283688 2.220183 2.326633 2.334802
##Create Graph
barplot(Average, main = "Student Alcohol Consumption and Parent's Education",
xlab = "Parent's Education Level",
ylab = "Mean Student Alcohol Consumption",

###For Alcohol Consumption and Parent’s Occupation

## Select variables
Mjob = student$Mjob #Mother's job variable
Fjob = student$Fjob # Father's job variable
Job = rbind(Mjob, Fjob)
##Produce Graph
rowSums(d2) #Sum of each row (horizontal)
##   1   2   3   4   5 
##  79  52 358 214  87
f2=rbind(a2,b2,c2,d2,e2) #Recreate d2
##    [,1] [,2] [,3] [,4] [,5]
## a2   27   26  140   81   28
## b2   21   10   71   43   25
## c2   16    4   75   46   19
## d2    9    7   43   30   13
## e2    6    5   29   14    2
##Calculate average Student Alcohol Consumption Value for each Occupation
Dependent=c(1,2,3,4,5) #Alcohol Consumption
w=Dependent%*%f2 #Matrix Multiplication
##      [,1] [,2] [,3] [,4] [,5]
## [1,]  183  111  824  495  197
Average2= w/d3 #Determining Mean Student Alcohol Consumption Value for each Occupation
##          [,1]     [,2]     [,3]     [,4]     [,5]
## [1,] 2.316456 2.134615 2.301676 2.313084 2.264368
##Create Graph
barplot(Average2, main = "Student Alcohol Consumption and Parent's Occupation",
xlab = "Parent's Occupation",
ylab = "Mean Student Alcohol Consumption",

###Summary: ###Parent’s Education -Sample of 790 parents (395 students) - 0.6% no formal education - 17.8% up to 4th grade - 27.6% up to 5th -9th grade - 25.2% secondary education - 28.7% higher education

For Student Alcohol Consumption(SAC) and Parent’s Education

To determine link between Student Alcohol Consumption and Parent’s Education, SAC was treated as a quantitative discrete variable: 1 = very low 2 = low 3 = moderate 4= high 5= very high Mean (average) value of SAC was determined for each level of formal education undertaken by parents Findings: No formal education - 2.200 Up to 4th grade - 2.284 Up to 5th-9th grade - 2.220 Secondary - 2.327 Higher - 2.335 As the SAC values are relatively similar (All between 2.2-2.35) these findings do not suggest a clear relationship between SAC and a parent formal education level. However, it should be noted that the SAC appears to be higher if parents have accomplished higher levels of formal education. This unexpected result may be explained by the small sample size and lack of diversity. I.e. only 5 of 790 parents sampled had no formal education compared to 218 parents who have completed formal education up to 5th- 9th grade. Therefore, invalid results due to incorrect methodology (sample size). This study could be improved if a large and relatively sample size of parents from each occupation type and their child’s alcohol consumption were investigated. ###Parent Occupation Sample of 790 parents (395 students) 10% At home 6.6% Health related 45.3% Other 27.1% Civil services 11.0% Teacher ###For Student Alcohol Consumption (SAC) and Parent Occupation To determine link between Student Alcohol Consumption and Parent’s Education, SAC was treated as a quantitative discrete variable 1 = very low 2 = low 3 = moderate 4= high 5= very high Mean (average) value of SAC was determined for each occupation type Findings (SAC Value): 10% At home - 2.316 6.6% Health related - 2.135 45.3% Other -2.302 27.1% Civil services -2.313 11.0% Teacher -2.264 As these SAC values are relatively similar (between 2.10-2.35) these findings do not suggest a relationship between SAC and parent’s occupation. Again, data could be improved by obtaining a larger sample size with even distribution of parents with different occupations.

Research Question 3: Does an increased level of alcohol consumption cause a decrease in exam performance?

By comparing (1) the level of alcohol consumption with exam performance (2) the level of alcohol consumption with hours of study and (3) hours of study with exam performance, it can be determined whether: Firstly, there is a negative correlation between exam performance and alcohol consumption, and secondly whether study is a confounding variable. If the relationship between alcohol consumption and study time matches the relationship between study time and exam performance it could be said that students who drink more study less, and this leads to poor results. However, if the relationships do not completely align it may be that case that the effect of alcohol consumption directly affects a student’s exam performance.

Is there a negative correlation between level of alcohol consumption and exam performance?

Boxplot of the distribution of exam marks at different alcohol consumption levels for each of the three exams:

boxplot(student$G1 ~ student$Walc, ylab = "First Exam Mark", xlab = "Alcohol Consumption")

boxplot(student$G2 ~ student$Walc, ylab = "Second Exam Mark", xlab = "Alcohol Consumption")

boxplot(student$G3 ~ student$Walc, ylab = "Final Exam Mark", xlab = "Alcohol Consumption")


In the first exam, students who marked their consumption at 1 or 2 achieved a median result above 50%, while those who marked themselves at a 3-5 had a median just below 50%. It should also be noted that an increased level of drinking correlated with a decrease in the mark of the 75th percentile. However, the mark at the 25th percentile stays stable. The interquartile range also shrinks at a consumption level of 4-5 with the box plot becoming skewed to the right. It seems there is some negative correlation between alcohol consumption and exam performance in the first exam, but the trend is not very strong.

In the second exam there is a more linear correlation between the level of alcohol consumption and the median mark in the exam. This trend points to a much more clear, negative, correlation between drinking exam performance. Similarly, the mark at the 75% percentile also trends down from an alcohol consumption level of 1-4. Even though this trend is not followed at the 5th level of consumption, this boxplot is skewed with the spread of the lower quartile group (from the 25th to 50th percentile) much more clustered than the upper quartile.

In the Final exam the trend is quite similar to that of the first exam, with a noticeable decrease in median mark between students who drink at a level of 1-2 and then those who drink more. It appears the impact on performance becomes more prominent over this threshold. The mark at the 75th percentile also decreases with an increased consumption from levels 1-4 further supporting this correlation.

While increased alcohol consumption negatively correlates exam performance it could be that the confounding variable of study time is more of a factor. Potentially students who drink less study more and thus the impact on exam performance is more directly caused by study.

Is there a negative correlation between the level of alcohol consumption and hours studied?

Boxplot of the hours studied at different alcohol consumption levels:

boxplot(student$studytime ~ student$Walc, xlab = "alcohol consumption", ylab = "hours of study")


There appears to be a strong negative correlation between alcohol consumption and study time. While the median study time remains stable at 2 hours, and the interquartile range also consistently spans two hours, the skew of the plot changes dramatically between a consumption level of 1 and 2. At an alcohol consumption level of 1, the median and 25th percentile are the same value of 2 hours of study. This means 25% of students study for two hours, while 50% study more. While there is a clear skew to the right in the middle 50%, there is a normal distribution at both ends.

However, an increase in alcohol consumption level from 1-2 leads to a very significant change in the distribution of study time. While the median and interquartile range remains the same, the box shifts to upper quartile decreases to the median while the lower quartile decreases to 1 hour of study. This is a strong skew to the right with 75% of students studying less than 2 hours. Students with an alcohol consumption level of 1 studied a maximum of 4 hours while students with a higher alcohol consumption studied a maximum of 3 hours.

An increase in alcohol consumption from 2-5 causes no change in the distribution of study. Thus, an increase in alcohol consumption from 1 to 2 correlates with a significant decrease in study, but any further increase in consumption has no correlation.

Is there a correlation between study time and exam performance?

Boxplot of the distribution of exam marks with different amounts of study:

boxplot(student$G1 ~ student$studytime, xlab = "hours of study", ylab = "final exam results")

boxplot(student$G2 ~ student$studytime, xlab = "hours of study", ylab = "final exam results")

boxplot(student$G3 ~ student$studytime, xlab = "hours of study", ylab = "final exam results")


Across all three exams there is almost no difference in the distribution of marks for students who studied 1-2 hours. However, for students who study more than 2 hours, the median score, lower quartile and upper quartile mark are all higher across the board. There doesn’t seem to be a clear trend between 3 and 4 hours of study.


  1. A comparison between an alcohol level of 1 or higher

Students who consume the lowest level of alcohol study more than those who consume any more. If hours of study were a confounding variable and the main cause of exam results this would imply a clear difference in the results of alcohol consumers in this lowest bracket from the others. This is not the case however, as the marks from an increased consumption to level 2 have little correlation with exam performance. However, students who study for 3-4 hours appear to perform better and these students all consume in the lowest bracket. It could be that another confounding variable like personality or intelligence is a factor. High IQ students tend to avoid alcohol (thus are in the lowest bracket) and more likely to study. It is not the case that students who drink the least study the most and thus outperform the others, who all study the same amount.

  1. A comparison between an alcohol level from 2-5

75% of students who consume alcohol at a level of 2 or higher study 2 hours or less. When comparing hours of study with exam performance there is no difference in the distribution of marks from 1 to 2 hours of study. It may be the case that this level of study has too negligible an impact on a student’s knowledge to improve exam performance. However, if study was a confounding variable and there was no causation between alcohol consumption and exam results, it would be expected that all students who consumed 2-5 levels of alcohol would achieve a similar distribution of results. This is not the case. In fact, there is a clear negative correlation between in increased consumption at those levels and exam performance. For students who consume alcohol at a level of 2 or higher, the level of consumption has a greater correlation with performance than study time. This implies that there may be a causal link between consumption at these higher levels and performance in exams.

In conclusion, small levels of alcohol consumption have some degree of correlation with improved exam marks but are not a clear causal factor. On the other hand, a high level of alcohol consumption causes a decrease in exam marks. It could be that low levels of alcohol consumption promote a lifestyle suited to high school performance, while high levels directly harm a student’s ability to perform.


Style: APA Ansari, W. Stock C. Mills, C. (October 2013) Is Alcohol Consumption Associated with Poor Academic Achievement in University Students? International Journal of Preventive Medicine 4(10): 1175-1188

Bremner, P., Burnett, J. Nunney, F., Ravat, M., Mistral, W. (June 2011). Young People, Alcohol, and Influences. Retrieved from

Coretz, P. Silva, A. (April 2008) Using Data Mining to Predict Secondary School Student Performance. Retrieved from

Hariharan, B. Krithivasan, R. Deborah, A (September 2016) Prediction of Secondary School Students’ Alcohol Addiction using Random Forest. International Journal of Computer Applications. Volume 149(6): 0975-8887

UCI Machine Learning. (2016). Student Alcohol Consumption. Retrieved from

Personal reflection on group work

My contribution: In this group project, I contributed by working alongside my group members to find the Student Alcohol Consumption (SAC) dataset on Kaggle. My role in the group was to also focus research question 2. To answer this, I treated the SAC variable as a discrete quantitative variable. Through matrix multiplication and division, I was able to determine the mean SAC values for each level of formal education achieved by the parents and created a graph. This method was repeated with SAC and parent occupation. What a learnt about group work: I personally felt that this group project was very successful as all group members contributed equally and communicated well. Through this group project, I realised that if all members of the group successfully contribute and collaborate, a lot of work can be completed within a short period of time. I was also able to improve my critical thinking/ statistical thinking skills. To create my data visualisation, I could not simply copy code that I’ve seen in lecture slides or tutorials. I was forced to do personal research and undergo trial and error to manipulate my data so I am able to present my findings in a coherent and easily understandable manner.