INTRODUCTION

Secondary and higher education are two extremely important factors in determining the trajectory of a young person’s life. These early years are among the most developmental and influential years for people, and for this reason, can be extraordinarily stressful and anxiety-inducing at times. Students, in turn, develop coping mechanisms and ways to handle this stress, with one significant stress-reliever being the consumption of alcohol. For some students, the consumption of alcohol may be a way to ease their nerves so that they do not become overwhelmed by the stress of their education. For other students, however, the consumption of alcohol can be an escape that causes them to ignore the importance of schooling.

Among students of many different ages and education levels, alcohol consumption has been known to affect how much time students devote to studying and completing school assignments, thus affecting a student’s overall success in their academic setting. As a group of young adults pursuing higher education, we were interested in examining the effects of alcohol consumption on education for ourselves. We all believe education to be extremely important, and it is no secret that alcohol consumption levels tend to be high among college students. To examine this relationship, we approached our data in two different but related ways. Firstly, we wanted to understand if we could predict student alcohol consumption given specific identifying information about a student. Secondly, we wanted to explore if we could predict a student’s academic success given information about that student, including data on their levels of alcohol consumption.

The first question we asked is as follows: Which regressors in the data set are the best in predicting a student’s total weekly alcohol consumption? Does the total weekly consumption of alcohol vary among students who do or do not want to pursue higher education? These questions are important for multiple reasons and yield intriguing results for students and instructors alike. For students, this question could help them understand if their levels of alcohol consumption are appropriate for their academic goals. This information could also be helpful in identifying the underlying reasons why students feel the need to drink – academics are certainly not the only stressors in a young person’s life, and understanding what factors play a role in determining alcohol consumption levels could help a student understand themselves better. While we certainly have no experience or expertise in the realm of educational instruction, we acknowledge that part of an instructor’s role is to cater to their students and prepare them for success in their futures. Instructors likely will never know how much alcohol their students consume but might benefit from knowing if the way they are teaching students is leading to unhealthy coping mechanisms. The years of secondary and higher education are arguably the most formative years of a person’s life, and it is critical that students during these years learn to appropriately handle stress and worry in their lives.

The second question we asked is similar to the first, but carries much significance of its own: What are the best predictor variables in predicting a student’s final grade in the class? Does the distribution of final grades vary among students who consume various amounts of alcohol each week? Similar to the first set of questions we asked, these questions are crucial for students who are seeking good grades in their classes. Many factors go into a student’s success in the classroom, such as the amount of time that can be devoted to studying and completing school assignments. Other factors include but are not limited to gender, socioeconomic status, parental education level, and possibly even alcohol consumption. This is not an exhaustive list by any means, but could all be significant in determining how well a student does in their classes. While grades may not always be the best indicator of how well a student is learning, they are certainly important and are an effective way to quantify understanding and success in the classroom. Especially for students who want to pursue higher education, these grades can determine where they will pursue this, or if they can even pursue it at all. Looking at this question from an alcohol perspective, we believe it could be extremely interesting to see if students who consume more alcohol per week tend to have less success in an educational setting based on their final grades.

DATA

The dataset we used, Student Alcohol Consumption, was uploaded by UC Irvine onto Kaggle through their Machine Learning Repository. This repository is a collection of databases and generators that was created for the machine learning community to analyze their algorithms by David Aha in 1987. It has been cited over 1000 times, making the repository one of computer science’s top 100 most cited papers. This particular dataset was obtained four years ago through a survey of students in math and Portuguese language courses in secondary school. While the data consisted of two separate sets for math and Portuguese students, we chose to focus on the math set to avoid overlapping students in our two datasets. The dataset consists of 395 observations where each one represents a student who attended either Gabriel Pereira or Mousinho da Silveira secondary school. Examples of various attributes collected include school, sex, age, family size, parental education, extracurricular activities, amount of free time after school, and how often the student goes out with friends. These factors could be used to predict the final grades of the students, or alcohol consumption.

school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason guardian traveltime studytime failures schoolsup famsup paid activities nursery higher internet romantic famrel freetime goout Dalc Walc health absences G1 G2 G3
GP F 18 U GT3 A 4 4 at_home teacher course mother 2 2 0 yes no no no yes yes no no 4 3 4 1 1 3 6 5 6 6
GP F 17 U GT3 T 1 1 at_home other course father 1 2 0 no yes no no no yes yes no 5 3 3 1 1 3 4 5 5 6
GP F 15 U LE3 T 1 1 at_home other other mother 1 2 3 yes no yes no yes yes yes no 4 3 2 2 3 3 10 7 8 10
GP F 15 U GT3 T 4 2 health services home mother 1 3 0 no yes yes yes yes yes yes yes 3 2 2 1 1 5 2 15 14 15
GP F 16 U GT3 T 3 3 other other home father 1 2 0 no yes yes no yes yes no no 4 3 2 1 2 5 4 6 10 10
GP M 16 U LE3 T 4 3 services other reputation mother 1 2 0 no yes yes yes yes yes yes no 5 4 2 1 2 5 10 15 15 15

For the first question, the two primary variables we used were the variables Dalc, which represents weekday alcohol consumption with a numeric value from 1 to 5, and Walc, which represents weekend alcohol consumption with a numeric value from 1 to 5. We chose to add these two variables together and divide the summation by 2 to create a new variable that represents total weekly alcohol consumption on a scale of 1 to 5. We additionally used other variables as the pool of the possible predictor variables to identify the best model, including sex, age, Pstatus, guardian, studytime, failures, paid, activities, higher, internet, romantic, famrel, freetime, goout, health, and absences. sex is a binary variable describing gender, age is a numeric value from 15 to 22, Pstatus represents whether or not parents cohabitate, and guardian describes the student’s guardian as mother, father, or other. studytime describes weekly study time from 1 to 10 hours, failures include the total number of past class failures, paid represents extra paid classes within the course subject, and activities is a binary variable describing whether or not the student has extracurricular activities. The variable internet represents whether or not the student has internet access at home, romantic binarily states whether or not the student is in a relationship, and famrel describes the quality of family relationships from 1 to 5. Finally, freetime represents the student’s free time after school on a scale of 1 to 5, goout is a scale from 1 to 5 on how often a student goes out with friends, health is the student’s health status (1-5), and absences is the number of school absences from 0 to 93. For the follow-up question that examines the relationship between the willingness to pursue higher education and total weekly consumption, we considered the binary variable higher, which indicates whether the student wants to go forward with higher education or not by denoting Yes or No. With these variables working together, we were able to obtain insights into the relationship between alcohol consumption and higher education.

For the second question, we chose to explore the best predictor variables for a person’s final grade in the class. The variables that are relevant here are G1, G2, and G3, which are representative of the student’s first-period grade, second-period grade, and final grade respectively on a scale of 1 to 20. We chose to predict G3 since this value represented the student’s final performance. Variables that we looked at to predict G3 included failures, Mjob, sex, goout, Medu, romantic, famsup, studytime, absences, schoolsup, age, famsize. In addition to variables we used in the first question, Mjob describes the mother’s job with the options of teacher, health care related, civil services, at home, and other, Medu describes the mother’s education as 0 (none), 1 (primary education), 2 (5th to 9th grade), 3 (secondary education) or 4 (higher education). Famsup is a binary describing whether or not the student has family educational support, schoolsup binarily states whether or not the student gets extra education support, and famsize is a binary variable with less than or equal to 3 and greater than 3 as options. We used these variables as our data pool for creating the best, most accurate model.

RESULTS

QUESTION 1

To begin answering our first question (predicting weekly alcohol consumption), we first had to take a look at all of our different variables and discover which of them are the best predictors. This process was difficult and time-consuming, as the data has many different variables which we hypothesized might be significant predictors. First, we began by considering models without interaction terms and without polynomial terms to get an understanding of which variables we might want to give attention to. To do this, we created five models using different combinations of variables we suspected might be significant in predicting weekly alcohol consumption. We included models with many variables along with models with far fewer variables – the five models we created shared many variables, while some variables were unique to specific models. We ensured that all five models were different from each other before undergoing specific statistical analyses on them.

The largest model included the following variables: sex, age, Pstatus, guardian, studytime, failures, paid, activities, higher, internet, romantic, famrel, freetime, goout, health, and absences. The other four models included subsets of those same variables. To compare these five models to each other, we created predictions for weekly alcohol consumption using each model and compared the accuracy of the results to each other. One metric we looked at to determine the effectiveness of these models was the root mean squared error (RMSE). We chose this metric since it tests the spread of residuals on the line of best fit, and we wanted to find a model that consistently predicted weekly alcohol consumption with the least variation. The model with the full set of predictors (the ones listed above) had the lowest RMSE. Below is a table displaying the RMSE values for each of the five models we created.

Model RMSE
Model 1 0.8111433
Model 2 0.8517678
Model 3 0.8740960
Model 4 0.8649660
Model 5 0.8382558

We did not, however, want to only consider RMSE when evaluating which of these models was most appropriate for the data. One other thing we sought to look at was the distribution of residuals for each model. Below are histograms of residuals for each of the five models.

We did notice that these histograms appear to be slightly right-skewed, indicating there are some instances where the models significantly underpredict weekly alcohol consumption. However, since our residuals for each model were centered at zero, we continued in considering more complex models to predict weekly alcohol consumption. We decided based on RMSE and histograms of residuals that the fifth “reduced” model would be good to work with moving forward with its fewer predictors – this would enhance the process of looking for potential interaction terms or polynomial regressors.

Our process for looking at interaction terms and potential polynomial regressors was similar to the methodology done on the initial untransformed variables – we created models with different interactions and polynomial regressors and underwent analysis on their effectiveness. We created four different models here: one with no polynomial or interaction terms, one with only polynomial terms, one with only interaction terms, and one with both types of terms. Upon creating these models, fitting predictions with them, and calculating RMSEs for each of them, the model with both types of terms had the lowest RMSE. The output of RMSEs for these models is listed below.

Model RMSE
Reduced (no interaction or polynomial terms) 0.8382558
Interaction (no polynomial terms) 0.8362003
Polynomial (no interaction terms) 0.8370526
Both interaction and polynomial terms 0.8351100

Our final objective in this first question was to see if weekly alcohol consumption varies among students who do or do not want to pursue higher education. Coming to a conclusion here was difficult, as there were only 20 such observations in the data where a student did not plan to pursue higher education. Nevertheless, we tried to effectively answer this question we posed. We started by looking at two side-by-side boxplots of weekly alcohol consumption levels between students who do and do not plan on pursuing higher education.

While the boxplots aren’t extremely telling in and of themselves, we noticed two important things: firstly, the median consumption for students who do not plan on pursuing higher education was much higher than students who do; secondly, a student who did plan to pursue higher education with a weekly consumption level of five was considered an outlier. To investigate this further, we did a t-test for difference in means between the two groups.

This led to us failing to reject the null hypothesis that the true difference in means between the two groups is equal to zero; said in other words, a 90% confidence interval for the difference in means for the two groups did include zero. Our original hypothesis that these means would significantly differ was refuted, but we suspect a small sample size could have led to this result and do still find it interesting that their means differ as much as they do.

Final single predictors

  • sex
  • paid
  • goout

Final interaction terms

  • studytime * famrel

Final polynomial terms

  • studytime ^ 2

QUESTION 2

Final grades are very important in an educational setting, so with this question, we aimed to see if we could use any combination of these predictors to effectively predict a student’s final grade. To start, we wanted to find an initial set of predictors. Before doing this, we chose to omit the G1 and G2 variables. While these certainly would be fantastic predictors of the final grade, they are too highly correlated and we seek to find the variables that go into those grades (we expect that the set of predictors we used to predict the final grade would also be good predictors for predicting the first and second-period grades). To find an initial set of predictors, we utilized stepwise regression. Doing this gave us the following predictors for an initial set: failures, Mjob, sex, goout, Medu, romantic, famsup, studytime, absences, schoolsup, age, and famsize. From here, we followed a similar process as we did in investigating our first question – we created some reduced models with subsets of those predictors and measured their effectiveness against each other.

We tested their effectiveness in two ways – first, we did a series of nested F-tests to determine if all the variables we had in the model were significant. This resulted in a resounding yes to that question; none of the reduced models were as effective in predicting final grade as the full model, according to the F-tests we ran. Second, we analyzed the RMSE values for each of the models by comparing their predicted values to the actual values in the data.

Model RMSE
Model 1 (Full) 3.997189
Model 2 4.119726
Model 3 4.074816
Model 4 4.029175
Model 5 4.335027

Looking at the RMSEs, the full model, which our nested F-tests determined to be the best option, also had the lowest RMSE. Notably, the second-best model by the F-test metric also had the second-lowest RMSE, which was only slightly higher than that of the full model.

Looking at a graph of the predicted final grades with the actual final grades, we see the predictions have much less variability than the actual values themselves – note that the purple line represents where predicted values are equal to actual values. This is something we expect, but we sought to improve upon these predictions by including interaction and higher-order polynomial terms in our model.

The method of determining which interaction terms and which higher-order polynomial terms involved a good bit of nested F-tests, adding combinations of predictors, and higher-order polynomial terms to see which terms significantly improved the model. Once again using RMSE as a measure of spread for error, we compared the different models to see which ones were most effective. Here is a table displaying the results we got from this methodology.

Model RMSE
No interaction or polynomial terms 3.997189
Only interaction terms 3.932965
Only polynomial terms 3.922543
Both interaction and polynomial terms 3.764697

The RMSE for the model with both interaction and polynomial terms is by far the least of these options. This gave us a final model to predict a student’s final grade with the following predictors.

Single predictors

  • failures
  • Mjob
  • sex
  • goout
  • Medu
  • romantic
  • famsup
  • studytime
  • absences
  • schoolsup
  • age
  • famsize

Interaction terms

  • failures * studytime
  • failures * absences
  • failures ^ 2 * absences ^ 2

Polynomials

  • failures ^ 2
  • absences ^ 2

Our final objective for the second question was to see if the distribution of final grades varied among students with different weekly alcohol consumption levels. To do this, we made a couple of visualizations to see how final course grades varied among these alcohol consumption levels.

From these visualizations, we thought there was certainly some negative relationship between the two where final course grades were lower among students who consumed more alcohol, but we weren’t sure if it would be a statistically significant difference. To test this, we created a new variable, heavy, which took the value No when the total alcohol consumption level was less than three and took the value Yes otherwise. We did a t-test for difference in means, and we rejected the null hypothesis that the true difference in means between the two groups was equal to zero, at an alpha level of 0.1. Based on the data, we are 90% confident there was a significant difference in final course grades among students who were considered heavy drinkers and students who were considered non-heavy drinkers.


CONCLUSION

With our first question, we sought to discover which variables in the data were the best predictors of total alcohol consumption, and we investigated if alcohol consumption varied among students who did and did not plan to pursue higher education. With our second question, we sought to discover the best predictors of a student’s final course grade and investigated if final grades varied among students who consumed different amounts of alcohol per week. For the first question, we used a model with single predictors, interaction terms, and higher-order polynomial terms. Looking at side-by-side boxplots, we hypothesized there would be a difference in alcohol consumption between students who did and did not plan to pursue higher education. However, upon undergoing a formal t-test for a difference in means between the two groups of students, we were unable to say there was a statistically significant difference between mean alcohol consumption between students who did and did not want to pursue higher education. We expected there would be some difference here, but were unable to statistically confirm this expectation – we will say, however, we were very close to being able to say there was a significant difference. For the second question, we decided our best model was the one with the lowest RMSE. This model consisted of single predictors, along with interaction terms and higher-order polynomial terms. Next, we investigated our follow-up question, curious if students who consumed less alcohol per week made higher final grades in the course. After observing some visualizations showing the relationship between these two variables, we observed there was a potential relationship between the variables where higher alcohol consumption levels were associated with lower final grades. After this, we created a new variable to indicate if a student was a heavy drinker or not and underwent a formal t-test to see if there was a difference in final grades between students who were heavy drinkers and those who were not. Upon doing this, we were able to establish there was a statistically significant difference in the final grades between students who were heavy consumers of alcohol and those who were not. This was in line with what we expected to see.

We strongly believe that our results are important in the real world. Particularly given that we are college students very similar to those we underwent analysis on, these results are highly applicable to us along with the people around us. In investigating our first question, we learned that many different factors can contribute to how much alcohol a student consumes. Among these are financial situations, how often a student goes out with friends, how much time a student spends studying, and the quality of a student’s relationship with their family. While this is not an exhaustive list, we believe it to be very indicative of the importance of family in the life of students. Consuming alcohol does not have to be a bad thing, but poor family relationships can lead to overconsumption which can lead to other bad consequences for students. For parents, this urges the importance of leading a strong family and confirms that the way in which they raise their children does have an effect on their lives. Our investigation also leads us to believe some of these same variables are important in how well students do academically. As education becomes more and more important in today’s world, we know it is important to create environments for students that foster success. We believe our observation that heavy consumers of alcohol performed worse academically than those who were not heavy consumers to be very telling; it is important for students to understand that alcohol does not have to be detrimental to their academic success, but it does have the potential to be so if its consumption is not moderated. It is extremely vital for students to develop healthy habits at this stage in their lives, and our analysis proves that not only are students responsible for their own success, but outside influences such as family structures also play an immense role in the actions that students take. Knowing what factors can lead someone to drink more heavily will definitely help instructors, advisors, or even other students reach out and offer support to students who could be struggling with alcoholism. Similarly, knowing what factors can set students up for academic success can help instructors, advisors, and family members to create positive learning environments for students that foster learning.

We are very pleased with our analysis, and we acknowledge much could be done with it in the future to improve it and realize new insights. One aspect of modeling that we did not find much success in was transformations of both response and predictor variables; it may be possible to produce more meaningful models than those we created by including these transformations. Doing so could both improve predictors we already used along with introducing new predictors that help get a more accurate prediction of the response variables. Another area where other methods may work better is with model selection. We used a combination of residual analysis, nested F-tests, and stepwise regression to choose our predictors, and we acknowledge there are other methods that could have been used that might lead to different results. We believe we made good decisions in choosing these methods but do also understand that we potentially could have used predictors that other methods might not have deemed necessary or that we could have excluded predictors that other methods might have deemed significant. The biggest area where this work could be continued would be in the collection of more similar data. Potentially the biggest shortcoming of this dataset is its exclusivity to a small number of students and a lack of any time-related data. One issue we ran into with our analysis was having only twenty students who did not plan on pursuing higher education; with far more data, this number would greatly increase and we would be able to investigate more meaningful questions related to the intersection between higher education and alcohol consumption. With data across different time periods, we could also begin to investigate if consumption levels vary over time and try to determine if consumption levels increased during certain periods of time, such as stressful election periods, for example. People have consumed alcohol for thousands of years and will continue to do so far into the future – given past information on alcohol consumption over time, we could make meaningful projections into the future which could help multiple different parties and groups of people. One final area we believe this work could be continued is with collecting data on adults as well. In analyzing consumption among students, questions about the drinking age and legality of consumption begin to play a role. If we could consider other age groups as well, we could make similarly meaningful insights for people who do not fall into the student category. Alcohol will continue to have an important role in the world, and further work on our analysis has great potential to realize meaningful insights that could change the ways people live their lives.