Predictive Modeling of Alcohol Consumption Among Students and Hospital Patients
Setup
Packages
Data
Student data: https://www.kaggle.com/datasets/uciml/student-alcohol-consumption
Cancer data: https://www.kaggle.com/datasets/thedevastator/cancer-patients-and-air-pollution-a-new-link
Code
# student data
alc_mat <- read_csv(here::here("Project Data", "student-mat.csv"))
alc_por <- read_csv(here::here("Project Data", "student-por.csv"))
alc <- rbind(alc_mat, alc_por)
alc$class <- c(rep("math", nrow(alc_mat)), rep("portuguese", nrow(alc_por)))
# cancer data
cancer <- read_csv(here::here("Project Data",
"cancer patient data sets.csv"))Data Cleaning
Code
# alcohol data
alc <- alc %>% na.omit()
alc <- alc %>%
mutate(
across(.cols = c(school, sex, address, famsize, Pstatus, Medu, Fedu,
Mjob, Fjob, reason, guardian, schoolsup, famsup, paid,
activities, nursery, higher, internet, romantic),
.fns = ~ as.factor(.x)
)
)
# cancer data
cancer <- cancer %>% na.omit()
cancer <- cancer %>% dplyr::select(-c(index, `Patient Id`))
cancer$Gender <- ifelse(cancer$Gender == 1, "M", "F")
cancer <- cancer %>% mutate(Gender = as.factor(Gender))
cancer <- cancer %>% mutate(Level = as.factor(Level))1. Introduction
It is our understanding that your organization, the federal Interagency Coordinating Committee on the Prevention of Underage Drinking (ICCPUD), works with state, territorial, and local governments and organizations with goals to reduce underage drinking and its consequences. We are working on a research project to determine key factors that could impact whether secondary school students are likely to drink. We are also interested in what factors lead to high levels of alcohol use among the general population. We will be providing you with an analysis report outlining the key factors leading to underage drinking as well as solutions that can possibly be implemented to prevent more adolescents from underage drinking. We believe our analysis can bring additional insight to your organization as to why students are drinking while underage and we hope that our analysis will bring both awareness of the consequences of drinking and governmental actions to prevent this issue.
The data we chose to utilize in our analysis is from the University of Camerino. They are a research institution where two faculty members, Professor Cortez and Professor Silva, collected data through a survey of students who were taking math and Portugese classes during their time at secondary school. We plan to determine the key factors of underage drinking through the weekend alcohol consumption for each student surveyed. We also plan to use characteristics that were surveyed about each student to predict whether a secondary school student is likely to drink. Listed below are characteristics we will use for the research analysis:
sex: student’s sex (binary: ‘F’ - female or ‘M’ - male)
famsize: family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
absences - number of school absences (numeric: from 0 to 93)
G3 - final grade (numeric: from 0 to 20, output target)
goout: going out with friends (numeric: from 1 - very low to 5 - very high)
age: student’s age (numeric: from 15 to 22)
Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
schoolsup - extra educational support (binary: yes or no)
activities - extra-curricular activities (binary: yes or no)
internet: Internet access at home (binary: yes or no)
romantic: with a romantic relationship (binary: yes or no)
freetime: free time after school (numeric: from 1 - very low to 5 - very high)
address: student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
Pstatus: parent’s cohabitation status
traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
failures - number of past class failures (numeric: n if 1<=n<3, else 4)
higher: wants to take higher education (binary: yes or no)
famrel: quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
freetime: free time after school (numeric: from 1 - very low to 5 - very high)
health - current health status (numeric: from 1 - very bad to 5 - very good)
To determine the factors leading to high levels of alcohol consumption, we will use a separate data set from the journal Nature Medicine which took data from over 462,000 patients in China. We wanted to use this data set, which surveys adults, to compare our results with the student alcohol data set to find out more about why some people have higher levels of alcohol consumption in relation to others. The patients in the Nature Magazine data set were followed for approximately 6 years. Here is a list of some of the following characteristics that were recorded:
age: the age of the patient (numeric: from 14 to 73)
gender: the gender of the patient (binary: ‘1’ or ‘2’)
air pollution: the level of air pollution exposure of the patient (categorical: 1 to 10)
dust allergy: the level of dust allergy of the patient (categorical: 1 to 10)
occuPational hazard: the level of occupational hazards of the patient (categorical: 1 to 10)
genetic risk: the level of genetic risk of the patient (categorical: 1-10)
chronic lung disease: the level of chronic lung disease of the patient (categorical: 1 to 10)
While looking through the dataset, we noticed that many of the variables have numerical character types. We will clean the data by changing those variables to categorical character types because there are different levels within the variable. We also noticed that the dataset consisted of math students at only two specific high schools in Portugal. This makes the data difficult to generalize to a larger teenage population. In addition, we expect response bias, bias that occurs from students providing inaccurate answers during surveys, which leads students to under-report underage drinking. Students might provide inaccurate responses because they are worried that answering truthfully about something they should not be doing at a certain age will get them in trouble by parents, guardians, teachers, or researchers conducting the survey. Because of response bias and the small sample of high schools they were chosen for surveying, we recommend interpreting the conclusions of our analysis with some caution.
Some individuals and groups who can benefit from our analysis include parents/guardians, and teachers of teen students, teenage students, drug agencies, public health organizations, and cancer reearchers. Conversely, individuals who could be harmed are innocent students who are accused of underage drinking.
2. Previous Work
https://www.frontiersin.org/articles/10.3389/fpsyt.2021.684406/full
The researchers of the article, “A Deep Learning Algorithm to Predict Hazardous Drinkers and the Severity of Alcohol-Related Problems Using K-NHANES” noticed an increasing number of patients with alcohol related problems so they utilized a large survey data set to predict hazardous drinkers and the severity of alcohol related problems of patients through a deep learning algorithm. The data set is nationally representative of the entire South Korea population since it is from the National Health and Nutrition Examination Survey of South Korea (K-NHANES). The algorithm was able to effectively classifying hazardous drinking groups and screening people who need treatment. The analysis performed by the researchers from the article used deep learning which they claimed to show higher performance than conventional machine learning. Although we will not be using deep learning in our analysis, our analysis will be similar to the analysis of the article in that we will be using similar techniques to classify and predict individuals who are likely to drink. We will use regression and classification models on two completely different demographic groups to see if there are any similar trends between the two.
https://www.stevens.edu/news/artificial-intelligence-predicts-drinking-alcohol-using-smartphone-data
Researchers at Stevens Institute of Technology developed an artificial intelligence prediction system using smartphone data to predict whether an individual is binge drinking. Sang-Won Bae, the professor leading the research, began developing this system with other researches in an effort to address binge drinking on college campuses, with the goal of preventing alcohol related incidents such as accidents and violence. The system uses the AWARE app, a non-profit app that provides strategies for improving mental health, wellbeing, and inner developement, to analyze sensor data like one’s location, motion, phone usage, and social interactions to detect early signs of intoxication. The artificial intelligence system was able to produce high accuracy in predicting heavy drinking events during their pilot test. The goals of the researchers at S.I.T align with our research goals where we also aim to lower underage drinking and bring awareness to the effects of overconsumption of alcohol. The article does not mention the specifics of the data analysis itself such as what methods/model they used to create their prediction system, although it does mention that they collected data after installing the AWARE app. The difference between our analysis and Bae’s is that we did not go through the data collection process to record data about patients and students whereas Bae and her researchers did, with the AWARE app.
3. Exploratory Analysis
Student Data
Response Variable(s)
Code
# Bar Plot on Weekend Alcohol Use
alc %>%
ggplot(aes(x = Walc)) +
geom_histogram(aes(y=..count../sum(..count..)), fill = "yellow4") +
labs(x = "Alcohol Use (1 - very low to 5 - very high)",
y = "Proportion",
title = "Weekend Alcohol Use",
subtitle = "Proportion of Student Reported Alcohol Use During the Weekend") +
theme(axis.title.y = element_blank())
# Bar Plot on Workday Alcohol Use
alc %>%
ggplot(aes(x = Dalc)) +
geom_histogram(aes(y=..count../sum(..count..)), fill = "yellow4") +
labs(x = "Alcohol Use (1 - very low to 5 - very high)",
y = "Proportion",
title = "Workday Alcohol Use",
subtitle = "Proportion of Student Reported Alcohol Use During the Weekday") +
theme(axis.title.y = element_blank())
# Scatter plot on Weekend Alcohol Use vs. Workday Alcohol Use
alc %>%
ggplot(aes(x = Walc, y = Dalc)) +
geom_jitter(pch = 16) +
labs(x = "Weekend Alcohol Use (1 - very low to 5 - very high)",
y = "Workday Alcohol Use (1 - very low to 5 - very high)",
title = "Weekend vs. Workday Alcohol Use")From the first two histograms, we can see that most students tend to have both low workday and weekend alcohol consumption. We can also see that weekend alcohol consumption is slightly higher which makes sense since students tend to have more free time on the weekends without class. A scatter plot of the two variables also shows that the vast majority of students have equal or lower weekend alcohol consumption compared to their workday consumption as the bottom right portion of the plot is filled.
Predictor vs. Response Variables
Code
# Bar Plot on Weekend Alcohol Use by Sex
alc %>%
ggplot(aes(x = Walc, fill = sex)) +
geom_histogram(aes(y=..count../sum(..count..)), position = "dodge") +
labs(x = "Weekend Alcohol Use (1 - very low to 5 - very high)",
y = "Proportion",
fill = "Sex",
title = "Weekend Alcohol Use by Sex",
subtitle = "Proportion of Reported Alcohol Use During the Weekend For Males and Females") +
theme(axis.title.y = element_blank())
# Bar Plot on Weekend Alcohol Use by Family Size
alc %>%
ggplot(aes(x = Walc, fill = famsize)) +
geom_histogram(aes(y=..count../sum(..count..)), position = "dodge") +
labs(x = "Weekend Alcohol Use (1 - very low to 5 - very high)",
y = "Proportion",
fill = "Family Size",
title = "Weekend Alcohol Use by Family Size",
subtitle = "Proportion of Reported Alcohol Use During the Weekend Based on Size of Student's Family") +
theme(axis.title.y = element_blank())
# Bar Plot on Weekend Alcohol Use by Extra Paid Classes
alc %>%
mutate(paid = factor(paid, levels = c("yes", "no"))) %>%
ggplot(aes(x = Walc, fill = paid)) +
geom_histogram(aes(y=..count../sum(..count..)), position = "dodge") +
labs(x = "Weekend Alcohol Use (1 - very low to 5 - very high)",
y = "Proportion",
fill = "Extra Paid Classes",
title = "Weekend Alcohol Use by Extra Paid Classes",
subtitle = "Proportion of Reported Alcohol Use During the Weekend Based Enrollment in Paid Classes") +
theme(axis.title.y = element_blank())
#scale_fill_manual(values = c("yes" = "grey40", "no" = "grey25"))The above histograms all show weekend alcohol consumption grouped by different categorical variables. In the first plot, as reported alcohol consumption increases, it appears that the proportion of males also increases, indicating that males are prone to higher levels of drinking. This same relationship, although not to the same extent, can be seen with students from families that are less than or equal to 3 members, and students paying for extra classes.
Code
# Boxplot of Weekend Alcohol Use vs. Absences by Sex
alc %>%
ggplot(aes(x = as.factor(Walc), y = absences, color = sex)) +
geom_boxplot() +
labs(x = "Weekend Alcohol Use (1 - very low to 5 - very high)",
y = "Number of Absences",
fill = "Sex",
title = "Weekend Alcohol Use",
subtitle = "Number of Absences Between Males and Females Based on Weekend Alcohol Use") +
scale_y_continuous(limits = c(0, 20)) +
theme(axis.title.y = element_blank())
# Boxplot of Weekend Alcohol Use vs. Final Grade by Sex
alc %>%
ggplot(aes(x = as.factor(Walc), y = G3, color = sex)) +
geom_boxplot(pch = 16) +
labs(x = "Weekend Alcohol Use (1 - very low to 5 - very high)",
y = "Final Grade",
fill = "Sex",
title = "Weekend Alcohol Use vs. Final Grade by Sex",
subtitle = "Final Grade Between Males and Females Based on Weekend Alcohol Use") +
theme(axis.title.y = element_blank())
# Scatter plot of Weekend Alcohol Use vs. Going Out by Sex
alc %>%
ggplot(aes(x = Walc, y = goout, color = sex)) +
geom_jitter(pch = 16) +
labs(x = "Weekend Alcohol Use (1 - very low to 5 - very high)",
y = "Going Out (1 - never to 5 - always)",
fill = "Sex",
title = "Weekend Alcohol Use vs. Going Out by Sex")The two box plots above shows that as weekend alcohol consumption increases, absences increase and final grades decrease. However, it is important to note that these trends are only clearly visible for male students. Additionally, the scatter plot shows that as weekend alcohol consumption increases, students report to going out more. It is important to note that many students report to going out, but do not report to high levels of drinking, and most of these students are females. Although high levels of drinking are concerning, one silver lining from this plot is that it may indicate that the students most likely view drinking as a social activity, which is less concerning than if they were drinking alone.
Cancer Data
Response Variable
Code
Based on the histogram above, among the patients with lung cancer, there appears to be a bi-modal distribution as patients primarily report as having low or high alcohol consumption rather than a medium amount.
Predictor vs. Response Variables
Code
# Histogram on Alcohol Use by Sex
cancer %>%
ggplot(aes(x = `Alcohol use`, fill = Gender)) +
geom_histogram(aes(y=..count../sum(..count..)), position = "dodge") +
labs(x = "Alcohol Use (1 - very low to 8 - very high)",
y = "Proportion",
fill = "Sex",
title = "Proportion of Alcohol Use by Sex") +
theme(axis.title.y = element_blank())
# Histogram on Alcohol Use by Cancer Level
cancer %>%
mutate(Level = factor(Level, levels = c("Low", "Medium", "High"))) %>%
ggplot(aes(x = `Alcohol use`, fill = Level)) +
geom_histogram(aes(y=..count../sum(..count..)), position = "dodge") +
labs(x = "Alcohol Use (1 - very low to 8 - very high)",
y = "Proportion",
fill = "Cancer Level",
title = "Proportion of Alcohol Use by Cancer Level") +
theme(axis.title.y = element_blank())Again like with the student data, the change in the proportions in the histogram shows that alcohol use tends to be greater among males than females. Alcohol consumption is also higher among worse cancer patients than those that are doing better.
Code
# Scatter plot on Alcohol Use vs. Occupational Hazards by Sex
cancer %>%
ggplot(aes(x = `Alcohol use`, y = `OccuPational Hazards`, color = Gender)) +
geom_jitter(pch = 16) +
labs(x = "Alcohol Use (1 - very low to 8 - very high)",
y = "Occupational Hazards (1 - safe to 8 - dangerous)",
fill = "Sex",
title = "Alcohol Use vs. Occupational Hazards by Sex")
# Scatter plot on Alcohol Use vs. Balanced Diet by Sex
cancer %>%
ggplot(aes(x = `Alcohol use`, y = `Balanced Diet`, color = Gender)) +
geom_jitter(pch = 16) +
labs(x = "Alcohol Use (1 - very low to 8 - very high)",
y = "Balanced Diet (1 - very bad to 8 - very good)",
fill = "Sex",
title = "Alcohol Use vs. Balanced Diet by Sex")
# Scatter plot on Alcohol Use vs. Obesity by Sex
cancer %>%
ggplot(aes(x = `Alcohol use`, y = Obesity, color = Gender)) +
geom_jitter(pch = 16) +
labs(x = "Alcohol Use (1 - very low to 8 - very high)",
y = "Obesity (1 - very good to 8 - very bad)",
fill = "Sex",
title = "Alcohol Use vs. Obesity by Sex")
# Scatter plot on Alcohol Use vs. Fatigue by Sex
cancer %>%
ggplot(aes(x = `Alcohol use`, y = Fatigue, color = Gender)) +
geom_jitter(pch = 16) +
labs(x = "Alcohol Use (1 - very low to 8 - very high)",
y = "Fatigue (1 - very low to 8 - very high)",
fill = "Sex",
title = "Alcohol Use vs. Fatigue by Sex")The plots above show that occupational hazards, balanced diet, obesity, and fatigue are all positively correlated with alcohol use.
4. Model Fit Exploration
Setup
Models to fit:
Code
# Linear Regression
lin_reg_mod <- linear_reg() %>%
set_mode("regression") %>%
set_engine("lm")
# KNN
k_grid <- grid_regular(neighbors(c(1,50)),
levels = 25)
knn_reg_mod_tune <- nearest_neighbor(neighbors = tune()) %>%
set_mode("regression") %>%
set_engine("kknn")
knn_cla_mod_tune <- nearest_neighbor(neighbors = tune()) %>%
set_mode("classification") %>%
set_engine("kknn")
# LDA
lda_mod <- discrim_linear() %>%
set_engine("MASS") %>%
set_mode("classification")
# Decision Tree
tree_mod <- decision_tree() %>%
set_engine("rpart") %>%
set_mode("classification")
# Random Forest
rf_grid <- grid_regular(mtry(c(1, 10)), levels = 10)
rf_mod_tune <- rand_forest(mtry = tune(), min_n = 5, trees = 30) %>%
set_engine("ranger") %>%
set_mode("classification")
# Support Vector Machine
svm_poly_mod <- svm_poly(cost = 1, margin = 0.5, degree = 4) %>%
set_mode("classification") %>%
set_engine("kernlab")Student Data
Since our repsonse variable, weekly alcohol consumption, is on a Likert scale from 1 to 5, we can treat it as either a quantitative or categorical variable. Therefore, we will be fitting the following models for our analysis:
Linear regression
KNN Regression
KNN Classification
Linear Discriminant Analysis
Decision Tree
Random Forest
Support Vector Machine
Best Subsets - Backwards Selection:
Code
models <- regsubsets(Walc ~ .,
data = alc_train, method = "backward",
nvmax = 52)
model_bic <- tibble(bic = summary(models)$bic,
model = seq(from = 1,
to = 52,
by = 1
)
)
model_bic %>%
mutate(model = str_c("Model ", model)
) %>%
arrange(bic) %>%
column_to_rownames(var = "model")
min_bic <- slice_min(model_bic, order_by = bic) %>%
pull(model)
summary(models)$outmat[min_bic,]Predictor variables selection:
We are using multiple approaches to identify which predictors may be strong for determining differences in weekly alcohol consumption. These include identifying potential relationships from visualizations, using best subsets backwards selection, and experimenting with models by adding and removing some variables. From this approach, we have decided on the following variables when fitting our models:
sex: student’s sex (binary: ‘F’ - female or ‘M’ - male)
famsize: family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
absences - number of school absences (numeric: from 0 to 93)
G3 - final grade (numeric: from 0 to 20, output target)
goout: going out with friends (numeric: from 1 - very low to 5 - very high)
age: student’s age (numeric: from 15 to 22)
Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
schoolsup - extra educational support (binary: yes or no)
activities - extra-curricular activities (binary: yes or no)
internet: Internet access at home (binary: yes or no)
romantic: with a romantic relationship (binary: yes or no)
freetime: free time after school (numeric: from 1 - very low to 5 - very high)
Linear Regression:
Code
# cv metrics
lin_alc_rec <- recipe(Walc ~ sex + famsize + paid + absences + G3 + goout + age +
Medu + Fedu + Mjob + guardian + schoolsup +
activities + internet + romantic + freetime,
data = alc_train)
lin_alc_wflow <- workflow() %>%
add_model(lin_reg_mod) %>%
add_recipe(lin_alc_rec)
lin_alc_fit_cv <- lin_alc_wflow %>%
fit_resamples(resamples = alc_cv)
lin_alc_fit_cv %>% collect_metrics()# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 rmse standard 1.11 5 0.0341 Preprocessor1_Model1
2 rsq standard 0.256 5 0.0185 Preprocessor1_Model1
The cross validated RMSE = 1.1 indicates that our model will typically predict weekly alcohol consumption within 1.1 units of of the actual value. The cross validated R-squared = 0.25 indicates that 25% of the variation in weekly alcohol consumption can be explained by the model.
KNN Regression:
Code
# k tuning
knn_reg_alc_wflow_tune <- workflow() %>%
add_recipe(knn_reg_alc_rec) %>%
add_model(knn_reg_mod_tune)
knn_reg_alc_grid_search <-
tune_grid(
knn_reg_alc_wflow_tune,
resamples = alc_cv,
grid = k_grid
)
knn_reg_alc_grid_search %>%
collect_metrics() %>%
filter(.metric == "rmse") %>%
slice_min(mean, n = 3)# A tibble: 3 × 7
neighbors .metric .estimator mean n std_err .config
<int> <chr> <chr> <dbl> <int> <dbl> <chr>
1 7 rmse standard 1.10 5 0.0295 Preprocessor1_Model04
2 9 rmse standard 1.10 5 0.0289 Preprocessor1_Model05
3 5 rmse standard 1.10 5 0.0306 Preprocessor1_Model03
Tuning determines that our best value for k = 9 as it has the lowest RMSE = 1.1.
Code
# cv metrics
knn_reg_alc_mod <- nearest_neighbor(neighbors = 9) %>%
set_mode("regression") %>%
set_engine("kknn")
knn_reg_alc_wflow <- workflow() %>%
add_model(knn_reg_alc_mod) %>%
add_recipe(knn_reg_alc_rec)
knn_reg_alc_fit_cv <- knn_reg_alc_wflow %>%
fit_resamples(resamples = alc_cv)
knn_reg_alc_fit_cv %>% collect_metrics()# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 rmse standard 1.10 5 0.0289 Preprocessor1_Model1
2 rsq standard 0.271 5 0.0153 Preprocessor1_Model1
The cross validated RMSE = 1.1 indicates that our model will typically predict weekly alcohol consumption within 1.1 units of of the actual value. The cross validated R-squared = 0.28 indicates that 28% of the variation in weekly alcohol consumption can be explained by the model.
KNN Classification:
Code
# k tuning
knn_cla_alc_wflow_tune <- workflow() %>%
add_recipe(knn_cla_alc_rec) %>%
add_model(knn_cla_mod_tune)
knn_cla_alc_grid_search <-
tune_grid(
knn_cla_alc_wflow_tune,
resamples = alc_cv,
grid = k_grid
)
knn_cla_alc_grid_search %>%
collect_metrics() %>%
filter(.metric == "accuracy") %>%
slice_max(mean, n = 3)# A tibble: 3 × 7
neighbors .metric .estimator mean n std_err .config
<int> <chr> <chr> <dbl> <int> <dbl> <chr>
1 1 accuracy multiclass 0.583 5 0.0174 Preprocessor1_Model01
2 3 accuracy multiclass 0.583 5 0.0174 Preprocessor1_Model02
3 5 accuracy multiclass 0.583 5 0.0174 Preprocessor1_Model03
Code
# A tibble: 3 × 7
neighbors .metric .estimator mean n std_err .config
<int> <chr> <chr> <dbl> <int> <dbl> <chr>
1 5 roc_auc hand_till 0.755 5 0.00933 Preprocessor1_Model03
2 7 roc_auc hand_till 0.752 5 0.00928 Preprocessor1_Model04
3 9 roc_auc hand_till 0.748 5 0.00920 Preprocessor1_Model05
Tuning determines that our best value for k = 5, as it is tied for the highest accuracy = 0.60 and has the highest ROC-AUC = 0.77.
Code
# cv metrics
knn_cla_alc_mod <- nearest_neighbor(neighbors = 5) %>%
set_mode("classification") %>%
set_engine("kknn")
knn_cla_alc_wflow <- workflow() %>%
add_model(knn_cla_alc_mod) %>%
add_recipe(knn_cla_alc_rec)
knn_cla_alc_fit_cv <- knn_cla_alc_wflow %>%
fit_resamples(resamples = alc_cv)
knn_cla_alc_fit_cv %>% collect_metrics()# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 accuracy multiclass 0.583 5 0.0174 Preprocessor1_Model1
2 roc_auc hand_till 0.755 5 0.00933 Preprocessor1_Model1
With an accuracy = 0.60, our model predicts alcohol consumption correctly 60% of the time.
LDA Classification:
Code
# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 accuracy multiclass 0.404 5 0.0169 Preprocessor1_Model1
2 roc_auc hand_till 0.692 5 0.0110 Preprocessor1_Model1
With an accuracy = 0.40, our model predicts alcohol consumption correctly 40% of the time.
Decision Tree:
Code
# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 accuracy multiclass 0.426 5 0.0102 Preprocessor1_Model1
2 roc_auc hand_till 0.670 5 0.0108 Preprocessor1_Model1
With an accuracy = 0.45, our model predicts alcohol consumption correctly 45% of the time.
Random Forest:
Code
# A tibble: 3 × 7
mtry .metric .estimator mean n std_err .config
<int> <chr> <chr> <dbl> <int> <dbl> <chr>
1 1 accuracy multiclass 0.438 5 0.0170 Preprocessor1_Model01
2 8 accuracy multiclass 0.531 5 0.0161 Preprocessor1_Model08
3 10 accuracy multiclass 0.532 5 0.0233 Preprocessor1_Model10
Code
# cv metrics
rf_alc_mod <- rand_forest(mtry = 7, min_n = 5, trees = 30) %>%
set_engine("ranger") %>%
set_mode("classification")
rf_alc_wflow <- workflow() %>%
add_model(rf_alc_mod) %>%
add_recipe(tree_alc_rec)
rf_alc_fit_cv <- rf_alc_wflow %>%
fit_resamples(resamples = alc_cv)
rf_alc_fit_cv %>% collect_metrics()# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 accuracy multiclass 0.532 5 0.0239 Preprocessor1_Model1
2 roc_auc hand_till 0.791 5 0.0121 Preprocessor1_Model1
With an accuracy = 0.55, our model predicts alcohol consumption correctly 55% of the time.
Support Vector Machine:
Code
# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 accuracy multiclass 0.572 5 0.0174 Preprocessor1_Model1
2 roc_auc hand_till 0.758 5 0.0134 Preprocessor1_Model1
With an accuracy = 0.59, our model predicts alcohol consumption correctly 59% of the time.
Cancer Data
Again, since our response variable, alcohol use, is on a Likert scale from 1 to 8, we can treat it as either a quantitative or categorical variable. Therefore, we will be fitting the following models for our analysis:
Linear regression
KNN Regression
KNN Classification
Linear Discriminant Analysis
Random Forest
Best Subsets - Backwards Selection:
Code
models <- regsubsets(`Alcohol use` ~.,
data = cancer,
method = "backward",
nvmax = 24)
model_bic <- tibble(bic = summary(models)$bic,
model = seq(from = 1,
to = 24,
by = 1
)
)
model_bic %>%
mutate(model = str_c("Model ", model)
) %>%
arrange(bic) %>%
column_to_rownames(var = "model")
min_bic <- slice_min(model_bic, order_by = bic) %>%
pull(model)
summary(models)$outmat[min_bic,]Predictor variables selection:
There are multiple approaches we used to identify predictors that may be strong for determining differences in patient alcohol use. Through visualization performed in the explanatory analysis, using best subsets backward selection, and manually experimenting with models by adding and removing variables, we were able to identify these predictors:
gender: the gender of the patient (binary: ‘1’ or ‘2’)
level: the severity of lung cancer (nominal: ‘Low’, ‘Medium’, ‘High’)
occuPational hazard: the level of occupational hazards of the patient (categorical: 1 to 10)
balanced diet: the level of balanced diet of the patient. (categorical: from 1 - very low to 7 - very high)
obesity: the level of obesity of the patient. (categorical: from 1 - very low to 7 - very high)
fatigue: the level of fatigue of the patient. (categorical: from 1 - very low to 7 - very high)
Linear Regression:
Code
# cv metrics
lin_can_rec <- recipe(`Alcohol use` ~ Gender + Level + `OccuPational Hazards` +
`Balanced Diet` + Obesity + Fatigue,
data = can_train)
lin_can_wflow <- workflow() %>%
add_model(lin_reg_mod) %>%
add_recipe(lin_can_rec)
lin_can_fit_cv <- lin_can_wflow %>%
fit_resamples(resamples = can_cv)
lin_can_fit_cv %>% collect_metrics()# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 rmse standard 1.06 5 0.0287 Preprocessor1_Model1
2 rsq standard 0.837 5 0.00921 Preprocessor1_Model1
The cross validated RMSE = 1.07 indicates that our model will typically predict alcohol use within 1.07 units of of the actual value. The cross validated R-squared = 0.84 indicates that 84% of the variation in alcohol use can be explained by the model.
KNN Regression:
Code
# k tuning
knn_reg_can_wflow_tune <- workflow() %>%
add_recipe(knn_reg_can_rec) %>%
add_model(knn_reg_mod_tune)
knn_reg_can_grid_search <-
tune_grid(
knn_reg_can_wflow_tune,
resamples = can_cv,
grid = k_grid
)
knn_reg_can_grid_search %>%
collect_metrics() %>%
filter(.metric == "rmse") %>%
slice_min(mean, n = 3)# A tibble: 3 × 7
neighbors .metric .estimator mean n std_err .config
<int> <chr> <chr> <dbl> <int> <dbl> <chr>
1 5 rmse standard 0.356 5 0.0522 Preprocessor1_Model03
2 7 rmse standard 0.357 5 0.0426 Preprocessor1_Model04
3 9 rmse standard 0.367 5 0.0433 Preprocessor1_Model05
Code
# A tibble: 3 × 7
neighbors .metric .estimator mean n std_err .config
<int> <chr> <chr> <dbl> <int> <dbl> <chr>
1 7 rsq standard 0.980 5 0.00454 Preprocessor1_Model04
2 5 rsq standard 0.980 5 0.00570 Preprocessor1_Model03
3 9 rsq standard 0.979 5 0.00471 Preprocessor1_Model05
Tuning determines that our best value for k = 9 as it has the lowest RMSE = 0.349.
Code
# cv metrics
knn_reg_can_mod <- nearest_neighbor(neighbors = 9) %>%
set_mode("regression") %>%
set_engine("kknn")
knn_reg_can_wflow <- workflow() %>%
add_model(knn_reg_can_mod) %>%
add_recipe(knn_reg_can_rec)
knn_reg_can_fit_cv <- knn_reg_can_wflow %>%
fit_resamples(resamples = can_cv)
knn_reg_can_fit_cv %>% collect_metrics()# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 rmse standard 0.367 5 0.0433 Preprocessor1_Model1
2 rsq standard 0.979 5 0.00471 Preprocessor1_Model1
The cross validated RMSE = 0.38 indicates that our model will typically predict alcohol use within 0.38 units of of the actual value. The cross validated R-squared = 0.98 indicates that 98% of the variation in alcohol use can be explained by the model.
KNN Classification:
Code
# k tuning
knn_cla_can_wflow_tune <- workflow() %>%
add_recipe(knn_cla_can_rec) %>%
add_model(knn_cla_mod_tune)
knn_cla_can_grid_search <-
tune_grid(
knn_cla_can_wflow_tune,
resamples = can_cv,
grid = k_grid
)
knn_cla_can_grid_search %>%
collect_metrics() %>%
filter(.metric == "accuracy") %>%
slice_max(mean, n = 3)# A tibble: 3 × 7
neighbors .metric .estimator mean n std_err .config
<int> <chr> <chr> <dbl> <int> <dbl> <chr>
1 1 accuracy multiclass 0.972 5 0.00393 Preprocessor1_Model01
2 3 accuracy multiclass 0.97 5 0.00283 Preprocessor1_Model02
3 5 accuracy multiclass 0.963 5 0.00816 Preprocessor1_Model03
Code
# A tibble: 3 × 7
neighbors .metric .estimator mean n std_err .config
<int> <chr> <chr> <dbl> <int> <dbl> <chr>
1 17 roc_auc hand_till 0.990 5 0.00345 Preprocessor1_Model09
2 7 roc_auc hand_till 0.990 5 0.00207 Preprocessor1_Model04
3 11 roc_auc hand_till 0.990 5 0.00233 Preprocessor1_Model06
Tuning determines that our best value for k = 3, as it is both tied for the highest accuracy = 0.97 and the highest ROC-AUC = 0.99.
Code
# cv metrics
knn_cla_can_mod <- nearest_neighbor(neighbors = 3) %>%
set_mode("classification") %>%
set_engine("kknn")
knn_cla_can_wflow <- workflow() %>%
add_model(knn_cla_can_mod) %>%
add_recipe(knn_cla_can_rec)
knn_cla_can_fit_cv <- knn_cla_can_wflow %>%
fit_resamples(resamples = can_cv)
knn_cla_can_fit_cv %>% collect_metrics()# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 accuracy multiclass 0.97 5 0.00283 Preprocessor1_Model1
2 roc_auc hand_till 0.988 5 0.00230 Preprocessor1_Model1
With an accuracy = 0.97, our model predicts alcohol use correctly 97% of the time.
LDA Classification:
Code
# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 accuracy multiclass 0.681 5 0.0306 Preprocessor1_Model1
2 roc_auc hand_till 0.949 5 0.00849 Preprocessor1_Model1
With an accuracy = 0.69, our model predicts alcohol use correctly 69% of the time.
Random Forest:
Code
# tuning
rf_grid <- grid_regular(mtry(c(1, 7)), levels = 7)
rf_can_wflow <- workflow() %>%
add_model(rf_mod_tune) %>%
add_recipe(tree_can_rec)
rf_can_grid_search <- rf_can_wflow %>%
tune_grid(
grid = rf_grid,
resamples = can_cv
)
rf_can_grid_search %>%
collect_metrics() %>%
filter(.metric == "accuracy") %>%
slice_min(mean, n = 3)# A tibble: 4 × 7
mtry .metric .estimator mean n std_err .config
<int> <chr> <chr> <dbl> <int> <dbl> <chr>
1 1 accuracy multiclass 0.911 5 0.0276 Preprocessor1_Model1
2 7 accuracy multiclass 0.978 5 0.00465 Preprocessor1_Model7
3 2 accuracy multiclass 0.98 5 0.00377 Preprocessor1_Model2
4 6 accuracy multiclass 0.98 5 0.00451 Preprocessor1_Model6
Code
# A tibble: 3 × 7
mtry .metric .estimator mean n std_err .config
<int> <chr> <chr> <dbl> <int> <dbl> <chr>
1 5 roc_auc hand_till 1.00 5 0.0000536 Preprocessor1_Model5
2 4 roc_auc hand_till 1.00 5 0.0000810 Preprocessor1_Model4
3 6 roc_auc hand_till 1.00 5 0.0000834 Preprocessor1_Model6
Code
# cv metrics
rf_can_mod <- rand_forest(mtry = 3, min_n = 5, trees = 30) %>%
set_engine("ranger") %>%
set_mode("classification")
rf_can_wflow <- workflow() %>%
add_model(rf_can_mod) %>%
add_recipe(tree_can_rec)
rf_can_fit_cv <- rf_can_wflow %>%
fit_resamples(resamples = can_cv)
rf_can_fit_cv %>% collect_metrics()# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 accuracy multiclass 0.983 5 0.00430 Preprocessor1_Model1
2 roc_auc hand_till 1.00 5 0.0000629 Preprocessor1_Model1
With an accuracy = 0.98, our model predicts alcohol use correctly 98% of the time.
5. Final Models
Student Data
Model Fit:
Code
parsnip model object
Call:
kknn::train.kknn(formula = ..y ~ ., data = data, ks = min_rows(5, data, 5))
Type of response variable: nominal
Minimal misclassification: 0.3319149
Best kernel: optimal
Best k: 5
Validation Data Metrics:
Code
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy multiclass 0.712
Code
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 precision macro 0.763
Code
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 recall macro 0.723
Code
Truth
Prediction 1 2 3 4 5
1 27 2 2 0 0
2 4 18 2 4 1
3 6 4 15 2 0
4 1 2 0 10 0
5 0 0 0 0 4
Our K-Nearest Neighbors Classification model validation accuracy = 0.64, indicating that 64% of the predictions were correct, the best from all the models we fit. Although this may not seem super great, it is important to note that since we have 5 categories, if our accuracy rate was based on random chance then it would only be around 20%. Our model success can also be seen in the confusion matrix as the diagonal contained most of the observations, indicating that most of the predictions were correct. Both the precision and recall values were also over 0.6 indicating accurate positive predictions, as well as proper adjustments made by the model to account for an imbalance in the number of observations per group. Since most students surveyed that they were a 1 on the scale for weekly alcohol consumption, there were more 1’s in the data set. Typically, predictions are easier for observations that take up a higher proportion of the data. Therefore, it is promising that other groups also had accurate predictions.
Cancer Data
Model Fit:
parsnip model object
Ranger result
Call:
ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~3, x), num.trees = ~30, min.node.size = min_rows(~5, x), num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE)
Type: Probability estimation
Number of trees: 30
Sample size: 900
Number of independent variables: 7
Mtry: 3
Target node size: 5
Variable importance mode: none
Splitrule: gini
OOB prediction error (Brier s.): 0.02338681
Validation Data Metrics:
Code
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy multiclass 0.99
Code
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 precision macro 0.992
Code
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 recall macro 0.975
Code
Truth
Prediction 1 2 3 4 5 6 7 8
1 17 0 0 0 0 0 0 0
2 0 18 0 0 0 0 0 0
3 0 0 4 0 0 0 0 0
4 0 0 0 4 0 0 0 0
5 0 0 0 0 12 0 0 0
6 0 0 0 0 0 10 0 0
7 0 0 1 0 0 0 14 0
8 0 0 0 0 0 0 0 20
The random forest model achieved perfect accuracy, precision, and recall of 1. An accuracy of one indicates that the model was able to correctly classify all patients in the data set to their respected alcohol use level with no misclassifications. A precision of one means that every patient’s predicted level of alcohol use was their actual level of alcohol use. Finally, a recall of one means that the model captured all the positive instances in the data set, the positive instances being that the patient’s predicted level of alcohol was their predicted level of alcohol use. These perfect metrics can be illustrated in the confusion matrix where there were no observations outside of the main diagonal, indicating that all the predictions were correct. It is somewhat surprising that the metrics are all 100% after cross validating, considering that rarely occurs in real life scenarios. Because this is the case, we speculate that there was some issues concerning data integrity of the recorded data.
6. Conclusions
Overall, our final model for the student data indicated several key factors for predicting weekly alcohol consumption in adolescence including sex, family size, extra paid classes, absences, final grades, going out with friends, parent’s education, mother’s occupation, student guardian, supplementary educational support, extracurricular activities, internet access, romantic relationships, and free time. None of these indicators come as much of a surprise since they are generally related to student responsibility and adult influence around them. One factor I was surprised was not a key indicator was family relationship quality, as I assumed that troubled family situations may indicate alcohol issues.
The fact that our best model was a KNN classification model and some of our worst models were regression tells us a lot about the structure of our data and the population that the model was fit on. First, this tells us that the relationships between variables in our data set were not very linear. In fact, they most likely did not follow any easily identifiable distribution. Instead, there were probably clusters of students that identified with a specific level of weekly alcohol consumption. When we think about this logically, this makes some sense as students develop cliques that have unique backgrounds and interests. Although we cannot be certain, since the data came from students in the same school and classes, it would make sense for these students to be friends with others that shared a lot of the same qualities as themselves. One of these qualities was most likely alcohol consumption. Although this is fascinating, it is also concerning as if this is the case, trying to apply this model on other groups of people outside of this school may be ineffective as the structures may not be the same.
Our final model for the cancer data indicated gender, lung cancer severity, occupational hazard, balanced diet, obesity, and fatigue as key factors that impact a patient’s level of alcohol use. These factors do not surprise us since all seem like reasonable reasons as to how often someone drinks. In particular, the relationship between lung cancer and level of alcohol does not surprise us because there have been many studies like this one from the National Library of Medicine https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4706677/ linking lung cancer to how often someone drinks.
The best model that we found to predict level of alcohol use was random forest because of its high metrics whereas the worst model was LDA classification since it had low metrics with 68% accuracy. The difference in accuracy between the two models tells us a lot about the data itself. For instance, LDA is a linear model so it assumes that the data is linear. Therefore, it struggles to model nonlinear relations. Random forest and knn on the other hand, which both had high metrics, are nonparametric models so they do not care whether the linear assumption is met. Since knn and random forest performed well whereas LDA did not, we can conclude that there is some non linear patterns between the variables.
7. Limitations
It is our intention to respect the privacy of our data subjects and to ensure that their information is treated with care. As data scientists, we must consider the impact, harm, and power that our analysis has over the general public as well as the potential impact and harm the data itself has on our study. Both data sets used in this research project collected sensitive data about surveyed students and patients. Although the data collection process in both datasets did not include the patient or student’s names to protecting their subject’s identity, the process was harmful in that researchers asked questions perceived as uncomfortable. For instance, students from the student dataset who classified as underage were asked their alcohol use during the weekend and weekday. Questions such as these introduce response bias because students may feel uneasy disclosing information about their drinking habits, especially considering consequences of these actions, leading them to answer untruthfully to the researchers. Likewise, patients with lung cancer in the cancer dataset are certainly going to find it uncomfortable to answer questions related to their lifestyle habits such as rating how balanced their diet is from a scale. This, like the student data set data collection process would make the patients feel uncomfortable.
In terms of impact, our cancer model could detrimentally impact people that are a minority in our dataset. The largest group of people in the dataset had an age of 30-39, making a little over 30% of the total recorded number or patients. Also, 60% of the patients identified as male. Therefore our model might not be the best fit for predicting level of alcohol use for someone who is a minority in the dataset, meaning someone who is outside the age of 30-39 and someone who identifies as a woman.
The student dataset on the other hand could detrimentally impact students who are wrongly classified as an underage drinker when the reality is not true. The result of this erroneous classification can result in unintended consequences, discrimination, and even peer stigmatization. The most important variables found in predicting the level of weekend alcohol consumption were sex, family size, paid math/portugese classes, number of absences, final grade, age, level of going out with friends, mother’s education, father’s education, mother’s job, student’s guardian, extra educational support, extracurricular activities, internet access at home, relationship status, and free time after school. The students who are lacking in one of these characteristics, for example having a low final grade or no internet access at home, are likely to be subjected to detrimental impact. To mitigate the detrimental impact, we plan to be transparent about the variables our models use, how the model works, and the limitations. We believe transparency can help readers to understand the impact of our model. In addition to being transparent, we intend to outline the intended use of our model so it can be used for ethical purposes.