High School Alcoholism and Academic Performance: EDA and Clustering

Introduction

Alcohol consumption among high school students has witnessed a significant rise on a global scale in recent decades, prompting growing concern regarding its impact on the academic achievements and future trajectories of adolescents. The predominant issue linked to alcohol consumption during adolescence lies in its adverse effects on students’ school performance, often resulting in an elevated risk of school failure and increased school absences.

The discourse surrounding alcohol consumption during adolescence has become a subject of debate, particularly in developed countries, with implications for public health, as well as social and cultural dynamics. Recent studies conducted in Portugal over the past five years, such as the one led by Silvetra et al, underscore a high prevalence of alcohol consumption among Portuguese school adolescents. Notably, binge drinking levels are reported to be comparable to those observed in the United States. Paulo Silvestre et al. revealed that a significant number of participants reported initiating alcohol consumption under the influence of friends and using it as a means of socializing, while some resorted to alcohol as a coping mechanism for life’s challenges.

In a parallel vein, research in the United States, as articulated in Teenage Alcohol Use and Educational Attainment, explores the correlation between teenage alcohol use and school failure, raising questions about whether this relationship is causal or spurious. Human capital models predict a direct negative impact of teenage alcohol use on educational attainment. Heavy alcohol use, in particular, is implicated in diverting time away from studying, completing homework, and seeking academic assistance. Beyond these immediate effects, heavy alcohol use during adolescence may compromise long-term educational attainment by influencing brain structure, functioning, and neuropsychological performance. Moreover, heavy alcohol use heightens the risk of motor vehicle accidents, physical and mental health problems, and violence, thereby impacting intervening variables that contribute to reduced educational attainment.

The dataset available for my study offers a unique opportunity to scrutinize the intricate relationships between student backgrounds, alcoholism, and academic achievements. This investigation is pivotal for comprehending the multifaceted dynamics of alcohol consumption among adolescents and its far-reaching implications for their educational journeys.

# Check, installation and loading of required packages #
requiredPackages = c("tidyverse", "pastecs", "readr", "dplyr", "ggplot2","cowplot", "flexclust", "clustertend", "clusterability", "grid","patchwork", "clustertend", 'hopkins',"cluster", "factoextra", "ClusterR", "clValid", "gridExtra", "ggpubr", "tidyr", "vioplot", "corrplot", 'viridis')

for(i in requiredPackages){if(!require(i,character.only = TRUE)) install.packages(i)}

## Loading required package: tidyverse

## Warning: package 'tidyverse' was built under R version 4.3.2

## Warning: package 'ggplot2' was built under R version 4.3.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: pastecs

## Warning: package 'pastecs' was built under R version 4.3.2

## 
## Attaching package: 'pastecs'
## 
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## 
## The following object is masked from 'package:tidyr':
## 
##     extract
## 
## Loading required package: cowplot
## 
## Attaching package: 'cowplot'
## 
## The following object is masked from 'package:lubridate':
## 
##     stamp
## 
## Loading required package: flexclust

## Warning: package 'flexclust' was built under R version 4.3.2

## Loading required package: grid
## Loading required package: lattice
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: clustertend
## Package `clustertend` is deprecated.  Use package `hopkins` instead.
## Loading required package: clusterability

## Warning: package 'clusterability' was built under R version 4.3.2

## Loading required package: patchwork

## Warning: package 'patchwork' was built under R version 4.3.2

## 
## Attaching package: 'patchwork'
## 
## The following object is masked from 'package:cowplot':
## 
##     align_plots
## 
## Loading required package: hopkins

## Warning: package 'hopkins' was built under R version 4.3.2

## 
## Attaching package: 'hopkins'
## 
## The following object is masked from 'package:clustertend':
## 
##     hopkins
## 
## Loading required package: cluster

## Warning: package 'cluster' was built under R version 4.3.2

## Loading required package: factoextra

## Warning: package 'factoextra' was built under R version 4.3.2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
## Loading required package: ClusterR

## Warning: package 'ClusterR' was built under R version 4.3.2

## Loading required package: clValid

## Warning: package 'clValid' was built under R version 4.3.2

## 
## Attaching package: 'clValid'
## 
## The following object is masked from 'package:flexclust':
## 
##     clusters
## 
## The following object is masked from 'package:modeltools':
## 
##     clusters
## 
## Loading required package: gridExtra

## Warning: package 'gridExtra' was built under R version 4.3.2

## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## Loading required package: ggpubr
## 
## Attaching package: 'ggpubr'
## 
## The following object is masked from 'package:cowplot':
## 
##     get_legend
## 
## Loading required package: vioplot

## Warning: package 'vioplot' was built under R version 4.3.2

## Loading required package: sm

## Warning: package 'sm' was built under R version 4.3.2

## Package 'sm', version 2.2-5.7: type help(sm) for summary information
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Loading required package: corrplot

## Warning: package 'corrplot' was built under R version 4.3.2

## corrplot 0.92 loaded
## Loading required package: viridis

## Warning: package 'viridis' was built under R version 4.3.2

## Loading required package: viridisLite

for(i in requiredPackages){if(!require(i,character.only = TRUE)) library(i,character.only = TRUE) }

Dataset Source

*The data was gathered by P. Cortez and A. Silva for the study “Using Data Mining to Predict High School Student Performance,” presented at the 5th Future Business Technology Conference (FUBUTEC, 2008) in Porto, Portugal.*

# loading the dataset
df = read.csv("https://raw.githubusercontent.com/Samidullo-Abdullayev/datasets-/main/en_lpor_explorer.csv", header = TRUE, sep = ',')

# let's see first 5 rows by Age descinding order
df %>% arrange(desc(Age)) %>% slice(1:5)

##            School Gender Age Housing_Type Family_Size Parental_Status
## 1 Gabriel Pereira   Male  22        Urban     Above 3 Living Together
## 2 Gabriel Pereira Female  21        Urban     Up to 3 Living Together
## 3 Gabriel Pereira   Male  21        Rural     Up to 3 Living Together
## 4 Gabriel Pereira Female  20        Rural     Above 3 Living Together
## 5 Gabriel Pereira   Male  20        Urban     Above 3       Separated
##         Mother_Education       Father_Education Mother_Work Father_Work
## 1            High School         Primary School    Services    Services
## 2       Higher Education       Higher Education       other       other
## 3         Primary School         Primary School   Homemaker       other
## 4 Lower Secondary School         Primary School       other       other
## 5            High School Lower Secondary School    Services       other
##   Reason_School_Choice Legal_Responsibility Commute_Time Weekly_Study_Time
## 1                Other               Mother Up to 15 min          Up to 2h
## 2           Reputation                Other Up to 15 min          5 to 10h
## 3    Course Preference                Other 15 to 30 min           2 to 5h
## 4    Course Preference                Other 15 to 30 min           2 to 5h
## 5    Course Preference                Other Up to 15 min          Up to 2h
##   Extra_Educational_Support Parental_Educational_Support Private_Tutoring
## 1                        No                           No               No
## 2                        No                           No              Yes
## 3                        No                          Yes               No
## 4                        No                          Yes              Yes
## 5                        No                           No               No
##   Extracurricular_Activities Attended_Daycare Desire_Graduate_Education
## 1                         No               No                        No
## 2                        Yes              Yes                       Yes
## 3                        Yes              Yes                        No
## 4                        Yes              Yes                        No
## 5                        Yes              Yes                       Yes
##   Has_Internet Is_Dating Good_Family_Relationship Free_Time_After_School
## 1          Yes       Yes                Excellent                   High
## 2          Yes        No                     Fair               Moderate
## 3          Yes       Yes                Excellent               Moderate
## 4          Yes       Yes                Very Poor                    Low
## 5           No        No                Excellent              Very High
##   Time_with_Friends Alcohol_Weekdays Alcohol_Weekends Health_Status
## 1         Very High        Very High        Very High     Very Poor
## 2               Low         Very Low         Very Low     Very Good
## 3          Moderate        Very High              Low          Good
## 4          Moderate         Very Low              Low          Poor
## 5          Moderate         Very Low         Very Low     Very Good
##   School_Absence Grade_1st_Semester Grade_2nd_Semester
## 1             12                  7                  8
## 2              0                  9                 12
## 3             21                  9                 10
## 4              8                 10                 12
## 5              0                 14                 15

# let's see data shape how many rows and columns
cat("Data deminsion: ", dim(df)[1], "rows", dim(df)[2], "columns")

## Data deminsion:  649 rows 31 columns

Features description

School: Student's school (char: Gabriel Pereira, Mousinho da Silveira)

Gender: Student's sex (char: Female, Male)

```
Age: Student age (numeric: 15 to 22)
```

Housing Type: Type of student's residential address (char: Urban, Rural)

Family_size: Family size (char: 'up to 3' - less than or equal to 3 or 'Above 3' - greater than 3)

Parental_tatus: Parents' cohabitation status (char: Living together, Separated)

Mother_Education: Mother's education level (char: none, Elementary School 1, Elementary School 2, High School or Higher Education)

Father_Edcuation: Father's education level (char: none, Elementary School 1, Elementary School 2, High School or Higher Education)

Mother_Work: Mother's job (char: Health, Homemaker, other, Services, Teacher)

Father_Work: Father's job (char: Health, Homemaker, other, Services, Teacher)

Reason_School_Choice: Reason for choosing this school (char: Course Preference, Near Home, Other, Reputation)

Legal_Responsibility: Student's guardian (char: Mother, Father or Other)

Commute_Time: Travel time from home to school (time intervals: Up to 15 min, 15 to 30 min, 30 min to 1h, More than 1h,)

Weekly_study_time: Weekly study time (time intervals: Up to 2h, 2 to 5h, 5 to 10h, More than 10h)

Extra_Educational_Support: (binary: yes or no)

Parental_Education_Support: Family educational support (binary: yes or no)

Private_Tutoring: Private classes on subjects related to the course (binary: yes or no)

Extracurricular_Activities: Performs extracurricular activities (binary: yes or no)

```
Attended_Daycare: (binary: yes or no)
```

Desire_Graduate_Educaiton: Desire to pursue a degree (binary: yes or no)

Has_Internet: Internet access at home (binary: yes or no)

Is_Dating: Are you in a romantic relationship (binary: yes or no)

Good_Family_Relationship: Quality of family relationships (categorical: from 1 - very bad to 5 - excellent)

Free_Time_After_School: Free time after school (categorical: Very Low, Low, Moderate, High, Very High)

Time_with_Friends: Time with friends (categorical: Very Low, Low, Moderate, High, Very High)

Alcohol_Weekdays: Alcohol consumption on the work day (categorical: Very Low, Low, Moderate, High, Very High)

Alcohol_Weekends: Alcohol consumption on the weekend (categorical: Very Low, Low, Moderate, High, Very High)

Health_Status: Current health status (categorical:  Very Poor, Fair, Good, Very Good)

School_Absences: Number of school absences (numeric: from 0 to 93)

Grade_1st_Semester: First semester grade (numeric: from 0 to 20)

Grade_2st_Semester: Second semester grade (numeric: from 0 to 20)

In the first steps of the analysis, I will inspect the detailed information of the data, calculate basic statistics and explore the relationship between variables.

During this exploratory phase, I will specifically address two hypotheses:

H1: Correlation between Family Background and Student Alcoholism

I will investigate whether there is a correlation between family background variables and the occurrence of student alcoholism. This involves analyzing features related to family structure, education levels, and occupations to understand their potential influence on student alcohol consumption.

H2: Relationship between Alcoholism and Grade

I will explore the relationship between student alcohol consumption and academic performance (grades). This analysis aims to uncover patterns or trends that suggest a potential connection between alcohol habits and academic outcomes.

#check missing values in dataset and count total missing values in each column of data frame
cat("Number of missing values for each row : \n")

## Number of missing values for each row :

lapply(df, function(x) sum(is.na(x)))

## $School
## [1] 0
## 
## $Gender
## [1] 0
## 
## $Age
## [1] 0
## 
## $Housing_Type
## [1] 0
## 
## $Family_Size
## [1] 0
## 
## $Parental_Status
## [1] 0
## 
## $Mother_Education
## [1] 0
## 
## $Father_Education
## [1] 0
## 
## $Mother_Work
## [1] 0
## 
## $Father_Work
## [1] 0
## 
## $Reason_School_Choice
## [1] 0
## 
## $Legal_Responsibility
## [1] 0
## 
## $Commute_Time
## [1] 0
## 
## $Weekly_Study_Time
## [1] 0
## 
## $Extra_Educational_Support
## [1] 0
## 
## $Parental_Educational_Support
## [1] 0
## 
## $Private_Tutoring
## [1] 0
## 
## $Extracurricular_Activities
## [1] 0
## 
## $Attended_Daycare
## [1] 0
## 
## $Desire_Graduate_Education
## [1] 0
## 
## $Has_Internet
## [1] 0
## 
## $Is_Dating
## [1] 0
## 
## $Good_Family_Relationship
## [1] 0
## 
## $Free_Time_After_School
## [1] 0
## 
## $Time_with_Friends
## [1] 0
## 
## $Alcohol_Weekdays
## [1] 0
## 
## $Alcohol_Weekends
## [1] 0
## 
## $Health_Status
## [1] 0
## 
## $School_Absence
## [1] 0
## 
## $Grade_1st_Semester
## [1] 0
## 
## $Grade_2nd_Semester
## [1] 0

cat("\n\nNumber of missing values in whole dataset : ", sum(is.na(df)))

## 
## 
## Number of missing values in whole dataset :  0

According to the above output, there is not any missing value in the dataset. We can continue next steps.

# turn categorical columns to factor type in ordered way
df$Mother_Education <- factor(df$Mother_Education, levels = c('None', 'Primary School','Lower Secondary School','High School','Higher Education'))
df$Father_Education <- factor(df$Father_Education, levels = c('None', 'Primary School','Lower Secondary School','High School','Higher Education'))
df$Commute_Time <- factor(df$Commute_Time, levels = c('Up to 15 min','15 to 30 min','30 min to 1h','More than 1h'))
df$Weekly_Study_Time <- factor(df$Weekly_Study_Time, levels = c('Up to 2h','2 to 5h','5 to 10h','More than 10h'))
df$Good_Family_Relationship <- factor(df$Good_Family_Relationship, levels = c('Very Poor','Poor','Fair','Good','Excellent'))
df$Alcohol_Weekdays <- factor(df$Alcohol_Weekdays, levels = c('Very Low', 'Low', 'Moderate', 'High', 'Very High'))
df$Alcohol_Weekends <- factor(df$Alcohol_Weekends, levels = c('Very Low', 'Low', 'Moderate', 'High', 'Very High'))
df$Time_with_Friends <- factor(df$Time_with_Friends, levels = c('Very Low', 'Low', 'Moderate', 'High', 'Very High'))
df$Free_Time_After_School <- factor(df$Free_Time_After_School, levels = c('Very Low', 'Low', 'Moderate', 'High', 'Very High'))
df$Health_Status <- factor(df$Health_Status, levels = c('Very Poor','Poor', 'Fair','Good','Very Good'))

# columns have binary (yes or now) or two categories we use following code to turn them factor type
df[sapply(df, is.character)] <- lapply(df[sapply(df, is.character)], as.factor)

# check dataframe structure
str(df)

## 'data.frame':    649 obs. of  31 variables:
##  $ School                      : Factor w/ 2 levels "Gabriel Pereira",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Gender                      : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 2 2 1 2 2 ...
##  $ Age                         : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ Housing_Type                : Factor w/ 2 levels "Rural","Urban": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Family_Size                 : Factor w/ 2 levels "Above 3","Up to 3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Parental_Status             : Factor w/ 2 levels "Living Together",..: 2 1 1 1 1 1 1 2 2 1 ...
##  $ Mother_Education            : Factor w/ 5 levels "None","Primary School",..: 5 2 2 5 4 5 3 5 4 4 ...
##  $ Father_Education            : Factor w/ 5 levels "None","Primary School",..: 5 2 2 3 4 4 3 5 3 5 ...
##  $ Mother_Work                 : Factor w/ 5 levels "Health","Homemaker",..: 2 2 2 1 3 4 3 3 4 3 ...
##  $ Father_Work                 : Factor w/ 5 levels "Health","Homemaker",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ Reason_School_Choice        : Factor w/ 4 levels "Course Preference",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ Legal_Responsibility        : Factor w/ 3 levels "Father","Mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ Commute_Time                : Factor w/ 4 levels "Up to 15 min",..: 2 1 1 1 1 1 1 2 1 1 ...
##  $ Weekly_Study_Time           : Factor w/ 4 levels "Up to 2h","2 to 5h",..: 2 2 2 3 2 2 2 2 2 2 ...
##  $ Extra_Educational_Support   : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ Parental_Educational_Support: Factor w/ 2 levels "No","Yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ Private_Tutoring            : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Extracurricular_Activities  : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ Attended_Daycare            : Factor w/ 2 levels "No","Yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ Desire_Graduate_Education   : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Has_Internet                : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ Is_Dating                   : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ Good_Family_Relationship    : Factor w/ 5 levels "Very Poor","Poor",..: 4 5 4 3 4 5 4 4 4 5 ...
##  $ Free_Time_After_School      : Factor w/ 5 levels "Very Low","Low",..: 3 3 3 2 3 4 4 1 2 5 ...
##  $ Time_with_Friends           : Factor w/ 5 levels "Very Low","Low",..: 4 3 2 2 2 2 4 4 2 1 ...
##  $ Alcohol_Weekdays            : Factor w/ 5 levels "Very Low","Low",..: 1 1 2 1 1 1 1 1 1 1 ...
##  $ Alcohol_Weekends            : Factor w/ 5 levels "Very Low","Low",..: 1 1 3 1 2 2 1 1 1 1 ...
##  $ Health_Status               : Factor w/ 5 levels "Very Poor","Poor",..: 3 3 3 5 5 5 3 1 1 5 ...
##  $ School_Absence              : int  4 2 6 0 0 6 0 2 0 0 ...
##  $ Grade_1st_Semester          : int  0 9 12 14 11 12 13 10 15 12 ...
##  $ Grade_2nd_Semester          : int  11 11 13 14 13 12 12 13 16 12 ...

What is the age of high school students?

par(mfrow = c(1,2))

boxplot(df$Age ~ df$Gender,
        xlab = 'Age', ylab = 'Gender',
        main = 'Comparision of Gender and Age',
        col = viridis(2), horizontal = TRUE)

vioplot(df$Age ~ df$Gender, xlab = "Gender", ylab = "Age",
        col = 'lightblue')

We can observe that most kids start high school at 15 and finish at 18/19, while there are also some boys and girls who if I fail I can even finish it at 21/22 years old.

Do families live more in urban or rural areas? Does family size influence the choice?

barplot(table(df$Family_Size, df$Housing_Type), col = c("lightblue", "green"),
        main = "Distribution of Families by Family Size and Address Type",
        xlab = "Family Size", ylab = "Frequency", ylim = c(0, 500))

legend("topleft", inset = 0.1, legend = levels(df$Family_Size), fill = c("lightblue", "green"), lty=1:4, cex= 1.3, bty = "n")

As we can see, most families are made up of more than 3 people and in general families live mainly in urban areas compared to rural areas.

What is the average level of education of the parents?

df_l <- gather(df, key = "Parent", value = "Education", Mother_Education, Father_Education)
# Order education levels in ascending order
df_l$Education <- factor(df_l$Education, levels = c('None', 'Primary School','Lower Secondary School','High School','Higher Education'))

ggplot(df_l, aes(x = Education, fill = Parent)) +
  geom_bar(position = "dodge", color = "white") +
  labs(title = "Distribution of Mother and Father Education Levels",
       x = "Education Level",
       y = "Frequency") +
  scale_fill_manual(values = c("Mother_Education" = "lightblue", "Father_Education" = "lightgreen"))

As can be seen, the educational level of parents is very varied even if the majority of parents (32.2% of fathers and 28.6% of mothers) stopped at lower secondary school.

Who live with students and their legal responsibility person?

ggplot(df, aes(x = Parental_Status, fill = Legal_Responsibility)) +
  geom_bar(position = "dodge") +
  labs(title = "Distribution of Students by Parental Status and Legal Responsibility",
       x = "Parental Status",
       y = "Frequency") +
  scale_fill_manual(values = c("Mother" = "skyblue", "Father" = "green", "Other" = "yellow"))

Most student list with their mother, however some of the live other who is legal responsible person.

Is there any relationship between student alcoholism vs family relationship ?

# Create a grouped bar plot for Alcohol_Weekdays
ggplot(df, aes(x = Good_Family_Relationship, fill = Alcohol_Weekdays)) +
  geom_bar(position = "dodge", stat = "count") +
  labs(title = "Relationship between Family Relationship and Alcohol Consumption on Weekdays",
       x = "Family Relationship",
       y = "Count") +
  scale_fill_manual(values = c("Very Low" = "lightblue", "Low" = "lightgreen", "Moderate" = "yellow", "High" = "orange", "Very High" = "red")) +
  theme_minimal()

# Create a similar plot for Alcohol_Weekends
ggplot(df, aes(x = Good_Family_Relationship, fill = Alcohol_Weekends)) +
  geom_bar(position = "dodge", stat = "count") +
  labs(title = "Relationship between Family Relationship and Alcohol Consumption on Weekends",
       x = "Family Relationship",
       y = "Count") +
  scale_fill_manual(values = c("Very Low" = "lightblue", "Low" = "lightgreen", "Moderate" = "yellow", "High" = "orange", "Very High" = "red")) +
  theme_minimal()

The family relationship may increase alcohol consumption. But in the below graph, many students, who have good family relationship, drink more than poor family relationship students. Also, during the weekends alcoholism significantly increased among good and even better family relationship students. This unexpected observation prompts the conclusion that there is a correlation between family background and alcoholism among students. This association becomes more evident during weekends, where the majority of students spend time at home.

Is there any relationship between weekdays’ alcohol consumption and study time

barplot(table(df$Alcohol_Weekdays, df$Weekly_Study_Time), beside = TRUE, col = viridis(5), ylim=c(0, 250),
        main = "Weekdays Alcohol Consumption vs. Study Time",
        xlab = "Alcohol Consumption on Weekdays",
        ylab = "Frequency",
        legend.text = TRUE,
        args.legend = list(x = "topright", bty = "n", legend = levels(df$Alcohol_Weekdays), cex = 1.3)
)

Alcoholism levels are significantly higher among students dedicating up to 2 hours to study during the week. Additionally, students who engage in more extensive study sessions, specifically those studying for more than 10 hours, exhibit higher levels of alcohol consumption compared to their counterparts studying between 5 to 10 hours. Interestingly, students who allocate their time to study between 2 to 5 hours per week demonstrate the lowest levels of alcoholism.

Let’s see correlation between weekdays and weekends alcohol sonsumption with age

par(mfrow = c(1, 2))
# weekdays alcohol consumption with age
plot(df$Alcohol_Weekdays, df$Age, col = viridis(5), pch = 16, ylim = c(14, 25),
     xlab = "Alcohol Consumption on Weekdays",
     ylab = "Age",
     main = "Scatter Plot of Alcohol Weekdays vs. Age")

#legend("topleft", legend = levels(df$Alcohol_Weekdays), col = viridis(5), pch = 16, bty = 'n')


# weekends alcohol consumption with age
plot(df$Alcohol_Weekends, df$Age, col = viridis(5), pch = 16, ylim = c(14, 25),
     xlab = "Alcohol Consumption on Weekends",
     ylab = "Age",
     main = "Scatter Plot of Alcohol Weekends vs. Age")

#legend("topleft", legend = levels(df$Alcohol_Weekends), col = viridis(5), pch = 16, bty = 'n')

The alcohol consumption between age and during weekdays or weekends has slightly correlated. The primary alcohol consumption occurs among 16-18-year-old students. Most of them consume alcoholic beverages very minimally on weekdays, with 17-18-year-old students showing higher levels, especially high or very high. However, during weekends, alcohol consumption is nearly equal across all age groups. Additionally, 20-22-year-old students tend to have very high alcohol consumption on weekdays.

Does alcohol comsuming during weekdays and weekends effects students’ school absenses by student age?

# How weekdays alcohol consumption impact school absense by students age

ggplot(df, aes(x= School_Absence, y= Age, color = Alcohol_Weekdays)) +
  geom_point() +
  labs(title = "Scatter Plot of Age vs. School Absences",
       y = "Age",
       x = "School Absences",
       color = "Alcohol Weekdays") +
  scale_color_manual(values = c("Very Low" = "blue", "Low" = "green", "Moderate" = "yellow", "High" = "coral", "Very High" = "salmon"))

# How weekends alcohol consumption impact school absense by students age

ggplot(df, aes(x= School_Absence, y= Age, color = Alcohol_Weekends)) +
  geom_point() +
  labs(title = "Scatter Plot of Age vs. School Absences",
       y = "Age",
       x = "School Absences",
       color = "Alcohol Weekends") +
  scale_color_manual(values = c("Very Low" = "blue", "Low" = "green", "Moderate" = "yellow", "High" = "coral", "Very High" = "salmon"))

In the graph above, we examine the potential impact of alcohol consumption on students’ school absences. The data suggests a correlation between drinking habits and truancy.

Moderate, low, and even very low levels of alcohol consumption appear to contribute to higher school absences, exceeding 10 days in a semester. Notably, 17-year-old students who consume alcohol at a low level during weekdays exhibit the highest number of school days missed.

During weekends, there is an observable increase in alcohol consumption across all levels, with students consuming more than on weekdays. The majority tend to drink at an average level, which coincides with an increased rate of school absences compared to weekdays.

Does alcohol comsuming during weekdays and weekends impacts students’ grade?

# Set the overall size of the plot
par(mfrow = c(2, 2), mar = c(5, 5, 4, 2), pin = c(2, 2))

# Weekends alcohol consumption with Grade 1st Semester
plot(df$Alcohol_Weekends, df$Grade_1st_Semester, col = viridis(5), pch = 16, ylim = c(0, 25),
     xlab = "Alcohol Consumption on Weekends",
     ylab = "Grade 1st Semester",
     main = "Alcohol Weekends vs. Grade 1st Semester")

#legend("topleft", legend = levels(df$Alcohol_Weekends), col = viridis(5), pch = 16, bty = 'n')

# Weekends alcohol consumption with Grade 2nd Semester
plot(df$Alcohol_Weekends, df$Grade_2nd_Semester, col = viridis(5), pch = 16, ylim = c(0, 25),
     xlab = "Alcohol Consumption on Weekends",
     ylab = "Grade 2nd Semester",
     main = "Alcohol Weekends vs. Grade 2nd Semester")

#legend("topleft", legend = levels(df$Alcohol_Weekends), col = viridis(5), pch = 16, bty = 'n')

# Weekdays alcohol consumption with Grade 1st Semester
plot(df$Alcohol_Weekdays, df$Grade_1st_Semester, col = viridis(5), pch = 16, ylim = c(0, 25),
     xlab = "Alcohol Consumption on Weekdays",
     ylab = "Grade 1st Semester",
     main = "Alcohol Weekdays vs. Grade 2nd Semester")

#legend("topleft", legend = levels(df$Alcohol_Weekdays), col = viridis(5), pch = 16, bty = 'n')

# Weekdays alcohol consumption with Grade 2nd Semester
plot(df$Alcohol_Weekdays, df$Grade_2nd_Semester, col = viridis(5), pch = 16, ylim = c(0, 25),
     xlab = "Alcohol Consumption on Weekdays",
     ylab = "Grade 2nd Semester",
     main = "Alcohol Weekdays vs. Grade 2nd Semester")

#legend("topleft", legend = levels(df$Alcohol_Weekdays), col = viridis(5), pch = 16, bty = 'n')

In the portugal, students will pass if they gain 60% of study material during the semester. Which means they need to collect 12 out of 20. in the below bar plots we can see 2 semester with drinking alcoholic beverages during school days and weekends. Notably, 25% of students who consumed alcoholic beverages, even at very low levels, failed in both semesters. Meanwhile, 50% of those who drank at the lowest level advanced to the next semester with the lowest grade.

However, students who consumed alcohol moderately, highly, or very highly on weekdays experienced consistent failures in both semesters. Interestingly, some of those who drank at moderate and very high levels managed to pass.

A distinct difference is observed between students who drink on weekdays and weekends. More students who consumed alcohol on weekends passed to the next semester,even if their consumption exceeded moderate levels, compared to those who drank predominantly on weekdays.

School Absence and Grade with drinking alcoholic beverages during school days

summary(df$School_Absence)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   2.000   3.659   6.000  32.000

# school absense with grade 1st semester by alcohol consuming weekdays
ggplot(df, aes(x = School_Absence, y = Grade_1st_Semester, color = Alcohol_Weekdays)) +
  geom_point() +
  labs(title = "Scatter Plot of School Absences vs. Grade 1st Semester",
       x = "School Absences",
       y = "Grade 1st Semester",
       color = "Alcohol Weekdays") +
  scale_color_manual(values = c("Very Low" = "green", "Low" = "yellow", "Moderate" = "orange", "High" = "red", "Very High" = "darkred")) +
  ylim(0, 20) + 
  theme_minimal()

# school absense with grade 2nd semester by alcohol consuming weekdays
ggplot(df, aes(x = School_Absence, y = Grade_2nd_Semester, color = Alcohol_Weekdays)) +
  geom_point() +
  labs(title = "Scatter Plot of School Absences vs. Grade 2nd Semester",
       x = "School Absences",
       y = "Grade 2nd Semester",
       color = "Alcohol Weekdays") +
  scale_color_manual(values = c("Very Low" = "green", "Low" = "yellow", "Moderate" = "orange", "High" = "red", "Very High" = "darkred")) +
  ylim(0, 20) + 
  theme_minimal()

Drinking alcohol on school days is associated with increased truancy and negatively impacts students’ grades. Mean dayoffs are eqaul 4 days for all students. Also we can see that alcohol consumption is higher among failed students, who got less than 12, is higher than passed students. Higher alcohol consumption is positively correlated to semester grade and school absense.

What is relationship alcoholism between gender and students grades ?

# Create a box plot for Alcohol_Weekdays
ggplot(df, aes(x = Gender, y = Grade_1st_Semester, fill = Alcohol_Weekdays)) +
  geom_boxplot() +
  labs(title = "Relationship between Gender, Alcohol Consumption on Weekdays, and 1st Semester Grades",
       x = "Gender",
       y = "1st Semester Grades") +
  scale_fill_manual(values = c("Very Low" = "lightblue", "Low" = "lightgreen", "Moderate" = "lightyellow", "High" = "orange", "Very High" = "red")) +
  ylim(0, 20) + theme_minimal()

# Create a similar plot for Alcohol_Weekends
ggplot(df, aes(x = Gender, y = Grade_1st_Semester, fill = Alcohol_Weekends)) +
  geom_boxplot() +
  labs(title = "Relationship between Gender, Alcohol Consumption on Weekends, and 1st Semester Grades",
       x = "Gender",
       y = "1st Semester Grades") +
  scale_fill_manual(values = c("Very Low" = "lightblue", "Low" = "lightgreen", "Moderate" = "lightyellow", "High" = "orange", "Very High" = "red")) +
  ylim(0, 20) + theme_minimal()

In the first semester, alcoholism appears to have a more pronounced impact on male students compared to their female counterparts. The prevalence of alcohol consumption is higher among male students, potentially contributing to a higher rate of academic challenges, with many males failing in the first semester.

However, in the second semester, there is a noticeable decrease in alcoholism for both genders. This positive trend correlates with improved academic performance, as a significant number of students, regardless of gender, managed to pass their exams, achieving at least a grade of 12. This success suggests a positive shift, indicating that students either successfully progressed to the next semester or completed their high school education.

Clustering

In preparation for the clustering section, I’ll encode categorical columns into numeric types. Following that, then separate columns based on our hypotheses. This involves transforming categorical features into a format suitable for clustering analysis, allowing us to investigate the relationships outlined in our hypotheses efficiently.

# Identify categorical columns
categorical_columns <- c("School", "Gender", "Housing_Type", "Family_Size", "Parental_Status",            "Mother_Education", "Father_Education", "Mother_Work", "Father_Work", "Reason_School_Choice",  "Legal_Responsibility", "Commute_Time", "Weekly_Study_Time", "Extra_Educational_Support",  "Parental_Educational_Support", "Private_Tutoring", "Extracurricular_Activities", "Attended_Daycare", "Desire_Graduate_Education", "Has_Internet", "Is_Dating", "Good_Family_Relationship", "Free_Time_After_School", "Time_with_Friends", "Alcohol_Weekdays", "Alcohol_Weekends", "Health_Status")

# Encode categorical columns to numeric
df_encoded <- df %>%
  mutate(across(all_of(categorical_columns), as.integer))

# I modify  grade columns to separate student pass or fail in the semester and whole year
df$pass_year <- (df_encoded$Grade_1st_Semester + df_encoded$Grade_2nd_Semester) / 2
df_encoded$pass_year <- ifelse(df$pass_year >= 12, 1, 0)
df_encoded$Grade_1st_Semester <- ifelse(df_encoded$Grade_1st_Semester >= 12, 1, 0)
df_encoded$Grade_2nd_Semester <- ifelse(df_encoded$Grade_2nd_Semester >= 12, 1, 0)

# Check the structure of the new encoded dataframe
str(df_encoded)

## 'data.frame':    649 obs. of  32 variables:
##  $ School                      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Gender                      : int  1 1 1 1 1 2 2 1 2 2 ...
##  $ Age                         : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ Housing_Type                : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ Family_Size                 : int  1 1 2 1 1 2 2 1 2 1 ...
##  $ Parental_Status             : int  2 1 1 1 1 1 1 2 2 1 ...
##  $ Mother_Education            : int  5 2 2 5 4 5 3 5 4 4 ...
##  $ Father_Education            : int  5 2 2 3 4 4 3 5 3 5 ...
##  $ Mother_Work                 : int  2 2 2 1 3 4 3 3 4 3 ...
##  $ Father_Work                 : int  5 3 3 4 3 3 3 5 3 3 ...
##  $ Reason_School_Choice        : int  1 1 3 2 2 4 2 2 2 2 ...
##  $ Legal_Responsibility        : int  2 1 2 2 1 2 2 2 2 2 ...
##  $ Commute_Time                : int  2 1 1 1 1 1 1 2 1 1 ...
##  $ Weekly_Study_Time           : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ Extra_Educational_Support   : int  2 1 2 1 1 1 1 2 1 1 ...
##  $ Parental_Educational_Support: int  1 2 1 2 2 2 1 2 2 2 ...
##  $ Private_Tutoring            : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Extracurricular_Activities  : int  1 1 1 2 1 2 1 1 1 2 ...
##  $ Attended_Daycare            : int  2 1 2 2 2 2 2 2 2 2 ...
##  $ Desire_Graduate_Education   : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ Has_Internet                : int  1 2 2 2 1 2 2 1 2 2 ...
##  $ Is_Dating                   : int  1 1 1 2 1 1 1 1 1 1 ...
##  $ Good_Family_Relationship    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ Free_Time_After_School      : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ Time_with_Friends           : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Alcohol_Weekdays            : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Alcohol_Weekends            : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ Health_Status               : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ School_Absence              : int  4 2 6 0 0 6 0 2 0 0 ...
##  $ Grade_1st_Semester          : num  0 0 1 1 0 1 1 0 1 1 ...
##  $ Grade_2nd_Semester          : num  0 0 1 1 1 1 1 1 1 1 ...
##  $ pass_year                   : num  0 0 1 1 1 1 1 0 1 1 ...

Since the values of variables are on different scales I will normalize them using scale() function in order to get proper and interpretable results.

df1 <- scale(df_encoded)

Now, I seperate columns to make cluster according to hypothesis

H1: There is correlation between family background and student alcoholism

H2: There is relationship between alcoholism and grade

h1_columns <- c('Housing_Type','Family_Size','Parental_Status','Mother_Education','Father_Education','Mother_Work',
                'Father_Work','Good_Family_Relationship','Reason_School_Choice','Legal_Responsibility', 'Parental_Educational_Support',
                'Attended_Daycare','Alcohol_Weekdays','Alcohol_Weekends')

h2_columns <- c('Weekly_Study_Time','Extra_Educational_Support','Private_Tutoring','Desire_Graduate_Education',
                'Has_Internet','Free_Time_After_School','Time_with_Friends','Alcohol_Weekdays','Alcohol_Weekends',
                'School_Absence','Grade_1st_Semester','Grade_2nd_Semester', 'pass_year')

df_h1 <- df1[, h1_columns] # for hypothesis 1
df_h2 <- df1[, h2_columns] # for hypothesis 2

corr_1 <- corrplot(cor(df_h1), type = 'upper',number.cex = 0.65, order = 'hclust', method = "number")

corr_2 <- corrplot(cor(df_h2,  use="complete.obs"), number.cex = 0.65, type = 'upper', method = 'number', order = 'hclust')

Prediagnostics

I will start with running pre-diagnostics in order to check whether data can be clustered and also to choose the optimal number of clusters.

In order to assess clusterability of the data, I will run Hopkins statistic. The null hypothesis tells that the dataset is uniformly distributed and does not contain meaningful clusters.

library(factoextra)
get_clust_tendency(df_h1, 20, graph=TRUE, gradient=list(low="red", mid="white", high="blue"))

## $hopkins_stat
## [1] 0.62192
## 
## $plot

The Hopkins statistic is a measure used to assess the clustering tendency of a dataset:

If the Hopkins statistic is close to 0.5, it indicates that the dataset is uniformly distributed, resembling random data. In this case, finding meaningful clusters might be challenging, as there is no apparent structure or tendency for points to cluster together.
If the Hopkins statistic is close to 1, it suggests a good clustering tendency. This means that the data points are not uniformly distributed, and there is a structure or pattern that can be exploited by clustering algorithms.

Let’s check with hopkins library funciton.

# check data is suitable for clustering or not  with such as hopkins 
# first dataset
hopkins_statistic <- hopkins(df_h1, m = nrow(df1) - 1)
cat("Hopkins statistic score for the first dataset: ", hopkins_statistic)

## Hopkins statistic score for the first dataset:  0.9470124

# second dataset
hopkins_statistic_1 <- hopkins(df_h2, m = nrow(df1) - 1)
cat("\nHopkins statistic score for the second dataset: ", hopkins_statistic_1)

## 
## Hopkins statistic score for the second dataset:  0.9994266

We can conclude that the dataset is significantly clusterable. The same conclusion can be made based on based on the ordered dissimilarity plot. One can see fields of different colors which indicate that there is possibility of finding clusters in the data.

Next step will be finding the optimal number of clusters. To do this, I will use silhouette statistic and apply it to two clustering algorithms: k-means and PAM clustering.

# Run K-Means clustering with different values of k for first dataset df_h1
k_values <- 2:10
silhouette_scores_1 <- sapply(k_values, function(k) {
  kmeans_result <- kmeans(df_h1, centers = k, nstart = 25)
  cluster_silhouette <- silhouette(kmeans_result$cluster, dist(df_h1))
  mean(cluster_silhouette[, 3])
})

# Run K-Means clustering with different values of k for second dataset df_h2
silhouette_scores_2 <- sapply(k_values, function(k) {
  kmeans_result <- kmeans(df_h2, centers = k, nstart = 25)
  cluster_silhouette <- silhouette(kmeans_result$cluster, dist(df_h2))
  mean(cluster_silhouette[, 3])
})

# Plot Silhouette Score vs. Number of Clusters
plot(k_values, silhouette_scores_1, type = "b", pch = 19, frame = FALSE,
     xlab = "Number of Clusters (k)", ylab = "Silhouette Score",
     main = "Silhouette Score for Different Numbers of Clusters (df_h1)")

# Plot Silhouette Score vs. Number of Clusters
plot(k_values, silhouette_scores_2, type = "b", pch = 19, frame = FALSE,
     xlab = "Number of Clusters (k)", ylab = "Silhouette Score",
     main = "Silhouette Score for Different Numbers of Clusters (df_h2)")

Let’s do this automated function with other clustering models

a <- fviz_nbclust(df_h1, FUNcluster = kmeans, method = "silhouette") + theme_classic() 
b <- fviz_nbclust(df_h1, FUNcluster = cluster::pam, method = "silhouette") + theme_classic() 

plot_grid(a, b, labels = c("kmeans", "pam"), ncol = 2, nrow = 1)

# second dataset
a <- fviz_nbclust(df_h2, FUNcluster = kmeans, method = "silhouette") + theme_classic() 
b <- fviz_nbclust(df_h2, FUNcluster = cluster::pam, method = "silhouette") + theme_classic() 

plot_grid(a, b, labels = c("kmeans", "pam"), ncol = 2, nrow = 1)

K-means

The basic concept of k-means is that it takes cluster centres (means) to represent cluster. Its goal is to minimize square error of the intra-class dissimilarity. It measn that the algorithm aims to the situation in which clusters are consistent and different from each other.

Pearson correlation for first dataset (df_h1)

library(factoextra)
cl_kmeans <- eclust(df_h1, k=2, FUNcluster="kmeans", hc_metric="pearson", graph=FALSE)

a <- fviz_silhouette(cl_kmeans)

##   cluster size ave.sil.width
## 1       1  293          0.09
## 2       2  356          0.12

b <- fviz_cluster(cl_kmeans, data = df_h1, elipse.type = "convex") + theme_minimal()
grid.arrange(a, b, ncol=2)

Efficiency testing:

table(cl_kmeans$cluster, df$Alcohol_Weekdays)

##    
##     Very Low Low Moderate High Very High
##   1      207  52       21    8         5
##   2      244  69       22    9        12

**Pearson correlation for first dataset (df_h2)**

cl_kmeans <- eclust(df_h2, k=2, FUNcluster="kmeans", hc_metric="pearson", graph=FALSE)

a <- fviz_silhouette(cl_kmeans)

##   cluster size ave.sil.width
## 1       1  361          0.13
## 2       2  288          0.34

b <- fviz_cluster(cl_kmeans, data = df_h2, elipse.type = "convex") + theme_minimal()
grid.arrange(a, b, ncol=2)

Efficiency testing:

table(cl_kmeans$cluster, df_encoded$pass_year)

##    
##       0   1
##   1 361   0
##   2   2 286

Euclidean distance:

cl_kmeans <- eclust(df_h2, k=2, FUNcluster="kmeans", hc_metric="euclidean", graph=FALSE)

a <- fviz_silhouette(cl_kmeans)

##   cluster size ave.sil.width
## 1       1  361          0.13
## 2       2  288          0.34

b <- fviz_cluster(cl_kmeans, data = df_h2, elipse.type = "convex") + theme_minimal()
grid.arrange(a, b, ncol=2)

Efficiency testing:

table(cl_kmeans$cluster, df_encoded$pass_year)

##    
##       0   1
##   1 361   0
##   2   2 286

Summarizing above results it occurs that there is basically no difference between Pearson’s correlation and Euclidean distance. Silhouette statistic for both is equal to 0.17. As to accuracy, both methods seems to truly assign labels of class 1 and 3.

PAM

PAM (Partitioning Around Medoids) is a clustering algorithm that is similar to K-Means but uses medoids (data points that are representative of a cluster) instead of centroids. The medoid is the data point within a cluster whose average dissimilarity to all the other points in the cluster is minimized.

# Manhattan
cl_pam <- eclust(df_h1, k=3, FUNcluster="pam", hc_metric="manhattan", graph=FALSE)
c <- fviz_silhouette(cl_pam)

##   cluster size ave.sil.width
## 1       1   73          0.17
## 2       2  392          0.14
## 3       3  184          0.05

d <- fviz_cluster(cl_pam, data = df_h1, elipse.type = "convex") + theme_minimal()
grid.arrange(c, d, ncol=2)

Efficiency testing:

table(cl_pam$cluster, df$Alcohol_Weekdays)

##    
##     Very Low Low Moderate High Very High
##   1       59   8        5    1         0
##   2      311  64       12    4         1
##   3       81  49       26   12        16

Euclidean distance

cl_pam1 <- eclust(df_h1, k=3, FUNcluster="pam", hc_metric="euclidean", graph=FALSE)
c <- fviz_silhouette(cl_pam)

##   cluster size ave.sil.width
## 1       1   73          0.17
## 2       2  392          0.14
## 3       3  184          0.05

d <- fviz_cluster(cl_pam1, data = df_h1, elipse.type = "convex") + theme_minimal()
grid.arrange(c, d, ncol=2)

Efficiency testing:

table(cl_pam1$cluster, df$Alcohol_Weekdays)

##    
##     Very Low Low Moderate High Very High
##   1       59   8        5    1         0
##   2      311  64       12    4         1
##   3       81  49       26   12        16

Using different distance algorithms: Manhattan and Euclidean, don’t make significant changes in the results. The results and Average Silhouette width are same. Let’s check for second data with the above 2 distance algorithms.

# Manhattan
cl_pam <- eclust(df_h2, k=3, FUNcluster="pam", hc_metric="manhattan", graph=FALSE)
c <- fviz_silhouette(cl_pam)

##   cluster size ave.sil.width
## 1       1  259          0.13
## 2       2  280          0.32
## 3       3  110          0.12

d <- fviz_cluster(cl_pam, data = df_h2, elipse.type = "convex") + theme_minimal()
grid.arrange(c, d, ncol=2)

Efficiency testing:

table(cl_pam$cluster, df_encoded$pass_year)

##    
##       0   1
##   1 259   0
##   2   0 280
##   3 104   6

#second dataset df_h2

cl_pam1 <- eclust(df_h2, k=3, FUNcluster="pam", hc_metric="euclidean", graph=FALSE)
c <- fviz_silhouette(cl_pam)

##   cluster size ave.sil.width
## 1       1  259          0.13
## 2       2  280          0.32
## 3       3  110          0.12

d <- fviz_cluster(cl_pam1, data = df_h2, elipse.type = "convex") + theme_minimal()
grid.arrange(c, d, ncol=2)

Efficiency testing:

table(cl_pam1$cluster, df_encoded$pass_year)

##    
##       0   1
##   1 259   0
##   2   0 280
##   3 104   6

Using different distance metrics such as Manhattan and Euclidean does not lead to significant changes in the clustering results. Both metrics yield the same outcomes, and the Average Silhouette Width remains consistent across the different distance algorithms.

I add a clustered column to data to check whether the model results and the data match

df_encoded$cluster <- cl_pam1$cluster
df$cluster <- cl_pam1$cluster

corr_2 <- corrplot(cor(df_encoded[, append(h2_columns, "cluster")],  use="complete.obs"), number.cex = 0.65, type = 'upper', method = 'number', order = 'alphabet')

We observe a positive correlation of 0.57 between alcohol consumption during weekdays and our model results. Additionally, weekend alcohol drinking shows a positive correlation of 0.48. Furthermore, passing the study year exhibits a negative correlation with both alcohol consumption on weekdays and weekends.

ggplot(df[, append(h2_columns, "cluster")], aes(x = pass_year, y = Alcohol_Weekdays, color = factor(cluster))) +
  geom_point() +
  labs(title = "Relationship between Pass Year, Alcohol Weekdays, and Cluster",
       x = "Pass Year",
       y = "Alcohol Weekdays",
       color = "Cluster") +
  theme_minimal()

# Cluster 1
cluster_1 <- df[df$cluster == 1, ]
summary(cluster_1$Alcohol_Weekdays)

##  Very Low       Low  Moderate      High Very High 
##       217        38         3         1         0

summary(cluster_1$Alcohol_Weekends)

##  Very Low       Low  Moderate      High Very High 
##       119        73        44        20         3

summary(cluster_1$Weekly_Study_Time)

##      Up to 2h       2 to 5h      5 to 10h More than 10h 
##            88           124            36            11

summary(cluster_1$School_Absence)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   2.000   3.405   5.000  18.000

summary(cluster_1$Grade_1st_Semester)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   9.000  10.000   9.525  11.000  13.000

summary(cluster_1$Grade_2nd_Semester)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   9.000  10.000   9.687  11.000  13.000

# Cluster 2
cluster_2 <- df[df$cluster == 2, ]
summary(cluster_2$Alcohol_Weekdays)

##  Very Low       Low  Moderate      High Very High 
##       228        44         6         2         0

summary(cluster_2$Alcohol_Weekends)

##  Very Low       Low  Moderate      High Very High 
##       126        72        54        23         5

summary(cluster_2$Weekly_Study_Time)

##      Up to 2h       2 to 5h      5 to 10h More than 10h 
##            58           145            58            19

summary(cluster_2$School_Absence)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   2.000   2.632   4.000  32.000

summary(cluster_2$Grade_1st_Semester)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   13.00   14.00   13.85   15.00   19.00

summary(cluster_2$Grade_2nd_Semester)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   13.00   14.00   14.04   15.00   19.00

# Cluster 3
cluster_3 <- df[df$cluster == 3, ]
summary(cluster_3$Alcohol_Weekdays)

##  Very Low       Low  Moderate      High Very High 
##         6        39        34        14        17

summary(cluster_3$Alcohol_Weekends)

##  Very Low       Low  Moderate      High Very High 
##         2         5        22        44        37

summary(cluster_3$Weekly_Study_Time)

##      Up to 2h       2 to 5h      5 to 10h More than 10h 
##            66            36             3             5

summary(cluster_3$School_Absence)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   5.500   6.873  10.000  30.000

summary(cluster_3$Grade_1st_Semester)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.000   8.000  10.000   9.582  11.000  16.000

summary(cluster_3$Grade_2nd_Semester)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   9.000  10.000   9.709  11.000  16.000

Conclusion

I explored the complex relationships between high school students’ family backgrounds, alcohol consumption habits, and academic performance. The dataset, obtained from a diverse group of Portuguese high school students, allowed us to conduct a comprehensive analysis, addressing specific hypotheses related to family background, alcoholism, and academic outcomes.

In conclusion, this study sheds light on the intricate relationships among family background, alcohol consumption, and academic performance in high school students. The findings emphasize the need for comprehensive, age-specific interventions that consider the role of family dynamics. By addressing these factors, educators, parents, and policymakers can contribute to creating a conducive environment for students to thrive academically while promoting overall well-being.

High school alcoholism and school achivements

Samidullo Abdullaev

2023-12-28

High School Alcoholism and Academic Performance: EDA and Clustering

Introduction

Dataset Source

Features description

What is the age of high school students?

Do families live more in urban or rural areas? Does family size influence the choice?

What is the average level of education of the parents?

Who live with students and their legal responsibility person?

Is there any relationship between student alcoholism vs family relationship ?

Is there any relationship between weekdays’ alcohol consumption and study time

Let’s see correlation between weekdays and weekends alcohol sonsumption with age

Does alcohol comsuming during weekdays and weekends effects students’ school absenses by student age?

Does alcohol comsuming during weekdays and weekends impacts students’ grade?

School Absence and Grade with drinking alcoholic beverages during school days

What is relationship alcoholism between gender and students grades ?

Clustering

Prediagnostics

K-means

PAM

Conclusion