Introduction

This study aims to contribute to understanding the interplay between students’ academic outcomes, their lifestyles, familial influences, and other social factors. Apart from natural differences in learning abilities, we know that success in school can depend on many other factors, such as parental support, extra-curricular activities, time devoted to study, leisure time and how that time is spent.

The goal of this work is to uncover patterns and dependencies that contribute to students’ final grades, using association rules. This method allows us to study the relationship between many variables within a data set. Rules can be examined by such indicators as: support, confidence and lift.

Dataset description

Dataset used in this study comes from kaggle. The data were obtained in a survey of students math and Portuguese language courses in secondary school.

Analysis through Market Basket Analysis can serve to draw conclusions about students in Portuguese schools or about the Portuguese education system. More broadly, it can also give a general view about students and their academic performance.

Variables

For association rules we need to have variables with their labels. That is why I recoded columns of this data set as follows:

  • sex - student’s sex (binary: ‘Female’ or ‘Male’)
  • age - student’s age (numeric: from 15 to 22)
  • address - student’s home address type (binary: “Urban” or “Rural”)
  • famsize - family size (binary: “Family size less than or equal to 3” or “Family size greater than 3”)
  • Pstatus - parent’s cohabitation status (binary: “Parents living together” or “Parents living apart”)
  • Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
  • Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
  • Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
  • Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
  • traveltime - home to school travel time (ordinal: < 15 min., 15 to 30 min., 30 min. to 1 hour, > 1 hour)
  • studytime - weekly study time (ordinal: < 2 hours, 2 to 5 hours, 5 to 10 hours, > 10 hours)
  • failures - number of past class failures (numeric: n if 1<=n<3, else 4)
  • schoolsup - extra educational support (binary: yes or no)
  • famsup - family educational support (binary: yes or no)
  • paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
  • activities - extra-curricular activities (binary: yes or no)
  • higher - wants to take higher education (binary: yes or no)
  • internet - Internet access at home (binary: yes or no)
  • romantic - with a romantic relationship (binary: yes or no)
  • famrel - quality of family relationships (ordinal: Bad/Neutral/Good quality of family relationships)
  • freetime - free time after school (ordinal: Little/Moderate amount/A lot of free time after school)
  • goout - going out with friends (ordinal: Low/Moderate/High frequency of going out with friends)
  • Dalc - workday alcohol consumption (odrinal: Low/Moderate/High workday alcohol consumption)
  • Walc - weekend alcohol consumption (ordinal: Low/Moderate/High weekend alcohol consumption)
  • health - current health status (ordinal: Poor/Moderate/Good health status)
  • absences - number of school absences (ordinal: Occasional/Frequent/Very frequent absences)
  1. G1 - first period grade
  2. G2 - second period grade
  3. G3 - final grade

Grades: 0-10 - “Poor”, 10-16 - “Good”, 16-20 - “Very good”

Libraries

library(arules)
library(arulesViz)
library(arulesCBA)
library(knitr)
library(kableExtra)
library(dplyr)

Dataset preparation

DATA <- read.csv('student-alcohol.csv', header=TRUE, sep=",")
DATA<-DATA[,c(-1,-2)]

DATA$sex[DATA$sex == "F"] <- "Female"
DATA$sex[DATA$sex == "M"] <- "Male"

DATA$address[DATA$address == "U"] <- "Urban"
DATA$address[DATA$address == "R"] <- "Rural"

DATA$famsize[DATA$famsize == "LE3"] <- "Family size less than or equal to 3"
DATA$famsize[DATA$famsize == "GT3"] <- "Family size greater than 3"

DATA$Pstatus[DATA$Pstatus == "T"] <- "Parents living together"
DATA$Pstatus[DATA$Pstatus == "A"] <- "Parents living apart"

DATA$Medu[DATA$Medu == 0] <- "Mother - no education"
DATA$Medu[DATA$Medu == 1 | DATA$Medu == 2] <- "Mother - primary education"
DATA$Medu[DATA$Medu == 3] <- "Mother - secondary education"
DATA$Medu[DATA$Medu == 4] <- "Mother - higher education"

DATA$Fedu[DATA$Fedu == 0] <- "Father - no education"
DATA$Fedu[DATA$Fedu == 1 | DATA$Fedu == 2] <- "Father - primary education"
DATA$Fedu[DATA$Fedu == 3] <- "Father - secondary education"
DATA$Fedu[DATA$Fedu == 4] <- "Father - higher education"

DATA$Mjob <- paste("Mother's job -", DATA$Mjob)
DATA$Fjob <- paste("Father's job -", DATA$Fjob)

DATA$traveltime[DATA$traveltime == 1] <- "< 15 min. home to school travel"
DATA$traveltime[DATA$traveltime == 2] <- "15 to 30 min. home to school travel"
DATA$traveltime[DATA$traveltime == 3] <- "30 min. to 1 hour home to school travel"
DATA$traveltime[DATA$traveltime == 4] <- "> 1 hour home to school travel"

DATA$studytime[DATA$studytime == 1] <- "< 2 hours  weekly study time"
DATA$studytime[DATA$studytime == 2] <- "2 to 5 hours  weekly study time"
DATA$studytime[DATA$studytime == 3] <- "5 to 10 hours  weekly study time"
DATA$studytime[DATA$studytime == 4] <- "> 10 hours  weekly study time"

DATA$failures <- paste(DATA$failures, "past class failures")

DATA$schoolsup[DATA$schoolsup == "yes"] <- "Extra educational support"
DATA$schoolsup[DATA$schoolsup == "no"] <- "No extra educational support"

DATA$famsup[DATA$famsup == "yes"] <- "Family educational support"
DATA$famsup[DATA$famsup == "no"] <- "No family educational support"

DATA$paid[DATA$paid == "yes"] <- "Extra paid classes"
DATA$paid[DATA$paid == "no"] <- "No extra paid classes"

DATA$activities[DATA$activities == "yes"] <- "Extra-curricular activities"
DATA$activities[DATA$activities == "no"] <- "No extra-curricular activities"

DATA$nursery[DATA$nursery == "yes"] <- "Attendend nursery school"
DATA$nursery[DATA$nursery == "no"] <- "Did not attend nursery school"

DATA$higher[DATA$higher == "yes"] <- "Wants to take higher education"
DATA$higher[DATA$higher == "no"] <- "Does not want to take higher education"

DATA$internet[DATA$internet == "yes"] <- "Has internet access at home"
DATA$internet[DATA$internet == "no"] <- "Does not have internet access at home"

DATA$romantic[DATA$romantic == "yes"] <- "With a romantic relationship"
DATA$romantic[DATA$romantic == "no"] <- "Without a romantic relationship"

DATA$famrel[DATA$famrel == "1" | DATA$famrel == "2"] <- "Bad quality of family relationships"
DATA$famrel[DATA$famrel == "3"] <- "Neutral quality of family relationships"
DATA$famrel[DATA$famrel == "4" | DATA$famrel == "5"] <- "Good quality of family relationships"

DATA$freetime[DATA$freetime == "1" | DATA$freetime == 2] <- "Little free time after school"
DATA$freetime[DATA$freetime == "3"] <- "Moderate amount of free time after school"
DATA$freetime[DATA$freetime == "4" | DATA$freetime == 5] <- "A lot of free time after school"

DATA$goout[DATA$goout == "1" | DATA$goout == 2] <- "Low frequency of going out with friends"
DATA$goout[DATA$goout == "3"] <- "Moderate frequency of going out with friends"
DATA$goout[DATA$goout == "4" | DATA$goout == 5] <- "High frequency of going out with friends"

DATA$Dalc[DATA$Dalc == "1" | DATA$Dalc== "2"] <- "Low workday alcohol consumption"
DATA$Dalc[DATA$Dalc == "3"] <- "Moderate workday alcohol consumption"
DATA$Dalc[DATA$Dalc == "4" | DATA$Dalc == "5"] <- "High workday alcohol consumption"

DATA$Walc[DATA$Walc == "1" | DATA$Walc== "2"] <- "Low weekend alcohol consumption"
DATA$Walc[DATA$Walc == "3"] <- "Moderate weekend alcohol consumption"
DATA$Walc[DATA$Walc == "4" | DATA$Walc == "5"] <- "High weekend alcohol consumption"

DATA$health[DATA$health == "1" | DATA$health== "2"] <- "Poor health status"
DATA$health[DATA$health == "3"] <- "Moderate health status"
DATA$health[DATA$health == "4" | DATA$health == "5"] <- "Good health status"

DATA$absences<-ifelse(DATA[,29]<15, "Occasional absences", ifelse(DATA[,29]<35, "Frequent absences", ifelse(DATA[,29]<93, "Very frequent absences")))

vars <- c("G1", "G2", "G3")

for(var in vars) {
  DATA[[var]] <- cut(DATA[[var]], breaks = c(0, 10, 16, 20), labels = c("Poor", "Good", "Very good"))
}

DATA$G1 <- paste(DATA$G1, "first period grade")
DATA$G2 <- paste(DATA$G2, "second period grade")
DATA$G3 <- paste(DATA$G3, "final grade")

Market Basket Analysis

With manually categorised data

Market basket analysis is a data mining technique, which enables to examine patterns within a data set. It is used to reveal groups of values that are likely to exist together.

Statistics available to examine the association rules:

  • support – frequency of features combination in data set, in how many percent of transactions there were products together;
  • confidence - estimated conditional probability of seeing the feature under the condition that the transaction also contains the other feature; maximum value - 1, the higher confidence the stronger the rule;
  • coverage - support of the left-hand-side of the rule;
  • lift - how likely the values are to exist together compared to the existence when they are assumed to be unrelated;
  • count - number of events, in how many transactions there were two features together;
data.sel<-DATA[,-c(6,7,10, 11, 30, 31)]
write.csv(data.sel, file="Students_selected.csv")


trans1<-read.transactions("Students_selected.csv", format="basket", sep=",", skip=0) # reading the file as transactions
trans1
## transactions in sparse format with
##  396 transactions (rows) and
##  496 items (columns)

From these transactions we want to leave only observations that occur frequently in our data set. We can display names of the features and their frequency.

#excluding rare observations
trans1<-trans1[, itemFrequency(trans1)>0.05]

#providing a table with observations and their frequency
#displaying only features with frequency > 25%
values <- sort(itemFrequency(trans1, type="relative"))
df <- data.frame(Feature = names(values), Frequency = values)
row.names(df) <- NULL
df_filtered <- df[df$Frequency > 0.25, ]
kable(df_filtered, format = "markdown", col.names = c("Feature", "Frequency")) %>%
  kable_styling() %>%
  row_spec(which(df_filtered$Feature %in% c("Poor final grade", "Good final grade")), bold = TRUE)
Feature Frequency
32 Mother’s job - services 0.2601010
33 < 2 hours weekly study time 0.2651515
34 15 to 30 min. home to school travel 0.2702020
35 Father’s job - services 0.2803030
36 Family size less than or equal to 3 0.2878788
37 Low frequency of going out with friends 0.3181818
38 Moderate frequency of going out with friends 0.3282828
39 With a romantic relationship 0.3333333
40 High frequency of going out with friends 0.3510101
41 Mother’s job - other 0.3535354
42 Poor final grade 0.3737374
43 No family educational support 0.3863636
44 A lot of free time after school 0.3914141
45 Moderate amount of free time after school 0.3964646
46 Extra paid classes 0.4570707
47 Good final grade 0.4671717
48 Male 0.4722222
49 No extra-curricular activities 0.4898990
50 2 to 5 hours weekly study time 0.5000000
51 Extra-curricular activities 0.5075758
52 Female 0.5252525
53 Good health status 0.5353535
54 No extra paid classes 0.5404040
55 Father’s job - other 0.5479798
56 Low weekend alcohol consumption 0.5959596
57 Family educational support 0.6111111
58 < 15 min. home to school travel 0.6489899
59 Without a romantic relationship 0.6641414
60 Family size greater than 3 0.7095960
61 Good quality of family relationships 0.7601010
62 Urban 0.7752525
63 0 past class failures 0.7878788
64 Attendend nursery school 0.7929293
65 Has internet access at home 0.8055556
66 No extra educational support 0.8686869
67 Low workday alcohol consumption 0.8863636
68 Parents living together 0.8939394
69 Occasional absences 0.9065657
70 Wants to take higher education 0.9469697

Poor final grade

In our data set we want to examine factors which have influence on student’s academic outcomes. This can be done by finding repeated profiles of students who meet the conditions being tested. Firstly we will focus on students who have the worst final grades. From the above table we can see that students with poor final grade constitute 37% of all students.

In order to obtain rules and patterns about students within our data set, we can use apriori algorithm, which creates sets from the features and based on them creates rules.

We leave the left hand side of the rule as default value, for apriori algorithm to extract the patterns which cause the tested effect. The right hand side of the rule should define the consequence, effect which we want to analyze - in this case we set the RHS to “Poor final grade”.

We define minimum support value to 0.1 and minimum confidence value to 0.5.

rules<-apriori(data=trans1, parameter=list(supp=0.1, conf=0.5), appearance=list(default="lhs", rhs="Poor final grade"), control=list(verbose=F)) 
summary(rules)
## set of 10 rules
## 
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5 
## 1 2 3 4 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    3.25    4.00    4.00    5.00    5.00 
## 
## summary of quality measures:
##     support         confidence        coverage           lift      
##  Min.   :0.1010   Min.   :0.5000   Min.   :0.1970   Min.   :1.338  
##  1st Qu.:0.1016   1st Qu.:0.5062   1st Qu.:0.1995   1st Qu.:1.354  
##  Median :0.1048   Median :0.5115   Median :0.2033   Median :1.369  
##  Mean   :0.1053   Mean   :0.5117   Mean   :0.2058   Mean   :1.369  
##  3rd Qu.:0.1061   3rd Qu.:0.5171   3rd Qu.:0.2102   3rd Qu.:1.384  
##  Max.   :0.1162   Max.   :0.5316   Max.   :0.2273   Max.   :1.423  
##      count      
##  Min.   :40.00  
##  1st Qu.:40.25  
##  Median :41.50  
##  Mean   :41.70  
##  3rd Qu.:42.00  
##  Max.   :46.00  
## 
## mining info:
##    data ntransactions support confidence
##  trans1           396     0.1        0.5
##                                                                                                                                                         call
##  apriori(data = trans1, parameter = list(supp = 0.1, conf = 0.5), appearance = list(default = "lhs", rhs = "Poor final grade"), control = list(verbose = F))
rules.byconf<-sort(rules, by="confidence", decreasing=TRUE)

There are only 10 rules for students who have a poor final grade, that meet our conditions. Even though there is a small number of rules, it is better to exclude: insignificant rules (tested with Fisher’s exact test), redundant rules (those which have a substitute of a more general rule with higher confidence level), and to leave maximal rules (those which do not contain a superset).

There are no insignificant rules, but two are redundant and one is also not maximal. We stay with only 8 rules.

rules.clean<-rules[is.maximal(rules)] 
rules.clean<-rules.clean[is.significant(rules.clean, trans1)]

inspectDT(rules.clean)

There are rather low values of association rules statistics. All rules have similar support value equal to approximately 0.1. This indicates that each rule containing features listed together, accounts for 10% of all transactions. Confidence values are also similar in all rules - approximately 0.5. These rules are not very significant. The probability of seeing students with poor final grades and with features listed in these combinations, is 50%. Lift id greater than 1 in all rules, what indicates that those features appear together more often than separately - approximately 1.3 times more often.

Interestingly, there is only one rule containing one feature (high weekend alcohol consumption) and support value does not differ from the others. Intuitively, this rule should occur more often than the rest. It can tell us that high weekend alcohol consumption is correlated with having a poor final grade.

We can conclude that students who have poor final grades are frequently going out with friends, they study rather rarely - only 2 to 5 hours in a week. These features seem reasonable to explain students poor academic outcome. We can see that there are also groups of students with poor grades that have good health status, receive family educational support, have good quality of family relationships, want to take higher education in the future. Intuitively, these factors could be conducive to getting good final grades. Such results may therefore suggest that these factors are not that significant for students’ academic performance. Additionally, among students with poor grades female students are more likely to be found. This is also counter-intuitive because other studies show that girls usually have better grades. The Portuguese case may therefore be interesting to dig deeper, leaning for example on PISA surveys.

plot(rules.clean, method="graph", engine="htmlwidget")

Visualization of revealed rules is provided on the above graph. We can see, that high frequency of going out with friends, 2 to 5 hours of studying per week, wanting to take higher education and having good quality of family relationships, are the features that constitute the existence of three different groups of patterns. Rules 1, 7, 6 and 5 have the highest values of association statistics - they can be considered as the most accurate.

Good final grade

This time let’s see features of students who have good final grades. We can conduct the same analysis, but this time on the right hand side we have another consequence - “Good final grade”.

rules<-apriori(data=trans1, parameter=list(supp=0.1, conf=0.5), appearance=list(default="lhs", rhs="Good final grade"), control=list(verbose=F)) 

rules.byconf<-sort(rules, by="confidence", decreasing=TRUE)

rules.clean<-rules[!is.redundant(rules)]
rules.clean<-rules.clean[is.significant(rules.clean, trans1)] 
rules.clean<-rules.clean[is.maximal(rules.clean)]

summary(rules.clean)
## set of 407 rules
## 
## rule length distribution (lhs + rhs):sizes
##   4   5   6   7   8   9  10 
##   5  29  79 122  82  69  21 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.000   6.000   7.000   7.322   8.000  10.000 
## 
## summary of quality measures:
##     support         confidence        coverage           lift      
##  Min.   :0.1010   Min.   :0.5688   Min.   :0.1389   Min.   :1.218  
##  1st Qu.:0.1010   1st Qu.:0.6111   1st Qu.:0.1591   1st Qu.:1.308  
##  Median :0.1035   Median :0.6269   Median :0.1667   Median :1.342  
##  Mean   :0.1094   Mean   :0.6339   Mean   :0.1735   Mean   :1.357  
##  3rd Qu.:0.1111   3rd Qu.:0.6557   3rd Qu.:0.1742   3rd Qu.:1.404  
##  Max.   :0.1919   Max.   :0.7273   Max.   :0.3258   Max.   :1.557  
##      count      
##  Min.   :40.00  
##  1st Qu.:40.00  
##  Median :41.00  
##  Mean   :43.34  
##  3rd Qu.:44.00  
##  Max.   :76.00  
## 
## mining info:
##    data ntransactions support confidence
##  trans1           396     0.1        0.5
##                                                                                                                                                         call
##  apriori(data = trans1, parameter = list(supp = 0.1, conf = 0.5), appearance = list(default = "lhs", rhs = "Good final grade"), control = list(verbose = F))

There are much more rules - after excluding unwanted ones, we still have 407 rules

inspectDT(rules.clean)

This time the confidence level is much higher, there are 13 rules with confidence level above 70%. In the provided interactive table, we can sort the rules by confidence level in descending order, in order to examine rules with the highest confidence value.

Students with good grades usually have no past class failures, and only occasional school absences. They rarely go out with friends and drink alcohol. Such students more often live in urban areas with short home to school travel. Surprisingly students with good academic performance have no extra educational support. They usually have good quality of family relationships, their parents live together and usually their family size is greater than 3.

plot(rules.clean, method="graph", engine="htmlwidget")
## Warning: Too many rules supplied. Only plotting the best 100 using 'lift'
## (change control parameter max if needed).

In the middle of this graphical web, we can see the most important features for the revealed rules. Students with good grades usually have 0 past class failures, internet access at home, they want to take higher education, have occasional school absences and good quality of family relationships. Interestingly male students tend to have good grades more often. Living in urban areas and close to school also may have positive impact on academic performance.

Conclusions

Based on the Market Basket Analysis we are able to easily create patterns among groups characterized by the attribute under study. In order to measure those patterns, we can analyze such statistics as: support, confidence, coverage and lift. This study was focused on the students’ academic performance and finding factors that can affect the achievement of poor and good final grades.

From the conducted analysis we can conclude that students with poor grades often socialize and do not study much - only 2 to 5 hours weekly. Factors like good health or family support do not seem to have a negative impact on grades.

Key features for rules explaining having good final grades include no past failures, home internet access, aspirations for higher education, occasional absences, and good family relationships. Urban living and proximity to school also seem to influence academic performance.

Surprisingly, among students with worse grades, girls were more likely to occur, and on the contrary, among students with good grades, boys were more likely to occur. It is worth noting that the survey was conducted on students in Portuguese schools, so the conclusions drawn may serve to analyze the situation of Portuguese students.