This study aims to contribute to understanding the interplay between students’ academic outcomes, their lifestyles, familial influences, and other social factors. Apart from natural differences in learning abilities, we know that success in school can depend on many other factors, such as parental support, extra-curricular activities, time devoted to study, leisure time and how that time is spent.
The goal of this work is to uncover patterns and dependencies that contribute to students’ final grades, using association rules. This method allows us to study the relationship between many variables within a data set. Rules can be examined by such indicators as: support, confidence and lift.
Dataset used in this study comes from kaggle. The data were obtained in a survey of students math and Portuguese language courses in secondary school.
Analysis through Market Basket Analysis can serve to draw conclusions about students in Portuguese schools or about the Portuguese education system. More broadly, it can also give a general view about students and their academic performance.
For association rules we need to have variables with their labels. That is why I recoded columns of this data set as follows:
Grades: 0-10 - “Poor”, 10-16 - “Good”, 16-20 - “Very good”
library(arules)
library(arulesViz)
library(arulesCBA)
library(knitr)
library(kableExtra)
library(dplyr)
DATA <- read.csv('student-alcohol.csv', header=TRUE, sep=",")
DATA<-DATA[,c(-1,-2)]
DATA$sex[DATA$sex == "F"] <- "Female"
DATA$sex[DATA$sex == "M"] <- "Male"
DATA$address[DATA$address == "U"] <- "Urban"
DATA$address[DATA$address == "R"] <- "Rural"
DATA$famsize[DATA$famsize == "LE3"] <- "Family size less than or equal to 3"
DATA$famsize[DATA$famsize == "GT3"] <- "Family size greater than 3"
DATA$Pstatus[DATA$Pstatus == "T"] <- "Parents living together"
DATA$Pstatus[DATA$Pstatus == "A"] <- "Parents living apart"
DATA$Medu[DATA$Medu == 0] <- "Mother - no education"
DATA$Medu[DATA$Medu == 1 | DATA$Medu == 2] <- "Mother - primary education"
DATA$Medu[DATA$Medu == 3] <- "Mother - secondary education"
DATA$Medu[DATA$Medu == 4] <- "Mother - higher education"
DATA$Fedu[DATA$Fedu == 0] <- "Father - no education"
DATA$Fedu[DATA$Fedu == 1 | DATA$Fedu == 2] <- "Father - primary education"
DATA$Fedu[DATA$Fedu == 3] <- "Father - secondary education"
DATA$Fedu[DATA$Fedu == 4] <- "Father - higher education"
DATA$Mjob <- paste("Mother's job -", DATA$Mjob)
DATA$Fjob <- paste("Father's job -", DATA$Fjob)
DATA$traveltime[DATA$traveltime == 1] <- "< 15 min. home to school travel"
DATA$traveltime[DATA$traveltime == 2] <- "15 to 30 min. home to school travel"
DATA$traveltime[DATA$traveltime == 3] <- "30 min. to 1 hour home to school travel"
DATA$traveltime[DATA$traveltime == 4] <- "> 1 hour home to school travel"
DATA$studytime[DATA$studytime == 1] <- "< 2 hours weekly study time"
DATA$studytime[DATA$studytime == 2] <- "2 to 5 hours weekly study time"
DATA$studytime[DATA$studytime == 3] <- "5 to 10 hours weekly study time"
DATA$studytime[DATA$studytime == 4] <- "> 10 hours weekly study time"
DATA$failures <- paste(DATA$failures, "past class failures")
DATA$schoolsup[DATA$schoolsup == "yes"] <- "Extra educational support"
DATA$schoolsup[DATA$schoolsup == "no"] <- "No extra educational support"
DATA$famsup[DATA$famsup == "yes"] <- "Family educational support"
DATA$famsup[DATA$famsup == "no"] <- "No family educational support"
DATA$paid[DATA$paid == "yes"] <- "Extra paid classes"
DATA$paid[DATA$paid == "no"] <- "No extra paid classes"
DATA$activities[DATA$activities == "yes"] <- "Extra-curricular activities"
DATA$activities[DATA$activities == "no"] <- "No extra-curricular activities"
DATA$nursery[DATA$nursery == "yes"] <- "Attendend nursery school"
DATA$nursery[DATA$nursery == "no"] <- "Did not attend nursery school"
DATA$higher[DATA$higher == "yes"] <- "Wants to take higher education"
DATA$higher[DATA$higher == "no"] <- "Does not want to take higher education"
DATA$internet[DATA$internet == "yes"] <- "Has internet access at home"
DATA$internet[DATA$internet == "no"] <- "Does not have internet access at home"
DATA$romantic[DATA$romantic == "yes"] <- "With a romantic relationship"
DATA$romantic[DATA$romantic == "no"] <- "Without a romantic relationship"
DATA$famrel[DATA$famrel == "1" | DATA$famrel == "2"] <- "Bad quality of family relationships"
DATA$famrel[DATA$famrel == "3"] <- "Neutral quality of family relationships"
DATA$famrel[DATA$famrel == "4" | DATA$famrel == "5"] <- "Good quality of family relationships"
DATA$freetime[DATA$freetime == "1" | DATA$freetime == 2] <- "Little free time after school"
DATA$freetime[DATA$freetime == "3"] <- "Moderate amount of free time after school"
DATA$freetime[DATA$freetime == "4" | DATA$freetime == 5] <- "A lot of free time after school"
DATA$goout[DATA$goout == "1" | DATA$goout == 2] <- "Low frequency of going out with friends"
DATA$goout[DATA$goout == "3"] <- "Moderate frequency of going out with friends"
DATA$goout[DATA$goout == "4" | DATA$goout == 5] <- "High frequency of going out with friends"
DATA$Dalc[DATA$Dalc == "1" | DATA$Dalc== "2"] <- "Low workday alcohol consumption"
DATA$Dalc[DATA$Dalc == "3"] <- "Moderate workday alcohol consumption"
DATA$Dalc[DATA$Dalc == "4" | DATA$Dalc == "5"] <- "High workday alcohol consumption"
DATA$Walc[DATA$Walc == "1" | DATA$Walc== "2"] <- "Low weekend alcohol consumption"
DATA$Walc[DATA$Walc == "3"] <- "Moderate weekend alcohol consumption"
DATA$Walc[DATA$Walc == "4" | DATA$Walc == "5"] <- "High weekend alcohol consumption"
DATA$health[DATA$health == "1" | DATA$health== "2"] <- "Poor health status"
DATA$health[DATA$health == "3"] <- "Moderate health status"
DATA$health[DATA$health == "4" | DATA$health == "5"] <- "Good health status"
DATA$absences<-ifelse(DATA[,29]<15, "Occasional absences", ifelse(DATA[,29]<35, "Frequent absences", ifelse(DATA[,29]<93, "Very frequent absences")))
vars <- c("G1", "G2", "G3")
for(var in vars) {
DATA[[var]] <- cut(DATA[[var]], breaks = c(0, 10, 16, 20), labels = c("Poor", "Good", "Very good"))
}
DATA$G1 <- paste(DATA$G1, "first period grade")
DATA$G2 <- paste(DATA$G2, "second period grade")
DATA$G3 <- paste(DATA$G3, "final grade")
Market basket analysis is a data mining technique, which enables to examine patterns within a data set. It is used to reveal groups of values that are likely to exist together.
Statistics available to examine the association rules:
data.sel<-DATA[,-c(6,7,10, 11, 30, 31)]
write.csv(data.sel, file="Students_selected.csv")
trans1<-read.transactions("Students_selected.csv", format="basket", sep=",", skip=0) # reading the file as transactions
trans1
## transactions in sparse format with
## 396 transactions (rows) and
## 496 items (columns)
From these transactions we want to leave only observations that occur frequently in our data set. We can display names of the features and their frequency.
#excluding rare observations
trans1<-trans1[, itemFrequency(trans1)>0.05]
#providing a table with observations and their frequency
#displaying only features with frequency > 25%
values <- sort(itemFrequency(trans1, type="relative"))
df <- data.frame(Feature = names(values), Frequency = values)
row.names(df) <- NULL
df_filtered <- df[df$Frequency > 0.25, ]
kable(df_filtered, format = "markdown", col.names = c("Feature", "Frequency")) %>%
kable_styling() %>%
row_spec(which(df_filtered$Feature %in% c("Poor final grade", "Good final grade")), bold = TRUE)
| Feature | Frequency | |
|---|---|---|
| 32 | Mother’s job - services | 0.2601010 |
| 33 | < 2 hours weekly study time | 0.2651515 |
| 34 | 15 to 30 min. home to school travel | 0.2702020 |
| 35 | Father’s job - services | 0.2803030 |
| 36 | Family size less than or equal to 3 | 0.2878788 |
| 37 | Low frequency of going out with friends | 0.3181818 |
| 38 | Moderate frequency of going out with friends | 0.3282828 |
| 39 | With a romantic relationship | 0.3333333 |
| 40 | High frequency of going out with friends | 0.3510101 |
| 41 | Mother’s job - other | 0.3535354 |
| 42 | Poor final grade | 0.3737374 |
| 43 | No family educational support | 0.3863636 |
| 44 | A lot of free time after school | 0.3914141 |
| 45 | Moderate amount of free time after school | 0.3964646 |
| 46 | Extra paid classes | 0.4570707 |
| 47 | Good final grade | 0.4671717 |
| 48 | Male | 0.4722222 |
| 49 | No extra-curricular activities | 0.4898990 |
| 50 | 2 to 5 hours weekly study time | 0.5000000 |
| 51 | Extra-curricular activities | 0.5075758 |
| 52 | Female | 0.5252525 |
| 53 | Good health status | 0.5353535 |
| 54 | No extra paid classes | 0.5404040 |
| 55 | Father’s job - other | 0.5479798 |
| 56 | Low weekend alcohol consumption | 0.5959596 |
| 57 | Family educational support | 0.6111111 |
| 58 | < 15 min. home to school travel | 0.6489899 |
| 59 | Without a romantic relationship | 0.6641414 |
| 60 | Family size greater than 3 | 0.7095960 |
| 61 | Good quality of family relationships | 0.7601010 |
| 62 | Urban | 0.7752525 |
| 63 | 0 past class failures | 0.7878788 |
| 64 | Attendend nursery school | 0.7929293 |
| 65 | Has internet access at home | 0.8055556 |
| 66 | No extra educational support | 0.8686869 |
| 67 | Low workday alcohol consumption | 0.8863636 |
| 68 | Parents living together | 0.8939394 |
| 69 | Occasional absences | 0.9065657 |
| 70 | Wants to take higher education | 0.9469697 |
In our data set we want to examine factors which have influence on student’s academic outcomes. This can be done by finding repeated profiles of students who meet the conditions being tested. Firstly we will focus on students who have the worst final grades. From the above table we can see that students with poor final grade constitute 37% of all students.
In order to obtain rules and patterns about students within our data set, we can use apriori algorithm, which creates sets from the features and based on them creates rules.
We leave the left hand side of the rule as default value, for apriori algorithm to extract the patterns which cause the tested effect. The right hand side of the rule should define the consequence, effect which we want to analyze - in this case we set the RHS to “Poor final grade”.
We define minimum support value to 0.1 and minimum confidence value to 0.5.
rules<-apriori(data=trans1, parameter=list(supp=0.1, conf=0.5), appearance=list(default="lhs", rhs="Poor final grade"), control=list(verbose=F))
summary(rules)
## set of 10 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5
## 1 2 3 4
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 3.25 4.00 4.00 5.00 5.00
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.1010 Min. :0.5000 Min. :0.1970 Min. :1.338
## 1st Qu.:0.1016 1st Qu.:0.5062 1st Qu.:0.1995 1st Qu.:1.354
## Median :0.1048 Median :0.5115 Median :0.2033 Median :1.369
## Mean :0.1053 Mean :0.5117 Mean :0.2058 Mean :1.369
## 3rd Qu.:0.1061 3rd Qu.:0.5171 3rd Qu.:0.2102 3rd Qu.:1.384
## Max. :0.1162 Max. :0.5316 Max. :0.2273 Max. :1.423
## count
## Min. :40.00
## 1st Qu.:40.25
## Median :41.50
## Mean :41.70
## 3rd Qu.:42.00
## Max. :46.00
##
## mining info:
## data ntransactions support confidence
## trans1 396 0.1 0.5
## call
## apriori(data = trans1, parameter = list(supp = 0.1, conf = 0.5), appearance = list(default = "lhs", rhs = "Poor final grade"), control = list(verbose = F))
rules.byconf<-sort(rules, by="confidence", decreasing=TRUE)
There are only 10 rules for students who have a poor final grade, that meet our conditions. Even though there is a small number of rules, it is better to exclude: insignificant rules (tested with Fisher’s exact test), redundant rules (those which have a substitute of a more general rule with higher confidence level), and to leave maximal rules (those which do not contain a superset).
There are no insignificant rules, but two are redundant and one is also not maximal. We stay with only 8 rules.
rules.clean<-rules[is.maximal(rules)]
rules.clean<-rules.clean[is.significant(rules.clean, trans1)]
inspectDT(rules.clean)
There are rather low values of association rules statistics. All rules have similar support value equal to approximately 0.1. This indicates that each rule containing features listed together, accounts for 10% of all transactions. Confidence values are also similar in all rules - approximately 0.5. These rules are not very significant. The probability of seeing students with poor final grades and with features listed in these combinations, is 50%. Lift id greater than 1 in all rules, what indicates that those features appear together more often than separately - approximately 1.3 times more often.
Interestingly, there is only one rule containing one feature (high weekend alcohol consumption) and support value does not differ from the others. Intuitively, this rule should occur more often than the rest. It can tell us that high weekend alcohol consumption is correlated with having a poor final grade.
We can conclude that students who have poor final grades are frequently going out with friends, they study rather rarely - only 2 to 5 hours in a week. These features seem reasonable to explain students poor academic outcome. We can see that there are also groups of students with poor grades that have good health status, receive family educational support, have good quality of family relationships, want to take higher education in the future. Intuitively, these factors could be conducive to getting good final grades. Such results may therefore suggest that these factors are not that significant for students’ academic performance. Additionally, among students with poor grades female students are more likely to be found. This is also counter-intuitive because other studies show that girls usually have better grades. The Portuguese case may therefore be interesting to dig deeper, leaning for example on PISA surveys.
plot(rules.clean, method="graph", engine="htmlwidget")
Visualization of revealed rules is provided on the above graph. We can see, that high frequency of going out with friends, 2 to 5 hours of studying per week, wanting to take higher education and having good quality of family relationships, are the features that constitute the existence of three different groups of patterns. Rules 1, 7, 6 and 5 have the highest values of association statistics - they can be considered as the most accurate.
This time let’s see features of students who have good final grades. We can conduct the same analysis, but this time on the right hand side we have another consequence - “Good final grade”.
rules<-apriori(data=trans1, parameter=list(supp=0.1, conf=0.5), appearance=list(default="lhs", rhs="Good final grade"), control=list(verbose=F))
rules.byconf<-sort(rules, by="confidence", decreasing=TRUE)
rules.clean<-rules[!is.redundant(rules)]
rules.clean<-rules.clean[is.significant(rules.clean, trans1)]
rules.clean<-rules.clean[is.maximal(rules.clean)]
summary(rules.clean)
## set of 407 rules
##
## rule length distribution (lhs + rhs):sizes
## 4 5 6 7 8 9 10
## 5 29 79 122 82 69 21
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.000 6.000 7.000 7.322 8.000 10.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.1010 Min. :0.5688 Min. :0.1389 Min. :1.218
## 1st Qu.:0.1010 1st Qu.:0.6111 1st Qu.:0.1591 1st Qu.:1.308
## Median :0.1035 Median :0.6269 Median :0.1667 Median :1.342
## Mean :0.1094 Mean :0.6339 Mean :0.1735 Mean :1.357
## 3rd Qu.:0.1111 3rd Qu.:0.6557 3rd Qu.:0.1742 3rd Qu.:1.404
## Max. :0.1919 Max. :0.7273 Max. :0.3258 Max. :1.557
## count
## Min. :40.00
## 1st Qu.:40.00
## Median :41.00
## Mean :43.34
## 3rd Qu.:44.00
## Max. :76.00
##
## mining info:
## data ntransactions support confidence
## trans1 396 0.1 0.5
## call
## apriori(data = trans1, parameter = list(supp = 0.1, conf = 0.5), appearance = list(default = "lhs", rhs = "Good final grade"), control = list(verbose = F))
There are much more rules - after excluding unwanted ones, we still have 407 rules
inspectDT(rules.clean)
This time the confidence level is much higher, there are 13 rules with confidence level above 70%. In the provided interactive table, we can sort the rules by confidence level in descending order, in order to examine rules with the highest confidence value.
Students with good grades usually have no past class failures, and only occasional school absences. They rarely go out with friends and drink alcohol. Such students more often live in urban areas with short home to school travel. Surprisingly students with good academic performance have no extra educational support. They usually have good quality of family relationships, their parents live together and usually their family size is greater than 3.
plot(rules.clean, method="graph", engine="htmlwidget")
## Warning: Too many rules supplied. Only plotting the best 100 using 'lift'
## (change control parameter max if needed).
In the middle of this graphical web, we can see the most important features for the revealed rules. Students with good grades usually have 0 past class failures, internet access at home, they want to take higher education, have occasional school absences and good quality of family relationships. Interestingly male students tend to have good grades more often. Living in urban areas and close to school also may have positive impact on academic performance.
Based on the Market Basket Analysis we are able to easily create patterns among groups characterized by the attribute under study. In order to measure those patterns, we can analyze such statistics as: support, confidence, coverage and lift. This study was focused on the students’ academic performance and finding factors that can affect the achievement of poor and good final grades.
From the conducted analysis we can conclude that students with poor grades often socialize and do not study much - only 2 to 5 hours weekly. Factors like good health or family support do not seem to have a negative impact on grades.
Key features for rules explaining having good final grades include no past failures, home internet access, aspirations for higher education, occasional absences, and good family relationships. Urban living and proximity to school also seem to influence academic performance.
Surprisingly, among students with worse grades, girls were more likely to occur, and on the contrary, among students with good grades, boys were more likely to occur. It is worth noting that the survey was conducted on students in Portuguese schools, so the conclusions drawn may serve to analyze the situation of Portuguese students.