Association rules using Apriori algorithm

The goal of this project is to analyze survey dataset using the Apriori algorithm to obtain association rules, which are patterns of behavior observed in the survey participants.

Young people, due to their still developing minds and worldviews, are very vulnerable to pathological behaviors. Smoking, alcohol use, drug use, and violence can leave a significant mark on them, which may affect their lives in the future. It is very important to understand why young people exhibit undesirable behaviors and what their consequences are. In this way, social programs can more effectively help them, minimizing the risk of problems in adult life.

This project uses data from the ‘Global School-Based Student Health Survey’ conducted by WHO, specifically the 2018 data from Argentina. It is an interesting country to observe due to the many social and economic issues it faces in the 21st century.

website: https://www.who.int/teams/noncommunicable-diseases/surveillance/systems-tools/global-school-based-student-health-survey

Opis obrazka source: https://news.iu.edu/live/news/28031-cigarette-smoking-more-prevalent-harder-to-quit

#Loading packages
library(arules)
library(corrplot)
library(arulesViz)
library(knitr)
library(kableExtra)

Loading data, survey contains a large number of variables, but many of them are highly correlated or empty. It is necessary to limit the dataset to only the variables suitable for analysis.

data <- read.csv("C:/Users/User/OneDrive/Pulpit/Studia magisterskie/USL/Project part 3/National.csv",  sep=";")
data <- data[, 3:57]
any(is.na(data))
## [1] TRUE
corrplot(cor(data, use = "complete.obs", method = "pearson"), method = "color")

Due to the complexity of the data, variables were manually selected based on the codebook provided on the WHO website, now correlations look much better.

data_1 <- data[, (names(data) %in% c("q1","q2","q3","q6","q10","q16","q17","q22","q23","q24","q27","q28","q29","q30","q33","q35","q39",
                                     "q41","q43","q44","q49","q51","q53","q56","q57","q58","q66","q67","q68"))]
data_1 <- na.omit(data_1)
corrplot(cor(data_1, use = "complete.obs", method = "pearson"), method = "color")

head(data_1)
##    q1 q2 q3 q6 q10 q16 q17 q22 q23 q24 q27 q28 q29 q30 q33 q35 q39 q41 q43 q44
## 7   4  2  4  3   6   1   1   3   3   2   4   1   1   1   1   1   1   1   1   2
## 8   2  2  4  1   1   1   1   2   2   2   4   1   1   1   1   1   1   1   1   2
## 10  4  2  2  2   1   1   1   5   1   1   1   1   1   1   1   1   1   1   1   2
## 11  5  1  2  1   5   2   2   1   4   2   4   3   3   1   1   2   1   1   1   1
## 12  3  2  4  1   2   2   3   3   5   1   4   1   1   1   1   2   1   1   1   2
## 13  3  2  2  1   1   1   1   3   1   2   4   1   1   1   1   1   1   1   1   2
##    q49 q51 q53 q56 q57 q58 q66 q67 q68
## 7    6   6   1   5   5   2   2   2   2
## 8    3   5   1   1   5   1   2   2   2
## 10   2   3   1   1   5   1   1   2   2
## 11   3   6   3   5   1   3   2   2   2
## 12   1   1   1   5   5   1   1   1   1
## 13   1   3   1   1   5   1   2   2   2

As we can see, Variables have a numerical form with many NA values. In order to apply the association rules algorithm, they need to be in binary form, with values of 0 or 1. This is complicated in this case because many variables have different ranges, some range from 0 to 4, others from 0 to 9, and so on. This situation significantly complicates the process of converting the variables to binary. What is more, many of the questions are phrased in such a way that it is easy to misinterpret the results

In order to best interpret the values for each question, it was decided to manually convert the values to binary in most cases, so that they could be used in the model while maintaining their real meaning. During this process, many questions should have been phrased differently, which is why I created dataframe named “Dictionary”, listing all the questions to help interpret the results.

data_1$q2 <- ifelse(data_1$q2 == 1, 1, 0) #is_male 1-male, 0 female
data_1$q6 <- ifelse(data_1$q6 == 1, 0, 1)
data_1$q10 <- ifelse(data_1$q10 == 1, 0, 1)
data_1$q16 <- ifelse(data_1$q16 == 1, 0, 1)
data_1$q17 <- ifelse(data_1$q17 == 1, 0, 1)
data_1$q24 <- ifelse(data_1$q24 == 1, 1, 0)
data_1$q27 <- ifelse(data_1$q27 == 4, 1, 0) #Do you have many friends?
data_1$q28 <- ifelse(data_1$q28 == 1, 0, 1) #Have you ever smoked ciggarets,
data_1$q29 <- ifelse(data_1$q29 == 1, 0, 1) #Have you smoked ciggarets during last 30 days
data_1$q30 <- ifelse(data_1$q30 == 1, 0, 1) #Have you used any other form of tabacco during last 30 days
data_1$q33 <- ifelse(data_1$q33 == 1, 0, 1) #Does you parents use any form of tabacco?
data_1$q35 <- ifelse(data_1$q35 == 1, 0, 1) #Did you have at least one alcohol drink during last 30 days?
data_1$q39 <- ifelse(data_1$q35 == 1, 0, 1) #Have you get into trouble because of alcohol?
data_1$q41 <- ifelse(data_1$q41 == 1, 0, 1) #Have you used marijuana?
data_1$q43 <- ifelse(data_1$q43 == 1, 0, 1) #Have you used amphetamines or methamphetamines?
data_1$q44 <- ifelse(data_1$q44 == 1, 1, 0)
data_1$q53 <- ifelse(data_1$q53 == 1, 0, 1) #Have you skipped school during past 30 days?
data_1$q66 <- ifelse(data_1$q66 == 1, 1, 0)
data_1$q67 <- ifelse(data_1$q67 == 1, 1, 0)
data_1$q68 <- ifelse(data_1$q68 == 1, 1, 0)

The remaining variables were converted using the median. If the value is above the median in a given column, it receives a value of 0, if it is below the median, it is a 1.

cols_to_change=c("q1","q3","q22","q23","q49","q51","q56","q57","q58")
for (col in cols_to_change) {
  median_value <- median(data_1[[col]], na.rm = TRUE)
  data_1[[col]] <- ifelse(data_1[[col]] <= median_value, 0, 1)
} #if value is below median it gets 0, if higher than median 1
sapply(data_1, function(col) length(unique(col)) == 2) 
##   q1   q2   q3   q6  q10  q16  q17  q22  q23  q24  q27  q28  q29  q30  q33  q35 
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE 
##  q39  q41  q43  q44  q49  q51  q53  q56  q57  q58  q66  q67  q68 
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Right now all variables are binary.

Creating dictionary to interprate varaibles

Explenation = c("How old are you?",
                "1-male, 0-female",
                "In what grade are you?",
                "During the past 30 days, how often did you go hungry because there was not enough food in your home?",
                "During the past 7 days, on how many days did you eat food from a fast food restaurant?",
                "During the past 12 months, have you been in a physical fight?",
                "During the past 12 months, have you been seriously injured?",
                "During the past 12 months, have you felt lonely?",
                "During the past 12 months, have you been so worried about something that you could not sleep at night?",
                "During the past 12 months, did you ever seriously consider attempting suicide?",
                "Do you have many friends?",
                "Have you ever smoked cigarettes?",
                "Have you smoked cigarettes during the last 30 days?",
                "Have you used any other form of tobacco during the last 30 days?",
                "Do your parents use any form of tobacco?",
                "Did you have at least one alcoholic drink during the last 30 days?",
                "Have you gotten into trouble because of alcohol?",
                "Have you ever used marijuana?",
                "Have you ever used amphetamines or methamphetamines?",
                "Have you ever had sexual intercourse?",
                "The last time you had sexual intercourse, did you or your partner use any other method of birth control, or any other method to prevent pregnancy?",
                "During this school year, on how many days did you go to physical education (PE) class each week?",
                "Have you skipped school during the past 30 days?",
                "During the past 30 days, how often did your parents or guardians understand your problems and worries?",
                "During the past 30 days, how often did your parents or guardians really know what you were doing with your free time?",
                "During the past 30 days, how often did your parents or guardians go through your things without your approval?",
                "During the past 12 months, have you ever been bullied on school property?",
                "During the past 12 months, have you ever been bullied when you were not on school property?",
                "During the past 12 months, have you ever been cyberbullied?")
Dictionary <- data.frame(
  variable = colnames(data_1),
  Explenation = Explenation
)
kable(Dictionary, caption = "Dictionary", format = "html") %>%
  kable_styling(bootstrap_options = c("striped", "bordered"))
Dictionary
variable Explenation
q1 How old are you?
q2 1-male, 0-female
q3 In what grade are you?
q6 During the past 30 days, how often did you go hungry because there was not enough food in your home?
q10 During the past 7 days, on how many days did you eat food from a fast food restaurant?
q16 During the past 12 months, have you been in a physical fight?
q17 During the past 12 months, have you been seriously injured?
q22 During the past 12 months, have you felt lonely?
q23 During the past 12 months, have you been so worried about something that you could not sleep at night?
q24 During the past 12 months, did you ever seriously consider attempting suicide?
q27 Do you have many friends?
q28 Have you ever smoked cigarettes?
q29 Have you smoked cigarettes during the last 30 days?
q30 Have you used any other form of tobacco during the last 30 days?
q33 Do your parents use any form of tobacco?
q35 Did you have at least one alcoholic drink during the last 30 days?
q39 Have you gotten into trouble because of alcohol?
q41 Have you ever used marijuana?
q43 Have you ever used amphetamines or methamphetamines?
q44 Have you ever had sexual intercourse?
q49 The last time you had sexual intercourse, did you or your partner use any other method of birth control, or any other method to prevent pregnancy?
q51 During this school year, on how many days did you go to physical education (PE) class each week?
q53 Have you skipped school during the past 30 days?
q56 During the past 30 days, how often did your parents or guardians understand your problems and worries?
q57 During the past 30 days, how often did your parents or guardians really know what you were doing with your free time?
q58 During the past 30 days, how often did your parents or guardians go through your things without your approval?
q66 During the past 12 months, have you ever been bullied on school property?
q67 During the past 12 months, have you ever been bullied when you were not on school property?
q68 During the past 12 months, have you ever been cyberbullied?

Now, data needs to be transformed into a format that can be used by the read.transactions function.

data_3=data_1
for (col in colnames(data_3)) {
  data_3[[col]] <- ifelse(data_3[[col]] == 1, col, '')
}

data_4 <- data.frame(SingleColumn = apply(data_3, 1, function(row) {
  non_empty_values <- row[!is.na(row) & row != ""]
  quoted_values <- paste0('"', non_empty_values, '"')
  paste(quoted_values, collapse = ", ")
}))
write.table(data_4, file = "C:/Users/User/OneDrive/Pulpit/Studia magisterskie/USL/Project part 3/National_1.csv", 
            row.names = FALSE, col.names = FALSE, sep = ",", quote=F)
trans <- read.transactions("C:/Users/User/OneDrive/Pulpit/Studia magisterskie/USL/Project part 3/National_1.csv",format="basket", sep = ",")
summary(trans)
## transactions as itemMatrix in sparse format with
##  38353 rows (elements/itemsets/transactions) and
##  29 columns (items) and a density of 0.3334199 
## 
## most frequent items:
##     q27     q35     q58      q2     q39 (Other) 
##   27002   20867   19130   17799   17486  268558 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
##   21   97  452 1112 2236 3346 4094 4435 4335 3943 3461 2841 2373 1820 1334  945 
##   17   18   19   20   21   22   23 
##  642  388  246  128   66   30    8 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   7.000   9.000   9.669  12.000  23.000 
## 
## includes extended item information - examples:
##   labels
## 1     q1
## 2    q10
## 3    q16

Data has 38353 rows and 29 columns, the most popular variable is “q27”, which is question about friends.

Apriori algorithm

Now we can proceed to generating association rules, using the Apriori algorithm, which detects the most frequent individual items in the dataset and progressively combines them into larger itemsets, as long as these larger itemsets remain frequent

itemFrequencyPlot(trans, topN=30, type="absolute", main="Frequency of variables", col="#cc99ff")

itemSupport <- itemFrequency(trans)
hist(itemSupport, 
     breaks = 10,
     col = "#cc99ff", 
     main = "Support distribution", 
     xlab = "Support", 
     ylab = "number of items")

Let’s see the frequency of variables and the distribution of support. It is evident that most variables have values in the range of 0.2-0.5, with one variable having a very high support. Now, we need to find the optimal level of support and confidence to obtain the appropriate number of rules

rules_apriori<-apriori(trans, parameter=list(supp=0.1, conf=0.5))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 3835 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[29 item(s), 38353 transaction(s)] done [0.02s].
## sorting and recoding items ... [27 item(s)] done [0.00s].
## creating transaction tree ... done [0.02s].
## checking subsets of size 1 2 3 4 5 done [0.06s].
## writing ... [685 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

685 rules is definitely too many, we need to increase support and confidence

rules_apriori<-apriori(trans, parameter=list(supp=0.12, conf=0.75))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.75    0.1    1 none FALSE            TRUE       5    0.12      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 4602 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[29 item(s), 38353 transaction(s)] done [0.02s].
## sorting and recoding items ... [27 item(s)] done [0.00s].
## creating transaction tree ... done [0.02s].
## checking subsets of size 1 2 3 4 done [0.04s].
## writing ... [76 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
hist(quality(rules_apriori)$lift,
     breaks = 30,
     col='#89CFF0',
     main = "Lift distribution", 
     xlab = "Lift", 
     ylab = "number of items"
     )

While lift around 1 implies independent itemsets, we are not interested in those. At the end rules 59 are obtained.

rules_apriori_1 <- subset(rules_apriori, lift >= 1.25)
hist(quality(rules_apriori_1)$lift,
     breaks = 30,
     col='#89CFF0',
     main = "Lift distribution", 
     xlab = "Lift", 
     ylab = "number of items"
)

length(rules_apriori_1)
## [1] 59
plot(rules_apriori_1, 
     method = "graph", 
     measure = "support", 
     colors = c("#9933cc", "#ffccff")
)

plot(rules_apriori_1, method="paracoord", control=list(reorder=TRUE))

Analyzing rules

inspect(head(sort(rules_apriori_1, by="confidence", decreasing=TRUE),10))
##      lhs                rhs   support   confidence coverage  lift     count
## [1]  {q29}           => {q28} 0.1851224 1.0000000  0.1851224 2.594048 7100 
## [2]  {q29, q44}      => {q28} 0.1343050 1.0000000  0.1343050 2.594048 5151 
## [3]  {q29, q35}      => {q28} 0.1662191 1.0000000  0.1662191 2.594048 6375 
## [4]  {q27, q29}      => {q28} 0.1283602 1.0000000  0.1283602 2.594048 4923 
## [5]  {q29, q35, q44} => {q28} 0.1234845 1.0000000  0.1234845 2.594048 4736 
## [6]  {q41}           => {q28} 0.1236148 0.9356621  0.1321148 2.427152 4741 
## [7]  {q29, q44}      => {q35} 0.1234845 0.9194331  0.1343050 1.689894 4736 
## [8]  {q28, q29, q44} => {q35} 0.1234845 0.9194331  0.1343050 1.689894 4736 
## [9]  {q29}           => {q35} 0.1662191 0.8978873  0.1851224 1.650293 6375 
## [10] {q28, q29}      => {q35} 0.1662191 0.8978873  0.1851224 1.650293 6375

Now let’s take a look at the most basic rules of this survey. We can see that the occurrence of q28 (Have you ever smoked cigarettes?) is most influenced by q29 (Have you smoked cigarettes during the last 30 days?), q44 (Have you ever had sexual intercourse?), and q35 (Did you have at least one alcoholic drink during the last 30 days?).The relationship between q29 and q28 is quite clear and obvious, which is why variable q28 was excluded from the dataset.

Smoking cigarettes

trans_without_q28 <- trans[, !itemLabels(trans) %in% "q28"]

rules_apriori_cigs_reasons<-apriori(trans_without_q28, parameter=list(supp=0.05, conf=0.5),
                            appearance=list(default="lhs", rhs="q29"), control=list(verbose=F))
inspect(head(sort(rules_apriori_cigs_reasons, by="confidence", decreasing=TRUE)))
##     lhs                rhs   support    confidence coverage   lift     count
## [1] {q30, q35}      => {q29} 0.05509347 0.8738627  0.06304592 4.720459 2113 
## [2] {q30}           => {q29} 0.05783120 0.8633710  0.06698303 4.663784 2218 
## [3] {q35, q41, q44} => {q29} 0.06781738 0.6850145  0.09900138 3.700332 2601 
## [4] {q35, q41}      => {q29} 0.07861184 0.6657099  0.11808724 3.596052 3015 
## [5] {q27, q35, q41} => {q29} 0.05355513 0.6655865  0.08046307 3.595386 2054 
## [6] {q41, q44}      => {q29} 0.07151983 0.6577938  0.10872683 3.553291 2743

As seen above, the occurrence of variables such as:

-q30 - Have you used any other form of tobacco during the last 30 days?

-q35 - Did you have at least one alcoholic drink during the last 30 days?

-q41 - Have you ever used marijuana?

-q44 - Have you ever had sexual intercourse?

Affects whether a person has smoked a cigarette in the last 30 days. We can observe the influence of other substances, such as alcohol, marijuana, and other forms of tobacco. Therefore, we can conclude that this form of substance use is frequently associated with other risky behaviors

rules_apriori_cigs_result<-apriori(trans_without_q28, parameter=list(supp=0.05, conf=0.5),
                            appearance=list(default="rhs", lhs="q29"), control=list(verbose=F))
inspect(head(sort(rules_apriori_cigs_result, by="confidence", decreasing=TRUE)))
##     lhs      rhs   support   confidence coverage  lift      count
## [1] {q29} => {q35} 0.1662191 0.8978873  0.1851224 1.6502934  6375
## [2] {q29} => {q44} 0.1343050 0.7254930  0.1851224 1.7497693  5151
## [3] {}    => {q27} 0.7040388 0.7040388  1.0000000 1.0000000 27002
## [4] {q29} => {q27} 0.1283602 0.6933803  0.1851224 0.9848609  4923
## [5] {q29} => {q58} 0.1049722 0.5670423  0.1851224 1.1368412  4026
## [6] {q29} => {q22} 0.1048419 0.5663380  0.1851224 1.3261348  4021

On the other hand, individuals who smoke cigarettes are characterized by:

-Drinking alcohol

-Having sexual intercourse

-Having many friends

-Feeling lonely

-Having their belongings checked by guardians.

We observe here two contradictory conclusions: having many friends and feeling lonely. A very logical conclusion is that guardians checking their belongings might help divert them from smoking

Using marijuana

rules_apriori_weed_reasons<-apriori(trans_without_q28, parameter=list(supp=0.05, conf=0.5),
                            appearance=list(default="lhs", rhs="q41"), control=list(verbose=F))
inspect(head(sort(rules_apriori_weed_reasons, by="confidence", decreasing=TRUE)))
##     lhs                rhs   support    confidence coverage   lift     count
## [1] {q29, q35, q44} => {q41} 0.06781738 0.5491976  0.12348447 4.156972 2601 
## [2] {q1, q29}       => {q41} 0.05089563 0.5352344  0.09509034 4.051282 1952 
## [3] {q29, q44}      => {q41} 0.07151983 0.5325180  0.13430501 4.030721 2743
rules_apriori_weed_results<-apriori(trans_without_q28, parameter=list(supp=0.05, conf=0.5),
                            appearance=list(default="rhs", lhs="q41"), control=list(verbose=F))
inspect(head(sort(rules_apriori_weed_results, by="confidence", decreasing=TRUE)))
##     lhs      rhs   support    confidence coverage  lift      count
## [1] {q41} => {q35} 0.11808724 0.8938228  0.1321148 1.6428229  4529
## [2] {q41} => {q44} 0.10872683 0.8229722  0.1321148 1.9848731  4170
## [3] {}    => {q27} 0.70403880 0.7040388  1.0000000 1.0000000 27002
## [4] {q41} => {q27} 0.08898913 0.6735741  0.1321148 0.9567287  3413
## [5] {q41} => {q29} 0.08343545 0.6315374  0.1321148 3.4114583  3200
## [6] {q41} => {q1}  0.08179282 0.6191040  0.1321148 1.6901200  3137

When it comes to marijuana use, there are sets of characteristics that contribute to the decision to use it. In addition to the typical factors similar to those associated with smoking cigarettes, but here age is important, as older individuals are more likely to use marijuana. The variables influenced by marijuana use are very similar, so it can be assumed that they naturally coexist

Using amphetamines or methamphetamines

rules_apriori_weed<-apriori(trans, parameter=list(supp=0.001, conf=0.4),
                            appearance=list(default="lhs", rhs="q43"), control=list(verbose=F))
inspect(head(sort(rules_apriori_weed, by="confidence", decreasing=TRUE)))
##     lhs                                            rhs   support     confidence
## [1] {q2, q23, q30, q35, q41, q44, q66}          => {q43} 0.001042943 0.4819277 
## [2] {q2, q23, q28, q30, q35, q41, q44, q66}     => {q43} 0.001042943 0.4819277 
## [3] {q16, q2, q23, q29, q30, q41, q44, q53, q6} => {q43} 0.001042943 0.4819277 
## [4] {q2, q23, q30, q41, q44, q66}               => {q43} 0.001069017 0.4767442 
## [5] {q2, q23, q28, q30, q41, q44, q66}          => {q43} 0.001069017 0.4767442 
## [6] {q16, q17, q22, q23, q30, q41, q53, q6}     => {q43} 0.001069017 0.4712644 
##     coverage    lift     count
## [1] 0.002164107 25.67135 40   
## [2] 0.002164107 25.67135 40   
## [3] 0.002164107 25.67135 40   
## [4] 0.002242328 25.39524 41   
## [5] 0.002242328 25.39524 41   
## [6] 0.002268401 25.10334 41
rules_apriori_drugs<-apriori(trans, parameter=list(supp=0.001, conf=0.4),
                            appearance=list(default="rhs", lhs="q43"), control=list(verbose=F))
inspect(head(sort(rules_apriori_drugs, by="confidence", decreasing=TRUE)))
##     lhs      rhs   support    confidence coverage   lift     count
## [1] {q43} => {q41} 0.01775611 0.9458333  0.01877298 7.159176   681
## [2] {q43} => {q28} 0.01767789 0.9416667  0.01877298 2.442729   678
## [3] {q43} => {q35} 0.01741715 0.9277778  0.01877298 1.705231   668
## [4] {q43} => {q44} 0.01699997 0.9055556  0.01877298 2.184051   652
## [5] {q43} => {q29} 0.01407973 0.7500000  0.01877298 4.051373   540
## [6] {}    => {q27} 0.70403880 0.7040388  1.00000000 1.000000 27002

In the case of hard drugs, we are dealing with a very small number of observations, which is why the ‘support’ value is so low. In the causes, we can observe sets of many variables, much more than in previous cases. The repeating variables create a profile of an individual using hard drugs, which includes:

-Male

-Can’t sleep at night

-Using tobacco, alcohol, and marijuana

-Having sexual intercourse

-Being bullied on school property

These findings are particularly interesting because many of these variables did not appear in the case of smoking cigarettes or using marijuana. What is especially concerning is the appearance of sleeping problems and being bullied at school.

Additionally, the analysis focused on two positive behaviors: understanding the factors that contribute to individuals having many friends and exploring the effects of parents showing concern for their children’s problems

Friends

rules_apriori_friends<-apriori(trans, parameter=list(supp=0.05, conf=0.5),
                            appearance=list(default="lhs", rhs="q27"), control=list(verbose=F))
inspect(head(sort(rules_apriori_friends, by="confidence", decreasing=TRUE)))
##     lhs                rhs   support    confidence coverage   lift     count
## [1] {q10, q49, q56} => {q27} 0.05793549 0.8205318  0.07060725 1.165464 2222 
## [2] {q2, q49, q56}  => {q27} 0.08648606 0.8188102  0.10562407 1.163019 3317 
## [3] {q2, q35, q56}  => {q27} 0.08283576 0.8177606  0.10129586 1.161528 3177 
## [4] {q10, q2, q51}  => {q27} 0.05569317 0.8174512  0.06813026 1.161088 2136 
## [5] {q10, q51, q56} => {q27} 0.05042630 0.8174134  0.06169009 1.161035 1934 
## [6] {q49, q51, q56} => {q27} 0.06009960 0.8159292  0.07365786 1.158926 2305

The obtained rules suggest that in order to have many friends, it is needed to eat fast food, use contraception, be male, and have guardians who are concerned about their problems

At this point, it’s important to consider the risks associated with surveys. Often, respondents do not answer according to true, but rather in a way they want to be perceived. Many responses, especially to subjective questions like ‘close friends,’ may be distorted.

Taking care of children

rules_apriori_care=apriori(trans, parameter=list(supp=0.05, conf=0.5),
                               appearance=list(default="rhs", lhs="q56"), control=list(verbose=F))
inspect(head(sort(rules_apriori_care, by="confidence", decreasing=TRUE)))
##     lhs      rhs   support   confidence coverage  lift     count
## [1] {q56} => {q27} 0.3241989 0.7593746  0.4269288 1.078598 12434
## [2] {}    => {q27} 0.7040388 0.7040388  1.0000000 1.000000 27002
## [3] {}    => {q35} 0.5440774 0.5440774  1.0000000 1.000000 20867
## [4] {q56} => {q39} 0.2207389 0.5170392  0.4269288 1.134050  8466
## [5] {q56} => {q57} 0.2191745 0.5133749  0.4269288 1.539924  8406

The analysis focused on the question of whether parental concern for children’s problems has a positive impact on certain behaviors:

-q27: Do you have many friends?

-q39: Have you gotten into trouble because of alcohol?

-q57: During the past 30 days, how often did your parents or guardians really know what you were doing with your free time?

The results of these rules were not satisfactory. The variable q39, indicating alcohol-related issues, seems to be more of a cause than an effect in this context. Additionally, q57 appears to be similair to q56.

One positive finding is that when parents are more aware of what their children are doing, the children tend to have more friends (q27). This suggests that when parents are more involved, it could help children build stronger social connections and friendships.

Questions and future research

While working with the survey data, it was noticed that the vast majority of questions was focused on negative phenomenons. There were very few questions about habits, past experiences, or more social behaviors of the respondents,while a significant portion of the survey was dedicated to substance use, problems, and other negative issues.

Results generated by the model often failed to connect standard behaviors with pathological ones. Instead, undesirable behaviors frequently correlated with other undesirable behaviors (like smoking cigarettes -> using marijuana). While this pattern may reflect reality, because people often combine these substances, the algorithm still had too few variables related to positive traits. Future research should include more neutral variables to obtain more insightful and useful results.

Conclusions

In this project, use of association rules was highly effective, leading to interesting conclusions. A key aspect was properly preparing binary data and setting appropriate support and confidence thresholds to limit the number of rules.

Findings indicate that young people often struggle with multiple substances and tend to combine them. Smokers frequently have issues with alcohol, and there is a notable pattern of mixing different types of drugs. An important observation is that individuals experiencing school bullying and sleep problems may be more likely to use hard drugs. Additionally, parental involvement plays a crucial role in young adults lives, as it appears to contribute to a larger number of friends for their children.