The goal of this project is to analyze survey dataset using the Apriori algorithm to obtain association rules, which are patterns of behavior observed in the survey participants.
Young people, due to their still developing minds and worldviews, are very vulnerable to pathological behaviors. Smoking, alcohol use, drug use, and violence can leave a significant mark on them, which may affect their lives in the future. It is very important to understand why young people exhibit undesirable behaviors and what their consequences are. In this way, social programs can more effectively help them, minimizing the risk of problems in adult life.
This project uses data from the ‘Global School-Based Student Health Survey’ conducted by WHO, specifically the 2018 data from Argentina. It is an interesting country to observe due to the many social and economic issues it faces in the 21st century.
source: https://news.iu.edu/live/news/28031-cigarette-smoking-more-prevalent-harder-to-quit
#Loading packages
library(arules)
library(corrplot)
library(arulesViz)
library(knitr)
library(kableExtra)
Loading data, survey contains a large number of variables, but many of them are highly correlated or empty. It is necessary to limit the dataset to only the variables suitable for analysis.
data <- read.csv("C:/Users/User/OneDrive/Pulpit/Studia magisterskie/USL/Project part 3/National.csv", sep=";")
data <- data[, 3:57]
any(is.na(data))
## [1] TRUE
corrplot(cor(data, use = "complete.obs", method = "pearson"), method = "color")
Due to the complexity of the data, variables were manually selected based on the codebook provided on the WHO website, now correlations look much better.
data_1 <- data[, (names(data) %in% c("q1","q2","q3","q6","q10","q16","q17","q22","q23","q24","q27","q28","q29","q30","q33","q35","q39",
"q41","q43","q44","q49","q51","q53","q56","q57","q58","q66","q67","q68"))]
data_1 <- na.omit(data_1)
corrplot(cor(data_1, use = "complete.obs", method = "pearson"), method = "color")
head(data_1)
## q1 q2 q3 q6 q10 q16 q17 q22 q23 q24 q27 q28 q29 q30 q33 q35 q39 q41 q43 q44
## 7 4 2 4 3 6 1 1 3 3 2 4 1 1 1 1 1 1 1 1 2
## 8 2 2 4 1 1 1 1 2 2 2 4 1 1 1 1 1 1 1 1 2
## 10 4 2 2 2 1 1 1 5 1 1 1 1 1 1 1 1 1 1 1 2
## 11 5 1 2 1 5 2 2 1 4 2 4 3 3 1 1 2 1 1 1 1
## 12 3 2 4 1 2 2 3 3 5 1 4 1 1 1 1 2 1 1 1 2
## 13 3 2 2 1 1 1 1 3 1 2 4 1 1 1 1 1 1 1 1 2
## q49 q51 q53 q56 q57 q58 q66 q67 q68
## 7 6 6 1 5 5 2 2 2 2
## 8 3 5 1 1 5 1 2 2 2
## 10 2 3 1 1 5 1 1 2 2
## 11 3 6 3 5 1 3 2 2 2
## 12 1 1 1 5 5 1 1 1 1
## 13 1 3 1 1 5 1 2 2 2
As we can see, Variables have a numerical form with many NA values. In order to apply the association rules algorithm, they need to be in binary form, with values of 0 or 1. This is complicated in this case because many variables have different ranges, some range from 0 to 4, others from 0 to 9, and so on. This situation significantly complicates the process of converting the variables to binary. What is more, many of the questions are phrased in such a way that it is easy to misinterpret the results
In order to best interpret the values for each question, it was decided to manually convert the values to binary in most cases, so that they could be used in the model while maintaining their real meaning. During this process, many questions should have been phrased differently, which is why I created dataframe named “Dictionary”, listing all the questions to help interpret the results.
data_1$q2 <- ifelse(data_1$q2 == 1, 1, 0) #is_male 1-male, 0 female
data_1$q6 <- ifelse(data_1$q6 == 1, 0, 1)
data_1$q10 <- ifelse(data_1$q10 == 1, 0, 1)
data_1$q16 <- ifelse(data_1$q16 == 1, 0, 1)
data_1$q17 <- ifelse(data_1$q17 == 1, 0, 1)
data_1$q24 <- ifelse(data_1$q24 == 1, 1, 0)
data_1$q27 <- ifelse(data_1$q27 == 4, 1, 0) #Do you have many friends?
data_1$q28 <- ifelse(data_1$q28 == 1, 0, 1) #Have you ever smoked ciggarets,
data_1$q29 <- ifelse(data_1$q29 == 1, 0, 1) #Have you smoked ciggarets during last 30 days
data_1$q30 <- ifelse(data_1$q30 == 1, 0, 1) #Have you used any other form of tabacco during last 30 days
data_1$q33 <- ifelse(data_1$q33 == 1, 0, 1) #Does you parents use any form of tabacco?
data_1$q35 <- ifelse(data_1$q35 == 1, 0, 1) #Did you have at least one alcohol drink during last 30 days?
data_1$q39 <- ifelse(data_1$q35 == 1, 0, 1) #Have you get into trouble because of alcohol?
data_1$q41 <- ifelse(data_1$q41 == 1, 0, 1) #Have you used marijuana?
data_1$q43 <- ifelse(data_1$q43 == 1, 0, 1) #Have you used amphetamines or methamphetamines?
data_1$q44 <- ifelse(data_1$q44 == 1, 1, 0)
data_1$q53 <- ifelse(data_1$q53 == 1, 0, 1) #Have you skipped school during past 30 days?
data_1$q66 <- ifelse(data_1$q66 == 1, 1, 0)
data_1$q67 <- ifelse(data_1$q67 == 1, 1, 0)
data_1$q68 <- ifelse(data_1$q68 == 1, 1, 0)
The remaining variables were converted using the median. If the value is above the median in a given column, it receives a value of 0, if it is below the median, it is a 1.
cols_to_change=c("q1","q3","q22","q23","q49","q51","q56","q57","q58")
for (col in cols_to_change) {
median_value <- median(data_1[[col]], na.rm = TRUE)
data_1[[col]] <- ifelse(data_1[[col]] <= median_value, 0, 1)
} #if value is below median it gets 0, if higher than median 1
sapply(data_1, function(col) length(unique(col)) == 2)
## q1 q2 q3 q6 q10 q16 q17 q22 q23 q24 q27 q28 q29 q30 q33 q35
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## q39 q41 q43 q44 q49 q51 q53 q56 q57 q58 q66 q67 q68
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Right now all variables are binary.
Creating dictionary to interprate varaibles
Explenation = c("How old are you?",
"1-male, 0-female",
"In what grade are you?",
"During the past 30 days, how often did you go hungry because there was not enough food in your home?",
"During the past 7 days, on how many days did you eat food from a fast food restaurant?",
"During the past 12 months, have you been in a physical fight?",
"During the past 12 months, have you been seriously injured?",
"During the past 12 months, have you felt lonely?",
"During the past 12 months, have you been so worried about something that you could not sleep at night?",
"During the past 12 months, did you ever seriously consider attempting suicide?",
"Do you have many friends?",
"Have you ever smoked cigarettes?",
"Have you smoked cigarettes during the last 30 days?",
"Have you used any other form of tobacco during the last 30 days?",
"Do your parents use any form of tobacco?",
"Did you have at least one alcoholic drink during the last 30 days?",
"Have you gotten into trouble because of alcohol?",
"Have you ever used marijuana?",
"Have you ever used amphetamines or methamphetamines?",
"Have you ever had sexual intercourse?",
"The last time you had sexual intercourse, did you or your partner use any other method of birth control, or any other method to prevent pregnancy?",
"During this school year, on how many days did you go to physical education (PE) class each week?",
"Have you skipped school during the past 30 days?",
"During the past 30 days, how often did your parents or guardians understand your problems and worries?",
"During the past 30 days, how often did your parents or guardians really know what you were doing with your free time?",
"During the past 30 days, how often did your parents or guardians go through your things without your approval?",
"During the past 12 months, have you ever been bullied on school property?",
"During the past 12 months, have you ever been bullied when you were not on school property?",
"During the past 12 months, have you ever been cyberbullied?")
Dictionary <- data.frame(
variable = colnames(data_1),
Explenation = Explenation
)
kable(Dictionary, caption = "Dictionary", format = "html") %>%
kable_styling(bootstrap_options = c("striped", "bordered"))
| variable | Explenation |
|---|---|
| q1 | How old are you? |
| q2 | 1-male, 0-female |
| q3 | In what grade are you? |
| q6 | During the past 30 days, how often did you go hungry because there was not enough food in your home? |
| q10 | During the past 7 days, on how many days did you eat food from a fast food restaurant? |
| q16 | During the past 12 months, have you been in a physical fight? |
| q17 | During the past 12 months, have you been seriously injured? |
| q22 | During the past 12 months, have you felt lonely? |
| q23 | During the past 12 months, have you been so worried about something that you could not sleep at night? |
| q24 | During the past 12 months, did you ever seriously consider attempting suicide? |
| q27 | Do you have many friends? |
| q28 | Have you ever smoked cigarettes? |
| q29 | Have you smoked cigarettes during the last 30 days? |
| q30 | Have you used any other form of tobacco during the last 30 days? |
| q33 | Do your parents use any form of tobacco? |
| q35 | Did you have at least one alcoholic drink during the last 30 days? |
| q39 | Have you gotten into trouble because of alcohol? |
| q41 | Have you ever used marijuana? |
| q43 | Have you ever used amphetamines or methamphetamines? |
| q44 | Have you ever had sexual intercourse? |
| q49 | The last time you had sexual intercourse, did you or your partner use any other method of birth control, or any other method to prevent pregnancy? |
| q51 | During this school year, on how many days did you go to physical education (PE) class each week? |
| q53 | Have you skipped school during the past 30 days? |
| q56 | During the past 30 days, how often did your parents or guardians understand your problems and worries? |
| q57 | During the past 30 days, how often did your parents or guardians really know what you were doing with your free time? |
| q58 | During the past 30 days, how often did your parents or guardians go through your things without your approval? |
| q66 | During the past 12 months, have you ever been bullied on school property? |
| q67 | During the past 12 months, have you ever been bullied when you were not on school property? |
| q68 | During the past 12 months, have you ever been cyberbullied? |
Now, data needs to be transformed into a format that can be used by the read.transactions function.
data_3=data_1
for (col in colnames(data_3)) {
data_3[[col]] <- ifelse(data_3[[col]] == 1, col, '')
}
data_4 <- data.frame(SingleColumn = apply(data_3, 1, function(row) {
non_empty_values <- row[!is.na(row) & row != ""]
quoted_values <- paste0('"', non_empty_values, '"')
paste(quoted_values, collapse = ", ")
}))
write.table(data_4, file = "C:/Users/User/OneDrive/Pulpit/Studia magisterskie/USL/Project part 3/National_1.csv",
row.names = FALSE, col.names = FALSE, sep = ",", quote=F)
trans <- read.transactions("C:/Users/User/OneDrive/Pulpit/Studia magisterskie/USL/Project part 3/National_1.csv",format="basket", sep = ",")
summary(trans)
## transactions as itemMatrix in sparse format with
## 38353 rows (elements/itemsets/transactions) and
## 29 columns (items) and a density of 0.3334199
##
## most frequent items:
## q27 q35 q58 q2 q39 (Other)
## 27002 20867 19130 17799 17486 268558
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 21 97 452 1112 2236 3346 4094 4435 4335 3943 3461 2841 2373 1820 1334 945
## 17 18 19 20 21 22 23
## 642 388 246 128 66 30 8
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 7.000 9.000 9.669 12.000 23.000
##
## includes extended item information - examples:
## labels
## 1 q1
## 2 q10
## 3 q16
Data has 38353 rows and 29 columns, the most popular variable is “q27”, which is question about friends.
Now we can proceed to generating association rules, using the Apriori algorithm, which detects the most frequent individual items in the dataset and progressively combines them into larger itemsets, as long as these larger itemsets remain frequent
itemFrequencyPlot(trans, topN=30, type="absolute", main="Frequency of variables", col="#cc99ff")
itemSupport <- itemFrequency(trans)
hist(itemSupport,
breaks = 10,
col = "#cc99ff",
main = "Support distribution",
xlab = "Support",
ylab = "number of items")
Let’s see the frequency of variables and the distribution of support. It is evident that most variables have values in the range of 0.2-0.5, with one variable having a very high support. Now, we need to find the optimal level of support and confidence to obtain the appropriate number of rules
rules_apriori<-apriori(trans, parameter=list(supp=0.1, conf=0.5))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.1 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 3835
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[29 item(s), 38353 transaction(s)] done [0.02s].
## sorting and recoding items ... [27 item(s)] done [0.00s].
## creating transaction tree ... done [0.02s].
## checking subsets of size 1 2 3 4 5 done [0.06s].
## writing ... [685 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
685 rules is definitely too many, we need to increase support and confidence
rules_apriori<-apriori(trans, parameter=list(supp=0.12, conf=0.75))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.75 0.1 1 none FALSE TRUE 5 0.12 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 4602
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[29 item(s), 38353 transaction(s)] done [0.02s].
## sorting and recoding items ... [27 item(s)] done [0.00s].
## creating transaction tree ... done [0.02s].
## checking subsets of size 1 2 3 4 done [0.04s].
## writing ... [76 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
hist(quality(rules_apriori)$lift,
breaks = 30,
col='#89CFF0',
main = "Lift distribution",
xlab = "Lift",
ylab = "number of items"
)
While lift around 1 implies independent itemsets, we are not interested in those. At the end rules 59 are obtained.
rules_apriori_1 <- subset(rules_apriori, lift >= 1.25)
hist(quality(rules_apriori_1)$lift,
breaks = 30,
col='#89CFF0',
main = "Lift distribution",
xlab = "Lift",
ylab = "number of items"
)
length(rules_apriori_1)
## [1] 59
plot(rules_apriori_1,
method = "graph",
measure = "support",
colors = c("#9933cc", "#ffccff")
)
plot(rules_apriori_1, method="paracoord", control=list(reorder=TRUE))
inspect(head(sort(rules_apriori_1, by="confidence", decreasing=TRUE),10))
## lhs rhs support confidence coverage lift count
## [1] {q29} => {q28} 0.1851224 1.0000000 0.1851224 2.594048 7100
## [2] {q29, q44} => {q28} 0.1343050 1.0000000 0.1343050 2.594048 5151
## [3] {q29, q35} => {q28} 0.1662191 1.0000000 0.1662191 2.594048 6375
## [4] {q27, q29} => {q28} 0.1283602 1.0000000 0.1283602 2.594048 4923
## [5] {q29, q35, q44} => {q28} 0.1234845 1.0000000 0.1234845 2.594048 4736
## [6] {q41} => {q28} 0.1236148 0.9356621 0.1321148 2.427152 4741
## [7] {q29, q44} => {q35} 0.1234845 0.9194331 0.1343050 1.689894 4736
## [8] {q28, q29, q44} => {q35} 0.1234845 0.9194331 0.1343050 1.689894 4736
## [9] {q29} => {q35} 0.1662191 0.8978873 0.1851224 1.650293 6375
## [10] {q28, q29} => {q35} 0.1662191 0.8978873 0.1851224 1.650293 6375
Now let’s take a look at the most basic rules of this survey. We can see that the occurrence of q28 (Have you ever smoked cigarettes?) is most influenced by q29 (Have you smoked cigarettes during the last 30 days?), q44 (Have you ever had sexual intercourse?), and q35 (Did you have at least one alcoholic drink during the last 30 days?).The relationship between q29 and q28 is quite clear and obvious, which is why variable q28 was excluded from the dataset.
Smoking cigarettes
trans_without_q28 <- trans[, !itemLabels(trans) %in% "q28"]
rules_apriori_cigs_reasons<-apriori(trans_without_q28, parameter=list(supp=0.05, conf=0.5),
appearance=list(default="lhs", rhs="q29"), control=list(verbose=F))
inspect(head(sort(rules_apriori_cigs_reasons, by="confidence", decreasing=TRUE)))
## lhs rhs support confidence coverage lift count
## [1] {q30, q35} => {q29} 0.05509347 0.8738627 0.06304592 4.720459 2113
## [2] {q30} => {q29} 0.05783120 0.8633710 0.06698303 4.663784 2218
## [3] {q35, q41, q44} => {q29} 0.06781738 0.6850145 0.09900138 3.700332 2601
## [4] {q35, q41} => {q29} 0.07861184 0.6657099 0.11808724 3.596052 3015
## [5] {q27, q35, q41} => {q29} 0.05355513 0.6655865 0.08046307 3.595386 2054
## [6] {q41, q44} => {q29} 0.07151983 0.6577938 0.10872683 3.553291 2743
As seen above, the occurrence of variables such as:
-q30 - Have you used any other form of tobacco during the last 30 days?
-q35 - Did you have at least one alcoholic drink during the last 30 days?
-q41 - Have you ever used marijuana?
-q44 - Have you ever had sexual intercourse?
Affects whether a person has smoked a cigarette in the last 30 days. We can observe the influence of other substances, such as alcohol, marijuana, and other forms of tobacco. Therefore, we can conclude that this form of substance use is frequently associated with other risky behaviors
rules_apriori_cigs_result<-apriori(trans_without_q28, parameter=list(supp=0.05, conf=0.5),
appearance=list(default="rhs", lhs="q29"), control=list(verbose=F))
inspect(head(sort(rules_apriori_cigs_result, by="confidence", decreasing=TRUE)))
## lhs rhs support confidence coverage lift count
## [1] {q29} => {q35} 0.1662191 0.8978873 0.1851224 1.6502934 6375
## [2] {q29} => {q44} 0.1343050 0.7254930 0.1851224 1.7497693 5151
## [3] {} => {q27} 0.7040388 0.7040388 1.0000000 1.0000000 27002
## [4] {q29} => {q27} 0.1283602 0.6933803 0.1851224 0.9848609 4923
## [5] {q29} => {q58} 0.1049722 0.5670423 0.1851224 1.1368412 4026
## [6] {q29} => {q22} 0.1048419 0.5663380 0.1851224 1.3261348 4021
On the other hand, individuals who smoke cigarettes are characterized by:
-Drinking alcohol
-Having sexual intercourse
-Having many friends
-Feeling lonely
-Having their belongings checked by guardians.
We observe here two contradictory conclusions: having many friends and feeling lonely. A very logical conclusion is that guardians checking their belongings might help divert them from smoking
Using marijuana
rules_apriori_weed_reasons<-apriori(trans_without_q28, parameter=list(supp=0.05, conf=0.5),
appearance=list(default="lhs", rhs="q41"), control=list(verbose=F))
inspect(head(sort(rules_apriori_weed_reasons, by="confidence", decreasing=TRUE)))
## lhs rhs support confidence coverage lift count
## [1] {q29, q35, q44} => {q41} 0.06781738 0.5491976 0.12348447 4.156972 2601
## [2] {q1, q29} => {q41} 0.05089563 0.5352344 0.09509034 4.051282 1952
## [3] {q29, q44} => {q41} 0.07151983 0.5325180 0.13430501 4.030721 2743
rules_apriori_weed_results<-apriori(trans_without_q28, parameter=list(supp=0.05, conf=0.5),
appearance=list(default="rhs", lhs="q41"), control=list(verbose=F))
inspect(head(sort(rules_apriori_weed_results, by="confidence", decreasing=TRUE)))
## lhs rhs support confidence coverage lift count
## [1] {q41} => {q35} 0.11808724 0.8938228 0.1321148 1.6428229 4529
## [2] {q41} => {q44} 0.10872683 0.8229722 0.1321148 1.9848731 4170
## [3] {} => {q27} 0.70403880 0.7040388 1.0000000 1.0000000 27002
## [4] {q41} => {q27} 0.08898913 0.6735741 0.1321148 0.9567287 3413
## [5] {q41} => {q29} 0.08343545 0.6315374 0.1321148 3.4114583 3200
## [6] {q41} => {q1} 0.08179282 0.6191040 0.1321148 1.6901200 3137
When it comes to marijuana use, there are sets of characteristics that contribute to the decision to use it. In addition to the typical factors similar to those associated with smoking cigarettes, but here age is important, as older individuals are more likely to use marijuana. The variables influenced by marijuana use are very similar, so it can be assumed that they naturally coexist
Using amphetamines or methamphetamines
rules_apriori_weed<-apriori(trans, parameter=list(supp=0.001, conf=0.4),
appearance=list(default="lhs", rhs="q43"), control=list(verbose=F))
inspect(head(sort(rules_apriori_weed, by="confidence", decreasing=TRUE)))
## lhs rhs support confidence
## [1] {q2, q23, q30, q35, q41, q44, q66} => {q43} 0.001042943 0.4819277
## [2] {q2, q23, q28, q30, q35, q41, q44, q66} => {q43} 0.001042943 0.4819277
## [3] {q16, q2, q23, q29, q30, q41, q44, q53, q6} => {q43} 0.001042943 0.4819277
## [4] {q2, q23, q30, q41, q44, q66} => {q43} 0.001069017 0.4767442
## [5] {q2, q23, q28, q30, q41, q44, q66} => {q43} 0.001069017 0.4767442
## [6] {q16, q17, q22, q23, q30, q41, q53, q6} => {q43} 0.001069017 0.4712644
## coverage lift count
## [1] 0.002164107 25.67135 40
## [2] 0.002164107 25.67135 40
## [3] 0.002164107 25.67135 40
## [4] 0.002242328 25.39524 41
## [5] 0.002242328 25.39524 41
## [6] 0.002268401 25.10334 41
rules_apriori_drugs<-apriori(trans, parameter=list(supp=0.001, conf=0.4),
appearance=list(default="rhs", lhs="q43"), control=list(verbose=F))
inspect(head(sort(rules_apriori_drugs, by="confidence", decreasing=TRUE)))
## lhs rhs support confidence coverage lift count
## [1] {q43} => {q41} 0.01775611 0.9458333 0.01877298 7.159176 681
## [2] {q43} => {q28} 0.01767789 0.9416667 0.01877298 2.442729 678
## [3] {q43} => {q35} 0.01741715 0.9277778 0.01877298 1.705231 668
## [4] {q43} => {q44} 0.01699997 0.9055556 0.01877298 2.184051 652
## [5] {q43} => {q29} 0.01407973 0.7500000 0.01877298 4.051373 540
## [6] {} => {q27} 0.70403880 0.7040388 1.00000000 1.000000 27002
In the case of hard drugs, we are dealing with a very small number of observations, which is why the ‘support’ value is so low. In the causes, we can observe sets of many variables, much more than in previous cases. The repeating variables create a profile of an individual using hard drugs, which includes:
-Male
-Can’t sleep at night
-Using tobacco, alcohol, and marijuana
-Having sexual intercourse
-Being bullied on school property
These findings are particularly interesting because many of these variables did not appear in the case of smoking cigarettes or using marijuana. What is especially concerning is the appearance of sleeping problems and being bullied at school.
Additionally, the analysis focused on two positive behaviors: understanding the factors that contribute to individuals having many friends and exploring the effects of parents showing concern for their children’s problems
Friends
rules_apriori_friends<-apriori(trans, parameter=list(supp=0.05, conf=0.5),
appearance=list(default="lhs", rhs="q27"), control=list(verbose=F))
inspect(head(sort(rules_apriori_friends, by="confidence", decreasing=TRUE)))
## lhs rhs support confidence coverage lift count
## [1] {q10, q49, q56} => {q27} 0.05793549 0.8205318 0.07060725 1.165464 2222
## [2] {q2, q49, q56} => {q27} 0.08648606 0.8188102 0.10562407 1.163019 3317
## [3] {q2, q35, q56} => {q27} 0.08283576 0.8177606 0.10129586 1.161528 3177
## [4] {q10, q2, q51} => {q27} 0.05569317 0.8174512 0.06813026 1.161088 2136
## [5] {q10, q51, q56} => {q27} 0.05042630 0.8174134 0.06169009 1.161035 1934
## [6] {q49, q51, q56} => {q27} 0.06009960 0.8159292 0.07365786 1.158926 2305
The obtained rules suggest that in order to have many friends, it is needed to eat fast food, use contraception, be male, and have guardians who are concerned about their problems
At this point, it’s important to consider the risks associated with surveys. Often, respondents do not answer according to true, but rather in a way they want to be perceived. Many responses, especially to subjective questions like ‘close friends,’ may be distorted.
Taking care of children
rules_apriori_care=apriori(trans, parameter=list(supp=0.05, conf=0.5),
appearance=list(default="rhs", lhs="q56"), control=list(verbose=F))
inspect(head(sort(rules_apriori_care, by="confidence", decreasing=TRUE)))
## lhs rhs support confidence coverage lift count
## [1] {q56} => {q27} 0.3241989 0.7593746 0.4269288 1.078598 12434
## [2] {} => {q27} 0.7040388 0.7040388 1.0000000 1.000000 27002
## [3] {} => {q35} 0.5440774 0.5440774 1.0000000 1.000000 20867
## [4] {q56} => {q39} 0.2207389 0.5170392 0.4269288 1.134050 8466
## [5] {q56} => {q57} 0.2191745 0.5133749 0.4269288 1.539924 8406
The analysis focused on the question of whether parental concern for children’s problems has a positive impact on certain behaviors:
-q27: Do you have many friends?
-q39: Have you gotten into trouble because of alcohol?
-q57: During the past 30 days, how often did your parents or guardians really know what you were doing with your free time?
The results of these rules were not satisfactory. The variable q39, indicating alcohol-related issues, seems to be more of a cause than an effect in this context. Additionally, q57 appears to be similair to q56.
One positive finding is that when parents are more aware of what their children are doing, the children tend to have more friends (q27). This suggests that when parents are more involved, it could help children build stronger social connections and friendships.
While working with the survey data, it was noticed that the vast majority of questions was focused on negative phenomenons. There were very few questions about habits, past experiences, or more social behaviors of the respondents,while a significant portion of the survey was dedicated to substance use, problems, and other negative issues.
Results generated by the model often failed to connect standard behaviors with pathological ones. Instead, undesirable behaviors frequently correlated with other undesirable behaviors (like smoking cigarettes -> using marijuana). While this pattern may reflect reality, because people often combine these substances, the algorithm still had too few variables related to positive traits. Future research should include more neutral variables to obtain more insightful and useful results.
In this project, use of association rules was highly effective, leading to interesting conclusions. A key aspect was properly preparing binary data and setting appropriate support and confidence thresholds to limit the number of rules.
Findings indicate that young people often struggle with multiple substances and tend to combine them. Smokers frequently have issues with alcohol, and there is a notable pattern of mixing different types of drugs. An important observation is that individuals experiencing school bullying and sleep problems may be more likely to use hard drugs. Additionally, parental involvement plays a crucial role in young adults lives, as it appears to contribute to a larger number of friends for their children.