Association Rule Mining for Bullying Risk Patterns

Author

Sebastian Chmielewski

Introduction

Bullying is a widespread phenomenon affecting adolescents across different social and cultural contexts. It is commonly understood as intentional and repetitive aggressive behavior occurring within relationships characterized by an imbalance of power, which distinguishes bullying from isolated peer conflicts. Due to its high prevalence and potential for long-term harm, bullying remains a major concern in adolescent health research.

Research consistently shows that bullying involvement, especially victimization, is associated with negative mental health outcomes such as loneliness, emotional distress, and depressive symptoms. Emotional vulnerability, including persistent feelings of loneliness, may both result from bullying and increase the risk of being victimized, suggesting a bidirectional relationship between emotional well-being and bullying.

In addition to psychosocial factors, individual characteristics such as weight status have been linked to bullying experiences. Adolescents who are underweight, overweight, or obese appear to be at higher risk of peer victimization, possibly due to weight-based stigma and appearance-related norms.

Despite extensive evidence on individual risk factors, less is known about how psychosocial, behavioral, and individual characteristics co-occur within adolescents’ everyday lives. This study addresses this gap by applying association rule mining to data from a large, nationally representative survey of secondary school students in Argentina. The analysis examines how loneliness, social support, school absence, and weight status combine into profiles associated with bullying involvement, with separate analyses for girls and boys to identify potential gender-specific patterns.

Data and preprocessing

Bullying in Schools – Kaggle dataset

The dataset is derived from the Global School-Based Student Health Survey (GSHS), an international school-based survey designed to collect information on health-related behaviors and protective factors among adolescents. The survey is conducted using a self-administered questionnaire completed by students during regular school hours.

In this study, we use data collected in Argentina in 2018. The survey covers a large nationally representative sample of secondary school students. Nearly 57,000 students participated in the study, with satisfactory school-level and student-level response rates, which ensures good coverage of the target population and reduces the risk of systematic non-response bias.

The GSHS questionnaire includes multiple thematic modules addressing physical health, mental well-being, social relationships, and risk behaviors. For the purpose of this project, we focus on a subset of variables related to bullying experiences and selected psychosocial and behavioral factors that have been linked in previous research to bullying involvement and victimization.

The selected variables describe:

different forms of bullying, including bullying on school property, bullying outside school, and cyberbullying;
experiences of physical aggression, such as physical attacks and participation in physical fights;
indicators of emotional well-being, including feelings of loneliness and sadness;
social support and peer relationships, such as having close friends and perceiving other students as kind and helpful;
school-related behaviors, including skipping classes without permission;
selected individual characteristics, such as sex and weight status (underweight, overweight, obese).

Loading the data and initial inspection

We begin with loading the data and performing an initial inspection.

df = pd.read_csv(r"Bullying_2018.csv", sep=';')
df = df.replace(r'^\s*$', np.nan, regex=True)
df.shape

for i in df.columns:
    print(i)
    print(df[i].unique())

record
[    1     2     3 ... 57093 57094 57095]
Bullied_on_school_property_in_past_12_months
['Yes' 'No' nan]
Bullied_not_on_school_property_in_past_12_months
['Yes' 'No' nan]
Cyber_bullied_in_past_12_months
[nan 'No' 'Yes']
Custom_Age
['13 years old' '14 years old' '16 years old' '12 years old'
 '15 years old' '11 years old or younger' '17 years old' nan
 '18 years old or older']
Sex
['Female' 'Male' nan]
Physically_attacked
['0 times' '1 time' '12 or more times' '4 or 5 times' '2 or 3 times'
 '10 or 11 times' '8 or 9 times' '6 or 7 times' nan]
Physical_fighting
['0 times' '2 or 3 times' '1 time' '4 or 5 times' '6 or 7 times'
 '8 or 9 times' '10 or 11 times' nan '12 or more times']
Felt_lonely
['Always' 'Never' 'Rarely' 'Sometimes' 'Most of the time' nan]
Close_friends
['2' '3 or more' '0' nan '1']
Miss_school_no_permission
['10 or more days' '0 days' '6 to 9 days' '3 to 5 days' nan '1 or 2 days']
Other_students_kind_and_helpful
['Never' 'Sometimes' 'Most of the time' nan 'Always' 'Rarely']
Parents_understand_problems
['Always' nan 'Most of the time' 'Never' 'Sometimes' 'Rarely']
Most_of_the_time_or_always_felt_lonely
['Yes' 'No' nan]
Missed_classes_or_school_without_permission
['Yes' 'No' nan]
Were_underweight
[nan 'No' 'Yes']
Were_overweight
[nan 'No' 'Yes']
Were_obese
[nan 'No' 'Yes']

The dataset contains 56,981 observations and 18 variables.

df.head()

	record	Bullied_on_school_property_in_past_12_months	Bullied_not_on_school_property_in_past_12_months	Cyber_bullied_in_past_12_months	Custom_Age	Sex	Physical_fighting	Felt_lonely	Close_friends	Miss_school_no_permission	Other_students_kind_and_helpful	Parents_understand_problems	Most_of_the_time_or_always_felt_lonely	Missed_classes_or_school_without_permission	Were_underweight	Were_overweight	Were_obese
0	1	Yes	Yes	NaN	13 years old	Female	0 times	Always	2	10 or more days	Never	Always	Yes	Yes	NaN	NaN	NaN
1	2	No	No	No	13 years old	Female	0 times	Never	3 or more	0 days	Sometimes	Always	No	No	NaN	NaN	NaN
2	3	No	No	No	14 years old	Male	0 times	Never	3 or more	0 days	Sometimes	Always	No	No	No	No	No
3	4	No	No	No	16 years old	Male	2 or 3 times	Never	3 or more	0 days	Sometimes	NaN	No	No	No	No	No
4	5	No	No	No	13 years old	Female	0 times	Rarely	3 or more	0 days	Most of the time	Most of the time	No	No	NaN	NaN	NaN

An overview of variable types and the number of non-missing observations is obtained below.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56981 entries, 0 to 56980
Data columns (total 18 columns):
 #   Column                                            Non-Null Count  Dtype 
---  ------                                            --------------  ----- 
 0   record                                            56981 non-null  int64 
 1   Bullied_on_school_property_in_past_12_months      55742 non-null  object
 2   Bullied_not_on_school_property_in_past_12_months  56492 non-null  object
 3   Cyber_bullied_in_past_12_months                   56410 non-null  object
 4   Custom_Age                                        56873 non-null  object
 5   Sex                                               56445 non-null  object
 6   Physically_attacked                               56741 non-null  object
 7   Physical_fighting                                 56713 non-null  object
 8   Felt_lonely                                       56615 non-null  object
 9   Close_friends                                     55905 non-null  object
 10  Miss_school_no_permission                         55117 non-null  object
 11  Other_students_kind_and_helpful                   55422 non-null  object
 12  Parents_understand_problems                       54608 non-null  object
 13  Most_of_the_time_or_always_felt_lonely            56615 non-null  object
 14  Missed_classes_or_school_without_permission       55117 non-null  object
 15  Were_underweight                                  36052 non-null  object
 16  Were_overweight                                   36052 non-null  object
 17  Were_obese                                        36052 non-null  object
dtypes: int64(1), object(17)
memory usage: 7.8+ MB

Most variables are categorical and stored as character strings, which is appropriate for subsequent transformation into binary indicators.

Missing values analysis

Next, we examine the amount of missing data in each variable.

df.isnull().sum().sort_values(ascending=False)

Were_underweight                                    20929
Were_obese                                          20929
Were_overweight                                     20929
Parents_understand_problems                          2373
Missed_classes_or_school_without_permission          1864
Miss_school_no_permission                            1864
Other_students_kind_and_helpful                      1559
Bullied_on_school_property_in_past_12_months         1239
Close_friends                                        1076
Cyber_bullied_in_past_12_months                       571
Sex                                                   536
Bullied_not_on_school_property_in_past_12_months      489
Most_of_the_time_or_always_felt_lonely                366
Felt_lonely                                           366
Physical_fighting                                     268
Physically_attacked                                   240
Custom_Age                                            108
record                                                  0
dtype: int64

Three variables related to weight status (Were_underweight, Were_overweight, Were_obese) contain missing values for approximately 40% of all observations. Although such a level of missingness poses challenges for association rule mining, these variables are considered theoretically important, as weight status is linked to peer victimization, stigma, and psychosocial vulnerability.

Rather than excluding these variables entirely, we adopt an alternative strategy and perform the main analysis on a reduced dataset consisting of observations with available weight status information. This choice represents a deliberate trade-off between sample size and conceptual richness: while the number of transactions is reduced, the resulting dataset allows the inclusion of potentially important individual characteristics that may contribute to bullying involvement.

For the remaining variables, which contain relatively small proportions of missing values, observations with missing entries are removed in later preprocessing steps. Since association rule mining relies on the presence or absence of items within transactions, retaining only complete cases ensures a consistent transactional representation and avoids introducing artificial co-occurrence patterns through imputation.

Binary encoding and transaction creation

In association rule mining, each observation must be represented as a transaction containing a set of items. Therefore, categorical survey responses are transformed into binary indicators (0/1), where value 1 denotes the presence of a given attribute for a student. Only complete observations are retained to ensure consistent transactional representation.

df = df.dropna()
df.shape

(32938, 18)

Converting responses to binary format

Survey variables differ in their measurement scales (binary, ordinal, and frequency-based). Therefore, instead of applying a single uniform transformation, variable-specific recoding rules are used. The objective is to preserve meaningful distinctions in response intensity while avoiding excessive sparsity that could negatively affect support in association rule mining.

Variables describing physical attacks and participation in physical fights are not used as antecedent features. These behaviors conceptually overlap with bullying and can be interpreted as manifestations or direct consequences of bullying rather than independent risk factors. Excluding them allows the analysis to focus on psychosocial, behavioral, and individual vulnerability factors that may precede or co-occur with bullying involvement.

Binary Yes/No variables are mapped directly to 0/1 indicators. Selected ordinal variables are collapsed into low and high categories, representing low versus elevated levels of the underlying construct. Frequency-based variables describing school absence and number of close friends are discretized into a small number of ordered categories capturing none, occasional, and frequent occurrences.

Age is extracted from the original textual format and discretized into intervals representing early, middle, and late adolescence. Finally, categorical variables are converted into binary indicators using one-hot encoding.

This preprocessing strategy represents a compromise between full granularity of the original survey scales and overly simplistic binary encoding, and is intended to produce stable and interpretable association rules.

df_enc = df.copy()

# ============================================================
# NOTE: Physical violence variables are intentionally excluded
# ============================================================


# ========== School absence ==========
def recode_absence(x):
    if x == "0 days":
        return "None"
    elif x in ["1 or 2 days","3 to 5 days"]:
        return "Occasional"
    elif x in ["6 to 9 days","10 or more days"]:
        return "Frequent"
    else:
        return np.nan

df_enc["School_absence_level"] = df_enc["Miss_school_no_permission"].apply(recode_absence)


# ========== Friends ==========
def recode_friends(x):
    if x == "0":
        return "None"
    elif x in ["1","2"]:
        return "Few"
    elif x == "3 or more":
        return "Many"
    else:
        return np.nan

df_enc["Friends_level"] = df_enc["Close_friends"].apply(recode_friends)


# ========== Any bullying (target) ==========
df_enc["Any_bullying"] = (
    (df_enc["Bullied_on_school_property_in_past_12_months"]=="Yes") |
    (df_enc["Bullied_not_on_school_property_in_past_12_months"]=="Yes") |
    (df_enc["Cyber_bullied_in_past_12_months"]=="Yes")
).astype(int)


# ========== Loneliness ==========
df_enc["Lonely"] = (
    df_enc["Most_of_the_time_or_always_felt_lonely"]=="Yes"
).astype(int)


# ========== Sex ==========
df_enc["Sex_Female"] = (df_enc["Sex"]=="Female").astype(int)


# ========== Age ==========
df_enc["Age_num"] = df_enc["Custom_Age"].str.extract(r"(\d+)").astype(float)

df_enc["Age_group"] = pd.cut(
    df_enc["Age_num"],
    bins=[10,13,15,17,20],
    labels=["Age_11_13","Age_14_15","Age_16_17","Age_18_plus"]
)

df_enc = pd.get_dummies(df_enc, columns=["Age_group"], dtype=int)
df_enc = df_enc.drop(columns=["Custom_Age","Age_num"])


# ========== Support ==========
support_map = {
    "Never":0,
    "Rarely":0,
    "Sometimes":1,
    "Most of the time":1,
    "Always":1
}

df_enc["Peers_support"] = df_enc["Other_students_kind_and_helpful"].map(support_map)
df_enc["Parents_support"] = df_enc["Parents_understand_problems"].map(support_map)

df_enc = df_enc.drop(columns=[
    "Other_students_kind_and_helpful",
    "Parents_understand_problems"
])

# ========== Weight status (binary) ==========
df_enc["Excess_weight"] = (
    (df_enc["Were_overweight"]=="Yes") | 
    (df_enc["Were_obese"]=="Yes")
).astype(int)

df_enc["Underweight"] = (
    df_enc["Were_underweight"]=="Yes"
).astype(int)


# ========== Drop raw columns ==========
df_enc = df_enc.drop(columns=[
    "Physically_attacked",
    "Physical_fighting",
    "Miss_school_no_permission",
    "Close_friends",
    "Bullied_on_school_property_in_past_12_months",
    "Bullied_not_on_school_property_in_past_12_months",
    "Cyber_bullied_in_past_12_months",
    "Most_of_the_time_or_always_felt_lonely",
    "Sex",
    "Felt_lonely",
    "Missed_classes_or_school_without_permission",
    "record",
    "Were_underweight",
    "Were_overweight",
    "Were_obese"
], errors="ignore")



# ========== One-hot encoding ==========
df_enc = pd.get_dummies(
    df_enc,
    columns=[
        "School_absence_level",
        "Friends_level"    
],
    dtype=int
)


df_enc = df_enc.dropna()
df_enc.head()

	Any_bullying	Lonely	Age_group_Age_11_13	Age_group_Age_14_15	Peers_support	Parents_support	Excess_weight	School_absence_level_None	School_absence_level_Occasional	Friends_level_Few	Friends_level_Many
2	0	0	0	1	1	1	0	1	0	0	1
5	0	0	1	0	1	1	0	1	0	0	1
10	0	0	0	1	1	1	0	0	1	0	1
22	1	1	1	0	0	1	0	1	0	0	1
23	0	1	0	1	1	1	1	1	0	1	0

Creating transactions

To avoid generating rules dominated by the sex variable, the dataset was stratified by sex and transactions were created separately for girls and boys. This allows identification of gender-specific association patterns.

df_girls = df_enc[df_enc["Sex_Female"] == 1].drop(columns=["Sex_Female"])
df_boys  = df_enc[df_enc["Sex_Female"] == 0].drop(columns=["Sex_Female"])

def make_transactions(df):
    return [list(row[row == 1].index) for _, row in df.iterrows()]

trans_girls = make_transactions(df_girls)
trans_boys  = make_transactions(df_boys)

Item frequency analysis

Before mining frequent itemsets and association rules, we examine how often individual items occur in the dataset. This step helps to understand the prevalence of different behaviors and to select appropriate support thresholds.

For girls, the most frequent single items indicate generally favorable social environments and low levels of school absence. The highest supports were observed for School_absence_level_None (0.73), Peers_support (0.72), and Friends_level_Many (0.66). At the same time, approximately 46% of girls reported experiencing at least one form of bullying (Any_bullying), while around 22% reported frequent loneliness. Excess body weight was present in about 26% of girls, whereas underweight status was relatively rare (1.7%). These results suggest substantial heterogeneity in psychosocial and individual vulnerability factors within the female subgroup.

item_counts_girls = Counter()
for t in trans_girls:
    item_counts_girls.update(t)

item_freq_girls = pd.DataFrame.from_dict(
    item_counts_girls, orient="index", columns=["count"]
)
item_freq_girls["support"] = item_freq_girls["count"] / len(trans_girls)

item_freq_girls.sort_values("support", ascending=False).head(15)

	count	support
School_absence_level_None	12910	0.730245
Peers_support	12744	0.720855
Friends_level_Many	11680	0.660671
Parents_support	10323	0.583913
Any_bullying	8053	0.455512
Age_group_Age_14_15	7708	0.435998
Age_group_Age_16_17	6917	0.391255
Friends_level_Few	5041	0.285141
Excess_weight	4527	0.256067
Lonely	3971	0.224617
School_absence_level_Occasional	3943	0.223033
Age_group_Age_11_13	2946	0.166638
Friends_level_None	958	0.054189
School_absence_level_Frequent	826	0.046722
Underweight	292	0.016517

Among boys, a similar pattern emerges for the most prevalent attributes. The most frequent items include Peers_support (0.76), Friends_level_Many (0.74), and School_absence_level_None (0.69). Bullying involvement is reported by approximately 35% of boys, which is lower than among girls. Excess body weight is more common in boys (34%) than in girls, while underweight status remains infrequent (2.4%). Notably, loneliness is less prevalent among boys (9%) than among girls (22%), suggesting potential sex differences in emotional distress.

item_counts_boys = Counter()
for t in trans_boys:
    item_counts_boys.update(t)

item_freq_boys = pd.DataFrame.from_dict(
    item_counts_boys, orient="index", columns=["count"]
)
item_freq_boys["support"] = item_freq_boys["count"] / len(trans_boys)

item_freq_boys.sort_values("support", ascending=False).head(15)

	count	support
Peers_support	11618	0.761387
Friends_level_Many	11258	0.737794
School_absence_level_None	10556	0.691788
Parents_support	9594	0.628744
Age_group_Age_14_15	6608	0.433056
Age_group_Age_16_17	6169	0.404286
Any_bullying	5332	0.349433
Excess_weight	5182	0.339603
School_absence_level_Occasional	3951	0.258929
Friends_level_Few	3206	0.210106
Age_group_Age_11_13	2359	0.154597
Lonely	1419	0.092994
Friends_level_None	795	0.052100
School_absence_level_Frequent	752	0.049282
Underweight	362	0.023724

In addition to single-item frequencies, frequent 2-itemsets were analyzed to identify commonly co-occurring attributes. For both girls and boys, the most frequent pairs involve combinations of positive social indicators, such as (Peers_support, School_absence_level_None) and (Friends_level_Many, Peers_support), with supports exceeding 0.50 in both subgroups. These results indicate that supportive peer environments and regular school attendance tend to co-occur. Pairs involving Any_bullying appear with substantially lower support than pairs reflecting positive social conditions, but they are nonetheless present among the most frequent combinations. For girls, notable examples include (Any_bullying, School_absence_level_None) and (Any_bullying, Peers_support), each with support around 0.30. For boys, bullying-related pairs occur less frequently, consistent with the lower overall prevalence of bullying in this subgroup.

pair_counts_girls = Counter()

for t in trans_girls:
    for pair in combinations(sorted(t), 2):
        pair_counts_girls.update([pair])

pair_freq_girls = pd.DataFrame.from_dict(
    pair_counts_girls, orient="index", columns=["count"]
)

pair_freq_girls["support"] = pair_freq_girls["count"] / len(trans_girls)

pair_freq_girls.sort_values("support", ascending=False).head(15)

	count	support
(Peers_support, School_absence_level_None)	9561	0.540811
(Friends_level_Many, Peers_support)	8922	0.504667
(Friends_level_Many, School_absence_level_None)	8660	0.489847
(Parents_support, Peers_support)	8114	0.458963
(Parents_support, School_absence_level_None)	7907	0.447254
(Friends_level_Many, Parents_support)	7248	0.409978
(Age_group_Age_14_15, School_absence_level_None)	5851	0.330958
(Any_bullying, School_absence_level_None)	5522	0.312348
(Age_group_Age_14_15, Peers_support)	5479	0.309916
(Any_bullying, Peers_support)	5349	0.302562
(Age_group_Age_14_15, Friends_level_Many)	5320	0.300922
(Any_bullying, Friends_level_Many)	5074	0.287007
(Age_group_Age_16_17, Peers_support)	5010	0.283387
(Age_group_Age_16_17, School_absence_level_None)	4615	0.261044
(Age_group_Age_14_15, Parents_support)	4466	0.252616

pair_counts_boys = Counter()

for t in trans_boys:
    for pair in combinations(sorted(t), 2):
        pair_counts_boys.update([pair])

pair_freq_boys = pd.DataFrame.from_dict(
    pair_counts_boys, orient="index", columns=["count"]
)

pair_freq_boys["support"] = pair_freq_boys["count"] / len(trans_boys)

pair_freq_boys.sort_values("support", ascending=False).head(15)

	count	support
(Friends_level_Many, Peers_support)	8917	0.584376
(Peers_support, School_absence_level_None)	8217	0.538502
(Parents_support, Peers_support)	7890	0.517072
(Friends_level_Many, School_absence_level_None)	7882	0.516548
(Friends_level_Many, Parents_support)	7418	0.486139
(Parents_support, School_absence_level_None)	6923	0.453699
(Age_group_Age_14_15, Peers_support)	4974	0.325972
(Age_group_Age_14_15, Friends_level_Many)	4952	0.324530
(Age_group_Age_16_17, Peers_support)	4786	0.313651
(Age_group_Age_14_15, School_absence_level_None)	4777	0.313061
(Age_group_Age_16_17, Friends_level_Many)	4412	0.289141
(Age_group_Age_14_15, Parents_support)	4208	0.275772
(Excess_weight, Peers_support)	3915	0.256570
(Age_group_Age_16_17, School_absence_level_None)	3911	0.256308
(Excess_weight, Friends_level_Many)	3793	0.248575

Overall, the single-item and pair frequency analyses reveal broadly similar structural patterns across sexes, while also highlighting differences in the prevalence of loneliness and weight-related characteristics. These findings motivate the use of a relatively low minimum support threshold in subsequent association rule mining and provide an empirical foundation for interpreting multi-item rules.

Association rule mining configuration

Association rules were generated using the FP-Growth algorithm. The following parameter settings were applied:

Minimum support = 0.01 – to exclude extremely rare patterns while preserving sufficient coverage of the population.
Minimum confidence = 0.6 – to retain rules with reasonable predictive strength.
Minimum lift = 1.3 – to focus on associations that exceed chance co-occurrence.
Maximum antecedent length = 3 and single-item consequents – to ensure interpretability of the resulting rules.

def mine_rules(transactions, min_support=0.01):
    te = TransactionEncoder()
    arr = te.fit(transactions).transform(transactions)
    df_tf = pd.DataFrame(arr, columns=te.columns_)

    frequent_itemsets = fpgrowth(
        df_tf,
        min_support=min_support,
        use_colnames=True
    )

    rules = association_rules(
        frequent_itemsets,
        metric="confidence",
        min_threshold=0.6
    )

    rules = rules[
        (rules["lift"] >= 1.3)
    ].copy()

    rules["antecedent_len"] = rules["antecedents"].apply(len)
    rules["consequent_len"] = rules["consequents"].apply(len)

    rules = rules[
        (rules["antecedent_len"] <= 3) &
        (rules["consequent_len"] == 1)
    ]

    return rules

rules_girls = mine_rules(trans_girls)
rules_boys  = mine_rules(trans_boys)

FP-Growth and association rule mining were performed separately for girls and boys using identical parameter settings to ensure comparability of results.

Sensitivity analysis of minimum support threshold

To assess the impact of the minimum support threshold on the number of discovered rules, a sensitivity analysis was performed for selected support values (0.01, 0.02, 0.03, and 0.05) separately for girls and boys.

for i in [trans_girls, trans_boys]:
    for s in [0.01, 0.02, 0.03, 0.05]:
        r = mine_rules(i, min_support=s)
        print(s, r.shape[0])
    print('\n')

For girls, decreasing the minimum support threshold leads to a substantial increase in the number of discovered rules (53 rules for support = 0.01, 35 for 0.02, 22 for 0.03, and 10 for 0.05). This monotonic pattern indicates a clear trade-off between rule richness and rule strictness.

For boys, bullying-related rules emerge only at the lowest support threshold (0.01), while higher thresholds result in no rules. This suggests that associations related to bullying among boys are weaker and less frequent, and therefore require more permissive support settings to be detected.

Based on these results, a minimum support threshold of 0.01 was selected for the main analysis as a compromise between retaining a sufficient number of rules and avoiding extremely rare patterns.

Rules with Any_bullying as the consequent were extracted to identify combinations of factors associated with bullying involvement separately for girls and boys.

Bullying-related association rules

Rules with Any_bullying as the consequent were extracted in order to identify combinations of factors associated with bullying involvement. The analysis was performed separately for girls and boys.

Among girls, the strongest rules consistently include loneliness as a central component of the antecedent. The highest-lift rules involve combinations of loneliness with school absence and additional psychosocial or individual characteristics. For example, the combination (Parents_support, School_absence_level_Occasional, Lonely) is associated with bullying with a confidence of approximately 0.74 and a lift of 1.62. Similarly, (Age_group_Age_14_15, School_absence_level_Occasional, Lonely) and (School_absence_level_Frequent, Lonely) yield confidence values above 0.73 and lift values exceeding 1.6.

Two-item rules also highlight the importance of loneliness. The rule (School_absence_level_Occasional, Lonely) → Any_bullying exhibits a confidence of 0.72 and a lift of 1.57, indicating that girls who feel lonely and occasionally miss school are substantially more likely to report bullying experiences than the average girl in the sample.

Weight-related characteristics appear in several high-ranking rules. For instance, combinations such as (Parents_support, Excess_weight, Lonely) and (Age_group_Age_14_15, Excess_weight, Lonely) are associated with bullying with confidence values around 0.70 and lift values close to 1.55. This suggests that excess body weight may amplify the association between emotional distress and bullying involvement.

Overall, the results indicate that bullying among girls is most strongly associated with profiles characterized by emotional vulnerability (loneliness), combined with school disengagement and, in some cases, excess body weight.

girls_bully = rules_girls[
    rules_girls["consequents"]
    .astype(str)
    .str.contains("Any_bullying")
]

pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

girls_bully.sort_values("lift", ascending=False).head(10)

	antecedents	consequents	antecedent support	consequent support	support	confidence	lift	representativity	leverage	conviction	zhangs_metric	jaccard	certainty	kulczynski	antecedent_len	consequent_len
510	(Lonely, School_absence_level_Occasional, Parents_support)	(Any_bullying)	0.019119	0.455512	0.014141	0.739645	1.623765	1.0	0.005432	2.091328	0.391635	0.030709	0.521835	0.385345	3	1
503	(Lonely, Age_group_Age_14_15, School_absence_level_Occasional)	(Any_bullying)	0.025171	0.455512	0.018553	0.737079	1.618132	1.0	0.007087	2.070915	0.391867	0.040147	0.517122	0.388904	3	1
813	(Lonely, School_absence_level_Frequent)	(Any_bullying)	0.015951	0.455512	0.011709	0.734043	1.611466	1.0	0.004443	2.047274	0.385598	0.025468	0.511546	0.379874	2	1
500	(Lonely, Friends_level_Many, School_absence_level_Occasional)	(Any_bullying)	0.029809	0.455512	0.021438	0.719165	1.578805	1.0	0.007859	1.938818	0.377874	0.046214	0.484222	0.383114	3	1
506	(Lonely, Friends_level_Few, School_absence_level_Occasional)	(Any_bullying)	0.023418	0.455512	0.016800	0.717391	1.574911	1.0	0.006133	1.926649	0.373797	0.036353	0.480964	0.377136	3	1
499	(Lonely, School_absence_level_Occasional)	(Any_bullying)	0.059110	0.455512	0.042367	0.716746	1.573496	1.0	0.015441	1.922263	0.387370	0.089711	0.479780	0.404878	2	1
667	(Lonely, Excess_weight, Parents_support)	(Any_bullying)	0.021721	0.455512	0.015385	0.708333	1.555026	1.0	0.005491	1.866815	0.364849	0.033313	0.464328	0.371055	3	1
507	(Lonely, School_absence_level_Occasional, Excess_weight)	(Any_bullying)	0.016517	0.455512	0.011652	0.705479	1.548761	1.0	0.004129	1.848726	0.360273	0.025310	0.459087	0.365530	3	1
508	(Lonely, Age_group_Age_16_17, School_absence_level_Occasional)	(Any_bullying)	0.027151	0.455512	0.019062	0.702083	1.541305	1.0	0.006695	1.827651	0.361001	0.041118	0.452850	0.371966	3	1
676	(Lonely, Age_group_Age_14_15, Excess_weight)	(Any_bullying)	0.027094	0.455512	0.019006	0.701461	1.539940	1.0	0.006664	1.823844	0.360388	0.040996	0.451707	0.371592	3	1

For boys, considerably fewer bullying-related rules were discovered. Nevertheless, the identified rules reveal a pattern similar in structure to that observed among girls. The strongest rule is (School_absence_level_Occasional, Lonely, Age_group_Age_16_17) → Any_bullying, with a confidence of 0.64 and a lift of 1.83. Two-item rules such as (School_absence_level_Occasional, Lonely) and (Friends_level_Few, Lonely) also show elevated confidence (above 0.60) and lift values exceeding 1.7.

These findings suggest that, among boys, bullying involvement is primarily associated with loneliness combined with limited peer relationships or occasional school absence. However, the smaller number of detected rules indicates weaker and less stable association structures compared to girls.

boys_bully = rules_boys[
    rules_boys["consequents"]
    .astype(str)
    .str.contains("Any_bullying")
]

boys_bully.sort_values("lift", ascending=False).head(10)

	antecedents	consequents	antecedent support	consequent support	support	confidence	lift	representativity	leverage	conviction	zhangs_metric	jaccard	certainty	kulczynski	antecedent_len	consequent_len
457	(Lonely, Age_group_Age_16_17, School_absence_level_Occasional)	(Any_bullying)	0.015663	0.349433	0.010027	0.640167	1.832017	1.0	0.004554	1.807971	0.461380	0.028239	0.446894	0.334431	3	1
456	(Lonely, School_absence_level_Occasional)	(Any_bullying)	0.028704	0.349433	0.017694	0.616438	1.764110	1.0	0.007664	1.696121	0.445942	0.049091	0.410419	0.333538	2	1
448	(Lonely, Friends_level_Few)	(Any_bullying)	0.031457	0.349433	0.018940	0.602083	1.723029	1.0	0.007948	1.634932	0.433256	0.052327	0.388354	0.328142	2	1

Visualization of bullying-related association rules (girls)

To support the interpretation of the discovered association rules for girls, two complementary visualizations were produced.

The first visualization presents a scatter plot of support versus confidence, with point size proportional to lift. Each point corresponds to one association rule. The plot indicates that most bullying-related rules are characterized by relatively low to moderate support (approximately 0.01–0.08) and high confidence (around 0.60–0.74). This pattern suggests that although the identified rules apply to specific subgroups of students, they exhibit substantial predictive strength. Larger points, corresponding to higher lift values, are mainly concentrated in this region, indicating that the strongest rules represent meaningful associations rather than chance co-occurrences.

plt.figure()
plt.scatter(
    rules_girls["support"],
    rules_girls["confidence"],
    s = rules_girls["lift"] * 20
)
plt.xlabel("Support")
plt.ylabel("Confidence")
plt.title("Girls: Support vs Confidence (size = Lift)")
plt.show()

The second visualization shows a bar chart of the most frequent antecedent items appearing in bullying-related rules. The results clearly demonstrate that Lonely is by far the most common antecedent, appearing much more frequently than any other item. Other frequently occurring antecedents include Peers_support, Parents_support, Friends_level_Few, Friends_level_Many, School_absence_level_Occasional, and age groups 14–15 and 16–17, as well as Excess_weight. This distribution confirms that emotional distress, aspects of peer relationships, school attendance patterns, and body weight status constitute the dominant components of high-risk profiles associated with bullying among girls.

cnt = Counter()
for ants in girls_bully["antecedents"]:
    cnt.update(list(ants))

pd.Series(cnt).sort_values(ascending=False).head(10).plot(kind="bar")
plt.title("Most frequent antecedent items (girls)")
plt.show()

Together, these visualizations provide a concise summary of rule quality and structure, reinforcing the central role of loneliness and psychosocial factors in bullying-related association patterns.

Conclusion

This study applied association rule mining to adolescent health survey data in order to identify combinations of psychosocial, behavioral, and individual characteristics associated with bullying involvement. By transforming survey responses into transactional form and analyzing girls and boys separately, the analysis revealed clear and interpretable patterns.

Across both sexes, loneliness emerged as the most central factor appearing in bullying-related rules. Bullying involvement is most strongly associated with profiles that combine emotional distress with school disengagement and limited peer relationships. Among girls, excess body weight additionally appears in several high-lift rules, suggesting that weight-related vulnerability may intensify the association between loneliness and bullying.

The results demonstrate that association rule mining can uncover meaningful multi-factor profiles that go beyond simple bivariate relationships. Rather than identifying single predictors, this approach highlights how combinations of characteristics jointly characterize high-risk groups.

Overall, the findings emphasize the importance of addressing emotional well-being and school connectedness as key components of bullying prevention strategies.

Limitations and future work

Several limitations of this study should be acknowledged.

First, the analysis is based on cross-sectional self-reported data, which prevents any causal interpretation. The discovered rules describe co-occurrence patterns rather than directional effects.

Second, a substantial proportion of missing values in weight-related variables required restricting the analysis to complete cases, which reduces sample size and may introduce selection bias.

Third, association rule mining is sensitive to parameter choices such as minimum support and confidence. Although a sensitivity analysis was performed, different thresholds may yield alternative rule sets.

Fourth, the study focused on a limited subset of available survey variables. Incorporating additional contextual factors (e.g., family environment, mental health indicators, or school climate variables) could reveal richer patterns.

Future work could extend this analysis by: * Incorporation of additional psychosocial and family-level variables * Cross-country and cross-cultural analysis

	Any_bullying	Lonely	Age_group_Age_11_13	Age_group_Age_14_15	Peers_support	Parents_support	Excess_weight	School_absence_level_None	School_absence_level_Occasional	Friends_level_Few	Friends_level_Many
2	0	0	0	1	1	1	0	1	0	0	1
5	0	0	1	0	1	1	0	1	0	0	1
10	0	0	0	1	1	1	0	0	1	0	1
22	1	1	1	0	0	1	0	1	0	0	1
23	0	1	0	1	1	1	1	1	0	1	0

	Any_bullying	Lonely	Age_group_Age_11_13	Age_group_Age_14_15	Peers_support	Parents_support	Excess_weight	School_absence_level_None	School_absence_level_Occasional	Friends_level_Few	Friends_level_Many
2	0	0	0	1	1	1	0	1	0	0	1
5	0	0	1	0	1	1	0	1	0	0	1
10	0	0	0	1	1	1	0	0	1	0	1
22	1	1	1	0	0	1	0	1	0	0	1
23	0	1	0	1	1	1	1	1	0	1	0

	Any_bullying	Lonely	Age_group_Age_11_13	Age_group_Age_14_15	Peers_support	Parents_support	Excess_weight	School_absence_level_None	School_absence_level_Occasional	Friends_level_Few	Friends_level_Many
2	0	0	0	1	1	1	0	1	0	0	1
5	0	0	1	0	1	1	0	1	0	0	1
10	0	0	0	1	1	1	0	0	1	0	1
22	1	1	1	0	0	1	0	1	0	0	1
23	0	1	0	1	1	1	1	1	0	1	0