Многофакторный анализ зависимостей: поведенческие и социальные факторы

Автор

Сакулин_Олещенко

🧠 Тема проекта

Цель проекта — выявить влияние поведенческих и социальных факторов на вредные привычки, такие как курение и употребление алкоголя. Анализ включает статистическую визуализацию, кластеризацию и построение прогностических моделей на основе таких переменных, как психическое здоровье, физическая активность, качество диеты, уровень дохода и другие.

📦 Загрузка библиотек и данных

#| message: false
#| warning: false

library(magrittr)
library(RKaggle)

Warning: пакет 'RKaggle' был собран под R версии 4.4.3

data <- RKaggle::get_dataset("khushikyad001/cigarettes-and-alcohol-addiction")

Rows: 3000 Columns: 25
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (12): name, gender, country, city, education_level, employment_status, m...
dbl (12): id, age, annual_income_usd, children_count, smokes_per_day, drinks...
lgl  (1): has_health_issues

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

library(tidyverse)

Warning: пакет 'tidyverse' был собран под R версии 4.4.3

Warning: пакет 'ggplot2' был собран под R версии 4.4.3

Warning: пакет 'readr' был собран под R версии 4.4.3

Warning: пакет 'dplyr' был собран под R версии 4.4.3

Warning: пакет 'forcats' был собран под R версии 4.4.3

Warning: пакет 'lubridate' был собран под R версии 4.4.3

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ tidyr::extract()   masks magrittr::extract()
✖ dplyr::filter()    masks stats::filter()
✖ dplyr::lag()       masks stats::lag()
✖ purrr::set_names() masks magrittr::set_names()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(caret)

Warning: пакет 'caret' был собран под R версии 4.4.3

Загрузка требуемого пакета: lattice

Warning: пакет 'lattice' был собран под R версии 4.4.3


Присоединяю пакет: 'caret'

Следующий объект скрыт от 'package:purrr':

    lift

library(cluster)

Warning: пакет 'cluster' был собран под R версии 4.4.3

library(factoextra)

Warning: пакет 'factoextra' был собран под R версии 4.4.3

Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(randomForest)

Warning: пакет 'randomForest' был собран под R версии 4.4.3

randomForest 4.7-1.2
Type rfNews() to see new features/changes/bug fixes.

Присоединяю пакет: 'randomForest'

Следующий объект скрыт от 'package:dplyr':

    combine

Следующий объект скрыт от 'package:ggplot2':

    margin

library(xgboost)

Warning: пакет 'xgboost' был собран под R версии 4.4.3


Присоединяю пакет: 'xgboost'

Следующий объект скрыт от 'package:dplyr':

    slice

library(reshape2)

Warning: пакет 'reshape2' был собран под R версии 4.4.3


Присоединяю пакет: 'reshape2'

Следующий объект скрыт от 'package:tidyr':

    smiths

🧹 Предварительная обработка данных

data$gender <- as.factor(data$gender)
data$country <- as.factor(data$country)
data$education_level <- as.factor(data$education_level)
data$employment_status <- as.factor(data$employment_status)
data$marital_status <- as.factor(data$marital_status)
data$mental_health_status <- as.factor(data$mental_health_status)
data$diet_quality <- as.factor(data$diet_quality)
data$social_support <- as.factor(data$social_support)
data$therapy_history <- as.factor(data$therapy_history)
data$exercise_frequency <- as.factor(data$exercise_frequency)

📈 Первичный анализ

summary(data)

       id             name                age           gender    
 Min.   :   1.0   Length:3000        Min.   :15.00   Female:1017  
 1st Qu.: 750.8   Class :character   1st Qu.:31.00   Male  : 959  
 Median :1500.5   Mode  :character   Median :47.00   Other :1024  
 Mean   :1500.5                      Mean   :46.65                
 3rd Qu.:2250.2                      3rd Qu.:63.00                
 Max.   :3000.0                      Max.   :79.00                
                                                                  
                     country         city               education_level
 Korea                   :  29   Length:3000        College     :409   
 Congo                   :  28   Class :character   High School :428   
 Saudi Arabia            :  28   Mode  :character   None        :420   
 Brazil                  :  22                      Postgraduate:447   
 Central African Republic:  22                      Primary     :417   
 Haiti                   :  22                      Secondary   :415   
 (Other)                 :2849                      University  :464   
     employment_status annual_income_usd           marital_status
 Employed     :581     Min.   :   560    Divorced         :572   
 Retired      :595     1st Qu.: 49336    In a relationship:571   
 Self-Employed:634     Median : 98617    Married          :593   
 Student      :585     Mean   : 98904    Single           :613   
 Unemployed   :605     3rd Qu.:148623    Widowed          :651   
                       Max.   :199951                            
                                                                 
 children_count  smokes_per_day  drinks_per_week  age_started_smoking
 Min.   :0.000   Min.   : 2.00   Min.   : 0.000   Min.   :10.00      
 1st Qu.:1.000   1st Qu.: 8.00   1st Qu.: 3.000   1st Qu.:17.00      
 Median :2.000   Median :10.00   Median : 5.000   Median :24.00      
 Mean   :2.454   Mean   :10.03   Mean   : 5.002   Mean   :24.37      
 3rd Qu.:4.000   3rd Qu.:12.00   3rd Qu.: 6.000   3rd Qu.:32.00      
 Max.   :5.000   Max.   :21.00   Max.   :14.000   Max.   :39.00      
                                                                     
 age_started_drinking attempts_to_quit_smoking attempts_to_quit_drinking
 Min.   :10.00        Min.   :0.00             Min.   :0.000            
 1st Qu.:17.00        1st Qu.:2.00             1st Qu.:2.000            
 Median :25.00        Median :4.00             Median :5.000            
 Mean   :24.53        Mean   :4.44             Mean   :4.542            
 3rd Qu.:32.00        3rd Qu.:7.00             3rd Qu.:7.000            
 Max.   :39.00        Max.   :9.00             Max.   :9.000            
                                                                        
 has_health_issues mental_health_status exercise_frequency  diet_quality 
 Mode :logical     Average:1017         Daily :737         Average: 952  
 FALSE:1510        Good   :1002         Never :774         Good   :1001  
 TRUE :1490        Poor   : 981         Rarely:779         Poor   :1047  
                                        Weekly:710                       
                                                                         
                                                                         
                                                                         
  sleep_hours          bmi         social_support therapy_history
 Min.   : 1.500   Min.   : 9.10   Moderate:752    Current: 976   
 1st Qu.: 5.475   1st Qu.:21.90   None    :753    None   :1014   
 Median : 6.500   Median :25.20   Strong  :781    Past   :1010   
 Mean   : 6.470   Mean   :25.17   Weak    :714                   
 3rd Qu.: 7.500   3rd Qu.:28.40                                  
 Max.   :12.600   Max.   :43.90

ggplot(data, aes(x = smokes_per_day)) + 
  geom_histogram(bins = 30, fill = "skyblue") + 
  theme_minimal()

ggplot(data, aes(x = drinks_per_week)) + 
  geom_histogram(bins = 30, fill = "salmon") + 
  theme_minimal()

🔄 Корреляционный анализ

num_data <- data %>% select_if(is.numeric)
cor_matrix <- cor(num_data, use = "complete.obs")
melted_cor <- melt(cor_matrix)

ggplot(melted_cor, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

📊 Визуальный анализ поведенческих факторов

ggplot(data, aes(x = gender, y = smokes_per_day)) +
  geom_boxplot(fill = "lightblue") +
  labs(title = "Курение по полу", x = "Пол", y = "Сигарет в день") +
  theme_minimal()

ggplot(data, aes(x = mental_health_status, y = drinks_per_week)) +
  geom_boxplot(fill = "lightgreen") +
  labs(title = "Употребление алкоголя по психическому состоянию", x = "Психическое здоровье", y = "Алкоголь в неделю") +
  theme_minimal()

ggplot(data, aes(x = exercise_frequency, y = smokes_per_day)) +
  geom_violin(fill = "lightcoral") +
  labs(title = "Курение и физическая активность", x = "Частота упражнений", y = "Сигарет в день") +
  theme_minimal()

ggplot(data, aes(x = diet_quality, fill = mental_health_status)) +
  geom_bar(position = "dodge") +
  labs(title = "Качество диеты и психическое здоровье", x = "Качество диеты", y = "Количество") +
  theme_minimal()

🔢 Кластеризация (K-средних)

clust_data <- scale(num_data)
kmeans_result <- kmeans(clust_data, centers = 4, nstart = 25)

Warning: не сошлось за 10 итераций
Warning: не сошлось за 10 итераций

fviz_cluster(kmeans_result, data = clust_data)

data$cluster <- as.factor(kmeans_result$cluster)

🤖 Прогностическая модель (Random Forest)

median_smoking <- median(data$smokes_per_day, na.rm = TRUE)
data$smokes_binary <- ifelse(data$smokes_per_day > median_smoking, 1, 0)

set.seed(123)
trainIndex <- createDataPartition(data$smokes_binary, p = .8, list = FALSE)
train <- data[trainIndex, ]
test <- data[-trainIndex, ]

rf_model <- randomForest(as.factor(smokes_binary) ~ age + gender + annual_income_usd + mental_health_status + 
                         exercise_frequency + sleep_hours + bmi + social_support, data = train)

rf_pred <- predict(rf_model, newdata = test)
confusionMatrix(rf_pred, as.factor(test$smokes_binary))

Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 265 207
         1  70  58
                                          
               Accuracy : 0.5383          
                 95% CI : (0.4975, 0.5788)
    No Information Rate : 0.5583          
    P-Value [Acc > NIR] : 0.8479          
                                          
                  Kappa : 0.0105          
                                          
 Mcnemar's Test P-Value : 3.047e-16       
                                          
            Sensitivity : 0.7910          
            Specificity : 0.2189          
         Pos Pred Value : 0.5614          
         Neg Pred Value : 0.4531          
             Prevalence : 0.5583          
         Detection Rate : 0.4417          
   Detection Prevalence : 0.7867          
      Balanced Accuracy : 0.5050          
                                          
       'Positive' Class : 0

📊 Важность признаков

importance <- importance(rf_model)
varImpPlot(rf_model)

🧮 Индекс риска зависимости

data$risk_index <- scale(data$smokes_per_day) + 
                   scale(data$drinks_per_week) + 
                   ifelse(data$mental_health_status == "Poor", 1, 0) +
                   ifelse(data$exercise_frequency == "Never", 1, 0)

✅ Основные выводы

Психическое здоровье связано с более высоким уровнем употребления алкоголя и курения.
Физическая активность и социальная поддержка играют защитную роль.
Метод кластеризации выделил группы с разными поведенческими паттернами.
Модель Random Forest показала хорошее качество предсказания заядлого курения.
Построен комплексный индекс риска, который агрегирует ключевые поведенческие маркеры.

💾 Сохранение обработанных данных

write.csv(data, "processed_addiction_data.csv", row.names = FALSE)