UK Smoking

Research Question

How do age,gender,marital status,educational level, and gross income predict the likelihood of smoking in the UK?

Introduction

The data set I am using titled “UK Smoking Data”, which comes from the https://www.openintro.org/data/index.php?data=smoking . The UK Smoking Data is a survey data set capturing smoking habits among adults in the United Kingdom. The data set contains 1691 observations and include 12 variables on demographic, socioeconomic status, and smoking behavior.I chose to work with this data set, because I have recently found myself really interested in crime-related,like this kinds of topics.

For this project I focused on five variables.

gender age marital_status highest_qualification gross_income

The age column shows the age of the individual in years, and it is a numerical variable. The gender columns shows whether the individual is male or female, and it is a categorical variable. The martial_status columns indicates the person’s martial situation (such as single, married,divorced,separated,widowed), and it is a categorical variable. The highest_qualification column represents the individual’s level of education, and it is a categorical variable. The gross_income columns shows the income group the individual belongs to, reported in income ranges, and it is a categorical variable.

Source

National STEM Centre, Large Datasets from stats4schools, https://www.stem.org.uk/resources/elibrary/resource/28452/large-datasets-stats4schools.

Data Analysis

Before building my model, I first cleaned and explored the data set to make sure I was working with reliable information. I checked for missing values and found that the variables I needed for my analysis did not contain any NA values, so no imputation was required. Since the outcome variable (smoke) had two categories, I converted it into a binary variable (0 = non-smoker, 1 = smoker) to fit a logistic regression model.Next, I created clearer and more meaningful categories for several predictors. I grouped ages into three categories (Under 30, 30–49, and 50+) so the variable would be easier to interpret. I also combined the detailed income ranges into broader groups (Low, Middle, High Income, and Not Reported) and simplified the highest_qualification variable into three levels (Low, Medium, and High Education). After renaming, I converted all predictors into factors so they would work correctly in the model.Once the variables were cleaned and renamed, I created a smaller data set containing only the variables needed for logistic regression. I used several dplyr functions—including mutate, case_when, and select—to organize and prepare the data. For quality control, I checked that each category had enough observations in both smoker and non-smoker groups. All predictors met this requirement, so no levels needed to be removed or merged.

Regression Analysis

For the analysis, I used a logistic regression model to examine whether age group, gender, martial status, education level, and gross income could predict the likelihood of smoking among adults in the UK. Since the smoking variable had two categories(smoker and non-smoker), it was converted into a binary outcome variable(0=non -smoker,1=smoker)to make it suitable for logistic regression. After cleaning the data and recoding the predictors into meaningful categories, I fit the final model using the selected variables. The model results showed that age group, education level, marital status, and income were important predictors of smoking behavior. In particular, individuals aged 50 and above were significantly less likely to smoker compared to aged 30-49. Education level was one of the strongest predictors,with individual had low or medium education showing much higher odds of smoking compared to those with high education. Income level was also meaningful, as individuals in the low- and middle-income groups were more likely to smoke than those in the high-income group. Marital status showed that married and widowed individuals had lower odds of smoking compared to divorced individuals. Gender had only a weak and marginally significant effect on smoking behavior.Although several predictors were statistically significant, The pseudo R² was low. It was approximately 0.078, meaning that the model explains only about 7.8% of the variation of smoking behavior. In addition, the confusion matrix and performance metrics showed that the model performed well in identifying non-smokers but struggled to correctly identify smokers, largely due to class imbalance in data set. Overall, while the model reveals meaningful relationships between factors and smoking, it is not very strong at predicting individual smoking behavior.

Loading the libraries

library(tidyverse)

Loading the data set

smoking <- read_csv("smoking_uk.csv")

To look at the data type and first 6 rows

str(smoking)

## spc_tbl_ [1,691 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ gender               : chr [1:1691] "Male" "Female" "Male" "Female" ...
##  $ age                  : num [1:1691] 38 42 40 40 39 37 53 44 40 41 ...
##  $ marital_status       : chr [1:1691] "Divorced" "Single" "Married" "Married" ...
##  $ highest_qualification: chr [1:1691] "No Qualification" "No Qualification" "Degree" "Degree" ...
##  $ nationality          : chr [1:1691] "British" "British" "English" "English" ...
##  $ ethnicity            : chr [1:1691] "White" "White" "White" "White" ...
##  $ gross_income         : chr [1:1691] "2,600 to 5,200" "Under 2,600" "28,600 to 36,400" "10,400 to 15,600" ...
##  $ region               : chr [1:1691] "The North" "The North" "The North" "The North" ...
##  $ smoke                : chr [1:1691] "No" "Yes" "No" "No" ...
##  $ amt_weekends         : num [1:1691] NA 12 NA NA NA NA 6 NA 8 15 ...
##  $ amt_weekdays         : num [1:1691] NA 12 NA NA NA NA 6 NA 8 12 ...
##  $ type                 : chr [1:1691] NA "Packets" NA NA ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   gender = col_character(),
##   ..   age = col_double(),
##   ..   marital_status = col_character(),
##   ..   highest_qualification = col_character(),
##   ..   nationality = col_character(),
##   ..   ethnicity = col_character(),
##   ..   gross_income = col_character(),
##   ..   region = col_character(),
##   ..   smoke = col_character(),
##   ..   amt_weekends = col_double(),
##   ..   amt_weekdays = col_double(),
##   ..   type = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

head(smoking)

## # A tibble: 6 × 12
##   gender   age marital_status highest_qualification nationality ethnicity
##   <chr>  <dbl> <chr>          <chr>                 <chr>       <chr>    
## 1 Male      38 Divorced       No Qualification      British     White    
## 2 Female    42 Single         No Qualification      British     White    
## 3 Male      40 Married        Degree                English     White    
## 4 Female    40 Married        Degree                English     White    
## 5 Female    39 Married        GCSE/O Level          British     White    
## 6 Female    37 Married        GCSE/O Level          British     White    
## # ℹ 6 more variables: gross_income <chr>, region <chr>, smoke <chr>,
## #   amt_weekends <dbl>, amt_weekdays <dbl>, type <chr>

Checking NAs

colSums(is.na(smoking))

##                gender                   age        marital_status 
##                     0                     0                     0 
## highest_qualification           nationality             ethnicity 
##                     0                     0                     0 
##          gross_income                region                 smoke 
##                     0                     0                     0 
##          amt_weekends          amt_weekdays                  type 
##                  1270                  1270                  1270

We can see that there is no missing values in variables that is going to use in this project.

Cnverting smoking into binary outcome

In smoking variable there is only two categories, So we are converting it to 0 and 1

smoking <- smoking |>
  mutate(smoke_binary = ifelse(smoke == "No",0,1))
smoking

## # A tibble: 1,691 × 13
##    gender   age marital_status highest_qualification nationality ethnicity
##    <chr>  <dbl> <chr>          <chr>                 <chr>       <chr>    
##  1 Male      38 Divorced       No Qualification      British     White    
##  2 Female    42 Single         No Qualification      British     White    
##  3 Male      40 Married        Degree                English     White    
##  4 Female    40 Married        Degree                English     White    
##  5 Female    39 Married        GCSE/O Level          British     White    
##  6 Female    37 Married        GCSE/O Level          British     White    
##  7 Male      53 Married        Degree                British     White    
##  8 Male      44 Single         Degree                English     White    
##  9 Male      40 Single         GCSE/CSE              English     White    
## 10 Female    41 Married        No Qualification      English     White    
## # ℹ 1,681 more rows
## # ℹ 7 more variables: gross_income <chr>, region <chr>, smoke <chr>,
## #   amt_weekends <dbl>, amt_weekdays <dbl>, type <chr>, smoke_binary <dbl>

Renaming the age variable

I did this because there are many ages and now it is easy to read. (Especially when I did quality control there were so many categories in each of this variable, so that is the main reason I grouped them into three categories.)

smoking <- smoking |>
  mutate(age_group = case_when(
  age < 30 ~ "Under 30",
  age >= 30 & age < 50 ~ "30-49",
  age >= 50 ~ "50+")
)
smoking

## # A tibble: 1,691 × 14
##    gender   age marital_status highest_qualification nationality ethnicity
##    <chr>  <dbl> <chr>          <chr>                 <chr>       <chr>    
##  1 Male      38 Divorced       No Qualification      British     White    
##  2 Female    42 Single         No Qualification      British     White    
##  3 Male      40 Married        Degree                English     White    
##  4 Female    40 Married        Degree                English     White    
##  5 Female    39 Married        GCSE/O Level          British     White    
##  6 Female    37 Married        GCSE/O Level          British     White    
##  7 Male      53 Married        Degree                British     White    
##  8 Male      44 Single         Degree                English     White    
##  9 Male      40 Single         GCSE/CSE              English     White    
## 10 Female    41 Married        No Qualification      English     White    
## # ℹ 1,681 more rows
## # ℹ 8 more variables: gross_income <chr>, region <chr>, smoke <chr>,
## #   amt_weekends <dbl>, amt_weekdays <dbl>, type <chr>, smoke_binary <dbl>,
## #   age_group <chr>

For this age group, we can grouped them into different types of categories, but I chose this one.

Renaming the gross_income variable

smoking <- smoking |>
  mutate(income_group = case_when(
    gross_income %in% c("Under 2,600","2,600 to 5,200","5,200 to 10,400") ~ "Low Income",
    gross_income %in% c("10,400 to 15,600","15,600 to 20,800","20,800 to 28,600") ~ "Middle Income",
    gross_income %in% c("28,600 to 36,400","Above 36,400") ~ "High Income",
    gross_income %in% c("Refused","Unknown") ~ "Not Reported"
  ))
smoking

## # A tibble: 1,691 × 15
##    gender   age marital_status highest_qualification nationality ethnicity
##    <chr>  <dbl> <chr>          <chr>                 <chr>       <chr>    
##  1 Male      38 Divorced       No Qualification      British     White    
##  2 Female    42 Single         No Qualification      British     White    
##  3 Male      40 Married        Degree                English     White    
##  4 Female    40 Married        Degree                English     White    
##  5 Female    39 Married        GCSE/O Level          British     White    
##  6 Female    37 Married        GCSE/O Level          British     White    
##  7 Male      53 Married        Degree                British     White    
##  8 Male      44 Single         Degree                English     White    
##  9 Male      40 Single         GCSE/CSE              English     White    
## 10 Female    41 Married        No Qualification      English     White    
## # ℹ 1,681 more rows
## # ℹ 9 more variables: gross_income <chr>, region <chr>, smoke <chr>,
## #   amt_weekends <dbl>, amt_weekdays <dbl>, type <chr>, smoke_binary <dbl>,
## #   age_group <chr>, income_group <chr>

Renaming the highest_qualification variable

smoking <- smoking |>
  mutate(education = case_when(
    highest_qualification %in% c("No Qualification","GCSE/CSE","GCSE/O Level") ~ "Low Education",
    highest_qualification %in% c("A Levels","ONC/BTEC","Other/Sub Degree") ~ "Medium Education",
    highest_qualification %in% c("Degree","Higher/Sub Degree") ~ "High Education")
  )
smoking

## # A tibble: 1,691 × 16
##    gender   age marital_status highest_qualification nationality ethnicity
##    <chr>  <dbl> <chr>          <chr>                 <chr>       <chr>    
##  1 Male      38 Divorced       No Qualification      British     White    
##  2 Female    42 Single         No Qualification      British     White    
##  3 Male      40 Married        Degree                English     White    
##  4 Female    40 Married        Degree                English     White    
##  5 Female    39 Married        GCSE/O Level          British     White    
##  6 Female    37 Married        GCSE/O Level          British     White    
##  7 Male      53 Married        Degree                British     White    
##  8 Male      44 Single         Degree                English     White    
##  9 Male      40 Single         GCSE/CSE              English     White    
## 10 Female    41 Married        No Qualification      English     White    
## # ℹ 1,681 more rows
## # ℹ 10 more variables: gross_income <chr>, region <chr>, smoke <chr>,
## #   amt_weekends <dbl>, amt_weekdays <dbl>, type <chr>, smoke_binary <dbl>,
## #   age_group <chr>, income_group <chr>, education <chr>

Converting predictors into factors

smoking <- smoking |>
  mutate(gender = as.factor(gender),
         marital_status = as.factor(marital_status),
         education = as.factor(education),
         income_group = as.factor(income_group),
         age_group = as.factor(age_group))
str(smoking)

## tibble [1,691 × 16] (S3: tbl_df/tbl/data.frame)
##  $ gender               : Factor w/ 2 levels "Female","Male": 2 1 2 1 1 1 2 2 2 1 ...
##  $ age                  : num [1:1691] 38 42 40 40 39 37 53 44 40 41 ...
##  $ marital_status       : Factor w/ 5 levels "Divorced","Married",..: 1 4 2 2 2 2 2 4 4 2 ...
##  $ highest_qualification: chr [1:1691] "No Qualification" "No Qualification" "Degree" "Degree" ...
##  $ nationality          : chr [1:1691] "British" "British" "English" "English" ...
##  $ ethnicity            : chr [1:1691] "White" "White" "White" "White" ...
##  $ gross_income         : chr [1:1691] "2,600 to 5,200" "Under 2,600" "28,600 to 36,400" "10,400 to 15,600" ...
##  $ region               : chr [1:1691] "The North" "The North" "The North" "The North" ...
##  $ smoke                : chr [1:1691] "No" "Yes" "No" "No" ...
##  $ amt_weekends         : num [1:1691] NA 12 NA NA NA NA 6 NA 8 15 ...
##  $ amt_weekdays         : num [1:1691] NA 12 NA NA NA NA 6 NA 8 12 ...
##  $ type                 : chr [1:1691] NA "Packets" NA NA ...
##  $ smoke_binary         : num [1:1691] 0 1 0 0 0 0 1 0 1 1 ...
##  $ age_group            : Factor w/ 3 levels "30-49","50+",..: 1 1 1 1 1 1 2 1 1 1 ...
##  $ income_group         : Factor w/ 4 levels "High Income",..: 2 2 1 3 2 3 1 3 2 2 ...
##  $ education            : Factor w/ 3 levels "High Education",..: 2 2 1 1 2 2 1 1 2 2 ...

Making the final clean data set

smoking_clean <- smoking |>
  select(smoke_binary,age_group,gender,marital_status,education,income_group)

smoking_clean

## # A tibble: 1,691 × 6
##    smoke_binary age_group gender marital_status education      income_group 
##           <dbl> <fct>     <fct>  <fct>          <fct>          <fct>        
##  1            0 30-49     Male   Divorced       Low Education  Low Income   
##  2            1 30-49     Female Single         Low Education  Low Income   
##  3            0 30-49     Male   Married        High Education High Income  
##  4            0 30-49     Female Married        High Education Middle Income
##  5            0 30-49     Female Married        Low Education  Low Income   
##  6            0 30-49     Female Married        Low Education  Middle Income
##  7            1 50+       Male   Married        High Education High Income  
##  8            0 30-49     Male   Single         High Education Middle Income
##  9            1 30-49     Male   Single         Low Education  Low Income   
## 10            1 30-49     Female Married        Low Education  Low Income   
## # ℹ 1,681 more rows

Quality control

xtabs(~smoke_binary +age_group,data = smoking_clean)

##             age_group
## smoke_binary 30-49 50+ Under 30
##            0   435 678      157
##            1   187 131      103

xtabs(~smoke_binary +gender,data = smoking_clean)

##             gender
## smoke_binary Female Male
##            0    731  539
##            1    234  187

xtabs(~smoke_binary +marital_status,data = smoking_clean)

##             marital_status
## smoke_binary Divorced Married Separated Single Widowed
##            0      103     669        46    269     183
##            1       58     143        22    158      40

xtabs(~smoke_binary +education,data = smoking_clean)

##             education
## smoke_binary High Education Low Education Medium Education
##            0            321           716              233
##            1             66           280               75

xtabs(~smoke_binary +income_group,data = smoking_clean)

##             income_group
## smoke_binary High Income Low Income Middle Income Not Reported
##            0         144        579           445          102
##            1          24        207           166           24

Each category has more more than 5 in both groups, so these variables are safe to use in modeling, no need to drop or merge any levels.

Logistic Regression Model

model <- glm(smoke_binary ~ age_group + gender + marital_status + education + income_group,data = smoking_clean,family = "binomial")

summary(model)

## 
## Call:
## glm(formula = smoke_binary ~ age_group + gender + marital_status + 
##     education + income_group, family = "binomial", data = smoking_clean)
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -1.28697    0.30077  -4.279 1.88e-05 ***
## age_group50+              -0.88541    0.15069  -5.876 4.21e-09 ***
## age_groupUnder 30          0.18061    0.18090   0.998   0.3181    
## genderMale                 0.24992    0.12764   1.958   0.0502 .  
## marital_statusMarried     -0.92863    0.19599  -4.738 2.16e-06 ***
## marital_statusSeparated   -0.25391    0.31800  -0.798   0.4246    
## marital_statusSingle      -0.29171    0.22128  -1.318   0.1874    
## marital_statusWidowed     -0.63870    0.25500  -2.505   0.0123 *  
## educationLow Education     0.71939    0.17181   4.187 2.82e-05 ***
## educationMedium Education  0.39681    0.20192   1.965   0.0494 *  
## income_groupLow Income     0.59021    0.27242   2.167   0.0303 *  
## income_groupMiddle Income  0.50919    0.25710   1.981   0.0476 *  
## income_groupNot Reported   0.07894    0.35043   0.225   0.8218    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1898.0  on 1690  degrees of freedom
## Residual deviance: 1749.1  on 1678  degrees of freedom
## AIC: 1775.1
## 
## Number of Fisher Scoring iterations: 4

Interpretation

Now that we have fitted the logistic regression model predicting whether an individual smokes, we can look at the predictors that are statistically significant and interpret their effects. Below are the variables with meaningful log-odds ratios.

age_group50+ = -0.88541 (p = 4.21e-09) Individuals aged 50 or older have significantly lower log-odds of smoking compared to the reference group (age 30-49). This means older adults are much less likely to smoke.

age_groupUnder30 = 0.18061 (p=0.3181) Individuals under age 30 do not differ significantly from 30-49 gropu in likelihood of smoking. This age group is not a meaningful predictor once other variables are controlled for.

genderMale = 0.24992 (p=0.0502) Being male slightly increases the log-odds of smoking compared to females, but the effect is only marginally significant(around p = 0.05). Gender has a weak influence in this model.

marital_statusMarried = -0.92863 (p=2.16e-06) Married individuals have significantly lower log-odds of smoking compared to the reference group(Divorced). Marriage appears to be associated with reduced smoking likelihood.

marital_statusWidowed = -0.63870 (p=0.0123) Widowed individuals show significantly lower odds of smoking. While the effect is smaller than the married group, it is still statistically meaningful.

marital_statusSeparated and marital_statusSingle (p>0.05) These categories are not significant predictors. Their p value indicate that separated and single individuals do not differ meaningfully from others in their likelihood of smoking.

educationLow Education = 0.71939 (p=2.82e-05) Individuals with low education have much higher log-odds of smoking compared to those with high education. This is one of the strongest predictors in the model.

educationMedium Education = 0.39681 (p=0.0494) Individuals with medium education levels also have higher odds of smoking. This effect is statistically significant but smaller than that of low education.

income_groupLow Income = 0.59021 (p=0.0303) Low-income individuals have significantly higher log-odds of smoking compared to high-income individuals. Income level meaningfully influences smoking behavior.

income_groupMiddle Income = 0.50919 (p=0.0476) Middle income individuals also show increased likelihood of smoking. This effect s significant but weaker than the low-income group.

income_groupNot Reported = 0.07894 (p=0.8218) Not significant.Individuals who did not report income do not differ in smoking behavior compared to the high-income group.

Overall, The residual deviance is lower than the null deviance, meaning the predictors improve model fit compared to an intercept-only model.

Calculating the r-squared

r_square_all <- 1 - (model$deviance/model$null.deviance)

r_square_all

## [1] 0.07845487

The R² value for this logistic regression model is approximately 0.078, meaning the predictors explain about 7.8% of the variation in smoking behavior.This value is considered low, which is common in logistic regression, especially with social or behavioral data sets. A low R² does not mean the model is incorrect, it simply indicates that smoking behavior is influenced by many factors not captured in this model. I report that the R² is low, and future models could include additional predictors to improve the explanatory power.

Calculating the p-value

1 - pchisq((model$null.deviance - model$deviance), df=(length(model$coefficients)-1))

## [1] 0

The overall p-value of the model is effectively 0, meaning the model is statistically significant.

Confusion matrix

predicted.probs <- model$fitted.values

predicted.classes <- ifelse(predicted.probs > 0.5, 1, 0) 

confusion <- table(
  Predicted = factor(predicted.classes, levels = c(0, 1)),
  Actual = factor(smoking_clean$smoke_binary, levels = c(0, 1))
)

confusion

##          Actual
## Predicted    0    1
##         0 1230  388
##         1   40   33

1230 people were actually non-smokers, and the model correctly predicted them as non-smokers. This is True Negative (TN).

388 people were actually smokers, but the model predicted them as non-smokers. This is False Negative (FN).

40 people were actually non-smokers, but the model predicted them as smokers. This is False Positive(FP).

33 people were actually smokers, and the model correctly predicted them as smokers. This is True Positive(TP).

Calculating key performance metrics

TN <- 1230
FP <- 40
FN <- 388
TP <- 33

# Metrics
accuracy <- (TP + TN) / (TP + TN + FP + FN)
sensitivity <- TP / (TP + FN)  
specificity <- TN / (TN + FP)  
precision <- TP / (TP + FP)   
f1_score <- 2 * (precision * sensitivity) / (precision + sensitivity)

# Print results
cat("Accuracy:    ", round(accuracy, 4), "\n")

## Accuracy:     0.7469

cat("Sensitivity: ", round(sensitivity, 4), "\n")

## Sensitivity:  0.0784

cat("Specificity: ", round(specificity, 4), "\n")

## Specificity:  0.9685

cat("Precision:   ", round(precision, 4), "\n")

## Precision:    0.4521

cat("F1 Score:    ", round(f1_score, 4), "\n")

## F1 Score:     0.1336

The logistic regression model achieves an overall accuracy of about 74.7%, meaning that it correctly classifies most individuals in the data set.

The specificity is very high (96.9%), indicating that the model is excellent at correctly identifying non-smokers. This expected because the data set contains many more non-smokers than smokers.

The sensitivity is extremely low(7.8%), meaning that the model rarely identifies smokers correctly. Most smokers are mistakenly predicted as non-smokers, which aligns with the large number of false negatives in the confusion matrix.

The precision is about 45.2, meaning that less than half of the individuals predicted as smokers are actually smokers. This occurs because the model predicts “smoker” very rarely.

The F1-score (0.1336), which combines precision and sensivity, is also low, reflecting the model’s weakness in predicting the minority case(smokers).

Overall, the model performs strongly for identifying non-smokers, but performs poorly in identifying smokers, mainly due to the class imbalance in the data set.

ROC and AUC

library(pROC)

## Warning: package 'pROC' was built under R version 4.5.2

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

# ROC curve & AUC on full data
roc_obj <- roc(response = smoking_clean$smoke_binary,
               predictor = model$fitted.values,
               levels = c(0,1),
               direction = "<")  

# Print AUC value
auc_val <- auc(roc_obj); auc_val

## Area under the curve: 0.6961

# Plot ROC with AUC displayed
plot.roc(roc_obj, print.auc = TRUE, legacy.axes = TRUE,
         xlab = "False Positive Rate (1 - Specificity)",
         ylab = "True Positive Rate (Sensitivity)")

The AUC = 0.696 means the model has a moderate ability to distinguish between people who smoker and those who do not.

On the ROC plot, the curve is above the diagonal “random guess” line, which shows that the model performs better than chance.

In plain words: If you randomly pick one person who smokes and one person who does not, the model has about a 69.6% chance of giving a higher predicted probability to the person who actually smoke.

This means he predictors (age_group,gender,martial_status,education, and income_group) provide some useful information, but the model does not completely separate smokers from non-smokers. Important factors influencing smoking behavior may be missing from the data set, so adding more predictors could improve the model’s performance in the future.

Conclusion

In this project, I examined whether age group, gender, marital status, education level, and gross income could help predict whether a person is a smoker in the UK. The results showed that several variables, especially age, education level, and income were related to smoking behavior. Younger people, individuals with lower education, and those with lower or middle income were more likely to smoke. However, the overall model did not predict smoking very well. The pseudo R² was low (around 0.078), meaning the predictors explained only a small portion of the variation in smoking behavior. The AUC value of 0.696 also showed that the model performs only moderately better than random guessing.

These results suggest that smoking behavior is influenced by many additional factors that were not included in this model. Important elements such as stress levels, lifestyle habits, peer influence, cultural background, and health conditions might play a major role in smoking decisions. Including such variables could help improve prediction. In the future, adding more meaningful predictors, exploring more detailed health or lifestyle information, or trying different types of models may lead to better results. Overall, while the current model provides some useful insights, more information is needed to better understand the factors related to smoking in the UK population.