How do age,gender,marital status,educational level, and gross income predict the likelihood of smoking in the UK?
The data set I am using titled “UK Smoking Data”, which comes from the https://www.openintro.org/data/index.php?data=smoking . The UK Smoking Data is a survey data set capturing smoking habits among adults in the United Kingdom. The data set contains 1691 observations and include 12 variables on demographic, socioeconomic status, and smoking behavior.I chose to work with this data set, because I have recently found myself really interested in crime-related,like this kinds of topics.
For this project I focused on five variables.
gender age marital_status highest_qualification gross_income
The age column shows the age of the individual in years, and it is a numerical variable. The gender columns shows whether the individual is male or female, and it is a categorical variable. The martial_status columns indicates the person’s martial situation (such as single, married,divorced,separated,widowed), and it is a categorical variable. The highest_qualification column represents the individual’s level of education, and it is a categorical variable. The gross_income columns shows the income group the individual belongs to, reported in income ranges, and it is a categorical variable.
National STEM Centre, Large Datasets from stats4schools, https://www.stem.org.uk/resources/elibrary/resource/28452/large-datasets-stats4schools.
Before building my model, I first cleaned and explored the data set to make sure I was working with reliable information. I checked for missing values and found that the variables I needed for my analysis did not contain any NA values, so no imputation was required. Since the outcome variable (smoke) had two categories, I converted it into a binary variable (0 = non-smoker, 1 = smoker) to fit a logistic regression model.Next, I created clearer and more meaningful categories for several predictors. I grouped ages into three categories (Under 30, 30–49, and 50+) so the variable would be easier to interpret. I also combined the detailed income ranges into broader groups (Low, Middle, High Income, and Not Reported) and simplified the highest_qualification variable into three levels (Low, Medium, and High Education). After renaming, I converted all predictors into factors so they would work correctly in the model.Once the variables were cleaned and renamed, I created a smaller data set containing only the variables needed for logistic regression. I used several dplyr functions—including mutate, case_when, and select—to organize and prepare the data. For quality control, I checked that each category had enough observations in both smoker and non-smoker groups. All predictors met this requirement, so no levels needed to be removed or merged.
For the analysis, I used a logistic regression model to examine whether age group, gender, martial status, education level, and gross income could predict the likelihood of smoking among adults in the UK. Since the smoking variable had two categories(smoker and non-smoker), it was converted into a binary outcome variable(0=non -smoker,1=smoker)to make it suitable for logistic regression. After cleaning the data and recoding the predictors into meaningful categories, I fit the final model using the selected variables. The model results showed that age group, education level, marital status, and income were important predictors of smoking behavior. In particular, individuals aged 50 and above were significantly less likely to smoker compared to aged 30-49. Education level was one of the strongest predictors,with individual had low or medium education showing much higher odds of smoking compared to those with high education. Income level was also meaningful, as individuals in the low- and middle-income groups were more likely to smoke than those in the high-income group. Marital status showed that married and widowed individuals had lower odds of smoking compared to divorced individuals. Gender had only a weak and marginally significant effect on smoking behavior.Although several predictors were statistically significant, The pseudo R² was low. It was approximately 0.078, meaning that the model explains only about 7.8% of the variation of smoking behavior. In addition, the confusion matrix and performance metrics showed that the model performed well in identifying non-smokers but struggled to correctly identify smokers, largely due to class imbalance in data set. Overall, while the model reveals meaningful relationships between factors and smoking, it is not very strong at predicting individual smoking behavior.
library(tidyverse)
smoking <- read_csv("smoking_uk.csv")
str(smoking)
## spc_tbl_ [1,691 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ gender : chr [1:1691] "Male" "Female" "Male" "Female" ...
## $ age : num [1:1691] 38 42 40 40 39 37 53 44 40 41 ...
## $ marital_status : chr [1:1691] "Divorced" "Single" "Married" "Married" ...
## $ highest_qualification: chr [1:1691] "No Qualification" "No Qualification" "Degree" "Degree" ...
## $ nationality : chr [1:1691] "British" "British" "English" "English" ...
## $ ethnicity : chr [1:1691] "White" "White" "White" "White" ...
## $ gross_income : chr [1:1691] "2,600 to 5,200" "Under 2,600" "28,600 to 36,400" "10,400 to 15,600" ...
## $ region : chr [1:1691] "The North" "The North" "The North" "The North" ...
## $ smoke : chr [1:1691] "No" "Yes" "No" "No" ...
## $ amt_weekends : num [1:1691] NA 12 NA NA NA NA 6 NA 8 15 ...
## $ amt_weekdays : num [1:1691] NA 12 NA NA NA NA 6 NA 8 12 ...
## $ type : chr [1:1691] NA "Packets" NA NA ...
## - attr(*, "spec")=
## .. cols(
## .. gender = col_character(),
## .. age = col_double(),
## .. marital_status = col_character(),
## .. highest_qualification = col_character(),
## .. nationality = col_character(),
## .. ethnicity = col_character(),
## .. gross_income = col_character(),
## .. region = col_character(),
## .. smoke = col_character(),
## .. amt_weekends = col_double(),
## .. amt_weekdays = col_double(),
## .. type = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
head(smoking)
## # A tibble: 6 × 12
## gender age marital_status highest_qualification nationality ethnicity
## <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 Male 38 Divorced No Qualification British White
## 2 Female 42 Single No Qualification British White
## 3 Male 40 Married Degree English White
## 4 Female 40 Married Degree English White
## 5 Female 39 Married GCSE/O Level British White
## 6 Female 37 Married GCSE/O Level British White
## # ℹ 6 more variables: gross_income <chr>, region <chr>, smoke <chr>,
## # amt_weekends <dbl>, amt_weekdays <dbl>, type <chr>
colSums(is.na(smoking))
## gender age marital_status
## 0 0 0
## highest_qualification nationality ethnicity
## 0 0 0
## gross_income region smoke
## 0 0 0
## amt_weekends amt_weekdays type
## 1270 1270 1270
We can see that there is no missing values in variables that is going to use in this project.
smoking <- smoking |>
mutate(smoke_binary = ifelse(smoke == "No",0,1))
smoking
## # A tibble: 1,691 × 13
## gender age marital_status highest_qualification nationality ethnicity
## <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 Male 38 Divorced No Qualification British White
## 2 Female 42 Single No Qualification British White
## 3 Male 40 Married Degree English White
## 4 Female 40 Married Degree English White
## 5 Female 39 Married GCSE/O Level British White
## 6 Female 37 Married GCSE/O Level British White
## 7 Male 53 Married Degree British White
## 8 Male 44 Single Degree English White
## 9 Male 40 Single GCSE/CSE English White
## 10 Female 41 Married No Qualification English White
## # ℹ 1,681 more rows
## # ℹ 7 more variables: gross_income <chr>, region <chr>, smoke <chr>,
## # amt_weekends <dbl>, amt_weekdays <dbl>, type <chr>, smoke_binary <dbl>
smoking <- smoking |>
mutate(age_group = case_when(
age < 30 ~ "Under 30",
age >= 30 & age < 50 ~ "30-49",
age >= 50 ~ "50+")
)
smoking
## # A tibble: 1,691 × 14
## gender age marital_status highest_qualification nationality ethnicity
## <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 Male 38 Divorced No Qualification British White
## 2 Female 42 Single No Qualification British White
## 3 Male 40 Married Degree English White
## 4 Female 40 Married Degree English White
## 5 Female 39 Married GCSE/O Level British White
## 6 Female 37 Married GCSE/O Level British White
## 7 Male 53 Married Degree British White
## 8 Male 44 Single Degree English White
## 9 Male 40 Single GCSE/CSE English White
## 10 Female 41 Married No Qualification English White
## # ℹ 1,681 more rows
## # ℹ 8 more variables: gross_income <chr>, region <chr>, smoke <chr>,
## # amt_weekends <dbl>, amt_weekdays <dbl>, type <chr>, smoke_binary <dbl>,
## # age_group <chr>
For this age group, we can grouped them into different types of categories, but I chose this one.
smoking <- smoking |>
mutate(income_group = case_when(
gross_income %in% c("Under 2,600","2,600 to 5,200","5,200 to 10,400") ~ "Low Income",
gross_income %in% c("10,400 to 15,600","15,600 to 20,800","20,800 to 28,600") ~ "Middle Income",
gross_income %in% c("28,600 to 36,400","Above 36,400") ~ "High Income",
gross_income %in% c("Refused","Unknown") ~ "Not Reported"
))
smoking
## # A tibble: 1,691 × 15
## gender age marital_status highest_qualification nationality ethnicity
## <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 Male 38 Divorced No Qualification British White
## 2 Female 42 Single No Qualification British White
## 3 Male 40 Married Degree English White
## 4 Female 40 Married Degree English White
## 5 Female 39 Married GCSE/O Level British White
## 6 Female 37 Married GCSE/O Level British White
## 7 Male 53 Married Degree British White
## 8 Male 44 Single Degree English White
## 9 Male 40 Single GCSE/CSE English White
## 10 Female 41 Married No Qualification English White
## # ℹ 1,681 more rows
## # ℹ 9 more variables: gross_income <chr>, region <chr>, smoke <chr>,
## # amt_weekends <dbl>, amt_weekdays <dbl>, type <chr>, smoke_binary <dbl>,
## # age_group <chr>, income_group <chr>
smoking <- smoking |>
mutate(education = case_when(
highest_qualification %in% c("No Qualification","GCSE/CSE","GCSE/O Level") ~ "Low Education",
highest_qualification %in% c("A Levels","ONC/BTEC","Other/Sub Degree") ~ "Medium Education",
highest_qualification %in% c("Degree","Higher/Sub Degree") ~ "High Education")
)
smoking
## # A tibble: 1,691 × 16
## gender age marital_status highest_qualification nationality ethnicity
## <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 Male 38 Divorced No Qualification British White
## 2 Female 42 Single No Qualification British White
## 3 Male 40 Married Degree English White
## 4 Female 40 Married Degree English White
## 5 Female 39 Married GCSE/O Level British White
## 6 Female 37 Married GCSE/O Level British White
## 7 Male 53 Married Degree British White
## 8 Male 44 Single Degree English White
## 9 Male 40 Single GCSE/CSE English White
## 10 Female 41 Married No Qualification English White
## # ℹ 1,681 more rows
## # ℹ 10 more variables: gross_income <chr>, region <chr>, smoke <chr>,
## # amt_weekends <dbl>, amt_weekdays <dbl>, type <chr>, smoke_binary <dbl>,
## # age_group <chr>, income_group <chr>, education <chr>
smoking <- smoking |>
mutate(gender = as.factor(gender),
marital_status = as.factor(marital_status),
education = as.factor(education),
income_group = as.factor(income_group),
age_group = as.factor(age_group))
str(smoking)
## tibble [1,691 × 16] (S3: tbl_df/tbl/data.frame)
## $ gender : Factor w/ 2 levels "Female","Male": 2 1 2 1 1 1 2 2 2 1 ...
## $ age : num [1:1691] 38 42 40 40 39 37 53 44 40 41 ...
## $ marital_status : Factor w/ 5 levels "Divorced","Married",..: 1 4 2 2 2 2 2 4 4 2 ...
## $ highest_qualification: chr [1:1691] "No Qualification" "No Qualification" "Degree" "Degree" ...
## $ nationality : chr [1:1691] "British" "British" "English" "English" ...
## $ ethnicity : chr [1:1691] "White" "White" "White" "White" ...
## $ gross_income : chr [1:1691] "2,600 to 5,200" "Under 2,600" "28,600 to 36,400" "10,400 to 15,600" ...
## $ region : chr [1:1691] "The North" "The North" "The North" "The North" ...
## $ smoke : chr [1:1691] "No" "Yes" "No" "No" ...
## $ amt_weekends : num [1:1691] NA 12 NA NA NA NA 6 NA 8 15 ...
## $ amt_weekdays : num [1:1691] NA 12 NA NA NA NA 6 NA 8 12 ...
## $ type : chr [1:1691] NA "Packets" NA NA ...
## $ smoke_binary : num [1:1691] 0 1 0 0 0 0 1 0 1 1 ...
## $ age_group : Factor w/ 3 levels "30-49","50+",..: 1 1 1 1 1 1 2 1 1 1 ...
## $ income_group : Factor w/ 4 levels "High Income",..: 2 2 1 3 2 3 1 3 2 2 ...
## $ education : Factor w/ 3 levels "High Education",..: 2 2 1 1 2 2 1 1 2 2 ...
smoking_clean <- smoking |>
select(smoke_binary,age_group,gender,marital_status,education,income_group)
smoking_clean
## # A tibble: 1,691 × 6
## smoke_binary age_group gender marital_status education income_group
## <dbl> <fct> <fct> <fct> <fct> <fct>
## 1 0 30-49 Male Divorced Low Education Low Income
## 2 1 30-49 Female Single Low Education Low Income
## 3 0 30-49 Male Married High Education High Income
## 4 0 30-49 Female Married High Education Middle Income
## 5 0 30-49 Female Married Low Education Low Income
## 6 0 30-49 Female Married Low Education Middle Income
## 7 1 50+ Male Married High Education High Income
## 8 0 30-49 Male Single High Education Middle Income
## 9 1 30-49 Male Single Low Education Low Income
## 10 1 30-49 Female Married Low Education Low Income
## # ℹ 1,681 more rows
xtabs(~smoke_binary +age_group,data = smoking_clean)
## age_group
## smoke_binary 30-49 50+ Under 30
## 0 435 678 157
## 1 187 131 103
xtabs(~smoke_binary +gender,data = smoking_clean)
## gender
## smoke_binary Female Male
## 0 731 539
## 1 234 187
xtabs(~smoke_binary +marital_status,data = smoking_clean)
## marital_status
## smoke_binary Divorced Married Separated Single Widowed
## 0 103 669 46 269 183
## 1 58 143 22 158 40
xtabs(~smoke_binary +education,data = smoking_clean)
## education
## smoke_binary High Education Low Education Medium Education
## 0 321 716 233
## 1 66 280 75
xtabs(~smoke_binary +income_group,data = smoking_clean)
## income_group
## smoke_binary High Income Low Income Middle Income Not Reported
## 0 144 579 445 102
## 1 24 207 166 24
Each category has more more than 5 in both groups, so these variables are safe to use in modeling, no need to drop or merge any levels.
model <- glm(smoke_binary ~ age_group + gender + marital_status + education + income_group,data = smoking_clean,family = "binomial")
summary(model)
##
## Call:
## glm(formula = smoke_binary ~ age_group + gender + marital_status +
## education + income_group, family = "binomial", data = smoking_clean)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.28697 0.30077 -4.279 1.88e-05 ***
## age_group50+ -0.88541 0.15069 -5.876 4.21e-09 ***
## age_groupUnder 30 0.18061 0.18090 0.998 0.3181
## genderMale 0.24992 0.12764 1.958 0.0502 .
## marital_statusMarried -0.92863 0.19599 -4.738 2.16e-06 ***
## marital_statusSeparated -0.25391 0.31800 -0.798 0.4246
## marital_statusSingle -0.29171 0.22128 -1.318 0.1874
## marital_statusWidowed -0.63870 0.25500 -2.505 0.0123 *
## educationLow Education 0.71939 0.17181 4.187 2.82e-05 ***
## educationMedium Education 0.39681 0.20192 1.965 0.0494 *
## income_groupLow Income 0.59021 0.27242 2.167 0.0303 *
## income_groupMiddle Income 0.50919 0.25710 1.981 0.0476 *
## income_groupNot Reported 0.07894 0.35043 0.225 0.8218
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1898.0 on 1690 degrees of freedom
## Residual deviance: 1749.1 on 1678 degrees of freedom
## AIC: 1775.1
##
## Number of Fisher Scoring iterations: 4
Interpretation
Now that we have fitted the logistic regression model predicting whether an individual smokes, we can look at the predictors that are statistically significant and interpret their effects. Below are the variables with meaningful log-odds ratios.
age_group50+ = -0.88541 (p = 4.21e-09) Individuals aged 50 or older have significantly lower log-odds of smoking compared to the reference group (age 30-49). This means older adults are much less likely to smoke.
age_groupUnder30 = 0.18061 (p=0.3181) Individuals under age 30 do not differ significantly from 30-49 gropu in likelihood of smoking. This age group is not a meaningful predictor once other variables are controlled for.
genderMale = 0.24992 (p=0.0502) Being male slightly increases the log-odds of smoking compared to females, but the effect is only marginally significant(around p = 0.05). Gender has a weak influence in this model.
marital_statusMarried = -0.92863 (p=2.16e-06) Married individuals have significantly lower log-odds of smoking compared to the reference group(Divorced). Marriage appears to be associated with reduced smoking likelihood.
marital_statusWidowed = -0.63870 (p=0.0123) Widowed individuals show significantly lower odds of smoking. While the effect is smaller than the married group, it is still statistically meaningful.
marital_statusSeparated and marital_statusSingle (p>0.05) These categories are not significant predictors. Their p value indicate that separated and single individuals do not differ meaningfully from others in their likelihood of smoking.
educationLow Education = 0.71939 (p=2.82e-05) Individuals with low education have much higher log-odds of smoking compared to those with high education. This is one of the strongest predictors in the model.
educationMedium Education = 0.39681 (p=0.0494) Individuals with medium education levels also have higher odds of smoking. This effect is statistically significant but smaller than that of low education.
income_groupLow Income = 0.59021 (p=0.0303) Low-income individuals have significantly higher log-odds of smoking compared to high-income individuals. Income level meaningfully influences smoking behavior.
income_groupMiddle Income = 0.50919 (p=0.0476) Middle income individuals also show increased likelihood of smoking. This effect s significant but weaker than the low-income group.
income_groupNot Reported = 0.07894 (p=0.8218) Not significant.Individuals who did not report income do not differ in smoking behavior compared to the high-income group.
Overall, The residual deviance is lower than the null deviance, meaning the predictors improve model fit compared to an intercept-only model.
r_square_all <- 1 - (model$deviance/model$null.deviance)
r_square_all
## [1] 0.07845487
The R² value for this logistic regression model is approximately 0.078, meaning the predictors explain about 7.8% of the variation in smoking behavior.This value is considered low, which is common in logistic regression, especially with social or behavioral data sets. A low R² does not mean the model is incorrect, it simply indicates that smoking behavior is influenced by many factors not captured in this model. I report that the R² is low, and future models could include additional predictors to improve the explanatory power.
1 - pchisq((model$null.deviance - model$deviance), df=(length(model$coefficients)-1))
## [1] 0
The overall p-value of the model is effectively 0, meaning the model is statistically significant.
predicted.probs <- model$fitted.values
predicted.classes <- ifelse(predicted.probs > 0.5, 1, 0)
confusion <- table(
Predicted = factor(predicted.classes, levels = c(0, 1)),
Actual = factor(smoking_clean$smoke_binary, levels = c(0, 1))
)
confusion
## Actual
## Predicted 0 1
## 0 1230 388
## 1 40 33
1230 people were actually non-smokers, and the model correctly predicted them as non-smokers. This is True Negative (TN).
388 people were actually smokers, but the model predicted them as non-smokers. This is False Negative (FN).
40 people were actually non-smokers, but the model predicted them as smokers. This is False Positive(FP).
33 people were actually smokers, and the model correctly predicted them as smokers. This is True Positive(TP).
TN <- 1230
FP <- 40
FN <- 388
TP <- 33
# Metrics
accuracy <- (TP + TN) / (TP + TN + FP + FN)
sensitivity <- TP / (TP + FN)
specificity <- TN / (TN + FP)
precision <- TP / (TP + FP)
f1_score <- 2 * (precision * sensitivity) / (precision + sensitivity)
# Print results
cat("Accuracy: ", round(accuracy, 4), "\n")
## Accuracy: 0.7469
cat("Sensitivity: ", round(sensitivity, 4), "\n")
## Sensitivity: 0.0784
cat("Specificity: ", round(specificity, 4), "\n")
## Specificity: 0.9685
cat("Precision: ", round(precision, 4), "\n")
## Precision: 0.4521
cat("F1 Score: ", round(f1_score, 4), "\n")
## F1 Score: 0.1336
The logistic regression model achieves an overall accuracy of about 74.7%, meaning that it correctly classifies most individuals in the data set.
The specificity is very high (96.9%), indicating that the model is excellent at correctly identifying non-smokers. This expected because the data set contains many more non-smokers than smokers.
The sensitivity is extremely low(7.8%), meaning that the model rarely identifies smokers correctly. Most smokers are mistakenly predicted as non-smokers, which aligns with the large number of false negatives in the confusion matrix.
The precision is about 45.2, meaning that less than half of the individuals predicted as smokers are actually smokers. This occurs because the model predicts “smoker” very rarely.
The F1-score (0.1336), which combines precision and sensivity, is also low, reflecting the model’s weakness in predicting the minority case(smokers).
Overall, the model performs strongly for identifying non-smokers, but performs poorly in identifying smokers, mainly due to the class imbalance in the data set.
library(pROC)
## Warning: package 'pROC' was built under R version 4.5.2
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
# ROC curve & AUC on full data
roc_obj <- roc(response = smoking_clean$smoke_binary,
predictor = model$fitted.values,
levels = c(0,1),
direction = "<")
# Print AUC value
auc_val <- auc(roc_obj); auc_val
## Area under the curve: 0.6961
# Plot ROC with AUC displayed
plot.roc(roc_obj, print.auc = TRUE, legacy.axes = TRUE,
xlab = "False Positive Rate (1 - Specificity)",
ylab = "True Positive Rate (Sensitivity)")
The AUC = 0.696 means the model has a moderate ability to distinguish
between people who smoker and those who do not.
On the ROC plot, the curve is above the diagonal “random guess” line, which shows that the model performs better than chance.
In plain words: If you randomly pick one person who smokes and one person who does not, the model has about a 69.6% chance of giving a higher predicted probability to the person who actually smoke.
This means he predictors (age_group,gender,martial_status,education, and income_group) provide some useful information, but the model does not completely separate smokers from non-smokers. Important factors influencing smoking behavior may be missing from the data set, so adding more predictors could improve the model’s performance in the future.
In this project, I examined whether age group, gender, marital status, education level, and gross income could help predict whether a person is a smoker in the UK. The results showed that several variables, especially age, education level, and income were related to smoking behavior. Younger people, individuals with lower education, and those with lower or middle income were more likely to smoke. However, the overall model did not predict smoking very well. The pseudo R² was low (around 0.078), meaning the predictors explained only a small portion of the variation in smoking behavior. The AUC value of 0.696 also showed that the model performs only moderately better than random guessing.
These results suggest that smoking behavior is influenced by many additional factors that were not included in this model. Important elements such as stress levels, lifestyle habits, peer influence, cultural background, and health conditions might play a major role in smoking decisions. Including such variables could help improve prediction. In the future, adding more meaningful predictors, exploring more detailed health or lifestyle information, or trying different types of models may lead to better results. Overall, while the current model provides some useful insights, more information is needed to better understand the factors related to smoking in the UK population.