###import dataset
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd("C:/Users/Mulut/Desktop/Classes/Data101/projects/project 1")
smoking <- read_csv("smoking.csv")
## Rows: 1691 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): gender, marital_status, highest_qualification, nationality, ethnici...
## dbl (3): age, amt_weekends, amt_weekdays
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Does smoking status (Yes/No) vary by gender, age, education level and income level in the UK?
This project explores how demographic and socioeconomic characteristics are related to smoking behavior among adults in the United Kingdom. Smoking remains a leading cause of preventable diseases and death, so identifying which groups are more likely to smoke can inform targeted public health interventions.The dataset used is the UK Smoking Data (smoking) from the OpenIntro.org .It contains 1,691 observations on 12 variables like their gender, age, country, education, income, and whether they smoke.
I will start by cleaning and exploring the dataset to understand its structure. First, I view the top rows with head and check the data types using structure. Then, I select only the five variables I need for this project. I use summary functions to find basic statistics like the mean, median, and maximum age. Finally, I group and summarize the data to explore smoking patterns by gender, education, income, and age, showing the counts and percentages for each group. The variables i will use in this project are:
gender – Whether they are Male or Female
age – the participant’s age
highest_qualification – Education level (e.g., Degree, A Levels, No Qualification)
gross_income – Income range (e.g., “Under 2,600”, “Above 36,400”)
smoke – Smoking status (Yes or No)
head(smoking)
## # A tibble: 6 × 12
## gender age marital_status highest_qualification nationality ethnicity
## <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 Male 38 Divorced No Qualification British White
## 2 Female 42 Single No Qualification British White
## 3 Male 40 Married Degree English White
## 4 Female 40 Married Degree English White
## 5 Female 39 Married GCSE/O Level British White
## 6 Female 37 Married GCSE/O Level British White
## # ℹ 6 more variables: gross_income <chr>, region <chr>, smoke <chr>,
## # amt_weekends <dbl>, amt_weekdays <dbl>, type <chr>
str(smoking)
## spc_tbl_ [1,691 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ gender : chr [1:1691] "Male" "Female" "Male" "Female" ...
## $ age : num [1:1691] 38 42 40 40 39 37 53 44 40 41 ...
## $ marital_status : chr [1:1691] "Divorced" "Single" "Married" "Married" ...
## $ highest_qualification: chr [1:1691] "No Qualification" "No Qualification" "Degree" "Degree" ...
## $ nationality : chr [1:1691] "British" "British" "English" "English" ...
## $ ethnicity : chr [1:1691] "White" "White" "White" "White" ...
## $ gross_income : chr [1:1691] "2,600 to 5,200" "Under 2,600" "28,600 to 36,400" "10,400 to 15,600" ...
## $ region : chr [1:1691] "The North" "The North" "The North" "The North" ...
## $ smoke : chr [1:1691] "No" "Yes" "No" "No" ...
## $ amt_weekends : num [1:1691] NA 12 NA NA NA NA 6 NA 8 15 ...
## $ amt_weekdays : num [1:1691] NA 12 NA NA NA NA 6 NA 8 12 ...
## $ type : chr [1:1691] NA "Packets" NA NA ...
## - attr(*, "spec")=
## .. cols(
## .. gender = col_character(),
## .. age = col_double(),
## .. marital_status = col_character(),
## .. highest_qualification = col_character(),
## .. nationality = col_character(),
## .. ethnicity = col_character(),
## .. gross_income = col_character(),
## .. region = col_character(),
## .. smoke = col_character(),
## .. amt_weekends = col_double(),
## .. amt_weekdays = col_double(),
## .. type = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
smoking_data <- smoking |>
select(gender, age, highest_qualification, gross_income, smoke)
summary(smoking_data$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 16.00 34.00 48.00 49.84 65.50 97.00
smoker_byage <- smoking_data |>
group_by(smoke) |>
summarise(
mean_age = mean(age, na.rm = TRUE),
median_age = median(age, na.rm = TRUE)
)
smoker_byage
## # A tibble: 2 × 3
## smoke mean_age median_age
## <chr> <dbl> <dbl>
## 1 No 52.2 52
## 2 Yes 42.7 40
smoker_bygender <- smoking_data |>
group_by(gender, smoke) |>
summarise(count = n()) |>
mutate(percent = count / sum(count) * 100)
## `summarise()` has grouped output by 'gender'. You can override using the
## `.groups` argument.
smoker_bygender
## # A tibble: 4 × 4
## # Groups: gender [2]
## gender smoke count percent
## <chr> <chr> <int> <dbl>
## 1 Female No 731 75.8
## 2 Female Yes 234 24.2
## 3 Male No 539 74.2
## 4 Male Yes 187 25.8
smoking_data <- smoking_data |>
mutate(education_group = case_when(
highest_qualification %in% c("No Qualification", "GCSE/CSE", "GCSE/O Level") ~ "Low level Education",
highest_qualification %in% c("A Levels", "Higher/Sub Degree") ~ "Mid Level Education",
TRUE ~ "Higher or Other Education"
))
smoker_byeducation <- smoking_data |>
group_by(education_group, smoke) |>
summarise(count = n()) |>
mutate(percent = count / sum(count) * 100)
## `summarise()` has grouped output by 'education_group'. You can override using
## the `.groups` argument.
smoker_byeducation
## # A tibble: 6 × 4
## # Groups: education_group [3]
## education_group smoke count percent
## <chr> <chr> <int> <dbl>
## 1 Higher or Other Education No 372 80
## 2 Higher or Other Education Yes 93 20
## 3 Low level Education No 716 71.9
## 4 Low level Education Yes 280 28.1
## 5 Mid Level Education No 182 79.1
## 6 Mid Level Education Yes 48 20.9
smoker_byincome <- smoking_data |>
group_by(gross_income, smoke) |>
summarise(count = n()) |>
mutate(percent = count / sum(count) * 100)
## `summarise()` has grouped output by 'gross_income'. You can override using the
## `.groups` argument.
smoker_byincome
## # A tibble: 20 × 4
## # Groups: gross_income [10]
## gross_income smoke count percent
## <chr> <chr> <int> <dbl>
## 1 10,400 to 15,600 No 185 69.0
## 2 10,400 to 15,600 Yes 83 31.0
## 3 15,600 to 20,800 No 143 76.1
## 4 15,600 to 20,800 Yes 45 23.9
## 5 2,600 to 5,200 No 193 75.1
## 6 2,600 to 5,200 Yes 64 24.9
## 7 20,800 to 28,600 No 117 75.5
## 8 20,800 to 28,600 Yes 38 24.5
## 9 28,600 to 36,400 No 70 88.6
## 10 28,600 to 36,400 Yes 9 11.4
## 11 5,200 to 10,400 No 289 73.0
## 12 5,200 to 10,400 Yes 107 27.0
## 13 Above 36,400 No 74 83.1
## 14 Above 36,400 Yes 15 16.9
## 15 Refused No 87 80.6
## 16 Refused Yes 21 19.4
## 17 Under 2,600 No 97 72.9
## 18 Under 2,600 Yes 36 27.1
## 19 Unknown No 15 83.3
## 20 Unknown Yes 3 16.7
The analysis reveals several clear patterns in smoking behavior across the UK. Men tend to smoke slightly more than women, though the difference is not very large. Participants with lower education levels are more likely to smoke, while smoking rates decrease among those with university degrees or higher qualifications. Income also shows a strong relationship with smoking in that individuals in lower income groups have higher smoking rates, whereas those with higher incomes are less likely to smoke. In terms of age, younger adults show a greater tendency to smoke compared to older adults.
Overall, this analysis shows that smoking in the UK is linked to socioeconomic disadvantage, where lower income and education levels are associated with higher smoking prevalence. Future research could include additional demographic factors such as region, occupation, or ethnicity to explore cultural and geographic differences. Collecting more recent data could also help determine whether public health initiatives have been effective in reducing smoking rates among lower income and less educated populations.
National STEM Centre. (n.d.). Large Datasets from stats4schools. Retrieved from (https://www.stem.org.uk/resources/elibrary/resource/28452/large-datasets-stats4schools)
OpenIntro. (n.d.). UK Smoking Data (smoking
dataset) in the openintro R package.