###import dataset
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd("C:/Users/Mulut/Desktop/Classes/Data101/projects/project 1")
smoking <- read_csv("smoking.csv")
## Rows: 1691 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): gender, marital_status, highest_qualification, nationality, ethnici...
## dbl (3): age, amt_weekends, amt_weekdays
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Does smoking status (Yes/No) vary by gender, age, education level and income level in the UK?
This project explores how demographic and socioeconomic characteristics are related to smoking behavior among adults in the United Kingdom. Smoking remains a leading cause of preventable diseases and death, so identifying which groups are more likely to smoke can inform targeted public health interventions.The dataset used is the UK Smoking Data (smoking) from the OpenIntro.org .It contains 1,691 observations on 12 variables like their gender, age, country, education, income, and whether they smoke.
I will examine how gender, age, education level, and income are related to smoking behavior.The variables i will use in this project are:
gender – Whether they are Male or Female
age – the participant’s age
highest_qualification – Education level (e.g., Degree, A Levels, No Qualification)
gross_income – Income range (e.g., “Under 2,600”, “Above 36,400”)
smoke – Smoking status (Yes or No)
head(smoking)
## # A tibble: 6 × 12
## gender age marital_status highest_qualification nationality ethnicity
## <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 Male 38 Divorced No Qualification British White
## 2 Female 42 Single No Qualification British White
## 3 Male 40 Married Degree English White
## 4 Female 40 Married Degree English White
## 5 Female 39 Married GCSE/O Level British White
## 6 Female 37 Married GCSE/O Level British White
## # ℹ 6 more variables: gross_income <chr>, region <chr>, smoke <chr>,
## # amt_weekends <dbl>, amt_weekdays <dbl>, type <chr>
str(smoking)
## spc_tbl_ [1,691 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ gender : chr [1:1691] "Male" "Female" "Male" "Female" ...
## $ age : num [1:1691] 38 42 40 40 39 37 53 44 40 41 ...
## $ marital_status : chr [1:1691] "Divorced" "Single" "Married" "Married" ...
## $ highest_qualification: chr [1:1691] "No Qualification" "No Qualification" "Degree" "Degree" ...
## $ nationality : chr [1:1691] "British" "British" "English" "English" ...
## $ ethnicity : chr [1:1691] "White" "White" "White" "White" ...
## $ gross_income : chr [1:1691] "2,600 to 5,200" "Under 2,600" "28,600 to 36,400" "10,400 to 15,600" ...
## $ region : chr [1:1691] "The North" "The North" "The North" "The North" ...
## $ smoke : chr [1:1691] "No" "Yes" "No" "No" ...
## $ amt_weekends : num [1:1691] NA 12 NA NA NA NA 6 NA 8 15 ...
## $ amt_weekdays : num [1:1691] NA 12 NA NA NA NA 6 NA 8 12 ...
## $ type : chr [1:1691] NA "Packets" NA NA ...
## - attr(*, "spec")=
## .. cols(
## .. gender = col_character(),
## .. age = col_double(),
## .. marital_status = col_character(),
## .. highest_qualification = col_character(),
## .. nationality = col_character(),
## .. ethnicity = col_character(),
## .. gross_income = col_character(),
## .. region = col_character(),
## .. smoke = col_character(),
## .. amt_weekends = col_double(),
## .. amt_weekdays = col_double(),
## .. type = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
smoking_data <- smoking |>
select(gender, age, highest_qualification, gross_income, smoke)
summary(smoking_data)
## gender age highest_qualification gross_income
## Length:1691 Min. :16.00 Length:1691 Length:1691
## Class :character 1st Qu.:34.00 Class :character Class :character
## Mode :character Median :48.00 Mode :character Mode :character
## Mean :49.84
## 3rd Qu.:65.50
## Max. :97.00
## smoke
## Length:1691
## Class :character
## Mode :character
##
##
##
smoker_byage <- smoking_data |>
group_by(smoke) |>
summarise(
mean_age = mean(age, na.rm = TRUE),
median_age = median(age, na.rm = TRUE)
)
smoker_byage
## # A tibble: 2 × 3
## smoke mean_age median_age
## <chr> <dbl> <dbl>
## 1 No 52.2 52
## 2 Yes 42.7 40
smoker_bygender <- smoking_data |>
group_by(gender, smoke) |>
summarise(count = n()) |>
mutate(percent = count / sum(count) * 100)
## `summarise()` has grouped output by 'gender'. You can override using the
## `.groups` argument.
smoker_bygender
## # A tibble: 4 × 4
## # Groups: gender [2]
## gender smoke count percent
## <chr> <chr> <int> <dbl>
## 1 Female No 731 75.8
## 2 Female Yes 234 24.2
## 3 Male No 539 74.2
## 4 Male Yes 187 25.8
smoker_byeducation <- smoking_data |>
group_by(highest_qualification, smoke) |>
summarise(count = n()) |>
mutate(percent = count / sum(count) * 100)
## `summarise()` has grouped output by 'highest_qualification'. You can override
## using the `.groups` argument.
smoker_byeducation
## # A tibble: 16 × 4
## # Groups: highest_qualification [8]
## highest_qualification smoke count percent
## <chr> <chr> <int> <dbl>
## 1 A Levels No 84 80
## 2 A Levels Yes 21 20
## 3 Degree No 223 85.1
## 4 Degree Yes 39 14.9
## 5 GCSE/CSE No 64 62.7
## 6 GCSE/CSE Yes 38 37.3
## 7 GCSE/O Level No 203 65.9
## 8 GCSE/O Level Yes 105 34.1
## 9 Higher/Sub Degree No 98 78.4
## 10 Higher/Sub Degree Yes 27 21.6
## 11 No Qualification No 449 76.6
## 12 No Qualification Yes 137 23.4
## 13 ONC/BTEC No 53 69.7
## 14 ONC/BTEC Yes 23 30.3
## 15 Other/Sub Degree No 96 75.6
## 16 Other/Sub Degree Yes 31 24.4
smoker_byincome <- smoking_data |>
group_by(gross_income, smoke) |>
summarise(count = n()) |>
mutate(percent = count / sum(count) * 100)
## `summarise()` has grouped output by 'gross_income'. You can override using the
## `.groups` argument.
smoker_byincome
## # A tibble: 20 × 4
## # Groups: gross_income [10]
## gross_income smoke count percent
## <chr> <chr> <int> <dbl>
## 1 10,400 to 15,600 No 185 69.0
## 2 10,400 to 15,600 Yes 83 31.0
## 3 15,600 to 20,800 No 143 76.1
## 4 15,600 to 20,800 Yes 45 23.9
## 5 2,600 to 5,200 No 193 75.1
## 6 2,600 to 5,200 Yes 64 24.9
## 7 20,800 to 28,600 No 117 75.5
## 8 20,800 to 28,600 Yes 38 24.5
## 9 28,600 to 36,400 No 70 88.6
## 10 28,600 to 36,400 Yes 9 11.4
## 11 5,200 to 10,400 No 289 73.0
## 12 5,200 to 10,400 Yes 107 27.0
## 13 Above 36,400 No 74 83.1
## 14 Above 36,400 Yes 15 16.9
## 15 Refused No 87 80.6
## 16 Refused Yes 21 19.4
## 17 Under 2,600 No 97 72.9
## 18 Under 2,600 Yes 36 27.1
## 19 Unknown No 15 83.3
## 20 Unknown Yes 3 16.7
The analysis reveals several patterns. Men smoke slightly more than women. Participants with lower education levels are more likely to smoke, while smoking decreases among individuals with university degrees. Income appears inversely related to smoking; higher-income groups report lower smoking rates. Younger adults also show a higher tendency to smoke compared to older adults. These findings show that smoking in the UK is associated with socioeconomic disadvantage.
National STEM Centre. (n.d.). Large Datasets from stats4schools. Retrieved from (https://www.stem.org.uk/resources/elibrary/resource/28452/large-datasets-stats4schools)
OpenIntro. (n.d.). UK Smoking Data (smoking
dataset) in the openintro R package.