Project 1

###import dataset

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

setwd("C:/Users/Mulut/Desktop/Classes/Data101/projects/project 1")

smoking <- read_csv("smoking.csv")

## Rows: 1691 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): gender, marital_status, highest_qualification, nationality, ethnici...
## dbl (3): age, amt_weekends, amt_weekdays
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Does smoking status (Yes/No) vary by gender, age, education level and income level in the UK?

Introduction

This project explores how demographic and socioeconomic characteristics are related to smoking behavior among adults in the United Kingdom. Smoking remains a leading cause of preventable diseases and death, so identifying which groups are more likely to smoke can inform targeted public health interventions.The dataset used is the UK Smoking Data (smoking) from the OpenIntro.org .It contains 1,691 observations on 12 variables like their gender, age, country, education, income, and whether they smoke.

Data Analysis

I will start by cleaning and exploring the dataset to understand its structure. First, I view the top rows with head and check the data types using structure. Then, I select only the five variables I need for this project. I use summary functions to find basic statistics like the mean, median, and maximum age. Finally, I group and summarize the data to explore smoking patterns by gender, education, income, and age, showing the counts and percentages for each group. The variables i will use in this project are:

gender – Whether they are Male or Female

age – the participant’s age

highest_qualification – Education level (e.g., Degree, A Levels, No Qualification)

gross_income – Income range (e.g., “Under 2,600”, “Above 36,400”)

smoke – Smoking status (Yes or No)

head(smoking)

## # A tibble: 6 × 12
##   gender   age marital_status highest_qualification nationality ethnicity
##   <chr>  <dbl> <chr>          <chr>                 <chr>       <chr>    
## 1 Male      38 Divorced       No Qualification      British     White    
## 2 Female    42 Single         No Qualification      British     White    
## 3 Male      40 Married        Degree                English     White    
## 4 Female    40 Married        Degree                English     White    
## 5 Female    39 Married        GCSE/O Level          British     White    
## 6 Female    37 Married        GCSE/O Level          British     White    
## # ℹ 6 more variables: gross_income <chr>, region <chr>, smoke <chr>,
## #   amt_weekends <dbl>, amt_weekdays <dbl>, type <chr>

str(smoking)

## spc_tbl_ [1,691 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ gender               : chr [1:1691] "Male" "Female" "Male" "Female" ...
##  $ age                  : num [1:1691] 38 42 40 40 39 37 53 44 40 41 ...
##  $ marital_status       : chr [1:1691] "Divorced" "Single" "Married" "Married" ...
##  $ highest_qualification: chr [1:1691] "No Qualification" "No Qualification" "Degree" "Degree" ...
##  $ nationality          : chr [1:1691] "British" "British" "English" "English" ...
##  $ ethnicity            : chr [1:1691] "White" "White" "White" "White" ...
##  $ gross_income         : chr [1:1691] "2,600 to 5,200" "Under 2,600" "28,600 to 36,400" "10,400 to 15,600" ...
##  $ region               : chr [1:1691] "The North" "The North" "The North" "The North" ...
##  $ smoke                : chr [1:1691] "No" "Yes" "No" "No" ...
##  $ amt_weekends         : num [1:1691] NA 12 NA NA NA NA 6 NA 8 15 ...
##  $ amt_weekdays         : num [1:1691] NA 12 NA NA NA NA 6 NA 8 12 ...
##  $ type                 : chr [1:1691] NA "Packets" NA NA ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   gender = col_character(),
##   ..   age = col_double(),
##   ..   marital_status = col_character(),
##   ..   highest_qualification = col_character(),
##   ..   nationality = col_character(),
##   ..   ethnicity = col_character(),
##   ..   gross_income = col_character(),
##   ..   region = col_character(),
##   ..   smoke = col_character(),
##   ..   amt_weekends = col_double(),
##   ..   amt_weekdays = col_double(),
##   ..   type = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

smoking_data <- smoking |>
  select(gender, age, highest_qualification, gross_income, smoke) 

summary(smoking_data$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   16.00   34.00   48.00   49.84   65.50   97.00

smoking analysis by age

smoker_byage <- smoking_data |>
  group_by(smoke) |>
  summarise(
    mean_age = mean(age, na.rm = TRUE),
    median_age = median(age, na.rm = TRUE)
  )


smoker_byage

## # A tibble: 2 × 3
##   smoke mean_age median_age
##   <chr>    <dbl>      <dbl>
## 1 No        52.2         52
## 2 Yes       42.7         40

smoking analysis by gender

smoker_bygender <- smoking_data |>
  group_by(gender, smoke) |>
  summarise(count = n()) |>
  mutate(percent = count / sum(count) * 100)

## `summarise()` has grouped output by 'gender'. You can override using the
## `.groups` argument.

smoker_bygender

## # A tibble: 4 × 4
## # Groups:   gender [2]
##   gender smoke count percent
##   <chr>  <chr> <int>   <dbl>
## 1 Female No      731    75.8
## 2 Female Yes     234    24.2
## 3 Male   No      539    74.2
## 4 Male   Yes     187    25.8

smoking analysis by education

smoking_data <- smoking_data |> 
  
  mutate(education_group = case_when(
    highest_qualification %in% c("No Qualification", "GCSE/CSE", "GCSE/O Level") ~ "Low level Education",
    highest_qualification %in% c("A Levels", "Higher/Sub Degree") ~ "Mid Level Education",
    TRUE ~ "Higher or Other Education"
  ))

smoker_byeducation <- smoking_data |>
  group_by(education_group, smoke) |>
  summarise(count = n()) |>
  mutate(percent = count / sum(count) * 100)

## `summarise()` has grouped output by 'education_group'. You can override using
## the `.groups` argument.

smoker_byeducation

## # A tibble: 6 × 4
## # Groups:   education_group [3]
##   education_group           smoke count percent
##   <chr>                     <chr> <int>   <dbl>
## 1 Higher or Other Education No      372    80  
## 2 Higher or Other Education Yes      93    20  
## 3 Low level Education       No      716    71.9
## 4 Low level Education       Yes     280    28.1
## 5 Mid Level Education       No      182    79.1
## 6 Mid Level Education       Yes      48    20.9

smoking analysis by income

smoker_byincome <- smoking_data |>
  group_by(gross_income, smoke) |>
  summarise(count = n()) |>
  mutate(percent = count / sum(count) * 100)

## `summarise()` has grouped output by 'gross_income'. You can override using the
## `.groups` argument.

smoker_byincome

## # A tibble: 20 × 4
## # Groups:   gross_income [10]
##    gross_income     smoke count percent
##    <chr>            <chr> <int>   <dbl>
##  1 10,400 to 15,600 No      185    69.0
##  2 10,400 to 15,600 Yes      83    31.0
##  3 15,600 to 20,800 No      143    76.1
##  4 15,600 to 20,800 Yes      45    23.9
##  5 2,600 to 5,200   No      193    75.1
##  6 2,600 to 5,200   Yes      64    24.9
##  7 20,800 to 28,600 No      117    75.5
##  8 20,800 to 28,600 Yes      38    24.5
##  9 28,600 to 36,400 No       70    88.6
## 10 28,600 to 36,400 Yes       9    11.4
## 11 5,200 to 10,400  No      289    73.0
## 12 5,200 to 10,400  Yes     107    27.0
## 13 Above 36,400     No       74    83.1
## 14 Above 36,400     Yes      15    16.9
## 15 Refused          No       87    80.6
## 16 Refused          Yes      21    19.4
## 17 Under 2,600      No       97    72.9
## 18 Under 2,600      Yes      36    27.1
## 19 Unknown          No       15    83.3
## 20 Unknown          Yes       3    16.7

Results

The analysis reveals several clear patterns in smoking behavior across the UK. Men tend to smoke slightly more than women, though the difference is not very large. Participants with lower education levels are more likely to smoke, while smoking rates decrease among those with university degrees or higher qualifications. Income also shows a strong relationship with smoking in that individuals in lower income groups have higher smoking rates, whereas those with higher incomes are less likely to smoke. In terms of age, younger adults show a greater tendency to smoke compared to older adults.

Overall, this analysis shows that smoking in the UK is linked to socioeconomic disadvantage, where lower income and education levels are associated with higher smoking prevalence. Future research could include additional demographic factors such as region, occupation, or ethnicity to explore cultural and geographic differences. Collecting more recent data could also help determine whether public health initiatives have been effective in reducing smoking rates among lower income and less educated populations.

References

National STEM Centre. (n.d.). Large Datasets from stats4schools. Retrieved from (https://www.stem.org.uk/resources/elibrary/resource/28452/large-datasets-stats4schools)
OpenIntro. (n.d.). UK Smoking Data (smoking dataset) in the openintro R package.