Project 1

###import dataset

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

setwd("C:/Users/Mulut/Desktop/Classes/Data101/projects/project 1")

smoking <- read_csv("smoking.csv")

## Rows: 1691 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): gender, marital_status, highest_qualification, nationality, ethnici...
## dbl (3): age, amt_weekends, amt_weekdays
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Does smoking status (Yes/No) vary by gender, age, education level and income level in the UK?

Introduction

This project explores how demographic and socioeconomic characteristics are related to smoking behavior among adults in the United Kingdom. Smoking remains a leading cause of preventable diseases and death, so identifying which groups are more likely to smoke can inform targeted public health interventions.The dataset used is the UK Smoking Data (smoking) from the OpenIntro.org .It contains 1,691 observations on 12 variables like their gender, age, country, education, income, and whether they smoke.

Data Analysis

I will examine how gender, age, education level, and income are related to smoking behavior.The variables i will use in this project are:

gender – Whether they are Male or Female

age – the participant’s age

highest_qualification – Education level (e.g., Degree, A Levels, No Qualification)

gross_income – Income range (e.g., “Under 2,600”, “Above 36,400”)

smoke – Smoking status (Yes or No)

head(smoking)

## # A tibble: 6 × 12
##   gender   age marital_status highest_qualification nationality ethnicity
##   <chr>  <dbl> <chr>          <chr>                 <chr>       <chr>    
## 1 Male      38 Divorced       No Qualification      British     White    
## 2 Female    42 Single         No Qualification      British     White    
## 3 Male      40 Married        Degree                English     White    
## 4 Female    40 Married        Degree                English     White    
## 5 Female    39 Married        GCSE/O Level          British     White    
## 6 Female    37 Married        GCSE/O Level          British     White    
## # ℹ 6 more variables: gross_income <chr>, region <chr>, smoke <chr>,
## #   amt_weekends <dbl>, amt_weekdays <dbl>, type <chr>

str(smoking)

## spc_tbl_ [1,691 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ gender               : chr [1:1691] "Male" "Female" "Male" "Female" ...
##  $ age                  : num [1:1691] 38 42 40 40 39 37 53 44 40 41 ...
##  $ marital_status       : chr [1:1691] "Divorced" "Single" "Married" "Married" ...
##  $ highest_qualification: chr [1:1691] "No Qualification" "No Qualification" "Degree" "Degree" ...
##  $ nationality          : chr [1:1691] "British" "British" "English" "English" ...
##  $ ethnicity            : chr [1:1691] "White" "White" "White" "White" ...
##  $ gross_income         : chr [1:1691] "2,600 to 5,200" "Under 2,600" "28,600 to 36,400" "10,400 to 15,600" ...
##  $ region               : chr [1:1691] "The North" "The North" "The North" "The North" ...
##  $ smoke                : chr [1:1691] "No" "Yes" "No" "No" ...
##  $ amt_weekends         : num [1:1691] NA 12 NA NA NA NA 6 NA 8 15 ...
##  $ amt_weekdays         : num [1:1691] NA 12 NA NA NA NA 6 NA 8 12 ...
##  $ type                 : chr [1:1691] NA "Packets" NA NA ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   gender = col_character(),
##   ..   age = col_double(),
##   ..   marital_status = col_character(),
##   ..   highest_qualification = col_character(),
##   ..   nationality = col_character(),
##   ..   ethnicity = col_character(),
##   ..   gross_income = col_character(),
##   ..   region = col_character(),
##   ..   smoke = col_character(),
##   ..   amt_weekends = col_double(),
##   ..   amt_weekdays = col_double(),
##   ..   type = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

smoking_data <- smoking |>
  select(gender, age, highest_qualification, gross_income, smoke) 

summary(smoking_data)

##     gender               age        highest_qualification gross_income      
##  Length:1691        Min.   :16.00   Length:1691           Length:1691       
##  Class :character   1st Qu.:34.00   Class :character      Class :character  
##  Mode  :character   Median :48.00   Mode  :character      Mode  :character  
##                     Mean   :49.84                                           
##                     3rd Qu.:65.50                                           
##                     Max.   :97.00                                           
##     smoke          
##  Length:1691       
##  Class :character  
##  Mode  :character  
##                    
##                    
##

smoking analysis by age

smoker_byage <- smoking_data |>
  group_by(smoke) |>
  summarise(
    mean_age = mean(age, na.rm = TRUE),
    median_age = median(age, na.rm = TRUE)
  )


smoker_byage

## # A tibble: 2 × 3
##   smoke mean_age median_age
##   <chr>    <dbl>      <dbl>
## 1 No        52.2         52
## 2 Yes       42.7         40

smoking analysis by gender

smoker_bygender <- smoking_data |>
  group_by(gender, smoke) |>
  summarise(count = n()) |>
  mutate(percent = count / sum(count) * 100)

## `summarise()` has grouped output by 'gender'. You can override using the
## `.groups` argument.

smoker_bygender

## # A tibble: 4 × 4
## # Groups:   gender [2]
##   gender smoke count percent
##   <chr>  <chr> <int>   <dbl>
## 1 Female No      731    75.8
## 2 Female Yes     234    24.2
## 3 Male   No      539    74.2
## 4 Male   Yes     187    25.8

smoking analysis by education

smoker_byeducation <- smoking_data |>
  group_by(highest_qualification, smoke) |>
  summarise(count = n()) |>
  mutate(percent = count / sum(count) * 100)

## `summarise()` has grouped output by 'highest_qualification'. You can override
## using the `.groups` argument.

smoker_byeducation

## # A tibble: 16 × 4
## # Groups:   highest_qualification [8]
##    highest_qualification smoke count percent
##    <chr>                 <chr> <int>   <dbl>
##  1 A Levels              No       84    80  
##  2 A Levels              Yes      21    20  
##  3 Degree                No      223    85.1
##  4 Degree                Yes      39    14.9
##  5 GCSE/CSE              No       64    62.7
##  6 GCSE/CSE              Yes      38    37.3
##  7 GCSE/O Level          No      203    65.9
##  8 GCSE/O Level          Yes     105    34.1
##  9 Higher/Sub Degree     No       98    78.4
## 10 Higher/Sub Degree     Yes      27    21.6
## 11 No Qualification      No      449    76.6
## 12 No Qualification      Yes     137    23.4
## 13 ONC/BTEC              No       53    69.7
## 14 ONC/BTEC              Yes      23    30.3
## 15 Other/Sub Degree      No       96    75.6
## 16 Other/Sub Degree      Yes      31    24.4

smoking analysis by income

smoker_byincome <- smoking_data |>
  group_by(gross_income, smoke) |>
  summarise(count = n()) |>
  mutate(percent = count / sum(count) * 100)

## `summarise()` has grouped output by 'gross_income'. You can override using the
## `.groups` argument.

smoker_byincome

## # A tibble: 20 × 4
## # Groups:   gross_income [10]
##    gross_income     smoke count percent
##    <chr>            <chr> <int>   <dbl>
##  1 10,400 to 15,600 No      185    69.0
##  2 10,400 to 15,600 Yes      83    31.0
##  3 15,600 to 20,800 No      143    76.1
##  4 15,600 to 20,800 Yes      45    23.9
##  5 2,600 to 5,200   No      193    75.1
##  6 2,600 to 5,200   Yes      64    24.9
##  7 20,800 to 28,600 No      117    75.5
##  8 20,800 to 28,600 Yes      38    24.5
##  9 28,600 to 36,400 No       70    88.6
## 10 28,600 to 36,400 Yes       9    11.4
## 11 5,200 to 10,400  No      289    73.0
## 12 5,200 to 10,400  Yes     107    27.0
## 13 Above 36,400     No       74    83.1
## 14 Above 36,400     Yes      15    16.9
## 15 Refused          No       87    80.6
## 16 Refused          Yes      21    19.4
## 17 Under 2,600      No       97    72.9
## 18 Under 2,600      Yes      36    27.1
## 19 Unknown          No       15    83.3
## 20 Unknown          Yes       3    16.7

Results

The analysis reveals several patterns. Men smoke slightly more than women. Participants with lower education levels are more likely to smoke, while smoking decreases among individuals with university degrees. Income appears inversely related to smoking; higher-income groups report lower smoking rates. Younger adults also show a higher tendency to smoke compared to older adults. These findings show that smoking in the UK is associated with socioeconomic disadvantage.

References

National STEM Centre. (n.d.). Large Datasets from stats4schools. Retrieved from (https://www.stem.org.uk/resources/elibrary/resource/28452/large-datasets-stats4schools)
OpenIntro. (n.d.). UK Smoking Data (smoking dataset) in the openintro R package.