Introduction

My Research Question Is the true proportion of adults in the United Kingdom who smoke, less than 20%

In this project, we’ll examine the smoking behavior among adults from the United Kingdom, utilizing a data set that contains the observations of 1,691 Individuals. This data set includes information such as Demographics, Gender, Age, Martial Status, Region, Education Level, Ethnicity, Salary, and the most essential part of our research, their smoking behaviors. The variable that is needed here is mainly smoke, which will indicate whether or not the individual smokes (Yes/No). Due to this variable being categorical and represents the smoking status, its best to test in a population proportion.

Lastly, understanding the smoking prevalence is important national health guidelines and government data frequently refer smoking rates as indicators for public health. The key point of interest in this project is whether the proportion of adults who smoke may now be lower than 20%, given the recent health campaigns and efforts to reduce smoking habits. We’ll utilize visualizations, and Single Proportion Test to determine whether the data set provides a statistical proof that the true smoking rate in the United Kingdom is below 20%.

Dataset Details

The data set used contains 1,691 rows and 12 columns. The variables that are relevant to this study will be: “smoke”.

Approach

As mentioned earlier, the perfect analysis for this project will be using a Single Proportion Test. This method will evaluate whether the population of adults who smoke is less than the hypothesized value which 0.20. We’re comparing the one binary variable “smoke” to a benchmark of 20%, due to the variable having only two outcomes which is “Yes” or “No” and the sample is large, therefore the Single Proportion Test is the most appropriate for this project.

\(H_0\): p = 0.20 \(H_a\): p < 0.20

This would be a left-tailed test because we want to determine whether smoking prevalence is lower than 20%.

Data Analysis

Exploring the data set, I started with the simplest cleaning steps and EDA functions. I checked for any missing values using the colsums(is.na()) and made sure that the smoke variable contained no NAs, making it perfect to calculate. Additionally, I also used functions such as head(), summary(), str(), table() to understand the structure of the data set. I calculated the smokers and creating a simplified the data set that just contains only the variables that are relevant to this study. In terms of visualization, I created a bar plot of smoking status to show the amount of individuals smoking versus the ones who don’t smoke, which is prefect because the variable is a categorical one. I also added a histogram to this study for age, providing a simple overview of the demographics.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
smoking_data <- read.csv("smoking.csv")

Dataset Cleaning

#Looking for any missing values (NA's)
colSums(is.na(smoking_data))
##                gender                   age        marital_status 
##                     0                     0                     0 
## highest_qualification           nationality             ethnicity 
##                     0                     0                     0 
##          gross_income                region                 smoke 
##                     0                     0                     0 
##          amt_weekends          amt_weekdays                  type 
##                  1270                  1270                     0
#Making sure the smoke variable has no NA
smoking_data <- filter(smoking_data, !is.na(smoke))

EDA Functions

head(smoking_data)
##   gender age marital_status highest_qualification nationality ethnicity
## 1   Male  38       Divorced      No Qualification     British     White
## 2 Female  42         Single      No Qualification     British     White
## 3   Male  40        Married                Degree     English     White
## 4 Female  40        Married                Degree     English     White
## 5 Female  39        Married          GCSE/O Level     British     White
## 6 Female  37        Married          GCSE/O Level     British     White
##       gross_income    region smoke amt_weekends amt_weekdays    type
## 1   2,600 to 5,200 The North    No           NA           NA        
## 2      Under 2,600 The North   Yes           12           12 Packets
## 3 28,600 to 36,400 The North    No           NA           NA        
## 4 10,400 to 15,600 The North    No           NA           NA        
## 5   2,600 to 5,200 The North    No           NA           NA        
## 6 15,600 to 20,800 The North    No           NA           NA
summary(smoking_data)
##     gender               age        marital_status     highest_qualification
##  Length:1691        Min.   :16.00   Length:1691        Length:1691          
##  Class :character   1st Qu.:34.00   Class :character   Class :character     
##  Mode  :character   Median :48.00   Mode  :character   Mode  :character     
##                     Mean   :49.84                                           
##                     3rd Qu.:65.50                                           
##                     Max.   :97.00                                           
##                                                                             
##  nationality         ethnicity         gross_income          region         
##  Length:1691        Length:1691        Length:1691        Length:1691       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##     smoke            amt_weekends    amt_weekdays       type          
##  Length:1691        Min.   : 0.00   Min.   : 0.00   Length:1691       
##  Class :character   1st Qu.:10.00   1st Qu.: 7.00   Class :character  
##  Mode  :character   Median :15.00   Median :12.00   Mode  :character  
##                     Mean   :16.41   Mean   :13.75                     
##                     3rd Qu.:20.00   3rd Qu.:20.00                     
##                     Max.   :60.00   Max.   :55.00                     
##                     NA's   :1270    NA's   :1270
str(smoking_data)
## 'data.frame':    1691 obs. of  12 variables:
##  $ gender               : chr  "Male" "Female" "Male" "Female" ...
##  $ age                  : int  38 42 40 40 39 37 53 44 40 41 ...
##  $ marital_status       : chr  "Divorced" "Single" "Married" "Married" ...
##  $ highest_qualification: chr  "No Qualification" "No Qualification" "Degree" "Degree" ...
##  $ nationality          : chr  "British" "British" "English" "English" ...
##  $ ethnicity            : chr  "White" "White" "White" "White" ...
##  $ gross_income         : chr  "2,600 to 5,200" "Under 2,600" "28,600 to 36,400" "10,400 to 15,600" ...
##  $ region               : chr  "The North" "The North" "The North" "The North" ...
##  $ smoke                : chr  "No" "Yes" "No" "No" ...
##  $ amt_weekends         : int  NA 12 NA NA NA NA 6 NA 8 15 ...
##  $ amt_weekdays         : int  NA 12 NA NA NA NA 6 NA 8 12 ...
##  $ type                 : chr  "" "Packets" "" "" ...
dim(smoking_data)
## [1] 1691   12
table(smoking_data$smoke)
## 
##   No  Yes 
## 1270  421
#Key variables
selected <- select(smoking_data, smoke, age)

#Counting the amount who smoke
table(smoking_data$smoke)
## 
##   No  Yes 
## 1270  421

Visualization: Bar Plot for Smoking Status

barplot(table(smoking_data$smoke), main = "Smoking Status of Individuals in the UK", xlab = "Responses", ylab = "Count")

Statistical Analysis

For the statistical analysis, I used the single proportion test to determine whether the proportion of individuals in the United Kingdom who smoke is less than 20%. As mentioned before the variable “smoke” was essential as it was a categorical one. I calculated the sample proportion of smokers and found out that 421 out of the 1,691 individuals stated that they smoke. This gives the sample proportion to be approx. (24.9%). I later conducted the hypothesis test using the prop.test() function, specifying count of smokers, the total size of the sample, and the hypothesized proportion of 0.20. Because the sample proportion is higher than 0.20, its indicating that there is no statistical proof to conclude that the true proportion of smokers in the United Kingdom is less than 20%, therefore the test results addresses the research question and indicates that must fail to reject the null hypothesis.

sample_proportion <- mean(smoking_data$smoke == "Yes")
sample_proportion
## [1] 0.2489651

Single Proportion Test

prop.test(421, 1691, 0.20, alternative = "less")
## 
##  1-sample proportions test with continuity correction
## 
## data:  421 out of 1691, null probability 0.2
## X-squared = 25.034, df = 1, p-value = 1
## alternative hypothesis: true p is less than 0.2
## 95 percent confidence interval:
##  0.0000000 0.2669554
## sample estimates:
##         p 
## 0.2489651

Conclusion

Based on the results of this single proportion test, there is no statistical proof to support the claim that the true proportion of adult smokers in the United Kingdom is less than 20%. The sample proportion of smokers in the dataset is approx. 24.9% which is higher than the expected hypothesized value of 20%. This indicates that the data does not support the idea that smoking prevalence is lower than 20%. Instead, the sample suggests the smoking rate may actually be higher than the expected. Although the data provided a useful insight into smoking behaviors, additional data and research is very much needed with an updated and representative sample that may offer a much stronger conclusion about the United Kingdom’s national smoking rate.

References https://www.openintro.org/data/index.php?data=smoking