My Research Question Is the true proportion of adults in the United Kingdom who smoke, less than 20%
In this project, we’ll examine the smoking behavior among adults from the United Kingdom, utilizing a data set that contains the observations of 1,691 Individuals. This data set includes information such as Demographics, Gender, Age, Martial Status, Region, Education Level, Ethnicity, Salary, and the most essential part of our research, their smoking behaviors. The variable that is needed here is mainly smoke, which will indicate whether or not the individual smokes (Yes/No). Due to this variable being categorical and represents the smoking status, its best to test in a population proportion.
Lastly, understanding the smoking prevalence is important national health guidelines and government data frequently refer smoking rates as indicators for public health. The key point of interest in this project is whether the proportion of adults who smoke may now be lower than 20%, given the recent health campaigns and efforts to reduce smoking habits. We’ll utilize visualizations, and Single Proportion Test to determine whether the data set provides a statistical proof that the true smoking rate in the United Kingdom is below 20%.
The data set used contains 1,691 rows and 12 columns. The variables that are relevant to this study will be: “smoke”.
As mentioned earlier, the perfect analysis for this project will be using a Single Proportion Test. This method will evaluate whether the population of adults who smoke is less than the hypothesized value which 0.20. We’re comparing the one binary variable “smoke” to a benchmark of 20%, due to the variable having only two outcomes which is “Yes” or “No” and the sample is large, therefore the Single Proportion Test is the most appropriate for this project.
\(H_0\): p = 0.20 \(H_a\): p < 0.20
This would be a left-tailed test because we want to determine whether smoking prevalence is lower than 20%.
Exploring the data set, I started with the simplest cleaning steps and EDA functions. I checked for any missing values using the colsums(is.na()) and made sure that the smoke variable contained no NAs, making it perfect to calculate. Additionally, I also used functions such as head(), summary(), str(), table() to understand the structure of the data set. I calculated the smokers and creating a simplified the data set that just contains only the variables that are relevant to this study. In terms of visualization, I created a bar plot of smoking status to show the amount of individuals smoking versus the ones who don’t smoke, which is prefect because the variable is a categorical one. I also added a histogram to this study for age, providing a simple overview of the demographics.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
smoking_data <- read.csv("smoking.csv")
#Looking for any missing values (NA's)
colSums(is.na(smoking_data))
## gender age marital_status
## 0 0 0
## highest_qualification nationality ethnicity
## 0 0 0
## gross_income region smoke
## 0 0 0
## amt_weekends amt_weekdays type
## 1270 1270 0
#Making sure the smoke variable has no NA
smoking_data <- filter(smoking_data, !is.na(smoke))
head(smoking_data)
## gender age marital_status highest_qualification nationality ethnicity
## 1 Male 38 Divorced No Qualification British White
## 2 Female 42 Single No Qualification British White
## 3 Male 40 Married Degree English White
## 4 Female 40 Married Degree English White
## 5 Female 39 Married GCSE/O Level British White
## 6 Female 37 Married GCSE/O Level British White
## gross_income region smoke amt_weekends amt_weekdays type
## 1 2,600 to 5,200 The North No NA NA
## 2 Under 2,600 The North Yes 12 12 Packets
## 3 28,600 to 36,400 The North No NA NA
## 4 10,400 to 15,600 The North No NA NA
## 5 2,600 to 5,200 The North No NA NA
## 6 15,600 to 20,800 The North No NA NA
summary(smoking_data)
## gender age marital_status highest_qualification
## Length:1691 Min. :16.00 Length:1691 Length:1691
## Class :character 1st Qu.:34.00 Class :character Class :character
## Mode :character Median :48.00 Mode :character Mode :character
## Mean :49.84
## 3rd Qu.:65.50
## Max. :97.00
##
## nationality ethnicity gross_income region
## Length:1691 Length:1691 Length:1691 Length:1691
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## smoke amt_weekends amt_weekdays type
## Length:1691 Min. : 0.00 Min. : 0.00 Length:1691
## Class :character 1st Qu.:10.00 1st Qu.: 7.00 Class :character
## Mode :character Median :15.00 Median :12.00 Mode :character
## Mean :16.41 Mean :13.75
## 3rd Qu.:20.00 3rd Qu.:20.00
## Max. :60.00 Max. :55.00
## NA's :1270 NA's :1270
str(smoking_data)
## 'data.frame': 1691 obs. of 12 variables:
## $ gender : chr "Male" "Female" "Male" "Female" ...
## $ age : int 38 42 40 40 39 37 53 44 40 41 ...
## $ marital_status : chr "Divorced" "Single" "Married" "Married" ...
## $ highest_qualification: chr "No Qualification" "No Qualification" "Degree" "Degree" ...
## $ nationality : chr "British" "British" "English" "English" ...
## $ ethnicity : chr "White" "White" "White" "White" ...
## $ gross_income : chr "2,600 to 5,200" "Under 2,600" "28,600 to 36,400" "10,400 to 15,600" ...
## $ region : chr "The North" "The North" "The North" "The North" ...
## $ smoke : chr "No" "Yes" "No" "No" ...
## $ amt_weekends : int NA 12 NA NA NA NA 6 NA 8 15 ...
## $ amt_weekdays : int NA 12 NA NA NA NA 6 NA 8 12 ...
## $ type : chr "" "Packets" "" "" ...
dim(smoking_data)
## [1] 1691 12
table(smoking_data$smoke)
##
## No Yes
## 1270 421
#Key variables
selected <- select(smoking_data, smoke, age)
#Counting the amount who smoke
table(smoking_data$smoke)
##
## No Yes
## 1270 421
barplot(table(smoking_data$smoke), main = "Smoking Status of Individuals in the UK", xlab = "Responses", ylab = "Count")
For the statistical analysis, I used the single proportion test to determine whether the proportion of individuals in the United Kingdom who smoke is less than 20%. As mentioned before the variable “smoke” was essential as it was a categorical one. I calculated the sample proportion of smokers and found out that 421 out of the 1,691 individuals stated that they smoke. This gives the sample proportion to be approx. (24.9%). I later conducted the hypothesis test using the prop.test() function, specifying count of smokers, the total size of the sample, and the hypothesized proportion of 0.20. Because the sample proportion is higher than 0.20, its indicating that there is no statistical proof to conclude that the true proportion of smokers in the United Kingdom is less than 20%, therefore the test results addresses the research question and indicates that must fail to reject the null hypothesis.
sample_proportion <- mean(smoking_data$smoke == "Yes")
sample_proportion
## [1] 0.2489651
prop.test(421, 1691, 0.20, alternative = "less")
##
## 1-sample proportions test with continuity correction
##
## data: 421 out of 1691, null probability 0.2
## X-squared = 25.034, df = 1, p-value = 1
## alternative hypothesis: true p is less than 0.2
## 95 percent confidence interval:
## 0.0000000 0.2669554
## sample estimates:
## p
## 0.2489651
Based on the results of this single proportion test, there is no statistical proof to support the claim that the true proportion of adult smokers in the United Kingdom is less than 20%. The sample proportion of smokers in the dataset is approx. 24.9% which is higher than the expected hypothesized value of 20%. This indicates that the data does not support the idea that smoking prevalence is lower than 20%. Instead, the sample suggests the smoking rate may actually be higher than the expected. Although the data provided a useful insight into smoking behaviors, additional data and research is very much needed with an updated and representative sample that may offer a much stronger conclusion about the United Kingdom’s national smoking rate.
References https://www.openintro.org/data/index.php?data=smoking