library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.4.1 ──
## ✔ broom        1.0.9     ✔ recipes      1.3.1
## ✔ dials        1.4.2     ✔ rsample      1.3.1
## ✔ dplyr        1.1.4     ✔ tailor       0.1.0
## ✔ ggplot2      4.0.0     ✔ tidyr        1.3.1
## ✔ infer        1.0.9     ✔ tune         2.0.0
## ✔ modeldata    1.5.1     ✔ workflows    1.3.0
## ✔ parsnip      1.3.3     ✔ workflowsets 1.1.1
## ✔ purrr        1.1.0     ✔ yardstick    1.3.2
## Warning: package 'ggplot2' was built under R version 4.5.2
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ✖ recipes::step()  masks stats::step()
setwd("C:/Users/sarah/OneDrive/Desktop/Data 101 - Data 110")

smoking <- read.csv("smoking.csv")

Introduction

Research Question: Do males have a significantly higher proportion of smokers than females?

Smoking data is provided by stats4schools, a United Kingdom organization, and this dataset shows survey data of 1,691 people. This dataset has 12 variables and 1691 observations. The variables I will be focusing on is “gender” and “smoke”. The “gender” variable shows the sex of the case, either male or female. The “smoke” variable shows the smoking status of the case, either “Yes” or “No.”

In this project, I will create visualizations showcasing the counts and proportion of male and female smokers. Then I will see if the observed difference (if any) is statistically significant using a hypothesis test, specifically the Difference of Proportions test.

This dataset can be accessed on the OpenIntro repository (https://www.openintro.org/data/index.php?data=smoking) or the National Stem Centre of the UK. (https://www.stem.org.uk/resources/library/resource/28452/large-datasets-stats4schools)

Data Analysis

In this section, I will check for any N/A values, check the structure and dimensions of the dataset, closely examine my selected variables, create two other datasets from the main dataset to use for my graphs and the statistical test, and finally, I will create visualizations, specifically a bar graph and a proportional stacked bar graph, of male and female smoking data.

Cleaning and EDA

Check for N/A values.

colSums(is.na(smoking))
##                gender                   age        marital_status 
##                     0                     0                     0 
## highest_qualification           nationality             ethnicity 
##                     0                     0                     0 
##          gross_income                region                 smoke 
##                     0                     0                     0 
##          amt_weekends          amt_weekdays                  type 
##                  1270                  1270                     0

No N/A Values for selected variables.

Check the dimensions of the dataset

dim(smoking)
## [1] 1691   12

12 variables, 1691 observations.

Check the structure of the dataset

str(smoking)
## 'data.frame':    1691 obs. of  12 variables:
##  $ gender               : chr  "Male" "Female" "Male" "Female" ...
##  $ age                  : int  38 42 40 40 39 37 53 44 40 41 ...
##  $ marital_status       : chr  "Divorced" "Single" "Married" "Married" ...
##  $ highest_qualification: chr  "No Qualification" "No Qualification" "Degree" "Degree" ...
##  $ nationality          : chr  "British" "British" "English" "English" ...
##  $ ethnicity            : chr  "White" "White" "White" "White" ...
##  $ gross_income         : chr  "2,600 to 5,200" "Under 2,600" "28,600 to 36,400" "10,400 to 15,600" ...
##  $ region               : chr  "The North" "The North" "The North" "The North" ...
##  $ smoke                : chr  "No" "Yes" "No" "No" ...
##  $ amt_weekends         : int  NA 12 NA NA NA NA 6 NA 8 15 ...
##  $ amt_weekdays         : int  NA 12 NA NA NA NA 6 NA 8 12 ...
##  $ type                 : chr  "" "Packets" "" "" ...

Gender - Character

Smoke - Character

Check individual values for the gender variable

unique(smoking$gender)
## [1] "Male"   "Female"

Check individual values for the smoke variable

unique(smoking$smoke)
## [1] "No"  "Yes"

Create Data Frames for Visualization and Statistical Analysis

Create a new data frame, select the two necessary variables, group by the gender and count the number of observations by the smoke variable. Create a new variable called “total” and calculate the sum of the counts by their group (male and female).

# For Visualization
smoking2 <- smoking |>
  select("gender", "smoke") |>
  group_by(gender) |>
  count(smoke) |>
  mutate(total = sum(n))
smoking2
## # A tibble: 4 × 4
## # Groups:   gender [2]
##   gender smoke     n total
##   <chr>  <chr> <int> <int>
## 1 Female No      731   965
## 2 Female Yes     234   965
## 3 Male   No      539   726
## 4 Male   Yes     187   726

Create a new data frame from the previous one and filter just for those who smoke. Arrange counts in ascending order.

# For Statistical Analysis
smoking3 <- smoking2 |>
  filter(smoke == "Yes") |>
  arrange(n)
smoking3
## # A tibble: 2 × 4
## # Groups:   gender [2]
##   gender smoke     n total
##   <chr>  <chr> <int> <int>
## 1 Male   Yes     187   726
## 2 Female Yes     234   965

Visualizations

Create a bar graph of male and female smoking data.

ggplot(smoking2, aes(x = gender, y = n, fill = smoke)) + 
  geom_col(color = "black") + 
  labs(title = "Bar Graph of Smoking Status by Gender in the United Kingdom", x = "Gender", y = "Count", caption = "Source: stats4schools", fill = "Smoking Status") 

This bar graph shows counts for male and female smokers and non-smokers. We can see that there are more female cases in the dataset, along with more female smokers, than male cases. This visualization may be misleading to those without a strong mathematical foundation; since the bars are using counts rather than the proportion, one would think that there are more female smokers than male smokers. However, this ignores the fact that there are more female cases in the dataset, thus more female smokers. For this specific dataset, because there are less male observations, there are less male smokers.

Create a proportional stacked bar graph of male and female smoking data using position = “fill”.

ggplot(smoking2, aes(x = gender, y = n, fill = smoke)) + 
  geom_col(color = "black", position = "fill") + # see references for position = "fill". Note: I do remember using position="fill" for my previous statistics class, but I figured it's best I include documentation as well.
  labs(title = "Proportional Stacked Bar Graph of Smoking Status by Gender", x = "Gender", y = "Proportion", caption = "Source: stats4schools", fill = "Smoking Status") 

This proportional stacked bar graph shows the proportion, out of a 100%, on the y-axis and the gender on the x-axis. The bars are grouped by the smoking status. We can see that the proportion of male smokers is slightly higher than the proportion of female smokers. However, is the observed difference in proportions significant? We will use a difference of proportions test to figure this out.

Comparing this visualization to the previous one, if we’re trying to see the proportion of male and female smokers, then we should use the proportional stacked bar graph. If we’re just trying to see the counts, then we should use a regular bar graph. I’ve included the regular bar graph to show how graphs can be misleading to people without fundamental knowledge of math.

Statistical Analysis

Hypothesis

\(H_o\): P_m = P_f

\(H_a\): P_m > P_f

Where P_m is the proportion of males that smoke and P_f is the proportion of females that smoke

Basic Assumptions (counts)

xtabs(~ gender + smoke, data=smoking)
##         smoke
## gender    No Yes
##   Female 731 234
##   Male   539 187

There are more than 5 counts for each variable. We pass basic assumptions for the proportions test.

Difference of Proportions Test

prop.test(smoking3$n, smoking3$total, alternative = "greater")
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  smoking3$n out of smoking3$total
## X-squared = 0.42699, df = 1, p-value = 0.2567
## alternative hypothesis: greater
## 95 percent confidence interval:
##  -0.02115591  1.00000000
## sample estimates:
##    prop 1    prop 2 
## 0.2575758 0.2424870

The p-value is 0.2567. This is above the alpha significance level of 0.05. This means that we fail to reject the null; there is no evidence proving that the proportion of male smokers is significantly greater than the proportion of female smokers.

Furthermore, the 95 percent confidence interval goes from -0.021 to 1.00, and this interval includes 0. This means that we are 95% confident that male smokers don’t have a significantly higher proportion than female smokers.

Conclusion

From the visualizations, we can see differences between male and female smokers. The regular bar plot specifically displays the counts and shows more female smokers than male smokers. However, the proportional stacked bar graph presents proportions rather than counts, and the visualization shows slightly more male smokers than female smokers out of the total. It is important to consider the proportional stacked bar graph over the regular bar graph when trying to see the overall difference between male and female smokers, as “overall” generally implies total (proportions).

Even though the proportional stacked bar graph shows a slightly larger proportion of male smokers than female smokers, is this difference actually significant? From the difference of proportions test, we conclude, with a p-value of 0.2567 (greater than 0.05), that there is no evidence suggesting that the proportion of male smokers is significantly greater than the proportion of female smokers. Additionally, the 95% confidence interval includes zero; we are 95% confident that the proportion of male smokers isn’t greater than female smokers.

If the observations in this dataset are independent, the sampling method was random, and if there were no biases when collecting responses, then these findings can successfully represent the population of the United Kingdom. If all holds true, then we can conclude that there is no evidence proving that the proportion of male smokers is significantly greater than the proportion of female smokers.

Future research can look at what factors cause people to smoke, for example, income, nationality, ethnicity, gender, age, etc. We can use a logistic model to figure out what variables cause people to smoke the most.

References

“Large Datasets from Stats4schools.” STEM Learning, 2009, www.stem.org.uk/resources/library/resource/28452/large-datasets-stats4schools.

Position = “fill” reference: https://r-resources.massey.ac.nz/rgcookbook/RECIPE-BAR-GRAPH-PROPORTIONAL-STACKED-BAR.html