library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.4.1 ──
## ✔ broom 1.0.9 ✔ recipes 1.3.1
## ✔ dials 1.4.2 ✔ rsample 1.3.1
## ✔ dplyr 1.1.4 ✔ tailor 0.1.0
## ✔ ggplot2 4.0.0 ✔ tidyr 1.3.1
## ✔ infer 1.0.9 ✔ tune 2.0.0
## ✔ modeldata 1.5.1 ✔ workflows 1.3.0
## ✔ parsnip 1.3.3 ✔ workflowsets 1.1.1
## ✔ purrr 1.1.0 ✔ yardstick 1.3.2
## Warning: package 'ggplot2' was built under R version 4.5.2
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ recipes::step() masks stats::step()
setwd("C:/Users/sarah/OneDrive/Desktop/Data 101 - Data 110")
smoking <- read.csv("smoking.csv")
Research Question: Do males have a significantly higher proportion of smokers than females?
Smoking data is provided by stats4schools, a United Kingdom organization, and this dataset shows survey data of 1,691 people. This dataset has 12 variables and 1691 observations. The variables I will be focusing on is “gender” and “smoke”. The “gender” variable shows the sex of the case, either male or female. The “smoke” variable shows the smoking status of the case, either “Yes” or “No.”
In this project, I will create visualizations showcasing the counts and proportion of male and female smokers. Then I will see if the observed difference (if any) is statistically significant using a hypothesis test, specifically the Difference of Proportions test.
This dataset can be accessed on the OpenIntro repository (https://www.openintro.org/data/index.php?data=smoking) or the National Stem Centre of the UK. (https://www.stem.org.uk/resources/library/resource/28452/large-datasets-stats4schools)
In this section, I will check for any N/A values, check the structure and dimensions of the dataset, closely examine my selected variables, create two other datasets from the main dataset to use for my graphs and the statistical test, and finally, I will create visualizations, specifically a bar graph and a proportional stacked bar graph, of male and female smoking data.
Check for N/A values.
colSums(is.na(smoking))
## gender age marital_status
## 0 0 0
## highest_qualification nationality ethnicity
## 0 0 0
## gross_income region smoke
## 0 0 0
## amt_weekends amt_weekdays type
## 1270 1270 0
No N/A Values for selected variables.
Check the dimensions of the dataset
dim(smoking)
## [1] 1691 12
12 variables, 1691 observations.
Check the structure of the dataset
str(smoking)
## 'data.frame': 1691 obs. of 12 variables:
## $ gender : chr "Male" "Female" "Male" "Female" ...
## $ age : int 38 42 40 40 39 37 53 44 40 41 ...
## $ marital_status : chr "Divorced" "Single" "Married" "Married" ...
## $ highest_qualification: chr "No Qualification" "No Qualification" "Degree" "Degree" ...
## $ nationality : chr "British" "British" "English" "English" ...
## $ ethnicity : chr "White" "White" "White" "White" ...
## $ gross_income : chr "2,600 to 5,200" "Under 2,600" "28,600 to 36,400" "10,400 to 15,600" ...
## $ region : chr "The North" "The North" "The North" "The North" ...
## $ smoke : chr "No" "Yes" "No" "No" ...
## $ amt_weekends : int NA 12 NA NA NA NA 6 NA 8 15 ...
## $ amt_weekdays : int NA 12 NA NA NA NA 6 NA 8 12 ...
## $ type : chr "" "Packets" "" "" ...
Gender - Character
Smoke - Character
Check individual values for the gender variable
unique(smoking$gender)
## [1] "Male" "Female"
Check individual values for the smoke variable
unique(smoking$smoke)
## [1] "No" "Yes"
Create a new data frame, select the two necessary variables, group by the gender and count the number of observations by the smoke variable. Create a new variable called “total” and calculate the sum of the counts by their group (male and female).
# For Visualization
smoking2 <- smoking |>
select("gender", "smoke") |>
group_by(gender) |>
count(smoke) |>
mutate(total = sum(n))
smoking2
## # A tibble: 4 × 4
## # Groups: gender [2]
## gender smoke n total
## <chr> <chr> <int> <int>
## 1 Female No 731 965
## 2 Female Yes 234 965
## 3 Male No 539 726
## 4 Male Yes 187 726
Create a new data frame from the previous one and filter just for those who smoke. Arrange counts in ascending order.
# For Statistical Analysis
smoking3 <- smoking2 |>
filter(smoke == "Yes") |>
arrange(n)
smoking3
## # A tibble: 2 × 4
## # Groups: gender [2]
## gender smoke n total
## <chr> <chr> <int> <int>
## 1 Male Yes 187 726
## 2 Female Yes 234 965
Create a bar graph of male and female smoking data.
ggplot(smoking2, aes(x = gender, y = n, fill = smoke)) +
geom_col(color = "black") +
labs(title = "Bar Graph of Smoking Status by Gender in the United Kingdom", x = "Gender", y = "Count", caption = "Source: stats4schools", fill = "Smoking Status")
This bar graph shows counts for male and female smokers and non-smokers. We can see that there are more female cases in the dataset, along with more female smokers, than male cases. This visualization may be misleading to those without a strong mathematical foundation; since the bars are using counts rather than the proportion, one would think that there are more female smokers than male smokers. However, this ignores the fact that there are more female cases in the dataset, thus more female smokers. For this specific dataset, because there are less male observations, there are less male smokers.
Create a proportional stacked bar graph of male and female smoking data using position = “fill”.
ggplot(smoking2, aes(x = gender, y = n, fill = smoke)) +
geom_col(color = "black", position = "fill") + # see references for position = "fill". Note: I do remember using position="fill" for my previous statistics class, but I figured it's best I include documentation as well.
labs(title = "Proportional Stacked Bar Graph of Smoking Status by Gender", x = "Gender", y = "Proportion", caption = "Source: stats4schools", fill = "Smoking Status")
This proportional stacked bar graph shows the proportion, out of a 100%, on the y-axis and the gender on the x-axis. The bars are grouped by the smoking status. We can see that the proportion of male smokers is slightly higher than the proportion of female smokers. However, is the observed difference in proportions significant? We will use a difference of proportions test to figure this out.
Comparing this visualization to the previous one, if we’re trying to see the proportion of male and female smokers, then we should use the proportional stacked bar graph. If we’re just trying to see the counts, then we should use a regular bar graph. I’ve included the regular bar graph to show how graphs can be misleading to people without fundamental knowledge of math.
\(H_o\): P_m = P_f
\(H_a\): P_m > P_f
Where P_m is the proportion of males that smoke and P_f is the proportion of females that smoke
xtabs(~ gender + smoke, data=smoking)
## smoke
## gender No Yes
## Female 731 234
## Male 539 187
There are more than 5 counts for each variable. We pass basic assumptions for the proportions test.
prop.test(smoking3$n, smoking3$total, alternative = "greater")
##
## 2-sample test for equality of proportions with continuity correction
##
## data: smoking3$n out of smoking3$total
## X-squared = 0.42699, df = 1, p-value = 0.2567
## alternative hypothesis: greater
## 95 percent confidence interval:
## -0.02115591 1.00000000
## sample estimates:
## prop 1 prop 2
## 0.2575758 0.2424870
The p-value is 0.2567. This is above the alpha significance level of 0.05. This means that we fail to reject the null; there is no evidence proving that the proportion of male smokers is significantly greater than the proportion of female smokers.
Furthermore, the 95 percent confidence interval goes from -0.021 to 1.00, and this interval includes 0. This means that we are 95% confident that male smokers don’t have a significantly higher proportion than female smokers.
From the visualizations, we can see differences between male and female smokers. The regular bar plot specifically displays the counts and shows more female smokers than male smokers. However, the proportional stacked bar graph presents proportions rather than counts, and the visualization shows slightly more male smokers than female smokers out of the total. It is important to consider the proportional stacked bar graph over the regular bar graph when trying to see the overall difference between male and female smokers, as “overall” generally implies total (proportions).
Even though the proportional stacked bar graph shows a slightly larger proportion of male smokers than female smokers, is this difference actually significant? From the difference of proportions test, we conclude, with a p-value of 0.2567 (greater than 0.05), that there is no evidence suggesting that the proportion of male smokers is significantly greater than the proportion of female smokers. Additionally, the 95% confidence interval includes zero; we are 95% confident that the proportion of male smokers isn’t greater than female smokers.
If the observations in this dataset are independent, the sampling method was random, and if there were no biases when collecting responses, then these findings can successfully represent the population of the United Kingdom. If all holds true, then we can conclude that there is no evidence proving that the proportion of male smokers is significantly greater than the proportion of female smokers.
Future research can look at what factors cause people to smoke, for example, income, nationality, ethnicity, gender, age, etc. We can use a logistic model to figure out what variables cause people to smoke the most.
“Large Datasets from Stats4schools.” STEM Learning, 2009, www.stem.org.uk/resources/library/resource/28452/large-datasets-stats4schools.
Position = “fill” reference: https://r-resources.massey.ac.nz/rgcookbook/RECIPE-BAR-GRAPH-PROPORTIONAL-STACKED-BAR.html