How does the distribution of age differ between smokers and non-smokers in London?

Introduction

My project is focused on the relationship between age and the smoking status of individuals in London. The original data set had 12 variables and 1691 observations, but I filtered it down to 3 variables and 182 observations. My research question is : How does the distribution of age differ between smokers and non-smokers in London?, and understanding this relationship is very important, as smoking is often caused by different factors.

I got the dataset from openintro.com, and the overall dataset is about smoking in the United Kingdom. For the research question, I am going to focus on three variables, age, region, and smoke. The age variable is just the age of the person, the region variable has been filtered down to only include London, as that is the region that I will be focusing on, and the smoke variable states yes or no, indicating whether or not one smokes.

Load the Library

library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Set the Working Directory and Load the Dataset

setwd("~/Documents/Data 101")

uk_smoking <- read.csv("smoking.csv")

Exploratory Data Analysis and Code Explanation

First Chunk :

In the first line, I created a new dataset called “london_smoking”, as this gives an insight to what the data is. Then, I used the “select()” function to select the age, region and smoke columns, as these are the variables that I need to focus on to answer my question. In the third line, I used the “filter()” function to filter the region column to only contain cases from London, as this region is the only region essential to my question. Then, I used “head()” to show the first six rows of the new dataset, “london_smoking”. Lastly, I used str() to check my column types, so I don’t have to mutate them.

Second Chunk:

In the first line, I used “tolower” to lowercase all of the column names in my revised dataset, “london_smoking”, which came from the first chunk. Then I used “gsub”, to have all the variable names with spaces in them replaced by underscores. Then I used “head” and “str” to see the first six rows with the cleaned names, and str to confirm the variable types.

Third Chunk:

In the first line, I ued ggplot and aes to set my dataset and map my x and y varibles, smoke and age. I then used fill to fill the side by side box-plot with the smoke values. Then I used geom_boxplot to create the boxplot, and then used labs to create the title, and title the x and y axis’.

london_smoking <-uk_smoking |>
  select(age, region, smoke) |>
  filter(region == "London")
head(london_smoking)
##   age region smoke
## 1  40 London    No
## 2  28 London    No
## 3  40 London    No
## 4  48 London    No
## 5  35 London    No
## 6  30 London   Yes
str(london_smoking)
## 'data.frame':    182 obs. of  3 variables:
##  $ age   : int  40 28 40 48 35 30 26 81 77 25 ...
##  $ region: chr  "London" "London" "London" "London" ...
##  $ smoke : chr  "No" "No" "No" "No" ...
names(london_smoking) <- tolower(names(london_smoking))
names(london_smoking) <- gsub(" ","_",names(london_smoking))
head(london_smoking)
##   age region smoke
## 1  40 London    No
## 2  28 London    No
## 3  40 London    No
## 4  48 London    No
## 5  35 London    No
## 6  30 London   Yes
ggplot(london_smoking, aes(x = smoke, y = age, fill = smoke)) +
  geom_boxplot(alpha = 0.7) +
  labs(title = "Age Distribution by Smoking Status in London",
       x = "Smoking Status",
       y = "Age") +
  theme_minimal()

Conclusion

Discussion of Key Results

The boxplot, which was the main visualization shows a lot of things and answers my question particularly well. The boxplot shows that non-smokers tend to be older than smokers, as the median age is higher and the distribution of the actual box is more spread apart, which means that there is more variation in age with non-smokers, showing that they are normally older than smokers. The smokers have a lower median age, and a smaller and more clustered distrubution, meaning that the ages are similar. To answer my question, the distribution for age does differ for smokers and non-smokers in London. Smoking tends to start around 17 years old and it can continue all the way up to the mid 90’s. However, there is a greater quantity of non-smokers, especially in the older age groups, which implies that they either quit smoking or never started in the first place. The distribution of age for smoker’s is centered aroud teens and young to mid-age adults, while the non-smokers age distributuion is more concentrated around young adults all the way into the elderly and senior people.

Implications and Relevance

These findings answer the research question because the graph demonstrates a significant difference in age distributions between smokers and non-smokers in London. These findings are also relevant to the real world, as smoking is more common with young adults in London, so anti-smoking campaigns can now be targeted to the younger demographic in London. These findings can help stop this smoking epidemic within our teens and young adults.

Potential Areas for Future Research

Some potential areas for future research is to confirm whether the difference in age is statistically significant. Also, we can look at the gross_income variable and include that in our findings to explore different patterns across all of the regions in the U.K. and also other factors within the U.K.

Works Cited

Smoking dataset. (n.d.). OpenIntro. Retrieved Februrary 27, 2026, from https://www.openintro.org/data/index.php?data=smoking