INTRODUCTION
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.2
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'tibble' was built under R version 4.3.2
## Warning: package 'tidyr' was built under R version 4.3.2
## Warning: package 'readr' was built under R version 4.3.2
## Warning: package 'purrr' was built under R version 4.3.2
## Warning: package 'dplyr' was built under R version 4.3.2
## Warning: package 'stringr' was built under R version 4.3.2
## Warning: package 'forcats' was built under R version 4.3.2
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
smokingData <- read_csv("smoking.csv")
## New names:
## Rows: 1691 Columns: 13
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (9): gender, marital_status, highest_qualification, nationality, ethnici... dbl
## (4): ...1, age, amt_weekends, amt_weekdays
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
head(smokingData)
## # A tibble: 6 × 13
## ...1 gender age marital_status highest_qualification nationality ethnicity
## <dbl> <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 1 Male 38 Divorced No Qualification British White
## 2 2 Female 42 Single No Qualification British White
## 3 3 Male 40 Married Degree English White
## 4 4 Female 40 Married Degree English White
## 5 5 Female 39 Married GCSE/O Level British White
## 6 6 Female 37 Married GCSE/O Level British White
## # ℹ 6 more variables: gross_income <chr>, region <chr>, smoke <chr>,
## # amt_weekends <dbl>, amt_weekdays <dbl>, type <chr>
# (Q1) Understand my data set
glimpse(smokingData)
## Rows: 1,691
## Columns: 13
## $ ...1 <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
## $ gender <chr> "Male", "Female", "Male", "Female", "Female", "F…
## $ age <dbl> 38, 42, 40, 40, 39, 37, 53, 44, 40, 41, 72, 49, …
## $ marital_status <chr> "Divorced", "Single", "Married", "Married", "Mar…
## $ highest_qualification <chr> "No Qualification", "No Qualification", "Degree"…
## $ nationality <chr> "British", "British", "English", "English", "Bri…
## $ ethnicity <chr> "White", "White", "White", "White", "White", "Wh…
## $ gross_income <chr> "2,600 to 5,200", "Under 2,600", "28,600 to 36,4…
## $ region <chr> "The North", "The North", "The North", "The Nort…
## $ smoke <chr> "No", "Yes", "No", "No", "No", "No", "Yes", "No"…
## $ amt_weekends <dbl> NA, 12, NA, NA, NA, NA, 6, NA, 8, 15, NA, NA, NA…
## $ amt_weekdays <dbl> NA, 12, NA, NA, NA, NA, 6, NA, 8, 12, NA, NA, NA…
## $ type <chr> NA, "Packets", NA, NA, NA, NA, "Packets", NA, "H…
# (Q2) Identify missing values in my data set
anyNA(smokingData)
## [1] TRUE
sum(is.na(smokingData))
## [1] 3810
colSums(is.na(smokingData))
## ...1 gender age
## 0 0 0
## marital_status highest_qualification nationality
## 0 0 0
## ethnicity gross_income region
## 0 0 0
## smoke amt_weekends amt_weekdays
## 0 1270 1270
## type
## 1270
Observation: This part shows that there are three features in which equal missing values (1270) are there. We will deal with NAs after because if we remove them now, we won’t have a lot of information for our analysis
# (Q3)How many individuals belong to each age group (e.g., 10-20, 21-30, 31-40, etc.)?
age_group <- cut(smokingData$age,
breaks = seq(10, 100, by = 10),
include.lowest = TRUE)
smokingData <- cbind(smokingData, age_group)
head(smokingData)
## ...1 gender age marital_status highest_qualification nationality ethnicity
## 1 1 Male 38 Divorced No Qualification British White
## 2 2 Female 42 Single No Qualification British White
## 3 3 Male 40 Married Degree English White
## 4 4 Female 40 Married Degree English White
## 5 5 Female 39 Married GCSE/O Level British White
## 6 6 Female 37 Married GCSE/O Level British White
## gross_income region smoke amt_weekends amt_weekdays type age_group
## 1 2,600 to 5,200 The North No NA NA <NA> (30,40]
## 2 Under 2,600 The North Yes 12 12 Packets (40,50]
## 3 28,600 to 36,400 The North No NA NA <NA> (30,40]
## 4 10,400 to 15,600 The North No NA NA <NA> (30,40]
## 5 2,600 to 5,200 The North No NA NA <NA> (30,40]
## 6 15,600 to 20,800 The North No NA NA <NA> (30,40]
smokingData %>%
ggplot(aes(age_group, fill = age_group))+
geom_bar()+
labs(title = "Frequency of people by age band",
x = "Interval of age",
y = "Frequency")+
theme_classic()
Observation: Firstly, I created a new column for intervals of ages using the age variable. Secondly, I added the new age group column in my data set. And then the bar chart. As we can, we have more people from 20 to 80 years old.
# (Q4) Are there many smokers?
smokingData %>%
ggplot(aes(smoke, fill = smoke))+
geom_bar()+
labs(title = "Difference between people who smoke and people who don't smoke",
x = "Smoke",
y = "Frequency")+
scale_fill_discrete(name = "Smoke")
Observation: We have more people who don’t smoke than people who smoke.
# (Q5) Omit all the NA values from the data set to create a subset for smokers only
smokersData <- na.omit(smokingData)
sum(is.na(smokersData))
## [1] 0
head(smokersData)
## ...1 gender age marital_status highest_qualification nationality ethnicity
## 2 2 Female 42 Single No Qualification British White
## 7 7 Male 53 Married Degree British White
## 9 9 Male 40 Single GCSE/CSE English White
## 10 10 Female 41 Married No Qualification English White
## 21 21 Female 34 Married GCSE/CSE British White
## 22 22 Female 36 Married GCSE/O Level English White
## gross_income region smoke amt_weekends amt_weekdays type
## 2 Under 2,600 The North Yes 12 12 Packets
## 7 Above 36,400 The North Yes 6 6 Packets
## 9 2,600 to 5,200 The North Yes 8 8 Hand-Rolled
## 10 5,200 to 10,400 The North Yes 15 12 Packets
## 21 2,600 to 5,200 The North Yes 6 12 Packets
## 22 5,200 to 10,400 The North Yes 5 2 Packets
## age_group
## 2 (40,50]
## 7 (50,60]
## 9 (30,40]
## 10 (40,50]
## 21 (30,40]
## 22 (30,40]
Observation: By removing all the NAs, we got a subset with only people who smoke.
# (Q6) Calculate average age of smokers?
average_age_smokers <- mean(smokersData$age)
print(average_age_smokers)
## [1] 42.71496
Observation: In average, a smoker is 42 years old.
# (Q7) Extract all the smokers having age more than 50
older_smokers <- subset(smokersData,age > 50)
head(older_smokers)
## ...1 gender age marital_status highest_qualification nationality ethnicity
## 7 7 Male 53 Married Degree British White
## 23 23 Female 56 Married No Qualification English White
## 27 27 Female 58 Divorced No Qualification English White
## 38 38 Female 78 Widowed No Qualification English White
## 48 48 Female 76 Widowed No Qualification English White
## 50 50 Male 59 Married Other/Sub Degree English White
## gross_income region smoke amt_weekends amt_weekdays type age_group
## 7 Above 36,400 The North Yes 6 6 Packets (50,60]
## 23 2,600 to 5,200 The North Yes 20 20 Packets (50,60]
## 27 5,200 to 10,400 The North Yes 25 20 Packets (50,60]
## 38 Under 2,600 The North Yes 20 20 Packets (70,80]
## 48 2,600 to 5,200 The North Yes 6 6 Packets (70,80]
## 50 15,600 to 20,800 The North Yes 25 25 Packets (50,60]
Observation: 123 smokers have age more than 50.
# (Q8) Extract all the smokers having age less than 20
younger_smokers <- subset(smokersData,age < 20)
head(younger_smokers)
## ...1 gender age marital_status highest_qualification nationality ethnicity
## 81 81 Female 18 Single GCSE/O Level English White
## 210 210 Female 16 Single GCSE/O Level British White
## 307 307 Female 18 Single GCSE/O Level British White
## 323 323 Male 18 Single No Qualification British White
## 493 493 Female 16 Single GCSE/CSE British White
## 548 548 Female 18 Single Higher/Sub Degree British White
## gross_income region smoke amt_weekends amt_weekdays
## 81 2,600 to 5,200 The North Yes 10 3
## 210 5,200 to 10,400 The North Yes 12 12
## 307 Refused The North Yes 8 8
## 323 10,400 to 15,600 The North Yes 15 10
## 493 Under 2,600 Midlands & East Anglia Yes 2 2
## 548 2,600 to 5,200 Midlands & East Anglia Yes 15 15
## type age_group
## 81 Packets [10,20]
## 210 Packets [10,20]
## 307 Hand-Rolled [10,20]
## 323 Both/Mainly Hand-Rolled [10,20]
## 493 Packets [10,20]
## 548 Packets [10,20]
Observation: 18 smokers have age less than 20.
# (Q9) How number of smokers are distributed into genders?
smokers_gender <- smokersData %>%
group_by(gender) %>%
summarise(count = n())
smokers_gender %>%
ggplot(aes(gender, count, fill = gender))+
geom_bar(stat = "identity")+
labs(title = "Number of smokers by gender",
x = "Gender",
y = "Number of smokers")+
scale_fill_discrete(name = "Sex of smoker")+
theme_classic()
Observation: According to my analysis, many female smoke than male.
# (Q10) What types of cigarettes are most preferred by individuals who smoke?
smokersData %>%
ggplot(aes(type, fill = type))+
geom_bar()+
labs(title = "Type of cigarettes most smoked",
x = "Type of cigarettes smoked",
y = "Frequency")+
theme(legend.position = "none",
axis.text.x = element_text(angle = 45,
hjust = 1,
vjust = 0.9))
Observation: people who smoke prefer packets than other types of cigarettes.
# (Q11) Is there any difference between amount of cigarettes smoked on weekday and weekend?
difference = sum(smokersData$amt_weekdays) - sum(smokersData$amt_weekends)
print(difference)
## [1] -1120
Observation: This difference means people tend to smoke more on weekend than on week day. Maybe because of the restrictions about smoking in the institutions they work for.
# (Q12) How many unique ethnicity are represented in the data set? And which ethnicity has more addiction to cigarettes and which type of cigarettes they consume the most?
unique(smokingData$ethnicity)
## [1] "White" "Mixed" "Black" "Refused" "Asian" "Chinese" "Unknown"
smokersData %>%
filter(!ethnicity %in% c("Refused", "Unknown")) %>%
ggplot(aes(ethnicity, fill = type))+
geom_bar()+
labs(title = "Relationship between ethnicity, type of cigarettes and number of smokers",
x = "Ethnicity",
y = "Frequency")+
scale_fill_discrete(name = "Type")+
theme_classic()
Observation: White people are more addicted to packed cigarettes than other ethnicity.
# (Q13) Are there any differences in smoking habits based on marital status?
smokersData %>%
ggplot(aes(x = marital_status, fill = marital_status)) +
geom_bar() +
labs(title = "Smokers by marital status",
x = "Marital status",
y = "Count")+
theme(legend.position = "none")
Observation: Single people followed by married people tend to smoke more than others.
# (Q14) What is the relationship between age and smoking amount on weekdays?
corr <- smokersData %>%
select(age, amt_weekdays) %>%
cor(method = "pearson")
corr
## age amt_weekdays
## age 1.0000000 0.1927826
## amt_weekdays 0.1927826 1.0000000
smokersData %>%
ggplot(aes(x = age, y = amt_weekdays)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
labs(x = "Age", y = "Amount (weekday)",
title = "Age Vs. Smoking Amount on weekdays")
## `geom_smooth()` using formula = 'y ~ x'
Observation: It is a low positive correlation of 0.19. When people get older, they tend a bit to smoke more during week days.
# (Q15) Which region tend to smoke the most? And what is the distribution of smokers according to gender?
smokersData %>%
ggplot(aes(region, fill = gender))+
geom_bar()+
labs(x = "Region", y = "Frequency",
title = "Region vs. Smoking Amount vs. Gender")+
theme(axis.text.x = element_text(angle = 45,
hjust = 1,
vjust = 0.9))+
scale_fill_discrete(name = "Gender")
Observation: According to this graph, people in the north followed by people in the Midlands and East Anglia tend to smoke the most. And also we can see that in these regions, women smoke more than men.