SMOKING DATA ANALYSIS

INTRODUCTION

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.3.2

## Warning: package 'ggplot2' was built under R version 4.3.3

## Warning: package 'tibble' was built under R version 4.3.2

## Warning: package 'tidyr' was built under R version 4.3.2

## Warning: package 'readr' was built under R version 4.3.2

## Warning: package 'purrr' was built under R version 4.3.2

## Warning: package 'dplyr' was built under R version 4.3.2

## Warning: package 'stringr' was built under R version 4.3.2

## Warning: package 'forcats' was built under R version 4.3.2

## Warning: package 'lubridate' was built under R version 4.3.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

smokingData <- read_csv("smoking.csv")

## New names:
## Rows: 1691 Columns: 13
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (9): gender, marital_status, highest_qualification, nationality, ethnici... dbl
## (4): ...1, age, amt_weekends, amt_weekdays
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

head(smokingData)

## # A tibble: 6 × 13
##    ...1 gender   age marital_status highest_qualification nationality ethnicity
##   <dbl> <chr>  <dbl> <chr>          <chr>                 <chr>       <chr>    
## 1     1 Male      38 Divorced       No Qualification      British     White    
## 2     2 Female    42 Single         No Qualification      British     White    
## 3     3 Male      40 Married        Degree                English     White    
## 4     4 Female    40 Married        Degree                English     White    
## 5     5 Female    39 Married        GCSE/O Level          British     White    
## 6     6 Female    37 Married        GCSE/O Level          British     White    
## # ℹ 6 more variables: gross_income <chr>, region <chr>, smoke <chr>,
## #   amt_weekends <dbl>, amt_weekdays <dbl>, type <chr>

# (Q1) Understand my data set
glimpse(smokingData)

## Rows: 1,691
## Columns: 13
## $ ...1                  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
## $ gender                <chr> "Male", "Female", "Male", "Female", "Female", "F…
## $ age                   <dbl> 38, 42, 40, 40, 39, 37, 53, 44, 40, 41, 72, 49, …
## $ marital_status        <chr> "Divorced", "Single", "Married", "Married", "Mar…
## $ highest_qualification <chr> "No Qualification", "No Qualification", "Degree"…
## $ nationality           <chr> "British", "British", "English", "English", "Bri…
## $ ethnicity             <chr> "White", "White", "White", "White", "White", "Wh…
## $ gross_income          <chr> "2,600 to 5,200", "Under 2,600", "28,600 to 36,4…
## $ region                <chr> "The North", "The North", "The North", "The Nort…
## $ smoke                 <chr> "No", "Yes", "No", "No", "No", "No", "Yes", "No"…
## $ amt_weekends          <dbl> NA, 12, NA, NA, NA, NA, 6, NA, 8, 15, NA, NA, NA…
## $ amt_weekdays          <dbl> NA, 12, NA, NA, NA, NA, 6, NA, 8, 12, NA, NA, NA…
## $ type                  <chr> NA, "Packets", NA, NA, NA, NA, "Packets", NA, "H…

# (Q2) Identify missing values in my data set
anyNA(smokingData)

## [1] TRUE

sum(is.na(smokingData))

## [1] 3810

colSums(is.na(smokingData))

##                  ...1                gender                   age 
##                     0                     0                     0 
##        marital_status highest_qualification           nationality 
##                     0                     0                     0 
##             ethnicity          gross_income                region 
##                     0                     0                     0 
##                 smoke          amt_weekends          amt_weekdays 
##                     0                  1270                  1270 
##                  type 
##                  1270

Observation: This part shows that there are three features in which equal missing values (1270) are there. We will deal with NAs after because if we remove them now, we won’t have a lot of information for our analysis

# (Q3)How many individuals belong to each age group (e.g., 10-20, 21-30, 31-40, etc.)?
age_group <- cut(smokingData$age, 
                  breaks = seq(10, 100, by = 10), 
                  include.lowest = TRUE)

smokingData <- cbind(smokingData, age_group)

head(smokingData)

##   ...1 gender age marital_status highest_qualification nationality ethnicity
## 1    1   Male  38       Divorced      No Qualification     British     White
## 2    2 Female  42         Single      No Qualification     British     White
## 3    3   Male  40        Married                Degree     English     White
## 4    4 Female  40        Married                Degree     English     White
## 5    5 Female  39        Married          GCSE/O Level     British     White
## 6    6 Female  37        Married          GCSE/O Level     British     White
##       gross_income    region smoke amt_weekends amt_weekdays    type age_group
## 1   2,600 to 5,200 The North    No           NA           NA    <NA>   (30,40]
## 2      Under 2,600 The North   Yes           12           12 Packets   (40,50]
## 3 28,600 to 36,400 The North    No           NA           NA    <NA>   (30,40]
## 4 10,400 to 15,600 The North    No           NA           NA    <NA>   (30,40]
## 5   2,600 to 5,200 The North    No           NA           NA    <NA>   (30,40]
## 6 15,600 to 20,800 The North    No           NA           NA    <NA>   (30,40]

smokingData %>% 
  ggplot(aes(age_group, fill = age_group))+
  geom_bar()+
  labs(title = "Frequency of people by age band",
       x = "Interval of age",
       y = "Frequency")+
  theme_classic()

Observation: Firstly, I created a new column for intervals of ages using the age variable. Secondly, I added the new age group column in my data set. And then the bar chart. As we can, we have more people from 20 to 80 years old.

# (Q4) Are there many smokers?
smokingData %>% 
  ggplot(aes(smoke, fill = smoke))+
  geom_bar()+
  labs(title = "Difference between people who smoke and people who don't smoke",
       x = "Smoke",
       y = "Frequency")+
  scale_fill_discrete(name = "Smoke")

Observation: We have more people who don’t smoke than people who smoke.

# (Q5) Omit all the NA values from the data set to create a subset for smokers only 
smokersData <- na.omit(smokingData)
sum(is.na(smokersData))

## [1] 0

head(smokersData)

##    ...1 gender age marital_status highest_qualification nationality ethnicity
## 2     2 Female  42         Single      No Qualification     British     White
## 7     7   Male  53        Married                Degree     British     White
## 9     9   Male  40         Single              GCSE/CSE     English     White
## 10   10 Female  41        Married      No Qualification     English     White
## 21   21 Female  34        Married              GCSE/CSE     British     White
## 22   22 Female  36        Married          GCSE/O Level     English     White
##       gross_income    region smoke amt_weekends amt_weekdays        type
## 2      Under 2,600 The North   Yes           12           12     Packets
## 7     Above 36,400 The North   Yes            6            6     Packets
## 9   2,600 to 5,200 The North   Yes            8            8 Hand-Rolled
## 10 5,200 to 10,400 The North   Yes           15           12     Packets
## 21  2,600 to 5,200 The North   Yes            6           12     Packets
## 22 5,200 to 10,400 The North   Yes            5            2     Packets
##    age_group
## 2    (40,50]
## 7    (50,60]
## 9    (30,40]
## 10   (40,50]
## 21   (30,40]
## 22   (30,40]

Observation: By removing all the NAs, we got a subset with only people who smoke.

# (Q6) Calculate average age of smokers?
average_age_smokers <- mean(smokersData$age)
print(average_age_smokers)

## [1] 42.71496

Observation: In average, a smoker is 42 years old.

# (Q7) Extract all the smokers having age more than 50
older_smokers <- subset(smokersData,age > 50)
head(older_smokers)

##    ...1 gender age marital_status highest_qualification nationality ethnicity
## 7     7   Male  53        Married                Degree     British     White
## 23   23 Female  56        Married      No Qualification     English     White
## 27   27 Female  58       Divorced      No Qualification     English     White
## 38   38 Female  78        Widowed      No Qualification     English     White
## 48   48 Female  76        Widowed      No Qualification     English     White
## 50   50   Male  59        Married      Other/Sub Degree     English     White
##        gross_income    region smoke amt_weekends amt_weekdays    type age_group
## 7      Above 36,400 The North   Yes            6            6 Packets   (50,60]
## 23   2,600 to 5,200 The North   Yes           20           20 Packets   (50,60]
## 27  5,200 to 10,400 The North   Yes           25           20 Packets   (50,60]
## 38      Under 2,600 The North   Yes           20           20 Packets   (70,80]
## 48   2,600 to 5,200 The North   Yes            6            6 Packets   (70,80]
## 50 15,600 to 20,800 The North   Yes           25           25 Packets   (50,60]

Observation: 123 smokers have age more than 50.

# (Q8) Extract all the smokers having age less than 20
younger_smokers <- subset(smokersData,age < 20)
head(younger_smokers)

##     ...1 gender age marital_status highest_qualification nationality ethnicity
## 81    81 Female  18         Single          GCSE/O Level     English     White
## 210  210 Female  16         Single          GCSE/O Level     British     White
## 307  307 Female  18         Single          GCSE/O Level     British     White
## 323  323   Male  18         Single      No Qualification     British     White
## 493  493 Female  16         Single              GCSE/CSE     British     White
## 548  548 Female  18         Single     Higher/Sub Degree     British     White
##         gross_income                 region smoke amt_weekends amt_weekdays
## 81    2,600 to 5,200              The North   Yes           10            3
## 210  5,200 to 10,400              The North   Yes           12           12
## 307          Refused              The North   Yes            8            8
## 323 10,400 to 15,600              The North   Yes           15           10
## 493      Under 2,600 Midlands & East Anglia   Yes            2            2
## 548   2,600 to 5,200 Midlands & East Anglia   Yes           15           15
##                        type age_group
## 81                  Packets   [10,20]
## 210                 Packets   [10,20]
## 307             Hand-Rolled   [10,20]
## 323 Both/Mainly Hand-Rolled   [10,20]
## 493                 Packets   [10,20]
## 548                 Packets   [10,20]

Observation: 18 smokers have age less than 20.

# (Q9) How number of smokers are distributed into genders?
smokers_gender <- smokersData %>% 
  group_by(gender) %>% 
  summarise(count = n())

smokers_gender %>% 
  ggplot(aes(gender, count, fill = gender))+
  geom_bar(stat = "identity")+
  labs(title = "Number of smokers by gender",
       x = "Gender",
       y = "Number of smokers")+
  scale_fill_discrete(name = "Sex of smoker")+
  theme_classic()

Observation: According to my analysis, many female smoke than male.

# (Q10) What types of cigarettes are most preferred by individuals who smoke?
smokersData %>% 
  ggplot(aes(type, fill = type))+
  geom_bar()+
  labs(title = "Type of cigarettes most smoked",
       x = "Type of cigarettes smoked",
       y = "Frequency")+
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, 
                                   hjust = 1, 
                                   vjust = 0.9))

Observation: people who smoke prefer packets than other types of cigarettes.

# (Q11) Is there any difference between amount of cigarettes smoked on weekday and weekend?
difference = sum(smokersData$amt_weekdays) - sum(smokersData$amt_weekends)
print(difference)

## [1] -1120

Observation: This difference means people tend to smoke more on weekend than on week day. Maybe because of the restrictions about smoking in the institutions they work for.

# (Q12) How many unique ethnicity are represented in the data set? And which ethnicity has more addiction to cigarettes and which type of cigarettes they consume the most?
unique(smokingData$ethnicity)

## [1] "White"   "Mixed"   "Black"   "Refused" "Asian"   "Chinese" "Unknown"

smokersData %>%
  filter(!ethnicity %in% c("Refused", "Unknown")) %>% 
  ggplot(aes(ethnicity, fill = type))+
  geom_bar()+
  labs(title = "Relationship between ethnicity, type of cigarettes and number of smokers",
       x = "Ethnicity",
       y = "Frequency")+
  scale_fill_discrete(name = "Type")+
  theme_classic()

Observation: White people are more addicted to packed cigarettes than other ethnicity.

# (Q13) Are there any differences in smoking habits based on marital status?
smokersData %>% 
  ggplot(aes(x = marital_status, fill = marital_status)) +
  geom_bar() +
  labs(title = "Smokers by marital status",
       x = "Marital status",
       y = "Count")+
  theme(legend.position = "none")

Observation: Single people followed by married people tend to smoke more than others.

# (Q14) What is the relationship between age and smoking amount on weekdays?
corr <- smokersData %>% 
  select(age, amt_weekdays) %>% 
  cor(method = "pearson")
corr

##                    age amt_weekdays
## age          1.0000000    0.1927826
## amt_weekdays 0.1927826    1.0000000

smokersData %>% 
  ggplot(aes(x = age, y = amt_weekdays)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(x = "Age", y = "Amount (weekday)", 
       title = "Age Vs. Smoking Amount on weekdays")

## `geom_smooth()` using formula = 'y ~ x'

Observation: It is a low positive correlation of 0.19. When people get older, they tend a bit to smoke more during week days.

# (Q15) Which region tend to smoke the most? And what is the distribution of smokers according to gender?
smokersData %>% 
  ggplot(aes(region, fill = gender))+
  geom_bar()+
  labs(x = "Region", y = "Frequency", 
       title = "Region vs. Smoking Amount vs. Gender")+
  theme(axis.text.x = element_text(angle = 45, 
                                   hjust = 1, 
                                   vjust = 0.9))+
  scale_fill_discrete(name = "Gender")

Observation: According to this graph, people in the north followed by people in the Midlands and East Anglia tend to smoke the most. And also we can see that in these regions, women smoke more than men.

SMOKING DATA ANALYSIS

Annu Kumari

2024-05-03