adult tobacco consumption in the U.S.

Author

Ayomide Joe-Adigwe

Introduction

This project analyzes Adult Tobacco Consumption in the U.S. from 2000 to the Present. Using data from cdc.gov, the dataset tracks annual tobacco use across different products like cigarettes, cigars, and smokeless tobacco. The aim is to explore patterns and trends in tobacco consumption over time, and provide insights that can help shape public health policies.

# Load libraries
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(dplyr)

# Load the dataset
library(readr)

# Make sure the file path is correct
Adult_Tobacco_Consumption_In_The_U_S_2000_Present <- read_csv(file.path("AYOMIDE'S DATAVISUALITIOM", "DATASETS", "Adult_Tobacco_Consumption_In_The_U.S.__2000-Present.csv"))

Rows: 312 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): LocationAbbrev, LocationDesc, Topic, Measure, Submeasure, Data Valu...
dbl (8): Year, Population, Domestic, Imports, Total, Domestic Per Capita, Im...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# View the first few rows
head(Adult_Tobacco_Consumption_In_The_U_S_2000_Present)

# A tibble: 6 × 14
   Year LocationAbbrev LocationDesc Population Topic          Measure Submeasure
  <dbl> <chr>          <chr>             <dbl> <chr>          <chr>   <chr>     
1  2000 US             National      209786736 Noncombustibl… Smokel… Chewing T…
2  2000 US             National      209786736 Combustible T… Cigare… Cigarette…
3  2000 US             National      209786736 Combustible T… Cigars  Total Cig…
4  2000 US             National      209786736 Combustible T… Loose … Total Loo…
5  2000 US             National      209786736 Combustible T… Loose … Total Loo…
6  2000 US             National      209786736 Combustible T… Cigars  Small Cig…
# ℹ 7 more variables: `Data Value Unit` <chr>, Domestic <dbl>, Imports <dbl>,
#   Total <dbl>, `Domestic Per Capita` <dbl>, `Imports Per Capita` <dbl>,
#   `Total Per Capita` <dbl>

Data Cleaning

# Checking for missing values
sum(is.na(Adult_Tobacco_Consumption_In_The_U_S_2000_Present))

[1] 0

# Removing rows with missing values
Adult_Tobacco_Consumption_In_The_U_S_2000_Present  <- na.omit(Adult_Tobacco_Consumption_In_The_U_S_2000_Present )

# Filter for relevant years (e.g., 2000 onwards)
Adult_Tobacco_Consumption_In_The_U_S_2000_Present  <- Adult_Tobacco_Consumption_In_The_U_S_2000_Present  %>% filter(Year >= 2000)

# Convert any necessary columns to appropriate data types
Adult_Tobacco_Consumption_In_The_U_S_2000_Present $Year <- as.numeric(Adult_Tobacco_Consumption_In_The_U_S_2000_Present $Year)

Linear Regression Analysis:

# Linear Regression: Predicting 'Total Per Capita' based on 'Year' and 'Total'
model <- lm(`Total Per Capita` ~ Year + Total, data = Adult_Tobacco_Consumption_In_The_U_S_2000_Present)

# Display model summary
model_summary <- summary(model)
model_summary


Call:
lm(formula = `Total Per Capita` ~ Year + Total, data = Adult_Tobacco_Consumption_In_The_U_S_2000_Present)

Residuals:
   Min     1Q Median     3Q    Max 
-97.48 -11.27   0.96  14.73 163.05 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.171e+03  5.576e+02   7.481 7.73e-13 ***
Year        -2.076e+00  2.771e-01  -7.489 7.32e-13 ***
Total        4.360e-09  1.638e-11 266.223  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 33.75 on 309 degrees of freedom
Multiple R-squared:  0.9957,    Adjusted R-squared:  0.9957 
F-statistic: 3.593e+04 on 2 and 309 DF,  p-value: < 2.2e-16

# Extract p-values and adjusted R-squared
p_values <- coef(summary(model))[, "Pr(>|t|)"]
adjusted_r2 <- model_summary$adj.r.squared

# Display p-values and adjusted R-squared for analysis
p_values

 (Intercept)         Year        Total 
7.728110e-13 7.319974e-13 0.000000e+00

adjusted_r2

[1] 0.9956911

# Regression equation
equation <- paste0("Total Per Capita = ", round(coef(model)[1], 2), 
                   " + ", round(coef(model)[2], 2), "*Year",
                   " + ", round(coef(model)[3], 2), "*Total")
equation

[1] "Total Per Capita = 4170.86 + -2.08*Year + 0*Total"

Diagnostic Plots

# Diagnostic Plots
# To avoid margin issues, reset plot layout and plot each diagnostic plot separately

# Plot: Residuals vs Fitted
par(mfrow = c(1, 1), mar = c(5, 5, 2, 2))  # Adjust margins
plot(model, which = 1)

# Plot: Normal Q-Q
plot(model, which = 2)

# Plot: Scale-Location (Homoscedasticity Check)
plot(model, which = 3)

# Plot: Residuals vs Leverage
plot(model, which = 5)

Create Scatterplot

# Create Heatmap
heatmap_plot <- ggplot(Adult_Tobacco_Consumption_In_The_U_S_2000_Present, aes(x = Year, y = Measure, fill = `Total Per Capita`)) +
  geom_tile() +
  labs(
    title = "Heatmap of Total Tobacco Consumption per Capita Over Time",
    x = "Year",
    y = "Type of Tobacco Product (Measure)",
    fill = "Total Per Capita",
    caption = "Data Source: CDC"
  ) +
  scale_fill_gradient(low = "lightblue", high = "darkblue") +  # Adjust the color scale
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5),
    legend.position = "right"
  )

heatmap_plot

Conclusion and Analysis

a. How the Data Was Cleaned

The dataset was cleaned using several steps to ensure that the data was accurate and relevant for analysis. First, any missing data points were identified and removed using the `na.omit()` function. This ensured that no incomplete rows were present, which could have affected the analysis.

Next, the dataset was filtered to include only records from the year 2000 onward, to focus the analysis on recent trends in tobacco consumption. Additionally, the `Year` column was converted to a numeric data type to ensure correct handling in calculations and visualizations. These cleaning steps helped create a consistent and reliable dataset for further analysis.

b. Visualization Interpretation

The primary visualization in this project was a heatmap, which illustrated the changes in Total Tobacco Consumption per Capita across different years and types of tobacco products (the `Measure` variable). The heatmap used color intensity to show higher or lower levels of consumption.

Several insights emerged from the heatmap:

General Trends: The heatmap revealed a noticeable decline in per capita tobacco consumption for several tobacco product categories over the years, particularly for combustible tobacco products like cigarettes.
Variation by Product: While some types of tobacco (e.g., cigarettes) showed a steady decline, non-combustible products like smokeless tobacco demonstrated more stable consumption patterns. This suggests that as smoking rates decrease, other tobacco products might be gaining popularity.
Unexpected Findings: In certain years, there was a sudden increase in consumption for specific products such as cigars, which could be attributed to changes in consumer behavior or product marketing.

The heatmap’s ability to visually represent multiple variables (year, product type, and consumption level) helped to uncover these patterns and allowed for easy comparison between product categories.

c. Challenges and Limitations

One of the challenges encountered during this analysis was the representation of multiple product categories within the heatmap. The dataset contained a large variety of tobacco products, and although the heatmap efficiently displayed trends over time, too many categories made it difficult to capture detailed trends for each individual product. Simplifying the product categories, or focusing on a select few, would improve the clarity of the visualization.

Another limitation was the lack of demographic data, such as age or geographic region, which would have provided additional context for the analysis. With demographic information, it would have been possible to explore how tobacco consumption trends differ across various population groups, potentially providing more targeted insights for public health initiatives.

Introduction

Data Cleaning

Linear Regression Analysis:

Diagnostic Plots

Create Scatterplot

Conclusion and Analysis

a. How the Data Was Cleaned

The dataset was cleaned using several steps to ensure that the data was accurate and relevant for analysis. First, any missing data points were identified and removed using the na.omit() function. This ensured that no incomplete rows were present, which could have affected the analysis.

b. Visualization Interpretation

The primary visualization in this project was a heatmap, which illustrated the changes in Total Tobacco Consumption per Capita across different years and types of tobacco products (the Measure variable). The heatmap used color intensity to show higher or lower levels of consumption.

Several insights emerged from the heatmap:

The heatmap’s ability to visually represent multiple variables (year, product type, and consumption level) helped to uncover these patterns and allowed for easy comparison between product categories.

c. Challenges and Limitations

The dataset was cleaned using several steps to ensure that the data was accurate and relevant for analysis. First, any missing data points were identified and removed using the `na.omit()` function. This ensured that no incomplete rows were present, which could have affected the analysis.

The primary visualization in this project was a heatmap, which illustrated the changes in Total Tobacco Consumption per Capita across different years and types of tobacco products (the `Measure` variable). The heatmap used color intensity to show higher or lower levels of consumption.