adult tobacco consumption in the U.S.

Author

Ayomide Joe-Adigwe

Introduction

This project analyzes Adult Tobacco Consumption in the U.S. from 2000 to the Present. Using data from cdc.gov, the dataset tracks annual tobacco use across different products like cigarettes, cigars, and smokeless tobacco. The aim is to explore patterns and trends in tobacco consumption over time, and provide insights that can help shape public health policies.

{r} # Load libraries library(tidyverse) library(ggplot2) library(dplyr)

{r} # Load the dataset library(readr)

Make sure the file path is correct

Adult_Tobacco_Consumption_In_The_U_S_2000_Present <- read_csv(file.path(“AYOMIDE’S DATAVISUALITIOM”, “DATASETS”, “Adult_Tobacco_Consumption_In_The_U.S.__2000-Present.csv”))

View the first few rows

head(Adult_Tobacco_Consumption_In_The_U_S_2000_Present)

Data Cleaning

{r} # Checking for missing values sum(is.na(Adult_Tobacco_Consumption_In_The_U_S_2000_Present))

Removing rows with missing values

Adult_Tobacco_Consumption_In_The_U_S_2000_Present <- na.omit(Adult_Tobacco_Consumption_In_The_U_S_2000_Present )

Filter for relevant years (e.g., 2000 onwards)

Adult_Tobacco_Consumption_In_The_U_S_2000_Present <- Adult_Tobacco_Consumption_In_The_U_S_2000_Present %>% filter(Year >= 2000)

Convert any necessary columns to appropriate data types

Adult_Tobacco_Consumption_In_The_U_S_2000_Present $Year <- as.numeric(Adult_Tobacco_Consumption_In_The_U_S_2000_Present $Year)

Linear Regression Analysis:

{r} # Linear Regression: Predicting ‘Total Per Capita’ based on ‘Year’ and ‘Total’ model <- lm(Total Per Capita ~ Year + Total, data = Adult_Tobacco_Consumption_In_The_U_S_2000_Present)

Display model summary

model_summary <- summary(model) model_summary

Extract p-values and adjusted R-squared

p_values <- coef(summary(model))[, “Pr(>|t|)”] adjusted_r2 <- model_summary$adj.r.squared

Display p-values and adjusted R-squared for analysis

p_values adjusted_r2

Regression equation

equation <- paste0(“Total Per Capita =”, round(coef(model)[1], 2), ” + “, round(coef(model)[2], 2),”Year”, ” + “, round(coef(model)[3], 2),”Total”) equation

Diagnostic Plots

{r} # Diagnostic Plots # To avoid margin issues, reset plot layout and plot each diagnostic plot separately

Plot: Residuals vs Fitted

par(mfrow = c(1, 1), mar = c(5, 5, 2, 2)) # Adjust margins plot(model, which = 1)

Plot: Normal Q-Q

plot(model, which = 2)

Plot: Scale-Location (Homoscedasticity Check)

plot(model, which = 3)

Plot: Residuals vs Leverage

plot(model, which = 5)

Create Scatterplot

{r} # Create Heatmap heatmap_plot <- ggplot(Adult_Tobacco_Consumption_In_The_U_S_2000_Present, aes(x = Year, y = Measure, fill = Total Per Capita)) + geom_tile() + labs( title = “Heatmap of Total Tobacco Consumption per Capita Over Time”, x = “Year”, y = “Type of Tobacco Product (Measure)”, fill = “Total Per Capita”, caption = “Data Source: CDC” ) + scale_fill_gradient(low = “lightblue”, high = “darkblue”) + # Adjust the color scale theme_minimal() + theme( plot.title = element_text(hjust = 0.5), legend.position = “right” )

heatmap_plot

Conclusion and Analysis

  1. How the Data Was Cleaned

The dataset was cleaned using several steps to ensure that the data was accurate and relevant for analysis. First, any missing data points were identified and removed using the na.omit() function. This ensured that no incomplete rows were present, which could have affected the analysis.

Next, the dataset was filtered to include only records from the year 2000 onward, to focus the analysis on recent trends in tobacco consumption. Additionally, the Year column was converted to a numeric data type to ensure correct handling in calculations and visualizations. These cleaning steps helped create a consistent and reliable dataset for further analysis.

  1. Visualization Interpretation

The primary visualization in this project was a heatmap, which illustrated the changes in Total Tobacco Consumption per Capita across different years and types of tobacco products (the Measure variable). The heatmap used color intensity to show higher or lower levels of consumption.

Several insights emerged from the heatmap:

General Trends: The heatmap revealed a noticeable decline in per capita tobacco consumption for several tobacco product categories over the years, particularly for combustible tobacco products like cigarettes.

Variation by Product: While some types of tobacco (e.g., cigarettes) showed a steady decline, non-combustible products like smokeless tobacco demonstrated more stable consumption patterns. This suggests that as smoking rates decrease, other tobacco products might be gaining popularity.

Unexpected Findings: In certain years, there was a sudden increase in consumption for specific products such as cigars, which could be attributed to changes in consumer behavior or product marketing.

The heatmap’s ability to visually represent multiple variables (year, product type, and consumption level) helped to uncover these patterns and allowed for easy comparison between product categories.

  1. Challenges and Limitations

One of the challenges encountered during this analysis was the representation of multiple product categories within the heatmap. The dataset contained a large variety of tobacco products, and although the heatmap efficiently displayed trends over time, too many categories made it difficult to capture detailed trends for each individual product. Simplifying the product categories, or focusing on a select few, would improve the clarity of the visualization.

Another limitation was the lack of demographic data, such as age or geographic region, which would have provided additional context for the analysis. With demographic information, it would have been possible to explore how tobacco consumption trends differ across various population groups, potentially providing more targeted insights for public health initiatives.