This project analyzes Adult Tobacco Consumption in the U.S. from 2000 to the Present. Using data from cdc.gov, the dataset tracks annual tobacco use across different products like cigarettes, cigars, and smokeless tobacco. The aim is to explore patterns and trends in tobacco consumption over time, and provide insights that can help shape public health policies.
# Load librarieslibrary(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)library(dplyr)
# Load the datasetlibrary(readr)# Make sure the file path is correctAdult_Tobacco_Consumption_In_The_U_S_2000_Present <-read_csv(file.path("AYOMIDE'S DATAVISUALITIOM", "DATASETS", "Adult_Tobacco_Consumption_In_The_U.S.__2000-Present.csv"))
Rows: 312 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): LocationAbbrev, LocationDesc, Topic, Measure, Submeasure, Data Valu...
dbl (8): Year, Population, Domestic, Imports, Total, Domestic Per Capita, Im...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# View the first few rowshead(Adult_Tobacco_Consumption_In_The_U_S_2000_Present)
# A tibble: 6 × 14
Year LocationAbbrev LocationDesc Population Topic Measure Submeasure
<dbl> <chr> <chr> <dbl> <chr> <chr> <chr>
1 2000 US National 209786736 Noncombustibl… Smokel… Chewing T…
2 2000 US National 209786736 Combustible T… Cigare… Cigarette…
3 2000 US National 209786736 Combustible T… Cigars Total Cig…
4 2000 US National 209786736 Combustible T… Loose … Total Loo…
5 2000 US National 209786736 Combustible T… Loose … Total Loo…
6 2000 US National 209786736 Combustible T… Cigars Small Cig…
# ℹ 7 more variables: `Data Value Unit` <chr>, Domestic <dbl>, Imports <dbl>,
# Total <dbl>, `Domestic Per Capita` <dbl>, `Imports Per Capita` <dbl>,
# `Total Per Capita` <dbl>
Data Cleaning
# Checking for missing valuessum(is.na(Adult_Tobacco_Consumption_In_The_U_S_2000_Present))
[1] 0
# Removing rows with missing valuesAdult_Tobacco_Consumption_In_The_U_S_2000_Present <-na.omit(Adult_Tobacco_Consumption_In_The_U_S_2000_Present )# Filter for relevant years (e.g., 2000 onwards)Adult_Tobacco_Consumption_In_The_U_S_2000_Present <- Adult_Tobacco_Consumption_In_The_U_S_2000_Present %>%filter(Year >=2000)# Convert any necessary columns to appropriate data typesAdult_Tobacco_Consumption_In_The_U_S_2000_Present $Year <-as.numeric(Adult_Tobacco_Consumption_In_The_U_S_2000_Present $Year)
Linear Regression Analysis:
# Linear Regression: Predicting 'Total Per Capita' based on 'Year' and 'Total'model <-lm(`Total Per Capita`~ Year + Total, data = Adult_Tobacco_Consumption_In_The_U_S_2000_Present)# Display model summarymodel_summary <-summary(model)model_summary
Call:
lm(formula = `Total Per Capita` ~ Year + Total, data = Adult_Tobacco_Consumption_In_The_U_S_2000_Present)
Residuals:
Min 1Q Median 3Q Max
-97.48 -11.27 0.96 14.73 163.05
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.171e+03 5.576e+02 7.481 7.73e-13 ***
Year -2.076e+00 2.771e-01 -7.489 7.32e-13 ***
Total 4.360e-09 1.638e-11 266.223 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 33.75 on 309 degrees of freedom
Multiple R-squared: 0.9957, Adjusted R-squared: 0.9957
F-statistic: 3.593e+04 on 2 and 309 DF, p-value: < 2.2e-16
# Extract p-values and adjusted R-squaredp_values <-coef(summary(model))[, "Pr(>|t|)"]adjusted_r2 <- model_summary$adj.r.squared# Display p-values and adjusted R-squared for analysisp_values
(Intercept) Year Total
7.728110e-13 7.319974e-13 0.000000e+00
[1] "Total Per Capita = 4170.86 + -2.08*Year + 0*Total"
Diagnostic Plots
# Diagnostic Plots# To avoid margin issues, reset plot layout and plot each diagnostic plot separately# Plot: Residuals vs Fittedpar(mfrow =c(1, 1), mar =c(5, 5, 2, 2)) # Adjust marginsplot(model, which =1)
# Plot: Normal Q-Qplot(model, which =2)
# Plot: Scale-Location (Homoscedasticity Check)plot(model, which =3)
# Plot: Residuals vs Leverageplot(model, which =5)
Create Scatterplot
# Create Heatmapheatmap_plot <-ggplot(Adult_Tobacco_Consumption_In_The_U_S_2000_Present, aes(x = Year, y = Measure, fill =`Total Per Capita`)) +geom_tile() +labs(title ="Heatmap of Total Tobacco Consumption per Capita Over Time",x ="Year",y ="Type of Tobacco Product (Measure)",fill ="Total Per Capita",caption ="Data Source: CDC" ) +scale_fill_gradient(low ="lightblue", high ="darkblue") +# Adjust the color scaletheme_minimal() +theme(plot.title =element_text(hjust =0.5),legend.position ="right" )heatmap_plot
Conclusion and Analysis
a. How the Data Was Cleaned
The dataset was cleaned using several steps to ensure that the data was accurate and relevant for analysis. First, any missing data points were identified and removed using the na.omit() function. This ensured that no incomplete rows were present, which could have affected the analysis.
Next, the dataset was filtered to include only records from the year 2000 onward, to focus the analysis on recent trends in tobacco consumption. Additionally, the Year column was converted to a numeric data type to ensure correct handling in calculations and visualizations. These cleaning steps helped create a consistent and reliable dataset for further analysis.
b. Visualization Interpretation
The primary visualization in this project was a heatmap, which illustrated the changes in Total Tobacco Consumption per Capita across different years and types of tobacco products (the Measure variable). The heatmap used color intensity to show higher or lower levels of consumption.
Several insights emerged from the heatmap:
General Trends: The heatmap revealed a noticeable decline in per capita tobacco consumption for several tobacco product categories over the years, particularly for combustible tobacco products like cigarettes.
Variation by Product: While some types of tobacco (e.g., cigarettes) showed a steady decline, non-combustible products like smokeless tobacco demonstrated more stable consumption patterns. This suggests that as smoking rates decrease, other tobacco products might be gaining popularity.
Unexpected Findings: In certain years, there was a sudden increase in consumption for specific products such as cigars, which could be attributed to changes in consumer behavior or product marketing.
The heatmap’s ability to visually represent multiple variables (year, product type, and consumption level) helped to uncover these patterns and allowed for easy comparison between product categories.
c. Challenges and Limitations
One of the challenges encountered during this analysis was the representation of multiple product categories within the heatmap. The dataset contained a large variety of tobacco products, and although the heatmap efficiently displayed trends over time, too many categories made it difficult to capture detailed trends for each individual product. Simplifying the product categories, or focusing on a select few, would improve the clarity of the visualization.
Another limitation was the lack of demographic data, such as age or geographic region, which would have provided additional context for the analysis. With demographic information, it would have been possible to explore how tobacco consumption trends differ across various population groups, potentially providing more targeted insights for public health initiatives.