Project 1 : adult tobacco use in the United States

Author

Aminata Diatta

Introduction :

The dataset on adult tobacco use in the United States, obtained from the CDC( Centers for Disease Control and prevention) , comprises 15 variables focusing on specific subsets. These variables started from “Year” 2000 to 2020, “LocationAbbrev” representing US state abbreviations, “LocationDesc” indicating whether the location is national or not, and “Population” indicating population counts or estimates for each specific year. Additional variables include “Topic”, detailing aspects of tobacco use such as combustible or non-combustible tobacco, “Measure” which may include smokeless tobacco, cigars, loose tobacco, cigarettes, or all combustibles, and “Submeasure” offering further detail such as the quantity or specific type of tobacco. “Data Value” presents the numerical value of the measure or submeasure. The dataset also covers domestic and imported tobacco products, with variables like “Domestic”, “Imports”, and “Total”, as well as “Domestic Per Capita”, “Imports Per Capita”, and “Total Per Capita” indicating consumption rates. This comprehensive dataset enables analysis of tobacco consumption trends and patterns across many years, providing insights into the prevalence and impact of tobacco use in the United States over the specified time period. We will narrow our focus to the three last years which are 2018,2019, and 2020 by representing first a boxplot the total of tabacco used in 2018,2019, and 2020 , and in the second time, I will represent a dual axis that will show us the quantity of cumulative imports tobacco and domestics tobacco in the United States between 2018 and 2020.

load necessary libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)

Load the dataset

setwd("C:/Users/satad/Desktop/data110")
tobacco_data <- read.csv("adultTobaccoUseUS.csv")

I will focus on years 2018,2019,and 2020 , I also want all mhy datavalue in pounds

tobacco_data2 <- tobacco_data %>%
  filter( Data.Value.Unit == "Pounds" & Year %in% c(2018,2019,2020)) %>%
  select(Data.Value.Unit,Domestic,Imports,Total, LocationAbbrev,Year)
         view(tobacco_data2)

Create a dataframe

tobacco_data2 <- data.frame(

  Domestic = c(30724855, 16841656, 5352683, 11488973, 66134346, 27422318, 22266231, 5922424, 28188655, 99572886, 29757852, 1724585, 31482437, 15027164, 116621350),

  Imports = c(476662, 1427444, 739887, 687557, 17729, 192666, 2822447, 511908, 3334355, 339291, 466403, 202296, 668699, 690399, 1738338),

  Total = c(31201517, 182690100, 6092570, 12176530, 66152375, 27614984, 25088678, 6434332, 31523010, 99912177, 30224255, 1926881, 32151136, 15717563, 118359688),

  LocationAbbrev = rep("US", 15),

  Year = c(2018, 2018, 2018, 2018, 2019, 2019, 2019, 2019, 2019, 2019, 2020, 2020, 2020, 2020, 2020))

Fit linear regression model

linear_regression <- tobacco_data2
 lm_model <- lm(Total ~ Domestic + Imports, data = tobacco_data2)
 scipen=999
  summary(lm_model)

Call:
lm(formula = Total ~ Domestic + Imports, data = tobacco_data2)

Residuals:
      Min        1Q    Median        3Q       Max 
-26306097 -13311625  -9672137  -4242177 147574790 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
(Intercept) 1.125e+07  2.021e+07   0.557   0.5879  
Domestic    8.220e-01  3.532e-01   2.327   0.0383 *
Imports     7.020e+00  1.232e+01   0.570   0.5794  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 44970000 on 12 degrees of freedom
Multiple R-squared:  0.3267,    Adjusted R-squared:  0.2145 
F-statistic: 2.911 on 2 and 12 DF,  p-value: 0.09316
plot(lm_model)

options(scipen = 999)

make a histogram that represent the distribution of total tobacco use each year on 2018 ,2019 , and 2020

ggplot(tobacco_data2, aes(x = factor(Year), y = Total, fill = factor(Year))) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "Total Tobacco Distribution in 2018, 2019 and 2020",
       x = "Year",
       y = "Total",
       fill = "Year") +
  theme_dark()

Comments : The visualization illustrates the total distribution of tobacco across the years 2018, 2019, and 2020 using a box-plot. Each box in the plot corresponds to a year and is filled with a distinct color from the Set1 palette, aiding in year-wise differentiation. The x-axis denotes the years while the y-axis represents the total amount of tobacco distributed. The box-plot provides a summary of the distribution, with the median marked by a line and the interquartile range enclosed by the box. To show the range of the data, we have some outliers represented as individual points.Finally,the boxplot is likely shows that the box corresponding to the year 2018 has the highest median compared to the other years, so people were more consuming tobacco in 2018 .

Second graph: Dual axis representing proportion of Domestic tobacco and also the cumultative Import tobacco in 2018,2019, and 2020.

Filter data for domestic tobacco in 2018, 2019, and 2020

domestic_data <- subset(tobacco_data2, Year %in% c(2018, 2019, 2020))
view(domestic_data)

Calculate cumulative proportion for import tobacco

import_data <- subset(tobacco_data2, Year %in% c(2018, 2019, 2020))
import_data <- import_data[order(import_data$Year), ]
import_data$cumulative_import <- cumsum(import_data$Imports) / sum(import_data$Imports)

Plot dual-axis chart

ggplot() +
  geom_bar(data = domestic_data, aes(x = Year, y = Domestic), stat = "identity", fill = "yellow", alpha = 0.5) +
  geom_line(data = import_data, aes(x = Year, y = cumulative_import * max(domestic_data$Domestic)), color = "red") +
  scale_y_continuous(name = "Domestic Tobacco",
                     sec.axis = sec_axis(~./max(domestic_data$Domestic) * 100, name = "Cumulative Import Tobacco")) +
  labs(title = "Dual-Axis Chart for Tobacco Distribution",
       x = "Year",
       y = "Domestic Tobacco",
       color = "Cumulative Import Tobacco") +
  theme_update()

Comments :

This visualization presents a dual-axis chart illustrating the distribution of tobacco in 2018,2019, and 2020. The primary axis displays the total distribution of domestic tobacco, represented by yellow bars, while the secondary axis shows the cumulative proportion of import tobacco as a red line. Each bar on the primary axis corresponds to a specific year, depicting the amount of domestic tobacco distributed in that year. Meanwhile, the red line on the secondary axis showcases the cumulative proportion of import tobacco relative to the maximum value of domestic tobacco distribution . This visualization offers a comprehensive view of both domestic and import tobacco distribution trends over the years, enabling easy comparison and analysis of their respective contributions to the overall tobacco distribution

Essay :

Choosing the adult tobacco datasets for my project originated from a curiosity to explore deeper into a topic I wasn’t very familiar with. I find that working on projects involving unfamiliar subjects allows me to learn and grow more.In the following lines, I will explain the process of this project by starting with the cleaning process, I will also talk a little bit about my choices, and the lack that I had during my project.

When I was cleaning the dataset, I was try to make my work easy by removing the columns that do not need at all first. While cleaning the dataset, I aimed to simplify my work by getting rid of unnecessary columns first, like submeasure or topic, which had characters I didn’t need. I made sure all my measurements were in pounds, which were the values I used in the dataset. I carefully chose which variables to focus on, like Year, Domestic, Imports, Total, and LocationAbbrev. These choices were made because they helped me understand tobacco use trends better and kept my analysis focused and organized.

For the graphs, I decided to create two. The first one, a boxplot, helps me understand not only which year had the highest tobacco use, but also where the middle value lies for each year. The second graph gives more insight. I calculated the cumulative proportion of imported tobacco. Considering that the United States, as one of the richest countries, imports tobacco from others, it’s concerning to see how many people are dependent on tobacco.

Working on this dataset was fascinating, but I faced a challenge when trying to figure out linear regression. It took me over an hour of watching videos and using various websites to understand it. Additionally, I encountered difficulty while attempting to create a bar graph that also included a line representing the procession of imported tobacco into the United States.

In conclusion, working on the adult tobacco dataset proved to be an enriching experience driven by my curiosity to explore unfamiliar subjects. The meticulous cleaning process allowed me to streamline my analysis by focusing on essential variables such as Year, Domestic, Imports, Total, DataValue, and LocationAbbrev. Despite encountering challenges with linear regression and graph creation, the project provided valuable learning opportunities and insights into the complexities of tobacco use trends. Moving forward, I aim to apply the knowledge gained from this project to further my understanding of data analysis techniques and their application in addressing societal issues.

Data Sourses : https://www.cdc.gov/tobacco/data_statistics/index.htm