I’m using the Adult Tobacco Use data set for this project. This data set shows figures for adult tobacco use in the United States over a period of 20 years (2000 - 2020). It looks at the population using tobacco in the U.S. from 2000 to 2020, and includes other variables like the type of tobacco being used, the domestic imports Per Capita and so on.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
library(highcharter)
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
Highcharts (www.highcharts.com) is a Highsoft software product which is
not free for commercial and Governmental use
Rows: 273 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): LocationAbbrev, LocationDesc, Topic, Measure, Submeasure, Data Valu...
dbl (2): Year, Imports Per Capita
num (6): Population, Domestic, Imports, Total, Domestic Per Capita, Total Pe...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ggplot(adultTobaccoUseUS_cdc, aes(x = Year, y = Population, color = Measure)) +geom_line() +geom_point() +theme_classic(base_size =12) +labs(title ="Population Using Tobacco Over the Years", caption ="Source = CDC")
This is a simple data visualization showing the increase in tobacco use overtime through the years. This also shows that in modern times smokeless tobacco seems to be the most used kind.
Graph_1 <-ggplot(adultTobaccoUseUS_cdc, aes(x = Year, y = Domestic, color = Measure)) +theme_light(base_size =12) +labs(title ="Domestic purchase of Tobacco over the Years", caption ="Source = CDC")Graph_2 <- Graph_1 +geom_bar(aes(x = Year, y = Domestic, color = Measure), position ="dodge", stat ="identity")Graph_2
While in the first graph we saw that the population of those using tobacco over the years slowly but surely rose, assumption being people in general were buying more, yet here we see a different story. From 2000 to 2020 the domestic purchase (within the U.S.) of tobacco as a whole seems to have been decreasing over the years. It may help to take a close look at the specific measure of tobacco to see if we can answer any questions.
Steps Towards Linear Regression
Lin_Reg <-ggplot(adultTobaccoUseUS_cdc, aes(x =`Domestic Per Capita`, y =`Imports Per Capita`)) +theme_light(base_size =12) +labs(title ="Domestic vs Imports Per Capita", x ="Domestic", y ="Imports", caption ="Source = CDC") +geom_point(aes(color = Year)) Lin_Reg
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
From this graph we find a relatively positve relationship between domestic and imports per capita. In a simple sense this is probably showing that the the tobacco bought domestically was originally imported. The numbers tell us that the more a person seems to buy Domestically the more they become acquainted with tobacco. My theory is that as people get used to the product the more they want authentic product. One other interesting find supporting my previous claim is that the highest correlation between domestic and imports per capita can be found in 2020.
cor(adultTobaccoUseUS_cdc$`Domestic Per Capita`, adultTobaccoUseUS_cdc$`Imports Per Capita`)
[1] 0.8866531
The correlation is almost perfectly linear. (Like I alluded to) The variables have an incredibly strong association to each other.
Equation <-lm(`Imports Per Capita`~`Domestic Per Capita`, data = adultTobaccoUseUS_cdc)summary(Equation)
Call:
lm(formula = `Imports Per Capita` ~ `Domestic Per Capita`, data = adultTobaccoUseUS_cdc)
Residuals:
Min 1Q Median 3Q Max
-33.149 -3.350 -3.344 -0.192 86.106
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.343500 0.783354 4.268 2.73e-05 ***
`Domestic Per Capita` 0.044007 0.001394 31.564 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.84 on 271 degrees of freedom
Multiple R-squared: 0.7862, Adjusted R-squared: 0.7854
F-statistic: 996.3 on 1 and 271 DF, p-value: < 2.2e-16
IPC stands for Imports Per Capita and DPC stands for Domestic Per Capita.
IPC = 0.044(DPC) + 3.344
These two variables have a very strong correlation that is almost linear. The Point of all this is to help us understand the trends we find between 2000 and 2020. There is a reason why people seem to have decreased their use of tobacco over the years and it’s important not to allow population to deceive us. One theory I can postulate and look deeper into is that in the last 5 years a specific category of adults have been using tobacco and especially those that are imported from outside rather than grown, produced, or purchased here domestically.
Data Visualization
highchart() |>hc_add_series(data = adultTobaccoUseUS_cdc, type ="Line", hcaes(x = Year, y =`Total Per Capita`, group =`Imports Per Capita`)) |>hc_xAxis(title =list(text ="Year")) |>hc_yAxis(title =list(text ="TPC")) |>hc_legend(title =list(text ="IPC"))