Title: “Data_110_Final”
Author: “Joyce Liang”
Format: revealjs
Editor: visual

Image Source:https://reason.org/commentary/why-nobody-cares-about-teen-smoking/

Image Source:https://reason.org/commentary/why-nobody-cares-about-teen-smoking/

INTRODUCTION

Before jumping into my dataset, I want to emphasize that the dataset does not just contain subjects, they are people. People that are children and are subjected to a practice that can cause health problems lasting for the rest of their lives. The children in the dataset smoke not just because they want to, but as a result of many factors. For example, some partake in tobacco products because of social and physical environments- the media depicting tobacco products as a normal activity or being surrounded by friends and family regularly participating in the practice. A couple other factors that can lead to tobacco use could be biological and genetic factors, mental health, personal views, lack of support or involvement from parents, and accessibility, availability, and price of tobacco products. It is critical to educate the youth about the use of tobacco products because “youth who use multiple tobacco products are at higher risk for developing nicotine dependence and might be more likely to continue using tobacco products into adulthood.”(cdc.gov) My dataset is a collection of values of both middle school and high school students using tobacco in participating states. The data was collected using the the STATE System, “an interactive application that presents current and historical state-level data on tobacco use prevention and control” (https://www.cdc.gov/statesystem/index.html). Published by the Centers for Disease Control (under the U.S. Department of Health & Human Services), the original dataset contained 10,600 observations and 31 variables. The variables I want to focus on are Year, State, TopicDesc, Data_Value, Data_Value_Unit, Education, GeoLocation, Sample_Size, Gender, and Race. I will be using linear regression to answer my questions about the dataset. Tobacco use has been a popular practice since the 1900s. In the 21st century, items such as electronic cigarettes have grown to be the next popular item among younger generations. Despite the growing use of e-cigarettes, I am anticipating a decreasing trend in the use of tobacco. The dataset has a variable called “TopicDesc” and under this variable contains values such as Cessation (Youth), Cigarette Use (Youth), and Smokeless Tobacco Use (Youth). In my exploration, I want to focus on cigarettes (tobacco or tobacco essence and nicotine that is smoked) and smokeless tobacco (chewing tobacco, snuff, Nicotine pouches, and dis-solvable tobacco) and discover whether or not the relationship between time and TopicDesc is negative (decreasing). The other question I want to explore is if middle school students are more prone to obtaining tobacco products compared to high school students? Last question is, what year and state experienced the highest use of tobacco. My original intent was to find a dataset on diabetes and the attributes that come with it, for example blood pressure, blood sugar, BMI, diet, etc. However, I had a hard time looking for a dataset that was rich enough to provide with such information. So I continued my search and found the Youth Tobacco Survey (YTS) dataset. This topic is significant to me because many older members of my family smoke(cigarettes), I smoke (cigarettes) occasionally, and my friends smoke (predominantly e-cigarettes). I am curious if tobacco use is still a popular item, despite many of my friends and family partaking in the practice. Link to data source: https://catalog.data.gov/dataset/youth-tobacco-survey-yts-data

Libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyr)
library(dplyr)
library(RColorBrewer)
library(plotly)

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout
library(leaflet)
library(knitr)
library(treemap)
library(ggplot2)
library(dplyr)

Loading data and change names

youth_tobacco <- read_csv("Youth_Tobacco_Survey__YTS__Data.csv")
Rows: 10600 Columns: 31
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (24): LocationAbbr, LocationDesc, TopicType, TopicDesc, MeasureDesc, Dat...
dbl  (7): YEAR, Data_Value, Data_Value_Std_Err, Low_Confidence_Limit, High_C...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
names(youth_tobacco)[names(youth_tobacco)== "LocationDesc"] <- "State" #I did not prefer the original name "LocationDesc", so I changed it to "State"
names(youth_tobacco)[names(youth_tobacco)== "YEAR"] <- "Year" #I did not prefer the all caps, so I changed it from "YEAR" to "Year"
names(youth_tobacco)
 [1] "Year"                       "LocationAbbr"              
 [3] "State"                      "TopicType"                 
 [5] "TopicDesc"                  "MeasureDesc"               
 [7] "DataSource"                 "Response"                  
 [9] "Data_Value_Unit"            "Data_Value_Type"           
[11] "Data_Value"                 "Data_Value_Footnote_Symbol"
[13] "Data_Value_Footnote"        "Data_Value_Std_Err"        
[15] "Low_Confidence_Limit"       "High_Confidence_Limit"     
[17] "Sample_Size"                "Gender"                    
[19] "Race"                       "Age"                       
[21] "Education"                  "GeoLocation"               
[23] "TopicTypeId"                "TopicId"                   
[25] "MeasureId"                  "StratificationID1"         
[27] "StratificationID2"          "StratificationID3"         
[29] "StratificationID4"          "SubMeasureID"              
[31] "DisplayOrder"              

Columns I want to focus on

refined_youth_tobacco <- youth_tobacco |>
  select("Year", "State", "TopicDesc", "Data_Value","Data_Value_Unit","Education","GeoLocation","Sample_Size", "Gender", "Race")

Extract NAs

sum(is.na(refined_youth_tobacco)) #1044 NAs
[1] 1044
youth_NA <-
is.na(refined_youth_tobacco$Year)|
is.na(refined_youth_tobacco$State)|
is.na(refined_youth_tobacco$TopicDesc)|
is.na(refined_youth_tobacco$Data_Value)|
is.na(refined_youth_tobacco$Data_Value_Unit)|
is.na(refined_youth_tobacco$Education)|
is.na(refined_youth_tobacco$GeoLocation)|
is.na(refined_youth_tobacco$Sample_Size)|
is.na(refined_youth_tobacco$Gender)|
is.na(refined_youth_tobacco$Race) 

#youth_NA will contain all the NAs. In the next code, I will exclude the dataset youth_NA- this would extract all NAs

refined_youth_tobacco <- subset(refined_youth_tobacco, subset = !youth_NA) 
#Creates a dataset called refined_youth_tobacco without NAs

sum(is.na(refined_youth_tobacco))
[1] 0
#Now there are no NAs

Separate lat and long

refined_youth_tobacco <- refined_youth_tobacco|>
  mutate(GeoLocation = str_replace_all(GeoLocation, "[()]", ""))|>
  separate(GeoLocation, into = c("lat", "long"), sep = ",", convert = TRUE)
#Separates latitude and longitude and gets rid of the ()
head(refined_youth_tobacco)
# A tibble: 6 × 11
   Year State         TopicDesc Data_Value Data_Value_Unit Education   lat  long
  <dbl> <chr>         <chr>          <dbl> <chr>           <chr>     <dbl> <dbl>
1  2004 Ohio          Cigarett…       32.2 %               Middle S…  40.1 -82.4
2  2008 Alabama       Cigarett…       47.8 %               High Sch…  32.8 -86.6
3  2015 West Virginia Smokeles…       12.7 %               High Sch…  38.7 -80.7
4  2005 Illinois      Cigarett…       28.5 %               Middle S…  40.5 -89.0
5  2005 Connecticut   Smokeles…       20.5 %               High Sch…  41.6 -72.6
6  2009 Connecticut   Smokeles…        1.2 %               High Sch…  41.6 -72.6
# ℹ 3 more variables: Sample_Size <dbl>, Gender <chr>, Race <chr>

Filter column values

refined_youth_tobacco <- filter(refined_youth_tobacco, TopicDesc %in% c("Cigarette Use (Youth)", "Smokeless Tobacco Use (Youth)")) 
#I wanted to explore Cigarette Use (Youth)" and "Smokeless Tobacco Use (Youth)" under TopicDesc

unique(refined_youth_tobacco$TopicDesc) 
[1] "Cigarette Use (Youth)"         "Smokeless Tobacco Use (Youth)"
#Confirmed there are only two categories of TopicDesc

refined_youth_tobacco <- refined_youth_tobacco|>
  filter(Sample_Size > 1000) |>
  #I decided a reliable sample size would be over 1000 
  filter(Gender=="Overall")     
  #I want to explore smoking habits from all kinds of people regardless of gender

Obtaining the average percentage after cleaning the dataset

str(refined_youth_tobacco) 
tibble [2,694 × 11] (S3: tbl_df/tbl/data.frame)
 $ Year           : num [1:2694] 2004 2005 2009 2007 2004 ...
 $ State          : chr [1:2694] "Ohio" "Illinois" "Connecticut" "Connecticut" ...
 $ TopicDesc      : chr [1:2694] "Cigarette Use (Youth)" "Cigarette Use (Youth)" "Smokeless Tobacco Use (Youth)" "Cigarette Use (Youth)" ...
 $ Data_Value     : num [1:2694] 32.2 28.5 1.2 13 9.4 5 6.9 7.2 14.8 6 ...
 $ Data_Value_Unit: chr [1:2694] "%" "%" "%" "%" ...
 $ Education      : chr [1:2694] "Middle School" "Middle School" "High School" "Middle School" ...
 $ lat            : num [1:2694] 40.1 40.5 41.6 41.6 39 ...
 $ long           : num [1:2694] -82.4 -89 -72.6 -72.6 -75.6 ...
 $ Sample_Size    : num [1:2694] 1377 1323 2214 2184 2226 ...
 $ Gender         : chr [1:2694] "Overall" "Overall" "Overall" "Overall" ...
 $ Race           : chr [1:2694] "All Races" "All Races" "All Races" "All Races" ...
#Curious about the structure and classification of each variable. Also to make sure the variables are the right data type

refined_youth_tobacco <- arrange(refined_youth_tobacco,Year) 
#Arranges the data in ascending order
refined_youth_tobacco
# A tibble: 2,694 × 11
    Year State        TopicDesc Data_Value Data_Value_Unit Education   lat  long
   <dbl> <chr>        <chr>          <dbl> <chr>           <chr>     <dbl> <dbl>
 1  1999 New Jersey   Cigarett…        2.2 %               Middle S…  40.1 -74.3
 2  1999 Missouri     Cigarett…       14.9 %               Middle S…  38.6 -92.6
 3  1999 North Carol… Cigarett…       15   %               Middle S…  35.5 -79.2
 4  1999 Mississippi  Smokeles…       24.3 %               High Sch…  32.7 -89.5
 5  1999 North Carol… Smokeles…        3.9 %               Middle S…  35.5 -79.2
 6  1999 Oklahoma     Smokeles…       18.6 %               Middle S…  35.5 -97.5
 7  1999 Mississippi  Smokeles…        9.6 %               High Sch…  32.7 -89.5
 8  1999 Tennessee    Cigarett…       44.2 %               Middle S…  35.7 -85.8
 9  1999 Oklahoma     Smokeles…       13   %               High Sch…  35.5 -97.5
10  1999 Mississippi  Cigarett…       16.6 %               High Sch…  32.7 -89.5
# ℹ 2,684 more rows
# ℹ 3 more variables: Sample_Size <dbl>, Gender <chr>, Race <chr>
refined_youth_tobacco$Data_Value <- refined_youth_tobacco$Data_Value/100 
#Converts percentage to decimal so I can get an accurate reading of the average percentage
refined_youth_tobacco
# A tibble: 2,694 × 11
    Year State        TopicDesc Data_Value Data_Value_Unit Education   lat  long
   <dbl> <chr>        <chr>          <dbl> <chr>           <chr>     <dbl> <dbl>
 1  1999 New Jersey   Cigarett…      0.022 %               Middle S…  40.1 -74.3
 2  1999 Missouri     Cigarett…      0.149 %               Middle S…  38.6 -92.6
 3  1999 North Carol… Cigarett…      0.15  %               Middle S…  35.5 -79.2
 4  1999 Mississippi  Smokeles…      0.243 %               High Sch…  32.7 -89.5
 5  1999 North Carol… Smokeles…      0.039 %               Middle S…  35.5 -79.2
 6  1999 Oklahoma     Smokeles…      0.186 %               Middle S…  35.5 -97.5
 7  1999 Mississippi  Smokeles…      0.096 %               High Sch…  32.7 -89.5
 8  1999 Tennessee    Cigarett…      0.442 %               Middle S…  35.7 -85.8
 9  1999 Oklahoma     Smokeles…      0.13  %               High Sch…  35.5 -97.5
10  1999 Mississippi  Cigarett…      0.166 %               High Sch…  32.7 -89.5
# ℹ 2,684 more rows
# ℹ 3 more variables: Sample_Size <dbl>, Gender <chr>, Race <chr>
ave_youth_tobacco <- refined_youth_tobacco |>
group_by(Year,State, Education, TopicDesc)|> #In a new dataset, ave_youth_tobacco, I grouped the data based on Year, State, Education, and TopicDesc

    summarise(count=n(),
    latitude = first(lat),
    longitude = first(long),
    .groups = "drop", # source to the code was ChatGPT- my question was: is there a way to keep the latitude and longitude values after using the summarise function in R? Option 1: Keep the first (or last) lat/lon value in each group
    Average_Data_Value= round(mean(Data_Value)*100), #putting it back to percentage form, rather than decimal
    Average_Sample_Size=round(mean(Sample_Size)))

str(ave_youth_tobacco) #Check to see if lat and long are numeric
tibble [902 × 9] (S3: tbl_df/tbl/data.frame)
 $ Year               : num [1:902] 1999 1999 1999 1999 1999 ...
 $ State              : chr [1:902] "Georgia" "Georgia" "Kansas" "Kansas" ...
 $ Education          : chr [1:902] "Middle School" "Middle School" "Middle School" "Middle School" ...
 $ TopicDesc          : chr [1:902] "Cigarette Use (Youth)" "Smokeless Tobacco Use (Youth)" "Cigarette Use (Youth)" "Smokeless Tobacco Use (Youth)" ...
 $ count              : int [1:902] 3 3 3 3 3 3 3 3 3 3 ...
 $ latitude           : num [1:902] 32.8 32.8 38.3 38.3 32.7 ...
 $ longitude          : num [1:902] -83.6 -83.6 -98.2 -98.2 -89.5 ...
 $ Average_Data_Value : num [1:902] 22 6 21 8 42 12 28 10 22 7 ...
 $ Average_Sample_Size: num [1:902] 1239 1291 1124 1121 1525 ...

Separate High school and middle school average percentages and separate TopicDesc- Cigarette Use (Youth) and Smokeless Tobacco Use.

getwd()
[1] "/Users/joyceliang/Desktop/DATA_110/FINAL"
H_cig_use <- ave_youth_tobacco |>
  filter(TopicDesc == "Cigarette Use (Youth)")|>
  filter(Education == "High School")

M_cig_use <- ave_youth_tobacco |>
  filter(TopicDesc == "Cigarette Use (Youth)")|>
  filter(Education == "Middle School")
M_cig_use
# A tibble: 239 × 9
    Year State   Education TopicDesc count latitude longitude Average_Data_Value
   <dbl> <chr>   <chr>     <chr>     <int>    <dbl>     <dbl>              <dbl>
 1  1999 Georgia Middle S… Cigarett…     3     32.8     -83.6                 22
 2  1999 Kansas  Middle S… Cigarett…     3     38.3     -98.2                 21
 3  1999 Missis… Middle S… Cigarett…     3     32.7     -89.5                 28
 4  1999 Missou… Middle S… Cigarett…     3     38.6     -92.6                 22
 5  1999 Nebras… Middle S… Cigarett…     3     41.6     -99.4                 15
 6  1999 New Je… Middle S… Cigarett…     3     40.1     -74.3                 16
 7  1999 North … Middle S… Cigarett…     3     35.5     -79.2                 19
 8  1999 Oklaho… Middle S… Cigarett…     3     35.5     -97.5                 23
 9  1999 South … Middle S… Cigarett…     3     44.4    -100.                  17
10  1999 Tennes… Middle S… Cigarett…     3     35.7     -85.8                 21
# ℹ 229 more rows
# ℹ 1 more variable: Average_Sample_Size <dbl>
H_smkeless_use <- ave_youth_tobacco|>
   filter(TopicDesc == "Smokeless Tobacco Use (Youth)")|>
  filter(Education == "High School")
H_smkeless_use
# A tibble: 212 × 9
    Year State   Education TopicDesc count latitude longitude Average_Data_Value
   <dbl> <chr>   <chr>     <chr>     <int>    <dbl>     <dbl>              <dbl>
 1  1999 Missis… High Sch… Smokeles…     3     32.7     -89.5                 12
 2  1999 New Je… High Sch… Smokeles…     3     40.1     -74.3                  9
 3  1999 North … High Sch… Smokeles…     3     35.5     -79.2                 10
 4  1999 Oklaho… High Sch… Smokeles…     3     35.5     -97.5                 17
 5  1999 Tennes… High Sch… Smokeles…     3     35.7     -85.8                 17
 6  2000 Alabama High Sch… Smokeles…     3     32.8     -86.6                 13
 7  2000 Arkans… High Sch… Smokeles…     3     34.7     -92.3                 15
 8  2000 Califo… High Sch… Smokeles…     3     37.6    -121.                   5
 9  2000 Colora… High Sch… Smokeles…     3     38.8    -106.                  12
10  2000 Connec… High Sch… Smokeles…     3     41.6     -72.6                  7
# ℹ 202 more rows
# ℹ 1 more variable: Average_Sample_Size <dbl>
M_smkeless_use <- ave_youth_tobacco|>
   filter(TopicDesc == "Smokeless Tobacco Use (Youth)")|>
   filter(Education == "Middle School")
M_smkeless_use 
# A tibble: 237 × 9
    Year State   Education TopicDesc count latitude longitude Average_Data_Value
   <dbl> <chr>   <chr>     <chr>     <int>    <dbl>     <dbl>              <dbl>
 1  1999 Georgia Middle S… Smokeles…     3     32.8     -83.6                  6
 2  1999 Kansas  Middle S… Smokeles…     3     38.3     -98.2                  8
 3  1999 Missis… Middle S… Smokeles…     3     32.7     -89.5                 10
 4  1999 Missou… Middle S… Smokeles…     3     38.6     -92.6                  7
 5  1999 Nebras… Middle S… Smokeles…     3     41.6     -99.4                  5
 6  1999 New Je… Middle S… Smokeles…     3     40.1     -74.3                  4
 7  1999 North … Middle S… Smokeles…     3     35.5     -79.2                  6
 8  1999 Oklaho… Middle S… Smokeles…     3     35.5     -97.5                  9
 9  1999 South … Middle S… Smokeles…     3     44.4    -100.                   8
10  1999 Tennes… Middle S… Smokeles…     3     35.7     -85.8                  8
# ℹ 227 more rows
# ℹ 1 more variable: Average_Sample_Size <dbl>
#By separating ave_youth_tobacco into 4 "subsets," I can explore the trends of each value more in depth.

Correlation, linear regression, and analysis

cor(ave_youth_tobacco$Year, ave_youth_tobacco$Average_Data_Value) 
[1] -0.327961
#The overall correlation for all youths in participating states is -0.328, which is a weak correlation. After finding the overall correlation, I want to explore the correlation for high school and middle school students, separately

cor(H_cig_use$Year, H_cig_use$Average_Data_Value) #-0.8140186
[1] -0.8156326
#It appears the the correlation between the average percentage and time in years is strong for high school students, meaning high school students are quickly losing interest in the use of cigarettes 

cor(H_smkeless_use$Year, H_smkeless_use$Average_Data_Value) #-0.2274458
[1] -0.2292833
#It appears the the correlation between the average percentage in the use of smokeless products and time in years is weak for high school students.

cor(M_cig_use$Year, M_cig_use$Average_Data_Value) #-0.757
[1] -0.7572516
#It appears the the correlation between the average percentage and time in years is somewhat strong for middle school students, meaning middle school students are also losing interest in the use of cigarettes 

cor(M_smkeless_use$Year, M_smkeless_use$Average_Data_Value) #-0.3364725
[1] -0.3402697
#It appears the the correlation between the average percentage in the use of smokeless products and time in years is aslo weak for middle school students.

Analysis: It is interesting to see how weak of a correlation both high school and middle school students have with smokeless tobacco products.

Correlation, linear regression, and analysis (Continued)

ave_youth_tobacco_lm <-lm(Average_Data_Value ~ Year, data=ave_youth_tobacco) #Average Percentage = -0.5614x +  1139.4351 
ave_youth_tobacco_lm 

Call:
lm(formula = Average_Data_Value ~ Year, data = ave_youth_tobacco)

Coefficients:
(Intercept)         Year  
  1139.4351      -0.5614  
#With every Year, the average percentage of youths using cigarettes decreases by 0.5614 units

H_cig_use_1 <-lm(Average_Data_Value ~ Year, data=H_cig_use) #Average Percentage = -1.276x + 2587.580
H_cig_use_1 

Call:
lm(formula = Average_Data_Value ~ Year, data = H_cig_use)

Coefficients:
(Intercept)         Year  
   2587.580       -1.276  
#With every Year, the average percentage of high school students using cigarettes decreases by 1.274 units

H_smkeless_use_1 <-lm(Average_Data_Value ~ Year, data=H_smkeless_use) #Average Percentage = -0.172x + 355
H_smkeless_use_1

Call:
lm(formula = Average_Data_Value ~ Year, data = H_smkeless_use)

Coefficients:
(Intercept)         Year  
    355.002       -0.172  
#With every Year, the average percentage of high school students using smokeless tobacco products decreases by 0.172 units

M_cig_use_1 <- lm(Average_Data_Value ~ Year, data=M_cig_use) #Average Percentage = -0.7645x + 1545.681
M_cig_use_1

Call:
lm(formula = Average_Data_Value ~ Year, data = M_cig_use)

Coefficients:
(Intercept)         Year  
  1545.6812      -0.7645  
#With every Year, the average percentage of middle schools students using cigarettes decreases by 0.7645 units

M_smkeless_use_1 <-lm(Average_Data_Value ~ Year, data=M_smkeless_use) #Average Percentage = -0.2018x + 409.974 
M_smkeless_use_1

Call:
lm(formula = Average_Data_Value ~ Year, data = M_smkeless_use)

Coefficients:
(Intercept)         Year  
   409.9742      -0.2018  
#With every Year, the average percentage of youths using smokeless tobacco products decreases by 0.001983 units

Summary of the linear models

summary(ave_youth_tobacco_lm)

Call:
lm(formula = Average_Data_Value ~ Year, data = ave_youth_tobacco)

Residuals:
    Min      1Q  Median      3Q     Max 
-14.617  -6.196  -2.196   4.620  28.383 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1139.4351   108.1926   10.53   <2e-16 ***
Year          -0.5614     0.0539  -10.41   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.831 on 900 degrees of freedom
Multiple R-squared:  0.1076,    Adjusted R-squared:  0.1066 
F-statistic: 108.5 on 1 and 900 DF,  p-value: < 2.2e-16
#p-value: < 2.2e-16- the p-value is lower than alpha 0.05 therefore time(years) is statistically significant and that it is a good predictor in the average percentage for tobacco use in youths
#R-squared: 0.1066- the r-squared suggests that 11% of the variance can be explained by time(years).

summary(H_cig_use_1)

Call:
lm(formula = Average_Data_Value ~ Year, data = H_cig_use)

Residuals:
     Min       1Q   Median       3Q      Max 
-27.5227  -2.2050   0.4979   2.9082  10.3714 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2587.58020  124.83199   20.73   <2e-16 ***
Year          -1.27648    0.06219  -20.53   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.908 on 212 degrees of freedom
Multiple R-squared:  0.6653,    Adjusted R-squared:  0.6637 
F-statistic: 421.3 on 1 and 212 DF,  p-value: < 2.2e-16
#p-value: < 2.2e-16- the p-value is lower than alpha 0.05 therefore time(years) is statistically significant and that it is a good predictor in the average percentage for tobacco use in youths
#R-squared: 0.664- the r-squared suggests that 66% of the variance can be explained by time(years).
plot(H_cig_use_1)

summary(H_smkeless_use_1)

Call:
lm(formula = Average_Data_Value ~ Year, data = H_smkeless_use)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.0312 -3.3224 -0.0693  2.3558 24.3128 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 355.00173  101.13584   3.510 0.000548 ***
Year         -0.17199    0.05038  -3.414 0.000769 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.975 on 210 degrees of freedom
Multiple R-squared:  0.05257,   Adjusted R-squared:  0.04806 
F-statistic: 11.65 on 1 and 210 DF,  p-value: 0.0007693
# p-value: 0.0007693- the p-value is lower than alpha 0.05 therefore time(years) is statistically significant and that it is a good predictor in the average percentage for tobacco use in youths
#R-squared: 0.04806- the r-squared suggests that 4.78% of the variance can be explained by time(years).
plot(H_smkeless_use_1)

summary(M_cig_use_1)

Call:
lm(formula = Average_Data_Value ~ Year, data = M_cig_use)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.6566 -2.3920 -0.1275  2.3434 10.5789 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1545.68117   85.95906   17.98   <2e-16 ***
Year          -0.76451    0.04283  -17.85   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.627 on 237 degrees of freedom
Multiple R-squared:  0.5734,    Adjusted R-squared:  0.5716 
F-statistic: 318.6 on 1 and 237 DF,  p-value: < 2.2e-16
#p-value: < 2.2e-16- the p-value is lower than alpha 0.05 therefore time(years) is statistically significant and that it is a good predictor in the average percentage for tobacco use in youths
#R-squared:  0.5712- the r-squared suggests that 57% of the variance can be explained by time(years).
plot(M_cig_use_1)

summary(M_smkeless_use_1)

Call:
lm(formula = Average_Data_Value ~ Year, data = M_smkeless_use)

Residuals:
   Min     1Q Median     3Q    Max 
-4.414 -2.002 -0.598  1.393 29.990 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 409.97419   73.00094   5.616 5.50e-08 ***
Year         -0.20178    0.03637  -5.547 7.79e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.08 on 235 degrees of freedom
Multiple R-squared:  0.1158,    Adjusted R-squared:  0.112 
F-statistic: 30.77 on 1 and 235 DF,  p-value: 7.794e-08
#p-value: < 7.794e-08 the p-value is lower than alpha 0.05 therefore time(years) is statistically significant and that it is a good predictor in the average percentage for tobacco use in youths
#R-squared: 0.112- the r-squared suggests that 11% of the variance can be explained by time(years).
plot(M_smkeless_use_1)

Analysis: Overall, the summaries depicting the residuals vs fitted, normal Q-Q, scale-location, and Cook’s distance indicate that the model represents the data well.

VISUALIZATION 1: Scatterplot

plot2<-ggplot(ave_youth_tobacco, aes(x = Year, 
                                 y = Average_Data_Value,
                                 color = Education,
                                 size = Average_Sample_Size, 
                                 text = paste("State:", State)))+
  
  geom_point(alpha = 0.5) +  #Because the geompoint size is going to be based off of sample size, alpha is needed as there will be overlapping
  facet_wrap(~ TopicDesc) +  #Facet by TopicDesc
  theme_minimal() +
  labs(title = "Cigarette Use VS Smokeless Tobacco use Among Youths",
       x= "\nYear",
       y= "\nPercentage(Decimal)",
      caption = "Source: Centers for Disease Control and Prevention (CDC)")+
theme_grey(base_size = 10)

plot2<- plot2+scale_color_manual(values=c("#E69F00", "#56B4E9")) #The colors will help distinguish between high school and middle school students
plot2

ggplotly(plot2) #for some reason plotly is not allowing the caption for the source to appear, but it appears just fine without plotly

Analysis: This is a great visualization because I am able to see the overall trend of both high school and middle school students based on the two TopicDesc, Cigarette Use and Smokeless Tobacco Use. The visualization depicts a scatterplot that plots the average percentage of each state in a year. The trend is a negative relationship in both areas meaning youths are progressively not using tobacco products. The most notable observation is the steep decrease in cigarette use through the years for both high school and middle school students. It also appears that high school students tend to have a higher chance of using cigarettes compared to middle school students.The scatterplot for smokeless tobacco, on the other hand, indicates a steady use among high school and middle school students despite it slowly decreasing. Smokeless tobacco also appears to be a less popular item compared to cigarettes. High school students still have a higher chance in using smokeless tobacco products compared to middle school students.

To answer my previous question, what year and state experienced the highest use in tobacco- in terms of cigarette use, high school students in the state of West Virginia experienced the highest average percentage in the year 2000. In terms of smokeless tobacco, high school students in the state of Kentucky experienced the highest average percentage in the year 2010. We will exclude the data values pertaining to the state of Florida for smokeless tobacco as it is considered an outlier.

VISUAL 2: Treemap

# After reviewing the scatterplot, I was curious about the state with the highest average occurrence of youths using tobacco related products. The scatter plot unfortunately does not depict the cumulative visual of the state with the highest average occurrence of youths using tobacco related products

treemap(ave_youth_tobacco, 
        index="State", 
        vSize="Average_Sample_Size", 
        vColor="Average_Data_Value", #The states with a greater amount if youths using tobacco products is green. The states with lesser amount of youths using tobacco products is yellow- red.
        type="manual",
        border.col = c("white"), #I don't prefer the black borders, so I changed it to white
        border.lwds = c(7), 
        fontsize.title = 12,
        title = "States With the Highest Occurance of Youth Using Tobacco Products", 
        title.legend = "Frequency",  
        palette="RdYlGn")

Analysis: I am unable to add a caption to the tree map in R so I will state it here. Source: https://catalog.data.gov/dataset/youth-tobacco-survey-yts-data The treemap depict states from the lowest to highest number of youths using tobacco related products. States with the highest number of youths using tobacco related products is illustrated in a range green. States with a lower number of youths using tobacco related products are conveyed through a range of red to yellow. The size of the blocks represent the sample size- the larger the sample size, the bigger the block. The states, Mississippi, Alabama, West Virginia, Oklahoma, Kentucky appear to be the top 5 states with the highest rate of youths using tobacco related products. I cross check this observation by separating the dataset based on Education and TopicDesc. Then I explore which of the states contain the highest number of youths based on the specific type of TopicDesc.

Cross- check

Top_5_ave <- ave_youth_tobacco |> #This is the overall top 5 states that is visualized throught the treemap
  group_by(State)|>
  summarize(sum = sum(Average_Data_Value)) |>
  slice_max(order_by = sum, n=5)
Top_5_ave
# A tibble: 5 × 2
  State           sum
  <chr>         <dbl>
1 Mississippi     819
2 Alabama         581
3 West Virginia   537
4 Oklahoma        467
5 Kentucky        463
#Mississippi, Alabama, West Virginia, Oklahoma, Kentucky

H_smkeless_5 <- H_smkeless_use |> #This dataset is a collection of high school students using smokeless tobacco products
  group_by(State)|>
  summarize(sum = sum(Average_Data_Value)) |>
  slice_max(order_by = sum, n=5) #operates on a grouped table, and returns the largest observations in each group.
H_smkeless_5
# A tibble: 5 × 2
  State           sum
  <chr>         <dbl>
1 Mississippi     151
2 West Virginia   125
3 Alabama         117
4 Kentucky        103
5 North Dakota    101
#Mississippi, West Virginia, Alabama, Kentucky, North Dakota

H_cig_5 <- H_cig_use |> #This dataset is a collection of high school students using cigarettes
  group_by(State)|>
  summarize(sum = sum(Average_Data_Value)) |>
  slice_max(order_by = sum, n=5) 
H_cig_5
# A tibble: 5 × 2
  State            sum
  <chr>          <dbl>
1 Mississippi      362
2 Alabama          254
3 West Virginia    233
4 North Carolina   221
5 Kentucky         217
#Mississippi, Alabama, West Virginia, North Carolina, Kentucky

M_smkeless_5 <- M_smkeless_use |> #This dataset is a collection of middle school students using smokeless tobacco products
  group_by(State)|>
  summarize(sum = sum(Average_Data_Value)) |>
  slice_max(order_by = sum, n=5)
M_smkeless_5
# A tibble: 5 × 2
  State           sum
  <chr>         <dbl>
1 Mississippi     104
2 Alabama          79
3 West Virginia    65
4 Oklahoma         64
5 Kentucky         51
#Mississippi, Alabama, West Virginia, Oklahoma, Kentucky

M_cig_5 <- M_cig_use |> #This dataset is a collection of middle school students using cigarettes
  group_by(State)|>
  summarize(sum = sum(Average_Data_Value)) |>
  slice_max(order_by = sum, n=5)
M_cig_5
# A tibble: 5 × 2
  State           sum
  <chr>         <dbl>
1 Mississippi     202
2 Alabama         131
3 Oklahoma        131
4 West Virginia   114
5 Louisiana        96
#Mississippi,Alabama, Oklahoma, West Virginia, Louisiana

Analysis: After exploring each sub dataset, I can confirm each sub dataset correspond to the main dataset used for the treemap, therefore the treemap is reliable in terms of which state contained the highest occurrence of youths using tobacco products.

Conclusion

Although my dataset is based on data collected from 1999-2017, according to newly released data from the 2024 National Youth Tobacco Survey (NYTS), “current tobacco product use among U.S. middle and high school students has dropped to the lowest recorded level in 25 years.”(www.cdc.gov) Despite the misconception of e-cigarettes and even hookah, becoming a popular practice, evidence has shown that there is a decrease in all methods of tobacco use. In other words, it is confirmed that there is a decreasing trend in terms of tobacco use in youths. The article states that the decrease might be due to multiple factors such as price increases on tobacco related products, mass media campaigns that educate youths of the harmful effects of tobacco use, and stronger implementation of smoke-free policies in and outside of establishments. I thoroughly enjoyed this final project because I was able to utilize a lot of the knowledge I learned throughout the semester. What I wished I could have included is the use of Tabluea and GIS. I did initially use GIS, but decided against it and used a treemap because I felt like it visualized the message better. As for Tabluea, I did attempt to use it, but as a result of other obligations, I had to scratch that method. I definitely look forward to other data visualizations!

Link to background research: https://www.cdc.gov/media/releases/2024/p1017-youth-tobacco-use.html#:~:text=Within%20the%20past%20year%20alone,to%202.80%20million%20in%202023. https://www.cdc.gov/tobacco/php/data-statistics/youth-data-tobacco/?CDC_AAref_Val=https://www.cdc.gov/tobacco/data_statistics/fact_sheets/youth_data/tobacco_use/index.htm

Citation: Centers for Disease Control and Prevention. (n.d.). Youth Tobacco product use at a 25-year low, yet disparities persist. Centers for Disease Control and Prevention. https://www.cdc.gov/media/releases/2024/p1017-youth-tobacco-use.html#:~:text=Within%20the%20past%20year%20alone,to%202.80%20million%20in%202023. Centers for Disease Control and Prevention. (n.d.-a). Youth and tobacco use. Centers for Disease Control and Prevention. https://www.cdc.gov/tobacco/php/data-statistics/youth-data-tobacco/?CDC_AAref_Val=https%3A%2F%2Fwww.cdc.gov%2Ftobacco%2Fdata_statistics%2Ffact_sheets%2Fyouth_data%2Ftobacco_use%2Findex.htm