Before jumping into my dataset, I want to emphasize that the dataset does not just contain subjects, they are people. People that are children and are subjected to a practice that can cause health problems lasting for the rest of their lives. The children in the dataset smoke not just because they want to, but as a result of many factors. For example, some partake in tobacco products because of social and physical environments- the media depicting tobacco products as a normal activity or being surrounded by friends and family regularly participating in the practice. A couple other factors that can lead to tobacco use could be biological and genetic factors, mental health, personal views, lack of support or involvement from parents, and accessibility, availability, and price of tobacco products. It is critical to educate the youth about the use of tobacco products because “youth who use multiple tobacco products are at higher risk for developing nicotine dependence and might be more likely to continue using tobacco products into adulthood.”(cdc.gov) My dataset is a collection of values of both middle school and high school students using tobacco in participating states. The data was collected using the the STATE System, “an interactive application that presents current and historical state-level data on tobacco use prevention and control” (https://www.cdc.gov/statesystem/index.html). Published by the Centers for Disease Control (under the U.S. Department of Health & Human Services), the original dataset contained 10,600 observations and 31 variables. The variables I want to focus on are Year, State, TopicDesc, Data_Value, Data_Value_Unit, Education, GeoLocation, Sample_Size, Gender, and Race. I will be using linear regression to answer my questions about the dataset. Tobacco use has been a popular practice since the 1900s. In the 21st century, items such as electronic cigarettes have grown to be the next popular item among younger generations. Despite the growing use of e-cigarettes, I am anticipating a decreasing trend in the use of tobacco. The dataset has a variable called “TopicDesc” and under this variable contains values such as Cessation (Youth), Cigarette Use (Youth), and Smokeless Tobacco Use (Youth). In my exploration, I want to focus on cigarettes (tobacco or tobacco essence and nicotine that is smoked) and smokeless tobacco (chewing tobacco, snuff, Nicotine pouches, and dis-solvable tobacco) and discover whether or not the relationship between time and TopicDesc is negative (decreasing). The other question I want to explore is if middle school students are more prone to obtaining tobacco products compared to high school students? Last question is, what year and state experienced the highest use of tobacco. My original intent was to find a dataset on diabetes and the attributes that come with it, for example blood pressure, blood sugar, BMI, diet, etc. However, I had a hard time looking for a dataset that was rich enough to provide with such information. So I continued my search and found the Youth Tobacco Survey (YTS) dataset. This topic is significant to me because many older members of my family smoke(cigarettes), I smoke (cigarettes) occasionally, and my friends smoke (predominantly e-cigarettes). I am curious if tobacco use is still a popular item, despite many of my friends and family partaking in the practice. Link to data source: https://catalog.data.gov/dataset/youth-tobacco-survey-yts-data
Libraries
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
Rows: 10600 Columns: 31
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (24): LocationAbbr, LocationDesc, TopicType, TopicDesc, MeasureDesc, Dat...
dbl (7): YEAR, Data_Value, Data_Value_Std_Err, Low_Confidence_Limit, High_C...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
names(youth_tobacco)[names(youth_tobacco)=="LocationDesc"] <-"State"#I did not prefer the original name "LocationDesc", so I changed it to "State"names(youth_tobacco)[names(youth_tobacco)=="YEAR"] <-"Year"#I did not prefer the all caps, so I changed it from "YEAR" to "Year"names(youth_tobacco)
youth_NA <-is.na(refined_youth_tobacco$Year)|is.na(refined_youth_tobacco$State)|is.na(refined_youth_tobacco$TopicDesc)|is.na(refined_youth_tobacco$Data_Value)|is.na(refined_youth_tobacco$Data_Value_Unit)|is.na(refined_youth_tobacco$Education)|is.na(refined_youth_tobacco$GeoLocation)|is.na(refined_youth_tobacco$Sample_Size)|is.na(refined_youth_tobacco$Gender)|is.na(refined_youth_tobacco$Race) #youth_NA will contain all the NAs. In the next code, I will exclude the dataset youth_NA- this would extract all NAsrefined_youth_tobacco <-subset(refined_youth_tobacco, subset =!youth_NA) #Creates a dataset called refined_youth_tobacco without NAssum(is.na(refined_youth_tobacco))
[1] 0
#Now there are no NAs
Separate lat and long
refined_youth_tobacco <- refined_youth_tobacco|>mutate(GeoLocation =str_replace_all(GeoLocation, "[()]", ""))|>separate(GeoLocation, into =c("lat", "long"), sep =",", convert =TRUE)#Separates latitude and longitude and gets rid of the ()head(refined_youth_tobacco)
# A tibble: 6 × 11
Year State TopicDesc Data_Value Data_Value_Unit Education lat long
<dbl> <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl>
1 2004 Ohio Cigarett… 32.2 % Middle S… 40.1 -82.4
2 2008 Alabama Cigarett… 47.8 % High Sch… 32.8 -86.6
3 2015 West Virginia Smokeles… 12.7 % High Sch… 38.7 -80.7
4 2005 Illinois Cigarett… 28.5 % Middle S… 40.5 -89.0
5 2005 Connecticut Smokeles… 20.5 % High Sch… 41.6 -72.6
6 2009 Connecticut Smokeles… 1.2 % High Sch… 41.6 -72.6
# ℹ 3 more variables: Sample_Size <dbl>, Gender <chr>, Race <chr>
Filter column values
refined_youth_tobacco <-filter(refined_youth_tobacco, TopicDesc %in%c("Cigarette Use (Youth)", "Smokeless Tobacco Use (Youth)")) #I wanted to explore Cigarette Use (Youth)" and "Smokeless Tobacco Use (Youth)" under TopicDescunique(refined_youth_tobacco$TopicDesc)
[1] "Cigarette Use (Youth)" "Smokeless Tobacco Use (Youth)"
#Confirmed there are only two categories of TopicDescrefined_youth_tobacco <- refined_youth_tobacco|>filter(Sample_Size >1000) |>#I decided a reliable sample size would be over 1000 filter(Gender=="Overall") #I want to explore smoking habits from all kinds of people regardless of gender
Obtaining the average percentage after cleaning the dataset
str(refined_youth_tobacco)
tibble [2,694 × 11] (S3: tbl_df/tbl/data.frame)
$ Year : num [1:2694] 2004 2005 2009 2007 2004 ...
$ State : chr [1:2694] "Ohio" "Illinois" "Connecticut" "Connecticut" ...
$ TopicDesc : chr [1:2694] "Cigarette Use (Youth)" "Cigarette Use (Youth)" "Smokeless Tobacco Use (Youth)" "Cigarette Use (Youth)" ...
$ Data_Value : num [1:2694] 32.2 28.5 1.2 13 9.4 5 6.9 7.2 14.8 6 ...
$ Data_Value_Unit: chr [1:2694] "%" "%" "%" "%" ...
$ Education : chr [1:2694] "Middle School" "Middle School" "High School" "Middle School" ...
$ lat : num [1:2694] 40.1 40.5 41.6 41.6 39 ...
$ long : num [1:2694] -82.4 -89 -72.6 -72.6 -75.6 ...
$ Sample_Size : num [1:2694] 1377 1323 2214 2184 2226 ...
$ Gender : chr [1:2694] "Overall" "Overall" "Overall" "Overall" ...
$ Race : chr [1:2694] "All Races" "All Races" "All Races" "All Races" ...
#Curious about the structure and classification of each variable. Also to make sure the variables are the right data typerefined_youth_tobacco <-arrange(refined_youth_tobacco,Year) #Arranges the data in ascending orderrefined_youth_tobacco
# A tibble: 2,694 × 11
Year State TopicDesc Data_Value Data_Value_Unit Education lat long
<dbl> <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl>
1 1999 New Jersey Cigarett… 2.2 % Middle S… 40.1 -74.3
2 1999 Missouri Cigarett… 14.9 % Middle S… 38.6 -92.6
3 1999 North Carol… Cigarett… 15 % Middle S… 35.5 -79.2
4 1999 Mississippi Smokeles… 24.3 % High Sch… 32.7 -89.5
5 1999 North Carol… Smokeles… 3.9 % Middle S… 35.5 -79.2
6 1999 Oklahoma Smokeles… 18.6 % Middle S… 35.5 -97.5
7 1999 Mississippi Smokeles… 9.6 % High Sch… 32.7 -89.5
8 1999 Tennessee Cigarett… 44.2 % Middle S… 35.7 -85.8
9 1999 Oklahoma Smokeles… 13 % High Sch… 35.5 -97.5
10 1999 Mississippi Cigarett… 16.6 % High Sch… 32.7 -89.5
# ℹ 2,684 more rows
# ℹ 3 more variables: Sample_Size <dbl>, Gender <chr>, Race <chr>
refined_youth_tobacco$Data_Value <- refined_youth_tobacco$Data_Value/100#Converts percentage to decimal so I can get an accurate reading of the average percentagerefined_youth_tobacco
# A tibble: 2,694 × 11
Year State TopicDesc Data_Value Data_Value_Unit Education lat long
<dbl> <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl>
1 1999 New Jersey Cigarett… 0.022 % Middle S… 40.1 -74.3
2 1999 Missouri Cigarett… 0.149 % Middle S… 38.6 -92.6
3 1999 North Carol… Cigarett… 0.15 % Middle S… 35.5 -79.2
4 1999 Mississippi Smokeles… 0.243 % High Sch… 32.7 -89.5
5 1999 North Carol… Smokeles… 0.039 % Middle S… 35.5 -79.2
6 1999 Oklahoma Smokeles… 0.186 % Middle S… 35.5 -97.5
7 1999 Mississippi Smokeles… 0.096 % High Sch… 32.7 -89.5
8 1999 Tennessee Cigarett… 0.442 % Middle S… 35.7 -85.8
9 1999 Oklahoma Smokeles… 0.13 % High Sch… 35.5 -97.5
10 1999 Mississippi Cigarett… 0.166 % High Sch… 32.7 -89.5
# ℹ 2,684 more rows
# ℹ 3 more variables: Sample_Size <dbl>, Gender <chr>, Race <chr>
ave_youth_tobacco <- refined_youth_tobacco |>group_by(Year,State, Education, TopicDesc)|>#In a new dataset, ave_youth_tobacco, I grouped the data based on Year, State, Education, and TopicDescsummarise(count=n(),latitude =first(lat),longitude =first(long),.groups ="drop", # source to the code was ChatGPT- my question was: is there a way to keep the latitude and longitude values after using the summarise function in R? Option 1: Keep the first (or last) lat/lon value in each groupAverage_Data_Value=round(mean(Data_Value)*100), #putting it back to percentage form, rather than decimalAverage_Sample_Size=round(mean(Sample_Size)))str(ave_youth_tobacco) #Check to see if lat and long are numeric
tibble [902 × 9] (S3: tbl_df/tbl/data.frame)
$ Year : num [1:902] 1999 1999 1999 1999 1999 ...
$ State : chr [1:902] "Georgia" "Georgia" "Kansas" "Kansas" ...
$ Education : chr [1:902] "Middle School" "Middle School" "Middle School" "Middle School" ...
$ TopicDesc : chr [1:902] "Cigarette Use (Youth)" "Smokeless Tobacco Use (Youth)" "Cigarette Use (Youth)" "Smokeless Tobacco Use (Youth)" ...
$ count : int [1:902] 3 3 3 3 3 3 3 3 3 3 ...
$ latitude : num [1:902] 32.8 32.8 38.3 38.3 32.7 ...
$ longitude : num [1:902] -83.6 -83.6 -98.2 -98.2 -89.5 ...
$ Average_Data_Value : num [1:902] 22 6 21 8 42 12 28 10 22 7 ...
$ Average_Sample_Size: num [1:902] 1239 1291 1124 1121 1525 ...
Separate High school and middle school average percentages and separate TopicDesc- Cigarette Use (Youth) and Smokeless Tobacco Use.
getwd()
[1] "/Users/joyceliang/Desktop/DATA_110/FINAL"
H_cig_use <- ave_youth_tobacco |>filter(TopicDesc =="Cigarette Use (Youth)")|>filter(Education =="High School")M_cig_use <- ave_youth_tobacco |>filter(TopicDesc =="Cigarette Use (Youth)")|>filter(Education =="Middle School")M_cig_use
#The overall correlation for all youths in participating states is -0.328, which is a weak correlation. After finding the overall correlation, I want to explore the correlation for high school and middle school students, separatelycor(H_cig_use$Year, H_cig_use$Average_Data_Value) #-0.8140186
[1] -0.8156326
#It appears the the correlation between the average percentage and time in years is strong for high school students, meaning high school students are quickly losing interest in the use of cigarettes cor(H_smkeless_use$Year, H_smkeless_use$Average_Data_Value) #-0.2274458
[1] -0.2292833
#It appears the the correlation between the average percentage in the use of smokeless products and time in years is weak for high school students.cor(M_cig_use$Year, M_cig_use$Average_Data_Value) #-0.757
[1] -0.7572516
#It appears the the correlation between the average percentage and time in years is somewhat strong for middle school students, meaning middle school students are also losing interest in the use of cigarettes cor(M_smkeless_use$Year, M_smkeless_use$Average_Data_Value) #-0.3364725
[1] -0.3402697
#It appears the the correlation between the average percentage in the use of smokeless products and time in years is aslo weak for middle school students.
Analysis: It is interesting to see how weak of a correlation both high school and middle school students have with smokeless tobacco products.
Correlation, linear regression, and analysis (Continued)
Call:
lm(formula = Average_Data_Value ~ Year, data = ave_youth_tobacco)
Coefficients:
(Intercept) Year
1139.4351 -0.5614
#With every Year, the average percentage of youths using cigarettes decreases by 0.5614 unitsH_cig_use_1 <-lm(Average_Data_Value ~ Year, data=H_cig_use) #Average Percentage = -1.276x + 2587.580H_cig_use_1
Call:
lm(formula = Average_Data_Value ~ Year, data = H_cig_use)
Coefficients:
(Intercept) Year
2587.580 -1.276
#With every Year, the average percentage of high school students using cigarettes decreases by 1.274 unitsH_smkeless_use_1 <-lm(Average_Data_Value ~ Year, data=H_smkeless_use) #Average Percentage = -0.172x + 355H_smkeless_use_1
Call:
lm(formula = Average_Data_Value ~ Year, data = H_smkeless_use)
Coefficients:
(Intercept) Year
355.002 -0.172
#With every Year, the average percentage of high school students using smokeless tobacco products decreases by 0.172 unitsM_cig_use_1 <-lm(Average_Data_Value ~ Year, data=M_cig_use) #Average Percentage = -0.7645x + 1545.681M_cig_use_1
Call:
lm(formula = Average_Data_Value ~ Year, data = M_cig_use)
Coefficients:
(Intercept) Year
1545.6812 -0.7645
#With every Year, the average percentage of middle schools students using cigarettes decreases by 0.7645 unitsM_smkeless_use_1 <-lm(Average_Data_Value ~ Year, data=M_smkeless_use) #Average Percentage = -0.2018x + 409.974 M_smkeless_use_1
Call:
lm(formula = Average_Data_Value ~ Year, data = M_smkeless_use)
Coefficients:
(Intercept) Year
409.9742 -0.2018
#With every Year, the average percentage of youths using smokeless tobacco products decreases by 0.001983 units
Summary of the linear models
summary(ave_youth_tobacco_lm)
Call:
lm(formula = Average_Data_Value ~ Year, data = ave_youth_tobacco)
Residuals:
Min 1Q Median 3Q Max
-14.617 -6.196 -2.196 4.620 28.383
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1139.4351 108.1926 10.53 <2e-16 ***
Year -0.5614 0.0539 -10.41 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8.831 on 900 degrees of freedom
Multiple R-squared: 0.1076, Adjusted R-squared: 0.1066
F-statistic: 108.5 on 1 and 900 DF, p-value: < 2.2e-16
#p-value: < 2.2e-16- the p-value is lower than alpha 0.05 therefore time(years) is statistically significant and that it is a good predictor in the average percentage for tobacco use in youths#R-squared: 0.1066- the r-squared suggests that 11% of the variance can be explained by time(years).summary(H_cig_use_1)
Call:
lm(formula = Average_Data_Value ~ Year, data = H_cig_use)
Residuals:
Min 1Q Median 3Q Max
-27.5227 -2.2050 0.4979 2.9082 10.3714
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2587.58020 124.83199 20.73 <2e-16 ***
Year -1.27648 0.06219 -20.53 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.908 on 212 degrees of freedom
Multiple R-squared: 0.6653, Adjusted R-squared: 0.6637
F-statistic: 421.3 on 1 and 212 DF, p-value: < 2.2e-16
#p-value: < 2.2e-16- the p-value is lower than alpha 0.05 therefore time(years) is statistically significant and that it is a good predictor in the average percentage for tobacco use in youths#R-squared: 0.664- the r-squared suggests that 66% of the variance can be explained by time(years).plot(H_cig_use_1)
summary(H_smkeless_use_1)
Call:
lm(formula = Average_Data_Value ~ Year, data = H_smkeless_use)
Residuals:
Min 1Q Median 3Q Max
-8.0312 -3.3224 -0.0693 2.3558 24.3128
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 355.00173 101.13584 3.510 0.000548 ***
Year -0.17199 0.05038 -3.414 0.000769 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.975 on 210 degrees of freedom
Multiple R-squared: 0.05257, Adjusted R-squared: 0.04806
F-statistic: 11.65 on 1 and 210 DF, p-value: 0.0007693
# p-value: 0.0007693- the p-value is lower than alpha 0.05 therefore time(years) is statistically significant and that it is a good predictor in the average percentage for tobacco use in youths#R-squared: 0.04806- the r-squared suggests that 4.78% of the variance can be explained by time(years).plot(H_smkeless_use_1)
summary(M_cig_use_1)
Call:
lm(formula = Average_Data_Value ~ Year, data = M_cig_use)
Residuals:
Min 1Q Median 3Q Max
-9.6566 -2.3920 -0.1275 2.3434 10.5789
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1545.68117 85.95906 17.98 <2e-16 ***
Year -0.76451 0.04283 -17.85 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.627 on 237 degrees of freedom
Multiple R-squared: 0.5734, Adjusted R-squared: 0.5716
F-statistic: 318.6 on 1 and 237 DF, p-value: < 2.2e-16
#p-value: < 2.2e-16- the p-value is lower than alpha 0.05 therefore time(years) is statistically significant and that it is a good predictor in the average percentage for tobacco use in youths#R-squared: 0.5712- the r-squared suggests that 57% of the variance can be explained by time(years).plot(M_cig_use_1)
summary(M_smkeless_use_1)
Call:
lm(formula = Average_Data_Value ~ Year, data = M_smkeless_use)
Residuals:
Min 1Q Median 3Q Max
-4.414 -2.002 -0.598 1.393 29.990
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 409.97419 73.00094 5.616 5.50e-08 ***
Year -0.20178 0.03637 -5.547 7.79e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.08 on 235 degrees of freedom
Multiple R-squared: 0.1158, Adjusted R-squared: 0.112
F-statistic: 30.77 on 1 and 235 DF, p-value: 7.794e-08
#p-value: < 7.794e-08 the p-value is lower than alpha 0.05 therefore time(years) is statistically significant and that it is a good predictor in the average percentage for tobacco use in youths#R-squared: 0.112- the r-squared suggests that 11% of the variance can be explained by time(years).plot(M_smkeless_use_1)
Analysis: Overall, the summaries depicting the residuals vs fitted, normal Q-Q, scale-location, and Cook’s distance indicate that the model represents the data well.
VISUALIZATION 1: Scatterplot
plot2<-ggplot(ave_youth_tobacco, aes(x = Year, y = Average_Data_Value,color = Education,size = Average_Sample_Size, text =paste("State:", State)))+geom_point(alpha =0.5) +#Because the geompoint size is going to be based off of sample size, alpha is needed as there will be overlappingfacet_wrap(~ TopicDesc) +#Facet by TopicDesctheme_minimal() +labs(title ="Cigarette Use VS Smokeless Tobacco use Among Youths",x="\nYear",y="\nPercentage(Decimal)",caption ="Source: Centers for Disease Control and Prevention (CDC)")+theme_grey(base_size =10)plot2<- plot2+scale_color_manual(values=c("#E69F00", "#56B4E9")) #The colors will help distinguish between high school and middle school studentsplot2
ggplotly(plot2) #for some reason plotly is not allowing the caption for the source to appear, but it appears just fine without plotly
Analysis: This is a great visualization because I am able to see the overall trend of both high school and middle school students based on the two TopicDesc, Cigarette Use and Smokeless Tobacco Use. The visualization depicts a scatterplot that plots the average percentage of each state in a year. The trend is a negative relationship in both areas meaning youths are progressively not using tobacco products. The most notable observation is the steep decrease in cigarette use through the years for both high school and middle school students. It also appears that high school students tend to have a higher chance of using cigarettes compared to middle school students.The scatterplot for smokeless tobacco, on the other hand, indicates a steady use among high school and middle school students despite it slowly decreasing. Smokeless tobacco also appears to be a less popular item compared to cigarettes. High school students still have a higher chance in using smokeless tobacco products compared to middle school students.
To answer my previous question, what year and state experienced the highest use in tobacco- in terms of cigarette use, high school students in the state of West Virginia experienced the highest average percentage in the year 2000. In terms of smokeless tobacco, high school students in the state of Kentucky experienced the highest average percentage in the year 2010. We will exclude the data values pertaining to the state of Florida for smokeless tobacco as it is considered an outlier.
VISUAL 2: Treemap
# After reviewing the scatterplot, I was curious about the state with the highest average occurrence of youths using tobacco related products. The scatter plot unfortunately does not depict the cumulative visual of the state with the highest average occurrence of youths using tobacco related productstreemap(ave_youth_tobacco, index="State", vSize="Average_Sample_Size", vColor="Average_Data_Value", #The states with a greater amount if youths using tobacco products is green. The states with lesser amount of youths using tobacco products is yellow- red.type="manual",border.col =c("white"), #I don't prefer the black borders, so I changed it to whiteborder.lwds =c(7), fontsize.title =12,title ="States With the Highest Occurance of Youth Using Tobacco Products", title.legend ="Frequency", palette="RdYlGn")
Analysis: I am unable to add a caption to the tree map in R so I will state it here. Source: https://catalog.data.gov/dataset/youth-tobacco-survey-yts-data The treemap depict states from the lowest to highest number of youths using tobacco related products. States with the highest number of youths using tobacco related products is illustrated in a range green. States with a lower number of youths using tobacco related products are conveyed through a range of red to yellow. The size of the blocks represent the sample size- the larger the sample size, the bigger the block. The states, Mississippi, Alabama, West Virginia, Oklahoma, Kentucky appear to be the top 5 states with the highest rate of youths using tobacco related products. I cross check this observation by separating the dataset based on Education and TopicDesc. Then I explore which of the states contain the highest number of youths based on the specific type of TopicDesc.
Cross- check
Top_5_ave <- ave_youth_tobacco |>#This is the overall top 5 states that is visualized throught the treemapgroup_by(State)|>summarize(sum =sum(Average_Data_Value)) |>slice_max(order_by = sum, n=5)Top_5_ave
# A tibble: 5 × 2
State sum
<chr> <dbl>
1 Mississippi 819
2 Alabama 581
3 West Virginia 537
4 Oklahoma 467
5 Kentucky 463
#Mississippi, Alabama, West Virginia, Oklahoma, KentuckyH_smkeless_5 <- H_smkeless_use |>#This dataset is a collection of high school students using smokeless tobacco productsgroup_by(State)|>summarize(sum =sum(Average_Data_Value)) |>slice_max(order_by = sum, n=5) #operates on a grouped table, and returns the largest observations in each group.H_smkeless_5
# A tibble: 5 × 2
State sum
<chr> <dbl>
1 Mississippi 151
2 West Virginia 125
3 Alabama 117
4 Kentucky 103
5 North Dakota 101
#Mississippi, West Virginia, Alabama, Kentucky, North DakotaH_cig_5 <- H_cig_use |>#This dataset is a collection of high school students using cigarettesgroup_by(State)|>summarize(sum =sum(Average_Data_Value)) |>slice_max(order_by = sum, n=5) H_cig_5
# A tibble: 5 × 2
State sum
<chr> <dbl>
1 Mississippi 362
2 Alabama 254
3 West Virginia 233
4 North Carolina 221
5 Kentucky 217
#Mississippi, Alabama, West Virginia, North Carolina, KentuckyM_smkeless_5 <- M_smkeless_use |>#This dataset is a collection of middle school students using smokeless tobacco productsgroup_by(State)|>summarize(sum =sum(Average_Data_Value)) |>slice_max(order_by = sum, n=5)M_smkeless_5
# A tibble: 5 × 2
State sum
<chr> <dbl>
1 Mississippi 104
2 Alabama 79
3 West Virginia 65
4 Oklahoma 64
5 Kentucky 51
#Mississippi, Alabama, West Virginia, Oklahoma, KentuckyM_cig_5 <- M_cig_use |>#This dataset is a collection of middle school students using cigarettesgroup_by(State)|>summarize(sum =sum(Average_Data_Value)) |>slice_max(order_by = sum, n=5)M_cig_5
# A tibble: 5 × 2
State sum
<chr> <dbl>
1 Mississippi 202
2 Alabama 131
3 Oklahoma 131
4 West Virginia 114
5 Louisiana 96
#Mississippi,Alabama, Oklahoma, West Virginia, Louisiana
Analysis: After exploring each sub dataset, I can confirm each sub dataset correspond to the main dataset used for the treemap, therefore the treemap is reliable in terms of which state contained the highest occurrence of youths using tobacco products.
Conclusion
Although my dataset is based on data collected from 1999-2017, according to newly released data from the 2024 National Youth Tobacco Survey (NYTS), “current tobacco product use among U.S. middle and high school students has dropped to the lowest recorded level in 25 years.”(www.cdc.gov) Despite the misconception of e-cigarettes and even hookah, becoming a popular practice, evidence has shown that there is a decrease in all methods of tobacco use. In other words, it is confirmed that there is a decreasing trend in terms of tobacco use in youths. The article states that the decrease might be due to multiple factors such as price increases on tobacco related products, mass media campaigns that educate youths of the harmful effects of tobacco use, and stronger implementation of smoke-free policies in and outside of establishments. I thoroughly enjoyed this final project because I was able to utilize a lot of the knowledge I learned throughout the semester. What I wished I could have included is the use of Tabluea and GIS. I did initially use GIS, but decided against it and used a treemap because I felt like it visualized the message better. As for Tabluea, I did attempt to use it, but as a result of other obligations, I had to scratch that method. I definitely look forward to other data visualizations!
Link to background research: https://www.cdc.gov/media/releases/2024/p1017-youth-tobacco-use.html#:~:text=Within%20the%20past%20year%20alone,to%202.80%20million%20in%202023. https://www.cdc.gov/tobacco/php/data-statistics/youth-data-tobacco/?CDC_AAref_Val=https://www.cdc.gov/tobacco/data_statistics/fact_sheets/youth_data/tobacco_use/index.htm
Citation: Centers for Disease Control and Prevention. (n.d.). Youth Tobacco product use at a 25-year low, yet disparities persist. Centers for Disease Control and Prevention. https://www.cdc.gov/media/releases/2024/p1017-youth-tobacco-use.html#:~:text=Within%20the%20past%20year%20alone,to%202.80%20million%20in%202023. Centers for Disease Control and Prevention. (n.d.-a). Youth and tobacco use. Centers for Disease Control and Prevention. https://www.cdc.gov/tobacco/php/data-statistics/youth-data-tobacco/?CDC_AAref_Val=https%3A%2F%2Fwww.cdc.gov%2Ftobacco%2Fdata_statistics%2Ffact_sheets%2Fyouth_data%2Ftobacco_use%2Findex.htm