I chose the topic of teenager smoking. So I found a datasheet on data.gov sourced by the CDC. And made the main question as to how has teens smoking increasing trought the year.
#Loading libraries tidyverse and highcharterlibrary(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(highcharter)
Warning: package 'highcharter' was built under R version 4.4.2
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
Highcharts (www.highcharts.com) is a Highsoft software product which is
not free for commercial and Governmental use
setwd("C:/Users/Usuário/Documents/Data101")
#Getting the dataset and filtering for 6 variablesdf <-read.csv("Final.csv")summary(df)
YEAR LocationAbbr LocationDesc TopicType
Min. :1999 Length:10600 Length:10600 Length:10600
1st Qu.:2002 Class :character Class :character Class :character
Median :2006 Mode :character Mode :character Mode :character
Mean :2007
3rd Qu.:2011
Max. :2017
TopicDesc MeasureDesc DataSource Response
Length:10600 Length:10600 Length:10600 Length:10600
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Data_Value_Unit Data_Value_Type Data_Value
Length:10600 Length:10600 Min. : 0.00
Class :character Class :character 1st Qu.: 3.00
Mode :character Mode :character Median :10.55
Mean :20.34
3rd Qu.:36.42
Max. :98.00
NA's :520
Data_Value_Footnote_Symbol Data_Value_Footnote Data_Value_Std_Err
Length:10600 Length:10600 Min. : 0.000
Class :character Class :character 1st Qu.: 0.600
Mode :character Mode :character Median : 1.300
Mean : 1.851
3rd Qu.: 2.400
Max. :16.100
NA's :520
Low_Confidence_Limit High_Confidence_Limit Sample_Size Gender
Min. : 0.00 Min. : 0.00 Min. : 50 Length:10600
1st Qu.: 1.70 1st Qu.: 4.30 1st Qu.: 671 Class :character
Median : 7.90 Median :13.20 Median : 1001 Mode :character
Mean :16.72 Mean :23.96 Mean : 1483
3rd Qu.:29.60 3rd Qu.:42.50 3rd Qu.: 1642
Max. :97.60 Max. :98.40 Max. :36910
NA's :517 NA's :520 NA's :520
Race Age Education GeoLocation
Length:10600 Length:10600 Length:10600 Length:10600
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
TopicTypeId TopicId MeasureId StratificationID1
Length:10600 Length:10600 Length:10600 Length:10600
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
StratificationID2 StratificationID3 StratificationID4 SubMeasureID
Length:10600 Length:10600 Length:10600 Length:10600
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
DisplayOrder
Min. : 1.000
1st Qu.: 7.000
Median : 9.000
Mean : 8.107
3rd Qu.:11.000
Max. :12.000
#Creating a linear regression model for the percentage of teens smoking per year model <-lm(Data_Value ~ YEAR, data = df1)summary(model)
Call:
lm(formula = Data_Value ~ YEAR, data = df1)
Residuals:
Min 1Q Median 3Q Max
-26.377 -15.601 -9.033 16.624 73.925
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1560.46645 76.57256 20.38 <2e-16 ***
YEAR -0.76743 0.03816 -20.11 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 20.76 on 10078 degrees of freedom
(520 observations deleted due to missingness)
Multiple R-squared: 0.03859, Adjusted R-squared: 0.0385
F-statistic: 404.5 on 1 and 10078 DF, p-value: < 2.2e-16
#ploting the graph for the linaer modelplot_data <- df1 %>%select(Data_Value, YEAR) %>%na.omit() highchart() %>%hc_chart(type ="scatter") %>%hc_title(text ="Scatter Plot with Linear Regression Line") %>%hc_xAxis(title =list(text ="Year")) %>%hc_yAxis(title =list(text ="Data Value")) %>%hc_add_series(data = plot_data, type ="scatter", hcaes(x = YEAR, y = Data_Value), marker =list(radius =4)) %>%hc_add_series(data = plot_data, type ="line", hcaes(x = YEAR, y =predict(model, newdata = plot_data)), color ="red", lineWidth =2)
#creating another model but for the total ammount of students per school that smoked per yeardf1 <- df1 %>%mutate(New_Column = (Data_Value * Sample_Size) /100)model_new_column <-lm(New_Column ~ YEAR, data = df1)summary(model_new_column)
Call:
lm(formula = New_Column ~ YEAR, data = df1)
Residuals:
Min 1Q Median 3Q Max
-346.4 -182.0 -84.3 24.6 18248.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34749.8005 1947.5453 17.84 <2e-16 ***
YEAR -17.2103 0.9704 -17.73 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 527.9 on 10078 degrees of freedom
(520 observations deleted due to missingness)
Multiple R-squared: 0.03026, Adjusted R-squared: 0.03017
F-statistic: 314.5 on 1 and 10078 DF, p-value: < 2.2e-16
#ploting the second methodplot_data <- df1 %>%select(New_Column, YEAR) %>%na.omit() highchart() %>%hc_chart(type ="scatter") %>%hc_title(text ="Scatter Plot with Linear Regression Line for New_Column") %>%hc_xAxis(title =list(text ="Year")) %>%hc_yAxis(title =list(text ="Ammount that smoke")) %>%hc_add_series(data = plot_data, type ="scatter", hcaes(x = YEAR, y = New_Column), marker =list(radius =4)) %>%hc_add_series(data = plot_data, type ="line", hcaes(x = YEAR, y =predict(model_new_column, newdata = plot_data)), color ="red", lineWidth =2)
#creating a linear regression of the percentage of teens smokig in all schools per year.df_yearly <- df1 %>%group_by(YEAR) %>%summarise(Total_Value_Sample =sum(New_Column, na.rm =TRUE),Total_Sample_Size =sum(Sample_Size, na.rm =TRUE) )df_yearly <- df_yearly %>%mutate(Percentage_Smokers = Total_Value_Sample / Total_Sample_Size)model <-lm(Percentage_Smokers ~ YEAR, data = df_yearly)summary(model)
Call:
lm(formula = Percentage_Smokers ~ YEAR, data = df_yearly)
Residuals:
Min 1Q Median 3Q Max
-0.031095 -0.004273 -0.001216 0.008449 0.018201
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.9413876 0.9982063 11.96 1.06e-09 ***
YEAR -0.0058832 0.0004971 -11.84 1.24e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.01187 on 17 degrees of freedom
Multiple R-squared: 0.8918, Adjusted R-squared: 0.8854
F-statistic: 140.1 on 1 and 17 DF, p-value: 1.244e-09
#Ploting the last regression modeldf_yearly$Predicted_Percentage_Smokers <-predict(model, newdata = df_yearly)highchart() %>%hc_chart(type ="scatter") %>%hc_title(text ="Percentage of Smokers Over Years with Regression Line") %>%hc_subtitle(text ="Scatter plot showing observed and predicted percentages of smokers by year") %>%hc_xAxis(title =list(text ="Year")) %>%hc_yAxis(title =list(text ="Percentage of Smokers")) %>%hc_add_series(data = df_yearly, type ="scatter", hcaes(x = YEAR, y = Percentage_Smokers), marker =list(radius =8, symbol ="circle"), name ="Observed Data") %>%hc_add_series(data = df_yearly, type ="line", hcaes(x = YEAR, y = Predicted_Percentage_Smokers),color ="red", lineWidth =1, name ="Regression Line") %>%hc_tooltip(shared =TRUE, useHTML =TRUE, pointFormat ="Year: {point.x}<br>Observed: {point.y}%<br>Predicted: {point.Predicted_Percentage_Smokers}%")
After analysing both the model and the graph we can see a strong correlation between percentage of smokers and year. This also shows a complete opposite initial supposition made that the percentage of smokers would increase as the years also increase.
#creating a boxplot of Education level and percentage of smokersggplot(df, aes(x = Education, y = Data_Value)) +geom_boxplot(fill ="lightblue", color ="black") +theme_minimal() +labs(title ="Distribution of Percentage of Smokers by Topic",x ="Topic Description",y ="Percentage of Smokers" ) +theme(axis.text.x =element_text(angle =25, hjust =1, size =7),plot.title =element_text(hjust =0.5), axis.title =element_text(size =7) )
Warning: Removed 520 rows containing non-finite outside the scale range
(`stat_boxplot()`).
#Making a t test for the education level and percentage of smokerst_test_Education <-t.test(Data_Value ~ Education, data = df)print(t_test_Education)
Welch Two Sample t-test
data: Data_Value by Education
t = 24.653, df = 9827.9, p-value < 2.2e-16
alternative hypothesis: true difference in means between group High School and group Middle School is not equal to 0
95 percent confidence interval:
9.318871 10.928815
sample estimates:
mean in group High School mean in group Middle School
25.51523 15.39139
Analyzing we can see that they have a difference between each other with a verry small p value making it significant.
#Also creating a t test for gender and percentage of smokers but filtering for only male and female schoolsdf_filtered <- df[df$Gender %in%c("Male", "Female"), ]t_test_gender <-t.test(Data_Value ~ Gender, data = df_filtered)print(t_test_gender)
Welch Two Sample t-test
data: Data_Value by Gender
t = -5.7827, df = 6534.5, p-value = 7.691e-09
alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
95 percent confidence interval:
-3.983523 -1.966477
sample estimates:
mean in group Female mean in group Male
18.22973 21.20473
Also we can see a very strong statistical significance with a very small p value. Showing that its very likely that males smoke more than females in school.