Final Project Data 101

Author

Arthur De Almeida

Final project for Data 101

I chose the topic of teenager smoking. So I found a datasheet on data.gov sourced by the CDC. And made the main question as to how has teens smoking increasing trought the year.

#Loading libraries tidyverse and highcharter
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(highcharter)

Warning: package 'highcharter' was built under R version 4.4.2

Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 
Highcharts (www.highcharts.com) is a Highsoft software product which is
not free for commercial and Governmental use

setwd("C:/Users/Usuário/Documents/Data101")

#Getting the dataset and filtering for 6 variables
df <- read.csv("Final.csv")
summary(df)

      YEAR      LocationAbbr       LocationDesc        TopicType        
 Min.   :1999   Length:10600       Length:10600       Length:10600      
 1st Qu.:2002   Class :character   Class :character   Class :character  
 Median :2006   Mode  :character   Mode  :character   Mode  :character  
 Mean   :2007                                                           
 3rd Qu.:2011                                                           
 Max.   :2017                                                           
                                                                        
  TopicDesc         MeasureDesc         DataSource          Response        
 Length:10600       Length:10600       Length:10600       Length:10600      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
 Data_Value_Unit    Data_Value_Type      Data_Value   
 Length:10600       Length:10600       Min.   : 0.00  
 Class :character   Class :character   1st Qu.: 3.00  
 Mode  :character   Mode  :character   Median :10.55  
                                       Mean   :20.34  
                                       3rd Qu.:36.42  
                                       Max.   :98.00  
                                       NA's   :520    
 Data_Value_Footnote_Symbol Data_Value_Footnote Data_Value_Std_Err
 Length:10600               Length:10600        Min.   : 0.000    
 Class :character           Class :character    1st Qu.: 0.600    
 Mode  :character           Mode  :character    Median : 1.300    
                                                Mean   : 1.851    
                                                3rd Qu.: 2.400    
                                                Max.   :16.100    
                                                NA's   :520       
 Low_Confidence_Limit High_Confidence_Limit  Sample_Size       Gender         
 Min.   : 0.00        Min.   : 0.00         Min.   :   50   Length:10600      
 1st Qu.: 1.70        1st Qu.: 4.30         1st Qu.:  671   Class :character  
 Median : 7.90        Median :13.20         Median : 1001   Mode  :character  
 Mean   :16.72        Mean   :23.96         Mean   : 1483                     
 3rd Qu.:29.60        3rd Qu.:42.50         3rd Qu.: 1642                     
 Max.   :97.60        Max.   :98.40         Max.   :36910                     
 NA's   :517          NA's   :520           NA's   :520                       
     Race               Age             Education         GeoLocation       
 Length:10600       Length:10600       Length:10600       Length:10600      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
 TopicTypeId          TopicId           MeasureId         StratificationID1 
 Length:10600       Length:10600       Length:10600       Length:10600      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
 StratificationID2  StratificationID3  StratificationID4  SubMeasureID      
 Length:10600       Length:10600       Length:10600       Length:10600      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
  DisplayOrder   
 Min.   : 1.000  
 1st Qu.: 7.000  
 Median : 9.000  
 Mean   : 8.107  
 3rd Qu.:11.000  
 Max.   :12.000

df1 <- df %>%
  select("Data_Value", "YEAR", "LocationDesc", "TopicDesc", "Sample_Size", "Gender")

#Creating a linear regression model for the percentage of teens smoking per year 
model <- lm(Data_Value ~ YEAR, data = df1)

summary(model)


Call:
lm(formula = Data_Value ~ YEAR, data = df1)

Residuals:
    Min      1Q  Median      3Q     Max 
-26.377 -15.601  -9.033  16.624  73.925 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1560.46645   76.57256   20.38   <2e-16 ***
YEAR          -0.76743    0.03816  -20.11   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 20.76 on 10078 degrees of freedom
  (520 observations deleted due to missingness)
Multiple R-squared:  0.03859,   Adjusted R-squared:  0.0385 
F-statistic: 404.5 on 1 and 10078 DF,  p-value: < 2.2e-16

#ploting the graph for the linaer model
plot_data <- df1 %>%
  select(Data_Value, YEAR) %>%
  na.omit() 
highchart() %>%
  hc_chart(type = "scatter") %>%
  hc_title(text = "Scatter Plot with Linear Regression Line") %>%
  hc_xAxis(title = list(text = "Year")) %>%
  hc_yAxis(title = list(text = "Data Value")) %>%
  hc_add_series(data = plot_data, type = "scatter", 
                hcaes(x = YEAR, y = Data_Value), 
                marker = list(radius = 4)) %>%
  hc_add_series(data = plot_data, type = "line", 
                hcaes(x = YEAR, y = predict(model, newdata = plot_data)), 
                color = "red", 
                lineWidth = 2)

#creating another model but for the total ammount of students per school that smoked per year
df1 <- df1 %>%
  mutate(New_Column = (Data_Value * Sample_Size) / 100)

model_new_column <- lm(New_Column ~ YEAR, data = df1)

summary(model_new_column)


Call:
lm(formula = New_Column ~ YEAR, data = df1)

Residuals:
    Min      1Q  Median      3Q     Max 
 -346.4  -182.0   -84.3    24.6 18248.8 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 34749.8005  1947.5453   17.84   <2e-16 ***
YEAR          -17.2103     0.9704  -17.73   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 527.9 on 10078 degrees of freedom
  (520 observations deleted due to missingness)
Multiple R-squared:  0.03026,   Adjusted R-squared:  0.03017 
F-statistic: 314.5 on 1 and 10078 DF,  p-value: < 2.2e-16

#ploting the second method
plot_data <- df1 %>%
  select(New_Column, YEAR) %>%
  na.omit() 
highchart() %>%
  hc_chart(type = "scatter") %>%
  hc_title(text = "Scatter Plot with Linear Regression Line for New_Column") %>%
  hc_xAxis(title = list(text = "Year")) %>%
  hc_yAxis(title = list(text = "Ammount that smoke")) %>%
  hc_add_series(data = plot_data, type = "scatter", 
                hcaes(x = YEAR, y = New_Column), 
                marker = list(radius = 4)) %>%
  hc_add_series(data = plot_data, type = "line", 
                hcaes(x = YEAR, y = predict(model_new_column, newdata = plot_data)), 
                color = "red", 
                lineWidth = 2)

#creating a linear regression of the percentage of teens smokig in all schools per year.
df_yearly <- df1 %>%
  group_by(YEAR) %>%
  summarise(
    Total_Value_Sample = sum(New_Column, na.rm = TRUE),
    Total_Sample_Size = sum(Sample_Size, na.rm = TRUE)
  )
df_yearly <- df_yearly %>%
  mutate(Percentage_Smokers = Total_Value_Sample / Total_Sample_Size)

model <- lm(Percentage_Smokers ~ YEAR, data = df_yearly)

summary(model)


Call:
lm(formula = Percentage_Smokers ~ YEAR, data = df_yearly)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.031095 -0.004273 -0.001216  0.008449  0.018201 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 11.9413876  0.9982063   11.96 1.06e-09 ***
YEAR        -0.0058832  0.0004971  -11.84 1.24e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.01187 on 17 degrees of freedom
Multiple R-squared:  0.8918,    Adjusted R-squared:  0.8854 
F-statistic: 140.1 on 1 and 17 DF,  p-value: 1.244e-09

#Ploting the last regression model
df_yearly$Predicted_Percentage_Smokers <- predict(model, newdata = df_yearly)
highchart() %>%
  hc_chart(type = "scatter") %>%
  hc_title(text = "Percentage of Smokers Over Years with Regression Line") %>%
  hc_subtitle(text = "Scatter plot showing observed and predicted percentages of smokers by year") %>%
  hc_xAxis(title = list(text = "Year")) %>%
  hc_yAxis(title = list(text = "Percentage of Smokers")) %>%

  hc_add_series(data = df_yearly, type = "scatter", hcaes(x = YEAR, y = Percentage_Smokers), 
                marker = list(radius = 8, symbol = "circle"), name = "Observed Data") %>%

  hc_add_series(data = df_yearly, type = "line", hcaes(x = YEAR, y = Predicted_Percentage_Smokers),
                color = "red", lineWidth = 1, name = "Regression Line") %>%

  hc_tooltip(shared = TRUE, useHTML = TRUE, 
             pointFormat = "Year: {point.x}<br>Observed: {point.y}%<br>Predicted: {point.Predicted_Percentage_Smokers}%")

After analysing both the model and the graph we can see a strong correlation between percentage of smokers and year. This also shows a complete opposite initial supposition made that the percentage of smokers would increase as the years also increase.

#creating a boxplot of Education level and percentage of smokers
ggplot(df, aes(x = Education, y = Data_Value)) +
  geom_boxplot(fill = "lightblue", color = "black") +
  theme_minimal() +
  labs(
    title = "Distribution of Percentage of Smokers by Topic",
    x = "Topic Description",
    y = "Percentage of Smokers"
  ) +
  theme(
    axis.text.x = element_text(angle = 25, hjust = 1, size = 7),
    plot.title = element_text(hjust = 0.5), 
    axis.title = element_text(size = 7) 
  )

Warning: Removed 520 rows containing non-finite outside the scale range
(`stat_boxplot()`).

#Making a t test for the education level and percentage of smokers
t_test_Education <- t.test(Data_Value ~ Education, data = df)
print(t_test_Education)


    Welch Two Sample t-test

data:  Data_Value by Education
t = 24.653, df = 9827.9, p-value < 2.2e-16
alternative hypothesis: true difference in means between group High School and group Middle School is not equal to 0
95 percent confidence interval:
  9.318871 10.928815
sample estimates:
  mean in group High School mean in group Middle School 
                   25.51523                    15.39139

Analyzing we can see that they have a difference between each other with a verry small p value making it significant.

#Also creating a t test for gender and percentage of smokers but filtering for only male and female schools
df_filtered <- df[df$Gender %in% c("Male", "Female"), ]

t_test_gender <- t.test(Data_Value ~ Gender, data = df_filtered)
print(t_test_gender)


    Welch Two Sample t-test

data:  Data_Value by Gender
t = -5.7827, df = 6534.5, p-value = 7.691e-09
alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
95 percent confidence interval:
 -3.983523 -1.966477
sample estimates:
mean in group Female   mean in group Male 
            18.22973             21.20473

Also we can see a very strong statistical significance with a very small p value. Showing that its very likely that males smoke more than females in school.