Term Paper
Fall 2021, DSPA (HS650)
Name: Kristian Shin
SID: #### - 2932 (last 4 digits only)
UMich E-mail: shinkris@umich.edu
I certify that the following paper represents my own independent work and conforms with the guidelines of academic honesty described in the UMich student handbook.
Remember that students are allowed and encouraged to discuss, on a conceptual level, the problems with your class mates, however, this can not involve the exchange of actual code, printouts, solutions, e-mails or other explicit electronic or paper handouts.

Younger Generations’ Perception and Self-Evaluation of Well-Being

With the rise of social media usage in the past two decades, younger generations are gaining access to more resources and information faster than ever before. In 2017, research indicated that approximately 69% of adults and 81% of teenagers in the United States utilize social media (Smith 2017). This phenomenon has created a paradox of sorts in the development of both the Millennial generation as well as Generation Z. On one hand, the speed and openness with which experienced social media users can share information with another has created an environment where people can freely share their thoughts and feelings online. Consequently, the speed and accessibility of online information can exacerbate feelings of missing out compared to peers, increase awareness about worldwide injustice, and ultimately cause young adults to feel burdened by knowledge. 
One particular issue worth highlighting is the double-edged sword of younger generations being more acutely aware of mental health issues. Studies indicate that both Generation Z and Millennials are concerned issues such as childhood trauma and problems they have with their peers. Mass shootings, massive amounts of debt, and other modern-era issues were cited as some of the major concerns facing younger generations today (Benthune 2019). As a result of this increased awareness, younger Americans are more likely to overcome the stigma of seeking therapy and be seen at an early age (Morin 2021) than older generations. The particular conundrum is that this focus and awareness of what constitutes good versus poor mental health may lead younger generations to rate their well-being more poorly than older generations that may have a higher sense of “dealing” with or not even perceiving problems as more than mere inconvenience. In other words, younger generations, despite being more attuned to their overall well-being, may be more likely to report their mental health and overall well-being as worse than older generations. 
To test our hypothesis that younger generations report worse mental health and well-being than older generations, we will conduct analysis on a dataset based on well-being and associated factors. This dataset is comprised of a selection of American adults who were asked to self-evaluate their well-being and commenting on associated life experiences: examples of this include physical health, income levels, and sleep. We will use a variety of data analysis techniques to evaluate younger vs older generations of patients. In the process, we will examine behavioral factors that may affect younger generations' perceptions of their own wellbeing.

1 Hypothesis: Younger generations are more likely to self-evaluate on metrics of well-being and mental health more negatively than older generations due to changes in perception of mental health.

#loading in the appropriate libraries
library(dplyr)
library(ggplot2)
library(reshape2)
library(plotly)
library(rvest)
library(tidyr)
library(rmarkdown)
library(DT)
library(Amelia)
library(e1071)
library(gmodels)

As our source of well-being information for this analysis, we will be utilizing the dataset created in Case-Study 27: Evidence of well-being, health behavior, and environmental factors. This dataset is an aggregate from multiple data sources containing information on factors associated with well-being as well as individuals’ perception of their mental health.

#reading in the well-being data
wb_data <- read.csv("27_Well_being_Health_Behavior_Environmental_Factors_Dataset.csv")

Generating summary statistic of the data

#summarizing the data
summary(wb_data)

##        X               age          wellbeing     SocialSupport  
##  Min.   :     1   Min.   : 7.00   Min.   :0.000   Min.   :1.000  
##  1st Qu.: 98925   1st Qu.:45.00   1st Qu.:0.000   1st Qu.:4.000  
##  Median :197849   Median :57.00   Median :0.000   Median :5.000  
##  Mean   :197849   Mean   :56.32   Mean   :0.055   Mean   :4.181  
##  3rd Qu.:296773   3rd Qu.:69.00   3rd Qu.:0.000   3rd Qu.:5.000  
##  Max.   :395697   Max.   :99.00   Max.   :1.000   Max.   :5.000  
##                                   NA's   :17025   NA's   :20998  
##       race           income      GeneralHealth   PoorPhysicalHealthDays
##  Min.   :0.000   Min.   :1.00    Min.   :1.000   Min.   : 0.000        
##  1st Qu.:1.000   1st Qu.:2.00    1st Qu.:2.000   1st Qu.: 0.000        
##  Median :1.000   Median :4.00    Median :3.000   Median : 0.000        
##  Mean   :0.806   Mean   :3.61    Mean   :2.586   Mean   : 4.472        
##  3rd Qu.:1.000   3rd Qu.:5.00    3rd Qu.:3.000   3rd Qu.: 3.000        
##  Max.   :1.000   Max.   :5.00    Max.   :5.000   Max.   :30.000        
##  NA's   :5309    NA's   :54657   NA's   :1555    NA's   :9341          
##  PoorMentalHealthDays     Asthma       HealthInsurance       CVD       
##  Min.   : 0.000       Min.   :0.0000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.: 0.000       1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.000  
##  Median : 0.000       Median :0.0000   Median :1.0000   Median :0.000  
##  Mean   : 3.497       Mean   :0.0936   Mean   :0.8955   Mean   :0.067  
##  3rd Qu.: 2.000       3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:0.000  
##  Max.   :30.000       Max.   :1.0000   Max.   :1.0000   Max.   :1.000  
##  NA's   :7361         NA's   :2590     NA's   :991      NA's   :3928   
##  LimitedActivity     Diabetes          employ           BMI       
##  Min.   :0.0000   Min.   :0.0000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :0.0000   Median :0.0000   Median :3.000   Median :2.000  
##  Mean   :0.2723   Mean   :0.1274   Mean   :3.885   Mean   :1.933  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:7.000   3rd Qu.:3.000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :8.000   Max.   :3.000  
##  NA's   :1858     NA's   :395      NA's   :1322    NA's   :17290  
##   HeavyDrinker   CurrentSmoker    PoorSleepDays    PhysicalActivity
##  Min.   :0.000   Min.   :0.0000   Min.   : 0.000   Min.   :0.0000  
##  1st Qu.:0.000   1st Qu.:0.0000   1st Qu.: 0.000   1st Qu.:0.0000  
##  Median :0.000   Median :0.0000   Median : 3.000   Median :1.0000  
##  Mean   :0.046   Mean   :0.1583   Mean   : 7.707   Mean   :0.7303  
##  3rd Qu.:0.000   3rd Qu.:0.0000   3rd Qu.:12.000   3rd Qu.:1.0000  
##  Max.   :1.000   Max.   :1.0000   Max.   :30.000   Max.   :1.0000  
##  NA's   :10742   NA's   :2463     NA's   :7560     NA's   :533     
##     marital       PrematureMortality HouseholdIncome    ParkAccess    
##  Min.   :0.0000   Min.   :133.5      Min.   : 22289   Min.   :  1.00  
##  1st Qu.:0.0000   1st Qu.:294.0      1st Qu.: 41103   1st Qu.: 13.00  
##  Median :1.0000   Median :345.0      Median : 47930   Median : 32.00  
##  Mean   :0.5552   Mean   :355.8      Mean   : 50177   Mean   : 34.58  
##  3rd Qu.:1.0000   3rd Qu.:405.7      3rd Qu.: 56166   3rd Qu.: 52.00  
##  Max.   :1.0000   Max.   :883.5      Max.   :119525   Max.   :100.00  
##  NA's   :1559                                         NA's   :9633    
##    CrimeRate      UnemploymentRate  WaterSafety      HighSchoolRate 
##  Min.   :  12.0   Min.   : 1.100   Min.   :  0.000   Min.   : 27.0  
##  1st Qu.: 202.0   1st Qu.: 7.000   1st Qu.:  0.000   1st Qu.: 74.0  
##  Median : 343.0   Median : 8.400   Median :  1.000   Median : 80.0  
##  Mean   : 405.6   Mean   : 8.716   Mean   :  6.511   Mean   : 79.1  
##  3rd Qu.: 558.0   3rd Qu.:10.200   3rd Qu.:  5.000   3rd Qu.: 86.0  
##  Max.   :2062.0   Max.   :29.700   Max.   :100.000   Max.   :100.0  
##  NA's   :4174                      NA's   :5039      NA's   :16     
##  SomeCollegeRate AccessToRecFacility FastFoodPercentage     PM2.5      
##  Min.   :23.20   Min.   : 0.00       Min.   :  8.00     Min.   : 6.50  
##  1st Qu.:54.00   1st Qu.: 6.90       1st Qu.: 44.00     1st Qu.:10.86  
##  Median :61.20   Median : 9.80       Median : 50.00     Median :11.74  
##  Mean   :60.45   Mean   :10.13       Mean   : 48.33     Mean   :11.63  
##  3rd Qu.:67.70   3rd Qu.:13.00       3rd Qu.: 54.00     3rd Qu.:12.78  
##  Max.   :90.40   Max.   :57.50       Max.   :100.00     Max.   :14.78  
##                                      NA's   :42                        
##  PopulationDensity
##  Min.   :    1.7  
##  1st Qu.:   66.3  
##  Median :  276.9  
##  Mean   : 1186.5  
##  3rd Qu.:  991.3  
##  Max.   :69468.4  
##

There are some observed NAs in the dataset. We will go ahead and visualize this data using Amelia to see if there are any columsn that are not worth using due to overly missing data.

missmap(wb_data, main = "Missing values vs observed")

Most values are readily available. We will remove the index column (X) and some of the county-level variables we will not be examining during this analysis.

#subsetting the data
wb_data = subset(wb_data, select = -c(X, PM2.5, PopulationDensity))

Because missing data is approximately 1% of the dataset, we will ignore missing values instead of imputing with Amelia.

Well-being appears to be defined as a categorical factor. We will redefine it, along with several other factors that are categorical as factors.

wb_data$wellbeing <- as.factor(wb_data$wellbeing)
wb_data$race <- as.factor(wb_data$race)
wb_data$Asthma <- as.factor(wb_data$Asthma)
wb_data$HealthInsurance <- as.factor(wb_data$HealthInsurance)
wb_data$CVD <- as.factor(wb_data$CVD)
wb_data$LimitedActivity <- as.factor(wb_data$LimitedActivity)
wb_data$Diabetes <- as.factor(wb_data$Diabetes)
wb_data$HeavyDrinker <- as.factor(wb_data$HeavyDrinker)
wb_data$CurrentSmoker <- as.factor(wb_data$CurrentSmoker)
wb_data$PhysicalActivity <- as.factor(wb_data$PhysicalActivity)
wb_data$marital <- as.factor(wb_data$marital)

#reexamining the data with categorical variables accurately defined
summary(wb_data)

##       age        wellbeing     SocialSupport     race            income     
##  Min.   : 7.00   0   :357734   Min.   :1.000   0   : 75583   Min.   :1.00   
##  1st Qu.:45.00   1   : 20938   1st Qu.:4.000   1   :314805   1st Qu.:2.00   
##  Median :57.00   NA's: 17025   Median :5.000   NA's:  5309   Median :4.00   
##  Mean   :56.32                 Mean   :4.181                 Mean   :3.61   
##  3rd Qu.:69.00                 3rd Qu.:5.000                 3rd Qu.:5.00   
##  Max.   :99.00                 Max.   :5.000                 Max.   :5.00   
##                                NA's   :20998                 NA's   :54657  
##  GeneralHealth   PoorPhysicalHealthDays PoorMentalHealthDays  Asthma      
##  Min.   :1.000   Min.   : 0.000         Min.   : 0.000       0   :356310  
##  1st Qu.:2.000   1st Qu.: 0.000         1st Qu.: 0.000       1   : 36797  
##  Median :3.000   Median : 0.000         Median : 0.000       NA's:  2590  
##  Mean   :2.586   Mean   : 4.472         Mean   : 3.497                    
##  3rd Qu.:3.000   3rd Qu.: 3.000         3rd Qu.: 2.000                    
##  Max.   :5.000   Max.   :30.000         Max.   :30.000                    
##  NA's   :1555    NA's   :9341           NA's   :7361                      
##  HealthInsurance   CVD         LimitedActivity Diabetes          employ     
##  0   : 41258     0   :365585   0   :286595     0   :344933   Min.   :1.000  
##  1   :353448     1   : 26184   1   :107244     1   : 50369   1st Qu.:1.000  
##  NA's:   991     NA's:  3928   NA's:  1858     NA's:   395   Median :3.000  
##                                                              Mean   :3.885  
##                                                              3rd Qu.:7.000  
##                                                              Max.   :8.000  
##                                                              NA's   :1322   
##       BMI        HeavyDrinker  CurrentSmoker PoorSleepDays    PhysicalActivity
##  Min.   :1.000   0   :367224   0   :330981   Min.   : 0.000   0   :106578     
##  1st Qu.:1.000   1   : 17731   1   : 62253   1st Qu.: 0.000   1   :288586     
##  Median :2.000   NA's: 10742   NA's:  2463   Median : 3.000   NA's:   533     
##  Mean   :1.933                               Mean   : 7.707                   
##  3rd Qu.:3.000                               3rd Qu.:12.000                   
##  Max.   :3.000                               Max.   :30.000                   
##  NA's   :17290                               NA's   :7560                     
##  marital       PrematureMortality HouseholdIncome    ParkAccess    
##  0   :175318   Min.   :133.5      Min.   : 22289   Min.   :  1.00  
##  1   :218820   1st Qu.:294.0      1st Qu.: 41103   1st Qu.: 13.00  
##  NA's:  1559   Median :345.0      Median : 47930   Median : 32.00  
##                Mean   :355.8      Mean   : 50177   Mean   : 34.58  
##                3rd Qu.:405.7      3rd Qu.: 56166   3rd Qu.: 52.00  
##                Max.   :883.5      Max.   :119525   Max.   :100.00  
##                                                    NA's   :9633    
##    CrimeRate      UnemploymentRate  WaterSafety      HighSchoolRate 
##  Min.   :  12.0   Min.   : 1.100   Min.   :  0.000   Min.   : 27.0  
##  1st Qu.: 202.0   1st Qu.: 7.000   1st Qu.:  0.000   1st Qu.: 74.0  
##  Median : 343.0   Median : 8.400   Median :  1.000   Median : 80.0  
##  Mean   : 405.6   Mean   : 8.716   Mean   :  6.511   Mean   : 79.1  
##  3rd Qu.: 558.0   3rd Qu.:10.200   3rd Qu.:  5.000   3rd Qu.: 86.0  
##  Max.   :2062.0   Max.   :29.700   Max.   :100.000   Max.   :100.0  
##  NA's   :4174                      NA's   :5039      NA's   :16     
##  SomeCollegeRate AccessToRecFacility FastFoodPercentage
##  Min.   :23.20   Min.   : 0.00       Min.   :  8.00    
##  1st Qu.:54.00   1st Qu.: 6.90       1st Qu.: 44.00    
##  Median :61.20   Median : 9.80       Median : 50.00    
##  Mean   :60.45   Mean   :10.13       Mean   : 48.33    
##  3rd Qu.:67.70   3rd Qu.:13.00       3rd Qu.: 54.00    
##  Max.   :90.40   Max.   :57.50       Max.   :100.00    
##                                      NA's   :42

2 Defining Generations of Interest by Age Bracket

This data utilizes data from 2010, which would mean that the oldest possible age of a Generation Z individual is 13 years old using the following definitions (2021):

Gen Z born 1997-2012 –> 9-24 (2021), 0-13 (2010)

Millenials born 1981-1996 –> 25-40 (2021), 14-29 (2010)

Gen X born 1965-1980 –> 41-56 (2021) –> 30-45 (2010)

Boomers generally born 1965 –> 57+ (2021) –> 46+ (2010)

We will examine summary statistics of three defined groups: the “young” generation (gen Z + millenial), gen x, and baby boomers using the definitions outlined above for 2010.

2.1 Defining Gen Z - Millenial Generation (0-29)

young_gen <- wb_data %>%
  filter(age < 30, .keep_all = TRUE)

datatable(summary(young_gen))

2.2 Defining Gen X (30-45)

gen_x <- wb_data %>%
  filter(age < 46 & age > 29, .keep_all = TRUE)

datatable(summary(gen_x))

2.3 Defining the Baby Boomer Generation (46+)

oldest_gen <- wb_data %>%
  filter(age > 45, .keep_all = TRUE)

datatable(summary(oldest_gen))

3 Visualizing well-being proportion by age bracket

3.1 Whole Sample Histogram

plot_ly(x = wb_data$wellbeing, type = "histogram", name = "Full Population Well-Being Histogram")

## Warning: Ignoring 17025 observations

#proportion of wellbeing responses for young generation
prop.table(table(wb_data$wellbeing))

## 
##          0          1 
## 0.94470676 0.05529324

3.2 Gen Z - Millennial Well-Being Histogram

plot_ly(x = young_gen$wellbeing, type = "histogram", name = "Gen Z + Millenial Well-Being Histogram")

## Warning: Ignoring 1401 observations

#proportion of wellbeing responses for young generation
prop.table(table(young_gen$wellbeing))

## 
##          0          1 
## 0.94493132 0.05506868

3.3 Gen X Well-Being Histogram

plot_ly(x = gen_x$wellbeing, type = "histogram", name = "Gen Z Well-Being Histogram")

## Warning: Ignoring 3141 observations

#proportion of wellbeing responses for generation x
prop.table(table(gen_x$wellbeing))

## 
##          0          1 
## 0.94455555 0.05544445

3.4 Baby Boomer Well-Being Histogram

plot_ly(x = oldest_gen$wellbeing, type = "histogram", name = "Baby Boomer Well-Being Histogram")

## Warning: Ignoring 12483 observations

#proportion of wellbeing responses for oldest generation
prop.table(table(oldest_gen$wellbeing))

## 
##          0          1 
## 0.94472736 0.05527264

Based on the histogram visualizations and the outcomes of the proportion tables, there appears to be minimal differences in self-evaluated well-being ratings between the three generations. We will continue investigating other factors related to well-being to determine if there are differences between young and older generations.

Poor Mental Health Days are another self-rated metric where participants rate how many poor mental health days they had in a month. Using Linear Regression, we can examine any trends in self-reported mental health days against age.

4 Linear Regression Model for Age vs Poor Mental Health Days

#making dataframe that only contains complete cases for linear regression
complete_wb <- wb_data[complete.cases(wb_data), ]

fit_wb<-lm(PoorMentalHealthDays ~ age, data=complete_wb)

plot_ly(complete_wb, x = ~age, y = ~PoorMentalHealthDays, type = 'scatter', mode = "markers", name="Data") %>%
    add_trace(x=~mean(age), y=~mean(PoorMentalHealthDays), type="scatter", mode="markers",
            name="(mean(age), mean(PoorMentalHealthDays))", marker=list(size=20, color='blue', line=list(color='yellow', width=2))) %>%
    add_lines(x = ~age, y = fit_wb$fitted.values, mode = "lines", name="Linear Model") %>%
    layout(title=paste0("Linear Regression Model of Age Vs. Poor Mental Health Days", 
                        round(cor(complete_wb$age, complete_wb$PoorMentalHealthDays),3)))

This linear regression model indicates to us that self-rated Poor Mental Health Days decrease linearly with age: in other words, as age increases, the number of self-perceived poor mental health days in a month decreases. This supports our hypothesis that younger individuals are more likely to report more negative well-being metrics.

Physical activity and good physical health in general has been shown to be related to better mental health (Lawton et. al 2017). To that end, we will examine a linear regression model of age vs self-reported physical activity to examine any possible trends in exercise.

5 Linear Regression Model for Age vs Poor Physical Health Days

fit_wb_phys <-lm(PoorPhysicalHealthDays ~ age, data=complete_wb)

plot_ly(complete_wb, x = ~age, y = ~PoorPhysicalHealthDays, type = 'scatter', mode = "markers", name="Data") %>%
    add_trace(x=~mean(age), y=~mean(PoorPhysicalHealthDays), type="scatter", mode="markers",
            name="(mean(age), mean(PoorPhysicalHealthDays))", marker=list(size=20, color='blue', line=list(color='yellow', width=2))) %>%
    add_lines(x = ~age, y = fit_wb_phys$fitted.values, mode = "lines", name="Linear Model") %>%
    layout(title=paste0("Linear Regression Model of Age Vs. Poor Physical Health Days", 
                        round(cor(complete_wb$age, complete_wb$PoorPhysicalHealthDays),3)))

Self-reported poor physical days actually rise with age conversely with poor mental health days. This is consistent with the aging process with difficulty moving and being active but serves no purpose to our hypothesis.

Poor sleep could also be an explanatory behavioral factor for poor reports of mental health and can also be explained by poor mental health. The Sleep Foudnation reports that poor sleep is strongly correlated with poor mental health (Suni 2020). To that end, we will perform a linear regression between age and PoorSleepDays to determine if there is any notable association.

6 Linear Regression Model for Age vs Poor Sleep

fit_wb_sleep <-lm(PoorSleepDays ~ age, data=complete_wb)

plot_ly(complete_wb, x = ~age, y = ~PoorSleepDays, type = 'scatter', mode = "markers", name="Data") %>%
    add_trace(x=~mean(age), y=~mean(PoorSleepDays), type="scatter", mode="markers",
            name="(mean(age), mean(PoorSleepDays))", marker=list(size=20, color='blue', line=list(color='yellow', width=2))) %>%
    add_lines(x = ~age, y = fit_wb_sleep$fitted.values, mode = "lines", name="Linear Model") %>%
    layout(title=paste0("Linear Regression Model of Age Vs. Poor Sleep Days", 
                        round(cor(complete_wb$age, complete_wb$PoorSleepDays),3)))

The results of the linear regression model indicate a strongly negative linear relationship between age and poor sleep days. This self-reported metric of poor sleep by younger generations may be a supporting cause of poor mental health reported by younger participants. However, it is noteworthy that direction of causality cannot necessarily be inferred by this analysis.

7 Creating a Naive Bayes Predictor Model for Wellbeing Status vs General Health and Social Support by Generation

7.1 Naive Bayes for the Youngest Generation

set.seed(1234)
nbc_model <- naiveBayes(young_gen[, c("GeneralHealth", "SocialSupport")], young_gen[,"wellbeing"])
predicted.nbcvalues <- predict(nbc_model, young_gen[,c("GeneralHealth", "SocialSupport")])
CT1 <- CrossTable(predicted.nbcvalues, young_gen[,"wellbeing"])

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  24533 
## 
##  
##                     | young_gen[, "wellbeing"] 
## predicted.nbcvalues |         0 |         1 | Row Total | 
## --------------------|-----------|-----------|-----------|
##                   0 |     22298 |      1069 |     23367 | 
##                     |     2.148 |    36.861 |           | 
##                     |     0.954 |     0.046 |     0.952 | 
##                     |     0.962 |     0.791 |           | 
##                     |     0.909 |     0.044 |           | 
## --------------------|-----------|-----------|-----------|
##                   1 |       884 |       282 |      1166 | 
##                     |    43.050 |   738.707 |           | 
##                     |     0.758 |     0.242 |     0.048 | 
##                     |     0.038 |     0.209 |           | 
##                     |     0.036 |     0.011 |           | 
## --------------------|-----------|-----------|-----------|
##        Column Total |     23182 |      1351 |     24533 | 
##                     |     0.945 |     0.055 |           | 
## --------------------|-----------|-----------|-----------|
## 
##

young_TN <- CT1$prop.row[1, 1]  
young_FP <- CT1$prop.row[1, 2]
young_FN <- CT1$prop.row[2, 1]
young_TP <- CT1$prop.row[2, 2]

plot_ly(x = c("young_TN", "young_FN", "young_FP", "young_TP"),
  y = c(young_TN, young_FN, young_FP, young_TP),
  name = c("TN", "FN", "FP", "TP"), type = "bar", color=c("TN", "FN", "FP", "TP")) %>% 
  layout(title="Young Generation Confusion Matrix", 
           legend=list(title=list(text='<b> Metrics </b>')), 
           xaxis=list(title='Metrics'), yaxis=list(title='Probability'))

table(predicted.nbcvalues, young_gen[,"wellbeing"])

##                    
## predicted.nbcvalues     0     1
##                   0 22298  1069
##                   1   884   282

The accuracy for the young generation model is 22580/24533 = 0.920

7.2 Naive Bayes Model for Generation X

set.seed(1234)
nbc_model_x <- naiveBayes(gen_x[, c("GeneralHealth", "SocialSupport")], gen_x[,"wellbeing"])
predicted.nbcvalues_x <- predict(nbc_model_x, gen_x[,c("GeneralHealth", "SocialSupport")])
CT2 <- CrossTable(predicted.nbcvalues_x, gen_x[,"wellbeing"])

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  74507 
## 
##  
##                       | gen_x[, "wellbeing"] 
## predicted.nbcvalues_x |         0 |         1 | Row Total | 
## ----------------------|-----------|-----------|-----------|
##                     0 |     67935 |      2985 |     70920 | 
##                       |    13.391 |   228.131 |           | 
##                       |     0.958 |     0.042 |     0.952 | 
##                       |     0.965 |     0.723 |           | 
##                       |     0.912 |     0.040 |           | 
## ----------------------|-----------|-----------|-----------|
##                     1 |      2441 |      1146 |      3587 | 
##                       |   264.760 |  4510.464 |           | 
##                       |     0.681 |     0.319 |     0.048 | 
##                       |     0.035 |     0.277 |           | 
##                       |     0.033 |     0.015 |           | 
## ----------------------|-----------|-----------|-----------|
##          Column Total |     70376 |      4131 |     74507 | 
##                       |     0.945 |     0.055 |           | 
## ----------------------|-----------|-----------|-----------|
## 
##

mid_TN <- CT2$prop.row[1, 1]  
mid_FP <- CT2$prop.row[1, 2]
mid_FN <- CT2$prop.row[2, 1]
mid_TP <- CT2$prop.row[2, 2]

plot_ly(x = c("mid_TN", "mid_FN", "mid_FP", "mid_TP"),
  y = c(mid_TN, mid_FN, mid_FP, mid_TP),
  name = c("TN", "FN", "FP", "TP"), type = "bar", color=c("TN", "FN", "FP", "TP")) %>% 
  layout(title="Gen X Confusion Matrix", 
           legend=list(title=list(text='<b> Metrics </b>')), 
           xaxis=list(title='Metrics'), yaxis=list(title='Probability'))

table(predicted.nbcvalues_x, gen_x[,"wellbeing"])

##                      
## predicted.nbcvalues_x     0     1
##                     0 67935  2985
##                     1  2441  1146

Accuracy of Young Generation = 70920/76346 = 92.9%

7.3 Naive Bayes Model for Baby Boomer Generation

set.seed(1234)
nbc_model_bb <- naiveBayes(oldest_gen[, c("GeneralHealth", "SocialSupport")], oldest_gen[,"wellbeing"])
predicted.nbcvalues_bb <- predict(nbc_model_bb, oldest_gen[,c("GeneralHealth", "SocialSupport")])
CT3 <- CrossTable(predicted.nbcvalues_bb, oldest_gen[,"wellbeing"])

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  279632 
## 
##  
##                        | oldest_gen[, "wellbeing"] 
## predicted.nbcvalues_bb |         0 |         1 | Row Total | 
## -----------------------|-----------|-----------|-----------|
##                      0 |    254086 |     12547 |    266633 | 
##                        |    19.049 |   325.587 |           | 
##                        |     0.953 |     0.047 |     0.954 | 
##                        |     0.962 |     0.812 |           | 
##                        |     0.909 |     0.045 |           | 
## -----------------------|-----------|-----------|-----------|
##                      1 |     10090 |      2909 |     12999 | 
##                        |   390.728 |  6678.373 |           | 
##                        |     0.776 |     0.224 |     0.046 | 
##                        |     0.038 |     0.188 |           | 
##                        |     0.036 |     0.010 |           | 
## -----------------------|-----------|-----------|-----------|
##           Column Total |    264176 |     15456 |    279632 | 
##                        |     0.945 |     0.055 |           | 
## -----------------------|-----------|-----------|-----------|
## 
##

old_TN <- CT3$prop.row[1, 1]  
old_FP <- CT3$prop.row[1, 2]
old_FN <- CT3$prop.row[2, 1]
old_TP <- CT3$prop.row[2, 2]

plot_ly(x = c("old_TN", "old_FN", "old_FP", "old_TP"),
  y = c(old_TN, old_FN, old_FP, old_TP),
  name = c("TN", "FN", "FP", "TP"), type = "bar", color=c("TN", "FN", "FP", "TP")) %>% 
  layout(title="Baby Boomer Confusion Matrix", 
           legend=list(title=list(text='<b> Metrics </b>')), 
           xaxis=list(title='Metrics'), yaxis=list(title='Probability'))

table(predicted.nbcvalues_bb, oldest_gen[,"wellbeing"])

##                       
## predicted.nbcvalues_bb      0      1
##                      0 254086  12547
##                      1  10090   2909

Accuracy of old generation = 256995 / 279,632 = 0.919

The prediction accuracy for each defined generation based on Naive Bayes between wellbeing status vs. General Health and Social Support was approximately 0.92. Minute differences in accuracy and overall performance were observed between each of the generations. As a result, it seems likely that high self-ratings of General Health and Social Support are accurately associated with positive self-reported wellbeing.

Conclusion:

Ultimately, while there were some indications that younger generations may fare better, there was not enough evidence to support the hypothesis that younger generations report more poorly on self-reported mental health measures. Several immediate limitations were likely to have contributed to the inconclusive nature of the study design: the data utilized in this analysis was from 2010: more recent study data may better capture some of the trends in mental health, particularly with concepts such as school shootings becoming more prevalent in more recent years. Another problem was the inability to capture Generation Z for various reasons: not only was the oldest Generation Z member roughly 13 years old at the time the data was generated, but the data was heavily skewed towards adults over the age of 50 years old. As a result, while data could be drawn from the set, the sheer amount of older adults interviewed made this data a poor choice to analyze the original question. Future analyses would stand to observe more modern surveys and also draw upon resources surveying younger generations of participants.

Citations:

Smith, A. (2020, August 25). Record shares of Americans have smartphones, Home Broadband. Pew Research Center. Retrieved December 6, 2021, from https://www.pewresearch.org/fact-tank/2017/01/12/evolution-of-technology/.

Amy Morin, L. C. S. W. (2021, June 22). Gen Z is the most stressed out generation right now. Verywell Mind. Retrieved December 6, 2021, from https://www.verywellmind.com/state-of-mental-health-across-generations-5189603.

Bethune, S. (2019, January). Gen Z more likely to report mental health concerns. Monitor on Psychology. Retrieved December 6, 2021, from https://www.apa.org/monitor/2019/01/gen-z.

Lawton, E., Brymer, E., Clough, P., & Denovan, A. (1AD, January 1). The relationship between the physical activity environment, nature relatedness, anxiety, and the psychological well-being benefits of regular exercisers. Frontiers. Retrieved December 8, 2021, from https://www.frontiersin.org/articles/10.3389/fpsyg.2017.01058/full.

Mental health and sleep. Sleep Foundation. (2020, September 18). Retrieved December 8, 2021, from https://www.sleepfoundation.org/mental-health.

Term Paper: Younger Generations’ Perception and Self-Evaluation of Well-Being

Kristian Shin

December 2021