Question-1:Read the data into a variable named “Mental_Health_Survey”. Remove all the rows that contain null values in any of the variables (Except “Comments” and “State”). Perform all the data cleaning required. For example you might want to do the following (a) Convert any categorical variables to numeric values.

  1. The column “Gender” seems to have non-uniform coding (like “Male”, “M”, “m”). Perform the required transformations to create consistency. (3) “Age” column might have values that do not make sense - Remove them. Decide on any other forms of cleaning that are required.

Answer the following questions using either a simple R-command or an appropriate visualization:

Which state in the United States seems to have employees with the most diagnosed cases of depression? Which state has the least?

What is the relationship among Self-employment, working for a tech company, ability to get a leave and how does it relate to depression for the employee? Also, how do physical health consequences play a role in incidence of depression? You might use a combination of statistics and visualizations to answer this question.

How do work interference and remote work option relate to the incidence of depression? What role do coworkers play in this case?

library(dplyr)
## Warning: package 'dplyr' was built under R version 4.4.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.3
Mental_Health_Survey <- read.csv("C:/Users/PALLAVI/Downloads/BML/survey.csv",
                                 stringsAsFactors = FALSE)

head(Mental_Health_Survey)
View(Mental_Health_Survey)
str(Mental_Health_Survey)
## 'data.frame':    1259 obs. of  19 variables:
##  $ Age                    : num  37 44 32 31 31 33 35 39 42 23 ...
##  $ Gender                 : chr  "Female" "M" "Male" "Male" ...
##  $ Country                : chr  "United States" "United States" "Canada" "United Kingdom" ...
##  $ state                  : chr  "IL" "IN" NA NA ...
##  $ self_employed          : chr  NA NA NA NA ...
##  $ family_history         : chr  "No" "No" "No" "Yes" ...
##  $ treatment              : chr  "Yes" "No" "No" "Yes" ...
##  $ work_interfere         : chr  "Often" "Rarely" "Rarely" "Often" ...
##  $ remote_work            : chr  "No" "No" "No" "No" ...
##  $ tech_company           : chr  "Yes" "No" "Yes" "Yes" ...
##  $ benefits               : chr  "Yes" "Don't know" "No" "No" ...
##  $ care_options           : chr  "Not sure" "No" "No" "Yes" ...
##  $ wellness_program       : chr  "No" "Don't know" "No" "No" ...
##  $ seek_help              : chr  "Yes" "Don't know" "No" "No" ...
##  $ leave                  : chr  "Somewhat easy" "Don't know" "Somewhat difficult" "Somewhat difficult" ...
##  $ phys_health_consequence: chr  "No" "No" "No" "Yes" ...
##  $ coworkers              : chr  "Some of them" "No" "Yes" "Some of them" ...
##  $ obs_consequence        : chr  "No" "No" "No" "Yes" ...
##  $ comments               : chr  NA NA NA NA ...
summary(Mental_Health_Survey)
##       Age                Gender            Country             state          
##  Min.   :-1.726e+03   Length:1259        Length:1259        Length:1259       
##  1st Qu.: 2.700e+01   Class :character   Class :character   Class :character  
##  Median : 3.100e+01   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 7.943e+07                                                           
##  3rd Qu.: 3.600e+01                                                           
##  Max.   : 1.000e+11                                                           
##  self_employed      family_history      treatment         work_interfere    
##  Length:1259        Length:1259        Length:1259        Length:1259       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  remote_work        tech_company         benefits         care_options      
##  Length:1259        Length:1259        Length:1259        Length:1259       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  wellness_program    seek_help            leave          
##  Length:1259        Length:1259        Length:1259       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##  phys_health_consequence  coworkers         obs_consequence   
##  Length:1259             Length:1259        Length:1259       
##  Class :character        Class :character   Class :character  
##  Mode  :character        Mode  :character   Mode  :character  
##                                                               
##                                                               
##                                                               
##    comments        
##  Length:1259       
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

Interpretation: The dataset was loaded into a variable named Mental_Health_Survey. The head(), View(), str(), and summary() commands were used to understand the structure, variable types, and initial data quality issues.

colSums(is.na(Mental_Health_Survey))
##                     Age                  Gender                 Country 
##                       0                       0                       0 
##                   state           self_employed          family_history 
##                     515                      18                       0 
##               treatment          work_interfere             remote_work 
##                       0                     264                       0 
##            tech_company                benefits            care_options 
##                       0                       0                       0 
##        wellness_program               seek_help                   leave 
##                       0                       0                       0 
## phys_health_consequence               coworkers         obs_consequence 
##                       0                       0                       0 
##                comments 
##                    1095

Interpretation: This output shows which columns contain missing values before cleaning. The question says to remove rows with null values except in comments and state.

Mental_Health_Survey <- Mental_Health_Survey %>%
  filter(complete.cases(select(., -comments, -state)))

colSums(is.na(Mental_Health_Survey))
##                     Age                  Gender                 Country 
##                       0                       0                       0 
##                   state           self_employed          family_history 
##                     386                       0                       0 
##               treatment          work_interfere             remote_work 
##                       0                       0                       0 
##            tech_company                benefits            care_options 
##                       0                       0                       0 
##        wellness_program               seek_help                   leave 
##                       0                       0                       0 
## phys_health_consequence               coworkers         obs_consequence 
##                       0                       0                       0 
##                comments 
##                     837

Interpretation: Rows with missing values were removed from all variables except comments and state, as required by the question.

Mental_Health_Survey$Gender <- tolower(Mental_Health_Survey$Gender)
Mental_Health_Survey$Gender <- trimws(Mental_Health_Survey$Gender)

Mental_Health_Survey$Gender <- ifelse(
  Mental_Health_Survey$Gender %in% c("male", "m", "man", "cis male", "male-ish", 
                                     "maile", "mal", "male (cis)", "make",
                                     "male ", "msle", "cis man"),
  "Male",
  ifelse(
    Mental_Health_Survey$Gender %in% c("female", "f", "woman", "cis female", 
                                       "female ", "femake", "female (cis)",
                                       "cis-female/femme"),
    "Female",
    "Other"
  )
)

Mental_Health_Survey$Gender <- as.factor(Mental_Health_Survey$Gender)

table(Mental_Health_Survey$Gender)
## 
## Female   Male  Other 
##    205    749     23

Interpretation: The Gender column had non-uniform values such as “Male”, “M”, and “m”. These were standardized into Male, Female, and Other.

summary(Mental_Health_Survey$Age)
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -1.726e+03  2.700e+01  3.100e+01  1.024e+08  3.600e+01  1.000e+11
Mental_Health_Survey <- Mental_Health_Survey %>%
  filter(Age >= 18 & Age <= 100)

summary(Mental_Health_Survey$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   27.00   31.00   32.31   36.00   72.00

Interpretation: Invalid age values were removed. Only realistic employee ages between 18 and 100 were kept.

Mental_Health_Survey <- distinct(Mental_Health_Survey)

nrow(Mental_Health_Survey)
## [1] 968

Interpretation: Duplicate rows were removed to avoid repeated records affecting the analysis.

Mental_Health_Survey_numeric <- Mental_Health_Survey

categorical_cols <- names(Mental_Health_Survey_numeric)[
  sapply(Mental_Health_Survey_numeric, is.character) |
    sapply(Mental_Health_Survey_numeric, is.factor)
]

categorical_cols <- setdiff(categorical_cols, c("comments", "state"))

Mental_Health_Survey_numeric[categorical_cols] <- lapply(
  Mental_Health_Survey_numeric[categorical_cols],
  function(x) as.numeric(as.factor(x))
)

head(Mental_Health_Survey_numeric)
str(Mental_Health_Survey_numeric)
## 'data.frame':    968 obs. of  19 variables:
##  $ Age                    : num  46 29 31 46 41 33 35 35 34 37 ...
##  $ Gender                 : num  2 2 2 2 2 2 2 1 2 2 ...
##  $ Country                : num  38 38 38 38 38 38 38 38 38 37 ...
##  $ state                  : chr  "MD" "NY" "NC" "MA" ...
##  $ self_employed          : num  2 1 2 1 1 1 1 1 1 1 ...
##  $ family_history         : num  2 2 1 1 1 2 2 2 1 1 ...
##  $ treatment              : num  1 2 1 2 2 2 2 2 2 1 ...
##  $ work_interfere         : num  4 4 1 2 1 3 4 3 4 4 ...
##  $ remote_work            : num  2 1 2 2 1 1 1 2 2 1 ...
##  $ tech_company           : num  2 2 2 2 1 2 1 2 2 2 ...
##  $ benefits               : num  3 3 2 3 1 3 3 3 1 2 ...
##  $ care_options           : num  2 3 1 3 1 2 3 3 2 1 ...
##  $ wellness_program       : num  3 2 2 2 2 1 2 1 2 2 ...
##  $ seek_help              : num  1 2 2 2 1 3 1 1 1 2 ...
##  $ leave                  : num  5 2 2 1 1 1 5 1 2 4 ...
##  $ phys_health_consequence: num  2 2 2 2 2 2 2 2 2 1 ...
##  $ coworkers              : num  3 2 2 2 1 3 2 3 2 2 ...
##  $ obs_consequence        : num  2 1 1 1 1 1 1 1 1 1 ...
##  $ comments               : chr  NA NA NA NA ...

Interpretation: Categorical variables were converted into numeric values so they can be used for statistical analysis and modeling. The original dataset is kept for interpretation, while the numeric version can be used later if needed.

us_state_depression <- Mental_Health_Survey %>%
  filter(Country == "United States", !is.na(state), obs_consequence == "Yes") %>%
  group_by(state) %>%
  summarise(depression_cases = n(), .groups = "drop") %>%
  arrange(desc(depression_cases))

us_state_depression
head(us_state_depression, 1)
tail(us_state_depression, 1)
ggplot(us_state_depression, aes(x = reorder(state, depression_cases),
                                y = depression_cases)) +
  geom_col() +
  coord_flip() +
  labs(title = "Diagnosed Depression Cases by State",
       x = "State",
       y = "Number of Diagnosed Cases")

Interpretation: The state shown at the top of the table has the highest number of diagnosed depression cases among U.S. employees. The state shown at the bottom has the lowest number of diagnosed cases. This answers which U.S. state has the most and least diagnosed cases.

table(Mental_Health_Survey$self_employed,
      Mental_Health_Survey$obs_consequence)
##      
##        No Yes
##   No  715 133
##   Yes  92  28
prop.table(table(Mental_Health_Survey$self_employed,
                 Mental_Health_Survey$obs_consequence), 1)
##      
##              No       Yes
##   No  0.8431604 0.1568396
##   Yes 0.7666667 0.2333333
ggplot(Mental_Health_Survey,
       aes(x = self_employed, fill = obs_consequence)) +
  geom_bar(position = "fill") +
  labs(title = "Self Employment vs Depression",
       x = "Self Employed",
       y = "Proportion",
       fill = "Depression")

table(Mental_Health_Survey$tech_company,
      Mental_Health_Survey$obs_consequence)
##      
##        No Yes
##   No  136  40
##   Yes 671 121
prop.table(table(Mental_Health_Survey$tech_company,
                 Mental_Health_Survey$obs_consequence), 1)
##      
##              No       Yes
##   No  0.7727273 0.2272727
##   Yes 0.8472222 0.1527778
ggplot(Mental_Health_Survey,
       aes(x = tech_company, fill = obs_consequence)) +
  geom_bar(position = "fill") +
  labs(title = "Tech Company vs Depression",
       x = "Tech Company",
       y = "Proportion",
       fill = "Depression")

table(Mental_Health_Survey$leave,
      Mental_Health_Survey$obs_consequence)
##                     
##                       No Yes
##   Don't know         358  53
##   Somewhat difficult  71  35
##   Somewhat easy      181  28
##   Very difficult      57  32
##   Very easy          140  13
prop.table(table(Mental_Health_Survey$leave,
                 Mental_Health_Survey$obs_consequence), 1)
##                     
##                              No        Yes
##   Don't know         0.87104623 0.12895377
##   Somewhat difficult 0.66981132 0.33018868
##   Somewhat easy      0.86602871 0.13397129
##   Very difficult     0.64044944 0.35955056
##   Very easy          0.91503268 0.08496732
ggplot(Mental_Health_Survey,
       aes(x = leave, fill = obs_consequence)) +
  geom_bar(position = "fill") +
  labs(title = "Leave Difficulty vs Depression",
       x = "Leave",
       y = "Proportion",
       fill = "Depression")

table(Mental_Health_Survey$phys_health_consequence,
      Mental_Health_Survey$obs_consequence)
##        
##          No Yes
##   Maybe 167  53
##   No    604  91
##   Yes    36  17
prop.table(table(Mental_Health_Survey$phys_health_consequence,
                 Mental_Health_Survey$obs_consequence), 1)
##        
##                No       Yes
##   Maybe 0.7590909 0.2409091
##   No    0.8690647 0.1309353
##   Yes   0.6792453 0.3207547
ggplot(Mental_Health_Survey,
       aes(x = phys_health_consequence, fill = obs_consequence)) +
  geom_bar(position = "fill") +
  labs(title = "Physical Health Consequence vs Depression",
       x = "Physical Health Consequence",
       y = "Proportion",
       fill = "Depression")

Interpretation: The contingency tables and proportional bar charts show how depression differs across workplace and health-related factors. Employees with harder leave options or physical health consequences may show higher depression incidence. These variables are useful because they connect workplace support and physical health concerns with mental health outcomes.

table(Mental_Health_Survey$work_interfere,
      Mental_Health_Survey$obs_consequence)
##            
##              No Yes
##   Never     189  17
##   Often     103  34
##   Rarely    144  24
##   Sometimes 371  86
prop.table(table(Mental_Health_Survey$work_interfere,
                 Mental_Health_Survey$obs_consequence), 1)
##            
##                     No        Yes
##   Never     0.91747573 0.08252427
##   Often     0.75182482 0.24817518
##   Rarely    0.85714286 0.14285714
##   Sometimes 0.81181619 0.18818381
ggplot(Mental_Health_Survey,
       aes(x = work_interfere, fill = obs_consequence)) +
  geom_bar(position = "fill") +
  labs(title = "Work Interference vs Depression",
       x = "Work Interference",
       y = "Proportion",
       fill = "Depression")

table(Mental_Health_Survey$remote_work,
      Mental_Health_Survey$obs_consequence)
##      
##        No Yes
##   No  556 118
##   Yes 251  43
prop.table(table(Mental_Health_Survey$remote_work,
                 Mental_Health_Survey$obs_consequence), 1)
##      
##              No       Yes
##   No  0.8249258 0.1750742
##   Yes 0.8537415 0.1462585
ggplot(Mental_Health_Survey,
       aes(x = remote_work, fill = obs_consequence)) +
  geom_bar(position = "fill") +
  labs(title = "Remote Work vs Depression",
       x = "Remote Work",
       y = "Proportion",
       fill = "Depression")

table(Mental_Health_Survey$coworkers,
      Mental_Health_Survey$obs_consequence)
##               
##                 No Yes
##   No           168  32
##   Some of them 489 110
##   Yes          150  19
prop.table(table(Mental_Health_Survey$coworkers,
                 Mental_Health_Survey$obs_consequence), 1)
##               
##                       No       Yes
##   No           0.8400000 0.1600000
##   Some of them 0.8163606 0.1836394
##   Yes          0.8875740 0.1124260
ggplot(Mental_Health_Survey,
       aes(x = coworkers, fill = obs_consequence)) +
  geom_bar(position = "fill") +
  labs(title = "Coworkers vs Depression",
       x = "Coworker Support",
       y = "Proportion",
       fill = "Depression")

Interpretation: Work interference is directly related to depression because employees whose mental health affects work more often may show higher diagnosed depression. Remote work may have mixed effects because it can reduce stress for some employees but may also increase isolation for others. Coworker support is important because employees who feel comfortable discussing mental health with coworkers may experience better support and lower negative outcomes.

Final Conclusion: The dataset was cleaned by removing rows with missing values (except for comments and state), standardizing the Gender variable, removing unrealistic Age values, eliminating duplicate records, and converting categorical variables into numeric form to ensure consistency and reliability of analysis.

Based on the analysis of U.S. states, California (CA) shows the highest number of diagnosed depression cases (15), while Virginia (VA) shows the lowest number of cases (1). This indicates variation in mental health outcomes across different geographic locations.

The relationship analysis using both statistical summaries (contingency tables and proportions) and visualizations (bar plots) reveals that workplace and health-related factors significantly influence depression incidence. Employees with limited leave options and those experiencing physical health consequences tend to show higher levels of depression.

Work interference shows a strong positive relationship with depression, indicating that employees whose mental health affects their work are more likely to report diagnosed depression. Remote work shows mixed effects, suggesting it may reduce stress for some employees while increasing isolation for others.

Additionally, coworker support plays an important role, as employees with better support systems tend to show improved mental health outcomes.

Overall, the results highlight that workplace conditions, physical health, and social support are key factors influencing mental health among employees.

Quetion-2:Based on your understanding of answers to Q1, pick a clustering technique and appropriate variables to identify clusters of employees with/without depression. Tabulate the results where there are two rows (“obs_consequence” either Yes or No) and the columns contain the values of centroids for the variables. What do you observe and infer from this?

library(dplyr)

q2_data <- Mental_Health_Survey %>%
  select(obs_consequence,
         work_interfere,
         remote_work,
         self_employed,
         tech_company,
         leave,
         phys_health_consequence,
         coworkers)

Interpretation: These variables were selected based on Q1 because work interference, remote work, self-employment, tech company status, leave difficulty, physical health consequences, and coworker support were related to depression.

q2_numeric <- q2_data

q2_numeric[] <- lapply(q2_numeric, function(x) {
  as.numeric(as.factor(x))
})

head(q2_numeric)
str(q2_numeric)
## 'data.frame':    968 obs. of  8 variables:
##  $ obs_consequence        : num  2 1 1 1 1 1 1 1 1 1 ...
##  $ work_interfere         : num  4 4 1 2 1 3 4 3 4 4 ...
##  $ remote_work            : num  2 1 2 2 1 1 1 2 2 1 ...
##  $ self_employed          : num  2 1 2 1 1 1 1 1 1 1 ...
##  $ tech_company           : num  2 2 2 2 1 2 1 2 2 2 ...
##  $ leave                  : num  5 2 2 1 1 1 5 1 2 4 ...
##  $ phys_health_consequence: num  2 2 2 2 2 2 2 2 2 1 ...
##  $ coworkers              : num  3 2 2 2 1 3 2 3 2 2 ...

Interpretation: All categorical variables were converted into numeric form so they can be used in clustering.

depression_label <- q2_data$obs_consequence

cluster_vars <- q2_numeric %>%
  select(-obs_consequence)

Interpretation: obs_consequence is kept as the depression label, while the remaining variables are used to form clusters.

cluster_scaled <- scale(cluster_vars)

Interpretation: Scaling was done so that all variables contribute equally to the clustering process.

set.seed(123)

kmeans_model <- kmeans(cluster_scaled,
                      centers = 2,
                      nstart = 25)

kmeans_model
## K-means clustering with 2 clusters of sizes 327, 641
## 
## Cluster means:
##   work_interfere remote_work self_employed tech_company       leave
## 1     0.02046454   1.2939859     0.7277437   0.19379556  0.18232034
## 2    -0.01043979  -0.6601145    -0.3712515  -0.09886295 -0.09300897
##   phys_health_consequence   coworkers
## 1             -0.01573577  0.12131337
## 2              0.00802745 -0.06188685
## 
## Clustering vector:
##   [1] 1 2 1 1 2 2 2 1 1 2 1 1 1 1 2 2 2 1 2 2 2 2 2 1 2 2 1 2 1 1 2 1 2 2 1 1 1
##  [38] 1 2 2 2 2 1 1 1 1 2 2 1 2 1 1 2 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 1 2 2 2 1 2
##  [75] 2 1 2 1 2 1 2 2 2 2 2 2 2 1 2 2 2 2 2 2 1 1 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2
## [112] 2 2 1 2 1 2 2 1 2 2 2 1 2 1 1 2 2 1 2 2 1 2 2 2 2 1 1 2 2 2 2 1 2 1 1 2 2
## [149] 2 1 2 2 1 1 2 2 2 2 1 1 2 1 2 1 2 1 2 2 2 2 2 2 1 1 2 1 1 1 1 2 2 2 1 2 2
## [186] 1 1 2 1 1 2 1 1 2 2 1 2 2 2 1 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 1 2 2
## [223] 2 2 1 2 1 2 2 2 1 2 2 2 2 2 2 1 2 2 1 2 2 1 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2
## [260] 2 1 1 1 2 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 1 1 2 1 2 1 2 2 2 2 1 2 2 2 2 2 2
## [297] 2 2 1 1 1 2 2 2 1 2 2 2 1 1 1 2 2 2 2 1 2 1 2 2 2 2 1 2 1 2 2 2 2 2 2 1 2
## [334] 2 2 2 2 2 2 2 2 1 2 1 1 2 2 2 1 1 2 2 2 1 2 2 1 2 2 1 2 2 2 2 1 2 2 2 1 2
## [371] 2 1 1 2 2 2 2 2 1 2 1 2 2 2 1 2 2 2 2 1 1 1 1 1 2 1 2 1 2 2 2 1 2 2 2 2 2
## [408] 2 1 2 1 2 2 2 2 2 1 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 1 1 2 2 2 1 2 1
## [445] 2 2 2 1 2 2 2 2 2 2 2 2 2 1 2 1 2 1 2 1 2 2 2 1 2 1 1 2 2 1 1 2 2 2 2 1 1
## [482] 1 2 2 1 1 2 2 1 2 1 2 2 2 1 2 2 2 2 1 2 1 2 2 2 1 2 2 2 2 1 2 1 2 2 2 2 2
## [519] 1 2 1 1 2 2 1 2 2 2 1 2 1 2 1 2 2 2 2 2 1 2 2 2 1 2 1 2 1 1 2 1 2 1 2 1 2
## [556] 2 2 1 1 2 2 2 1 2 1 2 2 2 2 1 2 2 2 2 1 2 1 2 2 1 2 2 1 2 2 1 2 2 2 2 2 2
## [593] 1 2 2 2 2 1 2 2 2 2 2 2 2 2 1 1 2 2 1 1 1 2 1 2 2 1 2 2 1 2 2 2 2 2 2 2 2
## [630] 1 2 2 2 2 1 1 2 1 2 1 1 2 1 2 1 2 2 1 1 2 1 2 2 2 1 1 1 1 1 2 2 1 1 2 1 2
## [667] 1 1 1 2 2 1 1 2 2 1 2 2 1 1 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 1 2
## [704] 1 2 2 2 2 2 2 2 2 2 1 2 2 2 1 2 1 1 1 2 1 1 1 2 1 1 1 2 2 2 2 2 1 1 1 2 2
## [741] 1 1 2 2 2 2 1 1 2 2 1 1 2 1 2 2 1 2 1 1 2 1 2 2 1 1 1 2 2 1 1 2 2 1 2 1 2
## [778] 2 1 1 2 1 1 2 2 2 2 2 1 1 2 2 2 1 2 2 2 1 2 2 1 2 1 1 2 2 1 2 2 1 1 2 2 2
## [815] 2 1 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 1 1 1 1 1 2 2 2 1 2 1 2 2 2
## [852] 2 1 1 2 2 2 1 1 2 2 1 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 1 1 2 2 1 2 2
## [889] 1 1 2 2 1 2 2 2 2 2 2 2 1 1 2 2 1 1 2 2 1 1 2 2 2 1 2 2 2 2 1 1 2 1 2 1 2
## [926] 2 2 1 2 2 1 1 2 1 2 1 1 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 1 2 1 2 2 2 2 2 1
## [963] 2 1 2 1 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 2367.716 3270.351
##  (between_SS / total_SS =  16.7 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Interpretation: K-means clustering was selected because the goal is to divide employees into two groups: employees with depression and employees without depression. Therefore, centers = 2 was used.

q2_results <- q2_data %>%
  mutate(Cluster = kmeans_model$cluster)

head(q2_results)
cluster_depression_table <- table(q2_results$Cluster,
                                  q2_results$obs_consequence)

cluster_depression_table
##    
##      No Yes
##   1 276  51
##   2 531 110

Interpretation: This table shows how the two clusters are associated with obs_consequence values of Yes and No.

centroid_table <- q2_numeric %>%
  group_by(obs_consequence) %>%
  summarise(across(everything(), mean), .groups = "drop")

centroid_table

Interpretation: This table satisfies the question requirement because the rows represent obs_consequence values and the columns show the average centroid-like values for the selected variables.

kmeans_centroids <- as.data.frame(kmeans_model$centers)

kmeans_centroids

Interpretation: The centroid table shows the average standardized value of each variable within each cluster.

Final Conclusion:K-means clustering was applied using variables related to workplace conditions, physical health, and social support, including work interference, remote work, self-employment, tech company status, leave difficulty, physical health consequences, and coworker support. These variables were selected based on the findings from Q1, where they showed relationships with depression incidence.

The clustering results show that Cluster 1 contains 276 employees without depression and 51 employees with depression, while Cluster 2 contains 531 employees without depression and 110 employees with depression. This indicates that Cluster 2 has a larger number of employees experiencing depression compared to Cluster 1.

The centroid table further shows differences between employees with and without depression. Employees with depression (obs_consequence = 2) have higher average values for work interference (3.11 compared to 2.86), indicating that mental health affects their work more strongly. Similarly, employees with depression also show slightly higher values for self-employment (1.17 compared to 1.11), suggesting that self-employment may be associated with increased mental health challenges.

The K-means centroid values also indicate variation between the two clusters. Cluster 1 shows higher positive centroid values for remote work (1.29) and self-employment (0.72), while Cluster 2 shows negative centroid values for these variables. This suggests that the two clusters represent employees with different workplace and mental health characteristics.

Overall, the clustering analysis successfully separates employees into meaningful groups based on workplace conditions, physical health impacts, and support-related variables. The results support the findings from Q1 and indicate that work interference, physical health consequences, and workplace support factors are important in distinguishing employees with and without depression.

Quetion-3:Based on your analysis in the previous two questions, pick an appropriate Association Rule Mining algorithm that uses the consequent as “obs_consequence”. Pick the appropriate values of support, confidence and lift in this case and justify. What are the top five rules by support, confidence and lift?

library(arules)
## Warning: package 'arules' was built under R version 4.4.3
## Loading required package: Matrix
## 
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
library(dplyr)
arm_data <- Mental_Health_Survey %>%
  select(work_interfere,
         remote_work,
         self_employed,
         tech_company,
         leave,
         phys_health_consequence,
         coworkers,
         obs_consequence)

head(arm_data)
str(arm_data)
## 'data.frame':    968 obs. of  8 variables:
##  $ work_interfere         : chr  "Sometimes" "Sometimes" "Never" "Often" ...
##  $ remote_work            : chr  "Yes" "No" "Yes" "Yes" ...
##  $ self_employed          : chr  "Yes" "No" "Yes" "No" ...
##  $ tech_company           : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ leave                  : chr  "Very easy" "Somewhat difficult" "Somewhat difficult" "Don't know" ...
##  $ phys_health_consequence: chr  "No" "No" "No" "No" ...
##  $ coworkers              : chr  "Yes" "Some of them" "Some of them" "Some of them" ...
##  $ obs_consequence        : chr  "Yes" "No" "No" "No" ...

Interpretation: These variables were selected based on Q1 and Q2 because workplace conditions, physical health consequences, and coworker support showed relationships with depression.

arm_data[] <- lapply(arm_data, as.factor)

str(arm_data)
## 'data.frame':    968 obs. of  8 variables:
##  $ work_interfere         : Factor w/ 4 levels "Never","Often",..: 4 4 1 2 1 3 4 3 4 4 ...
##  $ remote_work            : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 1 2 2 1 ...
##  $ self_employed          : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
##  $ tech_company           : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 2 2 ...
##  $ leave                  : Factor w/ 5 levels "Don't know","Somewhat difficult",..: 5 2 2 1 1 1 5 1 2 4 ...
##  $ phys_health_consequence: Factor w/ 3 levels "Maybe","No","Yes": 2 2 2 2 2 2 2 2 2 1 ...
##  $ coworkers              : Factor w/ 3 levels "No","Some of them",..: 3 2 2 2 1 3 2 3 2 2 ...
##  $ obs_consequence        : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 1 1 ...

Interpretation: Association rule mining works with categorical transaction data, so all selected variables were converted to factors.

transactions_data <- as(arm_data, "transactions")

summary(transactions_data)
## transactions as itemMatrix in sparse format with
##  968 rows (elements/itemsets/transactions) and
##  23 columns (items) and a density of 0.3478261 
## 
## most frequent items:
##           self_employed=No         obs_consequence=No 
##                        848                        807 
##           tech_company=Yes phys_health_consequence=No 
##                        792                        695 
##             remote_work=No                    (Other) 
##                        674                       3928 
## 
## element (itemset/transaction) length distribution:
## sizes
##   8 
## 968 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       8       8       8       8       8       8 
## 
## includes extended item information - examples:
##                  labels      variables levels
## 1  work_interfere=Never work_interfere  Never
## 2  work_interfere=Often work_interfere  Often
## 3 work_interfere=Rarely work_interfere Rarely
## 
## includes extended transaction information - examples:
##   transactionID
## 1             1
## 2             2
## 3             3
inspect(head(transactions_data, 5))
##     items                         transactionID
## [1] {work_interfere=Sometimes,                 
##      remote_work=Yes,                          
##      self_employed=Yes,                        
##      tech_company=Yes,                         
##      leave=Very easy,                          
##      phys_health_consequence=No,               
##      coworkers=Yes,                            
##      obs_consequence=Yes}                     1
## [2] {work_interfere=Sometimes,                 
##      remote_work=No,                           
##      self_employed=No,                         
##      tech_company=Yes,                         
##      leave=Somewhat difficult,                 
##      phys_health_consequence=No,               
##      coworkers=Some of them,                   
##      obs_consequence=No}                      2
## [3] {work_interfere=Never,                     
##      remote_work=Yes,                          
##      self_employed=Yes,                        
##      tech_company=Yes,                         
##      leave=Somewhat difficult,                 
##      phys_health_consequence=No,               
##      coworkers=Some of them,                   
##      obs_consequence=No}                      3
## [4] {work_interfere=Often,                     
##      remote_work=Yes,                          
##      self_employed=No,                         
##      tech_company=Yes,                         
##      leave=Don't know,                         
##      phys_health_consequence=No,               
##      coworkers=Some of them,                   
##      obs_consequence=No}                      4
## [5] {work_interfere=Never,                     
##      remote_work=No,                           
##      self_employed=No,                         
##      tech_company=No,                          
##      leave=Don't know,                         
##      phys_health_consequence=No,               
##      coworkers=No,                             
##      obs_consequence=No}                      5

Interpretation: The dataset was converted into transactions so that association rules can be generated using the Apriori algorithm.

rules <- apriori(
  transactions_data,
  parameter = list(
    supp = 0.005,
    conf = 0.30,
    minlen = 2
  ),
  appearance = list(
    rhs = "obs_consequence=Yes",
    default = "lhs"
  )
)
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.3    0.1    1 none FALSE            TRUE       5   0.005      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 4 
## 
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[23 item(s), 968 transaction(s)] done [0.00s].
## sorting and recoding items ... [23 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 done [0.00s].
## writing ... [338 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
length(rules)
## [1] 338
summary(rules)
## set of 338 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3   4   5   6   7 
##   3  32 104 123  63  13 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    4.00    5.00    4.74    5.00    7.00 
## 
## summary of quality measures:
##     support           confidence        coverage             lift      
##  Min.   :0.005165   Min.   :0.3000   Min.   :0.006198   Min.   :1.804  
##  1st Qu.:0.006198   1st Qu.:0.3403   1st Qu.:0.013430   1st Qu.:2.046  
##  Median :0.007231   Median :0.3889   Median :0.018595   Median :2.338  
##  Mean   :0.009713   Mean   :0.4273   Mean   :0.025096   Mean   :2.569  
##  3rd Qu.:0.011364   3rd Qu.:0.5000   3rd Qu.:0.029959   3rd Qu.:3.006  
##  Max.   :0.036157   Max.   :1.0000   Max.   :0.109504   Max.   :6.012  
##      count       
##  Min.   : 5.000  
##  1st Qu.: 6.000  
##  Median : 7.000  
##  Mean   : 9.402  
##  3rd Qu.:11.000  
##  Max.   :35.000  
## 
## mining info:
##               data ntransactions support confidence
##  transactions_data           968   0.005        0.3
##                                                                                                                                                        call
##  apriori(data = transactions_data, parameter = list(supp = 0.005, conf = 0.3, minlen = 2), appearance = list(rhs = "obs_consequence=Yes", default = "lhs"))

Interpretation: The Apriori algorithm generated 338 association rules using 968 transactions. The support threshold was set to 0.005 and the confidence threshold was set to 0.30 because higher threshold values produced no usable rules. The generated rules have support values ranging from 0.0052 to 0.0362 and confidence values ranging from 0.30 to 1.00, indicating varying levels of rule frequency and reliability.

The lift values range from 1.804 to 6.012, which are all greater than 1. This indicates that the antecedent variables have positive associations with obs_consequence=Yes beyond random chance. The rule length distribution shows that most rules contain between 4 and 5 items, suggesting that depression outcomes are influenced by combinations of multiple workplace and health-related factors rather than a single variable.

Overall, the generated rules indicate meaningful relationships between workplace conditions, physical health consequences, coworker support, and depression outcomes.

inspect(head(sort(rules, by = "support", decreasing = TRUE), 5))
##     lhs                                rhs                      support confidence   coverage     lift count
## [1] {leave=Somewhat difficult}      => {obs_consequence=Yes} 0.03615702  0.3301887 0.10950413 1.985234    35
## [2] {leave=Very difficult}          => {obs_consequence=Yes} 0.03305785  0.3595506 0.09194215 2.161770    32
## [3] {work_interfere=Sometimes,                                                                              
##      self_employed=No,                                                                                      
##      phys_health_consequence=Maybe} => {obs_consequence=Yes} 0.03099174  0.3030303 0.10227273 1.821946    30
## [4] {self_employed=No,                                                                                      
##      leave=Somewhat difficult}      => {obs_consequence=Yes} 0.02995868  0.3452381 0.08677686 2.075717    29
## [5] {tech_company=Yes,                                                                                      
##      leave=Somewhat difficult}      => {obs_consequence=Yes} 0.02685950  0.3170732 0.08471074 1.906378    26
inspect(head(sort(rules, by = "confidence", decreasing = TRUE), 5))
##     lhs                              rhs                       support confidence    coverage     lift count
## [1] {work_interfere=Rarely,                                                                                 
##      tech_company=Yes,                                                                                      
##      phys_health_consequence=Yes} => {obs_consequence=Yes} 0.006198347  1.0000000 0.006198347 6.012422     6
## [2] {work_interfere=Rarely,                                                                                 
##      self_employed=No,                                                                                      
##      tech_company=Yes,                                                                                      
##      phys_health_consequence=Yes} => {obs_consequence=Yes} 0.006198347  1.0000000 0.006198347 6.012422     6
## [3] {work_interfere=Rarely,                                                                                 
##      phys_health_consequence=Yes} => {obs_consequence=Yes} 0.007231405  0.8750000 0.008264463 5.260870     7
## [4] {work_interfere=Rarely,                                                                                 
##      self_employed=No,                                                                                      
##      phys_health_consequence=Yes} => {obs_consequence=Yes} 0.006198347  0.8571429 0.007231405 5.153505     6
## [5] {work_interfere=Rarely,                                                                                 
##      remote_work=No,                                                                                        
##      phys_health_consequence=Yes} => {obs_consequence=Yes} 0.005165289  0.8333333 0.006198347 5.010352     5
inspect(head(sort(rules, by = "lift", decreasing = TRUE), 5))
##     lhs                              rhs                       support confidence    coverage     lift count
## [1] {work_interfere=Rarely,                                                                                 
##      tech_company=Yes,                                                                                      
##      phys_health_consequence=Yes} => {obs_consequence=Yes} 0.006198347  1.0000000 0.006198347 6.012422     6
## [2] {work_interfere=Rarely,                                                                                 
##      self_employed=No,                                                                                      
##      tech_company=Yes,                                                                                      
##      phys_health_consequence=Yes} => {obs_consequence=Yes} 0.006198347  1.0000000 0.006198347 6.012422     6
## [3] {work_interfere=Rarely,                                                                                 
##      phys_health_consequence=Yes} => {obs_consequence=Yes} 0.007231405  0.8750000 0.008264463 5.260870     7
## [4] {work_interfere=Rarely,                                                                                 
##      self_employed=No,                                                                                      
##      phys_health_consequence=Yes} => {obs_consequence=Yes} 0.006198347  0.8571429 0.007231405 5.153505     6
## [5] {work_interfere=Rarely,                                                                                 
##      remote_work=No,                                                                                        
##      phys_health_consequence=Yes} => {obs_consequence=Yes} 0.005165289  0.8333333 0.006198347 5.010352     5

Final Conclusion: Association Rule Mining was performed using the Apriori algorithm to identify workplace and health-related patterns associated with depression outcomes. The consequent was restricted to obs_consequence=Yes, meaning all generated rules explain conditions associated with diagnosed depression.

The support threshold was set to 0.005 and the confidence threshold was set to 0.30 because higher threshold values produced no usable rules. Using these values, a total of 338 meaningful association rules were generated from 968 transactions. The lift values ranged from 1.804 to 6.012, indicating positive associations between the antecedent variables and depression outcomes beyond random chance.

The top rules by support show that employees reporting difficult leave conditions are frequently associated with diagnosed depression. For example, the rule {leave=Somewhat difficult} => {obs_consequence=Yes} has the highest support value of 0.0362, meaning it is one of the most commonly occurring patterns in the dataset. Other highly supported rules involve combinations of work interference, self-employment status, physical health consequences, and leave difficulty.

The top rules by confidence indicate highly reliable relationships with depression. The rule {work_interfere=Rarely, tech_company=Yes, phys_health_consequence=Yes} => {obs_consequence=Yes} achieved a confidence value of 1.00 and a lift value of 6.01, indicating that employees satisfying these conditions always experienced diagnosed depression in the dataset. Similar rules involving physical health consequences, self-employment status, and remote work also showed very high confidence values above 0.83.

The top rules by lift represent the strongest associations with depression beyond random chance. The highest lift value of 6.01 indicates a very strong relationship between workplace conditions, physical health consequences, and depression outcomes. Physical health consequences appear repeatedly in the strongest rules, suggesting that physical and mental health are closely connected.

Overall, the association rule analysis supports the findings from Q1 and Q2. Workplace stress, leave difficulty, physical health consequences, remote work conditions, and coworker-related support are important factors associated with depression among employees. The generated rules reveal meaningful and reliable patterns that help explain depression outcomes in the workplace environment.

Quetion-4:Based on the exploration you did in the previous three questions, pick at least five variables that you think can predict incidence of depression. Pick at least three classification techniques and use the appropriate cross validation techniques. Provide the values of Precision, Recall, Overall accuracy and AUC. Which technique performs the best?

library(dplyr)
library(caret)
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: lattice
library(pROC)
## Warning: package 'pROC' was built under R version 4.4.3
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(rpart)
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
model_data <- Mental_Health_Survey %>%
  select(obs_consequence,
         work_interfere,
         leave,
         phys_health_consequence,
         coworkers,
         remote_work,
         self_employed,
         tech_company)

head(model_data)
str(model_data)
## 'data.frame':    968 obs. of  8 variables:
##  $ obs_consequence        : chr  "Yes" "No" "No" "No" ...
##  $ work_interfere         : chr  "Sometimes" "Sometimes" "Never" "Often" ...
##  $ leave                  : chr  "Very easy" "Somewhat difficult" "Somewhat difficult" "Don't know" ...
##  $ phys_health_consequence: chr  "No" "No" "No" "No" ...
##  $ coworkers              : chr  "Yes" "Some of them" "Some of them" "Some of them" ...
##  $ remote_work            : chr  "Yes" "No" "Yes" "Yes" ...
##  $ self_employed          : chr  "Yes" "No" "Yes" "No" ...
##  $ tech_company           : chr  "Yes" "Yes" "Yes" "Yes" ...

Interpretation: These variables were selected based on Q1–Q3 because they were related to depression outcomes.

model_data[] <- lapply(model_data, as.factor)

model_data$obs_consequence <- factor(model_data$obs_consequence,
                                     levels = c("No", "Yes"))

str(model_data)
## 'data.frame':    968 obs. of  8 variables:
##  $ obs_consequence        : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 1 1 ...
##  $ work_interfere         : Factor w/ 4 levels "Never","Often",..: 4 4 1 2 1 3 4 3 4 4 ...
##  $ leave                  : Factor w/ 5 levels "Don't know","Somewhat difficult",..: 5 2 2 1 1 1 5 1 2 4 ...
##  $ phys_health_consequence: Factor w/ 3 levels "Maybe","No","Yes": 2 2 2 2 2 2 2 2 2 1 ...
##  $ coworkers              : Factor w/ 3 levels "No","Some of them",..: 3 2 2 2 1 3 2 3 2 2 ...
##  $ remote_work            : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 1 2 2 1 ...
##  $ self_employed          : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
##  $ tech_company           : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 2 2 ...
table(model_data$obs_consequence)
## 
##  No Yes 
## 807 161
set.seed(123)

train_index <- createDataPartition(model_data$obs_consequence,
                                   p = 0.70,
                                   list = FALSE)

train_data <- model_data[train_index, ]
test_data  <- model_data[-train_index, ]

nrow(train_data)
## [1] 678
nrow(test_data)
## [1] 290

Interpretation: A 70/30 train-test split was used. Stratified sampling keeps the Yes/No depression classes balanced across train and test data.

control <- trainControl(
  method = "cv",
  number = 10,
  classProbs = TRUE,
  summaryFunction = twoClassSummary,
  savePredictions = TRUE
)

Interpretation: 10-fold cross validation was used to evaluate models reliably and reduce overfitting.

set.seed(123)

log_model <- train(
  obs_consequence ~ .,
  data = train_data,
  method = "glm",
  family = "binomial",
  metric = "ROC",
  trControl = control
)

log_model
## Generalized Linear Model 
## 
## 678 samples
##   7 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 610, 610, 611, 611, 610, 611, ... 
## Resampling results:
## 
##   ROC        Sens       Spec      
##   0.6971043  0.9733396  0.09772727
log_pred <- predict(log_model, test_data)
log_prob <- predict(log_model, test_data, type = "prob")

log_cm <- confusionMatrix(log_pred,
                          test_data$obs_consequence,
                          positive = "Yes")

log_cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  239  45
##        Yes   3   3
##                                           
##                Accuracy : 0.8345          
##                  95% CI : (0.7866, 0.8754)
##     No Information Rate : 0.8345          
##     P-Value [Acc > NIR] : 0.5384          
##                                           
##                   Kappa : 0.0772          
##                                           
##  Mcnemar's Test P-Value : 3.262e-09       
##                                           
##             Sensitivity : 0.06250         
##             Specificity : 0.98760         
##          Pos Pred Value : 0.50000         
##          Neg Pred Value : 0.84155         
##              Prevalence : 0.16552         
##          Detection Rate : 0.01034         
##    Detection Prevalence : 0.02069         
##       Balanced Accuracy : 0.52505         
##                                           
##        'Positive' Class : Yes             
## 
log_precision <- log_cm$byClass["Precision"]
log_recall <- log_cm$byClass["Recall"]
log_accuracy <- log_cm$overall["Accuracy"]

log_auc <- auc(roc(test_data$obs_consequence,
                   log_prob$Yes,
                   levels = c("No", "Yes"),
                   direction = "<"))

log_precision
## Precision 
##       0.5
log_recall
## Recall 
## 0.0625
log_accuracy
##  Accuracy 
## 0.8344828
log_auc
## Area under the curve: 0.6775
set.seed(123)

tree_model <- train(
  obs_consequence ~ .,
  data = train_data,
  method = "rpart",
  metric = "ROC",
  trControl = control
)

tree_model
## CART 
## 
## 678 samples
##   7 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 610, 610, 611, 611, 610, 611, ... 
## Resampling results across tuning parameters:
## 
##   cp           ROC        Sens       Spec      
##   0.000000000  0.5969576  0.9663221  0.06212121
##   0.004424779  0.5942947  0.9768797  0.05378788
##   0.017699115  0.5239462  0.9982456  0.00000000
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.
tree_pred <- predict(tree_model, test_data)
tree_prob <- predict(tree_model, test_data, type = "prob")

tree_cm <- confusionMatrix(tree_pred,
                           test_data$obs_consequence,
                           positive = "Yes")

tree_cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  232  45
##        Yes  10   3
##                                           
##                Accuracy : 0.8103          
##                  95% CI : (0.7604, 0.8538)
##     No Information Rate : 0.8345          
##     P-Value [Acc > NIR] : 0.8808          
##                                           
##                   Kappa : 0.0299          
##                                           
##  Mcnemar's Test P-Value : 4.549e-06       
##                                           
##             Sensitivity : 0.06250         
##             Specificity : 0.95868         
##          Pos Pred Value : 0.23077         
##          Neg Pred Value : 0.83755         
##              Prevalence : 0.16552         
##          Detection Rate : 0.01034         
##    Detection Prevalence : 0.04483         
##       Balanced Accuracy : 0.51059         
##                                           
##        'Positive' Class : Yes             
## 
tree_precision <- tree_cm$byClass["Precision"]
tree_recall <- tree_cm$byClass["Recall"]
tree_accuracy <- tree_cm$overall["Accuracy"]

tree_auc <- auc(roc(test_data$obs_consequence,
                    tree_prob$Yes,
                    levels = c("No", "Yes"),
                    direction = "<"))

tree_precision
## Precision 
## 0.2307692
tree_recall
## Recall 
## 0.0625
tree_accuracy
##  Accuracy 
## 0.8103448
tree_auc
## Area under the curve: 0.5331
set.seed(123)

rf_model <- train(
  obs_consequence ~ .,
  data = train_data,
  method = "rf",
  metric = "ROC",
  trControl = control
)

rf_model
## Random Forest 
## 
## 678 samples
##   7 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 610, 610, 611, 611, 610, 611, ... 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens       Spec     
##    2    0.6657371  1.0000000  0.0000000
##    8    0.6568033  0.9326754  0.1416667
##   14    0.6538885  0.9097118  0.1840909
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
rf_pred <- predict(rf_model, test_data)
rf_prob <- predict(rf_model, test_data, type = "prob")

rf_cm <- confusionMatrix(rf_pred,
                         test_data$obs_consequence,
                         positive = "Yes")

rf_cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  242  48
##        Yes   0   0
##                                           
##                Accuracy : 0.8345          
##                  95% CI : (0.7866, 0.8754)
##     No Information Rate : 0.8345          
##     P-Value [Acc > NIR] : 0.5384          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 1.17e-11        
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 1.0000          
##          Pos Pred Value :    NaN          
##          Neg Pred Value : 0.8345          
##              Prevalence : 0.1655          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : Yes             
## 
rf_precision <- rf_cm$byClass["Precision"]
rf_recall <- rf_cm$byClass["Recall"]
rf_accuracy <- rf_cm$overall["Accuracy"]

rf_auc <- auc(roc(test_data$obs_consequence,
                  rf_prob$Yes,
                  levels = c("No", "Yes"),
                  direction = "<"))

rf_precision
## Precision 
##        NA
rf_recall
## Recall 
##      0
rf_accuracy
##  Accuracy 
## 0.8344828
rf_auc
## Area under the curve: 0.6002
model_comparison <- data.frame(
  Model = c("Logistic Regression", "Decision Tree", "Random Forest"),
  Precision = c(log_precision, tree_precision, rf_precision),
  Recall = c(log_recall, tree_recall, rf_recall),
  Accuracy = c(log_accuracy, tree_accuracy, rf_accuracy),
  AUC = c(as.numeric(log_auc),
          as.numeric(tree_auc),
          as.numeric(rf_auc))
)

model_comparison

Final Conclusion:Three classification techniques were used to predict depression incidence: Logistic Regression, Decision Tree, and Random Forest. The predictor variables were selected based on findings from Q1–Q3, where work interference, leave difficulty, physical health consequences, coworker support, remote work, self-employment, and tech company status showed relationships with depression outcomes. A 70/30 train-test split and 10-fold cross validation were used to evaluate the models reliably.

The Logistic Regression model achieved a Precision of 0.50, Recall of 0.0625, Overall Accuracy of 0.8345, and an AUC of 0.6775. This model correctly identified some depression cases while maintaining the highest AUC value among all models, indicating better overall classification performance.

The Decision Tree model achieved a Precision of 0.2308, Recall of 0.0625, Overall Accuracy of 0.8103, and an AUC of 0.5331. Although the model captured some depression cases, its predictive performance was weaker compared to Logistic Regression.

The Random Forest model achieved an Overall Accuracy of 0.8345 and an AUC of 0.6002. However, the model failed to correctly classify any positive depression cases, resulting in a Recall of 0 and undefined Precision (NA). This indicates that the model was biased toward predicting the majority class (“No”).

Based on the comparison of Precision, Recall, Overall Accuracy, and AUC, the Logistic Regression model performs the best overall because it achieved the highest AUC value (0.6775) while also maintaining the highest Precision among the three techniques. Although all models showed relatively low Recall values due to class imbalance in the dataset, Logistic Regression provided the most balanced predictive performance for identifying depression incidence.

Overall, the classification results support the findings from Q1–Q3 that workplace stress, leave difficulty, physical health consequences, and coworker-related support are important predictors of depression among employees.