Question-1:Read the data into a variable named “Mental_Health_Survey”. Remove all the rows that contain null values in any of the variables (Except “Comments” and “State”). Perform all the data cleaning required. For example you might want to do the following (a) Convert any categorical variables to numeric values.
Answer the following questions using either a simple R-command or an appropriate visualization:
Which state in the United States seems to have employees with the most diagnosed cases of depression? Which state has the least?
What is the relationship among Self-employment, working for a tech company, ability to get a leave and how does it relate to depression for the employee? Also, how do physical health consequences play a role in incidence of depression? You might use a combination of statistics and visualizations to answer this question.
How do work interference and remote work option relate to the incidence of depression? What role do coworkers play in this case?
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.4.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.3
Mental_Health_Survey <- read.csv("C:/Users/PALLAVI/Downloads/BML/survey.csv",
stringsAsFactors = FALSE)
head(Mental_Health_Survey)
View(Mental_Health_Survey)
str(Mental_Health_Survey)
## 'data.frame': 1259 obs. of 19 variables:
## $ Age : num 37 44 32 31 31 33 35 39 42 23 ...
## $ Gender : chr "Female" "M" "Male" "Male" ...
## $ Country : chr "United States" "United States" "Canada" "United Kingdom" ...
## $ state : chr "IL" "IN" NA NA ...
## $ self_employed : chr NA NA NA NA ...
## $ family_history : chr "No" "No" "No" "Yes" ...
## $ treatment : chr "Yes" "No" "No" "Yes" ...
## $ work_interfere : chr "Often" "Rarely" "Rarely" "Often" ...
## $ remote_work : chr "No" "No" "No" "No" ...
## $ tech_company : chr "Yes" "No" "Yes" "Yes" ...
## $ benefits : chr "Yes" "Don't know" "No" "No" ...
## $ care_options : chr "Not sure" "No" "No" "Yes" ...
## $ wellness_program : chr "No" "Don't know" "No" "No" ...
## $ seek_help : chr "Yes" "Don't know" "No" "No" ...
## $ leave : chr "Somewhat easy" "Don't know" "Somewhat difficult" "Somewhat difficult" ...
## $ phys_health_consequence: chr "No" "No" "No" "Yes" ...
## $ coworkers : chr "Some of them" "No" "Yes" "Some of them" ...
## $ obs_consequence : chr "No" "No" "No" "Yes" ...
## $ comments : chr NA NA NA NA ...
summary(Mental_Health_Survey)
## Age Gender Country state
## Min. :-1.726e+03 Length:1259 Length:1259 Length:1259
## 1st Qu.: 2.700e+01 Class :character Class :character Class :character
## Median : 3.100e+01 Mode :character Mode :character Mode :character
## Mean : 7.943e+07
## 3rd Qu.: 3.600e+01
## Max. : 1.000e+11
## self_employed family_history treatment work_interfere
## Length:1259 Length:1259 Length:1259 Length:1259
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## remote_work tech_company benefits care_options
## Length:1259 Length:1259 Length:1259 Length:1259
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## wellness_program seek_help leave
## Length:1259 Length:1259 Length:1259
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## phys_health_consequence coworkers obs_consequence
## Length:1259 Length:1259 Length:1259
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## comments
## Length:1259
## Class :character
## Mode :character
##
##
##
Interpretation: The dataset was loaded into a variable named Mental_Health_Survey. The head(), View(), str(), and summary() commands were used to understand the structure, variable types, and initial data quality issues.
colSums(is.na(Mental_Health_Survey))
## Age Gender Country
## 0 0 0
## state self_employed family_history
## 515 18 0
## treatment work_interfere remote_work
## 0 264 0
## tech_company benefits care_options
## 0 0 0
## wellness_program seek_help leave
## 0 0 0
## phys_health_consequence coworkers obs_consequence
## 0 0 0
## comments
## 1095
Interpretation: This output shows which columns contain missing values before cleaning. The question says to remove rows with null values except in comments and state.
Mental_Health_Survey <- Mental_Health_Survey %>%
filter(complete.cases(select(., -comments, -state)))
colSums(is.na(Mental_Health_Survey))
## Age Gender Country
## 0 0 0
## state self_employed family_history
## 386 0 0
## treatment work_interfere remote_work
## 0 0 0
## tech_company benefits care_options
## 0 0 0
## wellness_program seek_help leave
## 0 0 0
## phys_health_consequence coworkers obs_consequence
## 0 0 0
## comments
## 837
Interpretation: Rows with missing values were removed from all variables except comments and state, as required by the question.
Mental_Health_Survey$Gender <- tolower(Mental_Health_Survey$Gender)
Mental_Health_Survey$Gender <- trimws(Mental_Health_Survey$Gender)
Mental_Health_Survey$Gender <- ifelse(
Mental_Health_Survey$Gender %in% c("male", "m", "man", "cis male", "male-ish",
"maile", "mal", "male (cis)", "make",
"male ", "msle", "cis man"),
"Male",
ifelse(
Mental_Health_Survey$Gender %in% c("female", "f", "woman", "cis female",
"female ", "femake", "female (cis)",
"cis-female/femme"),
"Female",
"Other"
)
)
Mental_Health_Survey$Gender <- as.factor(Mental_Health_Survey$Gender)
table(Mental_Health_Survey$Gender)
##
## Female Male Other
## 205 749 23
Interpretation: The Gender column had non-uniform values such as “Male”, “M”, and “m”. These were standardized into Male, Female, and Other.
summary(Mental_Health_Survey$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.726e+03 2.700e+01 3.100e+01 1.024e+08 3.600e+01 1.000e+11
Mental_Health_Survey <- Mental_Health_Survey %>%
filter(Age >= 18 & Age <= 100)
summary(Mental_Health_Survey$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 27.00 31.00 32.31 36.00 72.00
Interpretation: Invalid age values were removed. Only realistic employee ages between 18 and 100 were kept.
Mental_Health_Survey <- distinct(Mental_Health_Survey)
nrow(Mental_Health_Survey)
## [1] 968
Interpretation: Duplicate rows were removed to avoid repeated records affecting the analysis.
Mental_Health_Survey_numeric <- Mental_Health_Survey
categorical_cols <- names(Mental_Health_Survey_numeric)[
sapply(Mental_Health_Survey_numeric, is.character) |
sapply(Mental_Health_Survey_numeric, is.factor)
]
categorical_cols <- setdiff(categorical_cols, c("comments", "state"))
Mental_Health_Survey_numeric[categorical_cols] <- lapply(
Mental_Health_Survey_numeric[categorical_cols],
function(x) as.numeric(as.factor(x))
)
head(Mental_Health_Survey_numeric)
str(Mental_Health_Survey_numeric)
## 'data.frame': 968 obs. of 19 variables:
## $ Age : num 46 29 31 46 41 33 35 35 34 37 ...
## $ Gender : num 2 2 2 2 2 2 2 1 2 2 ...
## $ Country : num 38 38 38 38 38 38 38 38 38 37 ...
## $ state : chr "MD" "NY" "NC" "MA" ...
## $ self_employed : num 2 1 2 1 1 1 1 1 1 1 ...
## $ family_history : num 2 2 1 1 1 2 2 2 1 1 ...
## $ treatment : num 1 2 1 2 2 2 2 2 2 1 ...
## $ work_interfere : num 4 4 1 2 1 3 4 3 4 4 ...
## $ remote_work : num 2 1 2 2 1 1 1 2 2 1 ...
## $ tech_company : num 2 2 2 2 1 2 1 2 2 2 ...
## $ benefits : num 3 3 2 3 1 3 3 3 1 2 ...
## $ care_options : num 2 3 1 3 1 2 3 3 2 1 ...
## $ wellness_program : num 3 2 2 2 2 1 2 1 2 2 ...
## $ seek_help : num 1 2 2 2 1 3 1 1 1 2 ...
## $ leave : num 5 2 2 1 1 1 5 1 2 4 ...
## $ phys_health_consequence: num 2 2 2 2 2 2 2 2 2 1 ...
## $ coworkers : num 3 2 2 2 1 3 2 3 2 2 ...
## $ obs_consequence : num 2 1 1 1 1 1 1 1 1 1 ...
## $ comments : chr NA NA NA NA ...
Interpretation: Categorical variables were converted into numeric values so they can be used for statistical analysis and modeling. The original dataset is kept for interpretation, while the numeric version can be used later if needed.
us_state_depression <- Mental_Health_Survey %>%
filter(Country == "United States", !is.na(state), obs_consequence == "Yes") %>%
group_by(state) %>%
summarise(depression_cases = n(), .groups = "drop") %>%
arrange(desc(depression_cases))
us_state_depression
head(us_state_depression, 1)
tail(us_state_depression, 1)
ggplot(us_state_depression, aes(x = reorder(state, depression_cases),
y = depression_cases)) +
geom_col() +
coord_flip() +
labs(title = "Diagnosed Depression Cases by State",
x = "State",
y = "Number of Diagnosed Cases")
Interpretation: The state shown at the top of the table has the highest
number of diagnosed depression cases among U.S. employees. The state
shown at the bottom has the lowest number of diagnosed cases. This
answers which U.S. state has the most and least diagnosed cases.
table(Mental_Health_Survey$self_employed,
Mental_Health_Survey$obs_consequence)
##
## No Yes
## No 715 133
## Yes 92 28
prop.table(table(Mental_Health_Survey$self_employed,
Mental_Health_Survey$obs_consequence), 1)
##
## No Yes
## No 0.8431604 0.1568396
## Yes 0.7666667 0.2333333
ggplot(Mental_Health_Survey,
aes(x = self_employed, fill = obs_consequence)) +
geom_bar(position = "fill") +
labs(title = "Self Employment vs Depression",
x = "Self Employed",
y = "Proportion",
fill = "Depression")
table(Mental_Health_Survey$tech_company,
Mental_Health_Survey$obs_consequence)
##
## No Yes
## No 136 40
## Yes 671 121
prop.table(table(Mental_Health_Survey$tech_company,
Mental_Health_Survey$obs_consequence), 1)
##
## No Yes
## No 0.7727273 0.2272727
## Yes 0.8472222 0.1527778
ggplot(Mental_Health_Survey,
aes(x = tech_company, fill = obs_consequence)) +
geom_bar(position = "fill") +
labs(title = "Tech Company vs Depression",
x = "Tech Company",
y = "Proportion",
fill = "Depression")
table(Mental_Health_Survey$leave,
Mental_Health_Survey$obs_consequence)
##
## No Yes
## Don't know 358 53
## Somewhat difficult 71 35
## Somewhat easy 181 28
## Very difficult 57 32
## Very easy 140 13
prop.table(table(Mental_Health_Survey$leave,
Mental_Health_Survey$obs_consequence), 1)
##
## No Yes
## Don't know 0.87104623 0.12895377
## Somewhat difficult 0.66981132 0.33018868
## Somewhat easy 0.86602871 0.13397129
## Very difficult 0.64044944 0.35955056
## Very easy 0.91503268 0.08496732
ggplot(Mental_Health_Survey,
aes(x = leave, fill = obs_consequence)) +
geom_bar(position = "fill") +
labs(title = "Leave Difficulty vs Depression",
x = "Leave",
y = "Proportion",
fill = "Depression")
table(Mental_Health_Survey$phys_health_consequence,
Mental_Health_Survey$obs_consequence)
##
## No Yes
## Maybe 167 53
## No 604 91
## Yes 36 17
prop.table(table(Mental_Health_Survey$phys_health_consequence,
Mental_Health_Survey$obs_consequence), 1)
##
## No Yes
## Maybe 0.7590909 0.2409091
## No 0.8690647 0.1309353
## Yes 0.6792453 0.3207547
ggplot(Mental_Health_Survey,
aes(x = phys_health_consequence, fill = obs_consequence)) +
geom_bar(position = "fill") +
labs(title = "Physical Health Consequence vs Depression",
x = "Physical Health Consequence",
y = "Proportion",
fill = "Depression")
Interpretation: The contingency tables and proportional bar charts show how depression differs across workplace and health-related factors. Employees with harder leave options or physical health consequences may show higher depression incidence. These variables are useful because they connect workplace support and physical health concerns with mental health outcomes.
table(Mental_Health_Survey$work_interfere,
Mental_Health_Survey$obs_consequence)
##
## No Yes
## Never 189 17
## Often 103 34
## Rarely 144 24
## Sometimes 371 86
prop.table(table(Mental_Health_Survey$work_interfere,
Mental_Health_Survey$obs_consequence), 1)
##
## No Yes
## Never 0.91747573 0.08252427
## Often 0.75182482 0.24817518
## Rarely 0.85714286 0.14285714
## Sometimes 0.81181619 0.18818381
ggplot(Mental_Health_Survey,
aes(x = work_interfere, fill = obs_consequence)) +
geom_bar(position = "fill") +
labs(title = "Work Interference vs Depression",
x = "Work Interference",
y = "Proportion",
fill = "Depression")
table(Mental_Health_Survey$remote_work,
Mental_Health_Survey$obs_consequence)
##
## No Yes
## No 556 118
## Yes 251 43
prop.table(table(Mental_Health_Survey$remote_work,
Mental_Health_Survey$obs_consequence), 1)
##
## No Yes
## No 0.8249258 0.1750742
## Yes 0.8537415 0.1462585
ggplot(Mental_Health_Survey,
aes(x = remote_work, fill = obs_consequence)) +
geom_bar(position = "fill") +
labs(title = "Remote Work vs Depression",
x = "Remote Work",
y = "Proportion",
fill = "Depression")
table(Mental_Health_Survey$coworkers,
Mental_Health_Survey$obs_consequence)
##
## No Yes
## No 168 32
## Some of them 489 110
## Yes 150 19
prop.table(table(Mental_Health_Survey$coworkers,
Mental_Health_Survey$obs_consequence), 1)
##
## No Yes
## No 0.8400000 0.1600000
## Some of them 0.8163606 0.1836394
## Yes 0.8875740 0.1124260
ggplot(Mental_Health_Survey,
aes(x = coworkers, fill = obs_consequence)) +
geom_bar(position = "fill") +
labs(title = "Coworkers vs Depression",
x = "Coworker Support",
y = "Proportion",
fill = "Depression")
Interpretation: Work interference is directly related to depression
because employees whose mental health affects work more often may show
higher diagnosed depression. Remote work may have mixed effects because
it can reduce stress for some employees but may also increase isolation
for others. Coworker support is important because employees who feel
comfortable discussing mental health with coworkers may experience
better support and lower negative outcomes.
Final Conclusion: The dataset was cleaned by removing rows with missing values (except for comments and state), standardizing the Gender variable, removing unrealistic Age values, eliminating duplicate records, and converting categorical variables into numeric form to ensure consistency and reliability of analysis.
Based on the analysis of U.S. states, California (CA) shows the highest number of diagnosed depression cases (15), while Virginia (VA) shows the lowest number of cases (1). This indicates variation in mental health outcomes across different geographic locations.
The relationship analysis using both statistical summaries (contingency tables and proportions) and visualizations (bar plots) reveals that workplace and health-related factors significantly influence depression incidence. Employees with limited leave options and those experiencing physical health consequences tend to show higher levels of depression.
Work interference shows a strong positive relationship with depression, indicating that employees whose mental health affects their work are more likely to report diagnosed depression. Remote work shows mixed effects, suggesting it may reduce stress for some employees while increasing isolation for others.
Additionally, coworker support plays an important role, as employees with better support systems tend to show improved mental health outcomes.
Overall, the results highlight that workplace conditions, physical health, and social support are key factors influencing mental health among employees.
Quetion-2:Based on your understanding of answers to Q1, pick a clustering technique and appropriate variables to identify clusters of employees with/without depression. Tabulate the results where there are two rows (“obs_consequence” either Yes or No) and the columns contain the values of centroids for the variables. What do you observe and infer from this?
library(dplyr)
q2_data <- Mental_Health_Survey %>%
select(obs_consequence,
work_interfere,
remote_work,
self_employed,
tech_company,
leave,
phys_health_consequence,
coworkers)
Interpretation: These variables were selected based on Q1 because work interference, remote work, self-employment, tech company status, leave difficulty, physical health consequences, and coworker support were related to depression.
q2_numeric <- q2_data
q2_numeric[] <- lapply(q2_numeric, function(x) {
as.numeric(as.factor(x))
})
head(q2_numeric)
str(q2_numeric)
## 'data.frame': 968 obs. of 8 variables:
## $ obs_consequence : num 2 1 1 1 1 1 1 1 1 1 ...
## $ work_interfere : num 4 4 1 2 1 3 4 3 4 4 ...
## $ remote_work : num 2 1 2 2 1 1 1 2 2 1 ...
## $ self_employed : num 2 1 2 1 1 1 1 1 1 1 ...
## $ tech_company : num 2 2 2 2 1 2 1 2 2 2 ...
## $ leave : num 5 2 2 1 1 1 5 1 2 4 ...
## $ phys_health_consequence: num 2 2 2 2 2 2 2 2 2 1 ...
## $ coworkers : num 3 2 2 2 1 3 2 3 2 2 ...
Interpretation: All categorical variables were converted into numeric form so they can be used in clustering.
depression_label <- q2_data$obs_consequence
cluster_vars <- q2_numeric %>%
select(-obs_consequence)
Interpretation: obs_consequence is kept as the depression label, while the remaining variables are used to form clusters.
cluster_scaled <- scale(cluster_vars)
Interpretation: Scaling was done so that all variables contribute equally to the clustering process.
set.seed(123)
kmeans_model <- kmeans(cluster_scaled,
centers = 2,
nstart = 25)
kmeans_model
## K-means clustering with 2 clusters of sizes 327, 641
##
## Cluster means:
## work_interfere remote_work self_employed tech_company leave
## 1 0.02046454 1.2939859 0.7277437 0.19379556 0.18232034
## 2 -0.01043979 -0.6601145 -0.3712515 -0.09886295 -0.09300897
## phys_health_consequence coworkers
## 1 -0.01573577 0.12131337
## 2 0.00802745 -0.06188685
##
## Clustering vector:
## [1] 1 2 1 1 2 2 2 1 1 2 1 1 1 1 2 2 2 1 2 2 2 2 2 1 2 2 1 2 1 1 2 1 2 2 1 1 1
## [38] 1 2 2 2 2 1 1 1 1 2 2 1 2 1 1 2 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 1 2 2 2 1 2
## [75] 2 1 2 1 2 1 2 2 2 2 2 2 2 1 2 2 2 2 2 2 1 1 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2
## [112] 2 2 1 2 1 2 2 1 2 2 2 1 2 1 1 2 2 1 2 2 1 2 2 2 2 1 1 2 2 2 2 1 2 1 1 2 2
## [149] 2 1 2 2 1 1 2 2 2 2 1 1 2 1 2 1 2 1 2 2 2 2 2 2 1 1 2 1 1 1 1 2 2 2 1 2 2
## [186] 1 1 2 1 1 2 1 1 2 2 1 2 2 2 1 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 1 2 2
## [223] 2 2 1 2 1 2 2 2 1 2 2 2 2 2 2 1 2 2 1 2 2 1 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2
## [260] 2 1 1 1 2 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 1 1 2 1 2 1 2 2 2 2 1 2 2 2 2 2 2
## [297] 2 2 1 1 1 2 2 2 1 2 2 2 1 1 1 2 2 2 2 1 2 1 2 2 2 2 1 2 1 2 2 2 2 2 2 1 2
## [334] 2 2 2 2 2 2 2 2 1 2 1 1 2 2 2 1 1 2 2 2 1 2 2 1 2 2 1 2 2 2 2 1 2 2 2 1 2
## [371] 2 1 1 2 2 2 2 2 1 2 1 2 2 2 1 2 2 2 2 1 1 1 1 1 2 1 2 1 2 2 2 1 2 2 2 2 2
## [408] 2 1 2 1 2 2 2 2 2 1 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 1 1 2 2 2 1 2 1
## [445] 2 2 2 1 2 2 2 2 2 2 2 2 2 1 2 1 2 1 2 1 2 2 2 1 2 1 1 2 2 1 1 2 2 2 2 1 1
## [482] 1 2 2 1 1 2 2 1 2 1 2 2 2 1 2 2 2 2 1 2 1 2 2 2 1 2 2 2 2 1 2 1 2 2 2 2 2
## [519] 1 2 1 1 2 2 1 2 2 2 1 2 1 2 1 2 2 2 2 2 1 2 2 2 1 2 1 2 1 1 2 1 2 1 2 1 2
## [556] 2 2 1 1 2 2 2 1 2 1 2 2 2 2 1 2 2 2 2 1 2 1 2 2 1 2 2 1 2 2 1 2 2 2 2 2 2
## [593] 1 2 2 2 2 1 2 2 2 2 2 2 2 2 1 1 2 2 1 1 1 2 1 2 2 1 2 2 1 2 2 2 2 2 2 2 2
## [630] 1 2 2 2 2 1 1 2 1 2 1 1 2 1 2 1 2 2 1 1 2 1 2 2 2 1 1 1 1 1 2 2 1 1 2 1 2
## [667] 1 1 1 2 2 1 1 2 2 1 2 2 1 1 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 1 2
## [704] 1 2 2 2 2 2 2 2 2 2 1 2 2 2 1 2 1 1 1 2 1 1 1 2 1 1 1 2 2 2 2 2 1 1 1 2 2
## [741] 1 1 2 2 2 2 1 1 2 2 1 1 2 1 2 2 1 2 1 1 2 1 2 2 1 1 1 2 2 1 1 2 2 1 2 1 2
## [778] 2 1 1 2 1 1 2 2 2 2 2 1 1 2 2 2 1 2 2 2 1 2 2 1 2 1 1 2 2 1 2 2 1 1 2 2 2
## [815] 2 1 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 1 1 1 1 1 2 2 2 1 2 1 2 2 2
## [852] 2 1 1 2 2 2 1 1 2 2 1 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 1 1 2 2 1 2 2
## [889] 1 1 2 2 1 2 2 2 2 2 2 2 1 1 2 2 1 1 2 2 1 1 2 2 2 1 2 2 2 2 1 1 2 1 2 1 2
## [926] 2 2 1 2 2 1 1 2 1 2 1 1 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 1 2 1 2 2 2 2 2 1
## [963] 2 1 2 1 2 2
##
## Within cluster sum of squares by cluster:
## [1] 2367.716 3270.351
## (between_SS / total_SS = 16.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
Interpretation: K-means clustering was selected because the goal is to divide employees into two groups: employees with depression and employees without depression. Therefore, centers = 2 was used.
q2_results <- q2_data %>%
mutate(Cluster = kmeans_model$cluster)
head(q2_results)
cluster_depression_table <- table(q2_results$Cluster,
q2_results$obs_consequence)
cluster_depression_table
##
## No Yes
## 1 276 51
## 2 531 110
Interpretation: This table shows how the two clusters are associated with obs_consequence values of Yes and No.
centroid_table <- q2_numeric %>%
group_by(obs_consequence) %>%
summarise(across(everything(), mean), .groups = "drop")
centroid_table
Interpretation: This table satisfies the question requirement because the rows represent obs_consequence values and the columns show the average centroid-like values for the selected variables.
kmeans_centroids <- as.data.frame(kmeans_model$centers)
kmeans_centroids
Interpretation: The centroid table shows the average standardized value of each variable within each cluster.
Final Conclusion:K-means clustering was applied using variables related to workplace conditions, physical health, and social support, including work interference, remote work, self-employment, tech company status, leave difficulty, physical health consequences, and coworker support. These variables were selected based on the findings from Q1, where they showed relationships with depression incidence.
The clustering results show that Cluster 1 contains 276 employees without depression and 51 employees with depression, while Cluster 2 contains 531 employees without depression and 110 employees with depression. This indicates that Cluster 2 has a larger number of employees experiencing depression compared to Cluster 1.
The centroid table further shows differences between employees with and without depression. Employees with depression (obs_consequence = 2) have higher average values for work interference (3.11 compared to 2.86), indicating that mental health affects their work more strongly. Similarly, employees with depression also show slightly higher values for self-employment (1.17 compared to 1.11), suggesting that self-employment may be associated with increased mental health challenges.
The K-means centroid values also indicate variation between the two clusters. Cluster 1 shows higher positive centroid values for remote work (1.29) and self-employment (0.72), while Cluster 2 shows negative centroid values for these variables. This suggests that the two clusters represent employees with different workplace and mental health characteristics.
Overall, the clustering analysis successfully separates employees into meaningful groups based on workplace conditions, physical health impacts, and support-related variables. The results support the findings from Q1 and indicate that work interference, physical health consequences, and workplace support factors are important in distinguishing employees with and without depression.
Quetion-3:Based on your analysis in the previous two questions, pick an appropriate Association Rule Mining algorithm that uses the consequent as “obs_consequence”. Pick the appropriate values of support, confidence and lift in this case and justify. What are the top five rules by support, confidence and lift?
library(arules)
## Warning: package 'arules' was built under R version 4.4.3
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
##
## recode
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(dplyr)
arm_data <- Mental_Health_Survey %>%
select(work_interfere,
remote_work,
self_employed,
tech_company,
leave,
phys_health_consequence,
coworkers,
obs_consequence)
head(arm_data)
str(arm_data)
## 'data.frame': 968 obs. of 8 variables:
## $ work_interfere : chr "Sometimes" "Sometimes" "Never" "Often" ...
## $ remote_work : chr "Yes" "No" "Yes" "Yes" ...
## $ self_employed : chr "Yes" "No" "Yes" "No" ...
## $ tech_company : chr "Yes" "Yes" "Yes" "Yes" ...
## $ leave : chr "Very easy" "Somewhat difficult" "Somewhat difficult" "Don't know" ...
## $ phys_health_consequence: chr "No" "No" "No" "No" ...
## $ coworkers : chr "Yes" "Some of them" "Some of them" "Some of them" ...
## $ obs_consequence : chr "Yes" "No" "No" "No" ...
Interpretation: These variables were selected based on Q1 and Q2 because workplace conditions, physical health consequences, and coworker support showed relationships with depression.
arm_data[] <- lapply(arm_data, as.factor)
str(arm_data)
## 'data.frame': 968 obs. of 8 variables:
## $ work_interfere : Factor w/ 4 levels "Never","Often",..: 4 4 1 2 1 3 4 3 4 4 ...
## $ remote_work : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 1 2 2 1 ...
## $ self_employed : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
## $ tech_company : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 2 2 ...
## $ leave : Factor w/ 5 levels "Don't know","Somewhat difficult",..: 5 2 2 1 1 1 5 1 2 4 ...
## $ phys_health_consequence: Factor w/ 3 levels "Maybe","No","Yes": 2 2 2 2 2 2 2 2 2 1 ...
## $ coworkers : Factor w/ 3 levels "No","Some of them",..: 3 2 2 2 1 3 2 3 2 2 ...
## $ obs_consequence : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 1 1 ...
Interpretation: Association rule mining works with categorical transaction data, so all selected variables were converted to factors.
transactions_data <- as(arm_data, "transactions")
summary(transactions_data)
## transactions as itemMatrix in sparse format with
## 968 rows (elements/itemsets/transactions) and
## 23 columns (items) and a density of 0.3478261
##
## most frequent items:
## self_employed=No obs_consequence=No
## 848 807
## tech_company=Yes phys_health_consequence=No
## 792 695
## remote_work=No (Other)
## 674 3928
##
## element (itemset/transaction) length distribution:
## sizes
## 8
## 968
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8 8 8 8 8 8
##
## includes extended item information - examples:
## labels variables levels
## 1 work_interfere=Never work_interfere Never
## 2 work_interfere=Often work_interfere Often
## 3 work_interfere=Rarely work_interfere Rarely
##
## includes extended transaction information - examples:
## transactionID
## 1 1
## 2 2
## 3 3
inspect(head(transactions_data, 5))
## items transactionID
## [1] {work_interfere=Sometimes,
## remote_work=Yes,
## self_employed=Yes,
## tech_company=Yes,
## leave=Very easy,
## phys_health_consequence=No,
## coworkers=Yes,
## obs_consequence=Yes} 1
## [2] {work_interfere=Sometimes,
## remote_work=No,
## self_employed=No,
## tech_company=Yes,
## leave=Somewhat difficult,
## phys_health_consequence=No,
## coworkers=Some of them,
## obs_consequence=No} 2
## [3] {work_interfere=Never,
## remote_work=Yes,
## self_employed=Yes,
## tech_company=Yes,
## leave=Somewhat difficult,
## phys_health_consequence=No,
## coworkers=Some of them,
## obs_consequence=No} 3
## [4] {work_interfere=Often,
## remote_work=Yes,
## self_employed=No,
## tech_company=Yes,
## leave=Don't know,
## phys_health_consequence=No,
## coworkers=Some of them,
## obs_consequence=No} 4
## [5] {work_interfere=Never,
## remote_work=No,
## self_employed=No,
## tech_company=No,
## leave=Don't know,
## phys_health_consequence=No,
## coworkers=No,
## obs_consequence=No} 5
Interpretation: The dataset was converted into transactions so that association rules can be generated using the Apriori algorithm.
rules <- apriori(
transactions_data,
parameter = list(
supp = 0.005,
conf = 0.30,
minlen = 2
),
appearance = list(
rhs = "obs_consequence=Yes",
default = "lhs"
)
)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.3 0.1 1 none FALSE TRUE 5 0.005 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 4
##
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[23 item(s), 968 transaction(s)] done [0.00s].
## sorting and recoding items ... [23 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 done [0.00s].
## writing ... [338 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
length(rules)
## [1] 338
summary(rules)
## set of 338 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5 6 7
## 3 32 104 123 63 13
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 4.00 5.00 4.74 5.00 7.00
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.005165 Min. :0.3000 Min. :0.006198 Min. :1.804
## 1st Qu.:0.006198 1st Qu.:0.3403 1st Qu.:0.013430 1st Qu.:2.046
## Median :0.007231 Median :0.3889 Median :0.018595 Median :2.338
## Mean :0.009713 Mean :0.4273 Mean :0.025096 Mean :2.569
## 3rd Qu.:0.011364 3rd Qu.:0.5000 3rd Qu.:0.029959 3rd Qu.:3.006
## Max. :0.036157 Max. :1.0000 Max. :0.109504 Max. :6.012
## count
## Min. : 5.000
## 1st Qu.: 6.000
## Median : 7.000
## Mean : 9.402
## 3rd Qu.:11.000
## Max. :35.000
##
## mining info:
## data ntransactions support confidence
## transactions_data 968 0.005 0.3
## call
## apriori(data = transactions_data, parameter = list(supp = 0.005, conf = 0.3, minlen = 2), appearance = list(rhs = "obs_consequence=Yes", default = "lhs"))
Interpretation: The Apriori algorithm generated 338 association rules using 968 transactions. The support threshold was set to 0.005 and the confidence threshold was set to 0.30 because higher threshold values produced no usable rules. The generated rules have support values ranging from 0.0052 to 0.0362 and confidence values ranging from 0.30 to 1.00, indicating varying levels of rule frequency and reliability.
The lift values range from 1.804 to 6.012, which are all greater than 1. This indicates that the antecedent variables have positive associations with obs_consequence=Yes beyond random chance. The rule length distribution shows that most rules contain between 4 and 5 items, suggesting that depression outcomes are influenced by combinations of multiple workplace and health-related factors rather than a single variable.
Overall, the generated rules indicate meaningful relationships between workplace conditions, physical health consequences, coworker support, and depression outcomes.
inspect(head(sort(rules, by = "support", decreasing = TRUE), 5))
## lhs rhs support confidence coverage lift count
## [1] {leave=Somewhat difficult} => {obs_consequence=Yes} 0.03615702 0.3301887 0.10950413 1.985234 35
## [2] {leave=Very difficult} => {obs_consequence=Yes} 0.03305785 0.3595506 0.09194215 2.161770 32
## [3] {work_interfere=Sometimes,
## self_employed=No,
## phys_health_consequence=Maybe} => {obs_consequence=Yes} 0.03099174 0.3030303 0.10227273 1.821946 30
## [4] {self_employed=No,
## leave=Somewhat difficult} => {obs_consequence=Yes} 0.02995868 0.3452381 0.08677686 2.075717 29
## [5] {tech_company=Yes,
## leave=Somewhat difficult} => {obs_consequence=Yes} 0.02685950 0.3170732 0.08471074 1.906378 26
inspect(head(sort(rules, by = "confidence", decreasing = TRUE), 5))
## lhs rhs support confidence coverage lift count
## [1] {work_interfere=Rarely,
## tech_company=Yes,
## phys_health_consequence=Yes} => {obs_consequence=Yes} 0.006198347 1.0000000 0.006198347 6.012422 6
## [2] {work_interfere=Rarely,
## self_employed=No,
## tech_company=Yes,
## phys_health_consequence=Yes} => {obs_consequence=Yes} 0.006198347 1.0000000 0.006198347 6.012422 6
## [3] {work_interfere=Rarely,
## phys_health_consequence=Yes} => {obs_consequence=Yes} 0.007231405 0.8750000 0.008264463 5.260870 7
## [4] {work_interfere=Rarely,
## self_employed=No,
## phys_health_consequence=Yes} => {obs_consequence=Yes} 0.006198347 0.8571429 0.007231405 5.153505 6
## [5] {work_interfere=Rarely,
## remote_work=No,
## phys_health_consequence=Yes} => {obs_consequence=Yes} 0.005165289 0.8333333 0.006198347 5.010352 5
inspect(head(sort(rules, by = "lift", decreasing = TRUE), 5))
## lhs rhs support confidence coverage lift count
## [1] {work_interfere=Rarely,
## tech_company=Yes,
## phys_health_consequence=Yes} => {obs_consequence=Yes} 0.006198347 1.0000000 0.006198347 6.012422 6
## [2] {work_interfere=Rarely,
## self_employed=No,
## tech_company=Yes,
## phys_health_consequence=Yes} => {obs_consequence=Yes} 0.006198347 1.0000000 0.006198347 6.012422 6
## [3] {work_interfere=Rarely,
## phys_health_consequence=Yes} => {obs_consequence=Yes} 0.007231405 0.8750000 0.008264463 5.260870 7
## [4] {work_interfere=Rarely,
## self_employed=No,
## phys_health_consequence=Yes} => {obs_consequence=Yes} 0.006198347 0.8571429 0.007231405 5.153505 6
## [5] {work_interfere=Rarely,
## remote_work=No,
## phys_health_consequence=Yes} => {obs_consequence=Yes} 0.005165289 0.8333333 0.006198347 5.010352 5
Final Conclusion: Association Rule Mining was performed using the Apriori algorithm to identify workplace and health-related patterns associated with depression outcomes. The consequent was restricted to obs_consequence=Yes, meaning all generated rules explain conditions associated with diagnosed depression.
The support threshold was set to 0.005 and the confidence threshold was set to 0.30 because higher threshold values produced no usable rules. Using these values, a total of 338 meaningful association rules were generated from 968 transactions. The lift values ranged from 1.804 to 6.012, indicating positive associations between the antecedent variables and depression outcomes beyond random chance.
The top rules by support show that employees reporting difficult leave conditions are frequently associated with diagnosed depression. For example, the rule {leave=Somewhat difficult} => {obs_consequence=Yes} has the highest support value of 0.0362, meaning it is one of the most commonly occurring patterns in the dataset. Other highly supported rules involve combinations of work interference, self-employment status, physical health consequences, and leave difficulty.
The top rules by confidence indicate highly reliable relationships with depression. The rule {work_interfere=Rarely, tech_company=Yes, phys_health_consequence=Yes} => {obs_consequence=Yes} achieved a confidence value of 1.00 and a lift value of 6.01, indicating that employees satisfying these conditions always experienced diagnosed depression in the dataset. Similar rules involving physical health consequences, self-employment status, and remote work also showed very high confidence values above 0.83.
The top rules by lift represent the strongest associations with depression beyond random chance. The highest lift value of 6.01 indicates a very strong relationship between workplace conditions, physical health consequences, and depression outcomes. Physical health consequences appear repeatedly in the strongest rules, suggesting that physical and mental health are closely connected.
Overall, the association rule analysis supports the findings from Q1 and Q2. Workplace stress, leave difficulty, physical health consequences, remote work conditions, and coworker-related support are important factors associated with depression among employees. The generated rules reveal meaningful and reliable patterns that help explain depression outcomes in the workplace environment.
Quetion-4:Based on the exploration you did in the previous three questions, pick at least five variables that you think can predict incidence of depression. Pick at least three classification techniques and use the appropriate cross validation techniques. Provide the values of Precision, Recall, Overall accuracy and AUC. Which technique performs the best?
library(dplyr)
library(caret)
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: lattice
library(pROC)
## Warning: package 'pROC' was built under R version 4.4.3
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(rpart)
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
model_data <- Mental_Health_Survey %>%
select(obs_consequence,
work_interfere,
leave,
phys_health_consequence,
coworkers,
remote_work,
self_employed,
tech_company)
head(model_data)
str(model_data)
## 'data.frame': 968 obs. of 8 variables:
## $ obs_consequence : chr "Yes" "No" "No" "No" ...
## $ work_interfere : chr "Sometimes" "Sometimes" "Never" "Often" ...
## $ leave : chr "Very easy" "Somewhat difficult" "Somewhat difficult" "Don't know" ...
## $ phys_health_consequence: chr "No" "No" "No" "No" ...
## $ coworkers : chr "Yes" "Some of them" "Some of them" "Some of them" ...
## $ remote_work : chr "Yes" "No" "Yes" "Yes" ...
## $ self_employed : chr "Yes" "No" "Yes" "No" ...
## $ tech_company : chr "Yes" "Yes" "Yes" "Yes" ...
Interpretation: These variables were selected based on Q1–Q3 because they were related to depression outcomes.
model_data[] <- lapply(model_data, as.factor)
model_data$obs_consequence <- factor(model_data$obs_consequence,
levels = c("No", "Yes"))
str(model_data)
## 'data.frame': 968 obs. of 8 variables:
## $ obs_consequence : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 1 1 ...
## $ work_interfere : Factor w/ 4 levels "Never","Often",..: 4 4 1 2 1 3 4 3 4 4 ...
## $ leave : Factor w/ 5 levels "Don't know","Somewhat difficult",..: 5 2 2 1 1 1 5 1 2 4 ...
## $ phys_health_consequence: Factor w/ 3 levels "Maybe","No","Yes": 2 2 2 2 2 2 2 2 2 1 ...
## $ coworkers : Factor w/ 3 levels "No","Some of them",..: 3 2 2 2 1 3 2 3 2 2 ...
## $ remote_work : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 1 2 2 1 ...
## $ self_employed : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
## $ tech_company : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 2 2 ...
table(model_data$obs_consequence)
##
## No Yes
## 807 161
set.seed(123)
train_index <- createDataPartition(model_data$obs_consequence,
p = 0.70,
list = FALSE)
train_data <- model_data[train_index, ]
test_data <- model_data[-train_index, ]
nrow(train_data)
## [1] 678
nrow(test_data)
## [1] 290
Interpretation: A 70/30 train-test split was used. Stratified sampling keeps the Yes/No depression classes balanced across train and test data.
control <- trainControl(
method = "cv",
number = 10,
classProbs = TRUE,
summaryFunction = twoClassSummary,
savePredictions = TRUE
)
Interpretation: 10-fold cross validation was used to evaluate models reliably and reduce overfitting.
set.seed(123)
log_model <- train(
obs_consequence ~ .,
data = train_data,
method = "glm",
family = "binomial",
metric = "ROC",
trControl = control
)
log_model
## Generalized Linear Model
##
## 678 samples
## 7 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 610, 610, 611, 611, 610, 611, ...
## Resampling results:
##
## ROC Sens Spec
## 0.6971043 0.9733396 0.09772727
log_pred <- predict(log_model, test_data)
log_prob <- predict(log_model, test_data, type = "prob")
log_cm <- confusionMatrix(log_pred,
test_data$obs_consequence,
positive = "Yes")
log_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 239 45
## Yes 3 3
##
## Accuracy : 0.8345
## 95% CI : (0.7866, 0.8754)
## No Information Rate : 0.8345
## P-Value [Acc > NIR] : 0.5384
##
## Kappa : 0.0772
##
## Mcnemar's Test P-Value : 3.262e-09
##
## Sensitivity : 0.06250
## Specificity : 0.98760
## Pos Pred Value : 0.50000
## Neg Pred Value : 0.84155
## Prevalence : 0.16552
## Detection Rate : 0.01034
## Detection Prevalence : 0.02069
## Balanced Accuracy : 0.52505
##
## 'Positive' Class : Yes
##
log_precision <- log_cm$byClass["Precision"]
log_recall <- log_cm$byClass["Recall"]
log_accuracy <- log_cm$overall["Accuracy"]
log_auc <- auc(roc(test_data$obs_consequence,
log_prob$Yes,
levels = c("No", "Yes"),
direction = "<"))
log_precision
## Precision
## 0.5
log_recall
## Recall
## 0.0625
log_accuracy
## Accuracy
## 0.8344828
log_auc
## Area under the curve: 0.6775
set.seed(123)
tree_model <- train(
obs_consequence ~ .,
data = train_data,
method = "rpart",
metric = "ROC",
trControl = control
)
tree_model
## CART
##
## 678 samples
## 7 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 610, 610, 611, 611, 610, 611, ...
## Resampling results across tuning parameters:
##
## cp ROC Sens Spec
## 0.000000000 0.5969576 0.9663221 0.06212121
## 0.004424779 0.5942947 0.9768797 0.05378788
## 0.017699115 0.5239462 0.9982456 0.00000000
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.
tree_pred <- predict(tree_model, test_data)
tree_prob <- predict(tree_model, test_data, type = "prob")
tree_cm <- confusionMatrix(tree_pred,
test_data$obs_consequence,
positive = "Yes")
tree_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 232 45
## Yes 10 3
##
## Accuracy : 0.8103
## 95% CI : (0.7604, 0.8538)
## No Information Rate : 0.8345
## P-Value [Acc > NIR] : 0.8808
##
## Kappa : 0.0299
##
## Mcnemar's Test P-Value : 4.549e-06
##
## Sensitivity : 0.06250
## Specificity : 0.95868
## Pos Pred Value : 0.23077
## Neg Pred Value : 0.83755
## Prevalence : 0.16552
## Detection Rate : 0.01034
## Detection Prevalence : 0.04483
## Balanced Accuracy : 0.51059
##
## 'Positive' Class : Yes
##
tree_precision <- tree_cm$byClass["Precision"]
tree_recall <- tree_cm$byClass["Recall"]
tree_accuracy <- tree_cm$overall["Accuracy"]
tree_auc <- auc(roc(test_data$obs_consequence,
tree_prob$Yes,
levels = c("No", "Yes"),
direction = "<"))
tree_precision
## Precision
## 0.2307692
tree_recall
## Recall
## 0.0625
tree_accuracy
## Accuracy
## 0.8103448
tree_auc
## Area under the curve: 0.5331
set.seed(123)
rf_model <- train(
obs_consequence ~ .,
data = train_data,
method = "rf",
metric = "ROC",
trControl = control
)
rf_model
## Random Forest
##
## 678 samples
## 7 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 610, 610, 611, 611, 610, 611, ...
## Resampling results across tuning parameters:
##
## mtry ROC Sens Spec
## 2 0.6657371 1.0000000 0.0000000
## 8 0.6568033 0.9326754 0.1416667
## 14 0.6538885 0.9097118 0.1840909
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
rf_pred <- predict(rf_model, test_data)
rf_prob <- predict(rf_model, test_data, type = "prob")
rf_cm <- confusionMatrix(rf_pred,
test_data$obs_consequence,
positive = "Yes")
rf_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 242 48
## Yes 0 0
##
## Accuracy : 0.8345
## 95% CI : (0.7866, 0.8754)
## No Information Rate : 0.8345
## P-Value [Acc > NIR] : 0.5384
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 1.17e-11
##
## Sensitivity : 0.0000
## Specificity : 1.0000
## Pos Pred Value : NaN
## Neg Pred Value : 0.8345
## Prevalence : 0.1655
## Detection Rate : 0.0000
## Detection Prevalence : 0.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : Yes
##
rf_precision <- rf_cm$byClass["Precision"]
rf_recall <- rf_cm$byClass["Recall"]
rf_accuracy <- rf_cm$overall["Accuracy"]
rf_auc <- auc(roc(test_data$obs_consequence,
rf_prob$Yes,
levels = c("No", "Yes"),
direction = "<"))
rf_precision
## Precision
## NA
rf_recall
## Recall
## 0
rf_accuracy
## Accuracy
## 0.8344828
rf_auc
## Area under the curve: 0.6002
model_comparison <- data.frame(
Model = c("Logistic Regression", "Decision Tree", "Random Forest"),
Precision = c(log_precision, tree_precision, rf_precision),
Recall = c(log_recall, tree_recall, rf_recall),
Accuracy = c(log_accuracy, tree_accuracy, rf_accuracy),
AUC = c(as.numeric(log_auc),
as.numeric(tree_auc),
as.numeric(rf_auc))
)
model_comparison
Final Conclusion:Three classification techniques were used to predict depression incidence: Logistic Regression, Decision Tree, and Random Forest. The predictor variables were selected based on findings from Q1–Q3, where work interference, leave difficulty, physical health consequences, coworker support, remote work, self-employment, and tech company status showed relationships with depression outcomes. A 70/30 train-test split and 10-fold cross validation were used to evaluate the models reliably.
The Logistic Regression model achieved a Precision of 0.50, Recall of 0.0625, Overall Accuracy of 0.8345, and an AUC of 0.6775. This model correctly identified some depression cases while maintaining the highest AUC value among all models, indicating better overall classification performance.
The Decision Tree model achieved a Precision of 0.2308, Recall of 0.0625, Overall Accuracy of 0.8103, and an AUC of 0.5331. Although the model captured some depression cases, its predictive performance was weaker compared to Logistic Regression.
The Random Forest model achieved an Overall Accuracy of 0.8345 and an AUC of 0.6002. However, the model failed to correctly classify any positive depression cases, resulting in a Recall of 0 and undefined Precision (NA). This indicates that the model was biased toward predicting the majority class (“No”).
Based on the comparison of Precision, Recall, Overall Accuracy, and AUC, the Logistic Regression model performs the best overall because it achieved the highest AUC value (0.6775) while also maintaining the highest Precision among the three techniques. Although all models showed relatively low Recall values due to class imbalance in the dataset, Logistic Regression provided the most balanced predictive performance for identifying depression incidence.
Overall, the classification results support the findings from Q1–Q3 that workplace stress, leave difficulty, physical health consequences, and coworker-related support are important predictors of depression among employees.