The dataset “Medical Students’ Empathy, Mental Health, and Burnout in Switzerland: A Cross-Sectional Analysis” by Carrard et al. (2022) contains survey data from 886 Swiss medical students examining the relationships between empathy, mental health, and burnout. It includes 20 variables covering demographic and academic factors (such as age, gender, study year, study hours, health status), psychological measures (Jefferson Scale of Physician Empathy, Cognitive and Affective Empathy subscales, Attitudes toward Mental Suffering and Patients), and mental-health indicators (depression, anxiety, and three burnout dimensions). The dataset enables cross-sectional analyses of how empathy and psychological well-being vary across medical training stages and demographic groups, providing valuable insights into factors contributing to student burnout and emotional resilience.
library(tidyverse) #load library tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(psych) #load the package for each session
##
## Attaching package: 'psych'
##
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
medteach <- read.csv("C:/Users/ASUS/Downloads/medteach.csv") #imported data set name it medteach using read.csv function because the dataset is a csv file
#Q3
summary(medteach) #using summary function to show descriptive stats for dataset medteach, and it gives descriptive stats for each variable such as range, quantiles, mean, median
## id age year sex
## Min. : 2.0 Min. :17.00 Min. :1.000 Min. :1.000
## 1st Qu.: 447.5 1st Qu.:20.00 1st Qu.:1.000 1st Qu.:1.000
## Median : 876.0 Median :22.00 Median :3.000 Median :2.000
## Mean : 889.7 Mean :22.38 Mean :3.103 Mean :1.695
## 3rd Qu.:1341.8 3rd Qu.:24.00 3rd Qu.:5.000 3rd Qu.:2.000
## Max. :1790.0 Max. :49.00 Max. :6.000 Max. :3.000
## glang part job stud_h
## Min. : 1.00 Min. :0.0000 Min. :0.0000 Min. : 0.00
## 1st Qu.: 1.00 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:12.00
## Median : 1.00 Median :1.0000 Median :0.0000 Median :25.00
## Mean : 14.33 Mean :0.5632 Mean :0.3488 Mean :25.29
## 3rd Qu.: 1.00 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:36.00
## Max. :121.00 Max. :1.0000 Max. :1.0000 Max. :70.00
## health psyt jspe qcae_cog
## Min. :1.000 Min. :0.0000 Min. : 67.0 Min. :37.00
## 1st Qu.:3.000 1st Qu.:0.0000 1st Qu.:101.0 1st Qu.:54.00
## Median :4.000 Median :0.0000 Median :107.0 Median :58.00
## Mean :3.778 Mean :0.2246 Mean :106.4 Mean :58.53
## 3rd Qu.:5.000 3rd Qu.:0.0000 3rd Qu.:113.0 3rd Qu.:63.00
## Max. :5.000 Max. :1.0000 Max. :125.0 Max. :76.00
## qcae_aff amsp erec_mean cesd
## Min. :18.00 Min. : 6.00 Min. :0.3571 Min. : 0.00
## 1st Qu.:31.00 1st Qu.:20.00 1st Qu.:0.6667 1st Qu.: 9.00
## Median :35.00 Median :23.00 Median :0.7262 Median :16.00
## Mean :34.78 Mean :23.15 Mean :0.7201 Mean :18.05
## 3rd Qu.:39.00 3rd Qu.:26.75 3rd Qu.:0.7857 3rd Qu.:25.00
## Max. :48.00 Max. :35.00 Max. :0.9524 Max. :56.00
## stai_t mbi_ex mbi_cy mbi_ea
## Min. :20.0 Min. : 5.00 Min. : 4.00 Min. :10.00
## 1st Qu.:34.0 1st Qu.:13.00 1st Qu.: 6.00 1st Qu.:21.00
## Median :43.0 Median :17.00 Median : 9.00 Median :24.00
## Mean :42.9 Mean :16.88 Mean :10.08 Mean :24.21
## 3rd Qu.:51.0 3rd Qu.:20.00 3rd Qu.:13.00 3rd Qu.:28.00
## Max. :77.0 Max. :30.00 Max. :24.00 Max. :36.00
describe(medteach) #using function describe() from package psych to show descriptive stats, do similar things to summary but have add details such as standard deviation, skew, kurtosis and standard error
## vars n mean sd median trimmed mad min max range
## id 1 886 889.71 515.56 876.00 888.23 667.91 2.00 1790.00 1788.0
## age 2 886 22.38 3.30 22.00 22.04 2.97 17.00 49.00 32.0
## year 3 886 3.10 1.76 3.00 3.00 2.97 1.00 6.00 5.0
## sex 4 886 1.70 0.47 2.00 1.74 0.00 1.00 3.00 2.0
## glang 5 886 14.33 32.37 1.00 5.12 0.00 1.00 121.00 120.0
## part 6 886 0.56 0.50 1.00 0.58 0.00 0.00 1.00 1.0
## job 7 886 0.35 0.48 0.00 0.31 0.00 0.00 1.00 1.0
## stud_h 8 886 25.29 15.93 25.00 24.34 19.27 0.00 70.00 70.0
## health 9 886 3.78 1.06 4.00 3.90 1.48 1.00 5.00 4.0
## psyt 10 886 0.22 0.42 0.00 0.16 0.00 0.00 1.00 1.0
## jspe 11 886 106.37 8.78 107.00 106.90 8.90 67.00 125.00 58.0
## qcae_cog 12 886 58.53 6.57 58.00 58.53 5.93 37.00 76.00 39.0
## qcae_aff 13 886 34.78 5.38 35.00 34.90 5.93 18.00 48.00 30.0
## amsp 14 886 23.15 4.99 23.00 23.18 4.45 6.00 35.00 29.0
## erec_mean 15 886 0.72 0.09 0.73 0.72 0.09 0.36 0.95 0.6
## cesd 16 886 18.05 11.48 16.00 17.16 11.86 0.00 56.00 56.0
## stai_t 17 886 42.90 11.98 43.00 42.55 11.86 20.00 77.00 57.0
## mbi_ex 18 886 16.88 5.26 17.00 16.82 5.93 5.00 30.00 25.0
## mbi_cy 19 886 10.08 4.59 9.00 9.65 4.45 4.00 24.00 20.0
## mbi_ea 20 886 24.21 4.63 24.00 24.30 4.45 10.00 36.00 26.0
## skew kurtosis se
## id 0.02 -1.21 17.32
## age 2.07 9.34 0.11
## year 0.26 -1.29 0.06
## sex -0.69 -1.10 0.02
## glang 2.27 3.47 1.09
## part -0.25 -1.94 0.02
## job 0.63 -1.60 0.02
## stud_h 0.45 -0.49 0.54
## health -0.88 0.22 0.04
## psyt 1.32 -0.26 0.01
## jspe -0.66 0.70 0.30
## qcae_cog -0.09 0.10 0.22
## qcae_aff -0.20 -0.24 0.18
## amsp -0.11 -0.07 0.17
## erec_mean -0.40 0.22 0.00
## cesd 0.68 -0.09 0.39
## stai_t 0.24 -0.42 0.40
## mbi_ex 0.10 -0.40 0.18
## mbi_cy 0.75 -0.05 0.15
## mbi_ea -0.17 -0.20 0.16
dim(medteach) #shows total number of rows and columns for dataset medteach first number is row, second is column
## [1] 886 20
str(medteach) #using function str on medteach dataset to show the type of each variable and peak into the data, show a few entries of the variable
## 'data.frame': 886 obs. of 20 variables:
## $ id : int 2 4 9 10 13 14 17 21 23 24 ...
## $ age : int 18 26 21 21 21 26 23 23 23 22 ...
## $ year : int 1 4 3 2 3 5 5 4 4 2 ...
## $ sex : int 1 1 2 2 1 2 2 1 2 2 ...
## $ glang : int 120 1 1 1 1 1 1 1 1 1 ...
## $ part : int 1 1 0 0 1 1 1 1 1 1 ...
## $ job : int 0 0 0 1 0 1 0 1 1 0 ...
## $ stud_h : int 56 20 36 51 22 10 15 8 20 20 ...
## $ health : int 3 4 3 5 4 2 3 4 2 5 ...
## $ psyt : int 0 0 0 0 0 0 0 0 0 0 ...
## $ jspe : int 88 109 106 101 102 102 117 118 118 108 ...
## $ qcae_cog : int 62 55 64 52 58 48 58 65 69 56 ...
## $ qcae_aff : int 27 37 39 33 28 37 38 40 46 36 ...
## $ amsp : int 17 22 17 18 21 17 23 32 23 22 ...
## $ erec_mean: num 0.738 0.69 0.69 0.833 0.69 ...
## $ cesd : int 34 7 25 17 14 14 45 6 43 11 ...
## $ stai_t : int 61 33 73 48 46 56 56 36 43 43 ...
## $ mbi_ex : int 17 14 24 16 22 18 28 11 26 18 ...
## $ mbi_cy : int 13 11 7 10 14 15 17 10 21 6 ...
## $ mbi_ea : int 20 26 23 21 23 18 16 27 22 23 ...
sum(is.na(medteach)) #check for total of NA values if there is any in the data set using sum and is.na functions
## [1] 0
attr(medteach$year, "label") <- "Curriculum Year" #set a label for variable year using attr function, assigning the name "Curriculum year" for the label
attr(medteach$year, "label") #check for label
## [1] "Curriculum Year"
attr(medteach$sex, "label") <- "1=Man, 2=Woman, 3=Nonbinary" #set label for variable six using attr function, assignment "1=Man, 2=Woman, 3=Nonbinary" for the label
attr(medteach$sex, "label") #check for label
## [1] "1=Man, 2=Woman, 3=Nonbinary"
attr(medteach$stud_h, "label") <- "Hours of study per week on average" #set a label for variable stud_h using attr function, assigning the name "Hours of study per week on average" for the label
attr(medteach$stud_h, "label") #check for label
## [1] "Hours of study per week on average"
The dataset has a total of 886 rows and 20 columns, meaning that it has 886 observations and 20 variables. There are no missing values reported. Variables of interest include curriculum year, gender, Jefferson Scale of Physician Empathy (JSPE) total scores, Center for Epidemiologic Studies Depression Scale (CES-D) total scores, State-Trait Anxiety Inventory (STAI) scores, and average hours of study per week. Descriptive analysis of all variables was performed. Descriptive analysis revealed that participants were spread across all six curriculum years, with women making up the bulk of the sample. The mean scores indicated typically high levels of empathy, mild depression symptoms, and moderate anxiety levels. On average, participants reported studying for about 25 hours per week.
Using summary() and describe() from the psych package, I further characterized the key variables age, curriculum year, sex, study hours, empathy (JSPE), and depressive symptoms (CES-D). Age ranged from 17 to 49 years (range = 32), with a mean of 22.38 (SD = 3.30) and near-symmetric distribution (skew = 0.07, kurtosis = 9.34, SE = 0.11). Curriculum year ranged from 1 to 6 (range = 5), with a mean of 3.10 (SD = 1.76); the slight positive skew (0.26) and negative kurtosis (-1.29) suggest more students in earlier years and a relatively flat distribution across years. Sex (coded 1 = man, 2 = woman, 3 = nonbinary) had a mean of 1.70 (SD = 0.47, range = 2) with negative skew (-0.69) and kurtosis around -1.10, consistent with most participants identifying as men or women and very few in the third category. Weekly study hours (stud_h) varied widely from 0 to 70 hours (range = 70), with a mean of 25.29 (SD = 15.93); the positive skew (0.45) and slight negative kurtosis (-0.49, SE = 0.54) indicate that a subset of students reported very high study hours. JSPE empathy scores ranged from 67 to 125 (range = 58), with a mean of about 106.37 (SD = 8.8), a modest negative skew (-0.66), and kurtosis around 0.70, suggesting generally high empathy with a slight tail toward lower scores. CES-D depressive symptom scores ranged from 0 to 56 (range = 56), with a mean of 18.05 (SD = 11.48), positive skew (0.68), and near-zero kurtosis (-0.09, SE = 0.39), indicating that while many students had relatively low depressive symptoms, there is a noticeable tail of students with higher scores.
#Q4
medteach$cesd_high <- as.numeric(medteach$cesd >= 16) #create new column (variable) cesd_high for dataset medteach using as.numeric function, if cesd is 16 or higher then put 1, and 0 if not
medteach$stage <- as.numeric(medteach$year >= 4) #create new column (variable) stage for dataset medteach using as.numeric function, if year is 4 or higher then put 1, and 0 if not
For Q4, I created two dichotomous variables and then formed subsets for later analysis. I generated cesd_high to classify students as having high vs low depressive symptoms using the common CES-D screening threshold of 16. Values are coded 1 = CES-D ≥ 16 (high) and 0 = CES-D < 16 (low), with appropriate value labels. This dichotomization supports my research question analysis on depressive symptoms and empathy scores since a CES-D score of over 16 is the cutoff scorere based on the scale (Lewinsohn et al., 1997). Then, using tabstat cesd, by(cesd_high) to verify the transformation, which reports group counts and distribution checks (mean, SD, range).
I collapsed the curriculum year variable into a two-level training stage variable stage: 0 = Preclinical (BMed1-3; years 1–3) and 1 = Clinical (MMed1-3; years 4-6). These transformations allow stratum-specific analyses that are directly aligned with the study aims, which are to compare empathy by depression level within preclinical and clinical students. Value labels were added and then usingthe tabulate function to confirm the variable. This helps simplify reporting and support correlation tests within each stratum that will be used later on. All changes were saved to a Stata dataset (medteach_master.dta) to ensure reproducibility and to enable creating physical subsets for each stratum in later steps.
#Q5
#create 2 subsets from the data set one for preclinical select stage = 0 and clinical for stage = 1, while keeping the following variables: id, age, year, stud_h, jspe, cesd, cesd_high and stage
medteach_preclinical <- medteach[medteach$stage == 0, c(1:4, 8, 11,16, 21:22)]
medteach_clinical <- medteach[medteach$stage == 1, c(1:4, 8, 11,16, 21:22) ]
#For Preclinical subset
#Get a descriptive stats of all variables in the preclinical subset using function describe() from psych package to have more details
describe(medteach_preclinical)
## vars n mean sd median trimmed mad min max range skew
## id 1 523 893.37 521.52 890 892.35 696.82 2 1790 1788 0.00
## age 2 523 21.00 2.91 21 20.60 1.48 17 49 32 3.42
## year 3 523 1.80 0.84 2 1.76 1.48 1 3 2 0.38
## sex 4 523 1.71 0.46 2 1.75 0.00 1 3 2 -0.84
## stud_h 5 523 31.36 15.33 30 31.04 14.83 0 70 70 0.15
## jspe 6 523 104.69 9.25 105 105.14 8.90 67 125 58 -0.55
## cesd 7 523 19.98 11.58 18 19.22 11.86 0 56 56 0.57
## cesd_high 8 523 0.61 0.49 1 0.64 0.00 0 1 1 -0.45
## stage 9 523 0.00 0.00 0 0.00 0.00 0 0 0 NaN
## kurtosis se
## id -1.24 22.80
## age 21.58 0.13
## year -1.48 0.04
## sex -1.13 0.02
## stud_h -0.49 0.67
## jspe 0.48 0.40
## cesd -0.24 0.51
## cesd_high -1.80 0.02
## stage NaN 0.00
#create a frequency table of the dichotomous variable 'cesd_high' in the preclinical subset. This shows how many students are in each group. This is important because it shows how many students have high depressive symptoms
table(medteach_preclinical$cesd_high)
##
## 0 1
## 204 319
prop.table(table(medteach_preclinical$cesd_high)) #Convert the frequency table of 'cesd_high' into proportions of students in this preclinical subset
##
## 0 1
## 0.3900574 0.6099426
#show mean jspe in each depression group using tapply and include any NA values
tapply(medteach_preclinical$jspe, medteach_preclinical$cesd_high, mean, na.rm = TRUE) #summary of jspe by cesd_high (0 and 1)
## 0 1
## 105.7206 104.0313
#show number of observations for jspe in each depression group tapply and include any NA values
tapply(medteach_preclinical$jspe, medteach_preclinical$cesd_high, function(x) sum(!is.na(x))) #summary of jspe by cesd_high (0 and 1)
## 0 1
## 204 319
#spearman correlation in preclinical subset usinh cor function, specify method as spearman
cor(medteach_preclinical$cesd,
medteach_preclinical$jspe,
method = "spearman")
## [1] -0.06791901
#For Clinical Subset
#Get a descriptive stats of all variables in the clinical subset using function describe() from psych package to have more details
describe(medteach_clinical)
## vars n mean sd median trimmed mad min max range skew
## id 1 363 884.43 507.51 866 882.38 625.66 4 1789 1785 0.06
## age 2 363 24.38 2.77 24 23.93 1.48 21 44 23 2.85
## year 3 363 4.97 0.81 5 4.97 1.48 4 6 2 0.05
## sex 4 363 1.68 0.49 2 1.71 0.00 1 3 2 -0.49
## stud_h 5 363 16.54 12.30 15 15.22 11.86 0 65 65 0.98
## jspe 6 363 108.80 7.43 109 109.22 7.41 82 125 43 -0.61
## cesd 7 363 15.26 10.74 13 14.19 10.38 0 54 54 0.88
## cesd_high 8 363 0.39 0.49 0 0.36 0.00 0 1 1 0.44
## stage 9 363 1.00 0.00 1 1.00 0.00 1 1 0 NaN
## kurtosis se
## id -1.17 26.64
## age 11.97 0.15
## year -1.47 0.04
## sex -1.07 0.03
## stud_h 0.79 0.65
## jspe 0.65 0.39
## cesd 0.32 0.56
## cesd_high -1.81 0.03
## stage NaN 0.00
table(medteach_clinical$cesd_high) #create a frequency table of the dichotomous variable 'cesd_high' in the clinical subset. This shows how many students are in each group. This is important because it shows how many students have high depressive symptoms
##
## 0 1
## 221 142
prop.table(table(medteach_clinical$cesd_high)) #convert the frequency table of 'cesd_high' into proportions of students in this clinical subset
##
## 0 1
## 0.6088154 0.3911846
#show mean jspe in each depression group using tapply and include any NA values
tapply(medteach_clinical$jspe, medteach_clinical$cesd_high, mean, na.rm = TRUE) #summary of jspe by cesd_high (0 and 1)
## 0 1
## 108.5204 109.2394
#show number of observations for jspe in each depression group tapply and include any NA values
tapply(medteach_clinical$jspe, medteach_clinical$cesd_high, function(x) sum(!is.na(x))) #summary of jspe by cesd_high (0 and 1)
## 0 1
## 221 142
#spearman correlation in clinical subset using cor function, specify method as spearman
cor(medteach_clinical$cesd,
medteach_clinical$jspe,
method = "spearman")
## [1] -0.008727817
For Q5, I first split the dataset into two subsets based on the dichotomous stage variable: a preclinical group (stage = 0) and a clinical group (stage = 1), keeping only the variables id, age, year, stud_h, jspe, cesd, cesd_high, and stage for further analysis. For each subset, I used describe() to obtain descriptive statistics for all variables (including mean, sd, se and skew, kurtosis for later analysis). I then examined the distribution of depressive symptoms in each group by creating a frequency table and proportions for cesd_high, which shows how many and what proportion of students scored above the CES-D cut-off for high depressive symptoms. Within each subset, I used tapply() to calculate the mean JSPE empathy score and the number of non-missing JSPE observations separately for students with low versus high depressive symptoms, this allows comparison of average empathy levels by depression group in preclinical and clinical students. Finally, I computed the Spearman correlation between CES-D scores and JSPE scores within each subset using cor(…, method = “spearman”), Spearman’s rank correlation was selected instead of Pearson’s correlation to assess the relationship between JSPE and CES-D within both preclinical and clinical group.This correlation shows the direction and strength of the association between depressive symptoms and empathy in preclinical and clinical students, where a negative value would indicate that higher depressive symptoms tend to be associated with lower empathy, and a positive value would indicate the opposite.
For the preclinical students, the Spearman correlation between CES-D scores and JSPE empathy scores was ρ = -0.068. This value is very close to zero, indicating no meaningful association between depressive symptoms and empathy in the preclinical group.
The Spearman correlation between CES-D scores and JSPE empathy scores in the clinical students was ρ = -0.009, indicating essentially no association between depressive symptoms and empathy in this group.
#Q6
#Scatterplot for preclinical students
ggplot(medteach_preclinical, aes(x = cesd, y = jspe)) + #using ggplot functions() from tidyverse package that includes ggplot2 package to plot scatter plot with fitted line
geom_point() + #scatterplot points
geom_smooth(method = "lm", se = FALSE) + #fitted regression line without including standard error on the plot
labs(
title = "Preclinical: JSPE vs CES-D",
x = "CES-D score",
y = "JSPE empathy score")
## `geom_smooth()` using formula = 'y ~ x'
This scatterplot shows the relationship between CES-D scores (x-axis,
for depressive symptoms) and JSPE empathy scores (y-axis) among
preclinical students. Each point represents one student. CES-D scores
range from about 0 to just over 50, while empathy scores range roughly
from 70 to 125. The points form a fairly wide cloud rather than a clear
line, and there is no clear strong upward or downward trend. This
suggests that there is little to no association between depressive
symptoms and empathy scores meaning in the preclinical group, students
with higher CES-D scores do not consistently show higher or lower
empathy compared with those who have lower CES-D scores.
#Scatterplot for clinical students
ggplot(medteach_clinical, aes(x = cesd, y = jspe)) + #using ggplot functions() from tidyverse package that includes ggplot2 package to plot scatter plot with fitted line
geom_point() + #scatterplot points
geom_smooth(method = "lm", se = FALSE) + #fitted regression line without including standard error on the plot
labs(
title = "Clinical: JSPE vs CES-D",
x = "CES-D score",
y = "JSPE empathy score")
## `geom_smooth()` using formula = 'y ~ x'
This scatterplot shows the relationship between CES-D scores (x-axis, for depressive symptoms) and JSPE empathy scores (y-axis) among clinical students. Each point represents one student. CES-D scores range from about 0 to just over 50, while empathy scores are mostly between about 90 and 125. The points form a fairly concentrated point without a clear upward or downward trend (from fitted line), suggesting little to no association between depressive symptoms and empathy in clinical students. In other words, students with higher CES-D scores do not consistently have clearly higher or lower empathy scores compared with those with lower CES-D scores.
Overall, although empathy scores were slightly lower among students with higher CES-D values, the differences were small, and the Spearman correlations in both groups were weak and not statistically significant. This suggests that depressive symptoms may not play a major role in shaping empathy levels at either stage of medical training. The scatterplots further supported this pattern, where data points were widely dispersed indicating little to no linear trend when evaluating fitted line across both graphs. Empathy tended to remain high across both groups, while variations in depressive symptoms showed minimal association with empathy scores.
R Program Pros: R is free and open-source, so it is easily accessible on any computer without a license. It has many packages for advanced statistics, graphics, and reproducible reports like R Markdown, which is useful for more complex analyses and visualization.
Cons: Different packages sometimes do similar things in slightly different ways, which can be confusing for beginners. Need to be careful whenever using packages to do analysis because anyone can create a function, so it might not be statistically correct. The learning curve can be steep because almost everything is done through coding rather than menus. And it’s hard to know how a function works if it’s not a well-established package.
Stata Pros: Stata has a user-friendly interface with clear commands and help files, which makes it easier to learn basic analyses. It is very stable and widely used in applied research, so many tutorials and example codes are available, especially in epidemiology and social sciences.
Cons: Stata requires a paid license, which can be expensive and limits access outside of institutional computers. You can only work on 1 dataset at a time, which is not flexible and can be difficult when working with a large dataset (that needs stratified or breaking down into multiple subsets for analysis) or multiple datasets.
References Carrard, et al. (2022). Medical Students’ Empathy Mental Health and Burnout in Switzerland: A Cross Sectional Analysis. Kaggle. Retrieved October 20, 2025, from https://www.kaggle.com/code/a3amat02/medical-students-analysis-and-classification/input.