BIOSTAT9519 - Project 2

Dataset selection: Medical Students’ Empathy Mental Health and Burnout in Switzerland: A Cross Sectional Analysis by Carrard et al. 2022

The dataset “Medical Students’ Empathy, Mental Health, and Burnout in Switzerland: A Cross-Sectional Analysis” by Carrard et al. (2022) contains survey data from 886 Swiss medical students examining the relationships between empathy, mental health, and burnout. It includes 20 variables covering demographic and academic factors (such as age, gender, study year, study hours, health status), psychological measures (Jefferson Scale of Physician Empathy, Cognitive and Affective Empathy subscales, Attitudes toward Mental Suffering and Patients), and mental-health indicators (depression, anxiety, and three burnout dimensions). The dataset enables cross-sectional analyses of how empathy and psychological well-being vary across medical training stages and demographic groups, providing valuable insights into factors contributing to student burnout and emotional resilience.

Research question: Among Swiss medical students, is there a relationship between depressive symptoms (CES-D) and physician empathy (JSPE), and does this relationship differ between preclinical and clinical students?

library(tidyverse) #load library tidyverse

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(psych) #load the package for each session

## 
## Attaching package: 'psych'
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

medteach <- read.csv("C:/Users/ASUS/Downloads/medteach.csv") #imported data set name it medteach using read.csv function because the dataset is a csv file

Describe the number of rows, columns, variable types, different values, quartiles, etc.

#Q3
summary(medteach) #using summary function to show descriptive stats for dataset medteach, and it gives descriptive stats for each variable such as range, quantiles, mean, median

##        id              age             year            sex       
##  Min.   :   2.0   Min.   :17.00   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 447.5   1st Qu.:20.00   1st Qu.:1.000   1st Qu.:1.000  
##  Median : 876.0   Median :22.00   Median :3.000   Median :2.000  
##  Mean   : 889.7   Mean   :22.38   Mean   :3.103   Mean   :1.695  
##  3rd Qu.:1341.8   3rd Qu.:24.00   3rd Qu.:5.000   3rd Qu.:2.000  
##  Max.   :1790.0   Max.   :49.00   Max.   :6.000   Max.   :3.000  
##      glang             part             job             stud_h     
##  Min.   :  1.00   Min.   :0.0000   Min.   :0.0000   Min.   : 0.00  
##  1st Qu.:  1.00   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:12.00  
##  Median :  1.00   Median :1.0000   Median :0.0000   Median :25.00  
##  Mean   : 14.33   Mean   :0.5632   Mean   :0.3488   Mean   :25.29  
##  3rd Qu.:  1.00   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:36.00  
##  Max.   :121.00   Max.   :1.0000   Max.   :1.0000   Max.   :70.00  
##      health           psyt             jspe          qcae_cog    
##  Min.   :1.000   Min.   :0.0000   Min.   : 67.0   Min.   :37.00  
##  1st Qu.:3.000   1st Qu.:0.0000   1st Qu.:101.0   1st Qu.:54.00  
##  Median :4.000   Median :0.0000   Median :107.0   Median :58.00  
##  Mean   :3.778   Mean   :0.2246   Mean   :106.4   Mean   :58.53  
##  3rd Qu.:5.000   3rd Qu.:0.0000   3rd Qu.:113.0   3rd Qu.:63.00  
##  Max.   :5.000   Max.   :1.0000   Max.   :125.0   Max.   :76.00  
##     qcae_aff          amsp         erec_mean           cesd      
##  Min.   :18.00   Min.   : 6.00   Min.   :0.3571   Min.   : 0.00  
##  1st Qu.:31.00   1st Qu.:20.00   1st Qu.:0.6667   1st Qu.: 9.00  
##  Median :35.00   Median :23.00   Median :0.7262   Median :16.00  
##  Mean   :34.78   Mean   :23.15   Mean   :0.7201   Mean   :18.05  
##  3rd Qu.:39.00   3rd Qu.:26.75   3rd Qu.:0.7857   3rd Qu.:25.00  
##  Max.   :48.00   Max.   :35.00   Max.   :0.9524   Max.   :56.00  
##      stai_t         mbi_ex          mbi_cy          mbi_ea     
##  Min.   :20.0   Min.   : 5.00   Min.   : 4.00   Min.   :10.00  
##  1st Qu.:34.0   1st Qu.:13.00   1st Qu.: 6.00   1st Qu.:21.00  
##  Median :43.0   Median :17.00   Median : 9.00   Median :24.00  
##  Mean   :42.9   Mean   :16.88   Mean   :10.08   Mean   :24.21  
##  3rd Qu.:51.0   3rd Qu.:20.00   3rd Qu.:13.00   3rd Qu.:28.00  
##  Max.   :77.0   Max.   :30.00   Max.   :24.00   Max.   :36.00

describe(medteach) #using function describe() from package psych to show descriptive stats, do similar things to summary but have add details such as standard deviation, skew, kurtosis and standard error

##           vars   n   mean     sd median trimmed    mad   min     max  range
## id           1 886 889.71 515.56 876.00  888.23 667.91  2.00 1790.00 1788.0
## age          2 886  22.38   3.30  22.00   22.04   2.97 17.00   49.00   32.0
## year         3 886   3.10   1.76   3.00    3.00   2.97  1.00    6.00    5.0
## sex          4 886   1.70   0.47   2.00    1.74   0.00  1.00    3.00    2.0
## glang        5 886  14.33  32.37   1.00    5.12   0.00  1.00  121.00  120.0
## part         6 886   0.56   0.50   1.00    0.58   0.00  0.00    1.00    1.0
## job          7 886   0.35   0.48   0.00    0.31   0.00  0.00    1.00    1.0
## stud_h       8 886  25.29  15.93  25.00   24.34  19.27  0.00   70.00   70.0
## health       9 886   3.78   1.06   4.00    3.90   1.48  1.00    5.00    4.0
## psyt        10 886   0.22   0.42   0.00    0.16   0.00  0.00    1.00    1.0
## jspe        11 886 106.37   8.78 107.00  106.90   8.90 67.00  125.00   58.0
## qcae_cog    12 886  58.53   6.57  58.00   58.53   5.93 37.00   76.00   39.0
## qcae_aff    13 886  34.78   5.38  35.00   34.90   5.93 18.00   48.00   30.0
## amsp        14 886  23.15   4.99  23.00   23.18   4.45  6.00   35.00   29.0
## erec_mean   15 886   0.72   0.09   0.73    0.72   0.09  0.36    0.95    0.6
## cesd        16 886  18.05  11.48  16.00   17.16  11.86  0.00   56.00   56.0
## stai_t      17 886  42.90  11.98  43.00   42.55  11.86 20.00   77.00   57.0
## mbi_ex      18 886  16.88   5.26  17.00   16.82   5.93  5.00   30.00   25.0
## mbi_cy      19 886  10.08   4.59   9.00    9.65   4.45  4.00   24.00   20.0
## mbi_ea      20 886  24.21   4.63  24.00   24.30   4.45 10.00   36.00   26.0
##            skew kurtosis    se
## id         0.02    -1.21 17.32
## age        2.07     9.34  0.11
## year       0.26    -1.29  0.06
## sex       -0.69    -1.10  0.02
## glang      2.27     3.47  1.09
## part      -0.25    -1.94  0.02
## job        0.63    -1.60  0.02
## stud_h     0.45    -0.49  0.54
## health    -0.88     0.22  0.04
## psyt       1.32    -0.26  0.01
## jspe      -0.66     0.70  0.30
## qcae_cog  -0.09     0.10  0.22
## qcae_aff  -0.20    -0.24  0.18
## amsp      -0.11    -0.07  0.17
## erec_mean -0.40     0.22  0.00
## cesd       0.68    -0.09  0.39
## stai_t     0.24    -0.42  0.40
## mbi_ex     0.10    -0.40  0.18
## mbi_cy     0.75    -0.05  0.15
## mbi_ea    -0.17    -0.20  0.16

dim(medteach) #shows total number of rows and columns for dataset medteach first number is row, second is column

## [1] 886  20

str(medteach) #using function str on medteach dataset to show the type of each variable and peak into the data, show a few entries of the variable

## 'data.frame':    886 obs. of  20 variables:
##  $ id       : int  2 4 9 10 13 14 17 21 23 24 ...
##  $ age      : int  18 26 21 21 21 26 23 23 23 22 ...
##  $ year     : int  1 4 3 2 3 5 5 4 4 2 ...
##  $ sex      : int  1 1 2 2 1 2 2 1 2 2 ...
##  $ glang    : int  120 1 1 1 1 1 1 1 1 1 ...
##  $ part     : int  1 1 0 0 1 1 1 1 1 1 ...
##  $ job      : int  0 0 0 1 0 1 0 1 1 0 ...
##  $ stud_h   : int  56 20 36 51 22 10 15 8 20 20 ...
##  $ health   : int  3 4 3 5 4 2 3 4 2 5 ...
##  $ psyt     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ jspe     : int  88 109 106 101 102 102 117 118 118 108 ...
##  $ qcae_cog : int  62 55 64 52 58 48 58 65 69 56 ...
##  $ qcae_aff : int  27 37 39 33 28 37 38 40 46 36 ...
##  $ amsp     : int  17 22 17 18 21 17 23 32 23 22 ...
##  $ erec_mean: num  0.738 0.69 0.69 0.833 0.69 ...
##  $ cesd     : int  34 7 25 17 14 14 45 6 43 11 ...
##  $ stai_t   : int  61 33 73 48 46 56 56 36 43 43 ...
##  $ mbi_ex   : int  17 14 24 16 22 18 28 11 26 18 ...
##  $ mbi_cy   : int  13 11 7 10 14 15 17 10 21 6 ...
##  $ mbi_ea   : int  20 26 23 21 23 18 16 27 22 23 ...

sum(is.na(medteach)) #check for total of NA values if there is any in the data set using sum and is.na functions

## [1] 0

attr(medteach$year, "label") <- "Curriculum Year" #set a label for variable year using attr function, assigning the name "Curriculum year" for the label
attr(medteach$year, "label") #check for label

## [1] "Curriculum Year"

attr(medteach$sex, "label") <- "1=Man, 2=Woman, 3=Nonbinary" #set label for variable six using attr function, assignment "1=Man, 2=Woman, 3=Nonbinary" for the label
attr(medteach$sex, "label") #check for label

## [1] "1=Man, 2=Woman, 3=Nonbinary"

attr(medteach$stud_h, "label") <- "Hours of study per week on average" #set a label for variable stud_h using attr function, assigning the name "Hours of study per week on average" for the label
attr(medteach$stud_h, "label") #check for label

## [1] "Hours of study per week on average"

The dataset has a total of 886 rows and 20 columns, meaning that it has 886 observations and 20 variables. There are no missing values reported. Variables of interest include curriculum year, gender, Jefferson Scale of Physician Empathy (JSPE) total scores, Center for Epidemiologic Studies Depression Scale (CES-D) total scores, State-Trait Anxiety Inventory (STAI) scores, and average hours of study per week. Descriptive analysis of all variables was performed. Descriptive analysis revealed that participants were spread across all six curriculum years, with women making up the bulk of the sample. The mean scores indicated typically high levels of empathy, mild depression symptoms, and moderate anxiety levels. On average, participants reported studying for about 25 hours per week.

Using summary() and describe() from the psych package, I further characterized the key variables age, curriculum year, sex, study hours, empathy (JSPE), and depressive symptoms (CES-D). Age ranged from 17 to 49 years (range = 32), with a mean of 22.38 (SD = 3.30) and near-symmetric distribution (skew = 0.07, kurtosis = 9.34, SE = 0.11). Curriculum year ranged from 1 to 6 (range = 5), with a mean of 3.10 (SD = 1.76); the slight positive skew (0.26) and negative kurtosis (-1.29) suggest more students in earlier years and a relatively flat distribution across years. Sex (coded 1 = man, 2 = woman, 3 = nonbinary) had a mean of 1.70 (SD = 0.47, range = 2) with negative skew (-0.69) and kurtosis around -1.10, consistent with most participants identifying as men or women and very few in the third category. Weekly study hours (stud_h) varied widely from 0 to 70 hours (range = 70), with a mean of 25.29 (SD = 15.93); the positive skew (0.45) and slight negative kurtosis (-0.49, SE = 0.54) indicate that a subset of students reported very high study hours. JSPE empathy scores ranged from 67 to 125 (range = 58), with a mean of about 106.37 (SD = 8.8), a modest negative skew (-0.66), and kurtosis around 0.70, suggesting generally high empathy with a slight tail toward lower scores. CES-D depressive symptom scores ranged from 0 to 56 (range = 56), with a mean of 18.05 (SD = 11.48), positive skew (0.68), and near-zero kurtosis (-0.09, SE = 0.39), indicating that while many students had relatively low depressive symptoms, there is a noticeable tail of students with higher scores.

Using at least two different statistical software. Create a data transformation, i.e., a numerical, dichotomization or categorization involving a relevant factor.

#Q4
medteach$cesd_high <- as.numeric(medteach$cesd >= 16) #create new column (variable) cesd_high for dataset medteach using as.numeric function, if cesd is 16 or higher then put 1, and 0 if not

medteach$stage <- as.numeric(medteach$year >= 4) #create new column (variable) stage for dataset medteach using as.numeric function, if year is 4 or higher then put 1, and 0 if not

For Q4, I created two dichotomous variables and then formed subsets for later analysis. I generated cesd_high to classify students as having high vs low depressive symptoms using the common CES-D screening threshold of 16. Values are coded 1 = CES-D ≥ 16 (high) and 0 = CES-D < 16 (low), with appropriate value labels. This dichotomization supports my research question analysis on depressive symptoms and empathy scores since a CES-D score of over 16 is the cutoff scorere based on the scale (Lewinsohn et al., 1997). Then, using tabstat cesd, by(cesd_high) to verify the transformation, which reports group counts and distribution checks (mean, SD, range).

I collapsed the curriculum year variable into a two-level training stage variable stage: 0 = Preclinical (BMed1-3; years 1–3) and 1 = Clinical (MMed1-3; years 4-6). These transformations allow stratum-specific analyses that are directly aligned with the study aims, which are to compare empathy by depression level within preclinical and clinical students. Value labels were added and then usingthe tabulate function to confirm the variable. This helps simplify reporting and support correlation tests within each stratum that will be used later on. All changes were saved to a Stata dataset (medteach_master.dta) to ensure reproducibility and to enable creating physical subsets for each stratum in later steps.

Identify potential strata/subgroups in your data that may improve your understanding of the research question by performing stratum-specific analyses. Using at least two different statistical software, create a subset of your dataset and summarize each stratum/subgroup separately.

#Q5
#create 2 subsets from the data set one for preclinical select stage = 0 and clinical for stage = 1, while keeping the following variables: id, age, year, stud_h, jspe, cesd, cesd_high and stage
medteach_preclinical <- medteach[medteach$stage == 0, c(1:4, 8, 11,16, 21:22)]
medteach_clinical <- medteach[medteach$stage == 1, c(1:4, 8, 11,16, 21:22) ]

#For Preclinical subset
#Get a descriptive stats of all variables in the preclinical subset using function describe() from psych package to have more details
describe(medteach_preclinical)

##           vars   n   mean     sd median trimmed    mad min  max range  skew
## id           1 523 893.37 521.52    890  892.35 696.82   2 1790  1788  0.00
## age          2 523  21.00   2.91     21   20.60   1.48  17   49    32  3.42
## year         3 523   1.80   0.84      2    1.76   1.48   1    3     2  0.38
## sex          4 523   1.71   0.46      2    1.75   0.00   1    3     2 -0.84
## stud_h       5 523  31.36  15.33     30   31.04  14.83   0   70    70  0.15
## jspe         6 523 104.69   9.25    105  105.14   8.90  67  125    58 -0.55
## cesd         7 523  19.98  11.58     18   19.22  11.86   0   56    56  0.57
## cesd_high    8 523   0.61   0.49      1    0.64   0.00   0    1     1 -0.45
## stage        9 523   0.00   0.00      0    0.00   0.00   0    0     0   NaN
##           kurtosis    se
## id           -1.24 22.80
## age          21.58  0.13
## year         -1.48  0.04
## sex          -1.13  0.02
## stud_h       -0.49  0.67
## jspe          0.48  0.40
## cesd         -0.24  0.51
## cesd_high    -1.80  0.02
## stage          NaN  0.00

#create a frequency table of the dichotomous variable 'cesd_high' in the preclinical subset. This shows how many students are in each group. This is important because it shows how many students have high depressive symptoms
table(medteach_preclinical$cesd_high)

## 
##   0   1 
## 204 319

prop.table(table(medteach_preclinical$cesd_high)) #Convert the frequency table of 'cesd_high' into proportions of students in this preclinical subset

## 
##         0         1 
## 0.3900574 0.6099426

#show mean jspe in each depression group using tapply and include any NA values
tapply(medteach_preclinical$jspe, medteach_preclinical$cesd_high, mean, na.rm = TRUE) #summary of jspe by cesd_high (0 and 1)

##        0        1 
## 105.7206 104.0313

#show number of observations for jspe in each depression group tapply and include any NA values
tapply(medteach_preclinical$jspe, medteach_preclinical$cesd_high, function(x) sum(!is.na(x))) #summary of jspe by cesd_high (0 and 1)

##   0   1 
## 204 319

#spearman correlation in preclinical subset usinh cor function, specify method as spearman
cor(medteach_preclinical$cesd,
    medteach_preclinical$jspe,
    method = "spearman")

## [1] -0.06791901

#For Clinical Subset
#Get a descriptive stats of all variables in the clinical subset using function describe() from psych package to have more details
describe(medteach_clinical)

##           vars   n   mean     sd median trimmed    mad min  max range  skew
## id           1 363 884.43 507.51    866  882.38 625.66   4 1789  1785  0.06
## age          2 363  24.38   2.77     24   23.93   1.48  21   44    23  2.85
## year         3 363   4.97   0.81      5    4.97   1.48   4    6     2  0.05
## sex          4 363   1.68   0.49      2    1.71   0.00   1    3     2 -0.49
## stud_h       5 363  16.54  12.30     15   15.22  11.86   0   65    65  0.98
## jspe         6 363 108.80   7.43    109  109.22   7.41  82  125    43 -0.61
## cesd         7 363  15.26  10.74     13   14.19  10.38   0   54    54  0.88
## cesd_high    8 363   0.39   0.49      0    0.36   0.00   0    1     1  0.44
## stage        9 363   1.00   0.00      1    1.00   0.00   1    1     0   NaN
##           kurtosis    se
## id           -1.17 26.64
## age          11.97  0.15
## year         -1.47  0.04
## sex          -1.07  0.03
## stud_h        0.79  0.65
## jspe          0.65  0.39
## cesd          0.32  0.56
## cesd_high    -1.81  0.03
## stage          NaN  0.00

table(medteach_clinical$cesd_high) #create a frequency table of the dichotomous variable 'cesd_high' in the clinical subset. This shows how many students are in each group. This is important because it shows how many students have high depressive symptoms

## 
##   0   1 
## 221 142

prop.table(table(medteach_clinical$cesd_high)) #convert the frequency table of 'cesd_high' into proportions of students in this clinical subset

## 
##         0         1 
## 0.6088154 0.3911846

#show mean jspe in each depression group using tapply and include any NA values
tapply(medteach_clinical$jspe, medteach_clinical$cesd_high, mean, na.rm = TRUE) #summary of jspe by cesd_high (0 and 1)

##        0        1 
## 108.5204 109.2394

#show number of observations for jspe in each depression group tapply and include any NA values
tapply(medteach_clinical$jspe, medteach_clinical$cesd_high, function(x) sum(!is.na(x))) #summary of jspe by cesd_high (0 and 1)

##   0   1 
## 221 142

#spearman correlation in clinical subset using cor function, specify method as spearman
cor(medteach_clinical$cesd,
    medteach_clinical$jspe,
    method = "spearman")

## [1] -0.008727817

For Q5, I first split the dataset into two subsets based on the dichotomous stage variable: a preclinical group (stage = 0) and a clinical group (stage = 1), keeping only the variables id, age, year, stud_h, jspe, cesd, cesd_high, and stage for further analysis. For each subset, I used describe() to obtain descriptive statistics for all variables (including mean, sd, se and skew, kurtosis for later analysis). I then examined the distribution of depressive symptoms in each group by creating a frequency table and proportions for cesd_high, which shows how many and what proportion of students scored above the CES-D cut-off for high depressive symptoms. Within each subset, I used tapply() to calculate the mean JSPE empathy score and the number of non-missing JSPE observations separately for students with low versus high depressive symptoms, this allows comparison of average empathy levels by depression group in preclinical and clinical students. Finally, I computed the Spearman correlation between CES-D scores and JSPE scores within each subset using cor(…, method = “spearman”), Spearman’s rank correlation was selected instead of Pearson’s correlation to assess the relationship between JSPE and CES-D within both preclinical and clinical group.This correlation shows the direction and strength of the association between depressive symptoms and empathy in preclinical and clinical students, where a negative value would indicate that higher depressive symptoms tend to be associated with lower empathy, and a positive value would indicate the opposite.

For the preclinical students, the Spearman correlation between CES-D scores and JSPE empathy scores was ρ = -0.068. This value is very close to zero, indicating no meaningful association between depressive symptoms and empathy in the preclinical group.

The Spearman correlation between CES-D scores and JSPE empathy scores in the clinical students was ρ = -0.009, indicating essentially no association between depressive symptoms and empathy in this group.

Using your favourite statistical software. Generate a data visualization that can aid answering your research question.

#Q6 
#Scatterplot for preclinical students
ggplot(medteach_preclinical, aes(x = cesd, y = jspe)) + #using ggplot functions() from tidyverse package that includes ggplot2 package to plot scatter plot with fitted line
  geom_point() +                                  #scatterplot points
  geom_smooth(method = "lm", se = FALSE) + #fitted regression line without including standard error on the plot
  labs(
    title = "Preclinical: JSPE vs CES-D",
    x = "CES-D score",
    y = "JSPE empathy score")

## `geom_smooth()` using formula = 'y ~ x'

This scatterplot shows the relationship between CES-D scores (x-axis, for depressive symptoms) and JSPE empathy scores (y-axis) among preclinical students. Each point represents one student. CES-D scores range from about 0 to just over 50, while empathy scores range roughly from 70 to 125. The points form a fairly wide cloud rather than a clear line, and there is no clear strong upward or downward trend. This suggests that there is little to no association between depressive symptoms and empathy scores meaning in the preclinical group, students with higher CES-D scores do not consistently show higher or lower empathy compared with those who have lower CES-D scores.

#Scatterplot for clinical students
ggplot(medteach_clinical, aes(x = cesd, y = jspe)) + #using ggplot functions() from tidyverse package that includes ggplot2 package to plot scatter plot with fitted line
  geom_point() +                                  #scatterplot points
  geom_smooth(method = "lm", se = FALSE) + #fitted regression line without including standard error on the plot
  labs(
    title = "Clinical: JSPE vs CES-D",
    x = "CES-D score",
    y = "JSPE empathy score")

## `geom_smooth()` using formula = 'y ~ x'

This scatterplot shows the relationship between CES-D scores (x-axis, for depressive symptoms) and JSPE empathy scores (y-axis) among clinical students. Each point represents one student. CES-D scores range from about 0 to just over 50, while empathy scores are mostly between about 90 and 125. The points form a fairly concentrated point without a clear upward or downward trend (from fitted line), suggesting little to no association between depressive symptoms and empathy in clinical students. In other words, students with higher CES-D scores do not consistently have clearly higher or lower empathy scores compared with those with lower CES-D scores.

Overall, although empathy scores were slightly lower among students with higher CES-D values, the differences were small, and the Spearman correlations in both groups were weak and not statistically significant. This suggests that depressive symptoms may not play a major role in shaping empathy levels at either stage of medical training. The scatterplots further supported this pattern, where data points were widely dispersed indicating little to no linear trend when evaluating fitted line across both graphs. Empathy tended to remain high across both groups, while variations in depressive symptoms showed minimal association with empathy scores.

Considering two statistical software you used in the previous points. List of at least two pros and cons for each software.

R Program Pros: R is free and open-source, so it is easily accessible on any computer without a license. It has many packages for advanced statistics, graphics, and reproducible reports like R Markdown, which is useful for more complex analyses and visualization.

Cons: Different packages sometimes do similar things in slightly different ways, which can be confusing for beginners. Need to be careful whenever using packages to do analysis because anyone can create a function, so it might not be statistically correct. The learning curve can be steep because almost everything is done through coding rather than menus. And it’s hard to know how a function works if it’s not a well-established package.

Stata Pros: Stata has a user-friendly interface with clear commands and help files, which makes it easier to learn basic analyses. It is very stable and widely used in applied research, so many tutorials and example codes are available, especially in epidemiology and social sciences.

Cons: Stata requires a paid license, which can be expensive and limits access outside of institutional computers. You can only work on 1 dataset at a time, which is not flexible and can be difficult when working with a large dataset (that needs stratified or breaking down into multiple subsets for analysis) or multiple datasets.

References Carrard, et al. (2022). Medical Students’ Empathy Mental Health and Burnout in Switzerland: A Cross Sectional Analysis. Kaggle. Retrieved October 20, 2025, from https://www.kaggle.com/code/a3amat02/medical-students-analysis-and-classification/input.

BIOSTAT9519 - Project 2

Ha Ai Vi Vo

2025-12-03