Introduction

Obstetrics and gynecology clerkship directors and medical students alike desire reliable prognostic matching information tailored to the individual. One predictive tool is the nomogram, which creates a simple graphical representation of a statistical predictive model that generates a numerical probability of an event. There is an inverse relationship between model accuracy and model interpretability. Here was have chosen for a more interpretable model with a simple logistic regression. The goal is to predict accurately whether a new student, not in this dataset, will match or not. The outcome variable is a variable called Match_Status: coded as “Matched” or “Not_Matched”. Other classification questions are to be or not to be, to vote or not to vote, and to click or not to click.

Objective: We sought to construct and validate a model that predict a medical student’s chances of matching into an obstetrics and gynecology residency.

This is the skeleton I used for creating the lrm and the GLM: * Factorize categorical variables because R didn’t do the job * Remove collinear variables * Add relevant power-transformations * Add relevant variable interaction * Remove insignificant variables with relevant testing criteria * Repeat above steps until you’ve exhausted your options * Remove any outlying datapoints * Evaluate your model * Visualizing your findings

Create folder structure, load bespoke functions, and Load necessary R libraries:

Load data

Data Dictionary

data_dictionary <- read_csv("~/Dropbox (Personal)/Nomogram/nomogram/data_dictionary.csv")
knitr::kable(x = data_dictionary, align = "l", caption = "Data Dictionary of Features", padding = 2) %>% 
    kable_styling(bootstrap_options = "striped", full_width = F) %>% kable_paper("hover", full_width = F) %>% 
    # row_spec(0, angle = -45) %>%
footnote(c("See 03-Feature_Selection.Rmd for more details"))
Data Dictionary of Features
Varialble Name Data Type Definition
Match_Status factor Not_Matched versus Matched, Did the applicant match? Based on search of “meet our residents” web page.
Gender factor Male versus Female. One person did not list gender and was removed for privacy concerns.
Age double Age at the time they applied to OBGYN residency.
US_or_Canadian_Applicant factor Yes versus No, US or Canadian Senior applicant
Medical_Degree factor MD versus DO
Military_Service_Obligation factor Yes or No
Visa_Sponsorship_Needed factor Yes or No
Medical_Education_or_Training_Interrupted factor Yes or No
Misdemeanor_Conviction factor Yes or No
Alpha_Omega_Alpha factor Yes or No
Gold_Humanism_Honor_Society factor Yes or No
Couples_Match factor Yes or No
Count_of_Oral_Presentation integer Number
Count_of_Peer_Reviewed_Book_Chapter integer Number
Count_of_Peer_Reviewed_Journal_Articles_Abstracts integer Number
Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published integer Number
Count_of_Poster_Presentation integer Number
Year factor Season of application (2017, 2018, 2019, 2020)
USMLE_Pass_Fail_replaced factor Passed versus Failed attempt
Location factor Residency site of application: CU, Duke, Baylor Scott & White, Cleveland Clinic Akron,
Meeting_Name_Presented factor Did not present at a meeting versus Presented at a meeting
TopNIHfunded factor Attended a NIH top-funded Medical School versus Did not attend NIH top-funded medical school
Higher_Education_Institution factor Ivy League College vs. No Ivy league college
Higher_Education_Degree factor BS vs. Not a BS degree
Interest_Group factor Did they participate in OBGYN interest group
Birth_Place factor Born outside the USA versus Born in the USA
Language_Fluency factor Speaks English and Another Language versus Speaks English
ACLS factor Yes or No
PALS factor Yes or No
Other_Service_Obligation factor National Service Corps versus No
Medical_Training_Discipline factor Prior Residency Training versus No prior residency training
Photo_Received factor Yes or No
Tracks_Applied_by_Applicant_1 factor Applying for a preliminary position versus Categorical applicant
AMA factor AMA membership versus No AMA membership
ACOG factor ACOG Member versus No ACOG member
Latin_Honors character Cum laude versus no latin honors
Scholarship character Scholarship versus No_scholarship
Grant character Grant_funding versus No_Grant_funding
Phi_beta_kappa character Phi_Beta_Kappa versus No_Phi_Beta_Kappa
NCAA character NCAA_athlete versus Not_a_NCAA_athlete
Boy_Scouts character Boy_Scout/Girl_Scout versus Not_a_Boy_Scout/Girl_Scout
Valedictorian character Valedictorian versus Not_a_Valedictorian
NIH character NIH_present versus No_NIH_involvement
NCI character NCI_present versus No_NCI_involvement
total_OBGYN_letter_writers integer Total number of letters written by OBGYNs
reco_count integer Total number of letters of recommendation available
number_of_applicant_first_author_publications integer Number of first author publications
Advance_Degree factor M.B.A. versus No Advanced Degree versus Ph.D. versus Other
Birth_Resident_status character Born outside the USA and now lives in USA versus Born in USA and lives in USA
female_reco_letter integer Number of female letter of recommendation authors
Note:
See 03-Feature_Selection.Rmd for more details
# Remove year and location columns from data as those are not modifiable risk factors.
all_data %<>% select(-Year, -Location)

info <- sessionInfo()
r_ver <- paste(info$R.version$major, info$R.version$minor, sep = ".")

All analyses were performed using R (ver. 4.0.3)[R Core Team, 2013] and packages rms (ver. 6.1-0) [F. Harrell, 2014] for analysis, Gmisc for plot and table output (ver. 1.11.0), and knitr (ver 1.30) [Xie, 2013] for reproducible research.

Data Preparation: Factorize categorical variables

# Copy and pasted fullformula output here.
var.labels <- c(ACLS = "ACLS", ACOG = "ACOG member", Advance_Degree = "Additional Degree", Age = "Age, years", 
    Alpha_Omega_Alpha = "AOA member", AMA = "AMA member", Birth_Place = "Location of Birth", Birth_Resident_status = "Immigrant", 
    Boy_Scouts = "Boy or Girl Scout", Count_of_Poster_Presentation = "Count of Poster Presentations", 
    Count_of_Oral_Presentation = "Count of Oral Presentations", Count_of_Peer_Reviewed_Journal_Articles_Abstracts = "Count of Published Abstracts", 
    Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published = "Count of Other Published Products", 
    Count_of_Online_Publications = "Count of Online Publications", Count_of_Peer_Reviewed_Book_Chapter = "Count of Book Chapters", 
    Couples_Match = "Couples Match", female_reco_letter = "Number of letters of recommendation written by women", 
    Gender = "Sex", Gold_Humanism_Honor_Society = "Gold Humanism membership", Grant = "Received grants", 
    Higher_Education_Degree = "Undergraduate degree", Higher_Education_Institution = "Ivy League College Education", 
    Interest_Group = "Participates in OBGYN Interest Group", Language_Fluency = "Speaks foreign language", 
    Latin_Honors = "Graduated cum laude", Medical_Degree = "Medical degree", Medical_Education_or_Training_Interrupted = "Medical Education Interrupted", 
    Medical_Training_Discipline = "Previous residency training", Meeting_Name_Presented = "Presented at a scientific meeting", 
    Military_Service_Obligation = "Military service obligation", Misdemeanor_Conviction = "Prior Misdemeanor", 
    NCAA = "College athlete", NCI = "National Cancer Institute experience", NIH = "National Institutes of Health experience", 
    number_of_applicant_first_author_publications = "Count of First-Author Publications", Other_Service_Obligation = "National Health Service Corps", 
    PALS = "PALS", Phi_beta_kappa = "Phi Beta Kappa membership", Photo_Received = "Photo in Application", 
    reco_count = "Number of letters of recommendation", Scholarship = "Received scholarship", TopNIHfunded = "Attended Top NIH funded Medical School", 
    total_OBGYN_letter_writers = "Count of OBGYN writing letter of recommendation", Tracks_Applied_by_Applicant_1 = "Application Track", 
    US_or_Canadian_Applicant = "US or Canadian Applicant", USMLE_Pass_Fail_replaced = "USMLE Step 1", 
    Valedictorian = "Prior valedictorian", Visa_Sponsorship_Needed = "Visa Sponsorship needed", y = "Match Outcome")

# These Hmisc labels will become the labels on the nomogram ************
Hmisc::label(all_data$ACLS) = "ACLS"
Hmisc::label(all_data$ACOG) = "ACOG member"
Hmisc::label(all_data$Advance_Degree) = "Additional Degree"
Hmisc::label(all_data$Age) = "Age"
units(all_data$Age) <- "years"
Hmisc::label(all_data$Alpha_Omega_Alpha) = "AOA member"
Hmisc::label(all_data$AMA) = "AMA member"
Hmisc::label(all_data$Birth_Place) = "Location of Birth"
Hmisc::label(all_data$Birth_Resident_status) = "Immigrant"
Hmisc::label(all_data$Boy_Scouts) = "Boy or Girl Scout"
Hmisc::label(all_data$Count_of_Poster_Presentation) = "Count of Poster Presentations"
Hmisc::label(all_data$Count_of_Oral_Presentation) = "Count of Oral Presentations"
Hmisc::label(all_data$Count_of_Peer_Reviewed_Journal_Articles_Abstracts) = "Count of Published Abstracts"
Hmisc::label(all_data$Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published) = "Count of Other Published Work"
# Hmisc::label(all_data$Count`) = 'Count of Online Publications'
Hmisc::label(all_data$Count_of_Peer_Reviewed_Book_Chapter) = "Count of Book Chapters"
Hmisc::label(all_data$female_reco_letter) = "Number of letters of recommendation written by women"
Hmisc::label(all_data$Gender) = "Sex"
Hmisc::label(all_data$Gold_Humanism_Honor_Society) = "Gold Humanism membership"
Hmisc::label(all_data$Grant) = "Received grants"
Hmisc::label(all_data$Higher_Education_Degree) = "Undergraduate degree"
Hmisc::label(all_data$Higher_Education_Institution) = "Ivy League College Education"
Hmisc::label(all_data$Interest_Group) = "Participates in OBGYN Interest Group"
Hmisc::label(all_data$Language_Fluency) = "Speaks foreign language"
Hmisc::label(all_data$Latin_Honors) = "Graduated cum laude"
Hmisc::label(all_data$Medical_Degree) = "Medical degree"
Hmisc::label(all_data$Medical_Education_or_Training_Interrupted) = "Medical Education Interrupted"
Hmisc::label(all_data$Medical_Training_Discipline) = "Prior Residency Training"
Hmisc::label(all_data$Meeting_Name_Presented) = "Presented at a scientific meeting"
Hmisc::label(all_data$Military_Service_Obligation) = "Military service obligation"
Hmisc::label(all_data$Misdemeanor_Conviction) = "Prior Misdemeanor"
Hmisc::label(all_data$NCAA) = "College athlete"
Hmisc::label(all_data$NCI) = "National Cancer Institute experience"
Hmisc::label(all_data$NIH) = "National Institutes of Health experience"
Hmisc::label(all_data$number_of_applicant_first_author_publications) = "Count of First Author Publications"
Hmisc::label(all_data$Other_Service_Obligation) = "National Health Service Corps"
Hmisc::label(all_data$PALS) = "PALS"
Hmisc::label(all_data$Phi_beta_kappa) = "Phi Beta Kappa membership"
Hmisc::label(all_data$Photo_Received) = "Photo in Application"
Hmisc::label(all_data$reco_count) = "Number of letters of recommendation"
Hmisc::label(all_data$Scholarship) = "Received scholarship"
Hmisc::label(all_data$TopNIHfunded) = "Attended Top NIH funded Medical School"
Hmisc::label(all_data$total_OBGYN_letter_writers) = "Count of OBGYN writing letter of recommendation"
Hmisc::label(all_data$Tracks_Applied_by_Applicant_1) = "Application Track"
Hmisc::label(all_data$US_or_Canadian_Applicant) = "US or Canadian Applicant"
Hmisc::label(all_data$USMLE_Pass_Fail_replaced) = "USMLE Step 1"
Hmisc::label(all_data$Valedictorian) = "Prior valedictorian"
Hmisc::label(all_data$Visa_Sponsorship_Needed) = "Visa Sponsorship needed"
Hmisc::label(all_data$Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published) = "Unpublished Articles"
Hmisc::label(all_data$Count_of_Peer_Reviewed_Journal_Articles_Abstracts) = "Peer-Reviewed Articles"
Hmisc::label(all_data$Count_of_Oral_Presentation) = "Oral Presentations"
Hmisc::label(all_data$Count_of_Peer_Reviewed_Book_Chapter) = "Book Chapters"
Hmisc::label(all_data$Count_of_Poster_Presentation) = "Poster Presentations"
# Hmisc::label(all_data$Self_Identify) = 'Race/Ethnicity'
# Hmisc::label(all_data$Type_of_medical_school) = 'Type of Medical School'
Hmisc::label(all_data$Medical_Education_or_Training_Interrupted) = "Medical Education Interrupted"
Hmisc::label(all_data$Alpha_Omega_Alpha) = "Alpha Omega Alpha"
Hmisc::label(all_data$Gold_Humanism_Honor_Society) = "Gold Humanism Society"
Hmisc::label(all_data$Couples_Match) = "Couples Match"
Hmisc::label(all_data$Visa_Sponsorship_Needed) = "Visa Sponsorship Needed"
Hmisc::label(all_data$USMLE_Pass_Fail_replaced) = "USMLE Step 1"
Hmisc::label(all_data$Medical_Degree) = "Medical Degree"

There are 822 missing variables from the all_data dataframe addressed with imputation using mice.

# Assigns imputed data to all_data variable name
all_data <- completeData

Data split

Validation set Approach - The validation set approach consists of randomly splitting the data into two sets: one set is used to train the model and the remaining other set sis used to test the model.

Results

The rate of successfully matching into an OBGYN residency for this population was 56.97% with a 95% confidence interval of 55.63% to 58.31%. The sample is slightly unbalanced, with almost 56.97percent of the samples having positive y/Matching. There is no outcome imbalance.

A total of 47 features were available in the trainSet dataframe before checking for multicollinearity. While this is a large number of predictors, we start by using all of them, look at performance, and then explore reducing the dimension.

all.features.full model

We started by looking at feature importance of all the variables in all.features.full.

The estimated glm.all.features.full model with 47 predictors is shown below.

Multicollinearity (Variance inflation factor)

In model adequacy checking, we will check for multicollinearity in order to understand the linear dependence amongst the covariates. Explore the vif values of the all.features.full model and determine if you meet the assumption of additivity (meaning no multicollinearity).

Hence, we will use VIF to calculate multicollinearity for each covariate. We kept on removing the covariates one after the other with p values greater than 0.05. Based on multicollinearity result, removing one covariate linked with each other:

# create vector of VIF values
vif_values <- vif(all.features.full)
  * Type_of_medical_school+ #Removed due to VIF
  * Count_of_Scientific_Monograph+ #Stupid variable that I really don't understand
  * Year #removed for VIF issues
  * Medical_School_of_Graduation+ # removed due to VIF issues
  * work_exp_percentage+ #removed for VIF issues
  * Research_exp_percentage+ #removed for VIF issues
  * Advance_Degree,  #removed for VIF issues
  * Year_diff_from_graduation_to_application,,  #removed for VIF issues

Ideally, there should be no multicollinearity between these covariates now so we will check multicollinearity between these covariates. The revised_formula has no collinear terms present and only includes p<0.05. Our VIF values for the covariates are close to 1 hence there exists no multicollinearity between these covariates and all.features is our best model now.

all.features model without multicollinear variables

all.features_model
Adjusted and unadjusted estimates for Matching into OBGYN Residency.
  Crude   Adjusted
Variable Total Event   OR 2.5 % to 97.5 %   OR 2.5 % to 97.5 %
ACOG member
  No 1640 728 (44.4%)   ref   ref
  Yes 2090 1,397 (66.8%)   2.5 2.2 to 2.9   1.5 1.3 to 1.8
Additional Degree
  No Addl Degree 3,348 1,941 (58.0%)   ref   ref
  MBA 20 6 (30.0%)   0.3 0.1 to 0.8   0.6 0.2 to 1.8
  Other 344 173 (50.3%)   0.7 0.6 to 0.9   0.7 0.5 to 0.9
  PhD 18 5 (27.8%)   0.3 0.1 to 0.8   0.6 0.2 to 1.9
Age 29.1 (±4.1) 28.1 (±2.5)   0.8 0.8 to 0.9   0.9 0.9 to 1.0
Location of Birth
  Born in the USA 2899 1,821 (62.8%)   ref   ref
  Born outside the USA 831 304 (36.6%)   0.3 0.3 to 0.4   0.6 0.5 to 0.8
Peer-Reviewed Articles 1.1 (±2.4) 1.2 (±2.3)   1.0 1.0 to 1.1   1.1 1.0 to 1.1
Unpublished Articles 0.4 (±1.0) 0.4 (±1.1)   1.2 1.1 to 1.3   1.1 1.0 to 1.2
Gold Humanism Society
  No 3231 1,776 (55.0%)   ref   ref
  Yes 499 349 (69.9%)   1.9 1.6 to 2.3   1.2 1.0 to 1.5
Ivy League College Education
  No Ivy League Education 3529 1,981 (56.1%)   ref   ref
  Ivy League 201 144 (71.6%)   2.0 1.4 to 2.7   1.4 1.0 to 2.0
Medical Degree
  MD 3125 1,848 (59.1%)   ref   ref
  DO 605 277 (45.8%)   0.6 0.5 to 0.7   0.5 0.4 to 0.6
Medical Education Interrupted
  No 3200 1,913 (59.8%)   ref   ref
  Yes 530 212 (40.0%)   0.4 0.4 to 0.5   0.6 0.5 to 0.7
National Health Service Corps
  No 3694 2,099 (56.8%)   ref   ref
  Yes 36 26 (72.2%)   2.0 1.0 to 4.1   2.7 1.2 to 6.1
Number of letters of recommendation 3.6 (±0.6) 3.6 (±0.5)   1.4 1.2 to 1.5   1.4 1.2 to 1.6
Attended Top NIH funded Medical School
  Did not attend NIH top-funded medical school 2461 1,204 (48.9%)   ref   ref
  Attended a NIH top-funded Medical School 1269 921 (72.6%)   2.8 2.4 to 3.2   1.4 1.1 to 1.6
Count of OBGYN writing letter of recommendation 2.0 (±0.9) 2.2 (±0.7)   1.9 1.7 to 2.0   1.4 1.3 to 1.5
Application Track
  Categorical Applicant 2010 1,283 (63.8%)   ref   ref
  Applying for a Preliminary Position 1720 842 (49.0%)   0.5 0.5 to 0.6   0.7 0.6 to 0.9
US or Canadian Applicant
  No 717 157 (21.9%)   ref   ref
  Yes 3013 1,968 (65.3%)   6.7 5.5 to 8.1   3.0 2.3 to 3.9
USMLE Step 1
  Passed 3657 2,117 (57.9%)   ref   ref
  Failed attempt 73 8 (11.0%)   0.1 0.0 to 0.2   0.2 0.1 to 0.5
Visa Sponsorship Needed
  No 3483 2,071 (59.5%)   ref   ref
  Yes 247 54 (21.9%)   0.2 0.1 to 0.3   0.7 0.5 to 1.1
plot(anova(all.features))

# summary(all.features)

Variable importance for the all.features_model. After adjusting for the 47 variables, only 15 remained significant.

Stepwise Backwards selection using aic and p-values

kable(colnames(trainSet1), digits = 2, caption = "Features following Backwards Regression Selection", 
    align = "l", padding = 2, format.args = list(decimal.mark = ",", big.mark = "'"))
Features following Backwards Regression Selection
x
Match_Status
Age
Visa_Sponsorship_Needed
Alpha_Omega_Alpha
Count_of_Peer_Reviewed_Journal_Articles_Abstracts
Other_Service_Obligation
Medical_Training_Discipline
NCI
number_of_applicant_first_author_publications
Birth_Resident_status
female_reco_letter

Table 1

# must have results="asis"

trainSet_ordered <- trainSet1

Table_1 <- arsenal::tableby(formula = Match_Status ~ ., data=trainSet1, 
                 control =arsenal::tableby.control(test = TRUE,
                 total = TRUE,
                 digits = 1L, digits.count = 0L, cat.simplify = TRUE, numeric.simplify = TRUE,numeric.stats = c(#"Nmiss2", 
                                                                                                                  "median",  "q1q3"), cat.stats =c(#"Nmiss2", 
                                                                                                                                                   "countpct"), stats.labels = list(Nmiss = "N Missing", Nmiss2 ="N Missing", meansd = "Mean (SD)", medianrange = "Median (Range)",median ="Median", medianq1q3 = "Median (Q1, Q3)",q1q3 = "Q1, Q3",iqr = "IQR",range = "Range",countpct = "Count (Pct)", Nevents = "Events", medSurv ="Median Survival",medTime = "Median Follow-Up")))
  
summary(Table_1,
          text=F,
          title = 'Applicant Descriptive Variables by Matched or Did Not Match from 2017 to 2020', 
          labelTranslations = mylabels, #Seen in additional functions file
          pfootnote=TRUE)
Applicant Descriptive Variables by Matched or Did Not Match from 2017 to 2020
Not_Matched (N=1605) Matched (N=2125) Total (N=3730) p value
Age, yrs < 0.0011
   Median 29.0 28.0 28.0
   Q1, Q3 27.0, 31.0 27.0, 29.0 27.0, 30.0
Visa Sponsorship Needed 193 (12.0%) 54 (2.5%) 247 (6.6%) < 0.0012
Alpha Omega Alpha Member 91 (5.7%) 277 (13.0%) 368 (9.9%) < 0.0012
Count of Peer Reviewed Journal Articles 0.0041
   Median 0.0 0.0 0.0
   Q1, Q3 0.0, 1.0 0.0, 1.0 0.0, 1.0
National Health Service Corps 10 (0.6%) 26 (1.2%) 36 (1.0%) 0.0632
Prior Residency Training 200 (12.5%) 46 (2.2%) 246 (6.6%) < 0.0012
National Cancer Institute experience 53 (3.3%) 84 (4.0%) 137 (3.7%) 0.2952
Count of First Author Publications < 0.0011
   Median 0.0 1.0 1.0
   Q1, Q3 0.0, 2.0 0.0, 3.0 0.0, 3.0
Immigrant 602 (37.5%) 360 (16.9%) 962 (25.8%) < 0.0012
Number of letters of recommendation written by women < 0.0011
   Median 1.0 1.0 1.0
   Q1, Q3 0.0, 2.0 1.0, 2.0 1.0, 2.0
  1. Linear Model ANOVA
  2. Pearson’s Chi-squared test
table1 <- Table_1 %>% as.data.frame() 

#View(as.data.frame(Table_1))
#table1$Total[[2]]
#tm_write2word(tm_arsenal_table_output, "tm_arsenal_table_output1")
#tm_write2pdf(tm_arsenal_table_output, "tm_arsenal_table_output1")

The typical applicant to the included Obstetrics and Gynecology residency program is a 28-year-old (IQR: 25 - 31)female. A majority of applicants were trained in an allopathic medical school. The majority of students do not need visa sponsorship. About 13% of applicants had interrupted their medical school training either to get another degree or to take time for an illness or another reason. Typically applicants were not members of AOA or of the Gold Humanism Society.

Unsurprisingly the most common research output was a poster. In terms of research the applicants had presented a median of 1 poster (IQR: -2 - 4).

In terms of prior history predicting the future we looked at prior USMLE Step 1 failure. The most common was that most students Step 1.

Table 1 by Year

Model Training: monica

The monica model is going to include only the variables chosen by backwards selection above.

# monica is the model that uses only statistically significant features from backwards selection.

d <- rms::datadist(trainSet1)
options(datadist = "d")  #often a good idea, to store info with fit

monica <- rms::lrm(formula = ssformula, data = trainSet1, method = "lrm.fit", tol = 0.000000000001, linear.predictors = TRUE, 
    x = T, y = T)

No multicollinearity present in monica model

# Must have results='asis' in the rmarkdown header

vif_monica <- vif(monica)  # test for multicolinearity
# Fox, J. and Monette, G. (1992) Generalized collinearity diagnostics. JASA, 87, 178–183.

stargazer::stargazer(vif_monica, title = "Variable-Inflation Factors for monica model", type = "html", 
    notes = ("GVIF = Generalized Variance-inflation Factors.  Df = Degrees of freedom.  "), notes.align = "l", 
    notes.append = TRUE)
Variable-Inflation Factors for monica model
Age Alpha_Omega_Alpha Birth_Resident_status Count_of_Peer_Reviewed_Journal_Articles_Abstracts female_reco_letter Medical_Training_Discipline NCI number_of_applicant_first_author_publications Other_Service_Obligation Visa_Sponsorship_Needed
110.000 1.100 1.600 1.600 3.000 1.200 1.000 1.700 1.000 1.300
GVIF = Generalized Variance-inflation Factors. Df = Degrees of freedom.

The mean variable inflation factor for the monica model is 12.35.

monica.glm

Variable Importance

Adding relevant power-transformations using the Nadaraya-Watson kernel regression estimate

Here, we’ll check the linear relationship between continuous predictor variables and the logit of the outcome. This can be done by visually inspecting the scatter plot between each predictor and the logit values.

Linearity assumptions

ggplot(mydata, aes(logit, predictor.value)) + geom_point(size = 0.5, alpha = 0.5) + geom_smooth(method = "loess") + 
    theme_bw() + facet_wrap(~predictors, scales = "free_y")

The variables Age, female_reco_letter, and Number_of_Peer_reviewed_journal_articles are not linear and might need some transformations. If the scatter plot shows non-linearity, you need other methods to build the model such as including 2 or 3-power terms, fractional polynomials and spline function.

Here we compare the significance of terms as linear or as quadratic. In regression, polynomial effects are predictors that have a power greater than one.

The original monica model without polynomials effects had an AIC of 4536.5 but the updated model with the second degree polynomials for age, number of peer-reviewed articles, and number of applicants where they are the first author was 4527.75. There was no change in r.squared as well. This AIC difference of 8.76 was not significant. Plots could be used in a supplement. We found no quadratic or cubic terms that explained the variation in matching above and beyond the linear terms.

Adding variable interactions

Use the anova function to answer if the addition of the interaction was significant.

knitr::kable(as.data.frame(x) %>% filter(x$`Pr(>Chi)` < 0.05) %>% select("Pr(>Chi)"), digits = 4, padding = 2, 
    caption = "Interactions between Variables in the monica.glm", format.args = list(decimal.mark = ",", 
        big.mark = "'"))
Interactions between Variables in the monica.glm
Pr(>Chi)
Age:Alpha_Omega_Alpha 0,001
Count_of_Peer_Reviewed_Journal_Articles_Abstracts:number_of_applicant_first_author_publications 0,014
Age:Count_of_Peer_Reviewed_Journal_Articles_Abstracts 0,018
Count_of_Peer_Reviewed_Journal_Articles_Abstracts:Medical_Training_Discipline 0,024
Alpha_Omega_Alpha:Other_Service_Obligation 0,034
Birth_Resident_status:Count_of_Peer_Reviewed_Journal_Articles_Abstracts 0,049
# Includes table of interactions Counts the number of statistically significant interactions between
# variables

There are 6 interactions between the factors in the monica model.

Let’s test for more interactions after adding the variable interaction Age:Alpha_Omega_Alpha:

The original monica model without interactions had an AIC of 4536.5 but the updated model with the six feature interactions produced an AIC of 4527.68. This difference of 8.83 was not significant. We found no quadratic or cubic terms that affected the model.

Remove insignficant variables

The p-values in monica.glm indicate the probability of a variable not being important for prediction.

There is no change in AIC between the two models with and without the statistically insignificant model. The main advantage of this model over our previous is the added simplicity inherent in the reduced number of explanatory variables! Removing these variables does not violate the principle of marginality.

# Update monica model where we remove NCI and other_service
monica <- rms::lrm(formula = Match_Status ~ Age + Alpha_Omega_Alpha + Birth_Resident_status + female_reco_letter + 
    Medical_Training_Discipline + number_of_applicant_first_author_publications + Visa_Sponsorship_Needed, 
    data = trainSet1, method = "lrm.fit", tol = 0.000000000001, linear.predictors = TRUE, x = T, y = T)

Removing outliers

So now that we have a model we’re satisfied with we can look for outliers that negatively effect the model.

When you have outliers in a continuous predictor, potential solutions include: * Removing the concerned records - I chose this option.
* Transform the data into log scale * Use non parametric methods

Data point that is greater than three standard deviations different from the mean.

# Filter potential influential data points with abs(.std.res) > 3:
model.data %>% filter(abs(.std.resid) > 3)

We see that the datapoint 1925 is classified as an outlier in both tests, we can take a look at the point in a few of our plots to gauge whether or not to remove it. Code turned off.

Down to almost 4527.1 AIC from the original 4536.5, this isn’t really a relevant measure of performance when comparing the AIC of two different sets of data.

Look for overfitting

In the code above we’re using k-fold cross-validation with k = 73 (since 73 is a factor of 3730 which is the length of trainSet1) this means we’re splitting our data in 73 ‘chunks’.

Table 2: Interpreting Model Coefficients

Performance?

The overall performance of the monica model without interactions was evaluated using the Brier score, rescaled to range from 0 to 1 (with higher values indicating better performance) as suggested by Steyerberg et al (2010). Brier score for monica was 0.21. The C-statistic/AUC was 0.71 for the monica model. The R2 was 0.19.

# rms::Predict is not working and so will use glm predict as another option.

# Page 139-140 in Harrell Text p <- rms::Predict(object = monica, #rms fit object digits=4, conf.int
# = FALSE) plot(p) ggplot(p, groups=FALSE)
# https://darrendahly.github.io/post/homr/
trainSet1$Match_Status <- as.numeric(trainSet1$Match_Status) - 1

monica.glm <- glm(formula = glm_ssformula, data = trainSet1, family = binomial(link = "logit"))

dust(monica.glm)
term estimate std.error statistic p.value
(Intercept) 4.25 0.37 11.39 0
Age -0.15 0.01 -11.29 0
Alpha_Omega_AlphaYes 0.52 0.13 3.97 0
Birth_Resident_statusImmigrated to US -0.52 0.09 -5.55 0
Count_of_Peer_Reviewed_Journal_Articles_Abstracts 0.11 0.02 6 0
female_reco_letter 0.2 0.04 5.44 0
Medical_Training_DisciplineYes -0.73 0.2 -3.71 0
NCIYes 0.17 0.19 0.88 0.38
number_of_applicant_first_author_publications 0.03 0.01 2.28 0.02
Other_Service_ObligationYes 0.86 0.4 2.12 0.03
Visa_Sponsorship_NeededYes -0.97 0.18 -5.37 0


Generating Predictions

The first step is predicting the match probabilities of test data set using predict( ) function. Setting a cut-off value (0.5 for binary classification). Below 0.5 of probability did not match (0) and above that pos (1) for matching.

#https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62

#Create Predictions
probs <- 
  stats::predict(monica.glm, # Model created with the trainSet data
  newdata = testSet, #Need a test set called testSet
  type = "response")    # Return predicted probabilities
pred <- factor(ifelse(probs > 0.5, 1, 0))

# #Set levels so they are the same
# pred <- factor(pred, levels=c("1", "0"), labels = c("1", "0"))
# pred <- relevel(pred, ref = "1")
# levels(pred)
# class(pred)

# #Set levels so they are the same
# testSet$Match_Status <- factor(testSet$Match_Status, levels=c("1", "0"), labels = c("1", "0"))
# testSet$Match_Status <- relevel(testSet$Match_Status, ref = "1")
# levels(testSet$Match_Status)
# class(testSet$Match_Status)

Model Evaluation on Test data Set

Confusion Matrix

After fitting a binary logistic regression model, the next step is to check how well the fitted model performs on unseen data i.e. 30% test data. Thus, the next step is to predict the classes in the test data set and generating a confusion matrix.

testSet$Match_Status <- as.numeric(testSet$Match_Status) - 1

cm <- caret::confusionMatrix(data = pred, reference = factor(testSet$Match_Status), positive = as.character(1))
cm
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 291 124
         1 397 787
                                             
               Accuracy : 0.674              
                 95% CI : (0.651, 0.697)     
    No Information Rate : 0.57               
    P-Value [Acc > NIR] : <0.0000000000000002
                                             
                  Kappa : 0.301              
                                             
 Mcnemar's Test P-Value : <0.0000000000000002
                                             
            Sensitivity : 0.864              
            Specificity : 0.423              
         Pos Pred Value : 0.665              
         Neg Pred Value : 0.701              
             Prevalence : 0.570              
         Detection Rate : 0.492              
   Detection Prevalence : 0.740              
      Balanced Accuracy : 0.643              
                                             
       'Positive' Class : 1                  
                                             
draw_confusion_matrix(cm)

The accuracy for the model was 0.67 with sensitivity of NA, and finally NA.

Nomogram

Figure 1: Nomogram: An analogue tool to deliver digital knowledge

We generated the nomogram to provide a pre-match, personalized estimate of the chance of matching into OBGYN residency at all institutions whereby points in the nomogram were assigned in proportion to the effect sizes in the multivariable logistic regression analysis model. The nomogram was based on presurgical variables including pre-Match education preparations, research accomplishments, and applicant demographics.
Points were allocated for each variable, summed, and then used to calculate a medical student-specific, pre-application risk chance of Matching. The nomogram illustrates the strength of association of the predictors to the outcome as well as the nonlinear associations between age and count of poster presentations and matching.

The goal of machine learning is to build models that generate accurate predictions on previously unseen data (Max Kuhn, http://www.feat.engineering). Here we see that in action:

par(mar=c(1, 1, 1, 1)) # 1 inch margins all over 
#https://www.statmethods.net/advgraphs/parameters.html

plot(nomo.nomo, 
     xfrac=0.5, #distance of variable names to the bars
     cex.axis=0.6, #Size of the words  (e.g. "Male")
     cex.var=0.6, #Variable name size (e.g. size of "Age, years")
     total.points.label="Sum of all points",
     force.label = TRUE,
     lmgp = 0.3,
     #tcl = 0.8,
       label.every=2)

Equation

varNames <- colnames(stats::model.matrix(monica))
equationStr <- paste(round(coef(monica), 2), varNames, sep = "*", collapse = "   +   ")
equationStr <- gsub("*(Intercept)", "", equationStr, fixed = TRUE)
equationStr <- paste(all.features$terms[[2]], "=", equationStr)
equationStr[[2]] %>% knitr::kable(format = "html", align = "l", caption = "Polynomial Equation to Predict Matching for OBGYN Residency:", 
    padding = 2) %>% kableExtra::kable_styling()
Polynomial Equation to Predict Matching for OBGYN Residency:
x
Match_Status = 3.82 + -0.13Age + 0.52Alpha_Omega_AlphaYes + -0.5Birth_Resident_statusImmigrated to US + 0.19female_reco_letter + -0.64Medical_Training_DisciplineYes + 0.06number_of_applicant_first_author_publications + -0.95*Visa_Sponsorship_NeededYes

The overall performance of these models was evaluated using the Brier score, rescaled to range from 0 to 1 (with higher values indicating better performance) as suggested by Steyerberg et al (2010). We assessed calibration graphically, in addition to using the maximum and average difference in predicted vs loess-calibrated probabilities (Emax and Eavg); and we used the c-statistic to assess discrimination. For each of these metrics, we reported bootstrapped 95% confidence intervals. Finally, for Model Revision, we estimated an optimism-corrected c-statistic and shrinkage factor using bootstrapping, as described in Harrell, Lee and Mark (1996). Uncertainty in these metrics was evaluated with bootstrapping:

# # Metrics

# val_monica <- rms::val.prob( p = testSet$monica.glm_pred, # predicted probability y =
# testSet$Match_Status, #vector of binary outcomes pl = FALSE) %>% #plot calibration curves round(3)
# rescale_brier <- function(x, p, ...){ format(round(1 - (x / (mean(p) * (1 - mean(p)))), digits =
# 2), nsmall = 2) } b1 <- rescale_brier(val_monica['Brier'], 0.41) # Note: 0.41 is the marginal
# probabilty of not matching in the entire sample pander(val_monica) #Uncertainty of the ???
# set.seed(12345678) boot_val <- function(data, formula, ...){ out <- list() for(i in 1:500){ df <-
# sample_n(data, nrow(data), replace = TRUE) md <- glm(formula, data = df, family = binomial)
# out[[i]] <- rms::val.prob(predict(md, type = 'response'), as.numeric(df$Match_Status) - 1, pl =
# FALSE) %>% round(3) } return(out) } forglm_revisedformula <- as.formula( paste0('Match_Status ~',
# linearphrase) ) boot_vals_monica <- boot_val(trainSet, forglm_revisedformula) ##Should this be
# trainSet or testSet?  #This code just pulls out 95% intervals from the bootstrapped values.
# calc_ci <- function(metric, boot_vals, n){ x <- unlist(map(boot_vals, `[`, c(metric))) if(metric ==
# 'Brier'){x <- as.numeric(rescale_brier(x, 0.184))} paste0('(', round(quantile(x, 0.025), n), ' to
# ', round(quantile(x, 0.975), n), ')') } # m1 monica_c_boot_ci <- calc_ci('C (ROC)',
# boot_vals_monica, 2) monica_brier_boot_ci <- calc_ci('Brier', boot_vals_monica, 2)
# monica_emax_boot_ci <- calc_ci('Emax', boot_vals_monica, 2) monica_eavg_boot_ci <- calc_ci('Eavg',
# boot_vals_monica, 2)
# # Function to produce the calibration plots cal_plot <- function(model, model_name, pred_var, ...){
# require(tidyverse) require(viridis) require(gridExtra) # The calibration plot g1 <-
# dplyr::mutate(all_data, bin = ntile(get(pred_var), 10)) %>% # Bin prediction into 10ths
# group_by(bin) %>% dplyr::mutate(n = n(), # Get ests and CIs bin_pred = mean(get(pred_var)),
# bin_prob = mean(as.numeric(alive) - 1), se = sqrt((bin_prob * (1 - bin_prob)) / n), ul = bin_prob +
# 1.96 * se, ll = bin_prob - 1.96 * se) %>% ungroup() %>% ggplot(aes(x = bin_pred, y = bin_prob, ymin
# = ll, ymax = ul)) + geom_pointrange(size = 0.5, color = 'black') + scale_y_continuous(limits = c(0,
# 1), breaks = seq(0, 1, by = 0.1)) + scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, by =
# 0.1)) + geom_abline() + # 45 degree line indicating perfect calibration geom_smooth(method = 'lm',
# se = FALSE, linetype = 'dashed', color = 'black', formula = y~-1 + x) + # straight line fit through
# estimates geom_smooth(aes(x = get(pred_var), y = as.numeric(alive) - 1), color = 'red', se = FALSE,
# method = 'loess') + # loess fit through estimates xlab('') + ylab('Observed Probability') +
# theme_minimal() + ggtitle(model_name) # The distribution plot g2 <- ggplot(all_data, aes(x =
# get(pred_var))) + geom_histogram(fill = 'black', bins = 200) + scale_x_continuous(limits = c(0, 1),
# breaks = seq(0, 1, by = 0.1)) + xlab('Predicted Probability') + ylab('') + theme_minimal() +
# scale_y_continuous(breaks = c(0, 40)) + theme(panel.grid.minor = element_blank()) # Combine them g
# <- arrangeGrob(g1, g2, respect = TRUE, heights = c(1, 0.25), ncol = 1) grid.newpage() grid.draw(g)
# return(g[[3]]) }

# model_tab_1 <- data_frame( Measures = c('Intercept', 'Slope', 'Residual deviance', 'Df', 'LRT Chisq
# p-value', 'Brier score (rescaled)', 'Emax', 'Eavg', 'c-statistic'), all.features_Values =
# c(round(c(0, 1, monica$deviance, monica$df.residual, 0), 2), paste(b1, monica_brier_boot_ci),
# paste(val_monica['Emax'], monica_emax_boot_ci), paste(val_monica['Eavg'], monica_eavg_boot_ci),
# paste(round(val_monica['C (ROC)'], 2), monica_c_boot_ci)), ) names(model_tab_1) <- c('', 'monica
# model') knitr::kable(model_tab_1)

Plot the Full distributions of bootstrapped values.

# AMAZING!!!!: https://darrendahly.github.io/post/homr/ boots <- function(metric, boot_vals){ x <-
# as.numeric(unlist(map(boot_vals, `[`, metric))) if(metric == 'Brier'){x <-
# as.numeric(rescale_brier(x, 0.184))} return(x) } boot_list <- list(boot_vals_monica) metrics <-
# c('C (ROC)', 'Brier', 'Eavg', 'Emax') x <- c() for(i in metrics){ for(j in 1:1){ x <- c(x, boots(i,
# boot_list[[j]])) } } y <- rep(c(paste0('all.features model', ' c-index'), paste0('all.features
# model', ' Brier'), paste0('all.features model', ' Emax'), paste0('all.features model',' Eavg')),
# each = 500) boot_data <- data_frame(var = y, val = x) #Density plots ggplot(boot_data, aes(x =
# val)) + geom_density() + facet_wrap(~var, nrow = 4, scales = 'free') + theme_minimal() + xlab('') +
# ylab('Density')

Evaluating the model: Overview

To evaluate the monica Model, we followed the procedure outlined in Vergouwe et al (2016).

monica_1 <- glm(formula = ssformula, data = trainSet, family = binomial(link = "logit"))

# Model predicted probabilties
trainSet$monica_pred <- stats::predict(monica_1, type = "response")

# x <- cal_plot(monica_1, 'monica model', 'all_data$monica_pred')
# sjPlot::tab_model(model = monica.glm, 
#                     show.p = TRUE, 
#                     show.ci = FALSE, 
#                     show.se = TRUE, 
#                     transform = NULL)
  
# sjlabelled::term_labels(monica.glm)

sjPlot::plot_model(model = monica.glm, 
                     transform = NULL, 
                     show.values = TRUE, 
                     #axis.labels = TRUE, 
                     show.legend = TRUE,
                     value.offset = .4, 
                     colors = "bw",
                     auto.label = FALSE) #Uses the column header names

#jtools::effect_plot(monica.glm, pred = Age, data = all_data, interval = TRUE)

Odds Ratios

Table 2: Interpreting Model Coefficients

#http://rstudio-pubs-static.s3.amazonaws.com/481253_60f332ede0ea4ad9bb8e8c1148f560cd.html#5_multiple_imputation_with_a_logistic_regression_model

modeltab <- arsenal::modelsum(formula = Match_Status ~ ., 
             family = "binomial", 
             data = trainSet,  #train or test?
             binomial.stats=c("estimate","CI.lower.estimate","CI.upper.estimate","p.value"))
             #adjust = ~ Age + Gender + US_or_Canadian_Applicant + Medical_Degree + Alpha_Omega_Alpha)

summary(modeltab, 
        title='Odds Ratios for monica model', 
        control=arsenal::modelsum.control(gaussian.stats=c("N","estimate", "Nmiss2")), 
        show.intercept = FALSE)
Odds Ratios for monica model
estimate CI.lower.estimate CI.upper.estimate p.value
Sex Male -0.417 -0.583 -0.252 < 0.001
Age -0.166 -0.188 -0.144 < 0.001
US or Canadian Applicant Yes 1.905 1.715 2.100 < 0.001
Medical Degree DO -0.539 -0.714 -0.364 < 0.001
Military service obligation Yes 0.126 -0.598 0.885 0.737
Visa Sponsorship Needed Yes -1.657 -1.975 -1.355 < 0.001
Medical Education Interrupted Yes -0.802 -0.990 -0.615 < 0.001
Prior Misdemeanor Yes -0.349 -0.847 0.146 0.166
Alpha Omega Alpha Yes 0.914 0.672 1.165 < 0.001
Gold Humanism Society Yes 0.645 0.444 0.851 < 0.001
Couples Match Yes 0.670 0.419 0.928 < 0.001
Oral Presentations 0.016 -0.021 0.055 0.403
Book Chapters -0.188 -0.411 0.022 0.086
Peer-Reviewed Articles 0.044 0.015 0.075 0.004
Unpublished Articles 0.182 0.108 0.261 < 0.001
Poster Presentations 0.123 0.092 0.156 < 0.001
USMLE Step 1 Failed attempt -2.413 -3.231 -1.737 < 0.001
Presented at a scientific meeting Presented at a meeting 0.361 0.226 0.497 < 0.001
Attended Top NIH funded Medical School Attended a NIH top-funded Medical School 1.016 0.871 1.164 < 0.001
Ivy League College Education Ivy League 0.680 0.372 1.001 < 0.001
Undergraduate degree Not a B.S. degree -0.305 -0.436 -0.175 < 0.001
Participates in OBGYN Interest Group Mentions Interest Group -0.975 -2.952 0.660 0.260
Location of Birth Born outside the USA -1.074 -1.235 -0.915 < 0.001
Speaks foreign language Speaks English and Another Language -0.118 -0.266 0.030 0.120
ACLS Yes -0.247 -0.378 -0.116 < 0.001
PALS Yes -0.966 -1.260 -0.680 < 0.001
National Health Service Corps Yes 0.681 -0.020 1.461 0.068
Prior Residency Training Yes -1.862 -2.200 -1.544 < 0.001
Photo in Application No -1.767 -2.559 -1.086 < 0.001
Application Track Applying for a Preliminary Position -0.610 -0.741 -0.479 < 0.001
AMA member American Medical Association Member 0.521 0.389 0.653 < 0.001
ACOG member Yes 0.926 0.793 1.060 < 0.001
Graduated cum laude Latin_honors -1.229 -2.056 -0.489 0.002
Received scholarship Scholarship 0.313 0.166 0.462 < 0.001
Received grants Grant_funding 0.466 0.186 0.755 0.001
Phi Beta Kappa membership Phi_Beta_Kappa 0.886 0.386 1.432 < 0.001
College athlete NCAA_athlente 0.627 0.003 1.310 0.058
Boy or Girl Scout Boy/Girl_Scouts 0.495 -0.436 1.543 0.317
Prior valedictorian Valedictorian 0.520 -0.095 1.185 0.108
National Institutes of Health experience NIH_present -0.509 -1.183 0.150 0.131
National Cancer Institute experience Yes 0.187 -0.160 0.542 0.296
Count of OBGYN writing letter of recommendation 0.618 0.538 0.700 < 0.001
Number of letters of recommendation 0.317 0.200 0.434 < 0.001
Count of First Author Publications 0.054 0.032 0.077 < 0.001
Additional Degree MBA -1.169 -2.209 -0.253 0.017
Additional Degree Other -0.310 -0.532 -0.088 0.006
Additional Degree PhD -1.277 -2.416 -0.300 0.015
Immigrant Immigrated to US -1.079 -1.232 -0.928 < 0.001
Number of letters of recommendation written by women 0.306 0.240 0.372 < 0.001
monica_pred 4.742 4.310 5.186 < 0.001
# all.features_predict <- stats::predict(monica, x=TRUE, y=TRUE)

Internal Validation

Bootstrap validation of the monica model. In bootstrap validation random samples are drawn with replacement from the original data set are the same size as the original cohort.

# page 284 in Harrell's book, page 301
set.seed(123456)

monica <- update(monica, x = TRUE, y = TRUE)

v_monica <- rms::validate(fit = monica, method = "boot", B = 1000, pr = FALSE, digits = 2, bw = FALSE, 
    tol = 0.000000000001, maxdim = 5, maxiter = 100, dxy = TRUE)
head(v_monica, B = 20, digits = 3)
          index.orig training   test optimism index.corrected    n
Dxy             0.40     0.40 0.3946   0.0044          0.3911 1000
R2              0.18     0.18 0.1785   0.0044          0.1762 1000
Intercept       0.00     0.00 0.0056  -0.0056          0.0056 1000
Slope           1.00     1.00 0.9840   0.0160          0.9840 1000
Emax            0.00     0.00 0.0045   0.0045          0.0045 1000
D               0.14     0.15 0.1425   0.0038          0.1404 1000

Figure 2: Calibration of the monica model

graphics::plot(cal_monica, main = "Bootstrap Overfitting-Corrected Calibration Curve", xlim = c(0, 1), 
    ylim = c(0, 1), xlab = "Nomogram-Predicted Probability of Matching", ylab = "Actual Matching (proportion)", 
    lwd = 2, lty = 1, errbar.col = c(rgb(0, 118, 192, maxColorValue = 255)), legend = FALSE, subtitles = FALSE, 
    cex.lab = 1.2, cex.axis = 1, cex.main = 1, cex.sub = 0.6, scat1d.opts = list(nhistSpike = 240, side = 3, 
        frac = 0.08, lwd = 1, nint = 50))

n=3730   Mean absolute error=0.013   Mean squared error=0.00033
0.9 Quantile of absolute error=0.019
lines(cal_monica, lwd = 2, lty = 3, col = c(rgb(255, 0, 0, maxColorValue = 255)))
abline(0, 1, lty = 5, lwd = 2, col = c(rgb(0, 0, 255, maxColorValue = 255)))
legend("bottomright", cex = 0.8, legend = c("Apparent", "Bias-corrected", "Ideal"), col = c("red", "black", 
    "blue"), lwd = c(1, 1, 1), lty = c(1, 1, 2))

Site Validation

The process works as follow:

  • Build (train) the model on the training data set
  • Apply the model to the test data set to predict the outcome of new unseen observations
  • Quantify the prediction error as the mean squared difference between the observed and the predicted outcome values.
# Create the data for each individual site source('Code/000_1f_Site_Data_cleaning_function_maybe.R')